Issue 52 * July 3, 2008

Text Comparison
Writing a text compare tool in Revolution

by Jan Schenkel

As some of you know by now, I do Quartam Software by night, and by day I work for a leading provider of Laboratory Information Systems. One of the most powerful features of our products is MISPL: a domain-specific scripting language that allows service engineers and system administrators to customize the application and fine-tune the workflow. The MISPL infrastructure comes complete with version tracking, and recently an intriguing project landed on my desk: integrating a source code compare tool so that engineers and administrators could verfify their changes before committing them to the system.

With only a few hours to whip up a prototype, I launched Revolution and started coding away. And as you might have guessed, my initial attempts proved futile – so I decided to save myself the upcoming headaches, and headed over to my web browser. Surely there had to be some open source code that I could take apart? After quite a bit of googling and clicking, I found a clue in a Java forum: text comparison boils down to the Longest Common Subsequence problem.

And some kind soul had indeed put up a Wikipedia page for this particular problem: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem - even if it took me a while to digest the way it actually worked, I now had pseudo-code to derive my solution! So I started with a fresh stack, dragged a few fields and buttons onto it, and wrote my own text compare tool.

With the layout done, I could concentrate on the script:

on mouseUp
   -- turn the sources into arrays of lines
   local tOldSource, tNewSource
   put field "OldSource" into tOldSource
   put field "NewSource" into tNewSource
   split tOldSource using return
   split tNewSource using return
   -- we're also going to need the linecount
   local tOldLineCount, tNewLineCount
   put item 2 of line 1 of the extents of tOldSource into \
   tOldLineCount
   put item 2 of line 1 of the extents of tNewSource into \
   tNewLineCount
   -- calculate the longest common subsequence lengths
   local tLCSLength
   CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount
   -- finally determine and display the source diferences
   local tSourceDiff
   DetermineDiff tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount, tSourceDiff
   put char 1 to -2 of tSourceDiff into field "SourceDiff"
end mouseUp

on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \
   pOldLineCount, pNewLineCount
   local tRow, tCol
   repeat with tRow = 0 to pOldLineCount 
      put 0 into pLCSLength[tRow,0]
   end repeat
   repeat with tCol = 0 to pNewLineCount 
      put 0 into pLCSLength[0,tCol]
   end repeat
   repeat with tRow = 1 to pOldLineCount
      repeat with tCol = 1 to pNewLineCount
         put pOldSource[tRow] into tOldLine
         put pNewSource[tCol] into tNewLine
         if tOldLine = tNewLine then
            put pLCSLength[tRow - 1,tCol - 1] + 1 into \
            pLCSLength[tRow,tCol]
         else
            put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \
            - 1]) into pLCSLength[tRow,tCol]
         end if
      end repeat
   end repeat
end CalculateLCSLengths

on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \
   pCol, @pSourceDiff
   put pOldSource[pRow] into tOldLine
   put pNewSource[pCol] into tNewLine
   if pRow > 0 and pCol > 0 and tOldLine = tNewLine then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - \
      1, pCol - 1, pSourceDiff
      put " " & tab & tOldLine & return after pSourceDiff
   else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \
   >= pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \
      pCol - 1, pSourceDiff
      put "+" & tab & tNewLine & return after pSourceDiff
   else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] \
   < pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - \
      1, pCol, pSourceDiff
      put "-" & tab & tOldLine & return after pSourceDiff
   end if
end DetermineDiff

In case you're wondering about the '@' sign in front of the parameters: this is Revolution's way of passign variables 'by reference'. Any modifications to this parameter in your handler or function will propagate to the calling handler, which is what we want for some of the parameters; but this is also a good way to save memory for those parameters that you don't intend to change in the handler or function.

But I digress - let's type in some text and click the "Diff sources" button:

Old-school developers brought up with command-line tools will surely recognize the output they'd get from the 'diff' tool. While this is a nice way of displaying the changes, I figured it would be better displayed in two fields. So after adding a tabbed button and another field, the new script became:

constant kHtmlTab = "&#9;"

on mouseUp
   -- turn the sources into arrays of lines
   local tOldSource, tNewSource
   put field "OldSource" into tOldSource
   put field "NewSource" into tNewSource
   split tOldSource using return
   split tNewSource using return
   -- we're also going to need the linecount
   local tOldLineCount, tNewLineCount
   put item 2 of line 1 of the extents of tOldSource into \
   tOldLineCount
   put item 2 of line 1 of the extents of tNewSource into \
   tNewLineCount
   -- calculate the longest common subsequence lengths
   local tLCSLength
   CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount
   -- finally determine and display the source diferences
   local tSourceDiff, tSourceCompare
   DetermineDiff tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount, tSourceDiff, tSourceCompare
   set the text of field "SourceDiff" to char 1 to -2 of \
   tSourceDiff
   set the htmlText of field "SourceCompare" to char 1 to -2 of \
   tSourceCompare
end mouseUp

on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \
   pOldLineCount, pNewLineCount
   local tRow, tCol
   repeat with tRow = 0 to pOldLineCount 
      put 0 into pLCSLength[tRow,0]
   end repeat
   repeat with tCol = 0 to pNewLineCount 
      put 0 into pLCSLength[0,tCol]
   end repeat
   repeat with tRow = 1 to pOldLineCount
      repeat with tCol = 1 to pNewLineCount
         put pOldSource[tRow] into tOldLine
         put pNewSource[tCol] into tNewLine
         if tOldLine = tNewLine then
            put pLCSLength[tRow - 1,tCol - 1] + 1 into \
            pLCSLength[tRow,tCol]
         else
            put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \
            - 1]) into pLCSLength[tRow,tCol]
         end if
      end repeat
   end repeat
end CalculateLCSLengths

on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \
   pCol, @pSourceDiff, @pSourceCompare
   put pOldSource[pRow] into tOldLine
   put pNewSource[pCol] into tNewLine
   if pRow > 0 and pCol > 0 and tOldLine = tNewLine then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \
      pCol - 1, pSourceDiff, pSourceCompare
      put " " & tab & tOldLine & return after pSourceDiff
      put "<p> " & kHtmlTab & tOldLine & kHtmlTab & tNewLine & \
      "</p>" & return after pSourceCompare
   else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \
   >= pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \
      pCol - 1, pSourceDiff, pSourceCompare
      put "+" & tab & tNewLine & return after pSourceDiff
      put "<p>+" & kHtmlTab & "" & kHtmlTab & tNewLine & "</p>" & \
      return after pSourceCompare
   else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] < \
   pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \
      pCol, pSourceDiff, pSourceCompare
      put "-" & tab & tOldLine & return after pSourceDiff
      put "<p>-" & kHtmlTab & tOldLine & kHtmlTab & "" & "</p>" & \
      return after pSourceCompare
   end if
end DetermineDiff

And when you click on the button "Diff sources" we now get the following display of the text compare.

You can easily add colouring to the htmlText by inserting appropriate <font color="red"> tags, but I'll leave that as an exercise to the reader. The final bit I'd like to add is a way to ignore leading and trailing spaces – after all, when youy're comparing scripts you don't want to get bogged down by added or missing spaces, do you? So let's drag out a checkbox and modify the script some more.

constant kHtmlTab = "&#9;"

on mouseUp
   -- turn the sources into arrays of lines
   local tOldSource, tNewSource
   put field "OldSource" into tOldSource
   put field "NewSource" into tNewSource
   split tOldSource using return
   split tNewSource using return
   -- we're also going to need the linecount
   local tOldLineCount, tNewLineCount
   put item 2 of line 1 of the extents of tOldSource into \
   tOldLineCount
   put item 2 of line 1 of the extents of tNewSource into \
   tNewLineCount
   -- calculate the longest common subsequence lengths
   local tLCSLength
   CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount
   -- finally determine and display the source diferences
   local tSourceDiff, tSourceCompare
   DetermineDiff tLCSLength, tOldSource, tNewSource, \
   tOldLineCount, tNewLineCount, tSourceDiff, tSourceCompare
   set the text of field "SourceDiff" to char 1 to -2 of \
   tSourceDiff
   set the htmlText of field "SourceCompare" to char 1 to -2 of \
   tSourceCompare
end mouseUp

on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \
   pOldLineCount, pNewLineCount
   local tRow, tCol
   repeat with tRow = 0 to pOldLineCount 
      put 0 into pLCSLength[tRow,0]
   end repeat
   repeat with tCol = 0 to pNewLineCount 
      put 0 into pLCSLength[0,tCol]
   end repeat
   repeat with tRow = 1 to pOldLineCount
      repeat with tCol = 1 to pNewLineCount
         put pOldSource[tRow] into tOldLine
         put pNewSource[tCol] into tNewLine
         if the hilite of button "IgnoreSpaces" is true then
            put word 1 to -1 of tOldLine into tOldLine
            put word 1 to -1 of tNewLine into tNewLine
         end if
         if tOldLine = tNewLine then
            put pLCSLength[tRow - 1,tCol - 1] + 1 into \
            pLCSLength[tRow,tCol]
         else
            put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \
            - 1]) into pLCSLength[tRow,tCol]
         end if
      end repeat
   end repeat
end CalculateLCSLengths

on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \
   pCol, @pSourceDiff, @pSourceCompare
   put pOldSource[pRow] into tOldLine
   put pNewSource[pCol] into tNewLine
   if the hilite of button "IgnoreSpaces" is true then
      put word 1 to -1 of tOldLine into tOldCompare
      put word 1 to -1 of tNewLine into tNewCompare
   else
      put tOldLine into tOldCompare
      put tNewLine into tNewCompare
   end if
   if pRow > 0 and pCol > 0 and tOldCompare = tNewCompare then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \
      pCol - 1, pSourceDiff, pSourceCompare
      put " " & tab & tOldLine & return after pSourceDiff
      put "<p> " & kHtmlTab & tOldLine & kHtmlTab & tNewLine & \
      "</p>" & return after pSourceCompare
   else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \
   >= pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \
      pCol - 1, pSourceDiff, pSourceCompare
      put "+" & tab & tNewLine & return after pSourceDiff
      put "<p>+" & kHtmlTab & "" & kHtmlTab & tNewLine & "</p>" & \
      return after pSourceCompare
   else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] < \
   pLCSLength[pRow - 1, pCol]) then
      DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \
      pCol, pSourceDiff, pSourceCompare
      put "-" & tab & tOldLine & return after pSourceDiff
      put "<p>-" & kHtmlTab & tOldLine & kHtmlTab & "" & "</p>" & \
      return after pSourceCompare
   end if
end DetermineDiff

Finally, let's remove some leading spaces from the old source, tick the checkbox and click the 'Diff sources' button:

And there you have it: a quick way to compare two texts, optionally ignoring leading and trailing spaces. You'll probably want to read the linked Wikipedia pages for optimisations and experiment. But I hope I sent you off to a running start!

Jan Schenkel.

Main Menu What's New