Issue 52 July 3, 2008 | ||
Text Comparison As some of you know by now, I do Quartam Software by night, and by day I work for a leading provider of Laboratory Information Systems. One of the most powerful features of our products is MISPL: a domain-specific scripting language that allows service engineers and system administrators to customize the application and fine-tune the workflow. The MISPL infrastructure comes complete with version tracking, and recently an intriguing project landed on my desk: integrating a source code compare tool so that engineers and administrators could verfify their changes before committing them to the system. With only a few hours to whip up a prototype, I launched Revolution and started coding away. And as you might have guessed, my initial attempts proved futile – so I decided to save myself the upcoming headaches, and headed over to my web browser. Surely there had to be some open source code that I could take apart? After quite a bit of googling and clicking, I found a clue in a Java forum: text comparison boils down to the Longest Common Subsequence problem. And some kind soul had indeed put up a Wikipedia page for this particular problem: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem - even if it took me a while to digest the way it actually worked, I now had pseudo-code to derive my solution! So I started with a fresh stack, dragged a few fields and buttons onto it, and wrote my own text compare tool. With the layout done, I could concentrate on the script: on mouseUp -- turn the sources into arrays of lines local tOldSource, tNewSource put field "OldSource" into tOldSource put field "NewSource" into tNewSource split tOldSource using return split tNewSource using return -- we're also going to need the linecount local tOldLineCount, tNewLineCount put item 2 of line 1 of the extents of tOldSource into \ tOldLineCount put item 2 of line 1 of the extents of tNewSource into \ tNewLineCount -- calculate the longest common subsequence lengths local tLCSLength CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount -- finally determine and display the source diferences local tSourceDiff DetermineDiff tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount, tSourceDiff put char 1 to -2 of tSourceDiff into field "SourceDiff" end mouseUp on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \ pOldLineCount, pNewLineCount local tRow, tCol repeat with tRow = 0 to pOldLineCount put 0 into pLCSLength[tRow,0] end repeat repeat with tCol = 0 to pNewLineCount put 0 into pLCSLength[0,tCol] end repeat repeat with tRow = 1 to pOldLineCount repeat with tCol = 1 to pNewLineCount put pOldSource[tRow] into tOldLine put pNewSource[tCol] into tNewLine if tOldLine = tNewLine then put pLCSLength[tRow - 1,tCol - 1] + 1 into \ pLCSLength[tRow,tCol] else put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \ - 1]) into pLCSLength[tRow,tCol] end if end repeat end repeat end CalculateLCSLengths on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \ pCol, @pSourceDiff put pOldSource[pRow] into tOldLine put pNewSource[pCol] into tNewLine if pRow > 0 and pCol > 0 and tOldLine = tNewLine then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - \ 1, pCol - 1, pSourceDiff put " " & tab & tOldLine & return after pSourceDiff else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \ >= pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \ pCol - 1, pSourceDiff put "+" & tab & tNewLine & return after pSourceDiff else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] \ < pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - \ 1, pCol, pSourceDiff put "-" & tab & tOldLine & return after pSourceDiff end if end DetermineDiff In case you're wondering about the '@' sign in front of the parameters: this is Revolution's way of passign variables 'by reference'. Any modifications to this parameter in your handler or function will propagate to the calling handler, which is what we want for some of the parameters; but this is also a good way to save memory for those parameters that you don't intend to change in the handler or function. But I digress - let's type in some text and click the "Diff sources" button: Old-school developers brought up with command-line tools will surely recognize the output they'd get from the 'diff' tool. While this is a nice way of displaying the changes, I figured it would be better displayed in two fields. So after adding a tabbed button and another field, the new script became: constant kHtmlTab = "	" on mouseUp -- turn the sources into arrays of lines local tOldSource, tNewSource put field "OldSource" into tOldSource put field "NewSource" into tNewSource split tOldSource using return split tNewSource using return -- we're also going to need the linecount local tOldLineCount, tNewLineCount put item 2 of line 1 of the extents of tOldSource into \ tOldLineCount put item 2 of line 1 of the extents of tNewSource into \ tNewLineCount -- calculate the longest common subsequence lengths local tLCSLength CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount -- finally determine and display the source diferences local tSourceDiff, tSourceCompare DetermineDiff tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount, tSourceDiff, tSourceCompare set the text of field "SourceDiff" to char 1 to -2 of \ tSourceDiff set the htmlText of field "SourceCompare" to char 1 to -2 of \ tSourceCompare end mouseUp on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \ pOldLineCount, pNewLineCount local tRow, tCol repeat with tRow = 0 to pOldLineCount put 0 into pLCSLength[tRow,0] end repeat repeat with tCol = 0 to pNewLineCount put 0 into pLCSLength[0,tCol] end repeat repeat with tRow = 1 to pOldLineCount repeat with tCol = 1 to pNewLineCount put pOldSource[tRow] into tOldLine put pNewSource[tCol] into tNewLine if tOldLine = tNewLine then put pLCSLength[tRow - 1,tCol - 1] + 1 into \ pLCSLength[tRow,tCol] else put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \ - 1]) into pLCSLength[tRow,tCol] end if end repeat end repeat end CalculateLCSLengths on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \ pCol, @pSourceDiff, @pSourceCompare put pOldSource[pRow] into tOldLine put pNewSource[pCol] into tNewLine if pRow > 0 and pCol > 0 and tOldLine = tNewLine then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \ pCol - 1, pSourceDiff, pSourceCompare put " " & tab & tOldLine & return after pSourceDiff put "<p> " & kHtmlTab & tOldLine & kHtmlTab & tNewLine & \ "</p>" & return after pSourceCompare else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \ >= pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \ pCol - 1, pSourceDiff, pSourceCompare put "+" & tab & tNewLine & return after pSourceDiff put "<p>+" & kHtmlTab & "" & kHtmlTab & tNewLine & "</p>" & \ return after pSourceCompare else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] < \ pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \ pCol, pSourceDiff, pSourceCompare put "-" & tab & tOldLine & return after pSourceDiff put "<p>-" & kHtmlTab & tOldLine & kHtmlTab & "" & "</p>" & \ return after pSourceCompare end if end DetermineDiff And when you click on the button "Diff sources" we now get the following display of the text compare. You can easily add colouring to the htmlText by inserting appropriate <font color="red"> tags, but I'll leave that as an exercise to the reader. The final bit I'd like to add is a way to ignore leading and trailing spaces – after all, when youy're comparing scripts you don't want to get bogged down by added or missing spaces, do you? So let's drag out a checkbox and modify the script some more. constant kHtmlTab = "	" on mouseUp -- turn the sources into arrays of lines local tOldSource, tNewSource put field "OldSource" into tOldSource put field "NewSource" into tNewSource split tOldSource using return split tNewSource using return -- we're also going to need the linecount local tOldLineCount, tNewLineCount put item 2 of line 1 of the extents of tOldSource into \ tOldLineCount put item 2 of line 1 of the extents of tNewSource into \ tNewLineCount -- calculate the longest common subsequence lengths local tLCSLength CalculateLCSLengths tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount -- finally determine and display the source diferences local tSourceDiff, tSourceCompare DetermineDiff tLCSLength, tOldSource, tNewSource, \ tOldLineCount, tNewLineCount, tSourceDiff, tSourceCompare set the text of field "SourceDiff" to char 1 to -2 of \ tSourceDiff set the htmlText of field "SourceCompare" to char 1 to -2 of \ tSourceCompare end mouseUp on CalculateLCSLengths @pLCSLength, @pOldSource, @pNewSource, \ pOldLineCount, pNewLineCount local tRow, tCol repeat with tRow = 0 to pOldLineCount put 0 into pLCSLength[tRow,0] end repeat repeat with tCol = 0 to pNewLineCount put 0 into pLCSLength[0,tCol] end repeat repeat with tRow = 1 to pOldLineCount repeat with tCol = 1 to pNewLineCount put pOldSource[tRow] into tOldLine put pNewSource[tCol] into tNewLine if the hilite of button "IgnoreSpaces" is true then put word 1 to -1 of tOldLine into tOldLine put word 1 to -1 of tNewLine into tNewLine end if if tOldLine = tNewLine then put pLCSLength[tRow - 1,tCol - 1] + 1 into \ pLCSLength[tRow,tCol] else put max(pLCSLength[tRow - 1,tCol],pLCSLength[tRow,tCol \ - 1]) into pLCSLength[tRow,tCol] end if end repeat end repeat end CalculateLCSLengths on DetermineDiff @pLCSLength, @pOldSource, @pNewSource, pRow, \ pCol, @pSourceDiff, @pSourceCompare put pOldSource[pRow] into tOldLine put pNewSource[pCol] into tNewLine if the hilite of button "IgnoreSpaces" is true then put word 1 to -1 of tOldLine into tOldCompare put word 1 to -1 of tNewLine into tNewCompare else put tOldLine into tOldCompare put tNewLine into tNewCompare end if if pRow > 0 and pCol > 0 and tOldCompare = tNewCompare then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \ pCol - 1, pSourceDiff, pSourceCompare put " " & tab & tOldLine & return after pSourceDiff put "<p> " & kHtmlTab & tOldLine & kHtmlTab & tNewLine & \ "</p>" & return after pSourceCompare else if pCol > 0 and (pRow = 0 or pLCSLength[pRow,pCol - 1] \ >= pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow, \ pCol - 1, pSourceDiff, pSourceCompare put "+" & tab & tNewLine & return after pSourceDiff put "<p>+" & kHtmlTab & "" & kHtmlTab & tNewLine & "</p>" & \ return after pSourceCompare else if pRow > 0 and (pCol = 0 or pLCSLength[pRow,pCol - 1] < \ pLCSLength[pRow - 1, pCol]) then DetermineDiff pLCSLength, pOldSource, pNewSource, pRow - 1, \ pCol, pSourceDiff, pSourceCompare put "-" & tab & tOldLine & return after pSourceDiff put "<p>-" & kHtmlTab & tOldLine & kHtmlTab & "" & "</p>" & \ return after pSourceCompare end if end DetermineDiff Finally, let's remove some leading spaces from the old source, tick the checkbox and click the 'Diff sources' button: And there you have it: a quick way to compare two texts, optionally ignoring leading and trailing spaces. You'll probably want to read the linked Wikipedia pages for optimisations and experiment. But I hope I sent you off to a running start! Jan Schenkel. |