Issue 63 * January 15 2009

Who's Afraid of the Big Byte Wolf?
Or: future-proofing your scripts for binary files

by Jan Schenkel

Back in September, the Runtime Revolution team shipped version 3.0 of our beloved cross-platform development tool Revolution. If you haven't checked it out yet, you should: Revolution 3.0 is an impressive release, with a revamped script editor, gradients, nested arrays and much-improved out-of-the-box experience thanks to the Start Center and Resource Center. One of the items that was also introduced, but not mentioned in the press release, was the new 'byte' chunk type.

If you're wondering what it's good for or why you would want to use it instead of the good old 'character' chunk type, then allow me to explain the underlying reason for its introduction. Back in the HyperCard days, not much attention was given to languages that didn't fall neatly in the ASCII-domain, based on the English alphabet, where one character takes up a single byte and thus there are 256 possible characters (a number of which are control characters that can't even be displayed on screen).

Naturally, this isn't quite enough room for a couple of thousand Chinese or Japanese characters, and that's where Unicode comes into play. This is a set of standards that govern how this much larger number of characters can be stored and should be interpreted. The UTF-8 standard still relies on single bytes as its main storage system, but uses special bit-settings to determine if the next character is a single byte or more than one byte.

My other favorite cross-platform technology Java embraced Unicode from the very first day, which made perfect sense given its goal of "write once, run anywhere" global software. And to make this easier, if less memory-efficient for those circumstances where you only have to deal with the ASCII characters, it uses UTF-16 everywhere.

When Revolution 2.0 saw the daylight, it introduced support for Unicode text entry and manipulation. Rather than jumping on the UTF-16 bandwagon, the engineers decided to implement a more flexible system that would use plain-old ASCII for everyday operations, while allowing the developer to use UTF-16 only when necessary.

While the implementation could still use some love and care, Revolution's technology director Mark Waddingham has big plans for it and wants to evolve to a system that is far less painful than anything else out there, but as resource-efficient as the current implementation. Which means that in the distant future, Revolution developers should be able to use the 'character', 'word', 'item' and 'line' chunks without having to care about the underlying encoding - the engine will know what constitutes a word or a sentence in the language of that piece of text and will take care of collation and all the gruesome details of Unicode.

Traditionally, the one-byte-is-one-character paradigm of xCard meant that developers could use the 'character' keyword, along with the 'charToNum' and 'numToChar' functions, to manipulate data at the byte-level. But with this equivalence fading away at some point in the future, the new 'byte' chunk type is our saviour!

Quartam PDF Library reads PNG and JPEG image files as byte streams, extracting information about the image height and width, bit depth and other binary-encoded information. To ensure that this and other binary-reading code will keep working correctly in the future, I've decided to make the necessary changes in the upcoming version 1.1 to use the 'byte' chunk type rather than the 'character' chunk type.

And it all boils down to the following set of rules:
- make sure to 'open file theFilePath for binary read'
- make sure to 'open file theFilePath for binary write'
- replace 'length(theBinaryVariable)' / 'the number of chars in theBinaryVariable' with 'the number of bytes in theBinaryVariable'
- replace 'char x to y of theBinaryVariable' with 'byte x to y of theBinaryVariable'
- replace 'charToNum(char x of theBinaryVariable)' with 'byteToNum(byte x of theBinaryVariable)'
- replace 'numToChar(theNumber)' with 'numToByte(theNumber)'

While that may seem like a lot of steps, it's actually a breeze to implement. Allow me to demonstrate on the basis of some old code that I used to read a file that was produced in an old Turbo Pascal application and had to be imported into Revolution. The file contains a number of Pascal strings, which have the following twist: unlike in C or Java, where a string can be of any length as it is ended with a NULL byte, Pascal uses the first byte to store the length of the string (thus effectively limiting the length of a string to 256 bytes)

The file format is as follows:
- byte 1 to 4 = the number of strings to follow in the file (little-endian)
- per string, you have the following block
+ 1 byte to designate the length of the string
+ n bytes that contain the actual string
In short, nothing too complicated - but it requires a few byte conversions and simple calculations to figure out how many bytes to read next.

Here's the original script:

on mouseUp
   local tFilePath, tPStringCount, tPStringIndex, \
   tPStringLength, tPStringsA
   -- prepare for import
   put empty into field "ImportList"
   answer file "Select a PStrings file"
   if the result is "Cancel" then exit mouseUp
   put it into tFilePath
   open file tFilePath for binary read
   -- read the first 4 characters to see how many pstrings are
   -- in the file
   read from file tFilePath for 4 chars
   -- as it is little-endian, we take each character and
   -- multiply by descending powers of 256
   put charToNum(char 1 of it) * 256 * 256 * 256 \
   + charToNum(char 2 of it) * 256 * 256 \
   + charToNum(char 3 of it) * 256 \
   + charToNum(char 4 of it) \
   into tPStringCount
   -- now read one pstring at a time
   repeat with tPStringIndex = 1 to tPStringCount
      -- one character to give us the length of the pstring
      read from file tFilePath for 1 char
      put charToNum(it) into tPStringLength
      -- and now we know how many characters to read
      read from file tFilePath for tPStringLength chars
      put it into tPStringsA[tPStringIndex]
   end repeat
   -- wrap things up
   close file tFilePath
   combine tPStringsA using return
   put tPStringsA into field "ImportList"
   answer "Done"
end mouseUp 

As I was already using the 'binary read' I could concentrate on rewriting the character keyword chunks with the new byte keyword. In just a few minutes, I had applied the rules and could test my update with the click of a button.

on mouseUp
   local tFilePath, tPStringCount, tPStringIndex, \
   tPStringLength, tPStringsA
   -- prepare for import
   put empty into field "ImportList"
   answer file "Select a PStrings file"
   if the result is "Cancel" then exit mouseUp
   put it into tFilePath
   open file tFilePath for binary read
   -- read the first 4 bytes to see how many pstrings are in
   -- the file
   read from file tFilePath for 4 bytes
   -- as it is little-endian, we take each byte and multiply by
   -- descending powers of 256
   put byteToNum(byte 1 of it) * 256 * 256 * 256 \
   + byteToNum(byte 2 of it) * 256 * 256 \
   + byteToNum(byte 3 of it) * 256 \
   + byteToNum(byte 4 of it) \
   into tPStringCount
   -- now read one pstring at a time
   repeat with tPStringIndex = 1 to tPStringCount
      -- one byte to give us the length of the pstring
      read from file tFilePath for 1 byte
      put byteToNum(it) into tPStringLength
      -- and now we know how many bytes to read
      read from file tFilePath for tPStringLength bytes
      put it into tPStringsA[tPStringIndex]
   end repeat
   -- wrap things up
   close file tFilePath
   combine tPStringsA using return
   put tPStringsA into field "ImportList"
   answer "Done"
end mouseUp 

That's it - no other changes required! The modification was straightforward and consistent, and everything works just fine. I know that my code will continue to work in future versions, and I can enjoy the same linear speed that I have now with the 'character' chunk type even when that part is overhauled for Unicode embracing.

If only all requested or necessary changes were so easy to implement...

Jan Schenkel is the developer behind Quartam Reports and Quartam PDF Library for Revolution. Find out more at www.quartam.com.

Main Menu What's New

Get The Megabundle