Runtime Revolution
 
Articles Other News

Quick Tips: Web Page Parsing

by Bill Marriott

In a parallel universe, they don't use web browsers to consume information from the web, they use a database client. As our counterparts in that universe click links, a massive data store on their hard drives is being populated with the latest prices on digital cameras, compiling performance benchmarks on the latest processors, and recording balances on their accounts automatically. It's a data store that can be sorted, totaled, averaged, and graphed. In that universe, people make better decisions more quickly.

In our universe, however, the web is a medium designed primarily for casual human consumption, like a newspaper. Even though the vast majority of information we read online comes from databases, it is presented with formatting and other candy for the eye. Web browsers excel at displaying this formatted data, but really don't know much about the contents of pages we view.

As a result, when we are comparison shopping for a new widget, for example, we end up jotting down notes of which models have the best specifications and which stores have the best prices on the models of widget we're looking for. The computer is being merely an electronic catalog reader. It's not helping us do the actual comparison shopping. A database would enable us to tabulate all the specifications and prices that could be found and produce nice, neat report.

If you're just researching these kinds of things once every few months or longer, you don't mind “online shopping.” It's easier on your feet than real shopping, and maybe even kind of fun. But when data from the web needs to be captured and analyzed on an ongoing basis, using a paper and pen becomes a real drag. You begin to look at the various tables of data on your screen and begin to wonder how difficult it would be to get that information into a format you could actually use. With Revolution, it's easier than you imagine.

Processing HTML pages is generally the business of opening a web page and “digesting” the information contained within it to extract the data you need. It's also sometimes called “screen scraping” or “web scraping.” It requires being able to transform information designed for human eyes into the essential data, ignoring all the unwanted headings, formatting, labels, etc.

It's not an elegant process, but it's extremely useful in situations where something better is not possible:

  • you're working with legacy systems where the original source code is not available or maintainable
  • you do not have access to the back-end system (via XML or ODBC, for example)
  • the web page is the only output available
  • you're in a hurry

I add the last bullet point because it can often be easier to obtain results without going through the rigmarole of XML and SOAP, etc. Between the built-in internet functions and the powerful chunk expressions built into Revolution, web scraping is a snap.

As an example, I'm going to use Google Images. Here's a situation where you are using a search engine that queries a database and provides results in human-readable format. An API for elegantly implementing Google Search is available, but the technique I'm about to show you will give you results in the time it would take for you to sign up for API access.

  1. Open up a web browser to Google Images and search for "fish."

Your address bar will change to "http://images.google.com/images?hl=en&q=fish&gbv=2" and you'll see a bunch of pictures of fish.

  1. Open Revolution and create a new mainstack.
  2. Add a field to the stack. Make it large enough to fill the card. (you may also want to quickly visit Geometry Manager to enable the field to grow as the stack is resized.)
  3. Set the field's Don't Wrap property to true.
  4. Open up the message box if it is not already open and enter the following text:

put url "http://images.google.com/images?q=fish" into fld 1

Revolution makes it easy to grab the HTML content of any web page. Just one line does it.

(I simplified the URL a little bit. I've experimented enough to have learned that ? marks the beginning of parameters to a web server-based program. Multiple parameters are separated by & signs. Of the three parameters reflected in the original URL, only the query=fish part was really needed.)

You'll see the field you added fill up with the HTML source code for the page. What can we do with this? It looks like a mess! Well, a little bit of study will show it's not so messy as first appears.

The first thing I'm going to do to "tame" it is to put each HTML tag on its own line:

replace "<" with return & "<" in fld 1

This makes it a lot easier for you to read (without going through the work of a full formatting of HTML with indents, etc.) and also prepares the way for processing things on a line-by-line basis if necessary.

Scanning through I see what I'm looking for. For each image returned by the search, there is a unique line. It starts with "<a href=/imgres?imgurl=" in every instance. This is the key to the art of web scraping, looking for patterns and structure in the formatted information.

The link for each graphic tells us a lot of information about the picture. The HTML for it contains a unique "marker" that makes filtering for this information easier.

Trying another quick command from the message line:

filter fld 1 with "*href=/imgres*"

I press return and watch as I'm left with 20 lines of HTML. (Which, voila, matches the number of images I found in my web browser.)

What can I do with what's left? Each line appears to be a long jumble of HTML codes to describe a link.

replace "=" with tab in fld 1
replace "&" with tab in fld 1

I use these two commands and things get spaced out very cleanly. To see what I mean, select the field and choose the "Table" panel in the properties palette. Turn on the vertical grid, and set the tabstops to 100.

Setting tab stops enables us to see how cleanly the HTML can be broken up to reveal useful information about the links shown.

You'll see a nice table now clearly showing useful data for each image:

  • column 3 is the image URL
  • column 5 is the page holding the image
  • columns 9 & 11 are the height and width of the original image

There's other information in there such as the location and size of the thumbnail image, as well. With just a few lines of code, we were able to transform this page into neat, tabular data.

The whole script might look something like this:

on mouseup
put url "http://images.google.com/images?q=fish" into x
replace "<" with return & "<" in x
filter x with "*href=/imgres*"
replace "=" with tab in x
replace "&" with tab in x

end mouseup

Now with the link HTML converted into a handy table, one can simply set the itemdelimiter to tab, cycle through each line, and use the "item" chunk expression to pick out whichever tidbit of information is desired.

For your finished application, you may want to take this forward a few steps:

  • Adding a field so the search term can easily be changed
  • Gathering information from several pages of Google at a time
  • Displaying the results in a tabular form, or perhaps icons with the thumbnails on them
  • A method for selecting "favorites" and adding them to a scrapbook stored locally on the user's computer

Not all information will be so easy to extract. In addition to the filter and replace commands, you may also want to become comfortable with "regular expressions." These are a set of symbols that take "wildcard" and "pattern matching" searches to the next level. You're probably familiar with using wildcards in DOS/Terminal mode or even in some word processors.

dir *.txt

lists all files with the extension "txt" in MS-DOS and Linux. And we've already seen this simple wildcard convention in use with the filter command. Regular expressions add a whole lot of new symbols and new capabilities.

As an example, I'll introduce the "replaceText" function. Go ahead and reset fld 1 in our practice stack to the contents of the Google HTML, using the "put url…" command again as above. Then try the following:

put replaceText (fld 1, "<.*?>","") into fld 1

This will zap away every HTML tag in the source, leaving you with just the raw text and some scripting/style sheet code that isn't HTML. The "<.*?>" pattern is a regular expression that says, "match everything between two angle brackets." The online help in Revolution contains a useful Regular Expressions syntax reference.

As you experiment with "web scraping" you'll add more tools and procedures to your arsenal. And you'll continue to be amazed at how easy Revolution makes it for you to distill valuable data out of web pages formatted for human eyes.

 
©2008 Runtime Revolution Ltd, 15-19 York Place, Edinburgh, Scotland, UK, EH1 3EB.
Questions? Email info@runrev.com for answers.