Screen Scraping for Modern Times

While screen scraping is normally associated with mainframes and other technologies that send a cold shiver up the PC’s spine, sometimes it has its uses. In this case, a customer wanted driving distance information (and only driving distance information) to appear on a web site. Linking to a routing page with maps and everything was not appealing to the customer, and there are technical, learning-curve, and data currency issues with trying to do that with RoutemapIMS. As many online routing sites can figure drive time out better than we can, why not let them to the heavy lifting? Enter the screen scrape. Sort of.

What we’re using here is the new Google Maps site. And we aren’t so much scraping the screen as we’re scraping the returned HTML, parsing it, and extracting the goodies we’re interested in. Let’s take a look at the code.

First, we’ll need two namespaces. System.Net has our WebClient (the thing pretending to be a browser, sending a request to a URL, and grabbing our HTML for us). System.Text has our UTF8 encoding type (getting rid of carriage return characters and whatnot in the returned HTML).

Use “Imports” instead for those items if you do code-behinds. That’s all we’ll need to write the code. Next, in your sub or function, make a web client object and a UTF8 encoding object:

Dim objWebClient As New WebClient()
Dim objUTF8 As New UTF8Encoding()

Next, build the URL string, putting in your own two addresses. As Google Maps doesn’t break up the address components in the URL, it makes it fairly easy:

Dim strURL As String = “,charlotte&daddr=700+n+tryon+st,charlotte+nc&hl=en"

Then send the URL to your webclient object and return the result into a string:

Dim txtDist As String = objUTF8.GetString(objWebClient.DownloadData(strURL))

txtDist contains what you would see if you went to that URL and viewed the source. Now comes the only hard part in this process - parsing that information. For driving distance like we need here, Google Maps makes things pretty easy. It includes a segments tag with the information we need:

<segments meters=”8482” seconds=”556” distance=”5.3 mi” time=”9 mins”>

So, all we have to do is drag it out of the rest of the code. Here we parse the returned HTML for the tag and strip it out:

‘Make sure the segments tag exists. If it doesn’t, one or both of your addresses weren’t found.
if InStr(txtDist, “<segments”) 0 then
‘Eliminate everything before the segments tag.
txtDist = Right(txtDist, Len(txtDist) - InStr(txtDist, “<segments”))

‘Loop through the first 200 characters after the segments tag to get the ending >
Dim i As Integer = 1
Dim t As Integer
Do While i “ Then
‘You found it, so set your position holder variable and break out of the loop.
t = i - 1
i = 200
End If
i = i + 1

‘Set your text string to the contents of the segments tag you stripped out.
txtDist = Left(txtDist, t)
‘This is where you can run some code if no match was found on the Google site.
response.Write(“We can’t find you! Aaaaaargh!”)
End If

In this case, the resulting string will be:
segments meters=”8482” seconds=”556” distance=”5.3 mi” time=”9 mins”

We can parse out the information we need, in this case driving distance, using a split function or some similar device. You can check out Starnes’ Fire District site to see a good example in action.

Screen scraping in .NET is fairly easy and powerful, but you should use it judiciously. Your application will be slowed by the response time of the other site, and even minor changes in the other site could make your screen scrape stop working altogether. Ask your PC if a particular case is a good place to use a screen scrape if you’re not sure.