Three functions pull just the data you need from an HTML document
One task that Windows administrators need to accomplish occasionally is extracting specific information from an HTML document. The document might be a local file, a status page on a LAN-attached network device, a Web-accessible database report, or any one of a thousand other types of pages, but in every case, we face two problems with using data from these sources. The first problem is connecting to the Web page and reading the data the page contains. If a page isn't a static file accessible through a Windows share or file system somewhere on the managed network, we can't use standard tools such as the Scripting.FileSystemObject object to read it. We might even need to supply a username and password to the device serving the Web page. Once we solve that problem, we have an even larger one: How do we reliably extract the snippet of information we need from virtually unreadable raw HTML?
We can resolve both problems by using standard components available on any Windows Script Host (WSH)-capable workstation and a little bit of thinking. I'll demonstrate the general process by walking through a demo script that fetches a Web page from a DSL router and extracts the router's public IP address. I'll then distill the process into three generic functions we can use to extract specific information from a wide variety of pages.
Note that in my discussion, the words information and data each have a particular meaning. When I use information, I mean the particular thing we want to find out--in this case, the public IP address of the router. When I use data, I'm referring to the raw page material that contains the information as well as extraneous material we need to remove.
Retrieving the Data
Out of the box, Windows 2000 and newer OSs have multiple components that we can use to retrieve Web data. Earlier 32-bit versions of Windows typically do as well if they've had patches and feature packs installed since 1999. The most obvious component for retrieving data from a Web page, and the one I personally don't use, is Microsoft Internet Explorer (IE). Although IE has many uses in scripts, it's designed for the interactive display of material. IE has problems when used in nongraphical sessions, might throw errors that block progress, and potentially has side effects when used against arbitrary remote content. IE is also slow because it automatically retrieves and displays additional content such as embedded images.
The component I typically use to retrieve data over an HTTP connection is Microsoft's XMLHTTP requester in msxml.dll. We begin the process by creating a reference to the requester like this:
Set xml = CreateObject _
Using the XMLHTTP requester consists of three steps: opening a connection, sending the request, and retrieving the response. For this component, opening the connection also means specifying the connection details. The full form of the open method for Microsoft.XMLHTTP looks like this, with optional arguments shown in square brackets:
open(method, url, \[async\], \[user\], \[pass\])
The method argument is a string specifying the type of request we're making. For HTTP connections, this is typically "GET". The url argument is also a string, and should be a complete legal URL if we're requesting remote content.
The url argument works with local file content as well, and in those cases, just the full path to the file will work--there's no need to add a file:// prefix to the file path.
In this case, we just use the URL we see in the browser when looking at the remote configuration page we want. The particular router I'm working with, a HomePortal 1800HG DSL router used in many home and small offices for Internet access, shows its public IP address on the page http://10.1.1.1/?PAGE=B01, so this will be the URL we specify.
The optional async argument tells the requester whether it should wait for the response (synchronous behavior--a False value) or continue on to the next line of code immediately after the request is sent (asynchronous behavior--a True value). If not specified, the value defaults to True, which isn't what we want. When we send our request, we want our script to wait for a response because the next task processes the results. So we need to specify False.
The next two arguments are extremely useful if you need to access restricted resources--just be careful about hard-coding them in a script. If you need to supply a username and password to access a resource, you can specify these as the user and pass arguments so that your request doesn't return an authentication error. You don't need to specify them if the resource doesn't need authentication, or you can specify an empty string for the arguments by using vbNullString or a set of empty double quotes ("").
We now have all the information necessary to open the connection, so we can add the following line of script to our code:
Although we've configured and opened the connection, we haven't actually made a request yet. To do that, we call the send method:
As soon as the send is completed, we can get the page data back by reading the requester's responseText property:
data = xml.responseText
We now have the entire page available as data in our script. Although my explanation is lengthy, the actual code required is only four lines. As you'll see when I wrap the code up in a function at the end of the article, we can call it with a single line of code for repeated use.
Now we have our data, but we still have some work to do. We're after one IP address. When we look at the router configuration page in a browser, we see about a dozen lines of text containing about five actual bits of hard data, including the IP address shown as follows:
Internet Address: www.xxx.yyy.zzz
The data returned from our request, however, includes a lot of "noise" that has to be removed before we have what we want. There are more than 200 lines of text and roughly 1250 words, and we need to find what is basically a single word in this mass.
Filtering the Noise
It would be nice if we could filter out some of this noise quickly. In theory, we could eliminate HTML and XML tags by loading the data into an XmlDocument or HtmlDocument object and then requesting just the document's text, but problems arise if the data doesn't exactly fit the requirements of the object.
In general, I've found that the risks outweigh the benefits for the following reasons. An XmlDocument object would work for our well-crafted router configuration page, but if a page isn't a well-formed XML document, parsing will fail. An HtmlDocument object would give us a nice way to filter out unneeded content, but it has significant risks for breaking our script. If a page has odd data--including some content that renders without errors in IE--an HtmlDocument object might produce a blocking GUI error box. Furthermore, scripts in a page might still run in the virtual document and cause odd behavior.
The best solution is to use regular expressions. Although we could find the public IP address with a short regular expression, it would be kind of like using an XmlDocument object. It would be cheating because it would show you what conveniently works in this case but would be useless for most other tasks. Thus, it's better to use a regular expression to clean up generic Web-like data to the point at which a much simpler regular expression or even a basic string search will let us find the information we want quickly.
We start by creating a reference to the VBScript Regular Expression engine. Because we want to work with patterns that might appear in many lines throughout our data, we then turn on the multiline and global support:
Set rx = New RegExp
rx.Multiline = True
rx.Global = True
Our first step toward isolating our target information is to remove all HTML tags from the data. Because HTML and XML tags always begin and end with angle brackets (<>), we can easily identify them by using the following pattern:
rx.Pattern = "<\[^>\]+>"
If you use regular expressions frequently, you recognize that this pattern looks for strings that begin with < and end with >, but the portion of the expression that's within the angle brackets (i.e., \[^>\]+) might be puzzling. When used within the character-set markers \[ and \], the caret (^) character means match any characters not in this set. The plus (+) character, which is more familiar to "regex" users, means match any sequence of one or more characters in the set. So \[^>\]+ just means match any characters that aren't >. Thus, the entire <\[^>\]+> pattern matches any sequence that begins with <, ends with >, and doesn't have > inside it.
Once we've identified the tags, all we need to do to remove them is to use a regular expression replacement, substituting an empty string for each tag:
If we peek at a typical Web page's data after such a replacement, we generally find that a lot of data remains, but most of it is white space: a mixture of tabs, standard spaces, and line terminations. We'll frequently see special HTML-character entities. Although these aren't significant in the case of the router Web page, one of them is worth cleaning up to make searching easier: the special nonbreaking-space character, typically encoded in Web pages as " ". Although we could use a regular expression for this replacement as well, it's just as easy to use VBScript's character replacement function. We can replace each of these characters with a single space by using the following line:
(data, " ", " ", 1, -1, _1)
The last step in our generic clean up is to condense all the sequences of white space in our data into single spaces so that our search string doesn't need to be concerned about just how many or what type of nonprinting characters separate specific words. To translate white space to single spaces, we specify a new regular expression pattern that matches one or more white-space characters of any type:
rx.Pattern = "\s+"
Then we substitute a single space for every occurrence we encounter:
data = rx.Replace(data, " ")
The Crucial Information
Most of the remaining work will be specific to the particular Web page from which we're extracting information, but we can do it with only two or three lines of code in most cases. Because most network monitoring or management tasks involve repeatedly accessing only a handful of different pages, a minor time investment in creating a simple extraction routine will pay off quite well.
Figure 1 shows a subset of data from my router configuration page after removing HTML tags and excess white space. This isn't the full page--I trimmed material to make the extract short--but the trick we'll use to get our information will work fine with long data sets too. The trick is to look at the data immediately before what we're after and pick out the shortest bit of it that's always on the page but only appears once.
For example, we're after the Internet address: the obviously bogus value 256.261.381.125. A colon and space immediately precede the address, but that set of characters isn't unique in the data. If we reach further back, we get "Address: ", but that set of characters isn't unique either. However, "Internet Address: " is unique. So we split the data at "Internet Address: ", which gives us an array of two strings, one containing the text before "Internet Address: " and one containing the text after. We need just the second string, which has an index of 1, so we do this:
data = Split(data, _ "Internet Address: ")(1)
Now our job is simple. The first character after the information we want is a space, and because our IP address should never have a space, we can split the data at the space. We keep the first string of the two-string array that Split returns (the first string has an index of 0), like this:
data = Split(data, " ")(0)
and are left with the value 256.261.381.125.
To see the power of this approach for repeated use, look at Listing 1. I've turned each of the major technical steps--retrieving the page, removing tags and extra white space, and isolating the information substring we want--into a separate function, all shown in callout B. With the functions implemented, extracting the IP address takes just the three lines of code in callout A.
If you read through the code, you'll notice that I use VBScript's Trim function in the GetSubString function even though I didn't use it in this example. Trim simply removes spaces at the beginning and end of text. I use it in the function so you don't need to be precise about including trailing and leading spaces in strings you use to extract data.
The functions also make it easier to adapt the code to your own situation. If you already know how to extract the material from a page using regular expressions and want to do it your own way, you can use just the GetWebXml function to retrieve data and then process it as you wish. If you want more work done for you but have special needs such as extracting multiple strings, you can use the CleanTaggedText function and ignore the GetSubString function. Finally, if you definitely need just one item from a page and have consistent unique text before and after the item, you can use all three functions as I've done to get the information you need.