Downloads
9170.zip

Using Regular Expressions

Regular expressions—text strings that define a pattern—are an old acquaintance for seasoned programmers who have worked with UNIX shells or Web developers who have worked with Perl or JavaScript. Although Perl and JavaScript have natively supported regular expressions for a long time, regular expressions are new to VBScript. Microsoft added support for regular expressions in VBScript 5.0 and is expanding their capabilities in VBScript 5.5. (A free beta version of VBScript 5.5 is available from http://msdn.microsoft.com/scripting/.)

The introduction of regular expressions to VBScript is an important advancement. To use regular expressions, you first must understand what they are in general and how they work in VBScript. With this basic understanding, you can use regular expressions to perform advanced text processing, such as searching and replacing text and searching and modifying text.

Understanding Regular Expressions
By the means of special functions, you apply a regular expression to given text and verify whether the text matches the pattern. To use a regular expression, you need to set the pattern, then test whether the text matches that pattern. You receive a Boolean value of True if a match occurs and False if it doesn't.

You're probably using simple regular expressions already. When you type an MS-DOS command such as

Dir *.*

you're processing all the files and directories that match the *.* pattern. Thus, *.* is a simple regular expression that encompasses all the filenames formed by two strings (of any length and format) that are separated by a dot. The asterisk in this regular expression is a metacharacter. Metacharacters are characters that have a special meaning in a set of regular expressions. Metacharacters play roughly the same role as keywords in scripting languages.

The MS-DOS command prompt supports a limited set of regular expressions commonly called wildcard expressions. The MS-DOS wildcard expressions include only two metacharacters: * (represents any variable-length combination of letters and digits) and ? (represents a single occurrence of one character chosen from the set of letters and digits).

In applications, you typically use regular expressions that consist of a combination of metacharacters and constant strings. For example, consider the MS-DOS command

Dir a*.exe

which lists all the filenames that begin with the letter a and end with the .exe extension. The regular expression a*.exe consists of one metacharacter (*) and two constant strings (a and .exe). The MS-DOS command

Dir a?b*.exe

uses a regular expression that has two metacharacters (? and *) and three constant strings (a, b, and .exe). When you use this regular expression with the Dir command, you receive a list of filenames that begin with the letter a, have the letter b as the third character, and end with the .exe extension, such as abb.exe or axb.exe.

As these examples show, metacharacters are key to regular expressions. Each metacharacter has a special meaning that affects the way in which the underlying regular expression processor applies the pattern to the text you're checking.

A scripting language that fully supports regular expressions defines a complex set of metacharacters and has an adequate regular expression processor to process that set. To be considered adequate, the regular expression processor must, at minimum, be able to test a string against a pattern and have a function that automatically replaces the portions of the text that match the pattern.

In JavaScript and a few other scripting languages, the regular expression processor is built into the language's interpreter. As a result, you can't use JavaScript's regular expressions outside a JavaScript program. In VBScript 5.0 and later, however, the regular expression processor is a COM object called RegExp. As a result, the support for VBScript's regular expressions is language independent. You can use VBScript's regular expressions in any development environment that supports COM, including Visual Basic (VB), Visual C++ (VC++), and Active Server Pages (ASP).

Using VBScript Regular Expressions
To use VBScript regular expressions, you need to create an instance of the RegExp object. One way is to use the New keyword in code such as

Set regexp = New RegExp

Another way is to use the CreateObject function with RegExp's programmatic identifier (ProgID) VBScript.RegExp in code such as

Set regexp = CreateObject _
("VBScript.RegExp")

After you create an instance of RegExp, you use that object's Pattern property to set the pattern you want to match text against. The pattern can contain any combination of constant strings and valid metacharacters. Table 1 contains commonly used metacharacters. (You can find a complete list of metacharacters at http://msdn.microsoft.com/scripting/vbscript/doc/vspropattern.htm.) Note that MS-DOS's ? metacharacter differs in meaning from VBScript's ? metacharacter. Metacharacters' meanings are not universal among metacharacter sets.

When you specify the pattern, you must enclose it in double quotes. For example, the code

regexp.Pattern = _ "\w+\@\w+\.\w+"

sets a pattern that defines an email address: a word (\w+), the at sign (\@), another word, a dot (\.), and a final word. A word is a non-null sequence of letters and digits. You don't need to use metacharacters in the pattern. The pattern can be a constant string such as "dino@win2000mag.com."

After you set the pattern, you can use RegExp's Test method to apply the regular expression against any text you want. For example, the code

buf = "dino@win2000mag.com"
If regexp.test(buf) Then
MsgBox "The string matches."
End If

tests whether the text you're checking matches the email address dino@win2000mag.com. If a match occurs, you receive the message The string matches.

RegExp's Methods and Properties
Table 2 outlines all the RegExp's object's methods and properties. Overall, the Microsoft Developer Network (MSDN) documentation does a good job explaining the methods' and properties' syntax. However, errors exist in the documentation for the Global and IgnoreCase properties.

Global is a read/write property that tells the regular expression processor to search for all occurrences of the pattern or search for only the first occurrence. If you want all occurrences, you set Global to the Boolean value of True. If you want only the first occurrence, you set Global to False. The MSDN documentation states that the default value for Global is True. However, the real default value is False. Thus, if you want to search for all occurrences, you must set Global to True.

Similarly, the MSDN documentation for IgnoreCase is incorrect. IgnoreCase tells the regular expression processor whether a pattern search is case insensitive or case sensitive. For case-insensitive searches, you set IgnoreCase to True. For case-sensitive searches, you set IgnoreCase to False. The MSDN documentation erroneously states that True is the default setting. However, the real default value is False. Therefore, if you want case-insensitive searches, you must set IgnoreCase to True.

Searching and Replacing Text
If you often use VBScript code to search and replace text in documents, regular expressions will come in handy. Before VBScript 5.0, you had to walk through each document's text and use many string manipulation functions such as Mid, Instr, and Replace to search and replace text. Because of the repeated calls to these functions, the code was often cumbersome to write and read and slow to run. With regular expressions, you can write compact, understandable, and fast-running code.

Using regular expressions for search-and-replace operations is simple. For example, suppose you want to replace all occurrences of the letter a with the letter x in a text string. As the code in Listing 1 shows, you first create an instance of the RegExp object. You then explicitly set the Global and IgnoreCase properties to True because you want to search for and replace all occurrences of the pattern and you want a case-insensitive search. Next, you set the buf variable to the text string you want to manipulate (i.e., aoAoa) and use RegExp's Pattern property to set the pattern you're searching for (i.e., the letter a). Finally, you use RegExp's Replace method to replace each occurrence of the letter a with the letter x and set the resulting string to the str variable, which you display on screen.

Although the Replace method works correctly in this example, I don't recommend using this method for such simple substitutions of strings. (I used it for illustration purposes only.) A simpler way to achieve the same result is to use VBScript's Replace function

Replace(expression, find, _
replacewith\[, start\[, count _
\[, compare\]\]\])

which I covered in my November 1999 column. In this example, the code would be

Replace(buf, "a", "x", 1, -1, 1)

The Replace function is not only easier to use but also lets you specify where to start replacing strings. The Replace function even gives you better control of how many occurrences you want to replace. Thus, when you need to replace strings that match a simple constant pattern, using the Replace function is preferable to using regular expressions. However, when you need to replace strings that match complex patterns, the Replace function is no longer adequate. You need to use regular expressions instead.

Searching and Modifying Text
Regular expressions are handy for searching and modifying text. For example, suppose you have text that contains URLs expressed in the short form without the protocol signature (e.g., www.expoware.com instead of http://www.expoware.com) and you want to automatically insert the http:// prefix in all of them. The text you're searching is If you comply with regular expression, go to www.regexp.com or www.re.com, so to trap all the URLs, you can set the pattern

regexp.Pattern = "www.\w+\.\w+"

The www.\w+\.\w+ pattern traps all the substrings that start with the www. expression and are followed by two dot-separated words. (This example doesn't consider URLs with more dots, such as www.xxx.co.uk. I'll discuss these URLs next month.)

At this point, you might be tempted to use the Replace method to add the http:// prefix. However, this method won't work. Although the Replace method would correctly find all the matching strings, it would replace them with the same constant string (i.e., the same URL). Instead, you need to modify, not replace, each matching string. To modify text, you use RegExp's Execute method.

The Execute method performs a regular expression search against a specified pattern, finds matches, and returns those matches in the Matches collection object. A Match object represents each match in the collection object. The Match object has three properties:

  • Value—returns the matching substring
  • Length—returns the matching substring's length
  • FirstIndex—returns the matching substring's position within the original text

As Listing 2 shows, you can use the Execute method to search the text and capture the matching URLs in the Matches collection. You can then use a For...Each statement to walk through each Match object (i.e., each URL) in the Matches collection and apply the Match object's Value property to return the URL.

At this point, you can apply the Replace function. However, the Replace function has a quirk: After the first match, the Replace function truncates any characters that precede the part of the string you're replacing. In other words, except for the first URL, you'll receive only a portion of the URL string and not the entire string. For this reason, you must first save the characters that the Replace function will truncate before you use that function. As callout A in Listing 2 shows, you use the Left function with the FirstIndex property to extract those characters and set them to the temp variable. You then concatenate the temp variable and the substring that the Replace function changes to obtain the full URL. Without regular expressions, you can't obtain this result so easily.

What's Next
Data validation is another area in which you can exploit the full power of regular expressions. Each time you need users to enter formatted data, you can define the input mask as a pattern and have RegExp parse the data for you. Next month, I'll show you how to enhance the InputBox function to make it support regular expressions and automatic data validation. I'll also show you how to use runtime code evaluation with text processing to create an improved version of the Replace function. With this subroutine, you can use pattern matching to identify candidates for replacement and runtime code evaluation to execute special code on each match.