Process the data in delimited files more easily than with string parsing
| Executive Summary:|
Although you can use a TextStream object to open a plain-text file, it's not advantageous because you can only read the file from the beginning. However, you can use ADO, which enables you to use the Microsoft Jet OLE DB text driver to parse the contents of a delimited file, to read delimited files. The Jet OLE DB text driver uses the registry to determine the format of the delimited file.
Reading delimited files is a common scripting task. For example, you might want a script to process a list of users and their email addresses that you exported to a delimited text file using the Microsoft Management Console (MMC) Active Directory Users and Computers snap-in.
Delimited files are plain-text files that often represent the contents of a database table. The data on each line in the file is separated by a delimiter (e.g., a comma or a tab). Figure 1 shows a sample delimited file in which the delimiter is a comma. This type of delimited file is known as a comma-separated values (CSV) file. The first line of the file, which is referred to as the header line, names the fields (columns) in the table. The subsequent lines contain the table's records (rows). In this example, the field names and record data are enclosed in double quotes (").
Of course, you can use the FileSystemObject's TextStream object to open plain-text files, but parsing the contents of delimited files presents some interesting problems. For example, you could open the file and use VBScript's Split function to split each line into fields, but doing so can present a problem if the line contains quoted data with an embedded delimiter. For example, consider the following comma-delimited address data:
"13100 Ellison Rd. NW","Albuquerque, NM","87114"
Now suppose you use the following line of VBScript code to parse the line:
Data = Split(Address, ",")
In the above line of code, assume the Address variable contains the address text mentioned above. If you use this code, the Data variable will contain the four-element array shown in Figure 2. The reason is that the Split function doesn't take the double quotes into account. Embedded delimiters are very common when dealing with AD distinguished names (DNs).
Another problem with trying to use a TextStream object to handle database data is that you can read the file only starting from the beginning. If you need to go back to a previous line, you must close the file and open it again, which can be time-consuming if you've got a large text file.
The ADO Solution
Rather than deal with these limitations, you can use ADO to read delimited files. ADO lets you use the Microsoft Jet OLE DB text driver to parse the contents of a delimited file. The text driver treats a delimited file as if it were a database table. The following steps provide a general overview of what you must do to use ADO to read a delimited file:
1. Create a Connection object to establish a connection to the datasource. When dealing with a delimited file, the datasource is the delimited file's directory.
2. Create a Recordset object and query the text file using a SQL query.
3. Iterate the Recordset object to obtain the query's results.
Let's take a more detailed look at these steps. Listing 1 shows a VBScript example of how to read the data from Sample.csv (shown in Figure 1). At the top of Listing 1, the script declares its constants and variables, and then it specifies the directory and filename for the delimited file. (The directory and filename are specified separately for reasons that I'll discuss later in the article.) Callout A shows how GetAddresses.vbs creates a Connection object and executes the Open method. The Open method's parameter is a connection string that describes the location and nature of the datasource, and it uses a semicolon-delimited list of property=value pairs. The Provider property is always Microsoft.Jet.OLEDB.4.0, and the Data Source property is the directory containing the delimited file. (This is why the script declares the CSV file's directory separately—the Data Source property must be a directory name. Declaring the directory separately also provides you with the flexibility of querying more than one delimited file using a single Connection object.)
The Extended Properties property is a semicolon-delimited string, and it must be enclosed in single or double quotes. The first argument in the string is Text, which tells the data provider (i.e., Jet OLE DB) to use its text driver. Next, the Hdr parameter (which must be Yes or No) specifies whether the delimited file has a header line. If you use Hdr=No, the data provider names the fields F1, F2, F3, and so forth. The default value is Hdr=Yes.
In the past, I've seen an additional Fmt=Delimited argument inside the Extended Parameters string, which seems to be a way of telling the data provider the type of delimited file you're using. (For example, Fmt=TabDelimited.) However, as far as I can tell, the Fmt parameter is ignored, and the text driver will always use the setting specified in the registry or from the Schema.ini file, if it exists. I'll describe both the registry location and the Schema.ini file shortly.
Next, you need to use a Recordset object to query the delimited file. Callout B shows how GetAddresses.vbs creates the Recordset object and calls its Open method. The Open method's first parameter is a SQL statement, and its second parameter is the Connection object. The correct values for the final three parameters are specified as constants at the top of the script. (For more information about these parameters, see http://msdn2.microsoft.com/en-us/library/ms675544.aspx.) GetAddresses.vbs uses the following SQL statement to read the contents of the delimited file:
SELECT * FROM \[File\] ORDER BY LastName
File is the CSV file's name as specified at the top of the script. The script uses square brackets (\[\]) around the file's name in case it contains spaces. This query selects every record in the table and sorts the results by the LastName field. Because it's a SQL query, it gives you lots of flexibility in retrieving data from the text file. For example, in a SQL query, you can use SELECT DISTINCT to return only unique records, WHERE to specify a filter, and ORDER BY to sort the results. See msdn2.microsoft.com/en-us/library/ms690481.aspx for information about how to construct a SQL query.
After calling the Open method, the Recordset object contains fields and records in a table format. Because GetAddresses.vbs queries the Sample.csv file shown in Figure 1, the Recordset object will look like the table shown in Figure 3. (Note that the Recordset object's contents are sorted by the LastName column, as specified by the SQL query.) The column headers are the fields' names, and the rows are the individual records. You can navigate through the Recordset object by using the MoveFirst, MoveNext, MovePrevious, and MoveLast methods. The end of file (EOF) property returns a value of True if the Recordset object contains no records or if you've moved past the last record (e.g., by using the MoveNext method).
To access the data in an individual field in the current record, use the following syntax:
In this command, Recordset is the Recordset object, and fieldname is the name of the field. Callout C shows how the GetAddresses.vbs script displays the data from the CSV file. The Do Until loop executes when the Recordset object's EOF property is False, and the WScript.Echo method echoes the data for the current record.
Controlling the Text Driver
As I mentioned previously, the Jet OLE DB text driver uses the registry (or the Schema.ini file, which I'll describe shortly) to determine the format of a delimited file. These values are stored in the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Text registry subkey. Table 1 lists the two main values (i.e., DisabledExtensions, Format) that you might need to modify. For security reasons, the text file driver allows only certain file extensions to be considered as delimited files, so if you need to work with a delimited file that has an extension that's not listed in Table 1, you'll need to rename the delimited file or add the file's extension to the DisabledExtensions value in the registry.
The other thing you can do to control the text driver's behavior is to create a plain-text file called Schema.ini in the same directory as the delimited file. Inside the Schema.ini file, place the delimited file's name in square brackets. Underneath that, you can specify a set of values that describe the format and layout of the delimited file. Table 2 lists some of the possible values you can use. Figure 4 shows a sample Schema.ini file that tells the text driver that the Accounts.tab file is tab-delimited. For more information about how to use the Scema.ini file, go to msdn2.microsoft.com/en-us/library/ms709353.aspx. Note that you can also use the text driver to read fixed-length files (meaning that the data in the file is stored in specific columns); however, because of space limitations, I won't describe how to do that here.
Using ADO to Read Delimited Files
ADO lets you process the data in delimited files much more easily than using string parsing. Now that you know how to set up ADO and control the Jet OLE DB text driver, you should be able to read delimited files without having to deal with many of the time-consuming limitations that come with using the TextStream object.