Automatically convert Microsoft Word files to a variety of formats
Daily IT tasks, such as network administration and user support, often focus on files, particularly word processing documents. Unfortunately, Windows has no built-in tools for working with documents other than as files; processes for handling Word internals, such as converting the document type, are beyond the reach of standard tools.
I've written a Windows Script Host (WSH) script, ConvertWord, which I designed as a command-line wrapper for Microsoft Word to help with document manipulation. ConvertWord lets you quickly convert documents to any export format supported by your Word installation or extract text from those documents to files. You might also find the script useful in testing documents for problems.
You must have Word 97 or later installed to use ConvertWord. Download the entire ConvertWord code from the Windows Scripting Solutions Web site. (Excerpts from the ConvertWord script appear in this article.) Go to http://www.windowsitpro.com/windowsscripting, enter 44361 in the InstantDoc ID box, then click the 44361.zip hotlink. Save convertword.wsf and convertword.cmd in the same folder.
ConvertWord can automatically use any file converter that's available for Word. Word comes with a core set of file converters for generic documents. However, this set doesn't include any special converters, such as those needed for Microsoft Works or WordPerfect documents. To obtain these and other optional converters, you'll need to perform a custom installation of Word.
You can also download the standard Word converters, which are bundled into the Microsoft Office Resource Kits, at the Office 2003 Editions Resource Kit page at http://www.microsoft.com/office/ork/2003/default.htm. You can use the resource kit converters with Word 97 and later. After installing the resource kit, navigate to the directory where it's installed (the default directory is \%programfiles%\orktools) and locate the converter pack (oconvpck.exe). Run oconvpck.exe on any PC on which you want to install the converters.
What ConvertWord Does
I originally began writing ConvertWord to handle some tasks that Word's Batch Conversion Wizard can't do. If you don't already have the Batch Conversion Wizard, consider adding it to your toolkit. The wizard is a Word template that performs single-input-format to single-output-format conversion. (For more information about this type of conversion, see the Microsoft article "How to automatically convert many documents to Word 2002 format" at http://support.microsoft.com/?kbid=313714.)
Although the Batch Conversion Wizard handles many tasks, it isn't optimized for certain scenarios, such as remote administration or for automating basic conversions for end users who share documents in separate network locations. ConvertWord can help with these types of distributed conversion scenarios by letting you perform these basic tasks:
- Query the system about the installed Word version.
- Automatically open arbitrarily long lists of mixed document types.
- Save documents with guaranteed unique names as Word (default) or other document types.
- Test documents for format problems and user passwords.
How ConvertWord Works
ConvertWord performs a four-step conversion process. First, the script creates an instance of the Word application, as callout A in Listing 1 shows. The script also includes code to suppress as many dialog boxes as possible. For example, the code at callout B disables dialog boxes that can be turned off.
The second step that ConvertWord performs is to open each document. The Word object contains a Documents collection; calling that collection's Open method—which the code at callout A in Listing 2 does—returns a document. If you know the document name and want to let Word automatically determine its format, you can call the method and specify only the name of the document as an argument.
Alternatively, you can specify the document format as another parameter of the Open method. Unfortunately, depending on your version of Word, the Open method takes up to 16 parameters. Because the parameter that controls the format is the 10th parameter, you must also specify the nine parameters that precede it. This makes for a long, ugly line of code. You can find information about the parameters at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_wrcore/html/wrconwordobjectmodeloverview.asp or in Word Help.
The parameters that ConvertWord uses are FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, and Format. The FileName parameter is the Word document filename. The ConfirmConversions parameter lets you force a dialog box to appear when Word converts the document when it's opened. In ConvertWord, this parameter is always set to False to enable easy automation.
The ReadOnly parameter controls whether Word opens the document as read-only; ConvertWord always sets this parameter to True to ensure that the original document isn't changed. AddToRecentFiles controls whether the opened document is added to the current user's RecentFiles list. Because this document is probably one of dozens or even hundreds, we don't want to add it to the list, so this parameter is set to False.
PasswordDocument is a password for opening protected documents, and PasswordTemplate is a password for templates. These values are useless for non-Word documents, so the script uses two double quotes ("") on each parameter to specify an empty string. The Revert parameter determines whether the script "reverts" to the currently open version of a document if the document that's being converted is already open. ConvertWord sets this parameter to True to avoid a possible loss of current changes and to activate only the open copy of the document.
The WritePasswordDocument and WritePasswordTemplate parameters specify the passwords that are required to save the opened document or template. For our purposes, these parameters are unnecessary because ConvertWord doesn't overwrite the original document; therefore, the script specifies "" for each of these arguments.
Finally, the Format parameter is a number that signifies the method that Word will use to try to interpret the format of the opened document. Specifying the correct number is tricky because the usable numbers and what they represent depend on which version of Word is installed, which additional document converters have been installed, and the installation order. For example, let's say you want to open and convert a Rich Text Format (RTF) document, which has an open-format code of 3. To open the sample document by using the standard RTF converter, you'd use the following code:
Set doc = Word.Documents._ Open("c:\my.rtf", False, _ True, False, "", "", _ True, "", "", 3)
(Some code lines wrap to several lines in this article because of space constraints.) You can find the list of add-on document converters with their numeric values and the standard extensions by using the object's FileConverters collection. The ConvertWord code in Listing 3 displays a list of these converters. This code doesn't list the standard converters available to Word. You can find the list of standard Word converters in Web Table 1 (http://www.windowsitpro.com/windowsscripting, InstantDoc ID 44361) or in the Word Help documentation.
In ConvertWord, the CreateFormatCollections subroutine displays the list of Word converters. Be aware that although the script simplifies the problem of determining open and save formats to some extent, the format that the script uses to open or save a document depends on the Word version and the order in which the converters were installed.
After ConvertWord opens the document, it saves the new version of the document by using the SaveAs method, which callout A in Listing 4 shows. Although the SaveAs method takes up to 16 parameters, we need to use only two because the parameter we need—SaveFormat—is the second parameter. As with the OpenFormat parameters, you need to specify document-open-format codes on the SaveFormat parameter. Web Table 2 shows a list of standard Word save-format converters and their codes. To specify a save format in a line of code—for example, to save a document as the plaintext file C:\my.txt—you'd write
doc.SaveAs "C:\my.txt", 2
After saving the document, ConvertWord closes it by using the Close method, which callout B in Listing 4 shows. The False value tells Word to discard the changes if Word detects that the document has been modified since it was saved. When the script has finished opening, saving, and closing all the documents, its final step is to quit Word by calling Word's Quit method as Listing 5 shows.
When you first run ConvertWord, you'll probably find it helpful to view the local Word version information by running the command
This command displays important information, including the Word version that's installed on the system. Although Microsoft dropped version numbers from product names beginning with Office 95 (which otherwise would have been Office 7), the version numbers that are used internally increment by 1 with each successive major release. As a member of the Office suite, Word uses the same numbering scheme. The internal Word version numbers are 8 (Word 97), 9 (Word 2000), 10 (Word 2002), and 11 (Word 2003).
By default, ConvertWord automatically opens files by making informed guesses about their format (e.g., Word, plaintext, WordPerfect, RTF) and saves them as Word documents by giving them unique names that consist of the filename, an underscore, and a number. ConvertWord lets you provide document names in several ways. You can enter a filename as a command-line argument like this:
convertword unicode.txt plain.txt otherdocs\corel.wps
This approach produces output files in Word format saved as unicode.doc, plain.doc, and otherdocs\corel.doc. Alternatively, you can tell ConvertWord to read files from standard input—for example:
convertword < convertthese.txt
Or, you can pipe the output of a command that creates a file list into ConvertWord like this:
dir /s /b c:\inbox\*.txt | convertword
If you don't provide any input data, ConvertWord will prompt you to input document names until you press Ctrl+C twice.
ConvertWord implements a simple strategy to avoid overwriting files that have the same name. Suppose you want to save a Word file as a text file called mylist.txt. If this filename already exists, ConvertWord starts checking a series of derived names—mylist_1.txt, mylist_2.txt, and so on—until it finds an unused name. ConvertWord then uses that name as the save-file name. The filename search is generally fast compared with the time needed to manually open and save the document.
Modifying Save Locations and Names
ConvertWord saves files to the same folder as the source file with the same base name. This way, if you're converting files for many users or groups of users, they'll see their new documents next to the old ones. They can usually find "their" files and know what names those files have.
However, you can modify the save location for converted documents. To do so, simply use the /d switch with a pathname that's either absolute or relative to the path in which the script runs. ConvertWord expands the path to its full form and creates the corresponding directory if it doesn't already exist.
For example, to save all files to the C:\temp\exports directory, run ConvertWord and specify that directory as the /d value:
You can specify the /b switch to override the base name (i.e., the filename without its extension). In the event that ConvertWord finds multiple files with the same filename, ConvertWord modifies filenames as I explained earlier. You can also use the /x switch to specify a filename extension other than the default extension for the file type you're exporting.
Creating Non-Word Documents
ConvertWord automatically generates Word documents by default. If you want to create non-Word documents, you can use ConvertWord's /sa option to override the default save format. The formats you can save to vary depending on the version of Word and the add-on converters that are installed on the system on which ConvertWord is running. Your first step when saving to a particular file format should be to run Word with the /cnv switch to view the installed converters; the converter's index number will be the type you want to save the new file as. If you then want to save all your files in a particular format—say, RTF (which has an index number of 6)—you'd add the switch /sa:6 to the ConvertWord arguments. For example, to convert all the WordPerfect files in the current folder to RTF, you'd run the command
dir /s /b *.wpd convertword /sa:6
Many different document formats might be available to you, depending on the version of Word and the installed converters you're working with. Always check the types before you convert files because the index values won't be the same on different systems. The only exceptions to this caveat are the standard built-in Word converters. Word 97 and later versions have the same values for 0 through 6, and the standard types increase with newer versions; for Word 2003, 0 through 11 have standard values across all installations. The exception to these standard values is an output with the index value of -1. This value doesn't represent a Word converter; instead, you use it to tell ConvertWord to write data from a document file to the console. You specify this value on the /sa switch either as /sa:-1 or /sa+.
In large-scale document-conversion operations, you'll probably encounter problems with some files. You need a way to track which documents fail to convert. If ConvertWord has a problem with a file, it sends the filename and some descriptive information to the standard error stream (StdErr); you can track the failures by watching these files scroll by on screen or by redirecting the error output to a file for later follow-up, like this:
convertword < c:\filelist.txt 2> errors.txt
By default, ConvertWord displays errors with only the filename and an error number:
c:\demo.rtf FAILED: 2
Specifying the /v+ (verbose output) switch as follows outputs more detailed error information:
convertword < c:\filelist.txt 2 /v+>errors.txt
Using the /v- switch doesn't provide any error numbers; instead, it just sends the filename to StdErr for easier processing later.
The last error that ConvertWord encounters will always be set as the script's exit errorlevel value; when the script finishes executing, this value will be available in the command environment so that it can be read by a shell script to determine whether the ConvertWord call succeeded or failed.
If you want to check for possible problems without converting documents, you can run ConvertWord with the /w (what if) switch. This switch tells ConvertWord to open all documents without saving them. If any of the files have problems, such as internal corruption, the files will produce the usual error output.
Handling Password Problems
Passwords are particularly troublesome for batch processing because they can vary from document to document. ConvertWord defaults to using a space as a password, which opens all documents that don't have passwords and causes all documents that have passwords to produce an error without interrupting processing.
You can change this behavior with the /p (password) switch. If you supply an empty argument (e.g., /p:""), Word prompts you for a password on all protected documents. You can also provide a specific password with the /p switch. Remember, though, that all documents that have no password or a different password than that specified will fail to open with no prompting.
ConvertWord in the Real World
Because I've used ConvertWord for approximately 30,000 conversions, I've encountered a few common problems in running the script. If you encounter any unusual problems, they're almost certainly Word automation errors; the error number and message echoed back are actually from Word in most cases. Most errors are easy to resolve (e.g., incorrect password) or understand. A few might look unusual. Here are three errors that I've seen occur with some regularity.
The first is a Word pop-up dialog box for documents that contain macros. ConvertWord disables document macros by default to protect you from malicious code. However, when Word opens documents that contain macros, it displays a startup dialog box telling you that the macros are disabled. I haven't found a way to suppress this dialog box other than letting the macros run. You can do this by running ConvertWord and specifying the /as (automation security) switch with a value of 0 (/as:0), which is the default value for Word documents that are opened programmatically. Before you use the /as switch, you should ensure that the opened documents don't contain any malicious code.
The second error relates to some RTF documents that don't open successfully but that display correctly in WordPad. These documents are usually malformed RTF and don't open correctly in Word, either. ConvertWord can't fix this problem; thus, you can't use ConvertWord to convert these files.
The third error occurs because Word identifies Unicode text documents by an initial Byte Order Mark in the file. If this mark is missing, Word identifies the document as normal text and you'll see blanks after each visible letter when you open the converted document (the blanks are actually null characters). The only way to handle this problem is to convert the files with the /oa (OpenAs) switch set specifically to Encoded or Unicode text (/oa:5 for Word 97 and later versions).
Fortunately, such errors occur relatively infrequently. I've found ConvertWord to be an extremely useful tool for working with document operations on a large scale. If you're looking for a way to eliminate the tedious procedure of opening and converting Word documents manually, ConvertWord offers relief.