Retrieve text and binary data from Internet addresses
| Executive Summary:|
Windows PowerShell doesn't have any cmdlets specifically designed to download Internet data. However, you can use the Microsoft .NET Framework's System.Net.WebClient class in PowerShell scripts to download Internet data. To demonstrate how to use the Microsoft .NET Framework's WebClient class, here are three Windows PowerShell scripts—Get-WebString.ps1, Get-Rfc.ps1, and Copy-WebFile.ps1—that you can use to download and even copy text-based content from the Uniform Resource Identifier (URI) you specify.
Although Windows PowerShell doesn't have any cmdlets specifically designed to download Internet data, the Microsoft .NET Framework exposes Internet data access. You can write PowerShell scripts that use this WebClient class to retrieve text and binary data from Internet addresses. To show you how, I've created three scripts: Get-WebString.ps1, Get-Rfc.ps1, and Copy-WebFile.ps1. After I explain how to use these scripts, I’ll discuss how they work so you know how to create your own custom tools.
Downloading Text-Based Content
Get-WebString.ps1 in Listing 1 is a script that lets you download text-based content from the Internet. To obtain Get-WebString.ps1 and the other scripts I discuss in this article, click the Download the Code Here button at the top of the page. You need to place these scripts somewhere in the search path PowerShell uses for finding files. If you don't have a folder set up for this, you might want to add one as outlined in the sidebar "Adding a Folder to the Windows Command Search Path."
To download text-based content from any Internet address, you simply execute Get-WebString.ps1 with a Uniform Resource Identifier (URI) as its argument. For example, suppose you're checking the Federal Reserve's daily exchange rates. These rates are always posted at http://www.federalreserve.gov/releases/h10/update/h10daily.txt, so all you need to do is run the command
There are many text-based resources on the Internet that can be easily retrieved this way. For example, the Internet Engineering Task Force (IETF) makes all its Request for Comments (RFC) documents available as text at an address that follows the format http://www.ietf.org/rfc/rfcNNNN.txt, where NNNN is a 4-digit RFC number. For example, to retrieve RFC 1408, you'd use the command
Because IETF follows this format, you can create a script solely for the purpose of retrieving RFCs, as Get-Rfc.ps1 in Listing 2 shows. Get-Rfc.ps1 takes an RFC number as an argument and returns the corresponding RFC text, so all you need to do to get RFC 1408 is run the command
Because PowerShell automatically prepends Get- to searched cmdlet names, you can abbreviate the command to
PowerShell only prepends Get- while searching for a command. So, if you specify an explicit path—for example, if you saved this script as C:\tmp\Get-RFC.ps1—and set your current PowerShell location to C:\tmp, you could invoke the script and have it retrieve RFC 1408 using either of the following commands:
However, you can't use these commands
because PowerShell isn't performing a command search. Instead, you've told it that you want to load a command file explicitly in the C:\tmp folder, with the exact basename of rfc. PowerShell expands command names only when it's performing a search for a command.
Whichever technique you use, the data is returned as in-console text, so you can treat it like any other PowerShell data. You can pipe it through the More command to page through the RFC text:
Get-Rfc 1408 | More
You can store it in a variable:
$rfc1408 = Get-Rfc 1408
You can even save it directly to a local file:
Get-Rfc 1408 | Set-Content rfc1408.txt
You can use the resulting variable or text file anyway you want. If you want to see the content in the PowerShell console, you can use
to display the variable's content or
to display the text file's content.
Get-WebString.ps1 makes it just as easy to read data from the Internet as from a locally stored file. It works with all protocols that can be used for direct text data transfer (including FTP) and can even be used to read local files by specifying a file-system path.
One problem you can encounter with Get-WebString.ps1 is that the underlying WebClient class doesn't automatically decide what encoding is used for text it reads from a particular location. For example, suppose you want to use Get-WebString.ps1 to read the Unicode text file at C:\tmp\weather.txt. If you use the default command
you'll see onscreen text that looks something like what is shown in Figure 1. Unicode files allocate 2 bytes for each character; for most Western European languages, the first byte is 0. Because Get-WebString.ps1 assumes each byte is a separate character, the first byte of each character is assumed to be the character with a code value of 0—the null character. This is typically displayed in console windows as an empty space, causing the stretched out look you see in Figure 1.
You can modify the encoding that Get-WebString.ps1 uses to read files with the -Encoding parameter. To read the Unicode file C:\tmp\weather.txt correctly, you just specify Unicode after the -Encoding parameter:
Get-WebString C:\tmp\weather.txt -Encoding Unicode
The -Encoding parameter name is optional. Get-WebString.ps1 will understand what you mean if you just use
Get-WebString C:\tmp\weather.txt Unicode
With the proper encoding specified, you'll see the text rendered as shown in Figure 2.
The values you can use with the -Encoding parameter are ASCII, BigEndianUnicode, Default, Unicode, UTF32, UTF7, and UTF8. All the values except for Default are the standard character encodings. Default represents the system's current code page.
I won't go into details of character encodings here, except for when dealing with Unicode text. Remember that spaced-out text generally indicates that the file is Unicode. If you get a series of blocks onscreen when you try Unicode, the file might be BigEndianUnicode, where each pair of bytes is reversed compared to standard Unicode. Figure 3 shows the onscreen display you get with a BigEndianUnicode file when you try reading it without the -Encoding parameter, as a Unicode file, and as BigEndianUnicode file. Fortunately, the next application of the WebClient class—copying files from the Internet—isn't affected by the encoding of characters.
Downloading and Copying Text-Based Content
Listing 3 contains Copy-WebFile.ps1, which lets you download and save text-based content from the Internet in a single step. For example, if you run the command
the script will retrieve the content of RFC 1408 from the Internet and save it to a file named rfc1408.txt in the current PowerShell location.
When you use Copy-WebFile.ps1, you need to be aware of three limitations:
- Unlike files downloaded through Microsoft Internet Explorer (IE), there's no execution blocking. If you use Copy-WebFile.ps1 to save an executable locally and run the executable later, you won't receive a warning that it came from an Internet location.
- If you choose to save the target file under a different name or in a different location, you should specify the entire pathname; otherwise, it might end up where you don't expect it (more on this later). You can use environment variables in the path to save some typing. For example, to save the rfc1408.txt file to your Temporary Files folder, you can use the following command:
Copy-WebFile http://www.ietf.org/rfc/rfc1408.txt $env:temp\rfc1408.txt
If you're wondering how the code $env:temp works, see the sidebar "How to Access an Environment Variable Without Writing a Complete PowerShell Statement."
- You won't be prompted before a pre-existing file gets overwritten. If you repeatedly run the previous command, Copy-WebFile.ps1 will silently overwrite the rfc1408.txt file each time.
Understanding the Scripts
Get-WebString.ps1, Get-Rfc.ps1 and Copy-WebFile.ps1 are wrapper scripts for the WebClient class. Understanding how these scripts work can help you create your own custom scripts.
Get-WebString.ps1. Let's start with Get-WebString.ps1 in Listing 1. In callout A, I declare the named parameters $Uri and $Encoding. I could have used PowerShell's generic $args variable, but this way, I have meaningful names for the parameters. I can also specify a default value for the -Encoding parameter. In this case, that value is the string Default, which represents the system default encoding.
In callout B, I use PowerShell's New-Object cmdlet to create a System.Net.WebClient object and assign it to the $WebClient variable. Note that you can omit System. in the object's class name because PowerShell automatically searches the .NET System namespace if it can't find a class by the name you specify. For clarity and to slightly speed up the object creation process, I generally include the System. class name.
Next, I tell $WebClient what encoding it should use for the text, as callout C shows. This code might seem to work like magic if you don't know much about the WebClient class, so it's worth stepping through in detail. To find out the WebClient class's properties, you can run the command
New-Object System.Net.WebClient | Get-Member -MemberType Property
Figure 4 shows the output from this command. The Encoding property is defined as System.Text.Encoding Encoding, which you can translate as "the property named Encoding is a value type known as System.Text.Encoding." What's important about this is that the various text encodings you can use are a special type of element in .NET called static properties. Due to how PowerShell works with .NET data types (which is a much longer story than I can go into here), PowerShell needs to address static members of a .NET class in a way that might look unusual. Typically when you work with a class in PowerShell, you need to create an instance of it with the New-Object cmdlet, after which you can access the object's properties by adding to the object name a dot followed by the property's name. When you want to use a static property of a class, however, you don't create an instance of the class. You simply refer to the class by name, followed by a double colon (::) to make it clear to PowerShell that the next item is a static member of the class. Normally, the System.Text.Encoding static members are hidden by Get-Member, but you can you can use Get-Member with its special -Static parameter to see them, like this:
\[System.Text.Encoding\] | Get-Member -Static -MemberType Property
The output provides the list of predefined encodings I mentioned previously: ASCII, BigEndianUnicode, Default, Unicode, UTF32, UTF7, and UTF8. If you specify UTF8 as the -Encoding parameter value when running Get-WebString.ps1, the script reads the line in callout C as
The :: operator tells PowerShell that a static member name follows, so the script knows that UTF8 means the static property named UTF8 in the System.Text.Encoding type.
The final task that Get-WebString.ps1 performs is to retrieve the text. The WebClient class has a method called DownloadString that returns string data from a request. In callout D, this method is called into action. Because I don't assign the returned data to a variable, the script automatically writes the data to the output stream, so you see it returned directly as text.
Get-Rfc.ps1. Get-Rfc.ps1 is a simple script, but it's worth discussing because it demonstrates how easy it is to reuse tools in PowerShell. Get-Rfc.ps1 directly reuses Get-WebString.ps1. If you tried to do something similar in the Windows Script Host (WSH) environment, you would need to change Get-WebString into a Windows Script Component (WSC) and register it on each machine, where it would be used. The Get-Rfc script would then need to reference and invoke the WSC. Furthermore, the WSC wouldn't be a tool in its own right; you would need to write a wrapper script to use it.
Setting up and using a WSC takes a lot of effort, which discourages some people from using it. With the PowerShell approach, all you need to do is have both scripts in the command search path for PowerShell. You don't need to write a component or create a wrapper script. And you don't even have to deploy the scripts to clients if you save the scripts in a shared network location.
As Listing 2 shows, Get-Rfc.ps1 works by invoking Get-WebString.ps1 with a generic URI. The RFC number you provide replaces the $RfcId placeholder in the URI. If the Internet address for RFCs is ever changed, you can change the generic URI to reflect that.
Copy-WebFile.ps1. Copy-WebFile.ps1 in Listing 3 is the most complex of the three scripts. The complex portions are in the code that declares the parameters. In callout A, I treat the URI string (which is provided as an argument) as a System.Uri object because a System.Uri object provides some features that simplify the code in callout B.
The code in callout B assigns a location where the file is to be saved. If you don't specify a location, the script saves the data to a file in the current PowerShell location, with the file having the same name as the source Web file. Because I made the Web file address into a System.Uri object, I can use some features of the System.Uri class to make this assignment easier. A System.Uri object exposes the elements of the relative file path on the Internet to its components property. You can see what these properties are by using a command such as
As Figure 5 shows, Segments is one of the properties.
The Segments property, which is referenced in callout B, is an array of strings, and the last string in that array is the filename. The number of elements in $uri.Segments is $uri.Segments.Count. Because array elements are indexed beginning with 0, I use $uri.Segments\[$uri.Segments.Count -1\] to obtain the name for the file. To get the correct path, I use Get-Location -PSProvider FileSystem. This code ensures that PowerShell will use the file-system location, even if you switch to a registry drive. To construct a complete path, I use the Join-Path cmdlet to combine the PowerShell location and filename. As you can see, the System.Uri class provides a way to have the filename automatically generated.
You can, of course, provide your own filename, but if you do so, you need to specify a complete path. As I mentioned earlier, if you specify just a filename, it won't necessarily be saved where you expect. For example, suppose you set your PowerShell location to the Temporary Files folder, which is easily done with the command
Now you retrieve the file located at http://someserver.net/bin/setup.exe, and want to save it as SpecialApp-Setup.exe in the temp folder. If you use the command
Copy Web-File http://someserver.net/bin/setup.exe SpecialApp-Setup.exe
you won't see SpecialApp-Setup.exe in your Temporary Files folder. If you search the computer for SpecialApp-Setup.exe, you'll probably find it in your home directory. On Vista, for example, it would be at a location such as C:\Users\YourName\SpecialApp-Setup.exe, where YourName is your username. For information about why this happens, see the sidebar "Why the PowerShell Working Directory and the PowerShell Location Aren't One in the Same."
There are some workarounds for saving the data to a location other than the working directory. One workaround is to specify the complete path. Another workaround is to rename the file immediately after it is downloaded. You can do that with the following two PowerShell commands:
Copy Web-File http://someserver.net/bin/setup.exe
Rename-Item setup.exe SpecialApp-Setup.exe
This is almost as awkward as specifying the complete path, but if you're in a deep folder path (and don't have an existing file named setup.exe that would be overwritten), it can save some effort.
Take Advantage of the WebClient Class
Get-WebString.ps1 and Copy-WebFile.ps1 are useful tools for downloading text-based content from the Internet. You can also reference Get-WebString.ps1 in a specialty script in order to access files or data exposed by a specific Web site. Get-Rfc.ps1 is an example of such a specialty script. Using that script as a template, you can create your own specialty script, such as one that accesses Project Gutenberg's collection of more than 20,000 free electronic books. Each release from Project Gutenberg is assigned a numeric ID and can be accessed through an address that follows the format http://www.gutenberg.org/files/id/id.txt, where id is the ID of the book you want to download.
All three scripts take advantage of the WebClient class. This class, though, can be used for other purposes. For example, you can use it to upload data to a URI. If you'd like more information about this class, check out MSDN's WebClient Class Web page at http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx.