A little-known Microsoft performance tool gives you information you can't get anywhere else
It had to happen sooner or later: Your Active Directory (AD) performance just went to heck for no obvious reason. Last week, everything was humming along just fine, but this week, you've received half a dozen complaints about lethargic logons, failed Microsoft Exchange Server address book lookups, and slow application startups. Running Performance Monitor on each of your domain controllers (DCs) shows that the CPU utilization on one of your DCs is pegged at 100 percent much of the time. But nothing has changed, and everything else seems to be running fine. Now what do you do?
That's where Windows Server 2003 Performance Advisor (SPA) comes in. SPA is a nifty but largely unknown performance analysis utility that Microsoft made available more than two years ago. It automates the collection of configuration, Event Tracing for Windows (ETW), and performance counter data from one or more servers, crunches the resulting mountain of data, and produces easy-to-read performance reports with alerts and recommendations as to how to fix problems. SPA ships with predefined data collectors and performance rules for generic file servers, AD DCs, DNS servers, and servers running Microsoft Internet Information Services (IIS).
Step 1: Download and Install SPA
SPA doesn't ship with the Windows image; you have to download it from http://www.microsoft.com/downloads/details.aspx?FamilyID=09115420-8c9d-46b9-a9a5-9bffcd237da2. Make sure you download the most recent version and not the earlier Server Performance Advisor 1.0. The installer file is spa_v2_msi.
Running SPA on a busy server, such as a DC, can generate a lot of data. Be sure you have several gigabytes of free disk space for the data storage folder. Ideally, you should place the data storage folder on its own spindle to minimize the performance impact of running SPA.
Installing SPA is easy on 32-bit Windows. Run the Windows Installer package you downloaded; accept the End-User License Agreement (EULA); accept the defaults for the installation, data storage,-and reports folders; and you're good to go.
Installing SPA on 64-bit versions of Windows is a little more involved. SPA requires the Microsoft .NET Framework version 1.1, but that version isn't available for 64-bit platforms. However, you can use the 32-bit version of the .NET Framework on your 64-bit server, and SPA will work fine. Just do the following: Download and install the .NET Framework version 1.1 redistributable package, which is available at http://www.microsoft.com/downloads/details.aspx?FamilyID=262d25e3-f589-4842-8157-034d1e7cf3a3&displaylang=en.
Next, download and install .NET Framework Service Pack 1 (SP1) for Windows Server 2003. You'll find it at http://www.microsoft.com/downloads/details.aspx?familyid=AE7EDEF7-2CB7-4864-8623-A1038563DF23&displaylang=en. Finally, install SPA.
In addition to copying the executables and creating the SPA directories, the installer creates several scheduled tasks to collect performance data. You can see these tasks by clicking the Schedule Tasks icon in the Control Panel. The tasks that SPA creates are dormant—that is, they're created but don't have a scheduled run time. When you use the SPA client to start a collection, the client simply schedules the task to run. It's an unusual design, but simpler than creating a Windows service and just as effective.
Step 2: Run SPA
You can launch the SPA client by clicking Start, All Programs, Server Performance Advisor. The SPA client presents a somewhat inscrutable UI at startup, initially hiding the navigation hierarchy. To expose the hierarchy, select Scope Tree from the View menu or click the document icon in the gray border on the left side of the window.
The SPA client uses the conventional Microsoft Management Console (MMC) layout, displaying the navigation hierarchy in the left pane and data in the right pane. The Trace Providers and Performance Counters nodes are useful for composing new kinds of SPA collections. But the Data Collectors and Reports node is where the interesting stuff lies.
SPA gathers performance data from a server using a set of data collectors. There are four types of data collectors: performance counter collectors, registry setting collectors, trace collectors, and kernel trace collectors. SPA organizes data collectors into data collector groups, each of which targets performance data for a particular subsystem, such as IIS or AD.
SPA ships with about 90 predefined collector groups. The installation process detects what role or roles your server is configured for and adds one or more of the following eight data collector groups depending on those roles:
- Active Directory
- Active Directory/Application Mode (ADAM)
- DNS Server
- DNS Server Extended
- Print Spooler
- System Overview
SPA enables the appropriate collector groups at installation and displays them under the Data Collectors and Reports node in the Scope Tree. So, for instance, when you install SPA on a DC, the Active Directory collector group will be displayed. You can enable or disable individual collector groups from the SPA client by clicking File, Add/Repair Data Collector Groups, Server Roles.
SPA also ships with specialized collector groups that you can use and modify to suit your needs, You can enable these collector groups by clicking File, Add/Repair Data Collector Groups and selecting a collector group from the menu.
Step 3: Run a Collection
I set up a DC named DC2 running Windows 2003 SP1, along with several client machines, to generate a test load using the Active Directory Performance Testing Tool (ADTest.exe, which you can download at http://www.microsoft.com/downloads/details.aspx?familyid=4814FE3F-92CE-4871-B8A4-99F98B3F4338&displaylang=en). I modified the ADTest scripts to generate a mixture of authentication traffic and poorly designed LDAP searches and ran the scripts. For all practical purposes, the DC ground to a halt. To see SPA's analysis of the load on the DC, I ran a collection.
To start a collection, simply start the SPA client, open the scope tree, select the collector group you want to run (in our case, the Active Directory collector group), and select Start from the Record menu (or press F9, or click the green record arrow on the toolbar). SPA schedules the dormant data collection tasks to run immediately and displays several data-collection-in-progress icons that represent the currently running data collectors. Out of the box, the Active Directory collector group has four collection tasks, with an icon for each: performance counter data collection, registry data collection, Active Directory ETW data collection, and Kernel ETW data collection.
By default, the data collectors dump their raw data into the C:\PerfLogs\Data\collector group name\Current folder. After the collection completes, SPA moves the raw data files into a transfer directory, then runs the SPARPT program, which crunches the raw data and produces an XML report. SPA stores the report in the Reports directory in a folder labeled by server name and the date and time of the data collection (e.g., C:\PerfLogs\ Report\Active Directory\DC2_200607051641).
Step 4: Review the Report
Performance Monitor can collect most of the data that SPA can, but SPA really shines in its ability to summarize and present hundreds of megabytes of data in easy-to-understand reports. SPA presents performance data as a single—and possibly quite large—HTML page. The SPA client organizes performance reports by data collector group. To view the performance reports for a collector group, open the Reports node under the collector group and select Current. SPA displays available reports in the data pane on the right, organized first by machine name, and then by year, month, day, and time. To view a report, you have to click to open the machine; click to open the year; click to open the month; and click to open the day. Finally, click the particular report you want to view, and SPA will display it in the data pane.
To ease navigation, a table of contents at the top of the report provides hyperlinks to the sections of the report: the Performance Advice section, several sections of AD-specific performance data, and detailed sections about CPU, network, disk, and memory utilization. At the end of the report are some system-tuning parameters from the registry and some general system configuration and data collection information. Let's walk through some of the report's sections.
Summary. I suggest first reviewing the Summary section, which Figure 1 shows. Here you'll find the following information:
- CPU Usage(%): the CPU load during the collection period
- Top Process Group: the process responsible for the largest chunk of that load (on a DC, this should be LSASS)
- Top Activity: the most CPU-intensive operation performed by that process
- Top Client: the IP address of the client with the most CPU usage
- Top Disk by IO Rate: the busiest disk drive
SPA can show you the specific client and AD operation that generated the highest CPU load and disk I/O during the collection period, often all you need to determine the cause of a DC performance problem. When you click an item in the Summary, SPA takes you to the relevant report detail.
Performance warnings. Next, click the Warnings hotlink in the Performance Advice section of the table of contents for details about conditions that violated performance alert rules. SPA provides 17 AD-specific alert rules plus 17 general alert rules that apply to all server roles. You can configure each rule by selecting Rules from the Edit menu.
In our case, we have three warnings, as you can see in Figure 2:
- The top client is consuming 24.74 percent of the available CPU—far more than a single client should consume.
- The output queue length of the DC's NIC is at 12, which is long—you'd expect a length of 1 or 2. The long queue indicates that the DC is sending a lot of data out on the NIC.
- Clients' AD LDAP searches are using the ancestors index. AD uses the ancestors index to search on an un-indexed attribute. In this situation, AD has to read and inspect every object in the container. Use of the ancestors index can indicate a poorly designed query or the need to create a new index in AD.
Directory Search section. When you click the hotlink in a warning's Item column, SPA displays the section of the report that provides more detail about the warning. Clicking the hotlink for the first warning in Figure 2 displays the Directory Search section of the report shown in Figure 3. The Clients with the Most CPU Usage table displays a list of client IP addresses and information about the clients' search performance. The Unique Searches table shows that the searches generated by the client at 10.7.0.131 are using an extraordinary amount of CPU. In the first line of that table, the flag in the Index column corresponds to the performance warning in Figure 2 and tells you that the client at 10.7.0.131 is the one that accounts for 24.74 percent of CPU utilization.
If you click the plus sign to the left of a client's IP address in the Clients with the Most CPU Usage table, you can see more detail about all the unique searches attributed to that client, along with the search parameters and other search-related information, as Figure 4 shows. Each row represents one or more search operations that have the same LDAP search base, scope, filter, and result code. The Top: 3 of 7 notation in the table's title bar tells you that SPA is showing only the top three LDAP searches. To see more entries, click the 3 and type another number. To sort the data by the values in a particular column, click the column header. Most tables in the SPA report work this way.
The Index column shows the index that AD used to perform the search. In the most egregious search shown, a search of the Schema naming context (NC), the search filter cn=*a* specifies a medial-string search (i.e., a search using a filter containing a wildcard that is not at the end of the string) on the Common Name (CN) attribute. The CN attribute is indexed by default in AD, but not for medial-string searches, so AD has to read every object in the Schema NC to determine whether it matches the filter.
Looking at the other searches this client performed, you'll see they do the same thing as the schema search: They perform subtree-scoped medial searches on the CN attribute, causing AD to use either the ancestors index or distinguished name tag index (dnt_index), neither of which is good for performance. When you see a lot of searches that use the ancestors index or dnt_index, you should either modify the search filter to take advantage of the indexes that already exist in AD or create new indexes. For example, if you determine that cn=*a* is a legitimate search filter that should be optimized, you can add a medial search index on the CN attribute.
Adding an index to AD is simple, as the Microsoft article "Index an attribute in Active Directory" (http://go.microsoft.com/fwlink/?LinkId=46790) explains. Keep in mind that each index consumes additional disk space and will cause update operations to take longer. If you already have a large Directory Information Tree (DIT), that additional disk space could be substantial. Also, all objects in AD will be indexed, not just the objects in one container or NC, so consider the performance implications of a new index and carefully test it before making the change.
We've explained two of the three performance warnings we started with. Now let's look at the third NIC's long output queue.
NIC performance counters section. Go back to the Warnings section at the top of the report and click the hotlink for the NIC output queue length warning. SPA will navigate to the Network Interface performance counter table, which Figure 5 shows. Hovering your cursor over the flag that appears in the Mean column on the Output Queue Length counter displays a description of the performance warning.
You'll notice several interesting values in Figure 5. The Current Bandwidth counter shows 100 megabits, indicating that the 10/100 NIC is in fact running at 100Mbps. The Bytes Total/sec counter is 3.6Mbps, which is well below the network's capacity. (I know that on the server that produced this report, all network traffic on the segment was either going to or coming from the DC.) Finally, we see that Bytes Sent/sec accounts for nearly all the traffic, which makes sense considering that the LDAP searches are retrieving a lot of data.
So why is the output queue so long? There are several possibilities:
- The NIC is running at or near capacity.
- Too few output buffers are configured for the NIC.
- Output queue processing is slow because of insufficient CPU horsepower.
Although the report shows that the average total throughput on the NIC is 3.6MBps, the throughput peaked at 10.8MBps, which exceeds the theoretical maximum of 100Mbps (about 10MBps). So it would seem that the NIC is occasionally overloaded, which suggests that other components on the network segment might also be overloaded. Because the CPU is nearly maxed out, adding another NIC probably won't help. More analysis would be required to come up with a definitive answer, but the best strategy is probably to reduce the load on the server by optimizing the queries (or by adding an index) and, if necessary, either upgrading to a faster CPU or adding another DC and distributing the query load.
LDAP requests. There's one more anomaly-we should investigate. In the Summary section of Figure 1, if you click the Top Client hotlink, SPA displays the Clients with the Most CPU Usage subsection of the LDAP Request section of the report. Sure enough, the client at 10.7.0.131 is the prime offender.
When you expand the entry for that IP address, SPA displays a summary of the operations the client performed and the percentage of CPU resources each operation consumed, as Figure 6 shows. We can see that the client at 10.7.0.131 used most of the CPU by issuing nonDSE searches (i.e., searches of some part of the directory tree). This usage is consistent with what we discovered earlier.
What's surprising, however, is the second row in the operation summary, which shows that a significant number of the client's searches failed with a Size Limit Exceeded error. These search failures accounted for another 12 percent of CPU utilization. So SPA has helped uncover at least one other problem with this client: It isn't using paged searches. Not using paged searches is a common application programming error and can cause AD to not return all the data the application is searching for.
What Have We Accomplished?
We started out with an overloaded DC, and after running the SPA Active Directory data collector group on the DC and generating a report, we identified three r performance problems:
- Clients are issuing medial search queries on an attribute not indexed for medial searches.
- Clients are not using paged searches.
- The NIC (and possibly other system or network components) are overloaded.
Not bad for 15 minutes of work! You can also use SPA to collect and archive performance data on a scheduled basis, generating baseline data. In an upcoming article, I'll discuss how to set up SPA to collect data from multiple servers and generate the reports on a centralized reporting server.
PROBLEM: AD performance has plummeted.
SOLUTION: Run SPA and use its easy-to-read performance reports and recommendations to troubleshoot the problem.
WHAT YOU NEED: The most recent version of Windows Server 2003 Performance Advisor (SPA); Windows Server 2003
DIFFICULTY: 3 out of 5