Digging up the dirt
When your Exchange Server 2003 environment experiences a performance or stability problem, you need to be able to quickly diagnose and correct the malfunction and stem the flood of user complaints. Knowing how to collect the relevant debugging data is important. As an Exchange administrator, I've learned about several tools and techniques that can benefit your troubleshooting efforts and maximize the uptime on your Exchange servers.
Logging Debugging Information
Exchange 2003 logs informational, warning, and error events to the Application event log. Therefore, your first step in troubleshooting Exchange problems is to look in the Application event log. To view the log, open Windows Event Viewer (Start, All Programs, Administrative Tools, Event Viewer).
The default diagnostics logging level for all Exchange services is None. This level logs events such as start-up and shut-down of services, status of backup events, and errors. This logging level is adequate for single-server deployments that aren't experiencing any problems, but you'll probably need to increase logging levels for complex enterprise deployments that have many servers.
You can use the Microsoft Management Console (MMC) Exchange System Manager (ESM) snap-in to change the level of diagnostics logging for Exchange components. To do so, open ESM, right-click the server for which you want to set the logging levels, and select Properties. Select the Diagnostics Logging tab, which Figure 1 shows.
I recommend that you set logging levels conservatively when you first deploy Exchange, then refine the settings as you gain monitoring experience. Table 1 shows the settings I've configured on my production server, but these settings probably won't suit every deployment. For example, I've set diagnostics logging levels for public folder replication events to Maximum. My server is part of an Exchange deployment that closely monitors public folder replication, and setting the diagnostics logging to these levels helps us troubleshoot replication problems more effectively. If your organization doesn't use public folders, you could set the diagnostics logging for public folder components to None.
Increasing the diagnostics level to Maximum for any component causes that component to write every internal processing event to the Application event log. Some components can write hundreds of informational events in a short time, thereby complicating your efforts to find useful information in the log. Thus, you should use caution when setting Maximum logging levels.
However, setting Maximum diagnostics logging levels can be useful for troubleshooting problems with individual Exchange services. For example, the Microsoft article "XADM: How to Troubleshoot Exchange 2000 System Attendant Startup Failures" (http://support.microsoft.com/?kbid=245024) describes how to troubleshoot System Attendant failures by turning up logging on System Attendant service subcomponents such as the Name Service Provider Interface (NSPI), which is the service that Outlook Messaging API (MAPI) clients use to locate an Exchange Global Catalog (GC) server. After you resolve a problem, reset logging levels back to their normal levels.
Troubleshooting Exchange Hangs
When an Exchange server hangs, Exchange processes go into a frozen state and can't write information to the event log. Therefore, these types of problems are often difficult to troubleshoot. By default, Exchange 2003 doesn't create any dump files in the event of a crash, and by default, debugging symbols aren't installed. You can read more about installing debugging symbols in the Web-exclusive sidebar "Dr. Watson Debugging Symbols" (http://www.winnetmag.com/microsoftexchangeoutlook, InstantDoc ID 42879).
A server hang can have many causes, including insufficient virtual memory, a memory leak caused by an application, a process that hogs available CPU resources, or a conflict for system resources between the OS and a third-party application. In such circumstances, Microsoft Product Support Services (PSS) engineers or anyone else attempting to troubleshoot the problem will require certain information before they can discover the root cause.
To debug an Exchange server crash or hang event, PSS typically requests that you run three utilities (MPS_Reports, ADPlus, and VaDump) and send the output to PSS for analysis. For more information about sending your debugging information to PSS, see the Web-exclusive sidebar "Getting Help from Microsoft PSS" (http://www.winnetmag.com/microsoftexchangeoutlook, InstantDoc ID 42881).
I recommend that you install the MPS_Reports, ADPlus, and VaDump debugging tools on all your servers as part of your server build process. Without these tools, you might not be able to diagnose a hanging server. For more information about installing and using these tools, as well as the URLs for downloading them, see "Microsoft Resources." Microsoft regularly adds new features to its debugging tools, so I suggest checking the Microsoft site for updates every few months and upgrading the tools on a quarterly basis as part of your usual server-update process.
The MPS_Reports utility gathers configuration information from a server and runs on all versions of Windows Server 2003, Windows XP, Windows 2000, and Windows NT 4.0. You'll need local administrator privileges to run the utility. MPS_Reports generates compressed files that contain configuration information, so it requires disk space to run. The program is available in the following versions:
- Alliance edition (mpsrpt_alliance.exe) is a general, all-purpose version that captures a range of configuration information.
- Cluster edition (mpsrpt_cluster.exe) captures information relevant to Windows Cluster Service.
- Directory Services edition (mpsrpt_dirsvc.exe) captures information relevant to Microsoft Directory Services.
- Network edition (mpsrpt_network.exe) captures information relevant to networking.
- Setup edition (mpsrpt_setupperf.exe) captures information relevant to setup and performance.
- Software Update Services edition (mpsrpt_sus.exe) captures information relevant to Microsoft Software Update Services (SUS).
- SQL edition (mpsrpt_sql.exe) captures information relevant to Microsoft SQL Server.
- MDAC edition (mpsrpt_mdac.exe) captures information relevant to Microsoft Data Access Components (MDAC).
The Alliance, Setup, and Directory Services editions are the versions that apply to debugging Exchange. The Alliance and Setup versions help you eliminate drivers and incorrect Windows configuration as a root cause for Exchange server instability, and the Directory Services edition helps you troubleshoot Directory Services problems. Note that MPS_Reports is only a reporting tool; it doesn't change the registry or require any downtime. See "Microsoft Resources" for more information about installing and running MPS_Reports.
ADPlus is another debugging tool that will help you troubleshoot a process or an application that hangs or crashes. ADPlus is supported on all versions of Windows 2003, XP, Win2K, and NT and requires Windows Script Host (WSH) 5.6 or later. ADPlus generates memory dumps and logs files that contain debugging information. You can also use ADPlus instead of userdump.exe to obtain memory dumps of processes. The most recent version, ADPlus 6.3.11, released March 1, 2004, includes support for Itanium (formerly code-named Merced) servers and Longhorn (Client Preview version, build 4051). ADPlus is part of the Microsoft Debugging Tools for Windows, which you can download at http://www.microsoft.com/whdc/devtools/debugging/default.mspx.
Before you run ADPlus, always verify that Exchange services are indeed in a hung or frozen state. A hanging server can exhibit a variety of symptoms, including the following:
- Outlook clients can't connect. In some hanging scenarios, existing client connections are unaffected. Try connecting to a mailbox on the server to confirm that the server is frozen.
- Processor utilization is high.
- Exchange services go into an uncontrolled state, and you can't start or restart them.
- ESM is unresponsive.
Before running ADPlus, make sure that network problems aren't causing the Exchange problems. Ping the server to verify that it's reachable. Also, verify that your Exchange GC is online and that no DSAccess problems exist.
ADPlus runs in two modes: hang mode and crash mode. You can use ADPlus in hang mode to troubleshoot an unresponsive process or a process that's using 100 percent of your CPU. While running in this mode, ADPlus generates memory dumps for all active processes. To generate these memory dumps, ADPlus gains exclusive access to the processes, so clients will be unable to connect. For this reason, don't use ADPlus in hang mode on a production server that's online and functioning as usual.
You can run ADPlus in crash mode to troubleshoot applications that terminate unexpectedly. Unlike hang mode, crash mode requires that ADPlus be running in crash mode before the crash occurs. When ADPlus is running in crash mode, a debugger remains attached to each process that you specify on the command line.
After you run ADPlus on your server, send the dump files to PSS for analysis, then reboot the server to clear the hanging processes. The dump files that ADPlus generates contain information that isn't easy for the average user to understand. PSS engineers have special tools and training that enable them to interpret the contents of debug files.
When a production Exchange server hosting several thousand users goes into a hung state, the Help desk is usually swamped with calls, which puts a lot of pressure on support personnel. Under pressure, people are more likely to make mistakes. To alleviate this pressure and reduce the time needed to diagnose problems, I recommend the following best practices for using ADPlus to debug your Exchange servers.
Use batch files to automate data collection. Using ADPlus to generate dump files for Exchange processes requires quite a few keystrokes. I recommend that you create batch files to automate the collection process. Listing 1 shows a sample batch file that creates dump files for the Store, System Attendant, Microsoft Search, Microsoft IIS, and Exchange management processes. In this example, I used ADPlus with the -pn switch to specify the name of the process to be dumped. You'll have downtime for each process you run ADPlus against because ADPlus gains exclusive access to a process to generate a dump file.
Run ADPlus against IIS. Because IIS and Exchange are tightly integrated, you need to run ADPlus against IIS. Type the command
to retrieve debugging information that you can use to troubleshoot Exchange protocols (e.g., SMTP, IMAP, POP) that IIS manages and to troubleshoot Outlook Web Access (OWA) client connection problems.
Ensure that enough free space is available for ADPlus dumps. The -o switch, which Listing 1 also shows, directs ADPlus to create dump files in a specified folder. Dump files can be quite large, so I recommend that you create them in a location other than the system drive. For each ADPlus session, the tool creates a unique folder name that contains the date and time the dump was captured. This feature prevents you from overwriting dump files. If a server experiences multiple hangs, you can run ADPlus for each incident and send a collection of dump files to Microsoft to identify a common root cause. Figure 2, shows the folder structure for a sample batch file execution.
Windows Task Manager and ADPlus
When a process is consuming 100 percent of the CPU, you can use ADPlus to generate a dump for that process, based on the process identifier (PID). To obtain the PID, open Task Manager (Start, Run, taskmgr.exe) and at the Processes tab select the Show processes from all users check box to view all the active processes on your server. Under the View menu, choose Select Columns to add the PID column to your Task Manager view, as Figure 3 shows. You must use Task Manager to retrieve the correct PID because the server assigns the PID for a process when the process starts, but the PID changes when a server reboots or when the service associated with a process is restarted. Using Task Manager will ensure that you retrieve the current PID. After you obtain the PID, you can generate a dump file for the process that's hogging the CPU by entering
adplus -hang -p xxxx
where xxxx represents the offending process's PID.
VaDump is a Microsoft Windows 2000 Resource Kit utility (it also runs on Windows 2003) that lets PSS analyze a process's virtual address memory in detail. VaDump provides the following memory information about a process's virtual address memory use:
- each address, along with its size, state, protection, and type
- total committed memory for the image, the .exe file, and each .dll file, including system .dll files
- total mapped-committed, private-committed, and reserved memory
- *information about the working set and about paged and nonpaged pool usage
Unlike running ADPlus, running VaDump doesn't result in any server downtime. I recommend that you check with PSS before running VaDump; many command switches are available for VaDump, and PSS might require that you use the tool with specific switches. Web Figure 1 (http://www.winnetmag.com/microsoftexchangeoutlook, InstantDoc ID 42878) shows the switches you can use with VaDump.
The Microsoft article "How to Gather Data to Troubleshoot Exchange Server 2003 Virtual Memory Issues" (http://support.microsoft.com/?kbid=823150) provides guidance about the information you need to gather to troubleshoot Exchange virtual memory problems. The article recommends that you run VaDump against store.exe because it's the most memory-intensive Exchange process. VaDump requires that you provide the PID for the process you want to debug, so use Task Manager to obtain the PID, as I described earlier. In Figure 3, the PID for the Store process is 4368. To capture the Store process that has a PID of 4368, use the following command:
vadump -v -p 4368 > f:diagnosis\va_store_23mar2004 .txt
VaDump doesn't generate a log file by default, so you'll need to use the redirection symbol (>) to generate a log file that PSS can analyze. I recommend that you give the dump file a meaningful name that includes the process name and the date of the VaDump session, such as the name I used in the preceding example. The -v switch directs VaDump to capture verbose information.
Gathering Performance Monitor Information
Performance Monitor can help you diagnose the root causes of Exchange crashes, especially in cases of memory leaks and virtual memory fragmentation, because it lets you see how the server is performing over time. The Microsoft article "How to Create a Log Using System Monitor in Windows 2000" (http://support.microsoft.com/?kbid=248345) describes how to configure Performance Monitor to log diagnostic information. To troubleshoot Exchange problems, you'll want to set up Performance Monitor to log the counters that Web Figure 2 lists.
Don't create Performance Monitor logs on the server that you're monitoring. If the server is rebooted, you'll have to manually restart performance monitoring. Also, if you run Performance Monitor on a production server and the server crashes or hangs, the server won't be able to send a notification message. If you have a spare machine, set up Performance Monitor on a workstation running XP. XP's version of Performance Monitor (called System Monitor) lets you use additional tools to automate performance monitoring. See the Microsoft article "Description of the Windows XP Logman.exe, Relog.exe, and Typeperf.exe Tools" (http://support.microsoft.com/?kbid=303133) for information about these tools.
Also ensure that adequate disk space is available. The size of the log files depends on the number of counters you monitor and the logging time interval. Performance logs can become quite large (up to 500MB for a 24-hour logging period).
Diagnosing Memory Leaks
A memory leak occurs when a process requests memory and doesn't release the memory back to the OS when the process finishes. Over time, this occurrence exhausts all available memory, causing the system to crash. Some common causes of memory leaks include the following:
- Processes from an application are leaking because of a leak in a program.
- A device driver is leaking memory in kernel mode. In this scenario, the root cause of the problem won't be Exchange; a device driver associated with a hardware device is consuming memory.
- Malicious users are directing Denial of Service (DoS) attacks against a server to try to exhaust system resources such as memory or disk space.
When a system crashes, a blue screen with the message STOP 0x0000001E (0xC0000044) "STATUS_QUOTA_EXCEEDED" signals a memory leak. Microsoft provides the poolmon.exe tool to troubleshoot memory leaks. The tool is part of the Support Tools on the Windows 2003 and Win2K installation CD-ROM.
Use Poolmon to perform an audit of the device drivers and third-party products installed on the server. Compare the installed versions with the most recent release versions to see whether a newer version exists. Contact the vendor to see whether a newer version of the application or driver fixes a memory-leak problem. As a best practice, you should upgrade layered applications and device drivers regularly so that memory leaks and other problems are fixed as newer releases become available.
Diagnosing Virtual Memory Fragmentation
A common cause of instability in early Exchange 2000 deployments was virtual memory fragmentation. Microsoft released an updated version of the Store in Exchange 2000 Service Pack 3 (SP3) that improved virtual memory usage. The Windows & .NET Magazine article "Monitoring Virtual Memory" (November 2003, InstantDoc ID 40458) describes how to configure Performance Monitor to log information relevant to memory allocation. The list of performance counters in Web Figure 2 includes the relevant memory allocation counters.
The tools and techniques I've described will help you troubleshoot Exchange stability and performance problems. Before you use these procedures in production, I recommend that you practice them on a test server. You can use MPS_Reports and VaDump with no service disruption; running ADPlus in hang mode locks out processes, so use it with caution. Also, make sure device-driver revision levels and third-party software versions are up-to-date to help prevent memory leaks. Your users will benefit if you have the necessary tools and preparation for quickly diagnosing and fixing Exchange problems.