A pervasive problem and some temporary solutions

Memory leaks have plagued Windows NT since its first release. While diagnosing a slow server problem, I searched the Microsoft Knowledge Base for articles related to memory leaks. The sidebar "Windows NT Memory Leaks," lists Knowledge Base articles about memory leaks and demonstrates how pervasive the problem is. Troublesome areas include the Server Service, remote procedure call (RPC) service, Remote Access Service (RAS), Performance Monitor, LMRepl service (the replication service), Client Services for Novell NetWare, File and Print Services for NetWare, Active Server Pages (ASP), Winlogon, Domain Name System (DNS), NTFS, and print spooling. Service Pack 4 (SP4) corrects most of these problems.

To understand memory leaks, you need to understand how OS services and applications allocate memory. NT’s Memory Manager manages OS and application program memory requests. The Memory Manager assigns and releases memory for OS components and applications and separates the OS address space from the application processes’ address space. The Memory Manager also prevents processes from accidentally or purposefully changing an address space other than their own. For an in-depth discussion of NT memory management, see Mark Russinovich, "Inside Memory Management," parts 1 and 2, August and September 1998.

A thread is the basic unit of execution in NT, and multiple threads make up a process. When a thread starts, the Memory Manager allocates physical memory and paging file space to let the thread load and run. While running, a thread can request additional memory to accomplish its task. The Memory Manager handles thread space on a process basis—when a process has many threads, the process’ allocated memory grows accordingly. The Memory Manager allocates application memory from free pages or the paged pool and OS memory from the paged or nonpaged pool.

The Memory Manager allocates pages from the paged pool when it can temporarily write process data to the paging file, whereas pages in the nonpaged pool never leave memory. NT components such as device drivers use the nonpaged pool to store data structures that are necessary for interrupting processing or symmetric multiprocessing (SMP) routines. These pages must remain in memory because NT doesn’t allow page faults (i.e., removing a page from the process’ working set) while processing device code.

What Is a Memory Leak?
A memory leak results when a memory allocation occurs repeatedly without the corresponding memory release. Over time, memory leaks can cause the system to allocate all available memory (i.e., exhaust physical memory and paging file space). When no more memory is available, NT hangs until the system releases memory.

Memory leaks commonly occur in two scenarios: when a process creates worker threads but doesn’t delete them, and when a thread allocates but doesn’t release memory. In many cases, the effects of a memory leak are subtle—a few pages here, a few pages there—until much later when the offending application or service has run for a long time.

Memory leaks can cause temporary memory shortages in application programs that run for a short time. However, memory leaks in services that run as part of the OS or in production applications (e.g., a database) can cause severe problems. If pages in the paged pool cause a memory leak, system performance slows as the paging file fills up. A nonpaged pool memory leak typically doesn’t affect system performance.

When a paged pool memory leak exists, you see a pop-up error message on the console and an entry in the system event log. A common error message is Your system is running without a properly sized paging file. Please use the System applet’s Virtual Memory option in Control Panel to create a paging file or to increase the size of your paging file. This message appears when you boot NT without a paging file and when the paging file is full.

To increase the size of the paging file, right-click My Computer, select Properties, and go to the Performance tab. The total of all paging files appears in the lower part of the screen. Click Change to adjust the size of the current paging file or add another paging file on another drive. I recommend a starting paging file size of at least twice the amount of physical memory in your system; multiple paging files on different disks can significantly improve performance for systems running large applications.

When the paging file or paged and nonpaged pools don’t contain enough free space, you receive an error message. Two common error messages are The server was unable to allocate the nonpaged pool from the system because the pool was empty and Your system is running low on virtual memory. Please close some applications.

If you have a memory leak, increasing the size of the paging file provides a temporary solution. A larger paging file gives the system a little longer to run before exhausting the space in the extended or secondary paging file.

Diagnosing Memory Leaks
You can diagnose most memory leaks with Performance Monitor and several Microsoft Windows NT Server 4.0 Resource Kit utilities. (For a list of resource kit tools, see the sidebar "Resource Kit Tools for Diagnosing and Monitoring Memory Leaks.") You start by verifying that a memory leak exists; then you identify the process or service responsible for the leak. Memory leaks in the System process are usually the result of an errant device driver; unfortunately, you can’t dynamically stop and start most device drivers, so driver leaks are difficult to find.

With Performance Monitor, you can watch overall statistics for thread, pool, and paging file usage (to verify that the leak exists). You can also monitor these counters on an individual process basis (to identify the problem service or application). Diagnosing memory leaks in services or applications is an interactive process that can take days and can use several performance-monitor profiles.

Performance Monitor objects. Performance Monitor has several object classes that assist in memory leak diagnosis. The four most important objects are the Paging File (% Usage, % Usage Peak), Memory (Pool Nonpaged Bytes, Pool Paged Bytes), Objects (Threads), and Process (Page File Bytes, Pool Nonpaged Bytes, Pool Paged Bytes, Private Bytes, Thread Count).

The Paging File category monitors overall paging file usage for the paging files you select from the instance list. The Memory category tracks overall system paging rates, pool space usage, and other metrics. The Objects class tracks systemwide counters for six items, including processes and threads. You use these classes and counters to verify that a memory leak exists. The Process object monitors activity on an individual process basis, rather than overall system statistics.

After you select the metrics to monitor in the Counter box, you select specific processes from the alphabetical list in the Instance box. For example, to monitor the DNS service and the spooler service, select dns and spoolss from the instance list. Microsoft applications and third-party applications and services also appear in the instance list (e.g., WinWord, DKService) if the application is currently active. You use the Process object and counters to identify the source of a memory leak.

Diagnosing thread leaks. Two potential sources of memory leaks exist in NT: undeleted threads and unreleased memory. For diagnostic purposes, you can monitor threads, pool space, or both. You can monitor the total number of threads with the Performance Monitor class Objects:Threads. If you see an increasing overall thread count, monitor the Process:Thread Count metric for individual processes to identify the process responsible for creating the threads.

If you think an NT or third-party service is causing the problem, start the Services applet in Control Panel. Then, tile the windows so that you can view the Performance Monitor and Services windows at the same time. When you stop and start the services one at a time, a sharp decrease in the total number of threads denotes the culprit service.

To identify problems in application software, use a similar technique. When you stop an application that is leaking memory, the total number of threads decreases dramatically. This sudden dip in the count becomes obvious when you chart the Objects:Threads and Process:Thread Count metrics.

Diagnosing pool leaks. When an NT component or service has a pool leak, the number of bytes in the paged or nonpaged pool increases steadily and never declines. To document this rise with Performance Monitor, profile the suspect services and watch Pagefile Bytes, Pool Paged Bytes, and Pool Nonpaged Bytes for an extended period of time. To create the performance profile, choose the Process category and metrics and pick the list of suspect components and services from the instance list. For example, when an application has a memory leak, the number of Private Bytes increases and never shrinks. To diagnose this problem, create a performance profile to track the application process and Private Bytes usage (Process:Private Bytes).

You might have to monitor a combination of classes and counters for hours or even days before you can clearly identify a leak. (To find out how to implement performance monitoring, see Marcia Loughry, "Monitoring Windows NT Server’s Performance," October 1998.) All memory leaks exhibit a similar pattern—pool usage and thread counts show a steady stair-step growth pattern over time. Usage remains flat for a time, then jumps up, repeatedly and indefinitely. If Memory Manager is releasing threads and pool space appropriately, the counters decline accordingly. (For a quick way to spot some memory leaks, see the sidebar "Shortcut for Spotting Memory Leaks.")

Taking Corrective Action
Many documented cases of NT memory leaks exist. Corrective action includes installing updated software; stopping and starting the service, process, or application responsible for the memory leak; and rebooting the machine regularly until a permanent fix is available. When the source of a memory leak is a core component of the OS that you can’t start and stop, the only way to correct the problem is to reboot the system often enough to keep adequate memory free until updated software is available.

When a native or third-party service is responsible for a memory leak, stopping the service releases its allocated memory. To temporarily correct the problem, stop and start the service in the Services applet in Control Panel or from the command line (using NET STOP and NET START commands). If memory leaks are causing problems on several servers, you can write a script to start and stop the responsible service and run the script on a daily or weekly basis.

You can also use the stop and start technique as a temporary solution for applications that are leaking memory. If you can’t stop an application because it is mission-critical, you might be able to schedule the application to restart after business hours. Be sure to use the application’s native shutdown feature, if available, to avoid database or file corruption and ensure an optimal application restart. These corrective techniques will help you keep memory utilization under control until a permanent solution (i.e., a service pack, hotfix, or upgrade) is available. Now that Windows 2000 (Win2K—formerly NT 5.0) is on the horizon, Microsoft has announced the elimination of more than 400 memory leaks—let’s just hope the developers don’t introduce as many problems as they fixed.