Learn how NT keeps tabs on memory

Last month, I began this two-part series about memory management in Windows NT by introducing the concept of virtual memory. I reviewed the way x86 and Alpha processors use a two-level virtual address-to-physical address translation scheme. I discussed paging and introduced two powerful features of the Memory Manager: memory-mapped files and copy-on-write memory.

This month, I'll go into more detail about the internal data structures the Memory Manager uses to track the state of memory. I'll discuss working sets and the Page Frame Number (PFN) Database. I'll wrap up with a tour inside the additional data structures the Memory Manager uses to track memory shared by two or more applications, and I'll discuss Section Objects, the data structures the PFN Database uses to implement memory-mapped files.

Working Sets
The biggest effect the Memory Manager has on individual applications' performance and on the system is in its allocation of physical memory to each active process. The amount of memory the Memory Manager assigns to a process is called the working set. Every process has a working set. A special working set, called the system working set, is physical memory that belongs to parts of the NT Executive, device drivers, and the Cache Manager. If a process' working set is too small, the process will incur a high number of page faults as it accesses page table entries that describe data not present in memory but located either in the process' executable image on disk or in a paging file. For each such access, the Memory Manager must intervene and perform disk I/O to retrieve the data. If a process' working set is too large, the process will not incur page faults, but its physical memory might be holding data that the process will not access for some time, and that other processes might require. Thus, the Memory Manager must try to achieve a balance for all processes according to their memory usage patterns.

The choice NT makes for two memory-management policies--the page fetch policy and the page replacement policy--affects the way NT manages working sets. The page fetch policy is the method NT uses to bring in a process' data. NT uses the most common page fetch policy, demand paging. In demand paging, a process' data is paged-in as a process accesses it. A different page fetch policy, prefetching, requires the Memory Manager to aggressively bring in a process' data before the process asks for it. Prefetching can improve the performance of an application at the expense of other applications, because the Memory Manager's prediction of what data a process will ask for is likely to be imperfect and can waste physical memory in storing unused data. NT therefore implements a slight variation on demand paging: clustered demand paging. Instead of bringing into a process' working set only the page that a process accesses, the Memory Manager will attempt to bring in pages that surround the requested page. The idea behind clustered demand paging is that if a process accesses a page, it will likely also access the memory just preceding or following that page. The pages the Memory Manager thus brings in are known as a cluster. In NT, clusters can vary between 0 pages and 7 pages, depending on the amount of physical memory present on the system and whether the system is accessing code or data.

The page replacement policy affects the way the Memory Manager decides which page to remove from a working set. The Memory Manager must remove pages whenever physical memory is full of in-use pages and space is necessary to accommodate a page of data a process is accessing. Two characterizations for replacement policies are global and local. In a global replacement policy, the Memory Manager considers all pages of physical memory as replacement candidates, without regard to which working set the pages belong to. In a local replacement policy, the Memory Manager considers as replacement candidates only pages in the working set of the process that is accessing the page to be brought in. The disadvantage of global policy is that a rogue process can, by accessing large amounts of memory very quickly, adversely affect all other processes in the system by forcing their data out of memory. As with its page fetch policy, NT implements a variation of the replacement policy. This variation is local, because it first considers the pages in the working set of the process that is accessing the data. But the variation is also global in that it removes pages from the working sets of other processes if their memory requirements are low.

Replacement policies are further characterized by the method with which they choose a page to replace out of the set of local or global candidates. One policy, first in, first out (FIFO), removes pages in the order in which they were added to the working set (i.e., the first page added is the first page removed). An algorithm that has a more positive effect on the performance of applications is least recently used (LRU). The LRU algorithm requires the aid of the MMU to give the Memory Manager an idea of how recently a process accessed a particular page. LRU replaces first those pages that processes have not accessed for the longest period of time. On uniprocessors, NT uses the clock algorithm, a simple form of LRU replacement that is based on the limited access tracking x86 and Alpha MMUs provide. Whenever a process references a page, these MMUs set the Accessed flag bit in the page table entries (PTEs) of the page. If a page's Accessed flag is not set, the Memory Manager knows that no processes have accessed the page since the last time the Memory Manager examined the page's PTE.

The working set of every process, including the system working set, has a minimum size, a current size, and a maximum size. On a typical NT system with 64MB of physical memory, the default minimum size of a working set is 200KB, and the default maximum size is 1.4MB. Processes with the PROCESS_SET_QUOTA privilege can override these values using a Win32 API. When a process starts, it generates a large number of page faults as it references data in its virtual memory map (most often in its executable image) that it must bring into physical memory. The Memory Manager lets a process' current working set size grow as needed to its working set maximum. When a working set reaches its maximum size, the Memory Manager will allow additional pages into the working set only if plenty of free (unused) pages are available. If free pages are not available, the Memory Manager uses the replacement policies I just described to determine which page in the working set to replace. The pages in the working set are linked in a list that the Memory Manager scans. On a uniprocessor, if the Memory Manager finds a page with its Accessed flag set, the Memory Manager clears the flag and proceeds to the following pages, selecting for replacement the next page it finds with a cleared Accessed flag. Thus, the Memory Manager avoids pages that the process has accessed since the previous scan, because the Memory Manager assumes that the process will access those pages again. If the Memory Manager finds no pages to select for replacement, it restarts the scan and almost certainly finds a page whose Accessed flag it cleared on the previous pass and the process has not accessed since the first scan.

On a multiprocessor system, the Memory Manager does not clear Accessed flags during scans, because any modification to a PTE on a multiprocessor system invalidates Translation Look-aside Buffer (TLB) entries (which I described last month) on all processors. If the Memory Manager invalidates an address' cached translation by clearing its Accessed flag, the address must again undergo the slow three-step virtual memory translation process the next time a process references it. The result is an effectively random page-replacement algorithm on multiprocessors. Microsoft is developing more effective replacement algorithms for the multiprocessor version of NT 5.0.

The Balance Set Manager
The Memory Manager adjusts the sizes of working sets once every second, in response to page-in operations or when free memory drops below a certain threshold. The Memory Manager also examines all working sets to determine whether the Balance Set Manager thread needs to trim them. If free memory is plentiful, the Balance Set Manager removes pages from only the working sets of processes whose current size is above minimum that have not recently incurred many page faults. However, if the Memory Manager awakens the Balance Set Manager thread because free memory is scarce, the Balance Set Manager can trim pages from any process until it creates an adequate number of free pages--even to the point of dropping working sets below their minimum size.

In addition to trimming working sets, once every 4 seconds the Balance Set Manager can swap out pages that belong to the call stacks of threads that have been sleeping for more than 7 seconds (3 seconds on systems with less than 19MB of memory). The call stack is a special type of memory that records the function invocations a thread makes. If a thread has been asleep (e.g., while waiting for a user to enter a keystroke) for a long time, the Memory Manager assumes the thread will sleep for a longer time. If the Balance Set Manager swaps out the stacks of all the threads in a process, it trims the process' working set to nothing, and the system assumes that the process is inactive. Swapping out the stacks of all the threads in a process minimizes the footprints of processes that don't require memory because they are waiting for infrequent events to occur before they become active.

The Page Frame Database
Last month, I described how NT organizes physical memory as a list of page frames, each of which is one page in size (4096 bytes on x86 processors, 8192 bytes on Alphas). The Memory Manager mirrors this list of page frames with its own list--the PFN Database. Each entry in the PFN Database corresponds to a physical page of memory (e.g., entry 5 in the PFN Database corresponds to page frame 5 in physical memory). Entries in the PFN Database record information about the state of data stored in corresponding physical pages.

A physical page can exist in one of eight states, which Table 1, page 64, describes. Figure 1 illustrates how the Memory Manager adds to a list all the PFN Database entries that belong to pages in the same state (except pages in the valid or transition states). For example, all modified pages go on the modified page list. A given page undergoes several status changes during its lifetime; therefore, its PFN entry moves among different lists. Figure 1 shows how working sets are associated with valid pages in the PFN Database.

Let's look at some of these state changes, beginning with a valid page. Figure 2 depicts the state changes I'll discuss. A page is in the valid state if it is currently part of a process working set. A valid page can be dirty or clean--a dirty page is one a process has written to, so that the page's content in physical memory might be different from its content in a backing file (either a paging file or a memory-mapped file). The MMU sets a bit in a page's PTE when a process writes to that page.

When the Balance Set Manager trims a valid page from a working set, the page moves to one of three lists. If the page is clean, it will move to the standby list. If the page is dirty, it almost always moves to the modified page list. In some cases, as when a file system accesses pages backed by files the file system creates to represent its metadata (such as the FAT on FAT file systems, or the MFT--Master File Table--on NTFS file systems), a dirty page moves to the modified no-write page list instead. NTFS takes advantage of this type of marking so that it can ensure that pages belonging to NTFS logging files write to disk before file metadata writes to disk.

A page on the modified page list moves to the standby page list after the Memory Manager's modified page writer writes the page's data to disk. However, while the modified page writer is writing its data to disk, the page makes a stop in the transition state (which has no list). A page on the modified no-write page list moves to the standby page list when the file system instructs the Memory Manager to move the page. This instruction comes after a log file is safely written to disk, for example. When the number of pages on the modified page list exceeds a threshold, or when free memory drops below a threshold (based on the amount of physical memory on the system and determined during the boot), one of two modified page writer threads wakes up to perform disk I/O that sends the data to a paging file or a data file.

A page on the standby page list moves to the free page list when the process that had the page in its working set either frees the virtual memory associated with the page, or the process terminates (which is another way of freeing the virtual memory). An interesting optimization based on the lists I've already discussed is possible. If a page moves from the working set of a process to the standby page, modified page, or modified no-write page lists, and if that process subsequently accesses the page before the Memory Manager assigns the page to a different process, the Memory Manager can place the page back in the process' working set. This operation is known as soft-faulting (or soft-page faulting) memory back to a process. Soft-faulting is a way the Memory Manager can tentatively remove memory from a process but give it back quickly if the process reaccesses the memory in a short amount of time.

Pages on the standby page list move to the zeroed page list after a special thread, called the zero-page thread, clears their content. The zero-page thread executes in the background at priority 0. It runs only if no other thread can run, and its job is to move pages from the free page list to the zeroed page list as it clears their content.

When a process accesses a page and the Memory Manager decides that the process' working set can expand, the Memory Manager checks the zeroed page list to find a physical page to assign to the process. If the Memory Manager finds a page in the zeroed page list, it can immediately use the page. If the zeroed page list is empty, the Memory Manager checks the free page list. If the free page list is also empty, the Memory Manager checks the standby page list. Finally, if the standby page list is empty, the Memory Manager's thread waits while the Balance Set Manager trims a page from a working set, or until a page shows up on the standby page, free page, or zeroed page lists (e.g., a page might be in transition, and after it writes to disk it appears on the standby list).

The necessity of zeroing a page before assigning it to the working set of a different process is a C2 security requirement. (For more information about NT's C2 security rating, see "Windows NT Security, Part 1," May 1998.) The operating system (OS) must reinitialize all OS resources (e.g., memory, disk space, objects) before reassigning them to prevent the creation of security holes in which one process can see another process' potentially sensitive data. In some cases, the process does not require zero-filled memory, as when the Memory Manager allocates a page to store data read from a memory-mapped file. In these cases, the Memory Manager checks the free list for an available page before it checks the zeroed list.

The final list is the bad page list. As its name implies, the bad page list is an off-limits holding area in which the Memory Manager, with the support of memory parity error detection, places pages it has detected as faulty.

The number of pages that are on the standby page, free page, and zeroed page lists defines the free memory that various memory-tracking tools (e.g., the Task Manager) report. The commit total is the amount of currently allocated memory that paging file space backs. If the Memory Manager pages the data in commit total memory out of physical memory, the Memory Manager stores that data in a paging file. The commit limit is the amount of commit total memory the Memory Manager can allocate without expanding the sizes of existing paging files.

Shared Memory and Prototype PTEs
PFN Database entries contain varying information, depending on which state corresponding pages are in. In most cases, a PFN Database entry contains a pointer to a PTE that references a page. However, if two or more processes share the same page, multiple PTEs reference the page: one PTE in the virtual address map of each process sharing the page. Instead of pointing the PFN Database entry at one of these PTEs, the Memory Manager points the PFN Database entry at a data structure the Memory Manager allocates, called a Prototype PTE (PPTE), as Figure 3 shows. I'll describe the way the Memory Manager uses PPTEs to manage shared and mapped memory.

I explained last month that to share memory, a process must create a Section Object. Section Objects hold information about the name of a file, its size, and what portions of it are mapped. Section Object creation defines the shareable data, and each additional process that wants to participate in the sharing must map a portion of the data into its address space. This mapping is known as mapping a view of a section, because processes might map only a portion of the data that a Section Object defines.

When a process allocates a Section Object, the Memory Manager allocates another data structure called a Segment. Segments contain storage to hold enough PPTEs to describe all the pages in the Section Object. Usually, when a page moves from a process' working set to the standby page, modified page, or modified no-write page lists, the Memory Manager marks the page's PTE invalid and sets a bit in the page to indicate that the page can be soft-page faulted, which means the PFN of the page stays in the PTE. PTEs that the Memory Manager marks as invalid for shared pages do not continue to store PFNs; rather, the Memory Manager updates them to point at the shared page's PPTE. This trick makes it easy for the Memory Manager to update the PFN of a shared page without manually updating the PTEs that refer to the page in the address spaces of all the processes that reference the page.

Consider an example in which two processes share a page. When the page's data is in memory, each process has a valid PTE that stores the PFN where the page's data resides in physical memory. If the Memory Manager removes the PTEs from the working sets of both processes and sends the page's data to a paging file, the PTEs are both invalid and contain pointers to the PPTE. When the Memory Manager brings the page's data back into physical memory, the Memory Manager updates the PPTE to reflect the page's new PFN. When one of the processes tries to access the page, it generates a page fault. Then the Memory Manager looks at the PTE, finds the new PFN in the PPTE the PTE points to, marks the PTE as valid, and updates its PFN. When the second process accesses the page, the Memory Manager updates that process' PTE similarly. Without this optimization, the Memory Manager needs to track down both PTEs (or more, if a greater number of processes shared the page) and update them when it brings the page back into memory--an expensive and potentially wasteful operation.

A reference count in the PFN Database tracks the number of processes that access a shared page, and the Memory Manager knows that when the count becomes zero, the page is not marked valid in any working set. At that point, the Memory Manager moves the page to the standby page or modified page lists.

Memory-Mapped Files
Memory-mapped files are a special form of shared memory. When a process maps a file into its address space, the Memory Manager creates several support data structures that aid the process' interaction with file systems. Figure 4 shows how these data structures are related. A File Object represents the file on disk, and the File Object is the target of all I/O the Memory Manager performs on the file. The Section Object points at its Segment, which contains the PPTEs for the Section Object. The Segment points to a control area that the File Object also points at. This control area is the nerve center for a mapped file, so that even if a process creates more than one Section Object for a file, there is still only one shared con-trol area.

When a process references a virtual address that the process indicates should be backed by a file, the Memory Manager examines the Virtual Address Descriptor (VAD) that describes the memory range, then locates the control area. The Memory Manager allocates a page of physical memory to bring the requested data into and updates the PTE for the virtual address map. Because the Memory Manager finds the File Object through the control area, the Memory Manager can initiate an I/O operation to the file system that owns the disk on which the file resides. The operation reads the page from the file on disk (the page is in the transition state while the I/O is in progress). After the operation reads the page, the Memory Manager marks the PTE as valid and lets the process continue its attempt to access the data, which is now present.

A caveat with regard to memory-mapped files exists: Files can map as data files or as images. Files map as images when the NT Process Manager loads them for execution. The same file can map as a data file and an image, and maintaining separate control areas for data files and images lets NT ensure the consistency of the different mappings that different processes make.

More on Memory
Last month, I claimed that memory management is one of the most complex tasks an OS faces. In this two-part series on memory management, I've provided only an overview of the policies and mechanisms NT implements to provide applications with memory resources appropriate to their needs and the needs of other concurrently running programs. If you want to learn more about the Memory Manager, I recommend Inside Windows NT, Second Edition, by David A. Solomon (Microsoft Press).

Next month, I'll cover a subsystem that's closely tied to the Memory Manager--the Cache Manager. I'll discuss how the Memory Manager sizes system working sets (including the file system cache) differently from process working sets.