Learn how NT keeps tabs on memory
Last month, I began this two-part series about memory management in Windows NT by introducing the concept of virtual memory. I reviewed the way x86 and Alpha processors use a two-level virtual address-to-physical address translation scheme. I discussed paging and introduced two powerful features of the Memory Manager: memory-mapped files and copy-on-write memory.
This month, I'll go into more detail about the internal data structures the Memory Manager uses to track the state of memory. I'll discuss working sets and the Page Frame Number (PFN) Database. I'll wrap up with a tour inside the additional data structures the Memory Manager uses to track memory shared by two or more applications, and I'll discuss Section Objects, the data structures the PFN Database uses to implement memory-mapped files.
Working Sets
The biggest effect the Memory Manager has on individual applications'
performance and on the system is in its allocation of physical memory to each active process. The amount of memory the Memory Manager assigns to a process is called the working set. Every process has a working set. A special working set, called the system working set, is physical memory that belongs to parts of the NT Executive, device drivers, and the Cache Manager. If a process' working set is too small, the process will incur a high number of page faults as it accesses page table entries that describe data not present in memory but located either in the process' executable image on disk or in a paging file. For each such access, the Memory Manager must intervene and perform disk I/O to retrieve the data. If a process' working set is too large, the process will not incur page faults, but its physical memory might be holding data that the process will not access for some time, and that other processes might require. Thus, the Memory Manager must try to achieve a balance for all processes according to their memory usage patterns.
The choice NT makes for two memory-management policies--the page fetch policy and the page replacement policy--affects the way NT manages working sets. The page fetch policy is the method NT uses to bring in a process' data. NT uses the most common page fetch policy, demand paging. In demand paging, a process' data is paged-in as a process accesses it. A different page fetch policy, prefetching, requires the Memory Manager to aggressively bring in a process' data before the process asks for it. Prefetching can improve the performance of an application at the expense of other applications, because the Memory Manager's prediction of what data a process will ask for is likely to be imperfect and can waste physical memory in storing unused data. NT therefore implements a slight variation on demand paging: clustered demand paging. Instead of bringing into a process' working set only the page that a process accesses, the Memory Manager will attempt to bring in pages that surround the requested page. The idea behind clustered demand paging is that if a process accesses a page, it will likely also access the memory just preceding or following that page. The pages the Memory Manager thus brings in are known as a cluster. In NT, clusters can vary between 0 pages and 7 pages, depending on the amount of physical memory present on the system and whether the system is accessing code or data.
The page replacement policy affects the way the Memory Manager decides
which page to remove from a working set. The Memory Manager must remove pages
whenever physical memory is full of in-use pages and space is necessary to
accommodate a page of data a process is accessing. Two characterizations for
replacement policies are global and local. In a global
replacement policy, the Memory Manager considers all pages of physical memory as
replacement candidates, without regard to which working set the pages belong to.
In a local replacement policy, the Memory Manager considers as replacement
candidates only pages in the working set of the process that is accessing the
page to be brought in. The disadvantage of global policy is that a rogue process
can, by accessing large amounts of memory very quickly, adversely affect all
other processes in the system by forcing their data out of memory. As with its
page fetch policy, NT implements a variation of the replacement policy. This
variation is local, because it first considers the pages in the working set of
the process that is accessing the data. But the variation is also global in that
it removes pages from the working sets of other processes if their memory
requirements are low.
Replacement policies are further characterized by the method with which
they choose a page to replace out of the set of local or global candidates. One
policy, first in, first out (FIFO), removes pages in the order in which
they were added to the working set (i.e., the first page added is the first page
removed). An algorithm that has a more positive effect on the performance of
applications is least recently used (LRU). The LRU algorithm requires
the aid of the MMU to give the Memory Manager an idea of how recently a process
accessed a particular page. LRU replaces first those pages that processes have
not accessed for the longest period of time. On uniprocessors, NT uses the clock
algorithm, a simple form of LRU replacement that is based on the limited access
tracking x86 and Alpha MMUs provide. Whenever a process references a page, these
MMUs set the Accessed flag bit in the page table entries (PTEs) of the page. If
a page's Accessed flag is not set, the Memory Manager knows that no processes
have accessed the page since the last time the Memory Manager examined the
page's PTE.
The working set of every process, including the system working set, has a
minimum size, a current size, and a maximum size. On a typical NT system with
64MB of physical memory, the default minimum size of a working set is 200KB, and
the default maximum size is 1.4MB. Processes with the PROCESS_SET_QUOTA
privilege can override these values using a Win32 API. When a process starts, it
generates a large number of page faults as it references data in its virtual
memory map (most often in its executable image) that it must bring into physical
memory. The Memory Manager lets a process' current working set size grow as
needed to its working set maximum. When a working set reaches its maximum size,
the Memory Manager will allow additional pages into the working set only if
plenty of free (unused) pages are available. If free pages are not available,
the Memory Manager uses the replacement policies I just described to determine
which page in the working set to replace. The pages in the working set are
linked in a list that the Memory Manager scans. On a uniprocessor, if the Memory
Manager finds a page with its Accessed flag set, the Memory Manager clears the
flag and proceeds to the following pages, selecting for replacement the next
page it finds with a cleared Accessed flag. Thus, the Memory Manager avoids
pages that the process has accessed since the previous scan, because the Memory
Manager assumes that the process will access those pages again. If the Memory
Manager finds no pages to select for replacement, it restarts the scan and
almost certainly finds a page whose Accessed flag it cleared on the previous
pass and the process has not accessed since the first scan.
On a multiprocessor system, the Memory Manager does not clear Accessed
flags during scans, because any modification to a PTE on a multiprocessor system
invalidates Translation Look-aside Buffer (TLB) entries (which I described last
month) on all processors. If the Memory Manager invalidates an address' cached
translation by clearing its Accessed flag, the address must again undergo the
slow three-step virtual memory translation process the next time a process
references it. The result is an effectively random page-replacement algorithm on
multiprocessors. Microsoft is developing more effective replacement algorithms
for the multiprocessor version of NT 5.0.
The Balance Set Manager
The Memory Manager adjusts the sizes of working sets once every second, in response to page-in operations or when free memory drops below a certain
threshold. The Memory Manager also examines all working sets to determine
whether the Balance Set Manager thread needs to trim them. If free memory is
plentiful, the Balance Set Manager removes pages from only the working sets of processes whose current size is above minimum that have not recently incurred many page faults. However, if the Memory Manager awakens the Balance Set Manager thread because free memory is scarce, the Balance Set Manager can trim pages from any process until it creates an adequate number of free pages--even to the point of dropping working sets below their minimum size.
In addition to trimming working sets, once every 4 seconds the Balance Set
Manager can swap out pages that belong to the call stacks of threads that have
been sleeping for more than 7 seconds (3 seconds on systems with less than 19MB
of memory). The call stack is a special type of memory that records the function
invocations a thread makes. If a thread has been asleep (e.g., while waiting for
a user to enter a keystroke) for a long time, the Memory Manager assumes the
thread will sleep for a longer time. If the Balance Set Manager swaps out the
stacks of all the threads in a process, it trims the process' working set to
nothing, and the system assumes that the process is inactive. Swapping out the
stacks of all the threads in a process minimizes the footprints of processes
that don't require memory because they are waiting for infrequent events to
occur before they become active.
The Page Frame Database
Last month, I described how NT organizes physical memory as a list of page
frames, each of which is one page in size (4096 bytes on x86 processors, 8192 bytes on Alphas). The Memory Manager mirrors this list of page frames with its own list--the PFN Database. Each entry in the PFN Database corresponds to a physical page of memory (e.g., entry 5 in the PFN Database corresponds to page frame 5 in physical memory). Entries in the PFN Database record information about the state of data stored in corresponding physical pages.
A physical page can exist in one of eight states, which Table 1, page 64,
describes. Figure 1 illustrates how the Memory Manager adds to a list all the
PFN Database entries that belong to pages in the same state (except pages in the
valid or transition states). For example, all modified pages go on the modified
page list. A given page undergoes several status changes during its lifetime;
therefore, its PFN entry moves among different lists. Figure 1 shows how working
sets are associated with valid pages in the PFN Database.
Let's look at some of these state changes, beginning with a valid page.
Figure 2 depicts the state changes I'll discuss. A page is in the valid state if
it is currently part of a process working set. A valid page can be dirty
or clean--a dirty page is one a process has written to, so that the
page's content in physical memory might be different from its content in a
backing file (either a paging file or a memory-mapped file). The MMU sets a bit
in a page's PTE when a process writes to that page.
When the Balance Set Manager trims a valid page from a working set, the
page moves to one of three lists. If the page is clean, it will move to the
standby list. If the page is dirty, it almost always moves to the modified page
list. In some cases, as when a file system accesses pages backed by files the
file system creates to represent its metadata (such as the FAT on FAT file
systems, or the MFT--Master File Table--on NTFS file systems), a dirty page
moves to the modified no-write page list instead. NTFS takes advantage of this
type of marking so that it can ensure that pages belonging to NTFS logging files
write to disk before file metadata writes to disk.
A page on the modified page list moves to the standby page list after the
Memory Manager's modified page writer writes the page's data to disk. However,
while the modified page writer is writing its data to disk, the page makes a
stop in the transition state (which has no list). A page on the modified
no-write page list moves to the standby page list when the file system instructs
the Memory Manager to move the page. This instruction comes after a log file is safely written to disk, for example. When the number of pages on the modified page list exceeds a threshold, or when free memory drops below a
threshold (based on the amount of physical memory on the system and determined during the boot), one of two modified page writer threads wakes up to perform disk I/O that sends the data to a paging file or a data file.
A page on the standby page list moves to the free page list when the
process that had the page in its working set either frees the virtual memory
associated with the page, or the process terminates (which is another way of
freeing the virtual memory). An interesting optimization based on the lists I've
already discussed is possible. If a page moves from the working set of a process
to the standby page, modified page, or modified no-write page lists, and if that
process subsequently accesses the page before the Memory Manager assigns the
page to a different process, the Memory Manager can place the page back in the
process' working set. This operation is known as soft-faulting (or soft-page
faulting) memory back to a process. Soft-faulting is a way the Memory
Manager can tentatively remove memory from a process but give it back quickly if
the process reaccesses the memory in a short amount of time.
Pages on the standby page list move to the zeroed page list after a special
thread, called the zero-page thread, clears their content. The zero-page thread
executes in the background at priority 0. It runs only if no other thread can
run, and its job is to move pages from the free page list to the zeroed page
list as it clears their content.
When a process accesses a page and the Memory Manager decides that the
process' working set can expand, the Memory Manager checks the zeroed page list
to find a physical page to assign to the process. If the Memory Manager finds a
page in the zeroed page list, it can immediately use the page. If the zeroed
page list is empty, the Memory Manager checks the free page list. If the free
page list is also empty, the Memory Manager checks the standby page list.
Finally, if the standby page list is empty, the Memory Manager's thread waits
while the Balance Set Manager trims a page from a working set, or until a page
shows up on the standby page, free page, or zeroed page lists (e.g., a page
might be in transition, and after it writes to disk it appears on the standby
list).
The necessity of zeroing a page before assigning it to the working set of a
different process is a C2 security requirement. (For more information about NT's
C2 security rating, see "Windows NT Security, Part 1," May 1998.) The
operating system (OS) must reinitialize all OS resources (e.g., memory, disk
space, objects) before reassigning them to prevent the creation of security
holes in which one process can see another process' potentially sensitive data.
In some cases, the process does not require zero-filled memory, as when the
Memory Manager allocates a page to store data read from a memory-mapped file. In
these cases, the Memory Manager checks the free list for an available page
before it checks the zeroed list.
The final list is the bad page list. As its name implies, the bad page list
is an off-limits holding area in which the Memory Manager, with the support of
memory parity error detection, places pages it has detected as faulty.
The number of pages that are on the standby page, free page, and zeroed
page lists defines the free memory that various memory-tracking tools (e.g., the
Task Manager) report. The commit total is the amount of currently
allocated memory that paging file space backs. If the Memory Manager pages the
data in commit total memory out of physical memory, the Memory Manager stores
that data in a paging file. The commit limit is the amount of commit
total memory the Memory Manager can allocate without expanding the sizes of
existing paging files.
Shared Memory and Prototype PTEs
PFN Database entries contain varying information, depending on which state
corresponding pages are in. In most cases, a PFN Database entry contains a
pointer to a PTE that references a page. However, if two or more processes share
the same page, multiple PTEs reference the page: one PTE in the virtual address
map of each process sharing the page. Instead of pointing the PFN Database entry
at one of these PTEs, the Memory Manager points the PFN Database entry at a data
structure the Memory Manager allocates, called a Prototype PTE (PPTE), as
Figure 3 shows. I'll describe the way the Memory Manager uses PPTEs to manage shared and mapped memory.
I explained last month that to share memory, a process must create a
Section Object. Section Objects hold information about the name of a file, its size, and what portions of it are mapped. Section Object creation defines the shareable data, and each additional process that wants to participate in the sharing must map a portion of the data into its address space. This mapping is known as mapping a view of a section, because processes might map only a portion of the data that a Section Object defines.
When a process allocates a Section Object, the Memory Manager allocates
another data structure called a Segment. Segments contain storage to hold enough PPTEs to describe all the pages in the Section Object. Usually, when a page moves from a process' working set to the standby page, modified page, or modified no-write page lists, the Memory Manager marks the page's PTE invalid and sets a bit in the page to indicate that the page can be soft-page faulted, which means the PFN of the page stays in the PTE. PTEs that the Memory Manager marks as invalid for shared pages do not continue to store PFNs; rather, the Memory Manager updates them to point at the shared page's PPTE. This trick makes it easy for the Memory Manager to update the PFN of a shared page without
manually updating the PTEs that refer to the page in the address spaces of all
the processes that reference the page.
Consider an example in which two processes share a page. When the page's
data is in memory, each process has a valid PTE that stores the PFN where the
page's data resides in physical memory. If the Memory Manager removes the PTEs
from the working sets of both processes and sends the page's data to a paging
file, the PTEs are both invalid and contain pointers to the PPTE. When the
Memory Manager brings the page's data back into physical memory, the Memory
Manager updates the PPTE to reflect the page's new PFN. When one of the
processes tries to access the page, it generates a page fault. Then the Memory
Manager looks at the PTE, finds the new PFN in the PPTE the PTE points to, marks
the PTE as valid, and updates its PFN. When the second process accesses the
page, the Memory Manager updates that process' PTE similarly. Without this
optimization, the Memory Manager needs to track down both PTEs (or more, if a
greater number of processes shared the page) and update them when it brings the
page back into memory--an expensive and potentially wasteful operation.
A reference count in the PFN Database tracks the number of processes that
access a shared page, and the Memory Manager knows that when the count becomes
zero, the page is not marked valid in any working set. At that point, the Memory
Manager moves the page to the standby page or modified page lists.
Memory-Mapped Files
Memory-mapped files are a special form of shared memory. When a process maps a file into its address space, the Memory Manager creates several support data structures that aid the process' interaction with file systems. Figure 4 shows how these data structures are related. A File Object represents the file on disk, and the File Object is the target of all I/O the Memory Manager performs on the file. The Section Object points at its Segment, which contains the PPTEs for the Section Object. The Segment points to a control area that the File Object also points at. This control area is the nerve center for a mapped file, so that even if a process creates more than one Section Object for a file, there is still only one shared con-trol area.
When a process references a virtual address that the process indicates
should be backed by a file, the Memory Manager examines the Virtual Address
Descriptor (VAD) that describes the memory range, then locates the control area.
The Memory Manager allocates a page of physical memory to bring the requested
data into and updates the PTE for the virtual address map. Because the Memory
Manager finds the File Object through the control area, the Memory Manager can
initiate an I/O operation to the file system that owns the disk on which the
file resides. The operation reads the page from the file on disk (the page is in
the transition state while the I/O is in progress). After the operation reads
the page, the Memory Manager marks the PTE as valid and lets the process
continue its attempt to access the data, which is now present.
A caveat with regard to memory-mapped files exists: Files can map as data
files or as images. Files map as images when the NT Process Manager loads them
for execution. The same file can map as a data file and an image, and
maintaining separate control areas for data files and images lets NT ensure the consistency of the different mappings that different processes make.
More on Memory
Last month, I claimed that memory management is one of the most complex
tasks an OS faces. In this two-part series on memory management, I've provided only an overview of the policies and mechanisms NT implements to provide applications with memory resources appropriate to their needs and the needs of other concurrently running programs. If you want to learn more about the Memory Manager, I recommend Inside Windows NT, Second Edition, by David A. Solomon (Microsoft Press).
Next month, I'll cover a subsystem that's closely tied to the Memory
Manager--the Cache Manager. I'll discuss how the Memory Manager sizes system
working sets (including the file system cache) differently from process working sets.
I am very interested in articles. Could you give me next parts.
Thank you very much
Nguyen Xuan Khiem
The articles I've read on NT's Memory Management served as an exceptional source for my Operating Systems assignment. Well done Mark Russinovich!
good article, but how can I repair a PFN_LIST_CORRUPT error message? Thanks in advance for investigating, best regards.
Good Article!! I helped me lot to understand wim memory management with best regards...