NT's Executive, Kernel, and HAL

Last month I began a two-part primer on Windows NT architecture. This month I conclude with a description and discussion of the components that make up the NT Executive. I'll discuss the responsibilities of the Kernel and delve into one of the more mysterious elements of NT, the hardware abstraction layer (HAL).

The Executive
NT's Executive subsystems make up the meatiest layer in kernel mode, and they perform most of the functions traditionally associated with operating systems. Table 1 lists NT's Executive subsystems, and Figure 1, page 60, shows their position in NT's architecture. These subsystems have separate responsibilities and names, so you might think they are different processes. For example, when a program like Microsoft Word requests an operating-system service such as memory allocation, the flow of control proceeds from Word into kernel mode through NT's native system service interface. A system service handler for memory allocation then directly invokes the Virtual Memory Manager's allocation function. The requested allocation executes in the context of the process (Word) that requested it--there is no context switch to a different system process.

If you've seen the system process in NT's Performance Monitor (Perfmon), you might think that the Executive subsystems are different processes. However, the purpose of the system process in Perfmon is to own Executive threads (commonly called worker threads) that carry out work, usually of a background nature, for Executive subsystems. For example, the Cache Manager creates system process threads for lazy-write operations: Every few seconds the threads will flush dirty disk data from memory back to the disk. Because no user-mode application is associated with a system process, the user-mode portion of the system process' address map is not defined. And because the address map's user-mode portion does not change when a thread from the system process executes, the computer's address-mapping structures are not updated. This situation is different from a change from one application to another, in which case the user-mode portion of the address map would have to be changed from, say, Word's to Netscape's.

Just as NT doesn't assign Executive subsystems to different processes, NT doesn't place the Executive subsystems in different image files (an image file is an executable file). The ntoskrnl.exe file contains all NT Executive subsystems (except the Win32 subsystem, which is in win32k.sys) and the Kernel. NT loads the ntoskrnl.exe file during the system boot into the kernel-mode half of the virtual memory map.

Object Manager. The Object Manager, which I characterized in a previous column as probably the least known of NT's Executive subsystems (see "Inside NT's Object Manager," October 1997), is also one of the most important. An operating system's primary role is to manage a computer's physical and logical resources. Other Executive subsystems use the Object Manager to define and manage objects that represent resources. For example, through the Object Manager, the Process Manager defines a process object to track active processes. Table 2 lists the NT 4.0-defined objects and the kernel-mode subsystems that manage them.

The Object Manager performs object-management duties that include identification and reference counting. When an application opens a resource, the Object Manager either locates the associated object or creates a new object. Instead of returning an object pointer to the application that opened the resource, the Object Manager returns an opaque identifier called a handle. The handle's value is unique to the application that opened the resource, but it is not unique to the system across different applications. The application uses the handle to identify the resource in subsequent operations. When the application is finished with the object, the application closes the handle. The Object Manager uses reference counting to track how many system parts, including applications and Executive subsystems, are accessing an object that represents a resource. When the reference count goes to zero, the object is no longer in use representing the resource, and the Object Manager deletes the object (but not necessarily the resource).

The Object Manager implements NT's namespace to provide object identification. All shareable resources in NT have names that are rooted in this namespace. For example, when a program opens a file, the Object Manager parses the file's name to locate the file-system driver for the disk that stores the file. Similarly, when an application opens a Registry key, the Object Manager determines from the Registry key's name that the Configuration Manager must be called.

Most native system services that NT implements are resource related; thus, almost every system service invokes Object Manager functions. For example, services that open an existing resource call on the Object Manager to look up the resource name in the Object Manager namespace, ensure the caller has sufficient rights to open the resource, and allocate and return a handle to identify the open instance. Services that require a handle to a previously opened resource call the Object Manager to translate the handle to the object it represents.

The Object Manager calls other Executive subsystems when necessary. Every object type has functions that execute when NT performs particular operations on objects of that type. Thus, when the Object Manager creates a file object to represent an open file, the Object Manager invokes the I/O Manager's function for opening files. Similarly, the Object Manager creates an associated process object for an open process and invokes the Process Manager's function for opening processes.

Security Reference Monitor. The Security Reference Monitor is closely associated with the Object Manager. The Object Manager calls the Security Reference Monitor for an access check before letting an application open an object. The Object Manager also calls the Security Reference Monitor before it lets applications perform other operations on objects, such as reading from the object or writing to it.

The Security Reference Monitor implements a security model based on security identifiers (SIDs) and Discretionary Access Control Lists (DACLs). Every process in NT has an associated access token object that contains the SID identifying the user that owns the process and the SIDs of the groups the user belongs to. When a security check takes place, the SIDs in the access token of the process describe the user trying to complete an action on an object. Figure 2 gives an example of a DACL that does not let the process owner, Mark, read the object, although it lets the group the owner belongs to (Administrators) read from and delete the object.

NT's security model has a powerful capability that lets a process impersonate any user other than the user associated with the process. Server applications such as NT's built-in file server (SRV) rely heavily on impersonation. When a client on a different machine opens a file on the server, the server impersonates the client by temporarily adopting an access token that identifies the server as the remote client. NT creates the token on the server, but the token contains the client's SIDs. When the server opens the file, it invokes the Object Manager, which then calls the Security Reference Monitor to make the appropriate access check. The client can have more or less privilege than the server (the server might not be allowed to open the file), but impersonation lets the server temporarily identify as the client and thus hides the discrepancy.

DACLs specify the actions that particular SIDs can perform on an object. A DACL can contain any number of access control entries (ACEs), including no entries, that contain the information about actions SIDs can perform. Each ACE contains a SID, a flag specifying whether the ACE is of the deny type or allow type, and an operations mask (i.e., read, write, delete). Every object can have an ACE connected to it, such as the example in Figure 2 shows. NT references the ACE when a user attempts to open the object.

Given the access token object and the DACL in Figure 2, the Security Reference Monitor would deny Mark read access to the object, even though it allows members of the Administrator group read access to the object. The Security Reference Monitor would deny Mark read access to the object because the deny ACE is in front of the allow ACE in the DACL.

When a process wants to open an object, it must indicate the access it desires (e.g., read, write, delete). The Object Manager calls the Security Reference Monitor for an access check, and the Security Reference Monitor takes the desired access and the SIDs from the process' access token and goes through the object's DACL until it finds matching information. The Security Reference Monitor then looks at the DACL's ACE type: If the ACE is an allow type, the process can open the object. If the ACE is a deny type, the process cannot access the object. Two special cases exist in DACL security. First, users can fully access an object that does not have a DACL. Second, users cannot access an object with an existing but empty DACL.

When an object opens successfully, NT associates the access types granted to the calling process (and that match the access types specified during the open function) with the handle that NT returns to the calling process. When the calling process later performs an operation on the object, all the Security Reference Monitor must do is verify that the granted access types permit the operation--there is no need for the Security Reference Monitor to rescan the DACL.

The Security Reference Monitor also implements System Access Control Lists (SACLs), which are similar to DACLS. SACLs tell the system to log specific actions when particular users perform those actions. Systems administrators typically use SACLs to monitor and record attempted security violations.

Virtual Memory Manager. The Virtual Memory Manager has two main duties: to create and manage address maps for processes and to control physical memory allocation. NT 4.0 implements a 32-bit (4GB) address space; however, applications can directly access only the first 2GB, as Figure 3, page 62, shows. This portion of the address space is the user-mode half of the address map, and it changes to reflect the currently executing program's address map (e.g., Netscape, Notepad, Word). The 2GB to 4GB portion of the address space is for the kernel-mode portions of NT, and it doesn't change. NT 4.0 Service Pack 3 (SP3) and NT Server, Enterprise Edition 4.0, let administrators move the boundary in the address space so that user-mode applications can access 3GB of the map and kernel-mode components can use only 1GB.

The Virtual Memory Manager implements demand-paged virtual memory, which means it manages memory in individual segments, or pages. In x86 systems, a page is 4096 bytes; in Alpha systems, a page is 8192 bytes. The total memory applications require can exceed the computer's physical memory space. The Virtual Memory Manager stores the data that exceeds a computer's physical memory on the hard disk in page files. The Virtual Memory Manager transfers data to physical memory from a paging file when an application requests the data.

The Virtual Memory Manager has advanced capabilities that implement file memory mapping, memory sharing, and copy-on-write page protection. NT uses file memory mapping to load executable images and DLLs efficiently. In memory mapping, the Virtual Memory Manager learns through the operating system that a portion of a process' address map is connected to a particular file. When the process touches these portions of its address map (e.g., when it tries to execute code), the Virtual Memory Manager automatically loads the data into physical memory.

NT uses memory sharing to enhance physical memory use and to communicate between processes. For example, multiple instances of a program share a memory-mapped file image, and this sharing increases memory efficiency.

Copy-on-write is an optimization related to memory sharing in which several programs share common data that each program can modify individually. When one program writes to a copy-on-write page that it shares with another program, the program that makes the modification gets its own version of the copy-on-write page to scribble on. The other program then becomes the original page's sole owner. NT uses copy-on-write optimization when several applications share the writable portions of system DLLs.

The Virtual Memory Manager divides physical memory among several executing programs. It uses a function called working-set tuning to allocate additional memory to programs that require it and ensure that other executing programs have enough memory to keep running.

I/O Manager. The I/O Manager is responsible for integrating add-on device drivers with NT. Microsoft did not build support for various hardware devices into NT. Device drivers, which are dynamically loaded kernel-mode components, provide hardware support. A device driver controls a specific type of hardware device by translating the commands that NT and applications direct to the device and manipulating the hardware to carry out the commands. Microsoft supplies several device drivers for common hardware. If you purchase a nonstandard hardware item, the hardware vendor will provide a device driver for it.

The I/O Manager supports asynchronous, packet-based I/O. For example, when a program such as Lotus Notes reads from a file, the read-file system service, NtReadFile, allocates an I/O request packet (IRP) that describes everything a device driver needs to know to complete the program's request. The IRP information includes the location of the buffer into which the program must read the requested data, a pointer to the file object that represents the open file, the offset into the file where the data resides, and the amount of data the program must read. The I/O Manager takes the IRP and passes it to the device driver--in this example, a file system driver responsible for the target file. A file object represents an open file in this example, but a file object can represent a keyboard, a mouse, or an open network connection.

NT's developers designed IRPs to contain everything pertinent to an application's request to make implementing asynchronous packet-based I/O easier. For example, after the I/O Manager gives an application's request (and an IRP) to a device driver, the I/O Manager returns control to the application. The application can continue performing useful work while the device driver is transferring data and can check the device driver after the data transfer. Because asynchrony (or the overlapping of control between the application and the device driver) is sometimes difficult to program to, standard Win32 APIs hide the asynchrony. For example, when an application calls the Win32 function to read from a file, the application does not resume execution until after the function reads the data. Most advanced applications use Win32 APIs that expose the I/O Manager's asynchrony, because doing so can improve performance.

The I/O Manager supports 64-bit file offsets and layered device drivers. Using 64-bit offsets lets NT's file systems address extremely large files and lets disk device drivers address extremely large disks. Layering lets device drivers divide their labor. As the example in Figure 4 shows, the NTFS driver is layered above the fault-tolerant disk driver, which sits above a standard disk driver. As the I/O Manager processes requests, IRPs move down the layers, and the results pass back up from the bottom as each driver finishes its work.

Cache Manager. The Cache Manager works closely with the Virtual Memory Manager and file system drivers. The Cache Manager maintains NT's global (shared by all file systems) file system cache. The working-set tuner assigns physical memory to the file system cache. The NT cache is file oriented rather than disk-block oriented, as Windows 95 is. When the working-set tuner takes memory containing modified file data away from the Cache Manager, the I/O Manager invokes file systems that manage the moved files to write their data back to the disk.

Local Procedure Call Facility. NT's Local Procedure Call (LPC) Facility optimizes communications for applications, including operating system environments. The LPC function is based on two types of port object: connection ports and communication ports. A server creates a connection port, which a client connects to. After the client establishes that connection, the server creates a communication port, which the server and client transmit data through.

Three kinds of LPC exist: data copying, shared memory, and shared memory with event pairs (Quick-LPC). NT uses data copying for small messages (less than 256 bytes). One end of the communications link (client or server) copies a message to a port, and the other end of the link copies the message out of the port.

NT uses shared memory for messages larger than 256 bytes. In shared-memory LPC, the connected client and server share a region of memory. When one end of the communications link wants to send a message larger than 256 bytes to the other, it sends through the communications port a short message whose only function is to point to the location of the primary message in the shared memory. Shared memory avoids a copy operation but requires dedicated shared memory.

Win32 under NT 3.51 uses Quick-LPC. Win32 doesn't send messages through ports. Instead, one end of the communications link uses an event-pair synchronization object to signal the other end of the communications link that it has placed a message in shared memory. Using event-pair synchronization objects avoids the overhead of communicating through a port but as a trade-off has even higher resource overhead than other LPC methods.

Configuration Manager. Microsoft rarely discusses the Configuration Manager in its NT architecture documentation, but Configuration Manager is an important Executive subsystem. The Configuration Manager manages the Registry, and Win32 Registry API functions rely on NT native APIs the Configuration Manager implements. The Configuration Manager also exports functions to the I/O Manager, and the I/O Manager uses these functions to assign physical resources to device drivers. The Configuration Manager stores this assignment information in the Registry to detect and help prevent resource conflicts.

Process Manager. The Process Manager works with the Kernel to define process and thread objects. The Process Manager wraps the Kernel's process object and adds to it a process identifier (PID), the access token, an address map, and a handle table. The Process Manager performs a similar operation on the Kernel's thread object, adding to it a thread identifier (TID) and statistics. These statistics include process and thread start and exit times and various virtual-memory counters.

The Process Manager exports an interface that lets other Executive subsystems and user-mode applications manipulate processes and threads. For example, applications can access Process Manager functions to create processes, delete them, and modify their characteristics (such as their priority). You can access many Process Manager functions in user mode through system services.

Win32. Win32 consists of the messaging and drawing functions of the Win32 API. As I discussed in Part 1, in NT 3.51 these modules resided in user mode as part of the Win32 environment subsystem.

Plug-and-Play Manager and Power Manager. Two new Executive subsystems will debut in NT 5.0: the Plug-and-Play Manager and the Power Manager. The Plug-and-Play Manager notifies device drivers and the I/O Manager when hardware devices come online or are removed. The Power Manager maintains central control of the computer's power level, letting it shift into low power modes when possible. Both new subsystems work primarily with the I/O Manager and device drivers.

The Kernel
NT's Kernel operates more closely with hardware than the Executive does, and it contains CPU-specific code. NT's thread scheduler, called the dispatcher by NT's developers, resides in the Kernel. The dispatcher implements 32 priority levels, 0-31. The dispatcher reserves priority level 0 for a system thread that zeros memory pages as a background task. Priority levels 1 through 15 are variable (with some fixed priority levels) and are where programs execute; priority levels 16 through 31 are fixed priority levels that only administrators can access.

The NT dispatcher is a preemptive scheduler. The CPU's time is divided into slices called quantums. When a thread runs to the end of its quantum and doesn't yield the CPU, the dispatcher will step in and preempt it or schedule another thread of equal priority that is waiting to run (see "Inside the Windows NT Scheduler, Part 1," July 1997).

NT implements most synchronization primitives in the Kernel. NT has a rich set of synchronization types, including mutexes, semaphores, events, and spin locks. The Kernel implements and manages its own object types, and Kernel objects represent NT's synchronization primitives. In most cases, NT wraps Kernel objects with Executive objects so that applications can access them from user mode through the native API. As I described previously, the Process Manager wraps its process and thread objects around the Kernel's objects. NT stores all priority-related information and statistical information related to scheduling (e.g., context switches, user time) in the Kernel's objects.

The Kernel manages interrupt vectors (for more information on interrupt vectors, see my column "Inside NT's Interrupt Handling," November 1997). NT defines and implements IRQ levels in the Kernel.

The Hardware Abstraction Layer
The HAL is NT's interface to the raw CPU. Microsoft wanted to make NT portable across different processors. To make this portability feasible, NT's developers isolated as much CPU-specific code as possible into a separate, dynamically replaceable module, the HAL. The HAL exports a common processor model that masks the differences in various processor chips from NT. Device drivers use this common processor rather than a particular CPU type. Even different motherboards in the same processor family can differ significantly, but hardware vendors can ensure that NT will work with their boards by writing a custom HAL to work with NT.

A common difference between motherboards in the same processor family is that some are for multiprocessor systems and others are for uniprocessor systems. A multiprocessor HAL is different from a uniprocessor HAL. The multiprocessor-uniprocessor issue brings me to a little-known fact. Microsoft provides three versions of each NT release: the uniprocessor version, the multiprocessor version, and the debug version (or the checked-build version). In addition to having different HALs, uniprocessor and multiprocessor versions of NT have different ntoskrnl.exe images. The uniprocessor version of ntoskrnl.exe does not include code that is necessary for correct execution on multiprocessors. Microsoft provides the debug version of NT for device-driver developers. The debug version contains additional sanity checks and symbolic information not contained in the uniprocessor and multiprocessor versions of NT. Microsoft offers only one debug version of NT, which contains multiprocessor code and will work on multiprocessor and uniprocessor machines.

In a Nutshell
There you have it: NT's big picture in two articles. For more in-depth information, refer to previous columns in which I focused on subsystems or portions of subsystems (you can find these columns easily by searching the magazine archives on Windows NT Magazine's Web site, at http://www.winntmag.com). Another good source of information on NT subsystems is Helen Custer's Inside Windows NT (Microsoft Press, 1993).