Learn how Windows 2000 scales on multiple processors

I began this two-part series last month by highlighting the scalability enhancements that Microsoft has introduced in Windows 2000 (Win2K) for memory-intensive applications. These enhancements let applications directly address as much as 64GB of physical memory and introduce optimizations that increase the amount of certain types of memory (e.g., kernel nonpaged memory) and the file system cache's virtual size.

This month, I discuss Win2K multiprocessor scalability enhancements. Saying that an OS can use a certain number of processors is different than saying that the OS scales well. Scaling on multiprocessors requires that vendors develop all the levels of software that execute as part of a server application—including the OS, middleware, and layered services—with scaling in mind. If any part of the software in a server application isn't developed for scalability, the server application won't scale. Microsoft has little control over third-party middleware and applications, but the company has made the goal of OS scalability a top priority in Win2K. I conclude this series with a look at the changes Microsoft has made that enable applications to use processing cycles more efficiently and that help the kernel scale more effectively.

Multiprocessor Support
Before diving into the SMP enhancements Win2K implements, I want to describe the theoretical and licensing SMP limitations Microsoft has placed on Win2K's various versions. So that you can see whether changes from Windows NT 4.0 exist, I'll first review NT 4.0's capabilities. NT 4.0 comes in three basic flavors: NT Workstation; NT Server; and NT Server, Enterprise Edition (NTS/E). All three versions rely on the same NT kernel, but the kernel configures itself differently for each NT type. Theoretically, the NT 4.0 kernel can support systems with as many as 32 processors. The 32-processor limitation comes from the fact that the kernel uses 32-bit values (in which each bit in a value represents a processor) to represent processors in several places internally. For example, the idle mask value identifies which processors aren't performing useful work. NT sets each idle processor's bit in the mask to 1 and sets each active processor's bit to 0. Because NT 4.0 is a 32-bit OS, this use of 32-bit values comes naturally.

Although NT 4.0 internally supports up to 32 processors, licensing limitations typically prevent NT from using that many processors. The HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Session Manager\ LicensedProcessors Registry value designates the maximum number of processors that the kernel can use. On NT Workstation 4.0, the LicensedProcessors value is 2; on NT Server 4.0, the value is 4; and on NTS/E, the value is 8. Hardware vendors that sell NT systems with more than eight processors bundle OEM versions of NT that Microsoft licenses for the appropriate number of processors.

Regarding theoretical and licensing SMP limitations, Win2K is unchanged from NT 4.0. Win2K continues to use 32-bit values internally to identify processors, and Microsoft has licensed Win2K Professional (Win2K Pro), the Win2K equivalent of NT Workstation 4.0, for two processors. Similarly, Microsoft has licensed Win2K Server for four processors, and Win2K Advanced Server (Win2K AS—the equivalent of NTS/E) for eight processors. Microsoft has licensed Win2K Datacenter Server (Datacenter), which has no NT 4.0 equivalent, for 16 processors. As with NT 4.0, hardware vendors can distribute OEM versions of Win2K that support as many as 32 processors.

New Quantum Options and Scheduling Classes
The Win2K scheduler is a preemptive multitasking scheduler, which means that the scheduler divides processing time among the active threads running on a system. The scheduler gives each thread a quantum, which is a defined processing period. If a thread yields its turn early (e.g., when the thread stops executing to wait for an I/O operation to complete), Win2K invokes scheduler code to find another thread to take over the processor. If a thread runs to the end of its quantum, Win2K invokes the scheduler to give other threads a chance to execute. The Win2K scheduler (which Microsoft developers call the dispatcher) evaluates thread priorities to determine which thread should run at any time on each processor. The scheduler always chooses the thread with the highest priority, and if multiple threads exist with the same highest-priority value, the scheduler rotates among the threads, giving each thread a quantum of processor time. The term context-switch describes switching execution from one thread to another.

A thread's owning-process priority determines the thread's priority; a value between 1 and 31 designates the priority level. Application developers or administrators manipulate a process' priority by setting the priority to one of several possible priority classes. Each class corresponds to a value in the priority spectrum, and Win2K interprets the priority settings that programming APIs and administrative tools apply to a thread's priority as modifiers of the thread's process priority. NT 4.0 gives developers and administrators four process priority classes to choose from: Realtime, High, Normal, and Idle. Thread-priority modifiers position a thread's priority within two priority levels of the thread's process class. The Realtime class is the highest priority class. Developers and administrators rarely use the Realtime class because threads running at such elevated priorities can interfere with the kernel's operation. Therefore, most applications use only the remaining three classes. In many cases, developers and administrators want a richer priority spectrum to more precisely prioritize individual threads or entire applications based on the tasks that the threads and applications perform. Win2K therefore introduces two more priority classes: Below Normal and Above Normal. Figure 1 shows the complete Win2K process-priority-class spectrum.

Microsoft defines Win2K thread quanta in terms of scheduler clock ticks. A system's hardware abstraction layer (HAL) module defines the length of a clock tick. Typical clock-tick lengths are 7ms, 10ms, and 15ms. Short quanta are appropriate for systems running multiple interactive applications because applications more frequently get turns to respond to user activity. Long quanta are best for systems running a handful of noninteractive server applications because overall performance improves with fewer context switches.

On NT Workstation 4.0, the scheduler assigns quanta that are 2 clock ticks long to threads associated with background windows or applications that have no windows. A foreground application receives quanta that are 2, 4, or 6 clock ticks long, depending on the setting that an administrator chooses for the foreground-application boost slider in the Performance tab of the Control Panel System applet. A 6 clock-tick foreground quantum corresponds to a maximum performance boost on the slider. NT Server 4.0 assigns 12 clock-tick quanta to all threads, regardless of whether the threads are associated with a foreground window. (The slider functions on NT Server but has no effect.)

Although NT 4.0 quanta values are usually suitable for the different workloads that NT Workstation and NT Server run, cases exist in which those quanta values aren't suitable. In such cases, application performance, interactivity, or scalability suffers. As a result, Win2K lets administrators select the best quantum setting for a workload on any Win2K OS. The Win2K Control Panel System applet's Performance Options dialog box, which Screen 1 shows, lets administrators optimize performance for applications or background services. Optimizing performance for applications applies the quanta that NT Workstation 4.0 uses (i.e., background threads receive 2-clock-tick quanta and foreground applications receive 6-clock-tick quanta). Optimizing performance for background services applies the quanta that NT Server 4.0 uses (i.e., all threads receive 12-clock-tick quanta).

The Job Object
Many server workload types comprise related processes that cooperate to perform a computational or data-processing task. For example, a data-mining task might require the cooperation of a database client, middleware application, and custom analysis program. A server will often execute such multiple workloads simultaneously, or run the workloads as background operations while the server executes a client/server application (e.g., a database or Web server) in the foreground. Thus, letting an administrator or programmer control the processes of a task as a unit and apply limits on various characteristics of the task so that the task is prevented from interfering with the server's operation is desirable. For example, an administrator needs to assign a low scheduling priority to all the processes of background operations so that the task minimally impacts foreground execution.

NT 4.0 lacks programmatic and administrative interfaces to manage the processes in a group of related processes. Although interfaces exist to manipulate individual processes, control applications have difficulty determining programmatically which processes make up a common task, particularly when the task dynamically creates new processes as part of its control flow. Further move, other than setting the priority of a process, few options exist for limiting a process' resource usage.

Microsoft has addressed this limitation in Win2K by introducing a new process object, the job object. A job object is a container for related processes; job object programming APIs allow simultaneous management of all processes associated with a job and provide a variety of resource-limitation options. Figure 2 shows an example of a job object. A group of related processes might include special code that the group uses to manage itself, or administrative utilities can let administrators manage arbitrary applications as jobs. For example, Sequent has developed an administrative job object tool that Microsoft will include with Datacenter.

Developers use the AssignProcessToJobObject API to add a process to a job. When a process belongs to a job, you can't remove the process from the job. And, unless the developer dictates otherwise, any processes that a job's process creates or any of the process' descendents also join the job. The TerminateJobObject API provides the ability to terminate all a job's processes at one time. Another API lets a developer associate a completion port synchronization object with a job. An application can wait on a job's synchronization object to receive notification of various job events, such as the addition of a new process to the job or the termination of a job process, either because the process terminated normally or because of an error. The job object's core is the QueryInformationJobObject and SetInformationJobObject APIs. These APIs let an application monitor the collective resource usage of a job's processes and impose resource limitations or performance characteristics on the job.

Figure 3 lists the various limits that a job supports, including limits on memory usage, processor time consumption, number of processes, and various security restrictions. For example, a programmer can specify that an individual process in a job can't use more than a certain amount of virtual memory, or that the job's total memory usage can't exceed a particular value. When the job reaches a memory limit, the Win2K memory manager won't give the job more memory. Job object APIs give developers a choice of two behaviors when a job exceeds a CPU time limit. The first option is for Win2K to terminate the job and all of its processes; the second option is for Win2K to deliver a message to the job's completion port, if the port exists.

The security restrictions that job object APIs can apply to a job are flexible. For example, giving a background job the ability to reboot a server is undesirable. If an administrator runs a job, the ability to reboot a server would usually occur in the job's processes' security token because tokens reflect the security credentials of the account they are created with. One class of security restriction therefore enables the removal of certain privileges from the security tokens that belong to the job's processes. A programmer could remove the reboot and other privileges to prevent a job from disrupting the server's operation. Other security restrictions remove certain user-account associations from a job. If a job is executing in Joe's account and Joe belongs to the Administrators group, Joe might want to remove the Administrators group from the job processes' tokens so that the processes can't access resources such as files, devices, and synchronization objects that only administrators can usually access.

Performance characteristics that you can control with a job object include process priority classes and thread quantum lengths. Using a job object, you can simultaneously set the process priority class of every process in a job, a useful technique for assigning the job a priority among other system processes. If you want the job to execute in the background, you set a low priority class, such as Below Normal. Win2K designates the quantum lengths that a job object supports as scheduling class values from 0 through 9. On Win2K server versions, each scheduling class corresponds to a different quantum length. Scheduling class has no effect on Win2K Pro.

Table 1 shows quantum lengths for each scheduling class. When you apply scheduling class 9 to processes in the Realtime priority class, special behavior results: The Realtime process-thread quanta become infinite. In other words, the Win2K scheduler never ends the turn of a thread with a priority higher than 15 and a scheduling class of 9. Developers and systems administrators therefore need to be careful with this designation. Setting the scheduling class to a value higher than 5 requires the increase base priority privilege because scheduling-class values higher than 5 give a job longer quanta than standard processes possess.

With the job object, Microsoft has introduced a powerful and flexible way to control entire trees of related processes. What might not be immediately obvious is that the performance settings and resource limitations you configure with a job object can either help a particular task scale better or, as is more often the case, help the rest of a system scale better by degrading the capabilities of a particular job.

Thread Pooling
Virtually every application with the goal of scaling well on multiprocessors includes asynchronous I/O operation and multiple threads that cooperate with one another. If an application has only one thread, the application can execute on only one CPU at a time. However, if the application has at least as many threads as there are processors, the application can perform work on all processors of an SMP system simultaneously. Asynchronous I/O lets an application's threads initiate I/O operations but not idly wait for the operations to complete. Instead of waiting, the threads perform other work.

Implementing multithreaded code is tedious, particularly if threads are necessary for only certain parts of the application or if a developer wants to maintain an appropriate number of threads as the application's workload characteristics vary. Having too many threads can hurt an application's performance just as surely as having too few threads can. Win2K introduces thread pooling, a new API that makes the job of efficiently using threads easier.

Each application has a private pool of threads that the ntdll.dll system library manages. The kernel32.dll Win32 API library presents the thread-pooling interface to an application but relies on NTDLL to implement the APIs. An application uses the thread-pooling API whenever the application wants to efficiently execute specific functions with different threads. The functions might execute as the result of timer expirations, an I/O operation's completion, or simply because the application has queued the function to run in the thread pool. NTDLL dynamically creates and deletes threads for the thread pool as the demands of the application dictate.

When an application associates a function with the thread pool, the application specifies what type of processing the function will perform. Processing options include CPU processing, I/O operations, and long-running activity. These options aid the thread pool in tuning the number of active threads that the pool creates. A thread pool maintains two thread types: one type for CPU-processing, and another type for I/O operations. The I/O-operation threads use wait-semantics to wait for new work items from an application. Wait-semantics lets the threads receive notification when the I/O operations that the threads scheduled within application functions complete. If an application designates a function as long-running, the thread pool is more aggressive about creating new threads to handle subsequent requests because the pool knows that a thread might remain busy while it executes the long-running function.

The thread pool will increase the number of each thread type to an upper limit of about 2000 under extreme conditions, as when an application queues many long-running functions. However, the thread pool tries to keep the number of active threads to fewer than 10. Thus, an application can easily take advantage of multithreading simply by handing work to the thread pool, letting the pool worry about the details of thread management.

The Queued Spinlock
In addition to the new APIs I've described, Win2K includes numerous subtle multiprocessor-scalability-related changes beneath the surface. First, Microsoft has optimized the kernel's memory-footprint layout and the memory-footprint code locality for most system components such as DLLs and driver files. These optimizations ensure that pieces of code that execute together are stored near one another and thus place fewer demands on a computer's memory subsystem. Microsoft calls this technology Lego. Lego works in two steps. In the first step, executable code generates a Basic Block Testing (BBT) trace, which is a trace of the code's execution. In the second step, postprocessing occurs to analyze the code trace and reorganize the way in which the code is stored. The resulting reorganized code is the code that Microsoft ships.

Another performance optimization is the kernel's use of spinlocks. Spinlocks are synchronization objects that the kernel uses to synchronize access to data structures across an SMP system's processors. For example, if the kernel is accessing the scheduler database on one processor, the kernel must not simultaneously manipulate the database on the system's other processors. The kernel therefore acquires a spinlock before accessing the database and releases the lock after the access operation completes. If the kernel has already acquired the spinlock on a different processor before accessing the database, the kernel will spin in a tight busy-loop while it waits to acquire the spinlock for the database, hence the origin of the term spinlock.

A spinlock serializes the execution of code that it's protecting, and because parallel execution is required for good scalability, spinlocks are detrimental to multiprocessor scalability. Microsoft has extended considerable effort to fine-tune the kernel's use of spinlocks so that the kernel acquires spinlocks only when necessary. Further, Microsoft has introduced a new type of spinlock called a queued spinlock.

Queued spinlocks scale better on multiprocessors than standard spinlocks do. Queued spinlocks work in this way: When a processor wants to acquire a queued spinlock that another processor is holding, the first processor places its identifier in the spinlock's associated queue. When the processor that is holding the spinlock releases it, the releasing processor hands the spinlock to the next processor waiting in the queue. In the meantime, processors waiting in the global spinlock queue spin by monitoring the status not of the spinlock but of a per-processor flag that the releasing processor sets to signal that the turn of the next processor in the queue has arrived.

Two effects follow from the fact that queued spinlocks cause processors to spin on per-processor flags rather than on the global spinlock. The first effect is that the multiprocessor's bus doesn't experience the heavy traffic of interprocessor synchronization. The second effect is that queued spinlocks enforce first in/first out (FIFO) access to the lock. FIFO ordering ensures more consistent performance across processors that access the same locks.

Microsoft hasn't converted all the Win2K kernel's locks to queued spinlocks—only the dozen or so locks that protect the core data structures of the kernel, such as the Cache Manager's database, the scheduler's thread database, and the Memory Manager's physical memory database. Queued spinlocks will undoubtedly give Win2K a significant scalability boost over NT 4.0.

Scaling the Enterprise
With the scalability enhancements I've described in this two-part series—including memory sizing enhancements, APIs for accessing large amounts of physical memory and for controlling groups of processes as a unit, and tweaks to resource and processor management—Win2K is well equipped to help applications take advantage of enterprise-class hardware. The result is that Win2K is positioned to pick up where NT 4.0 left off and better address the mid- to high-end server space.