Is this OS ready for enterprise prime time?

In "Windows NT vs. UNIX: Is One Substantially Better?" (December 1998), I included a sidebar titled "Linux and the Enterprise." This sidebar discussed the Linux kernel's shortcomings, including the lack of kernel-mode threads and the use of non-reentrant kernel code that affects Linux's ability to scale on multiprocessor systems. The Linux community's response to the sidebar has been vociferous, and most of the email I received contains two assertions: Linux does have kernel threads, and Linux 2.2 will remove multiprocessor scalability limitations. Although these assertions are true to a certain extent, significant problems with kernel threads and multiprocessor scalability in Linux 2.2 will prevent it from competing head to head with NT and Linux's UNIX cousins for enterprise applications. In this article, I'll describe several key areas in which Linux lacks maturity. I'll also discuss some powerful features that other OSs, including UNIX variants and NT, have adopted to improve scalability.

Let me state clearly at the outset that I don't intend to bash Linux in this article, nor do I intend to proclaim NT's superiority to Linux. I base the information I present on a thorough study of Linux 2.2, including its source code. I hope that by revealing these problems, I will encourage Linux developers to focus on making Linux ready for the enterprise. I also want to dispel some of the hype that many Linux users have uncritically accepted, which has given them the false impression that Linux is ready for enterprise prime time.

What Does "Enterprise-Ready" Mean?
Before an OS can compete in the enterprise, it must deliver performance levels on network server applications that rival or exceed the levels that other OSs achieve. Examples of network server applications include Web servers, database servers, and mail servers. OS and hardware vendors typically use results from industry-standard benchmarks such as Transaction Processing Council (TPC)-C, TPC-D, and Standard Performance Evaluation Corporation (SPEC) SpecWeb to measure proprietary OSs or hardware against other vendors' products. Vendors spend millions of dollars on benchmark laboratories in which they tune, tweak, and otherwise push state-of-the-art hardware and software technology to the limit. Enterprise customers, in search of a computing infrastructure that provides maximum capacity, often turn to benchmark results for guidance, and OS and hardware vendors holding benchmark leads take great pride in being king of the hill at any given time.

Thus, competing in the enterprise means removing every performance impediment possible. Overlooking even the smallest drag will create an opening that a competitor can drive through to obtain a valuable lead on a benchmark. What complicates the science of engineering an OS for the enterprise is that an OS might have subtle design or implementation problems that don't adversely affect performance in casual desktop or workgroup environments. Yet these problems can keep the OS from achieving competitive results in an enterprise-class benchmarking environment. A typical enterprise-application benchmarking environment includes dozens of powerful multiprocessor computers sending requests as fast as they can over gigabit Ethernet to an 8-way server with 4GB of memory and hundreds of gigabytes of disk space.

Efficient Request Processing
Network server applications typically communicate with clients via TCP or UDP. The server application has either a published or well-known port address on which it waits for incoming client requests. When the server establishes a connection with a client or receives a client request, the server must then process the request. When the server application is a Web server, the Web server has to parse the HTTP information in the request and send requested file data back to the client. A database server application must parse the client's database query and obtain the desired information from the database.

For a network server application to scale, the application must use multiple kernel-mode threads to process client requests simultaneously on a multiprocessor's CPUs. The obvious way to make a server application take advantage of multiple threads is to program the application to create a large pool of threads when it initializes. A thread from the pool will process each incoming request or series of requests issued over the same client connection, so that each client request has a dedicated thread. This approach is easy to implement but suffers from several drawbacks. First, the server must know at the time it initializes what kind of client load it will be subject to, so that it can create the appropriate number of threads. Another drawback is that large numbers of threads (an enterprise environment can produce thousands of simultaneous client requests) can drain server resources significantly. Sometimes, resources might not be adequate to create all the threads the application wants to create. Furthermore, many threads actively processing requests force the server to divide CPU time among the threads. Managing the threads will consume precious processor time, and switching between competing threads introduces significant overhead.

Because a one-thread-to-one-client-request model is inefficient in enterprise environments, server applications must be able to specify a small number of threads in order to divide among themselves the processing for a large number of client requests. Where this client-multiplexing capability is present, no one-to-one correspondence between a thread and a client request occurs. Neither does a one-to-one correspondence between a client request and a thread occur—one thread might share a client request's processing with several other threads. Several OS requirements are necessary for a client-multiplexing server design to be feasible. The first requirement is that a thread must be able to simultaneously wait for multiple events: the arrival of a new client request on a new client connection, and a new client request occurring on an existing client connection. For example, a Web server will keep multiple browser connections open and active while accepting new browser connections as multiple users access a Web site the server manages. Connections between a browser and the server can stay open for several seconds while large files transmit over a connection, or while the browser requests multiple files over the connection.

The second requirement is that the threads must be able to issue asynchronous I/O requests. Asynchronous I/O is an OS-provided feature whereby a thread can initiate I/O and perform other work while the I/O is in progress—the thread can check the I/O result at a later time. For example, if a server thread wants to asynchronously read a file for a client request, the thread can start the read operation and wait for other client requests while the read is in progress. When the read completes, the system notifies a thread (not necessarily the thread that began the read operation) so that the thread can check the I/O's status (i.e., success or failure) and whether the I/O is complete.

Without asynchronous I/O, a thread initiating an I/O operation must wait while the operation takes place. This synchronous I/O causes multiple-client-per-thread server designs to perform poorly. Because such server designs designate limited thread pools, taking threads out of commission to perform I/O can lead to a situation in which no threads are available to accept new client requests or connections. In such a case, a multiprocessor's CPUs might remain idle while client requests sit backlogged. Worse, the server might never have a chance to service client requests, because the client might stop waiting for the server. Figure 1 contrasts asynchronous and synchronous I/O.

Linux and Request Processing
Unfortunately, Linux 2.2 doesn't satisfy either client-multiplexing server-design requirement: Linux 2.2 cannot efficiently wait for multiple events, and it doesn't support asynchronous I/O. Let's look more closely at each of these concerns.

Linux provides only one general API to server applications that want to wait on multiple requests—the select API. Select is a UNIX system call that has been present in every UNIX release since the OS's initial development. Select is one of the OS interface functions that has become part of the POSIX standard for UNIX API compatibility. One reason that the Linux select implementation is not an acceptable function for waiting on multiple events is that the system uses select to notify all threads that are waiting on the same event whenever the event occurs (e.g., the arrival of a request from a new client). Notifying multiple threads in this way degrades server performance: Only one thread can handle the new request or connection, and the other notified threads must return to a state of waiting. In addition, synchronization causes overhead as the threads agree among themselves which one will service the request. Other secondary overhead results when the OS divides CPU time among the threads it has needlessly notified. This kind of limitation forces a network server application to designate only one thread to wait for new incoming client requests. This thread can either process the new request itself, waking up another thread to take over the role of waiting for new requests, or the original thread can hand the request off to a waiting thread. Both alternatives add overhead, because every time a new client request arrives, the waiting thread receives notification and must then notify another thread.

If Linux provided some additional application support, the OS could wake up only one thread. For example, an application could specify that even though multiple threads are waiting for a particular event to occur, the application wants only one of the threads to receive notification for each occurrence of the event. NT provides such support for its waiting functions (NT server applications do not typically use select, although NT implements the select call for compatibility with the POSIX standard) to allow multiple threads to efficiently wait for incoming client requests.

Select suffers another serious problem: It doesn't scale. A Linux application can use select to wait for up to 1024 client connections or request endpoints. However, when an application receives notification of an event, the select call must determine which event occurred, before reporting the event to the application. Select uses a linear search to determine the first triggered event in the set the application is waiting for. In a linear search, select checks events sequentially until it arrives at the event responsible for the notification. Furthermore, the network server application must go through a similar search to determine which event select reports. As the number of events a thread waits for grows, so does the overhead of these two searches. The resulting CPU cost can significantly degrade a server's performance in an enterprise environment.

NT incorporates a unique feature known as completion ports to avoid the overhead of searching. A completion port represents a group of events. To wait on multiple events, a server associates the events with a completion port and waits for the completion port event. No hard upper limit exists on the number of events a server can associate with a completion port, and the server application need not search for which event occurred—when the server receives notification of a completion port event, the server also receives information about which event occurred. Similarly, the kernel doesn't perform searches, because the kernel knows which events the system associates with specific completion ports. Completion ports simplify the design and implementation of highly scalable server applications, and most enterprise-class NT network server applications use completion ports.

Linux 2.2 introduces a new mechanism that lets network server applications request notification of certain events in a more efficient way than select. A Linux server thread marks a communications endpoint as a notification endpoint; any new connections a client establishes through that endpoint cause the system to notify the application—the thread doesn't wait for the event. Furthermore, when such an event occurs, the kernel tells the application precisely which event is occurring, thus eliminating the searching that select requires. Unfortunately, this feature has two major limitations. First, the feature applies only to particular communications endpoints (TCP/IP and TTY devices) and events related to a new client connection—this mechanism does not notify a server thread of new requests over existing connections. Second, just as for select, the kernel wakes up all threads waiting for a new client request event—not just one thread, as efficient server applications require.

Linux fails to meet the second requirement for supporting scalable server applications because the OS currently has no asynchronous APIs. If a Linux server thread initiates an I/O request, the thread can't perform any useful work while the I/O is in progress. Instead, the thread waits.

Many Linux developers mistakenly believe that the existence of a form of I/O in Linux known as nonblocking I/O means that Linux supports asynchronous I/O. An application that requests nonblocking I/O can attempt to read from a network connection, for example, and the application doesn't wait until data is available on the connection before the application continues executing. A major difference exists between Linux nonblocking I/O and true asynchronous I/O, however: An application performing a nonblocking I/O call does not initiate an I/O operation if the I/O cannot be immediately satisfied. If the application wants to initiate I/O, the application must issue the I/O request when I/O is possible.

A quick example illustrates nonblocking I/O. If a server thread performs a nonblocking read from a network connection and no client data is ready on the connection, the server thread will not wait for data to become available. Instead, the thread can perform other work. However, the thread must issue further read operations until data is ready on the connection. By contrast, a server thread that issues an asynchronous read actually initiates a read I/O but can also perform other work without issuing additional reads. The OS notifies the thread when client data arriving at the connection can satisfy the issued request.

Even in a network server implementation that has a small pool of threads that share the processing of new client connections and client requests, the syndrome overscheduling can adversely affect server performance. If a server thread takes a client request and actively uses a CPU to process the request, and the server starts processing another client request on the same CPU, both threads will compete for CPU time. This situation introduces overhead when the OS switches between the threads to give each access to the processor. The higher the number of threads that actively compete for CPU time, the worse the overhead problem becomes. The goal of a high-performance server application is therefore to have as few threads competing for the CPU as possible. To achieve this goal, the application requires OS support.

The OS must make it likely that only one thread will process a request at a time, and that when that thread finishes with the request, the OS will choose the same thread to process the next request. Such support prevents a situation in which the thread finishing a request goes back to waiting while the OS launches another thread to handle the next request. A network server application that achieves this support will almost never have overhead that the OS scheduler causes when it switches the CPU among multiple threads.

To achieve this support, the OS scheduler must keep track of which server threads are active and which threads are waiting for events. NT integrates this knowledge into its completion ports and uses completion ports as gateways for threads to use the CPU. (Figure 2 shows an example of a completion port.) If a server thread begins processing a request after receiving notification from a completion port, the scheduler will not notify any other threads waiting on that completion port for client requests until the processing thread voluntarily gives up the CPU, usually by blocking on I/O. If the active thread finishes its processing without giving up the CPU and waits for another event at the completion port, the scheduler will immediately notify that thread of the next waiting request, and the thread will continue running. If a thread gives up the CPU while processing a request—for example, while it waits for some other event not associated with the completion port (such as transmitting a very large Web response)—the scheduler will notify another thread waiting at the completion port of the next client request. This thread-throttling mechanism helps the server application minimize the number of actively scheduled threads and get the most out of a CPU.

Without thread-throttling support in the Linux scheduler, Linux server applications must rely on less-precise methods in their application code to try to achieve the same goal. However, these Linux applications will not realize performance benefits to the extent that NT applications do with NT scheduler support.

Kernel Reentrancy
The Linux community heralded Linux 2.2's recent release as Linux's coming of age in the multiprocessing world. One big reason for this jubilation is that in Linux 2.2, parts of the kernel are reentrant. A reentrant kernel function is a function that can simultaneously execute on multiple CPUs in a multiprocessor. If one CPU is executing a non-reentrant function, another CPU wanting to execute the same function must wait until the first CPU is finished. This effect is known as serialization, because the two CPUs' execution of the function is sequential if viewed on a timeline, as Figure 3, page 98, shows. Serialized execution defeats the advantages of multiprocessor execution, because the non-reentrant functions execute as if they were on a uniprocessor.

Linux 2.2 is more reentrant than previous versions of Linux were. However, several major Linux functions are still not reentrant. These functions include read and write, the two most common functions network server applications use. A Linux server application will read client requests from a communication endpoint, read data from a file (such as a Web file or email database) to respond to the requests, and write the file to the client via the communications endpoint. Even if the data requested from the client is in a memory cache and a read from a file is not necessary, the write paths still serialize.

Figure 3 demonstrates the difference between the execution of a non-reentrant function and that of a reentrant function. In the top half of the figure, the kernel spends time waiting for both CPUs to execute the non-reentrant function; in the bottom half of the figure, the kernel doesn't spend time waiting for the reentrant function. The OS in the bottom half of the figure finishes executing the same code sooner than the OS in the top half of the figure does.

Many members of the Linux development community believe this kernel-waiting-time difference to be an insignificant performance problem. This belief comes about almost certainly because no performance studies of Linux 2.2 on enterprise workloads have yet taken place. To glimpse how serious a problem kernel waiting time will be for Linux, look at recent developments in NT. NT's write function for network I/O was reentrant except for the part of the function that NIC drivers (network device interface specification—NDIS—drivers) handled, wherein they transfer data to their network hardware. Making this small part of the entire write function non-reentrant was enough to prevent NT from competing effectively with Sun's Solaris OS on enterprise applications executing on 4-way multiprocessors. To remedy the situation, Microsoft let NDIS drivers have deserialized, or reentrant, write paths in NT 4.0 Service Pack 4 (SP4) and Windows 2000 (Win2K). In Linux 2.2, several functions, in addition to read and write, are still not reentrant, and each time a Linux server application uses such a function, the function's non-reentrancy hampers multiprocessor scalability.

A final area in which Linux is at a disadvantage is its implementation of the sendfile API. Sendfile is an API that Microsoft introduced to NT several years ago as a feature that network server applications can use to enhance their performance. Obvious candidates for the API are Web servers. Without sendfile, a Web server that receives a request from a browser for an HTML file must first read the contents of the file into its memory and then send the contents to the client via a communications endpoint. The process of reading the file into private application memory is wasteful, because the server application doesn't want the contents of the file—it merely wants to send the contents to the client.

Sendfile eliminates the necessity that a server application read a file before sending it. With sendfile, the server application specifies the file to send and the communications endpoint in the sendfile API, and the OS reads and sends the file. Thus, the server doesn't have to issue a read API or dedicate memory for the file contents, and the OS can use its file system cache to efficiently cache files that clients request. Soon after Microsoft implemented sendfile in NT, UNIX vendors implemented sendfile in their OSs.

Linux has a sendfile implementation, but the Linux sendfile has several problems that developers must fix before Web servers (and other applications that can use the sendfile API) on Linux can achieve the same benefits that UNIX variants and NT obtain from their sendfile implementations. On NT, sendfile doesn't incur a copy operation if the file being sent is in the NT file system cache. In other words, the network software can send the data directly from the cache. But on Linux, sendfile copies the file data into buffers that sendfile hands to the networking code. This extra copy operation consumes CPU time and creates a larger memory footprint for the server application, both of which adversely affect performance. Another problem with Linux's sendfile is that it is non-reentrant. Thus, using Linux's sendfile leads to the serialized execution of part of a server application, which inhibits the application's ability to scale on a multiprocessor. A final problem with Linux's sendfile is that it doesn't let the system preappend a data buffer to the front of the file sendfile is sending. In the case of Web servers, this limitation necessitates another system call to request an HTTP header before the server can send the file.

Linux Isn't There—Yet
The limitations I've cited are most of the major shortcomings in Linux's support for enterprise applications. However, other limitations might lurk beneath those I've pointed out. Despite the Linux community's claims to the contrary, Linux 2.2 is not ready for the enterprise or for multiprocessors. Linux is not engineered with enterprise computing in mind, nor has the OS been present in enterprise environments where administrators, programmers, and users can notice its limitations. Consequently, Linux's kernel threads are ineffective at supporting enterprise applications, and the Linux kernel is unable to scale applications on multiprocessors as well as other OSs can. Certainly over the next year or two, as Linux's momentum pushes it into the enterprise, Linux will face its shortcomings. When that happens, and Linux's developers address the OS's problems, UNIX variants and NT will feel a compelling threat to their enterprise dominance from this open-source OS.