Linux 2.2 introduces a new mechanism that lets network server applications request notification of certain events in a more efficient way than select. A Linux server thread marks a communications endpoint as a notification endpoint; any new connections a client establishes through that endpoint cause the system to notify the applicationthe thread doesn't wait for the event. Furthermore, when such an event occurs, the kernel tells the application precisely which event is occurring, thus eliminating the searching that select requires. Unfortunately, this feature has two major limitations. First, the feature applies only to particular communications endpoints (TCP/IP and TTY devices) and events related to a new client connectionthis mechanism does not notify a server thread of new requests over existing connections. Second, just as for select, the kernel wakes up all threads waiting for a new client request eventnot just one thread, as efficient server applications require.
Linux fails to meet the second requirement for supporting scalable server applications because the OS currently has no asynchronous APIs. If a Linux server thread initiates an I/O request, the thread can't perform any useful work while the I/O is in progress. Instead, the thread waits.
Many Linux developers mistakenly believe that the existence of a form of I/O in Linux known as nonblocking I/O means that Linux supports asynchronous I/O. An application that requests nonblocking I/O can attempt to read from a network connection, for example, and the application doesn't wait until data is available on the connection before the application continues executing. A major difference exists between Linux nonblocking I/O and true asynchronous I/O, however: An application performing a nonblocking I/O call does not initiate an I/O operation if the I/O cannot be immediately satisfied. If the application wants to initiate I/O, the application must issue the I/O request when I/O is possible.
A quick example illustrates nonblocking I/O. If a server thread performs a nonblocking read from a network connection and no client data is ready on the connection, the server thread will not wait for data to become available. Instead, the thread can perform other work. However, the thread must issue further read operations until data is ready on the connection. By contrast, a server thread that issues an asynchronous read actually initiates a read I/O but can also perform other work without issuing additional reads. The OS notifies the thread when client data arriving at the connection can satisfy the issued request.
Overscheduling
Even in a network server implementation that has a small pool of threads that share the processing of new client connections and client requests, the syndrome overscheduling can adversely affect server performance. If a server thread takes a client request and actively uses a CPU to process the request, and the server starts processing another client request on the same CPU, both threads will compete for CPU time. This situation introduces overhead when the OS switches between the threads to give each access to the processor. The higher the number of threads that actively compete for CPU time, the worse the overhead problem becomes. The goal of a high-performance server application is therefore to have as few threads competing for the CPU as possible. To achieve this goal, the application requires OS support.
The OS must make it likely that only one thread will process a request at a time, and that when that thread finishes with the request, the OS will choose the same thread to process the next request. Such support prevents a situation in which the thread finishing a request goes back to waiting while the OS launches another thread to handle the next request. A network server application that achieves this support will almost never have overhead that the OS scheduler causes when it switches the CPU among multiple threads.
To achieve this support, the OS scheduler must keep track of which server threads are active and which threads are waiting for events. NT integrates this knowledge into its completion ports and uses completion ports as gateways for threads to use the CPU. (Figure 2 shows an example of a completion port.) If a server thread begins processing a request after receiving notification from a completion port, the scheduler will not notify any other threads waiting on that completion port for client requests until the processing thread voluntarily gives up the CPU, usually by blocking on I/O. If the active thread finishes its processing without giving up the CPU and waits for another event at the completion port, the scheduler will immediately notify that thread of the next waiting request, and the thread will continue running. If a thread gives up the CPU while processing a requestfor example, while it waits for some other event not associated with the completion port (such as transmitting a very large Web response)the scheduler will notify another thread waiting at the completion port of the next client request. This thread-throttling mechanism helps the server application minimize the number of actively scheduled threads and get the most out of a CPU.
Without thread-throttling support in the Linux scheduler, Linux server applications must rely on less-precise methods in their application code to try to achieve the same goal. However, these Linux applications will not realize performance benefits to the extent that NT applications do with NT scheduler support.
Kernel Reentrancy
The Linux community heralded Linux 2.2's recent release as Linux's coming of age in the multiprocessing world. One big reason for this jubilation is that in Linux 2.2, parts of the kernel are reentrant. A reentrant kernel function is a function that can simultaneously execute on multiple CPUs in a multiprocessor. If one CPU is executing a non-reentrant function, another CPU wanting to execute the same function must wait until the first CPU is finished. This effect is known as serialization, because the two CPUs' execution of the function is sequential if viewed on a timeline, as Figure 3, page 98, shows. Serialized execution defeats the advantages of multiprocessor execution, because the non-reentrant functions execute as if they were on a uniprocessor.
Linux 2.2 is more reentrant than previous versions of Linux were. However, several major Linux functions are still not reentrant. These functions include read and write, the two most common functions network server applications use. A Linux server application will read client requests from a communication endpoint, read data from a file (such as a Web file or email database) to respond to the requests, and write the file to the client via the communications endpoint. Even if the data requested from the client is in a memory cache and a read from a file is not necessary, the write paths still serialize.
Figure 3 demonstrates the difference between the execution of a non-reentrant function and that of a reentrant function. In the top half of the figure, the kernel spends time waiting for both CPUs to execute the non-reentrant function; in the bottom half of the figure, the kernel doesn't spend time waiting for the reentrant function. The OS in the bottom half of the figure finishes executing the same code sooner than the OS in the top half of the figure does.
Many members of the Linux development community believe this kernel-waiting-time difference to be an insignificant performance problem. This belief comes about almost certainly because no performance studies of Linux 2.2 on enterprise workloads have yet taken place. To glimpse how serious a problem kernel waiting time will be for Linux, look at recent developments in NT. NT's write function for network I/O was reentrant except for the part of the function that NIC drivers (network device interface specificationNDISdrivers) handled, wherein they transfer data to their network hardware. Making this small part of the entire write function non-reentrant was enough to prevent NT from competing effectively with Sun's Solaris OS on enterprise applications executing on 4-way multiprocessors. To remedy the situation, Microsoft let NDIS drivers have deserialized, or reentrant, write paths in NT 4.0 Service Pack 4 (SP4) and Windows 2000 (Win2K). In Linux 2.2, several functions, in addition to read and write, are still not reentrant, and each time a Linux server application uses such a function, the function's non-reentrancy hampers multiprocessor scalability.
Sendfile
A final area in which Linux is at a disadvantage is its implementation of the sendfile API. Sendfile is an API that Microsoft introduced to NT several years ago as a feature that network server applications can use to enhance their performance. Obvious candidates for the API are Web servers. Without sendfile, a Web server that receives a request from a browser for an HTML file must first read the contents of the file into its memory and then send the contents to the client via a communications endpoint. The process of reading the file into private application memory is wasteful, because the server application doesn't want the contents of the fileit merely wants to send the contents to the client.
Sendfile eliminates the necessity that a server application read a file before sending it. With sendfile, the server application specifies the file to send and the communications endpoint in the sendfile API, and the OS reads and sends the file. Thus, the server doesn't have to issue a read API or dedicate memory for the file contents, and the OS can use its file system cache to efficiently cache files that clients request. Soon after Microsoft implemented sendfile in NT, UNIX vendors implemented sendfile in their OSs.
Linux has a sendfile implementation, but the Linux sendfile has several problems that developers must fix before Web servers (and other applications that can use the sendfile API) on Linux can achieve the same benefits that UNIX variants and NT obtain from their sendfile implementations. On NT, sendfile doesn't incur a copy operation if the file being sent is in the NT file system cache. In other words, the network software can send the data directly from the cache. But on Linux, sendfile copies the file data into buffers that sendfile hands to the networking code. This extra copy operation consumes CPU time and creates a larger memory footprint for the server application, both of which adversely affect performance. Another problem with Linux's sendfile is that it is non-reentrant. Thus, using Linux's sendfile leads to the serialized execution of part of a server application, which inhibits the application's ability to scale on a multiprocessor. A final problem with Linux's sendfile is that it doesn't let the system preappend a data buffer to the front of the file sendfile is sending. In the case of Web servers, this limitation necessitates another system call to request an HTTP header before the server can send the file.
Linux Isn't ThereYet
The limitations I've cited are most of the major shortcomings in Linux's support for enterprise applications. However, other limitations might lurk beneath those I've pointed out. Despite the Linux community's claims to the contrary, Linux 2.2 is not ready for the enterprise or for multiprocessors. Linux is not engineered with enterprise computing in mind, nor has the OS been present in enterprise environments where administrators, programmers, and users can notice its limitations. Consequently, Linux's kernel threads are ineffective at supporting enterprise applications, and the Linux kernel is unable to scale applications on multiprocessors as well as other OSs can. Certainly over the next year or two, as Linux's momentum pushes it into the enterprise, Linux will face its shortcomings. When that happens, and Linux's developers address the OS's problems, UNIX variants and NT will feel a compelling threat to their enterprise dominance from this open-source OS.
Perhaps if Mark had written the article for just the Linux kernel mailing list, he would have found a less zealous and more technically capable audience already familiar with the topics he discussed. Furthermore, any weaknesses he uncovered would be fixed immediately.
Seeing all the interest in Linux, why don’t you start another Linux magazine or at least add a regular column to <i>Windows NT Magazine</i>? In the real world, a lot of companies have both NT and Linux (or some other flavor of UNIX), so I imagine interoperability would be of great interest to your readers.<br>
--V.C. in Alameda, CA
V.C. in Alameda, CA August 09, 1999