Ask a company that sells NT Server systems just how reliable its systems are, and you'll probably hear a sales pitch about redundant fans, RAID, hot-swappable components, and other reliable design features. You might even hear mention of the number of nines of availability or a recounting of the many satisfied customers with continuous uptime. But a nagging question remains: How well will a system operate with your particular applications and workloads? The most direct answer lies in explicit testing of your system with the OS, application, and fault-tolerant support that you'll be using. If you'll be running a Web server using Microsoft Internet Information Server (IIS) 3.0 on Windows NT 4.0 Service Pack 4 (SP4) with Microsoft Cluster Server (MSCS), you should test that configuration to determine whether your system is reliable.

People are increasingly using NT as a platform for server applications that demand high reliability and availability, and while it’s important to be able to accurately characterize the reliability of such server systems, few tools exist. Lucent-Bell Labs’ NT Dependability Test Suite (ntDTS) fault-injection package is one example of a tool that lets you test fault-tolerant software, including NT applications. Let's take a look at what makes a system reliable and how you go about testing your machine and applications without destroying them. Along the way, we'll introduce you to the ntDTS tool and present experiments that demonstrate the utility of the tool for testing and comparing the reliability of several well-known applications and fault-tolerance software.

Reliability Testing
How do you test reliability? Functional testing (i.e., does the application produce the right answer?) and performance testing (i.e., how fast is my application?) are easy—simply run the program and see how fast the correct answer appears. In contrast, reliability testing requires fault injection— the insertion of faults, or at least the effects of faults, into the system you’re testing. You base such faults on models of hardware or software problems that might occur. An example of a hardware fault is an electrical short in a memory chip that causes a "stuck-at-0" memory bit. A software fault might be a coding mistake that inadvertently omits the initialization statement for a variable. In either case, the fault causes corruption of the system state, meaning that the OS or application starts to behave improperly and might eventually crash or hang the machine or application.

A reliable system contains measures to tolerate faults. To reliability test such a system, you run your application on the system in the presence of faults to examine the application’s ability to produce correct results despite the faults. However, a few complications exist. The faults of interest usually aren’t trivial. Imagine trying to cause a short in a memory chip just to test its effect on your application—you would permanently damage the chip. In addition, you must coordinate fault injection with the program execution—it doesn’t make sense to inject a fault after the program has finished producing its results.

Software-implemented fault injection, wherein a program injects fault effects that simulate a particular fault, greatly simplifies fault injection. Instead of causing a short in a memory chip, the software-implemented approach uses an injection program to zero out the corresponding memory bit. The fault effect is the same, but without damaging the chip hardware. In addition, you can easily reconfigure the injection program to inject different memory locations with different fault effects. Another benefit is that you can synchronize the injection program with the application you’re testing to let the injection program inject the fault at a particular point in the application.

The ntDTS Tool
The ntDTS fault-injection tool is a free package from Lucent Bell Labs that facilitates the fault-injection testing process for NT systems. (To download ntDTS, go to the Lucent-Bell Labs Web site.) Using the ntDTS tool, you can test the reliability of your critical applications. There are two main reasons to use ntDTS:

  • To improve the reliability of your applications. The tool will discover specific failure scenarios that you can repeat, so you can go back and improve your code to handle those scenarios. Note that you can make improvements to the application itself, the OS, or the fault-tolerance software.
  • To compare the reliability of different systems. If you’re deliberating the merits of two different fault-tolerance packages, you can use the tool to produce quantitative reliability results to help you decide.

ntDTS consists of two main parts: A set of programs and scripts that equip the test application for fault injection, and the actual fault-injection tool that runs the application, injects faults, and collects the results. Screen 1 shows the fault-injection tool’s GUI.

Lucent-Bell Labs developed ntDTS in Java, which increases the portability of the tool (the subsequent port to Linux required substantially less time). Java’s object-oriented nature lets extensible objects encapsulate the basic fault-injection tool functionality. Consequently, you can configure the tool for different applications with a minimal amount of additional Java code. In fact, in many cases, you don’t need to write any additional Java code because parameter files contain most of the necessary configuration information. The parameter files describe the types of desired faults and applications.

What results does the tool produce? To understand the output, you need to understand the fault-injection mechanism.

Fault Injection
ntDTS injects faults by corrupting system call parameters. Suppose we want to inject the CreatePipe() function. Listing 1 shows the function prototype for CreatePipe().

Listing 1. The Function Prototype for CreatePipe()
BOOL CreatePipe(
PHANDLE hReadPipe                         // pointer to read handle
PHANDLE hWritePipe                        // pointer to write handle
LPSECURITY_ATTRIBUTES lpPipeAttributes    // pointer to security attributes
DWORD nSize                               // pipe size
);

ntDTS corrupts one parameter at a time. For example, to corrupt the hReadPipe parameter, the tool does one of the following:

  • Sets hReadPipe to 0x0
  • Sets hReadPipe to 0xffffffff
  • Flips all the bits in hReadPipe; If the original value is 0x3, then the corrupted value is 0xfffffffc

ntDTS then calls the original function with the corrupted parameter. This type of fault injection causes fault effects such as corrupted kernel data or bad return values. For NT systems, functions in system DLLs invoke most system calls. The tool injects faults into the DLL functions, which then affect the associated system calls.

To prepare an application for fault injection,

  1. Create a wrapper DLL. The wrapper DLL contains masquerade functions for each function in the original DLL.
  2. Alter the target application Import Address Table (IAT) entries to point to the wrapper DLL. The IAT is a table that lists the DLL where each function resides.
  3. Create configuration files to activate the wrapper DLL.

Figure 1 illustrates the roles of the wrapper DLL and the IAT in the fault-injection mechanism.

Figure 2 shows what happens during the fault injection. The rectangular boxes in the figure represent executable code. IIS is the IIS application code, which links at runtime to the wrapper DLL, kernel32.dll, and ntdll.dll. HttpClient is a small program that sends client HTTP requests to the IIS server. Kernel Mode Components include the NT executive, kernel, and device drivers. The Parameter File and Output File contain the contents for an actual fault injection. The following sequence describes the events for a sample fault injection (the step numbers correspond to the numbers in Figure 2):

  1. IIS starts by linking with the wrapper DLL and other DLLs. The linking causes the initialization routine in the wrapper DLL to read the fault injection Parameter File.
  2. The HttpClient program starts and sends a request to IIS.
  3. IIS processes the client HTTP request and calls the CreatePipe() function.
  4. The modified IIS IAT calls the wrapper DLL function that corresponds with CreatePipe(). The wrapper DLL corrupts one of the CreatePipe() parameters.
  5. The wrapper DLL creates an Output File containing the details of the injected fault.
  6. The actual CreatePipe() function in kernel32.dll is called using the corrupted parameter.
  7. The CreatePipe() function eventually calls functions in ntdll.dll that access the resources in the Kernel Mode Components.

Results
Using ntDTS, we tested IIS 3.0, Apache Web server 1.3.3 for Win32, and Microsoft SQL Server 7.0. We executed these applications in three different configurations: as standalone NT services, with MSCS, and with NT-SwiFT's watchd component (NT-SwiFT is a fault-tolerance package from Lucent-Bell Labs). We conducted all experiments on the same machine, a 100MHz Pentium with 48MB of RAM running NT Server 4.0 with SP4.

For each application, a simple client program sends requests to the application. For the IIS and Apache Web servers, the HttpClient program sends two types of requests: an HTTP request for a 115KB static HTML file, and an HTTP request for a 1KB static HTML file via the Common Gateway Interface (CGI).

For SQL Server, the SqlClient program sends a SQL select request based on one table. Both HttpClient and SqlClient check the correctness of the server reply. If the reply is incorrect or if the client program doesn’t receive any response within a timeout period (the default is 15 seconds), ntDTS retries the request (and retries again, if necessary). After the client program receives a correct reply (or after three attempts fail), the client program reports about the success or failure of the requests and the number of retries it attempted.

The following results show how ntDTS is useful. First, the tool enables quantitative comparisons of different fault-tolerance packages. Second, the tool explicitly identifies situations that the fault-tolerance software doesn't handle adequately. Thus, the tool can improve the failure coverage of the fault-tolerance software and the reliability of the entire system.

Figure 3 shows results for experiments with IIS as a standalone service, with MSCS, and with watchd. We also performed experiments with Apache and SQL Server, but the IIS results are representative of the other results. Table 1 describes five possible outcomes and describes each outcome as a percentage of the total number of activated faults (the calling of a function activates the corresponding fault) for that particular application.

Perhaps the most important and obvious conclusion we can draw from Figure 3 is that both MSCS and watchd are effective in increasing the reliability of IIS. The fault injections that resulted in complete failure outcomes (i.e., cases where the application couldn't produce the correct response even after repeated client request retries) decreased markedly for all applications when we added MSCS or watchd.

MSCS and watchd can reduce the number of complete failures because they can determine when the application you're monitoring malfunctions and restart the application. The number of normal success outcomes remains essentially the same for each application. When we added MSCS or watchd, some complete failure outcomes became Application Restart with Success outcomes.

Figure 4 shows that while both MSCS and watchd decrease the number of Complete Failure outcomes, watchd does a much better job for the fault set we used. In fairness to MSCS, we used only the default failure-detection mechanisms in our experiments. You can configure MSCS with custom failure-detection mechanisms that interact with and monitor all aspects of IIS and SQL Server, but creating such detection mechanisms requires additional knowledge and effort.

Another important use of ntDTS is for identifying reliability problems. The ntDTS results list the specific faults that caused the application to fail. The failures might result from problems in the application, the OS, or the fault-tolerance software. By replaying the fault scenarios that led to failures, you can isolate and correct the specific deficiencies. For example, we pinpointed small timing problems in watchd that related to the startup of services. Fixes to watchd resulted in a version with improved results for IIS. We repeated the process and produced a third version with improved results for all three applications. After each round of improvements to watchd, we used ntDTS to determine the actual reliability improvement. We stopped after the third version because our goal was to show how ntDTS helped us improve watchd. Figure 4 shows the effect of the watchd improvements for Apache, IIS, and SQL Server. Watchd1 is the initial watchd version, Watchd2 is the intermediate version, and Watchd3 is the final version. We focused on watchd improvements because we had access to the watchd source code. Similar improvements are possible with MSCS or the applications themselves.

What Lies Ahead
Reliability testing with fault injection is becoming more popular. ntDTS is one tool that when used properly can help you make your applications and systems more reliable.