Real-world implementation and high-availability design guidelines

Today, systems administrators are facing the challenge of making Windows 2000 available more than 99.9 percent of the time. To address this challenge, Microsoft has partnered with several top-tier OEMs to deliver and support Win2K Datacenter Server. The result of this collaboration is the Windows Datacenter Program, which provides customers a list of certified configurations that Microsoft has thoroughly tested for reliability. Hewlett-Packard (HP), an OEM involved in the Windows Datacenter Program, has been working through the challenges and pitfalls of Datacenter implementations. Learning from their experiences, HP engineers and consultants have developed a valuable list of best practices to share with Datacenter customers around the world. With these best practices in mind, you can more easily decide whether Datacenter makes sense for you and see what you must do to create your own high-availability infrastructure.

For more information about the Windows Datacenter Program, see Greg Todd, "Win2K Datacenter Server," December 2000, and the Microsoft article "The Datacenter Program and Windows 2000 Datacenter Server Product" (http://support.microsoft.com/support/kb/articles/q265/1/73.asp). You can also visit Microsoft's Datacenter Web page at http://www.microsoft.com/windows2000/datacenter.

High Availability 101
Does your environment need a high-availability solution? To determine which high-availability technologies are relevant to your environment, you need to understand your availability requirements. Only then can you begin to design an infrastructure that meets your needs.

You also need to understand the difference between fault resilience and fault tolerance. Fault-resilient systems consist of clusters that achieve high availability through failover. Microsoft Cluster service is a clustering solution that makes Datacenter and Win2K Advanced Server fault-resilient. Cluster nodes have independent system images, and failover can take from a few seconds to several minutes. (A system image, which completely describes the point-in-time status of a particular system, is unique to each computer system and changes rapidly. This image includes such information as memory, CPU registers, disk and memory buffers, and message queues.)

Applications on fault-resilient systems use checkpoint files to recover application data. A checkpoint file is a log file, such as a database transaction log, that lets an application recover its state—the processing stage of the application at a certain point in time—after a power failure or hardware failure. Following a failure, the application first looks at checkpoint files stored on the disk to either roll forward or roll back transactions that were incomplete at the time of failure. Fault-resilient systems recover only to the most recent checkpoint. Information not saved to some form of checkpoint file (i.e., residing only in memory) will be lost on failover.

Fault-tolerant systems, which have tighter coupling of resources, keep applications available by protecting one system image. Applications that run on a fault-tolerant system don't require checkpoint files—they simply depend on the underlying fault-tolerant platform to keep the system running. Proprietary and highly customized hardware and software characterize fault-tolerant systems. Therefore, fault-tolerant systems are typically more expensive than their fault-resilient counterparts. When constituent components fail, redundant components take over so that the system image runs uninterrupted. Most high-availability computing uses fault-resilient systems, which don't require the same level of expensive custom hardware or software. However, fault-tolerant systems can more commonly achieve 99.999 percent planned availability.

In terms of high availability, a key difference between fault-tolerant and fault-resilient systems is recovery time. Fault-tolerant systems boast recovery times that approach zero. Fault-resilient systems (i.e., Cluster service clusters) have recovery times that range from a few seconds to several minutes because of the time necessary for failover.

By the Numbers
Availability is the ratio of the amount of time that a system is available to the amount of time the system should be available. Industry convention is to express availability as a percentage. The mythical perfect system would be available 100 percent of the time. Real systems, of course, post lower percentages.

You can use the simple calculation

A = MTBF/(MTBF+MTTR)

where A is availability, MTBF is mean time between failures, and MTTR is mean time to repair (or recover), to find a system's availability. "Three nines" conveys that availability is 99.9 percent, "four nines" conveys that availability is 99.99 percent, and so on. If you use 20 minutes as the MTTR value (Microsoft claims 20 minutes is the average time necessary to restore a Win2K or Windows NT system) and .999 as the A value, you get an MTBF value of approximately 14 days. (Not coincidentally, 14 days is the duration of the Microsoft stress test for Datacenter hardware and kernel-mode drivers.) The primary high-availability design goal is to increase A by increasing MTBF and decreasing MTTR.

Table 1 gives an overview of availability in terms of nines. The table's downtime numbers are measurements of unplanned downtime. (In today's world of high availability, techniques such as online backup and rolling upgrades for system maintenance or hardware updates keep planned downtime close to zero.) Do you need three or more nines? Costs can increase 10-fold for each nine that you add. Take a close look at your business. What does downtime cost you? To justify a high-availability solution, you need to start by calculating the cost of an unavailable system. Table 2 shows sample downtime costs per hour from various industries. Table 3 shows causes of downtime as evenly divided among planned outages, software, and physical factors (i.e., people, hardware, and environment).

Glancing at this data, you can easily understand the importance of people and processes to achieving high availability. In a recent white paper, "Increasing System Reliability and Availability with Windows 2000," Microsoft refers to industry studies showing that 80 percent of system failures are the result of human error or flawed processes.

Always-On Design Guidelines
High availability isn't a product that you can buy—it's a goal. High availability is the result of carefully blended technologies, support services, and human processes. More important, high availability is about attention to detail and the discipline to manage every aspect of your environment. (For an overview of the high-availability goal, see David Chernicoff, "Components of a High-Availability System," November 2000.) At its simplest, high availability means increasing the time between failures and decreasing the time to recovery. Using the following basic guidelines, you can begin to take specific steps to improve your system availability:

  • Use redundant components and automatic failover to eliminate single points of failure. Each redundant component should also be highly reliable. Possible single points of failure are sometimes obvious. When you're designing a high-availability system, the reliability of the power supply, the connection to the Internet, the disk drives, and the network components should be your primary concerns. Sometimes, however, you find potential single points of failure in unexpected places. For example, microsoft.com, msn.com, expedia.co.uk, and msnbc.com recently were unavailable for long periods of time. The rumor is that the root cause was a configuration change to a router at the edge of Microsoft's DNS network. Even the best sites can suffer from unexpected single points of failure.
  • Comprehensively test new hardware and software. Maintain separate environments for production, development, and testing. You should document, justify, and fully test all proposed changes to the production environment.
  • Reliable components are essential. Even when you configure a system as a redundant cluster node, be sure to configure the system to minimize the likelihood of failure. Consider implementing RAID 1 (i.e., duplex) mirroring on the boot partition. That way, if a controller fails, both halves of the mirror don't go down. Use redundant, hot-swappable power supplies, hot-swappable SCSI disks, error-correcting memory, and redundant fans.
  • Reduce chances for human error. Minimize unstructured human contact with the system. Create processes that minimize the chance of failure and the time to recover. Use scripts to automate routine tasks. Use systems management tools to identify trends, conduct root-cause analysis, and trigger automatic responses to error conditions and events. Record the entire production infrastructure in a configuration management database (CMDB)—a centralized and comprehensive record of configurable items (CIs) in your IT infrastructure.

The primary goal of the Windows Datacenter Program is to use Win2K to host mission-critical applications that must be highly available and highly scalable. Remember that high availability isn't a Datacenter feature—that line of thinking is similar to believing that a great golf score is a feature of great golf clubs. High availability is a result of disciplined processes, only some of which are part of the Windows Datacenter Program.

Datacenter or Win2K AS?
In terms of OS stability, Win2K is a tremendous step beyond NT 4.0. One question you need to answer is whether you need Datacenter or Win2K AS. Table 4 shows obvious surface differences between the OSs. (For more information, see Greg Todd, "Windows 2000 Datacenter Server," December 2000, and "Microsoft Clustering Solutions," November 2000.)

Is Datacenter the answer for your application? Win2K AS provides many excellent and affordable improvements, including 2-node Cluster service clusters, 32-node Network Load Balancing (NLB) clusters, 16-node Component Load Balancing (CLB) clusters, and reliable restart in Microsoft Internet Information Services (IIS) 5.0. If you combine Win2K AS's features with disciplined backup, recovery, and change management, you can achieve two or even three nines of availability.

Reasons for choosing Datacenter over Win2K AS include the following:

  • A memory requirement of greater than 8GB. Large databases and applications that you write specifically to take advantage of a large amount of memory (using Address Windowing Extensions—AWE) are good candidates for Datacenter.
  • A stateful application that requires the extra availability that 3-node or 4-node clustering provides (compared with the 2-node clustering that Win2K AS provides). Cluster service clustering will enhance the availability of stateful applications. (For information about stateless versus stateful applications, see the sidebar "Scaling Up vs. Scaling Out.")
  • A network-performance requirement for Winsock Direct (WSD). Using WSD, some vendors have boasted Gigabit speeds. (These claims are fairly controversial, considering Microsoft's recent removal of Giganet cards from the Data-center Hardware Compatibility List—HCL.)
  • The need for a GUI-based process-control tool. A process-control tool lets network administrators easily enforce the terms of service level agreements (SLAs) with their customers. For example, an administrator can limit the amount of memory, number of processors, or amount of time available to any group of processes. Datacenter has such a tool; Win2K AS doesn't.
  • The desire to access the Joint Support Queue's technical support resources, which are available only with the Windows Datacenter Program. The Windows Datacenter Program also provides change-management functions in the form of stress testing, signing, and certifying kernel-mode drivers; certifying hardware changes; and certifying application software.

Evaluate Applications
Now that you've worked through the costs associated with your downtime and reviewed the general guidelines for keeping systems available, the next step is to determine which of your applications can benefit from a Datacenter implementation. This assessment will depend on your current application architecture. Today, the applications that work most smoothly with Datacenter are n-tiered, COM-based, line of business (LOB) applications that store and retrieve data from a back-end data store (e.g., Microsoft SQL Server 2000).

At a basic level, an application has three functions:

  • It interacts with the user to collect input and display output (i.e., a UI).
  • It applies some transformation to collected input (i.e., business rules or logic).
  • It stores information for later use or retrieval (i.e., data storage).

Most applications perform all three functions—for example, Microsoft Word. However, to be scalable, an application needs to perform more work than one computer can manage.

In the client/server model, for example, components of an application cooperate while running from two or more computers. One part of the application handles user interaction and local processing while another part performs back-end processing. An example of the client/server model is the interaction of Microsoft Office and Microsoft Exchange Server, which work together to provide group collaboration. Multi-tiered (i.e., n-tiered) applications spread tasks around to different computers to increase efficiency.

Microsoft has provided a standard—COM—for creating distributable parts (i.e., components) of an application. Developers use COM to create reusable application components that can cooperate in a distributed computing environment. Microsoft's .NET Server architecture defines n-tier applications as applications that you can divide into specific functions that you can implement as components. The first tier, on which user interaction occurs, is called User Services or Presentation Services. The middle tier—called Business Logic—takes input from User Services and queries the back-end tier, which is called Data Store.

N-tiered applications leverage Microsoft's development, scalability, and high-availability technologies. For now, SQL Server 2000 is the only application to take full advantage of Datacenter's large memory space and 4-node clustering. Soon, however, you might be able to add Oracle and Microsoft Exchange 2000 Server (post­Service Pack 1—SP1) to the Datacenter list of one. Further testing will determine whether Oracle and Exchange 2000 will be good applications for Datacenter.

Microsoft provides clustering technologies that are appropriate for each of the three application tiers. (For more information about Microsoft clustering technologies, see Greg Todd, "Microsoft Clustering Solutions," November 2000.) For an application's Presentation Services tier, use IIS 5.0 on an NLB cluster. (Be sure to implement IIS's reliable restart enhancement, which lets the Web service automatically attempt a restart after failure.) NLB, which Microsoft introduced in Win2K AS, supports as many as 32 nodes per cluster. NLB provides stateless clustering with stateful connections to client browsers.

Alternatively, you can use Cisco Systems' LocalDirector (CLD) hardware solution to load-balance IP traffic. You might prefer CLD because of its support for network switches. (NLB requires hubs in conjunction with switches because of NLB-induced switch flooding.) Although many claim that using IIS on Datacenter with Cluster service is a reasonable first-tier solution, I recommend Win2K AS running IIS and NLB clusters—a solution that provides scalability, availability, and control at a relatively low cost.

Using a CLB cluster on any Win2K server supports the Business Logic tier. CLB lets you load-balance COM components across multiple computers and requires Microsoft Application Center 2000, which provides management tools for CLB and NLB clusters in a highly distributed environment. CLB also dynamically load-balances COM+ components, which are an enhancement to COM components. CLB clustering is good for stateless components and supports as many as 16 nodes. Although CLB runs on Win2K Server, Win2K AS, and Datacenter, your OS decision will depend largely on the nature and scaling requirements of your components.

If your application is stateful, you need the Datacenter implementation of Cluster service to scale out beyond two nodes. (Cluster service is the clustering technology of choice for stateful applications, whereas NLB and CLB are clustering technologies best suited for stateless applications such as IIS and Win2K Terminal Services for NLB and COM+ for CLB.) You need Datacenter to scale up above the eight-processor SMP and 8GB limit of Win2K AS.

Even for an application that doesn't take advantage of Datacenter's large memory space, you might see substantial benefits if you run the application on a Datacenter system with lots of memory. The large system cache and reduced physical paging could have major performance benefits.

Finally, you might experience performance benefits if you use Datacenter's WSD for communications with the Data Store tier. (WSD lets two computers communicate over an extremely high-speed network link.)

For the final tier of a standard n-tier architecture—the Data Store tier—I recommend Datacenter. The large memory requirements of terabyte databases and SQL Server 2000's Physical Address Extension (PAE) support benefit from the scale-up that Datacenter's 64GB of RAM permits. Exchange Server SP1 will likely find support on Datacenter, but it doesn't yet take advantage of the large memory space.

Applications that don't use COM components can still benefit from Datacenter. To take best advantage of 4-node clustering, you should modify these applications to use the Cluster service Cluster API. (The Cluster API features several functions that let applications respond to cluster messages and report their status back to Cluster service.) Best use of Datacenter's large amount of memory requires that you add AWE to your applications. (AWE is a Microsoft API that lets developers take full advantage of the 64GB of RAM that Datacenter supports.) Unmodified applications can still benefit from Cluster service and PAE, just not as much.

For all applications that use Cluster service, you need to enable the use of checkpoint files so that application data recovery occurs. As I mentioned earlier, checkpoint files are important for recovering the transactions that are in memory when a cluster node fails.

The decision to modify a large database application to use AWE or to be cluster-aware (i.e., incorporate the Cluster API) might need substantial justification. You should initiate any change to a mission-critical application with a Request for Change (RFC) and go through the same justification process you would with any other RFC.

For applications that use COM+, development might be easier: COM+ components are typically small and easy to modify. Stateful COM+ components don't perform nearly as well as their stateless counterparts. Storing and recovering component-state information limits the reusability of COM+ components. You can modify stateful COM+ components to become stateless or to store state information in the Data Store tier. For example, in Visual Basic (VB) applications, a Property Bag object holds state information. You can rewrite the VB code to avoid the use of the property bag and instead use Active Data Objects (ADO) to store state information in a different machine's database table.

Select Your Administration Tools
Although systems administration tools aren't a component of the Windows Datacenter Program, they're essential to a high-availability architecture. These tools will help you manage all your infrastructure's components. Such tools typically use SNMP and agents to monitor the condition of your site, trap errors, generate alerts, carry out preprogrammed responses to specific conditions, identify dependencies between components, and perform root-cause analysis of dangerous trends. HP's OpenView and Computer Associates' (CA's) Unicenter TNG are popular examples of systems administration tools.

Large servers are often shipped with preinstalled management utilities (e.g., HP's Toptools, Compaq's Compaq Insight Manager) that let you perform a detailed investigation of your hardware while remaining fully online. On a remote server, remote control cards can recycle power, monitor a boot sequence, and provide remote keyboard/video/mouse (KVM) capabilities.

Organize Support
After you design your high-availability infrastructure, you need to align internal and third-party support with your availability requirements. One function of the Windows Datacenter Program is to formalize such support. To do so, Microsoft, the OEM partners, and certified application developers perform extensive testing and change management. OEMs must offer SLAs that outline time-to-repair commitments. (Depending on the contract, OEMs can be available to answer support questions within 30 minutes and can be on site within 6 hours.) You must establish and enforce SLAs with your external and internal support teams.

The Joint Support Queue, staffed by both Microsoft and OEM personnel, provides a well-defined support-escalation path. First, the customer calls first-tier OEM support. The first tier can call the second tier, which can escalate to the Joint Support Queue. The Joint Support Queue determines whether the problem is related to hardware, the application, or the OS. If the problem is hardware-related, the call goes to the OEM's hardware support team. If the problem is application-related, the Joint Support Queue contacts the certified application developer's Help desk. If the problem is OS-related, the call goes to Microsoft Critical Problem Resolution teams, then to Microsoft Quick Fix Engineering. Of course, the problem might be resolved at any point along this path. Some OEMs (e.g., HP, IBM, Compaq) offer consulting and support beyond the Joint Support Queue and Windows Datacenter Program's minimum requirements.

Standards groups in the UK started codifying best practices for systems management in the IT Infrastructure Library (ITIL) in the late 1980s. (For more information about ITIL, go to http://www.itil.co.uk/index.htm.) Most successful high-availability sites today use some or all of these practices. Several training programs and publications are available to introduce you and your staff to the world of high-availability computing and IT Service Management (ITSM). (For more information, as well as a glimpse at the world of enterprise and high-availability computing from the perspective of big-systems management, go to http://www.itsmf.net.)

The Microsoft Operations Framework (MOF) builds upon the ITIL and ITSM and—according to Microsoft—is better suited to the rapidly changing needs of Windows environments. The MOF emphasizes iterative processes for risk assessment, configuration management, and adoption. For more information about the MOF, see the "MOF Executive Overview" white paper at http://www.microsoft.com/trainingandservices/default.asp?PageID=enterprise&Subsite=whitepapers&PageCall=mof#MOFoverview.

Monitor System State
To monitor your high-availability infrastructure, you can leverage best practices from ITSM and MOF and use your systems management tools. Create a CMDB that includes information about CIs. A CI is simply a configurable element of your infrastructure—anything to which you can apply (or from which you can derive) configuration or status information or that can cause the system to be unavailable. The CMDB also needs to include information about dependencies between CIs. A problem's root cause isn't always obvious. Dependency trees can help you discover the root cause of any problem. Your comprehensive CMDB should contain information about every component involved in the task of keeping your infrastructure available. For example, you should include configuration settings, firmware versions, and build or service pack numbers. Create this record during the infrastructure's initial installation, then update it after any change. The CMDB will help you troubleshoot your system and help you bring failed systems back into service rapidly (thereby decreasing MTTR).

Microsoft and third parties offer tools to help you create your CMDB. To get baseline data about hardware, software, and detailed system configuration, use your systems management tools and the utilities in the Microsoft Windows 2000 Server Resource Kit. To detect changes in a Datacenter configuration, run the Datacenter Config Comparison utility (cfgcmp.exe)—a command-line tool in the Datacenter CD-ROM's \Support\Tools directory. For stateful clusters (i.e., Cluster service with two, three, or four nodes), create an initial cluster log, which documents whether the cluster starts and runs correctly. (You and your vendor must debug any errors in the initial cluster log before accepting the installation as complete.) Run Network Diagnostics (netdiag.exe)—a resource kit tool—to ensure that no network problems exist. Also, to ensure that no errors or warnings are occurring on boot, be sure to save and review your event logs.

You also need to define the change-management roles within your enterprise. Outline the method by which specific teams create and submit RFCs. You might assign teams to ITSM functions such as Cost Management, Build and Test, Customer Management, and Change Management. Representatives from these groups can submit an RFC. The other teams would then assess the RFC to determine its effect on each ITSM function. Create a Change Advisory Board (CAB), the responsibility of which is to determine how any RFC will affect availability, capacity, and adherence to SLAs. Make sure that you've established formal procedures for implementing changes and recording new or changed CIs in your CMDB. Also, ensure that the CMDB is available for root-cause analysis of any failures. As appropriate, create new RFCs to address design changes following a failure.

Troubleshooting is a vital step on the path to recovery. The quicker you can detect and fix a problem (i.e., the lower your MTTR), the higher your availability will be. You can expedite system recovery with fast troubleshooting. To provide redundant "safe boot" on failure, place parallel Win2K installations on all servers. (Although you can boot Win2K in safe mode, a parallel installation of Win2K provides another way to repair a system and return it to service.) Create, document, and practice procedures for handling a blue screen, a Dr. Watson message, a hung server, and hung processes. Ensure that support personnel are familiar with these procedures.

Call to Action
Do you require Windows applications that are scalable and available 99.9 percent of the time? If so, Datacenter might be the solution for you. Remember, however, that high availability isn't a product feature. You'll achieve three or more nines only as a result of meticulous design choices and strict adherence to processes that keep systems up.

The Windows Datacenter Program lets you put some necessary high-availability tasks—such as Change and Configuration Management (CCM) and testing of certified hardware, OSs, and applications—into the hands of Microsoft and its OEM partners. But you'll still need to understand the myriad other aspects of designing, implementing, and supporting a high-availability infrastructure. The Windows Datacenter Program and high-availability computing will represent a substantial departure from the way you've managed Microsoft systems in the past. Despite the challenges, the rewards will be tangible and immediate.