The server is down! The Internet is down! Systems administrators and network administrators would prefer never to hear these words—and after all, the words are seldom literally accurate. How often is an entire server destroyed? How often does the Internet suffer a global failure? Most system failures are the result of a single component failure. Your job is to find that component, fix it, and return the system to normal operation.
For crucial systems, you're always looking for ways to predict and reduce downtime. One approach is to analyze the system's communication path from servers to users and look for potential single points of failure—that is, individual system components that, when broken, can cause the unavailability of the entire system. After you identify potential single points of failure, your next challenge is to decide what to do about them. Because money is often a consideration, you undertake risk analysis—either formally or informally. A considered response often includes one or more of the following strategies:
- Do nothing. Either the risk is low or the cost of a fix is too high.
- Acquire cold spare parts. Cold spare parts are components that you can use to replace broken parts quickly. This strategy comes with moderate cost and risk and is appropriate when some downtime is acceptable.
- Acquire hot spare parts. Hot spare parts are redundant components that are running all the time, ready to take over for broken components in the system. Clustering, load balancing, and hot sites are all forms of such redundancy, depending on the part of a system that needs repair.
As a network administrator, you need to ensure that packets continue to flow. Often, redundant network connections are your best bet. In a network setting, you can use redundancy to provide fault tolerance and to increase communications capacity. To build reliable network communications paths, you need to understand how to implement redundant LAN and WAN connections. For information about the standards and protocols that enable the following redundancy scenarios, see the sidebar "A Glossary of Standards and Protocols Relevant to Redundant Networks," page 62.
Redundant LAN Connections
Sooner or later, you'll need to handle a system communication failure that occurs within a server's local subnet. The server's NIC and default gateway are both potential points of failure, but you can add redundancy in a variety of ways.
Multiple NICs on the same subnet. Whether your server system is standalone, clustered, or load-balanced, the NIC is a potential point of failure. Starting with Windows 2000, Microsoft simplified the installation of multiple NICs configured for the same IP subnet. To provide NIC redundancy, you can connect such NICs to the same hub or switch or preferably to different switches. The Interface metric property determines which of the active (i.e., enabled) NICs the system will use for outbound traffic; the system uses the NIC with the lowest number in the Interface metric field. Go to Control Panel, Network and Dial-up Connections, Local Area Connection, Properties. Select Internet Protocol (TCP/IP), and click Properties. On the General tab, click Advanced. Clear the Automatic Metric check box at the bottom of resulting dialog box, and enter the metric you want to assign to this NIC.
Multiple default gateways. A failure of the default gateway on the subnet will cause traffic to remote subnets to fail. Implementing multiple routers on the subnet provides a measure of fault tolerance to this kind of failure. The Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP) support such fault tolerance without requiring configuration changes at the client. You can also implement multiple default gateways at each client by defining more than one default gateway address on each NIC. Starting with Win2K, Microsoft lets you assign a metric to a default gateway the same way that you assign a metric to a NIC.
In earlier versions of Windows, you can assign a metric to a default gateway by installing additional default gateways directly into the IP routing table. To make such routing table changes, you use the Route Add command with the metric option at a standard command prompt. For example, the command
adds a persistent default gateway for the router at 10.10.0.254 with a metric of 15. Understand that only connection-oriented traffic such as TCP will trigger a default gateway change; UDP and Internet Control Message Protocol (ICMP) traffic such as Ping won't. Defining different default gateways for different NICs in a multihomed computer can cause problems when the NICs connect to networks that can't communicate with one another. Even when default gateways are defined on different NICs, only one of a computer's default gateways is active at a time. For more information about configuring default gateways, see the Microsoft article "Default Gateway Configuration for Multihomed Computers" (http://support.microsoft.com/?kbid=157025).
The Internet Router Discovery Protocol (IRDP) is yet another way to handle dead-gateway detection. Routers that support IRDP use ICMP messages to advertise their presence. In Windows NT 4.0, Microsoft added IRDP support, which is disabled by default. You use registry modifications to enable IRDP individually for each NIC, as described in the Microsoft articles "Internet Router Discovery Protocol (IRDP) Client Support Added to Windows NT 4.0" (http://support.microsoft.com/?kbid=223756) and "Router Discovery Protocol Is Disabled by Default" (http://support.microsoft.com/?kbid=269734). After you enable IRDP, the protocol stack will listen for and request router advertisements and use them to set a default gateway.
Link aggregation. Several years ago, NIC vendors began to offer proprietary solutions to the single-NIC vulnerability. These solutions evolved into the IEEE 802.3ad Link Aggregation Control Protocol (LACP) standard. LACP supports multiple parallel switch-to-switch and server-to-switch connections. You can use this standard—variously called NIC teaming, port bonding, and link aggregation—to configure LACP-based products for fault tolerance, increased bandwidth, and load balancing across parallel links.
Figure 1 shows the concept of server-to-switch link aggregation. In this example, four NIC ports on the server connect to four ports on one switch. LACP static-mode support in the NIC driver and the switch combine the bandwidth of the four ports for a total effective bandwidth equal to the sum of the NIC speeds. Traffic across the four links is load-balanced, and when a link fails, the load-balancing algorithm quickly converges to balance the load across the remaining links. This configuration doesn't provide fault tolerance in the event of a switch failure.
Figure 2 shows a server-to-switch configuration that provides fault tolerance in the event of a switch failure but doesn't provide link aggregation or load balancing. This configuration requires that you enable the Spanning Tree Algorithm (STA) in both switches to ensure that only one link is active at a time, thereby preventing packets from circulating between the links.
Figure 3 shows a server-to-switch configuration that requires LACP Dynamic Mode support. Because the server has connections to two switches, the configuration provides switch fault tolerance. The server has multiple connections to each switch, and the connections to each switch are grouped together (i.e., teamed). In this configuration, the teamed connections to Switch A are active, whereas the teamed connections to Switch B remain in standby mode. LACP provides link aggregation, load balancing, and fault tolerance to link failures within the active team. In the event of a switch failure, LACP fails communications over to the standby team connected to Switch B.
Figure 4 shows a switch-to-switch configuration. This configuration supports additional switch-to-switch bandwidth, load balancing, and link-failure fault tolerance.
Redundant WAN Links
Whereas building redundancy into your LAN involves (typically Ethernet) server-to-switch and switch-to-switch connections, building redundancy into your WAN involves router-to-router connections. Let's look at the network architectures you can use to implement redundant paths to remote destinations and to implement fault tolerance in the event of WAN link or router failure.
Consider the simplest Internet-connection scenario, which Figure 5 shows. The local network connects to the ISP through a single link at Router A. If the local network is sufficiently simple, Router A can use static routing rather than run an interior gateway protocol such as the Routing Information Protocol (RIP) or the Open Shortest Path First (OSPF) protocol. However, this configuration offers no fault tolerance.
Figure 6 shows one level of fault tolerance: two independent connections from one site into the same autonomous system (AS) of one ISP. Routers A and B both run Border Gateway Protocol 4 (BGP-4). For information about the Interior Border Gateway Protocol (IBGP) and Exterior Border Gateway Protocol (EBGP), see the sidebar "A Glossary of Standards and Protocols Relevant to Redundant Networks." Although this configuration offers fault tolerance to the failure of either Router A or Router B or to either of their communications lines, the organization is still vulnerable to an outage. If both communication lines use the same "last-mile" communication path between the datacom provider (e.g., your local phone company) and the local network, anything—such as a backhoe—that inadvertently damages that path will take out both links. If both links terminate at the same ISP Point of Presence (POP), a problem at the ISP's location can also take down both links.
Figure 7 shows a more fault-tolerant configuration: a network connection to two ISPs. This connection might still be vulnerable to last-mile disruption but can survive an ISP outage. This configuration requires the local network to use a globally unique Autonomous System Number (ASN). For information about the ASN, see the sidebar "A Glossary of Standards and Protocols Relevant to Redundant Networks."
Organizations that have facilities in several locations around the country or around the world can take advantage of even more robust fault-tolerant configurations. Suppose a private network interconnects an organization's various locations. The private network connects each location to two or more other locations. At least two of these locations would have fault-tolerant ISP connections. Depending on the ISP facility's fault tolerance, the organization might be able to use the same ISP in several locations or might choose to contract with different ISPs. This network would be able to survive a regional outage (each organizational location would have at least two paths to reach other locations) and would be able to bypass regional Internet problems by connecting through an ISP outside the troubled region.
The Key To Success
A thorough analysis of your network communication paths is key to successfully implementing redundancy for fault tolerance. Do you know whether you can accomplish last-mile connections over more than one physical path? Does your ISP have fault-tolerant Internet connections?
Competent network administrators make the effort to correctly implement the components for which they are directly responsible. The best network administrators, however, look beyond that core responsibility by constantly searching for potential points of failure in the communication path.