Windows IT Pro is the leading independent community for IT professionals deploying Microsoft Windows server and client applications and technologies.
  
  
  Advanced Search 


Return to article

The 7 Habits of Highly Available Exchange Servers
 

Lessons in self-and server improvement

Consulting about Microsoft Exchange Server availability is like watching the Loony Tunes' Wile E. Coyote: Watch for a while, and you can begin to predict the mistakes that lead to the falls. You also learn that the falls aren't as deadly as the pounding that follows close behind. After years of working with Exchange Server organizations, I've identified the factors that can lead to falls from high availability and the disaster recovery mistakes that can make these falls catastrophic. Inspired by Stephen R. Covey's bestseller The 7 Habits of Highly Effective People (Simon & Schuster, 1999), I've identified seven factors that help organizations prevent Exchange Server system failures and maintain high availability.

Seek first to Understand Downtime
Administrators must commit to solving the problems that decrease Exchange Server availability. Such problems fall into one of two categories: planned downtime or unplanned downtime. Planned downtime (e.g., applying service packs, upgrading hardware) is by far the easier category to manage. The best approach, when feasible, is to schedule planned downtime for nonbusiness hours.

Highly available Exchange Server organizations conduct risk assessments of unplanned downtime events. An important part of these assessments is the list you generate of possible downtime events. You can sort this list by the events' relative risks, then concentrate on preventing high-probability, high-impact events (e.g., Software Component A causing Software Component B to behave unexpectedly) and give less attention to the low-probability events (e.g., a meteorite striking your data center).

In my experience, software quality problems—bugs—are most often the cause of unplanned downtime. However, your response to outages—the decisions you make and the procedures you follow—determines the duration of the downtime. Unplanned downtime cycles have several stages, from problem identification through recovery. Understanding these stages and preparing yourself for action helps minimize downtime.

The first stage is notification that a problem exists. Automated notification systems—either built-in or added on—can detect hardware problems before they cause outages. OS- and application-level monitors, such as NetIQ's AppManager Suite and BMC Software's PATROL for Microsoft Exchange 2000 Servers, also aid in early problem detection. Undetected problems can lead to cascading failures that obscure the source problem. For example, suppose a mail connector queue fills Server A's hard disk. If this problem goes unnoticed, it might result in a connector on Server B failing to deliver messages to Server A. Thus, Server B appears to be the source of the problem, which diverts attention from the actual source: Server A.

The second stage is thorough problem analysis. Analysis helps you develop a troubleshooting course of action. The troubleshooting team must react quickly, but mistakes can be costly. The team members need to first isolate the problem to prevent further harm. Then, they must gather information about the problem, whether from tracking logs, Windows event logs, or the server operator's records of system changes.

Implementing and testing your recovery solution is the third stage. But don't consider the downtime cycle complete until the fourth stage: analysis of the lessons you've learned. Most unplanned downtime events contain lessons that can help you prevent a recurrence of the problem.

Put Hardware First
Hardware is the foundation of availability. Application stability doesn't matter if
you don't run your applications on solid hardware. Fault-tolerant hardware often lets you repair hardware faults without taking systems down. Redundant components can keep systems running when the inevitable hardware faults occur. Hot-swappable components let you replace them without downtime.

RAID-protected hard disk subsystems are key to protecting your Exchange servers from the effects of hard disk failure. Best practice is to place Exchange Server log files on a RAID 1 volume and the database on a RAID 5 or, better yet, RAID 0+1 volume. For more information about the pros and cons of these RAID configurations, see the sidebar "Comparing RAID 5 and RAID 0+1."

Storage planning is another important consideration. One organization's Exchange Server administrators told me that migrations to larger storage cabinets and more or larger hard disks were their servers' most significant sources of downtime (corporate policy prevented these administrators from enforcing mailbox limits). The organization was looking into a Storage Area Network (SAN) as a solution. A SAN provides a high-performance pool of hard disks from which you can allocate storage to servers. SANs also simplify storage expansion, reconfiguration, and backup and recovery. However, transitioning to SAN-based storage can be difficult and can increase downtime.

Clustering for a Win-Win Environment
Clustering improves application reliability and helps prevent system failures. But the real beauty of clustering is that it can make even unreliable applications highly available to end users. For example, one day Node A in my 2-node Exchange Server 5.5 cluster began failing over to Node B. When I looked in the event log, I noticed that the failovers were occurring at 2-hour intervals. The person who installed the cluster had mistakenly installed an evaluation edition of Windows NT Server. When the 120-day evaluation period had expired, the OS began performing hard shutdowns every 2 hours. Clustering kept our Exchange Server system available to end users until we resolved the problem.

Clustering also helps you manage planned downtime. In a clustered environment, you can fail over Node A's services to Node B, then apply a service pack, hotfix, or upgrade to Node A.

Exchange Server 5.5 permits only 2-node active-passive clustering. Only the active node can perform Exchange Server processing. The passive node can't perform any processing until failover occurs. This limitation has lowered clustering's adoption rate, because 2-node active-passive clustering requires you to spend twice as much money on hardware without increasing processing capacity.

Exchange 2000 active-passive clusters are slightly different from Exchange Server 5.5 clusters: One node runs an Exchange Virtual Server (EVS) and the other has Exchange 2000 and doesn't run EVS until a failover occurs. Exchange 2000 with Service Pack 1 (SP1) permits 2-node active-active clustering on Windows 2000 Advanced Server. However, to ensure failover, you need to carefully distribute active user connections and keep processor utilization within the range that lets failovers occur. You can progress to 4-node clustering (i.e., 3+1 clustering) on Win2K Datacenter. Although you get better returns for your hardware investment when you cluster on Exchange 2000 and Win2K, you must still purchase special storage that lets two or more cluster nodes share a hard disk. Fibre channel SANs are a must for 3+1 clusters. For more information about clustering, see Greg Todd, "Microsoft Clustering Solutions," November 2000.

Back Up with Restores in Mind
A nasty crash can result in a corrupted Information Store (IS) that won't mount. This situation can necessitate a lengthy recovery process. Checking database integrity can take several hours. Eseutil, Exchange Server's primary integrity check and repair utility, could take an hour to check and repair a 15GB database, even with the fastest disk technology.



To a large extent, the techniques you employ for backing up your IS determine the length of the recovery process. If you plan for disaster recovery, you'll get back on your feet more quickly after a failure. Exchange Server 5.5 availability takes its biggest hit from the unpartitioned IS because when you need to restore this monolithic IS, you need to restore the entire IS. If you run Exchange 2000 Enterprise Server, you can partition the IS, which improves recovery time.

The most common approach to IS backups is doing full nightly backups to tape, then rotating the tapes off site. Database (.edb) restoration from tape drives runs at 15GB to 30GB per hour on the best DLT technology and more slowly on other tape technology or over the network.

Win2K's Ntbackup utility lets you perform online Exchange 2000 and Exchange Server 5.5 backups to disk. You can then back up the resulting .bkf file to tape and rotate the file off site. The advantage of this approach is that in the event of an IS problem, you can go directly to the disk-based backup set instead of locating and loading a tape. Restores from disk are also typically faster than restores from tape. For more information about Exchange 2000 backup and recovery, see Jerry Cochran, "Exchange 2000 Storage Exposed, Part 2," August 2000.

If you're willing to spend the extra money, advanced backup techniques—cloning, snapshots, and data replication—lead to much faster recoveries and are approaches to consider as your situation requires (e.g., if you need to satisfy a service level agreement—SLA). Cloning is a function of RAID 0+1 mirroring. The clone is the third member of a triple mirrored set. Extracting the clone requires that you stop the Exchange Server services so that the database is consistent. This action immediately affects uptime, but SLAs typically permit such brief outages if they take place during off-hours.

To run utilities such as integrity checks, you can present the clone to another host on the SAN. You can then take the clone offline and back it up to tape. To restore a database that's been totally lost, you can make the clone stripe set the primary member of a new mirror set, then bring your Exchange Server system back online. Even if your database is large, you're back online in minutes instead of hours. The RAID controller will rebuild the mirror set in the background, with a negligible impact on performance.

A snapshot is a point-in-time copy of a disk. Snapshot software, running on the OS or at the RAID controller level, creates a disk map. As your source disk changes, your snapshot records those changes.

Some snapshot software lets you present the snapshot to other systems. This feature can be valuable if you need to test an application without risking the production database. Snapshots are also handy for individual item or mailbox restores. Restoring from a snapshot is similar to restoring from a tape backup. The traditional method for restoring individual items and mailboxes is to restore the entire IS to a recovery server. With a snapshot, you don't need to wait for a lengthy tape to restore; instead you can mount the snapshot to the recovery server and immediately extract the specific information that you want to recover.

To guarantee database integrity, creating a snapshot requires that you take offline all stores that share a disk volume. (In Exchange 2000, you dismount each Mailbox Store and Public Folder Store individually; in Exchange Server 5.5, you dismount the IS as a whole.) Some vendors provide snapshot technology with online backup capabilities, but database consistency is difficult to guarantee.

Data replication helps protect you against the most serious disasters, such as loss of the data center. Data replication can copy the IS in realtime to a distant location. The underlying technology (e.g., fibre channel, Asynchronous Transfer Mode—ATM) determines how distant this location can be. Data replication solutions typically involve specialized, high-end hardware (e.g., Marathon Technologies' Marathon Exchange Servers, Compaq SANworks Data Replication Manager) or specialized software (e.g., VERITAS Software's Storage Replicator), all of which can be expensive.

Monitor Proactively
Proactively monitoring and maintaining your system can prevent downtime. Exchange Server's basic server and link monitoring tools provide limited functionality compared with third-party tools such as AppManager Suite and PATROL. You can monitor your servers at several levels: network, system hardware, OS, and application. The number of platforms you monitor and how you want the product to integrate with your systems will help you decide which product to use. But more important than what product you use is using it proactively: Respond to all early warnings to prevent detected problems from recurring or becoming more severe.

Sharpen Your Network Defense
Administrators of highly available Exchange Server organizations defend their systems vigorously against viruses and network attacks. Without a solid defense, you risk taking a hit to availability. I've seen an email virus outbreak shut down Exchange Server systems that previously had great availability track records. Cleaning up the aftereffects of such an outbreak can take hours.

A common network defense myth is that virus detection software is your most important method of protection. For information about antivirus software placement on SMTP or Exchange servers, see "A Viral Survival Checklist," http://www.exchangeadmin.com, InstantDoc ID 8513, and "Update to 'A Viral Survival Checklist,'" http://www.exchangeadmin.com, InstantDoc ID 8778. Virus scanning protects your systems against older known viruses but can't protect you against new viruses. For information about antivirus applications, see Tony Redmond, "The Great Antivirus Crusade," April 2001.

You also need to educate your users about how to recognize and dispose of suspicious attachments. You and your users need to configure systems in ways that limit the damage of virus attacks. Microsoft Outlook offers security patches, and Outlook 2002 will offer security options that help control virus attacks.

Although essential, purchasing antivirus software isn't enough. To sharpen your network defense, you need to stay on top of security bulletins and hotfixes. If you run Exchange 2000, you can take advantage of Win2K Server's security benefits. To read about leveraging Exchange 2000 and Win2K integration, see Jan De Clercq, "Win2K Security and Exchange 2000," October 2000.

Synergize Expertise
Organizations that have the most highly available Exchange Server systems have an amazing amount of inhouse expertise—although they might not have started out with such experts. Even if they did, ever-changing technology levels the field of high technology every few years. What organizations with highly available Exchange Server systems have in common is that they continually develop their inhouse expertise. And what they can't do, they outsource.

To be a high-availability system, a system's downtime must be less than 52 minutes per year. These 52 minutes don't leave much room for outages and planned downtime, so don't be discouraged if your system isn't one of the elite and highly available. Instead of counting downtime minutes, concentrate on developing these seven habits, and one day you'll be the Exchange Server expert whom others seek out.







Reader Comments

PLEASE, PLEASE, PLEASE reconsider the clustering advice for Exchange 5.5, at least on an NT4.0 platform. We have had such an environment (based on Microsoft initial advice) for close to six months now - and we - FINALLY - got rid of it last weekend (with Microsoft's agreement and support). It doesn't do one any good because the Exchange database is still shared - and - like in our case - when we had corruptions of it - it was actually more damaging to have a cluster environment, as the second server won't come up either ("shared" data), and it would require much, much, much more work for rebuilds. We have gone through all options a couple of times, with Microsoft by our side (support), when we were finally advised to get rid of it altogether. And - guess what - even the response time of the server improved now. Just my $0.02 Regards,

Calin -July 31, 2001

One method we have used for fast recovery (Exch 5.5).
(a) Shutdown services.
(b) Cold backup to disk ( clone / snapshot etc. ).
(c) Backup disk to tape.

In the event of failure:-
(a) Point Exchange server databases to clone copies
(b) restart services ( applying transactions )

This saved my skin once and was pretty quick. Probably not recommended by anyone though.

Nigel Robinson -August 06, 2001

I agree with Nigel and Calin. The best method for backing up your Exchange org environment is still cloning the disk/partition, rather than wasting time and money on clustering solutions.

In my company, we use a 2 HDDs with ($700) Lucor ExactCopy software installed on NT 4.0/Exh 5.5 SP3, and we scheduled twice a week a complete exact copy (sector by sector) through ExactCopy program from Disk 0 to Disk 1, in addition to our weekly tape backup for Exchange server databases. In case of HD failure, simply remove the failed disk and replace it with the other one. Thanks to ExactCopy to maintain for you 100% copy of your original HD. If the copy disk fails, just replace it and rebuild it again with ExactCopy!. For any other DB/Mailbox restore reason, use the weekly back-up from the tape on NT server.

M. Yassin -August 08, 2001

I agree with the previous comments about the perils of clustering. I've just spent the week from hell fixing a major problem with Exchange 5.5 running on a Win2k cluster at one of our client's sites. After a week without mail, users (and more importantly Management) were becoming increasingly frustrated at their inability to do their jobs...heads were on the block here. The problem was eventually traced to a cluster node misbehaving badly (the node was evicted, rebuilt and brought back into the cluster without problems), however there were very few indicators pointing to the cluster node as the source of the problem. For what this client actually requires (which is **NOT** 24x7 availability) clustering is a complex, overpriced solution that has it's own set of problems. A single server solution with RAID 0+1 and RAID 5 storage would have been more appropriate for this customer, however the cluster was sold to the client as a "must have". Beware the perils of an over-ambitious sales department where clustering is concerned...

Shane Woodman -August 11, 2001



Not Just a Storage Solution Provider

I'm writing in regard to Evan Morris's "The 7 Habits of Highly Available Exchange Servers" (August 2001). Although Marathon Technologies is pleased to be included in this article, the article represents the company's solutions too narrowly.

Habit 4: Back Up with Restores in Mind mentions Marathon as a storage solution provider. Marathon has storage capabilities, but the company's key goal is to ensure that mission-critical applications keep running.

In keeping with the title of the article, mention of Marathon's NoFail EMail offerings seems appropriate. Marathon's NoFail Email continuous Exchange solutions guarantee that email will always be available and accessible with no loss of data or transactions. Marathon's Long Distance SplitSite Disaster Tolerance offering lets you physically separate Marathon Exchange Servers and place them in different geographic locations, thus providing high availability and disaster tolerance.

As you can see, Marathon is more than just a storage solution provider. Marathon is an overall system-availability solutions provider for the Windows environment that spans a variety of applications, including messaging, storage, process automation, e-commerce, online financial services, computer-aided emergency dispatch, and Web-based applications.

Linda Mentzer, Vice President of Marketing, Marathon Technologies

Linda Mentzer -January 18, 2002

Is this article likely to be updated for Exchange 2003 or later? I'd be happy to have a go.

adamfield -January 25, 2006
Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro DevProConnections IT Job Hound
Left-Brain.com Technology Resource Directory asp.netPRO ITTV Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 © 2009 Penton Media, Inc. Terms of Use | Privacy Statement