AD System State and Exchange Store recovery are key
Is your Exchange 2000 Server environment a disaster waiting to happen? Exchange 2000's dependence on Windows 2000 Active Directory (AD) complicates Exchange 2000 disaster-recovery planning. Your recovery efforts might involve not just your Exchange team but also the people responsible for AD. Knowing how to back up and recover non-Exchange components, as well as being aware of recent changes in backup technologies, can help you plan and implement a course of action that can postpone disaster and speed recovery.
As an Exchange administrator, you're probably responsible for several aspects of your company's disaster-recovery process. This process can include conducting a risk analysis that identifies the probability and impact of an outage according to specific points of risk, as Table 1 shows; developing a risk-mitigation plan that defines risk-mitigation techniques for each possible type of outage; and implementing the plan on a day-to-day basis.
The first step in a disaster-recovery plan is a risk analysis. With this information, you can begin developing detailed procedures to protect each risk point and calm executives suffering from "Chicken Little Syndrome," who might insist that you look at high-impact events even though those events have a low probability. Instead, you must look at the most likely disasters, which typically result from hardware or application faults.
From a planning standpoint, you need to make a distinction between increasing availability and speeding recovery. This distinction is important because anything you can do to increase availability and avoid disaster can save you a lot of time in recovery. For example, suppose that a power supply in one of your servers stops working. No big deal—if you have redundant power supplies. If you don't, though, the server will go down hard, meaning that the OS might flag the drives to run Chkdsk to check for disk errors. This process can take hours for large disk arrays, increasing your outage window even if you don't need to recover any data. If the server is part of a cluster, server failover will increase your availability to the extent that you might not even need to measure recovery time. For another example, using Exchange 2000's built-in deleted mailbox retention can avert the need to set up an Exchange recovery server.
If you can't make components fully fault tolerant, though, you need to build recovery mechanisms that will help get failed components back up quickly. For Exchange 2000 environments, the most likely disaster scenarios involve Exchange or AD server recovery and Exchange Store recovery. Therefore, your disaster-recovery kit needs to include System State backups (the restoration point for both AD and Exchange servers) and Store backups, and you need to know how to recover these components.
System State Recovery
Ideally, you can maintain a standby recovery server that consists of the same hardware as your production servers. You need to keep the standby server updated with the same service packs and hotfixes you've installed on your production servers (or at least verify the OS and application versions before performing a recovery). A standby server shouldn't be a member of a domain; a System State restore will establish the server's domain identity. (I've even seen a System State restore work across partitions, meaning that you can create or maintain a second Windows installation on a production server and simply boot to that installation to perform the restore. However, with the many security patches that have appeared since the CodeRed virus, managing multiple boot partitions is an unwieldy process.) Be sure to keep the standby server off the network if you haven't applied the most recent security patches. And be prepared to deal with not only an Exchange server failure but AD server failures as well.
Exchange server failure. When you recover an Exchange server, first restore the System State to the standby server and reboot. This action reestablishes the server's identity. Then, run Exchange Setup with the Disaster Recovery option, as Figure 1 shows (or use the /disasterrecovery switch from the command line). This option pulls the Exchange configuration directly from AD. Reapply the Exchange service pack that you were using.
Be aware, however, of two potential problems with the Disaster Recovery option. First, the option doesn't work in a cluster (you must evict the node and reinstall it manually). Second, the option might not correctly install the Microsoft Search component, which is necessary for full-text indexing. "Troubleshooter: Restoring a Clustered Exchange Database to a Nonclustered System," September 2002, http://www.exchangeadmin.com, InstantDoc ID 25839, discusses the first problem; the Microsoft article "XADM: Disaster Recovery Does Not Correctly Setup Full-Text Indexing" (Q295921, http://support.microsoft.com) documents the latter problem.
AD server failure. When an AD server crashes, you must decide whether to rebuild the server or replace it with a new server. Because AD uses multimaster replication, you might not immediately notice the loss of a domain controller (DC)—with two exceptions. First, if the DC is a Global Catalog (GC) server, which Exchange uses for directory access and referrals to Outlook clients, Outlook clients might hang. If you're running a version earlier than Exchange 2000 Service Pack 2 (SP2), you might even need to reboot the Exchange server so that it can find another GC server (another good reason to apply SP3 or SP2). Second, if the DC owns a Flexible Single-Master Operation (FSMO) role, you need to manually seize that role or specific operations such as password resets will fail.
As soon as you realize that a DC is down, you need to determine whether it owned any domain FSMO roles. To do so, open the Microsoft Management Console (MMC) Active Directory Users and Computers snap-in. Right-click the domain object, then select Operations Masters from the context menu. The resulting dialog box includes the Infrastructure, PDC, and RID tabs, which display each role. Alternatively, you can use the Microsoft Windows 2000 Server Resource Kit's dumpfsmos.cmd batch file or the Netdom utility (a Win2K Support Tool) to see which server holds which role. Be aware that you should seize the Schema Master only when you plan to replace, rather than rebuild, the server. I suggest you read the Microsoft article "Flexible Single Master Operation Transfer and Seizure Process" (Q223787, http://support.microsoft.com) before seizing any FSMO roles.
When restoring the System State of an AD DC or GC server, you must first use Ntdsutil to run a metadata cleanup and remove the ntdsDSA object, or the recovered server won't be able to rejoin AD. The Microsoft article "HOW TO: Remove Data in Active Directory After an Unsuccessful Domain Controller Demotion" (Q216498, http://support.microsoft.com) explains this process.
As an Exchange administrator, you're probably more familiar with the process of recovering the Exchange Store than with the process of recovering an AD server. However, to be sure your recovery plan is up-to-date, you need to be aware of recent changes in backup technology.
Most administrators make online Exchange Server 5.5 backups to tape, either over the network or to tape drives attached directly to the Exchange server. Improvements in networking speed have helped decrease restore times, but nothing can compare to dedicated drives. (The process of backing up Storage Area Network—SAN—drives over a Fibre Channel network might challenge that statement, but that process requires SAN hardware and backup software that can support LAN-free backups. Major vendors such as CommVault Systems, LEGATO Systems, UltraBac Software, VERITAS Software, and Hewlett-Packard—HP—support this type of backup, but few companies are ready to switch to this design.) New Linear Tape-Open (LTO) and Super DLTtape (SDLT) drives improve tape capacities and backup speeds, but only for those who can afford these drives. To truly reduce restore times without dedicating one tape drive per server, some companies back up to disk, then rotate the backup file to tape.
Companies that invest in the necessary technologies to back up to disk might eventually realize that they can achieve even greater restore performance when they use those disks for Business Continuance Volumes. BCVs come in two flavors: snapshots—which are virtual copies of the original data—and clones—which are additional mirrors of the physical data. (Think of a snapshot as a mental image of your high-school classmates—despite how they age and change, you maintain that virtual image of what they looked like years ago. For more information about data replication and remote copy sets, see "Data Replication Technology for your Exchange Deployments," http://www.exchangeadmin.com, InstantDoc ID 16077.) With either of these techniques—especially with cloning (after you break off a clone, it's just another disk volume and can stand on its own)—database-restore time doesn't involve copying data back into place. Recovering to a BCV merely involves giving the new volume the original database drive letter. (Exchange is rather particular about having databases in the same place when it attempts to mount recovered Stores. However, Exchange 2000 is a bit more forgiving than Exchange 5.5 and generates event-log errors informing you that you need to place the files in the correct location to restore the Stores.)
So how do you create a BCV? Either your RAID controller must provide the necessary functionality (e.g., a SAN controller that supports multimember mirror sets) or you must provide the functionality through software that runs on the host OS. BCV functionality doesn't require a SAN; you can use snapshot software and Direct Attached Storage (DAS). In addition, you need software to manage the process of breaking off, or splitting, the BCVs, especially in the case of snapshots. This management software can be a GUI or scripts, depending on your preference and budget.
How do you manage Exchange during the BCV split process? Earlier BCV backup solutions shut down the Exchange databases before splitting off the BCV. This process placed the database files in a consistent state but caused an interruption of service, as the Microsoft article "XADM: Offline Backup and Restoration Procedures for Exchange 2000 Server" (Q296788, http://support.microsoft.com) explains. To avoid this problem, some vendors have begun touting "hot-split" solutions that can break off the BCV while Exchange runs. The BCV that results from a hot-split backup is an inconsistent database, meaning that Exchange doesn't know the state of pending transactions—similar to what would happen if you pulled the power plugs out of your Exchange server. When the Stores come back online, Exchange must perform soft recovery to roll back partial transactions and roll forward pending transactions. If you perform a swap before Exchange comes online and instead mount the BCV, you must remove the checkpoint file, which applies to the replaced database and is no longer accurate.
Microsoft doesn't support this kind of solution but recognizes that customers need to be aware of the risks involved in using hot-split BCVs, as the Microsoft article "Hot Split Snapshot Backups of Exchange" (Q311898, http://support.microsoft.com) explains. Microsoft recommends that you use a backup application and the Microsoft online backup API to back up and restore Exchange databases. I've tested hot-split cloning solutions and have learned that you have no guarantee that the database will be recoverable (which is what the Microsoft article points out). If you implement such a solution incorrectly, it can cause more problems than it solves. For example, most BCVs occur at the disk-volume level. A drive might contain multiple Stores within multiple storage groups (SGs), so when you need to recover only one Store, a traditional restore will affect fewer users than a BCV, which restores all the Stores on the volume.
Handling log files is another concern. You must either identify the log files that need to be replayed into the databases or remove the checkpoint file so that all transaction logs are replayed. I prefer the latter method, which is more foolproof and which replays committed log files at as fast a rate as 4 seconds per file (the rate I achieve on my servers by using a hardware RAID controller with a mirrored pair of log disks). Of course, your mileage might vary—I've seen some servers run up to 1 minute per log file when dealing with many files—but as long as you don't let outstanding transaction log files grow to some ridiculous number, you're looking at an outage of minutes instead of hours. The key is to purge your log files with an online APIbased backup before you perform your nightly clone split. You'll then know that you have a BCV for recovery and that the log files have been purged.
If you want to reduce log-replay time by finding only the exact log sequence numbers to replay, you must run
to find the Log Signature value in the database header, then match this value to the log sequence. Doing so will ensure that you're specifying the correct log files (e.g., E00xxxxx.log) for that database. This method is more difficult, though, because you must dump the signature, match it to the log files, and remove the log files that don't need to be replayed. These processes can introduce error, when instead you can remove the checkpoint and let Exchange blast through the log files.
On a final note, if you upgrade your servers from Exchange 2000 Service Pack 1 (SP1) to SP2 or later, be sure you have a valid backup of the Stores after the upgrade. You can't restore an SP1 backup to SP2 or later because the backup and restore processes have changed so that no patch (.pat) file is created during backup. The Microsoft article "XADM: A Patch File Is Not Created During Backup" (Q316796, http://support.microsoft.com) acknowledges the change.
Ready to Recover?
When you're faced with disaster recovery, your level of success boils down to two factors: having what you need and knowing what to do. Putting the information in this article (and in the resources in "Related Reading") into practice can substantially improve your odds of success. Be sure your recovery team understands your plan, and practice in a test environment so that your team can recover quickly and correctly.
WINDOWS & .NET MAGAZINE ARTICLES|
You can obtain the following articles from Windows & .NET Magazine's web site at http://www.winnetmag.com.
"Exchange 2000 Storage Exposed, Part 2," August 2000, InstantDoc ID 9073
"Repairing and Recovering AD," September 2002, InstantDoc ID 25957
"Practice Proactive AD Maintenance," August 2002, InstantDoc ID 25637
"Determining Operations Masters in a Win2K Forest and Domain,"
February 2002, InstantDoc ID 23403
"AD Disaster Recovery," August 2001, InstantDoc ID 21509
"Win2K Support Tools," February 2001, InstantDoc ID 16457
"6 Essential Tools for Troubleshooting AD Replication," April 2002, InstantDoc ID 24222
"Active Directory Disaster Recovery"
"Backup and Recovery Tip: Backing Up and Restoring Connectors on Exchange 2000"
"Disaster Recovery for Microsoft Exchange 2000 Server"
"Exchange 2000 Server Database Recovery"