Recovering from a database corruption in Exchange Server 5.5 isn't a simple task. Exchange Server 5.5 administrators have accumulated years of experience in building disaster-recovery plans. Tools such as Microsoft's excellent disaster-recovery white paper (http://www.microsoft.com/exchange/techinfo/disaster.htm) explain the steps that you need to take when hardware corruption strikes.
Exchange 2000 Server introduces a more complex database environment to master. In Exchange 2000 Enterprise Server, you can partition the Store across up to four storage groups (SGs), each of which can support up to five mailbox or public databases. In addition, Active Directory (AD) has replaced the Exchange Directory database, so you must revisit directory-restore procedures in case problems occur on servers acting as a domain controller (DC) or Global Catalog (GC). Dealing with an AD failure deserves separate consideration.
My group's Exchange 2000 server recently had a hardware failure that resulted in a corrupt mailbox store. I hope that the experience I gained in diagnosing, understanding, and dealing with the corruption is helpful to anyone who has to prepare a disaster-recovery plan for Exchange 2000.
The Exchange 2000 Environment
I manage a two-node Exchange 2000 cluster that's running inside Compaq's worldwide Exchange organization, which is in the process of migrating from Exchange Server 5.5. So far, roughly 30 percent of Compaq's 300 or so servers have moved to Exchange 2000. The cluster consists of two Compaq ProLiant DL380 servers (an entry-level cluster solution from Compaq). About 30 users are on the cluster, and all the mailbox limits are disabled, so the mailbox store is relatively large (about 6.5GB) for the small number of users. A small group of consultants use the cluster, which is an atypical deployment at Compaq. The consultants work with Microsoft technologies and try to establish the limits of Exchange 2000 and Windows 2000. In this case, the group found one of the limits and learned some hard lessons.
The database corruption occurred on a Friday, when the deployment experienced a sudden power outage at about 5:00 p.m. Unfortunately, the team hadn't configured the cluster hardware properly. Although we had enabled the write-back cache on the RAID array controller, the cache wasn't protected against sudden outages because we had incorrectly connected it to a UPS. (The array controller on our cluster—CR3500—has no battery backup. Owners of this array controller usually buy a UPS and connect it directly to the controller to get around this limitation.) The improper connection meant that any transaction data for the Store that was in the cache and hadn't been committed to disk was lost. Usually, the transaction logs rescue this situation, but they didn't in this case because the transaction logs themselves were corrupted.
One important difference between Exchange 2000 and Exchange Server 5.5 is how a database problem manifests itself. In Exchange Server 5.5, if either of the Information Store (IS—public or private) databases is corrupt, the Information Store service won't start. In Exchange 2000, the Information Store service will start, but it won't mount any database that has a problem. This difference is important because in Exchange 2000, users whose mailboxes are in the unaffected stores can continue working while administrators fix the corrupt databases. Unfortunately, all our mailboxes are in one store, so the failure affected everyone.
When power was restored, I used Microsoft Cluster Administrator, as Figure 1 shows, to monitor the startup of the cluster and noted that the Information Store service for the Exchange virtual server appeared to have started. (Exchange 2000 runs in cluster environments in one or more virtual servers that are allocated to the physical nodes in the cluster.) I then checked the Microsoft Management Console (MMC) Exchange System Manager snap-in to verify that the Information Store service had mounted the mailbox store. At this point, the problem became evident because the store hadn't been mounted. I attempted to mount the store and saw the error that Figure 2 shows. The Application event log showed event ID 412, The log file is corrupt, which further confirmed that a problem existed. Figure 3 shows the details of this error.
The Next Steps
At this point, I needed a cool head. When an Exchange server goes down, the Help desk gets swamped with calls. Despite pressure from all sides, I knew I shouldn't set an unrealistic time for completing the restoration of service. Based on my experience, I recommend that you follow these steps.
Find the last backup tape. Some IT departments ship their backup tapes to a remote site for storage. Because you'll definitely need the last backup tape, you need to start the process of retrieving it immediately.
Take a copy of the mailbox and public folder stores (Exchange and streaming databases). Although a database might be corrupt, you must take a copy of the existing databases. In Exchange Server 5.5, the private IS consists of one file (priv.edb). In Exchange 2000, each store consists of an Exchange (.edb) file and a streaming (.stm) file. Don't forget to take a copy of the streaming database.
A restore can overwrite the corrupt database, so you need a way back to the state of your database when the corruption occurred. If the restore is unsuccessful, you might be able to repair this database. In this scenario, you want the most recent version of the database, even if it's damaged. If the files are large, you can save time by renaming them to something meaningful (e.g., priv1.oldedb, priv1.oldstm). Remember to leave enough disk space on the database drive for the restore.
Make a copy of the transaction logs. Making a copy of the transaction logs is crucial because the transaction logs enable recovery up to the moment of the outage. Check the dates of the transaction logs, and verify that you created them since the last backup.
Disable inbound mail connections. A server recovery might require several restore attempts, and you don't want queued inbound messages delivered until you're satisfied that the restore process was successful and everything is running normally. Therefore, you need to disable the Message Transfer Agent (MTA) in Exchange Server 5.5 or mixed-mode sites or the default SMTP virtual server in pure Exchange 2000 sites.
If you're using an MTA, be sure to make a copy of the \mtadata directory. If queued messages are accidentally delivered during troubleshooting, a copy of the \mtadata directory lets you replay the messages in a procedure I describe later in this article.
Keep a log of the restore process. Document every task performed as part of the restore process. That way, if you need to hand over the restore process to someone else or get help from Microsoft, the other people will better understand the steps you've taken to restore service. You can also use the log to refine your Exchange disaster-recovery procedures.
Troubleshooting the Corruption
I renamed the checkpoint file from E00.chk to E00.old. The checkpoint file keeps track of the buffers that the Information Store service has written from memory into databases by using a pointer that specifies which transaction log has the latest transaction that was written to the database. After a restore of a database, the service consults the checkpoint file to find out which transactions are outstanding and need to be applied. In earlier versions of Exchange, the checkpoint file would occasionally become corrupt. Deleting the checkpoint file and restarting the Information Store service recreates the checkpoint file and often lets a database recover successfully. This action was unsuccessful in my case.
The event log showed that the most recent transaction log file was corrupt. I attempted to replay the transaction logs without this transaction log (E00.log) by removing the most recent log file, renaming the preceding log file to E00.log, and deleting the checkpoint file. An attempt to mount the database again ended in failure. The event log indicated that the checksums in the headers of the log files didn't match the headers in the database.
Finally, I copied back the version of the database I'd saved before I began troubleshooting and tried in vain to mount it without any transaction logs. I used the Eseutil utility to examine the database header and learned that the database was in an inconsistent state.
At this point, I had two options: Use Eseutil to repair the database or restore the database from tape. I decided not to perform a Eseutil repair because that repair would prevent recovery of the information in the transaction logs, as the warning in Figure 4 shows. In addition, running Eseutil repair typically results in loss of data.
I had a problem starting the restore. Windows NT Backup was unable to determine the server name it was running under. After a quick search of the Microsoft Web site, I discovered that this problem is a known bug in Exchange 2000. The Microsoft article "XADM: The 'ESEUTIL /CC' Command Does Not Work on Cluster Server" (http://support.microsoft.com/support/kb/articles/q266/6/89.asp) explains the problem. Eseutil attempts to use the local node name instead of the cluster name to perform recovery. The workaround is to type
at a command prompt. In the command, cluster_name corresponds to your cluster's network name. This action let me begin the restore procedure.
The Restore Sequence
I selected the This database can be overwritten by a restore check box on each mailbox store and public folder store I was restoring. Then, I opened NT Backup and selected the databases from the backup set. As Joseph Neubauer explains in "Restoring the Exchange 2000 Store Step by Step," page 1, if you perform full backups, you need to select the Last Backup Set check box in the Restoring Database Store dialog box. This option tells NT Backup that you're restoring from the most recent backup and that you want recovery to take place. Alternatively, you can use the /cc switch with Eseutil to replay the transaction logs. You also need a temporary location for files created during the restore.
My first attempt at the restore was ineffective. I copied back the log files created since the last backup and let the restore automatically replay the transaction logs since the backup. Unfortunately, some of the transaction logs Exchange had created since the backup were corrupt, so they in turn corrupted the restored database when I replayed them.
I contacted Microsoft Premier Support Services for assistance. Microsoft informed me that if a transaction log corruption occurs, you can't recover any data by replaying transaction logs since the last backup. I reran the restore procedure. This time, I removed all transaction logs, and I mounted the stores successfully. At this stage, the server was restored to the point of the last backup.
Before the restore, I had made a copy of the \mtadata directory on the Exchange 2000 server. Now, I replayed the messages in the following sequence:
- I stopped the MTA on the Exchange Server 5.5 server that was relaying messages to the Exchange 2000 server.
- I allowed Exchange to deliver all messages in the \mtadata directory.
- I stopped the MTA on the Exchange 2000 server.
- I deleted the contents of the \mtadata directory.
- I copied the contents of the \mtadata directory I'd copied back to the \mtadata directory.
- I restarted the MTA on the Exchange 2000 server.
- I restarted the MTA on the Exchange Server 5.5 server that was relaying messages to the Exchange 2000 server.
My experience with Exchange 2000 disaster recovery has taught me some important lessons. Remember these points.
- Make sure that you configure your hardware properly, especially the I/O subsystem. Place transaction logs and databases on separate physical disks and, if possible, on separate array controllers. Consult your hardware vendor to verify that the cache settings on the array controllers are correct for your configuration.
- If transaction log corruption occurs, you might not be able to recover messages sent and received since the last backup.
- Make a copy of the databases and transaction logs before attempting any recovery.
- Disable the incoming mail connections on your server. You don't want to deliver queued messages until you've completed the restores successfully.
- Practice and document your disaster-recovery procedures. Revisit these procedures when you deploy Exchange 2000 Service Pack 1. SP1 includes changes to the restore process.
- Assemble a disaster-recovery kit. This kit should include Exchange 2000 and Win2K CD-ROMs, disaster-recovery procedures, server build documents, emergency contact lists, and contact details for Microsoft Product Support Services (PSS).
- When you've resumed service, buy yourself a coffee. And be sure to keep all the thank-you messages from your grateful Microsoft Outlook users!