You've probably been there. The Help desk calls and says, "Users can't get into email. We think the Exchange server might be down." You check the server and find that the Information Store service or one of the message stores is offline. Next, you work through the usual debugging steps and try to restart the Information Store service, but your luck has just run out—the database has been corrupted beyond what a soft recovery can handle, and you need to restore from tape to get things going again.
Although most of what you've learned about Exchange Server 5.5 recovery is still valuable, Exchange 2000 Server has both subtle and obvious differences to the sequence, files, and procedures in the recovery process. Let's review some of Exchange Server 5.5's backup and restore basics and see how they've changed in Exchange 2000. Then, let's go through a step-by-step procedure for restoring a production Store in Exchange 2000.
What We Know from Exchange Server 5.5
Similar to the way in which some relational databases record data, Exchange Server stores use separate transaction log files to quickly record changes and later write the same information to the main database files. Figure 1, page 2, illustrates how the online backup process works. During normal operations (Figure 1A), Exchange doesn't update the Store's primary database files with the latest transactions that are taking place. As users are composing, reading, and deleting mail, the Information Store service uses a large memory buffer to manage the changes that their transactions make. At the same time, the Information Store service writes these changes to the Store transaction log file (edb.log). At scheduled intervals (or when the memory buffer becomes full), the service writes the changes to the primary Store database file (i.e., the .edb files). As Exchange writes the pages to disk, it updates a checkpoint to reflect which transactions it has committed to the .edb files and which are still pending commitment (Figure 1B).
When the backup begins (Figure 1C), Exchange closes the current transaction log (E) and creates a new transaction log (F). During a normal backup, the Extensible Storage Engine (ESE) reads from the .edb files one page at a time. To the users and any other processes, the system continues to operate while the backup is running. Exchange continues to update the transaction logs immediately and periodically writes the memory buffer area into the .edb files. The checkpoint in Figure 1C signifies that Exchange has committed transaction logs A, B, and C, but not D, E, and F. As the backup proceeds (Figure 1D), Exchange reads 4KB pages from the .edb file and writes them to tape. Any new transactions are processed into the memory buffer and immediately written to the transaction logs as usual. When Exchange commits the buffer pages, it writes them from memory to the .edb file. If the section of the database file in which the page is written has already been backed up (the shaded squares in Figure 1), Exchange also writes the pages to a patch file. After Exchange has written all the pages in the .edb file to tape, Exchange closes the patch file and saves it to tape, also. Exchange doesn't save the committed transaction logs (anything before the checkpoint—in this case A, B, and C) to tape but deletes them because the data is on the tape in the form of the .edb and patch files. Similarly, Exchange doesn't write the current transaction log (F) to tape because it's an open file. The only transaction log files that are saved are the uncommitted logs. In this example, Exchange saves the .edb file, the patch file, and transaction logs D and E.
To reverse this process and restore an IS from tape, Exchange uses the Restore API to write the .edb files, the transaction logs, and the patch files to their original locations. During the restore process, Exchange also uses the API to create the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\Restore in Progress registry subkey. When the Information Store service starts, it checks for the presence of this subkey. If the subkey isn't present, the Information Store service initiates soft recovery and attempts to replay the existing transaction logs. If the subkey is present, the service begins hard recovery. Hard recovery first reads the patch files and integrates those changes into the .edb files. When the patches are in place, the system replays the transaction logs, bringing the .edb file up-to-date, and fully starts the service.
How Is an Exchange 2000 Backup Different?
In Exchange 2000, the general backup process is more or less the same, but the process is now bigger, in the sense that storage groups (SGs) and multiple stores mean that you have more files to keep track of and back up. In Exchange 2000 Enterprise Server, each server can support up to four SGs; each SG can have up to five databases, but each SG uses only one set of transaction logs for all the databases within that group.
When you're performing a full backup, the process I described for Exchange Server 5.5 is basically the same in Exchange 2000. Exchange 2000 still reads the databases and verifies the checksum page by page as it saves them to tape, and it generates patch files for the transactions that occur in pages that have already been saved to tape. However, the process has a few significant differences:
- Both the .edb and new streaming (.stm) files are saved to tape.
- If you have multiple backup devices on your system, you can now back up multiple SGs in parallel.
- The process uses multiple checkpoint files, one for each SG.
- You can choose to back up at the SG level, which captures all the databases within the group, or you can back up at the individual database level.
If you back up at the database level, Exchange 2000 doesn't purge the transaction logs until it has backed up all the databases in the SG. This process is necessary because the transactions for the individual databases within the SG are intermixed throughout the single set of transaction log files. If the system were to purge the committed log files during the backup of the first database in the group without backing up other databases, you might not be able to easily recover those later databases. When you back up each database individually, Exchange 2000 might save the uncommitted transaction logs to tape many times, depending on which logs have been committed when the individual backup job starts. Therefore, if you back up at the database level, you need to allocate enough tapes to allow for multiple copies of the logs. You might also need some extra disk capacity to maintain the log files until all databases in the SG are safely on tape and the logs purged.
How Is an Exchange 2000 Restore Different?
The differences between the Exchange 2000 and the Exchange Server 5.5 restore procedures are much more pronounced than between the two backup procedures. First, in Exchange 2000, the restore process communicates directly with the Store instead of the System Attendant, so the Information Store service must be running. Because the Store process now runs as a separate instance for each SG, you don't need to take the other stable databases offline to restore a corrupt database.
Second, Exchange 2000 has safeguards in place to ensure that you don't accidentally restore over a running database. When a database is in an online, or mounted, state, a restore can't overwrite it. You must manually specify that the database file can be overwritten during the restore operation. These safeguards prevent accidental corruption of a running Store and the accidental overwriting of the wrong file when you have multiple stores dismounted.
Third, the restore process doesn't create an HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\MSExchangeIS\Restore in Progress registry subkey. Instead, the restore process creates a restore.env file that contains the same type of information. This file, like its registry subkey predecessor, controls what happens during a hard recovery by specifying, among other things, log sequence information and the locations of the original and restored files. When you initiate a restore-from-tape operation and select the Last Backup Set check box in the Restoring Database Store dialog box, as Figure 2 shows, the backup process will use the information in the restore.env file to complete the hard-recovery process. Selecting this check box signals that immediately after the files are restored, the ESE should begin to apply the patch files and replay the transaction logs.
Fourth, Exchange 2000 restores the .edb and the .stm files to their original locations. However, Exchange 2000 places the patch and log files into a temporary directory that you specify in the Restoring Database Store dialog box when you start the restore job. This action prevents the restored log files from becoming mixed in with the production logs. The alternate location is also used because one or more of the databases within the system might still be running and operating. Using an alternate directory reduces the risk of conflicts with other operations, such as backing up one database at the same time you're restoring another.
Fifth, the order in which you restore the backup sets is important. Generally, for Store recovery you need a full backup, but you might also need a differential backup or series of incrementally captured transaction logs. You must restore the full backup set first, followed by the incremental or differential backups. This sequence lets Exchange 2000 write the correct information about committed transaction logs as well as the paths of all the restored files to the restore.env file.
Finally, if you have multiple tape devices, you can restore databases in parallel. In addition, you can restore databases from within the same SG in parallel. However, be careful about the order and steps you use when performing parallel restore operations. The ESE can handle only one hard-recovery operation at a time within an SG. To prevent problems, use the Eseutil utility instead of the Last Backup Set option to trigger hard recovery.
The restore.env file is crucial to a successful recovery. When you're performing parallel restores, best practice is to use different temporary directories to prevent unintentionally overwriting this file.
A Restore Example
In the following recovery scenario, the server has two SGs. As Figure 3 shows, one SG has three databases (two mailbox stores and one public folder store), and the second SG has two mailbox stores. (The Web-exclusive sidebar "Keep It Simple and Consistent," http://www.exchangeadmin.com, InstantDoc ID 20021, gives suggestions for naming SGs and databases.) For larger sites, this arrangement is likely to be one of the most common partitioning schemes because of the number of available disk spindles and RAID configurations, as well as the high virtual-memory requirements associated with each additional storage group. For more information about factors to consider, see Jerry Cochran's Exchange & Outlook UPDATE "Exchange 2000 Server Sizing: Memory and Disk Subsystems," http://www.win2000mag.com, InstantDoc ID 16499.
Don't assume that this layout of disks and directories will always be optimal. You must consider many factors when you're designing a disk subsystem for a production deployment. For example, in the extreme case, you might need to place the streaming files on a separate disk array to get the best performance.
For this example, assume that the disk array that supported the second SG suffered a catastrophic failure and required replacement. Before the catastrophe, you were performing SG-level full backups on the weekend and incremental backups each day of the workweek. After you replace the hardware, you need to restore from tape to get the databases back online. Here are the steps you follow:
- Unlike in Exchange Server 5.5, the Information Store service must be running for a restore to succeed. Using the Microsoft Management Console (MMC) Computer Management snap-in, verify that the Information Store service is running. (On the system I've described, this step isn't necessary. If the Information Store service weren't running, the users hosted on the First Storage Group wouldn't have email access. I've included the step here just to be complete.) If the Information Store service isn't running, the restore will fail.
- Using the MMC Exchange System Manager snap-in, right-click the database you want to restore and confirm that the context menu shows Mount Store as an option. If the menu shows Dismount Store, select this option to dismount the store. In Figure 3, you would right-click Mailbox Store 2 (EX1) to confirm that you've disabled the first accidental restore safeguard.
- Right-click Mailbox Store 2 (EX1) again, select Properties, and click the Databases tab.
- Select the This database can be overwritten by a restore check box, as Figure 4 shows. You must select this action to deactivate the other safeguard that prevents you from accidentally restoring over a production database.
- Repeat Steps 2 through 4 for the other database in the SG—Mailbox Store 2b (EX1).
- Before you proceed with the tape restore, save copies of the existing .edb and .stm files by copying them to another directory, partition, tape, or even to another server. Although the files are damaged, you might find that these files are your best option for recovering your users' email. If luck isn't with you and you find that you have bad backup tapes or corrupt files saved on those tapes, you might need to recover what you can from the files you're trying to replace. So, save these files in advance to guarantee that the restore doesn't overwrite them.
- Using the Windows NT Backup program, restore the most recent full backup. (Although you can use third-party products to back up and restore databases, I use NT Backup here because everyone will have the program as part of the OS.) When you initiate the restore, you'll see the dialog box that Figure 2 shows. Enter a path to a directory in which the system can temporarily restore the transaction logs. If you're restoring multiple databases in parallel, specify an empty or nonexistent directory to avoid accidentally overwriting other restored files or copies of the restore.env file. If the directory doesn't exist, the restore process will create it.
- Use the same temporary path that you specified in the previous step to restore the incremental backups in chronological order. Ordinarily, you would select the Last Backup Set check box because you restored the last incremental set. However, in this example, let's leave the option blank so that you can perform Steps 9 and 10 to see how you can use Eseutil to manually trigger the hard-recovery process.
- Open a command prompt, and change to the temporary directory that you used during the restore. In this directory, you'll see the patch files, the log files, and the restore.env file. By default, Eseutil resides in the C:\program files\exchsrvr\bin directory. Because of the space between program and files in the path, you must surround the command with quotation marks. To trigger hard recovery, enter the following command:
- When the hard-recovery process completes, Exchange 2000 automatically deletes the restored files from the temporary directory. To be sure the recovery was successful, review the Application event log. Event ID pairs 204 and 205 from the ESE98 event source signal the start and completion of the log replay process.
- In Exchange System Manager, right-click the databases one by one and select Mount Store to bring them back online for users to access.
Because the backup scheme created incremental backups after the weekend's full backup, remember two other points. First, you must use the same restore directory path when you restore the incremental backups so that all the necessary files are in the same location. Second, if you didn't have any incremental backups, you would select the Last Backup Set check box. However, because you have incremental backups, leave that option blank to prevent initiating hard recovery prematurely.
Practice Makes Perfect
I hope this information has helped you understand the processes and steps in Exchange 2000 backup and recovery. If you want to become more familiar with the concepts before you have to deal with a catastrophe, I suggest building a test server with a configuration similar to the one that I used to develop this article. Back up your server at the SG and individual database level, then restore the databases one by one to see the differences in the files and directories that the process creates. Monitor the files in the original and temporary directories to see what is created, and review the event-log details to see what happens at each step.
Although the urge is hard to resist, don't start with complex operations such as parallel or multiple database restores. Keep your experiments and procedures simple as you're learning and get a firm grasp on the basics before you move on to more challenging tasks. The closer you observe and the more you practice these procedures, the more able you'll be to build and execute bulletproof disaster-recovery plans.