Understand the basics
Email systems depend on many hardware and software components. If any element fails to operate in the required manner, if the hardware suffers a catastrophic failure, or if a physical disaster such as an electricity outage afflicts the hardware, you must have good system backups to get users back online as quickly as possible.
The Exchange 2000 Server installation procedure enhances the standard Windows 2000 Server Backup utility (ntbackup.exe) to support the Exchange Store's transactional nature. These enhancements add support for Exchange 2000's .edb and .stm file formats, let backup agents (i.e., ntbackup.exe or third-party products) copy databases to tape without shutting down Exchange services, and let you select which servers and databases to back up or restore. Understanding the basics of the most important and useful disaster-recovery processes—including full backups, snapshot and clone backups, and the general recovery procedure—can help you prepare for disasters and recover from them quickly.
Incremental and differential backups copy only transaction logs to the backup set. Incremental backups copy the logs created since the most recent backup of any type; differential backups copy the logs created since the most recent full backup. Thus, to restore Exchange databases, you need the most recent full backup, the most recent full backup plus all incremental backups taken since then, or the most recent full backup plus the most recent differential backup. (Some companies believe that taking a mailbox-level—aka brick-level—backup is useful because then you can quickly restore a mailbox or specific items that have been deleted accidentally. However, Exchange 2000's Deleted Mailbox Recovery feature generally can prevent the necessity for this type of backup. Brick-level backups are now an anachronism; avoid them whenever possible because of the related performance penalty.) Obviously, restoring from a full backup is easiest because it involves the fewest tapes and the least chance for mistakes.
Whenever possible, avoid taking Exchange 2000 offline when you perform a backup. When Exchange is offline, users can't connect to their mailboxes. Also, each time you bring the Information Store service online again, the Store generates public folder replication messages to request updates from replication partners. Online backups are perfectly safe. During online backups, the Store calculates checksums for each page before streaming the pages to the backup media; if a checksum doesn't match, the Store generates the infamous -1018 error and halts the backup operation.
To prepare for a full online Exchange backup, the backup agent establishes the type of backup (i.e., full) and the target media (i.e., tape or disk—see the sidebar "Snapshots and Clones," page 10, for an evaluation of using snapshots or clones to speed backups to disk). You can perform a remote backup across the network, but I recommend against doing so unless you have a capable high-speed link between the source database and the target backup device.
The agent then makes a function call to inform the Extensible Storage Engine (ESE) that the backup is about to begin. ESE logs event ID 210, which indicates the start of a full backup, to the Application log. ESE closes the current transaction log and opens a new transaction log. The Store then directs all transactions that occur during the backup to the new set of logs, which will remain on the server after the backup is complete. (For information about the role of checkpoint files and patch files in backing up transaction logs, see the sidebar "Checkpoint and Patch Files," page 11.)
The backup process begins. The backup agent requests data, and the Store streams the data to the media in 64KB chunks, each made up of sixteen 4KB pages. As it begins the backup of each database, ESE writes event ID 220 to the Application log, noting the size of the file.
As the Store processes each 4KB page, it verifies that the page number and cyclical redundancy check (CRC) checksum, which reside in the first 4 bytes of each page, are correct. This verification ensures that the page contains valid data. If either piece of data is incorrect, the Store records a -1018 error in the Application log and the backup API stops processing data—a step that might seem excessive but that stops administrators from blithely taking backups of databases that might contain internal errors.
ESE logs event ID 221 as the backup of each database is finished. After writing all the pages from the target databases to the backup media, the backup agent requests that ESE write the prebackup transaction logs to the backup media. ESE records event ID 223 to indicate that the transaction logs have been written to the backup media. During a full backup, ESE then deletes those logs (noting the fact in event ID 224) to release disk space back to the system. Doing so is quite safe because the transactions are committed to the database and are available in the backup log set.
ESE closes the backup set, and typical operations resume. ESE records event ID 213, which indicates successful backup completion.
The Restore Process
Exchange 2000 Enterprise Server supports as many as four storage groups (SGs), with an additional special SG reserved for restore operations. Store partitioning lets Exchange 2000 take a granular approach to backup and restore operations, so you can back up or restore as few as one database, rather than the entire Store. (However, I suggest you process backups at the SG level whenever possible. Backing up individual databases doesn't typically make sense because ESE includes all transaction logs in the database's SG to ensure that the backup includes transactions that aren't yet fully committed to the database.)
Exchange Server 5.5 and earlier versions support only offline restores, meaning that the server can do no other work until the Information Store service is back online. Exchange 2000, however, simply requires you to start the Information Store service before the restore. ESE uses a reserved SG to enable online restores. (This SG is different from Exchange 2003 Server's Recovery Storage Group, which lets you bring a copy of a database online on the same server as the original database, then use the database copy to recover mailbox data without disrupting typical operations.) The Store overwrites the failed database with the backup database and moves the transaction logs from the backup set into a temporary directory. The Store then replays transactions from the logs and commits changes to the restored database to make it consistent and up-to-date. After the database is updated, the restore SG turns over control to the regular SG and operations recommence. This technique ensures that all unaffected databases on the Exchange server continue operating while you restore the failed database. If you're restoring multiple databases in one SG, you must restore all the failed databases before you begin to recover transactions. ESE interweaves transaction log transactions for all databases in an SG, so replaying the logs replays transactions for all databases in the SG.
If you're restoring a corrupt database, you probably want to overwrite the file on disk. (You might want to copy the corrupt database before the restore overwrites it and store the copy for later investigation.) A database property controls whether the Store will let ESE overwrite a database; before beginning the restore, access the database's properties and select the This Database can be overwritten by a Restore check box (or simply delete the corrupt file).
Ntbackup.exe initializes and displays details of the backup sets that you can restore from the available media, as Figure 1 shows. After you select a set, ntbackup.exe notifies ESE that it wants to begin a restore. The Store then launches an ESE recovery instance to manage the special recovery SG, which exists only during restore operations. Ntbackup.exe begins to stream data out of the backup set into the necessary databases, using direct Win32 file system calls to copy the files into the appropriate locations. You restore differential and incremental sets first, followed by the most recent full backup. After you select the backup set and begin the restore, you must specify the server to which to restore the backup, as Figure 2 shows.
When you're restoring only one backup set or you're restoring the final set in a series of backups, select the Last Backup Set check box so that ESE quits searching for other backup sets and can complete the recovery operation after restoring the specified databases. This step is crucial because ESE won't perform the final steps to make the database consistent until you select the check box. (If a database proves to be inconsistent after the restore, you can use the Eseutil utility to try and correct any errors.) You can also select the Mount Database After Restore check box to place the restored mailbox store into operation as quickly as possible after the restore is complete.
ESE restores the .edb and .stm files to the production directory, but you must specify a temporary location to hold the backup transaction logs for the duration of the restore. Otherwise, those logs might overwrite logs that contain transactions that ESE needs to replay later. Make sure that the temporary location (e.g., C:\temp\backup) has enough disk space to accommodate all the log files in the backup set.
During restores, ESE creates a file called restore.env in the temporary location to control the processing of transaction logs during the restore. This file, which replaces the registry subkey that earlier Exchange versions use to signal that a restore operation is in progress, contains information about databases, paths, signatures, and other restore components. ESE creates a separate restore.env file for each restore operation. (You can use Eseutil to examine the contents of restore.env if you're curious about the file's contents.)
After ESE restores the specified databases and transaction logs, it begins to apply page splits to the respective databases (see "Checkpoint and Patch Files" for more information about page splits). Next, ESE processes the transaction logs. During this procedure, ESE validates the log signature and generation sequence to ensure that the correct transaction logs are present and are available to recover transactions. If a log signature doesn't match the database signature, ESE returns error -610; if ESE discovers a gap in the log generations, it returns error -611. Either of these errors stops the recovery process before ESE replays any transactions into the database so that you can fix the problem before restarting the restore.
ESE actually reads two sets of logs during the restore. First, ESE processes the logs held in the temporary location, then processes the regular logs that have accumulated since the backup. Transaction logs contain data for all the databases in an SG, so even when you restore only one database, ESE must scan all the data in the logs to isolate the transactions for that database before applying them (ESE ignores all the other data in the logs). This phase of the operation can be the longest, depending on the number of transaction logs that ESE must process.
After ESE applies the transaction log data, the recovery SG performs some cleanup operations, exits, and returns control to the primary SG (i.e., the SG that you're restoring), which brings the newly restored databases back online. ESE also deletes all files in the temporary location.
At the end of a successful restore, ESE writes event ID 902 to the Application log. ESE also records the details of the restore operation in a text file in the temporary directory (the file is named after the backup set). To be certain that everything is OK before you let any users connect to the store, make sure that ESE logs event ID 902 and that ESE has recorded no errors in the text file.
Knowledge Is Power
Understanding the process of creating and restoring a full backup can help any disaster recovery go more smoothly. For more information about Exchange 2000 disaster recovery, see "Disaster Recovery for Microsoft Exchange 2000 Server" (http://www.microsoft.com/exchange/techinfo/deployment/2000/e2krecovery.asp).