The increasing importance of messaging and collaboration as mission-critical services has left Microsoft Exchange Server administrators and implementers looking for new ways to improve server recoverability and availability. In addition, hardware and software vendors are supporting technologies that enable more rapid recovery of server data and applications. Two specific enabling technologies are volume snapshots and volume cloning. These technologies are available in a variety of hardware and software implementations ranging from full-blown snapshot-manager products for Exchange to integration kits for a custom Exchange recovery solution. In the past, regardless of how these technologies were delivered, Microsoft products didn't support snapshots or cloning. Table 1, page 52, lists pertinent Microsoft articles that discuss snapshot and cloning support for Exchange 2000 Server, Exchange Server 5.5, and Exchange 4.0. However, with the advent of Windows Server 2003 and Exchange Server 2003, Microsoft has introduced Volume Shadow Copy Service (VSS), which makes snapshots and cloning available natively to Exchange administrators. Let's look at VSS and what it means to Exchange disaster recovery and availability.

Volume Snapshot and Volume Cloning Overview
Let's briefly review volume snapshots and volume cloning. Although these technologies aren't new, they're relatively new to the Windows platform. Cloning and snapshot technologies provide Business Continuance Volumes (BCVs)—a mechanism for data duplication and point-in-time copies. On the surface, these technologies might appear to be the same; however, they're quite different in technical implementation.

Volume snapshots. A volume snapshot (also known as a copy-on-write snapshot) is a representational metadata mapping of specific volume blocks at the time you create the snapshot. For example, if you create a snapshot of your Exchange database volume, the snapshot represents the blocks on disk that compose your Exchange database and any other files on the volume at the time you create the snapshot. Therefore, after you create a snapshot, you must maintain the original volume blocks for the snapshot to remain intact. As a result, you must copy changes to the volume blocks to another location in the storage pool. From an Exchange viewpoint, this requirement means that if you create a snapshot for the volume on which Exchange data resides, then make a change to a page in the Exchange database, the VSS-supported storage system will typically copy the affected blocks to a special "diff" area in the storage pool allocated from free volume pool space. In this manner, the system preserves the original subset of volume blocks that represent the snapshot. After you create the snapshot, the production data consists of a combination of original unchanged blocks from the snapshot and new blocks of data. The snapshot remains intact and represents the data state at the time the snapshot was created. Creating snapshots is relatively quick and simple—the VSS-supported storage system creates the volume block mapping, and the snapshot exists. Because a copy-on-write snapshot isn't a complete redundant copy of the data and is subject to disk failures, this technology is less desirable than clones. As a result, snapshot recovery can be problematic if the base volume is lost or corrupted because an administrator must complete several steps to recover the original volume.

Volume clones. Similar to volume snapshots, volume cloning isn't a recent development. Cloning comes from RAID 0+1 technology. A clone is an additional member of a RAID 0+1 mirror set. For example, if you have a RAID 0+1 set with three disks mirrored to three disks, you have a two-member RAID 0+1 set. By normalizing another three disks to the existing RAID 0+1 set of six disks, you create a three-member mirror set (i.e., a triple mirror of nine disks). You can add other members to the mirror set as well. By creating multimember mirror sets and then separating members from the set, you enable clones. Because the existing production data has multiple mirrored copies, you can use some of these copies to create point-in-time copies of the data. Unlike a snapshot, a clone is a complete standalone copy of the data. To create a clone, you simply separate one or more members of the RAID 0+1 mirror set from the production set. The result is a production mirror set that supports the application (i.e., the two-member RAID 0+1 array) and clones (single-member sets) that you've separated from the production data, as Figure 1 shows. In the event of data corruption or loss, you can use a clone to recover system data by making the clone available as the production LUN. Because clones are a complete redundant copy of the data, they're most useful as rapid-recovery mechanisms.

The advantage of volume clones lies in how quickly you can create them. The downside is having to resynchronize an old (or new) member with the primary mirror set, which can take time depending on the size of the disks and the capabilities of the controller and enclosure.

The VSS Foundation
Because volume snapshots and volume cloning have had limited availability in the Windows space, the OS and applications haven't been able to take full advantage of these technologies. Hardware and software developers have implemented snapshot and clone solutions with little or no exposure or integration with the OS and applications. As a result, third-party vendors have been primarily responsible for supporting these solutions.

Microsoft has made substantial storage technology investments, including VSS, in Windows 2003. VSS attempts to solve a key problem—the constant expansion of the backup and restore window. Because of today's inexpensive disk space and applications' large appetites for storage, administrators are constantly challenged with data sets that continually grow and disaster-recovery facilities that don't. Administrators have many methods—such as growth management (e.g., archiving, Hierarchical Storage Management—HSM, quotas)—of dealing with this problem, but what they really want is a way to increase backup and (more importantly) restore speeds. If you could increase backup rates from 10GB per hour to 20GB per hour, your backup window would shrink by 50 percent.

VSS addresses one primary concern—that a lot of today's data is online. For example, consider a 24 * 7 file server with thousands of user files. Whenever a typical backup runs, a few files will be open. To complete the backup, you have three choices. First, you can stop the service or session and close the open files. Second, you can skip open files and hope they don't get lost or corrupted before the next backup. But the best solution is to take a snapshot of the data and use the snapshot as the basis for recovery.

VSS provides a framework for using snapshot and cloning technologies with the Windows platform. More specifically, VSS provides services that deliver an infrastructure upon which the OS, applications, and vendors (e.g., Hewlett-Packard—HP—EMC, VERITAS Software, LEGATO Systems) can leverage these technologies.

VSS has three primary goals: to provide application synchronization, including synchronizing application data spread over multiple volumes; to provide discovery and enumeration of snapshots or clones (called Shadow Copies); and to provide a framework in which hardware and software vendors can plug in interoperable Shadow Copy creation components (called Providers). With these goals in mind, VSS on Windows 2003 lets a hardware or software vendor supply a Provider, an application developer expose Shadow Copy packages that contain XML-based metadata (called Writers), and a backup vendor build applications (called Requestors) that can initiate backup and restore operations that leverage these components on a common infrastructure. Figure 2 shows Windows 2003's VSS architecture.

VSS Providers. VSS exposes APIs that let vendors VSS-enable their solutions. For a vendor's snapshot or clone technology to function within the VSS framework, that vendor must develop a Provider—components that manage volumes and create clones and snapshots according to a specific vendor's technology and implementation. Typically, a Provider is a process containing some kernel-mode and some user-mode code that persists data about a physical Shadow Copy and exposes that Shadow Copy to the OS or applications. Vendors must build Providers regardless of whether the vendors create hardware- or software-based solutions. In the case of a software-based Provider, the implementation typically consists of a user-mode process coupled with a kernel-mode device driver. Details about the hardware- and software-based solutions and the Provider implementation are at the discretion of the vendors, as long as they follow the VSS framework-implementation rules. Windows 2003 includes a software-based Provider, which the OS implements as a copy-on-write software snapshot.

VSS Requestors. Backup and disaster-recovery vendors can develop applications that make use of the VSS architecture, APIs, and implementation rules. To do so, these vendors must develop a Requestor—an automated or GUI-based process or application that requests one or more Shadow Copy sets from one or more volumes. The Requestor is the main process that communicates with the Shadow Copy interface, which coordinates activities between Requestors, Providers, and Writers. The Requestor also communicates with Writers to gather backup components, files, and metadata that the Writers manage. This communication lets a Requestor select which volumes should be Shadow Copied to complete the requirements of the backup operation. Windows 2003 doesn't include a Requestor.

VSS Writers. The most important players in the VSS framework are arguably the applications. An application must carefully expose recovery packages that are specific to the application's technology, implementation, and disaster-recovery requirements and constraints. For example, because Exchange uses a transacted database engine, its requirements are unique, even when compared with similar applications (e.g., Microsoft SQL Server, Oracle). Writers are code and related data embedded in applications and components of those applications to enable VSS compatibility. Writers respond to the Shadow Copy interface to let the application prepare, freeze, and thaw application I/O to ensure that no writes occur on the volume when the Provider creates the Shadow Copy. Through the VSS interface, Writers also respond to Requestors by supplying Writer metadata that includes details about what the Requestor requires to perform Shadow Copy operations for the specific application.

A backup operation that uses VSS is a well-orchestrated process that involves the interaction of each component in the VSS framework. Figure 3 shows a generalized flow and interaction diagram of a backup operation using VSS technology.

Exchange 2003 Support for VSS
To support the VSS framework, an application such as Exchange must provide the Writer component. Because Microsoft has no plans to provide a Writer for Exchange 2000, Exchange 5.5, or Exchange 4.0, the company won't support VSS for these versions. However, Exchange 2003, paired with Windows 2003, does provide VSS support for Store backup and recovery. In Exchange 2003, Microsoft has built the Writer functionality into the Store process. This Writer provides the necessary support for Requestors to initiate backup operations for Exchange 2003.

Exchange 2003 Backups Using VSS
Traditional Exchange API-based backups focused on four backup types for Exchange databases: Full, Incremental, Differential, and Copy. However, the Exchange 2003 Writer supports only a Full backup at the storage group (SG) level. VSS performs Exchange Full backups at the SG level, even though the Exchange Writer treats individual databases as separate components. VSS uses the AddComponent call to add each database component to the Shadow Copy set, which in the case of a Full backup, is the entire SG (i.e., databases or log files). In a Full backup of a SG, VSS creates a complete Shadow Copy of all volumes—the Shadow Copy contains database and transaction log files associated with that SG. In addition, as is the case with non-VSS Full backups, VSS truncates the transaction log files after successfully creating and backing up the Shadow Copy. To truncate the transaction log files, the Shadow Copy set must include all databases. For this reason, Microsoft will use the metadata definition for the Exchange Writer to force the Requestor applications to process only Full backups that have all SG components (i.e., databases or log files) in the Shadow Copy set.

Exchange 2003 Recovery Using VSS
Although VSS backup for Exchange 2003 is at the SG level, you can recover individual databases from the SG Shadow Copy set. VSS-based restoration of an Exchange 2003 SG is useful when data in one or more databases in the SG is lost or corrupted, but the current log files remain intact on disk; when the current log files on disk are lost or corrupted, but the databases remain intact; or when databases and current log files within an SG are lost or corrupted.

In the context of Exchange 2003 and VSS, only the backup application is responsible for restoring data to disk. The Exchange 2003 database engine, not the Requestor, is responsible for recovering the data to a consistent, up-to-date state through playback of the log file. To do so, the database engine activates existing soft or hard recovery procedures. After the VSS-aware backup application restores the transaction log files and databases, Exchange 2003 remounts and restarts the SG, then the database engine initiates recovery. The database engine determines that the state of the databases isn't consistent with the end of the log file on disk and begins the recovery procedure.

Three Exchange 2003 data restoration scenarios exist, but only two procedures for those scenarios exist. The Roll-Forward recovery and Point-in-Time recovery procedures for restoring data are the same whether you've lost only the SG's log files or you've lost an SG's log files and databases. You use the same procedure because the loss of the log files is a catastrophic failure in Exchange and requires restoring the entire SG. In either case, these recovery options follow a specific step-by-step process:

  1. The backup application Requestor through the Exchange Writer and APIs takes the SG offline.
  2. The backup application performs a VSS-based recovery of the volumes required from the SG Shadow Copy set.
  • If one LUN per SG is configured, Exchange recovers all databases except those that are intact.
  • If multiple LUNs per SG are configured, Exchange recovers only the LUNs with the databases needing recovery from the Shadow Copy set.
  • Exchange performs an Extensible Storage Engine (ESE) hard recovery and replays applicable log files for databases being recovered, depending on whether a Roll-Forward recovery or Point-in-Time recovery is occurring.
  • The backup application Requestor through the Exchange Writer and APIs brings the SG online.
  • Roll-Forward recovery. In a Roll-Forward recovery, one or more databases in the SG are lost, but the log files are intact on the server at the time of the recovery. In this case, you can selectively restore each of the affected databases from a Full backup of the SG. Within the context of the VSS framework, you select from the SG backup only those database components that correspond to the databases you want to restore. The VSS-aware backup application restores the databases and Exchange recovers the databases and brings them up-to-date from their state at the time of the snapshot by rolling forward through the transaction logs (i.e., Exchange hard recovery). The Roll-Forward recovery option lets you recover backed up data as well as data that has accumulated (e.g., in transaction logs) since the last backup.

    Point-in-Time recovery. When the SG's log-file volume has been damaged or lost or the log files have been lost or damaged together with some or all of the SG's databases, you must restore the log files from a previous backup, together with all the databases backed up at the time of the last full backup of the SG. Because you can't recover to the point of the failure because the log files and databases since the last backup have been lost or damaged, you can recover only to the point of the last full backup. This process is known as a Point-in-Time recovery. Because this option doesn't provide roll-forward capability, some data will be lost. To provide Point-in-Time recovery, you must restore the databases that you backed up at the time of the Full backup as well as the log files from the Full backup. In addition, you must recover all databases associated with the SG. You can't assume that any of the databases were left in a transaction-consistent state at the time the log files were lost and went offline because the loss of the transaction log is a fatal error that causes the Store to shut down immediately with no guarantee of consistency. Therefore, to ensure that the databases are in a consistent state when you restart the SG, you must return the entire SG to its state at the time of the last Full backup.

    Implications for Exchange Administrators
    As organizations move to Windows 2003 and Exchange 2003, the use of VSS-based backup and recovery will become a standard mechanism for Exchange disaster recovery. However, VSS solutions aren't yet proven or readily available. In addition, VSS adds complexity to your disaster-recovery scenario, and we're only beginning to learn the best practices and pitfalls. The non-VSS solutions that exist today let you use snapshot and clone technologies with Exchange. However, these technologies have no native OS or application support. Organizations must rely on the vendors of these solutions for support—both for current non-VSS solutions and for future VSS solutions. Now that Microsoft has shipped Windows 2003 and Exchange 2003 is scheduled for release this year, vendors likely will follow closely with robust VSS Provider and Requestor support.