Most organizations completely rely on their IT infrastructures to function. To provide IT resiliency, many ensure that they have backups of their systems. When technically possible, companies also implement high-availability solutions, such as clusters running their Microsoft SQL Server instances and file services, a network load-balanced web farm, multiple domain controllers (DCs) replicating to one another, and so on.
For applications that have no native high-availability capabilities, virtualization can provide a solution by applying high availability at the virtual machine (VM) level. This approach allows the VM to be restarted automatically on an alternative virtualization host if an unplanned failover occurs. It also allows the migration of VMs between hosts, with no downtime, in planned situations such as maintenance events. These solutions handle a failover at the host level (i.e., when one host fails).
However, natural disasters (such as recent "hole in the earth" disasters, referring to the complete loss of the data center) and man-made events (which can be as innocent as road work cutting through both your redundant connections to the Internet) can effectively seal off your data center from the rest of the world. Organizations must plan for continuing business even if their primary data center is lost.
Assuming that your organization has a second location that can be used as a data center, many of those application-level technologies that I mentioned previously can also be used across locations. Many solutions, such as multiple DCs and failover cluster–enabled applications, are geographically aware. But there can be a catch with using the traditional Windows Failover Clustering feature over geographically separate locations. In many applications that use clustering and in all virtualization clustering, shared storage must be available to all the nodes in the cluster. This is generally very expensive, because it requires SANs at both locations, great connectivity between the locations, and storage replication to keep the content on both SANs synchronized.
Most SAN-to-SAN replication solutions are synchronous: A write action on the primary SAN is acknowledged to the writing process only when the write is also performed on the replica SAN, thus ensuring that both SANs are synchronized at all times. Although synchronous replication gives the greatest protection, it's costly.
Very large organizations can afford these storage solutions and enable virtualization clusters across locations for top-tier applications and virtual environments that host critical services. But many other organizations and non–top tier applications were left without a way to provide disaster recovery—until the introduction ofHyper-V Replica.
Introducing Hyper-V Replica
Windows Server 2012 was an enormous release, with particularly significant changes around virtualization and cloud services. One of the biggest new features is Hyper-V Replica, which introduces the ability to asynchronously replicate a VM to a second Hyper-V host. The target Hyper-V server (i.e., the replica) does not need to be part of a cluster with the primary Hyper-V host. In fact, the replica cannot be in the same cluster as the primary. Nor does the replica need any shared storage or even a dedicated network infrastructure for the replication. The goal of Hyper-V Replica is to enable disaster recovery capabilities for any Hyper-V environment, without steep requirements, through its use of asynchronous replication.
Some SANs also offer asynchronous replication, which works by replicating data from the primary to the replica, but not in real time. Write actions are performed on the primary host, acknowledged to the writing process, and then replicated, when possible. There is a delay between when the write occurs on the primary and when it occurs on the replica. Depending on this delay, a certain amount of data can be missing from the replica server—and therefore possibly lost—if the primary host fails. This possible gap is often referred to as the recovery point objective (RPO) and basically defines the maximum amount of data loss that is acceptable in a disaster. For example, an RPO of 5 minutes means that no more than 5 minutes of data should be lost.
SAN-level asynchronous replication might not be desirable to many organizations because it requires the same vendor in both the primary and replica locations. But Hyper-V Replica uses asynchronous replication very efficiently. At a high level, Hyper-V Replica works as follows:
- When a VM is enabled for replication, a new VM is created on the Hyper-V replica host. This replica VM matches the configuration of the primary VM and is turned off.
- The storage of the primary VM is replicated to the replica VM on the replica Hyper-V server. A log is started on the primary Hyper-V host to store writes to the replicated virtual hard disks (VHDs). This log file is stored in the same location as the source VHD.
- After the initial replication of the storage is complete, the log file is closed. A new log file is started to track ongoing changes; the closed log file is sent to the replica Hyper-V host and is merged with the VHDs for the replica VM. The replica VM remains turned off.
- Every 5 minutes, the log file is closed, a new one is created, and the closed file is merged with the replica.
Hyper-V Replica's use of asynchronous replication opens up the use of replication to many more companies and many more disaster recovery scenarios:
- Data center–to–data center replication for Tier 1 applications in organizations without SAN-level replication, such as small-to-midsized organizations
- Data center–to–data center replication for Tier 2 applications in organizations that have SAN-level replication but don't want to use it for non–Tier 1 applications
- Branch office–to–head office replication, to protect applications that are hosted at a branch location
- Hoster location–to–hoster location replication, for hosting companies
- Replication to a hoster, for disaster recovery at small organizations that don't have a second data center
There are many more potential scenarios. The key point is that with Hyper-V Replica, the ability to replicate VMs is now an option for any organization.
Using Hyper-V Replica
Hyper-V Replica is simple to configure. The easiest way to really understand how Hyper-V Replica works is to walk through its setup options and enable replication for a VM.
The first step is to configure the replica Hyper-V server to accept requests to host a replica. In Hyper-V Manager, choose Hyper-V Settings from the server's list of actions. Within Hyper-V Settings, choose the Replication Configuration list of configurations, as shown in Figure 1. Check the Enable this computer as a Replica server check box. You'll then need to make several choices.
The first choice is to enable the use of Kerberos (which uses HTTP) or certificate-based authentication (which uses HTTP Secure—HTTPS). Kerberos is easier to configure but requires that both the primary and replica Hyper-V servers use Kerberos authentication and therefore be part of the same Active Directory (AD) forest or trusted domains. Using Kerberos, the replication of data between the primary and replica servers is not encrypted and is sent over the standard HTTP port 80. However, if encryption is required, the Windows IPsec implementation can be used.
The second option is to use certificate-based authentication, which enables the primary and replica servers to be part of different AD forests or organizations. This choice requires a certificate to be specified for use. As an added benefit of using HTTPS, all transferred data is encrypted. If both Kerberos and certificate-based authentication are enabled, then when a new replication relationship is established, the administrator who configures the replication can choose which method to use.
The only other configuration choice is to specify the servers from which the replica will accept replication requests, as well as where those replicas will be stored. One option is to allow replication from any authenticated server. In this case, choose one location to store all replicas. The other option is to specify the servers that can replicate to the replica; each server can have a different storage location.
When specifying servers, you can use one (but only one) wildcard character within the server name. This allows the enablement of a group of servers; for example, *.na.savilltech.net for all servers with a Fully Qualified Domain Name (FQDN) that ends in na.savilltech.net. The Trust Group tag allows VMs to move between Hyper-V hosts with the same trust group and to continue replicating without issue. With Shared Nothing Live Migration, VMs can be moved between unclustered Hyper-V hosts, with no downtime. With this new mobility capability, you need to ensure that groups of servers have the same Trust Group tag to enable unaffected replication when VMs are moved between servers within a trust group.
If you use Failover Clustering, there is an additional requirement. A failover cluster consists of multiple Hyper-V hosts. Therefore, if a failover cluster is the target for Hyper-V Replica, it's important that the whole cluster—not just one host—can host the replicated VM. Therefore, the storage of the replica must be on a Server Message Block (SMB) share or cluster shared volume (CSV). Hyper-V Replica support in a failover cluster is enabled by adding the Hyper-V Replica Broker role to the failover cluster. This action requires a name and IP address for the broker, which serves as the client access point for Hyper-V Replica and will be the name that is used when choosing the cluster as a replication target. When enabling replication within a cluster, you perform the replication configuration within the Failover Cluster Manager tool, after the Hyper-V Replica Broker role is added. When the configurations for replication (which are the same as for a standalone Hyper-V host) are completed, all hosts in the cluster are automatically configured, unless certificate-based authentication was selected. In that case, each host needs its own configured certificate.
The final step is to enable the required firewall exception for the used port: 80 for HTTP and 443 for HTTPS. The firewall exceptions are built into Windows Server but are not enabled, even after replication configuration is complete. You will need to start the Windows Firewall with Advanced Security administrative tool, choose Inbound Rules, and enable either (or both) Hyper-V Replica HTTP Listener (TCP-In) or Hyper-V Replica HTTPS Listener (TCP-In), depending on your authentication method.
When the replica server has been enabled for replication, it is important to also enable the primary Hyper-V server as a replica. This allows the reversal of replication if the VM is activated on the replica server and needs to start replicating back to the previous primary server (which would then be considered the replica).
One item that is not configured is which network to use for the replication traffic. The assumption is that this technology is used between data centers. There would be only one valid path between them, so Hyper-V Replica automatically chooses the correct network to use for the replication traffic. (I suspect that a number of clients would like more granularity of the network used for Hyper-V Replica; if you are one of them, give Microsoft that feedback!)
Replicating a VM
After the Hyper-V hosts and clusters are configured to enable the Hyper-V Replica capability, the next step is to enable VMs to be replicated. Use Hyper-V Manager or Windows PowerShell (particularly in any kind of automated, bulk configuration). Choose the VM on which you want to enable replication, and then choose the Enable Replication action. This action launches the replication-configuration wizard, which comprises several steps. I walk through the whole process in the accompanying video.
After the target Hyper-V server is specified, choose the authentication type to use. This will depend on which types the replica server supports. Also choose whether to compress the data that is sent over the network; compression saves network bandwidth but uses additional CPU cycles on both the primary and replica Hyper-V servers. If a VM has multiple VHDs, then you can choose which hard disks to replicate. You can use this choice to ensure that only the required VHDs (e.g., only those VHDs that contain more than a single pagefile) are replicated. Only VHDs can be replicated; if a VM uses pass-through disks, those disks cannot be replicated with Hyper-V Replica (another reason to avoid pass-through disks).
The next configuration step is to configure the recovery history. By default, the replica has a single recovery point: the most recent replication state. However, an extended recovery history can be configured to include additional hourly recovery points, as shown in Figure 2.
These additional points are manifested as snapshots on the VM that is created on the replica server. You can then choose a specific recovery point by choosing the desired snapshot. An additional option lets you create an incremental Microsoft Volume Shadow Copy Service (VSS) copy at a configurable number of hours. This gives you an additional level of assurance in the integrity of the replica at that point in time. The normal log files that are sent every 5 minutes provide the latest storage content. However, at that point, the disk might have been in an inconsistent state on the source VM. There is no guarantee that the replica VHD will be in a consistent state when the replica is started. When enabled, the incremental VSS option triggers a VSS snapshot on the source prior to that cycle's replication, which forces the source VM to ensure that the disk content is in an application-consistent state. In the same manner as when a backup is taken and the log file closed and sent to the replica, that state is saved as the application-consistent recovery point on the target, as shown in Figure 3.
If the VM contains applications that have VSS writers, I suggest using the option to create an application-consistent recovery point. The default of 4 hours is a good balance between integrity and the additional work caused by creating a VSS recovery point on the source VM.
After the recovery-point configuration is complete, you must choose the method to initially replicate the storage:
- Send the VHD content over the network.
- Send the VHD content via external media; specify a location to which to export the content.
- Use an existing VM on the replica server as the initial copy. You can use this option if you already restored the VM to the target Hyper-V server or previously had replication enabled and broke the replica but now want to re-enable it. A very efficient bit-by-bit comparison will be performed between the primary and replica, to ensure consistency.
The initial replication can be configured to begin immediately or at a later, specified time; for example, outside of normal business hours, when contention for network resources is reduced. Depending on your choices, the VM is created on the replica Hyper-V server in the off state, and the initial replication begins. Every 5 minutes, the Hyper-V Replica log (.hrl) file is closed, sent to the replica, and merged into the replica VHD. The entire time, the replica VM is turned off. Only disk content—not memory, processor, or device state—is replicated to the replica VM. If the replica is activated, it will be turned on and booted similar to a crash-consistent state, as if it had just been powered down without clean shutdown. This is one of the reasons why performing the periodic VSS snapshot recovery point is useful for ensuring disk integrity.
After the replica VM is created, it is separate from the primary VM. Any changes in configuration to the primary VM are not reflected in the replica VM. This allows changes to be made on either VM, and the replication of the VHD content will continue.
Using Hyper-V Replica
Remember that Hyper-V Replica is a disaster-recovery solution. It is not designed to be used in place of failover clusters or other high-availability technologies. Typically, during a disaster, many steps and processes must be performed to activate a disaster-recovery site. Hyper-V Replica is not an automatic solution. It will not detect that the primary VM host is missing and start the VM on the replica server because incorrectly detecting a site failure could cause a huge problem. This out-of-the-box feature must be initiated manually, but there's no reason that it can't be automated through PowerShell as part of your other processes. (Perhaps in the future, the feature will be automated through some Microsoft system management solution, such as System Center Virtual Machine Manager, to allow multiple VMs to fail over as part of a larger site-recovery process.)
There are three types of Hyper-V Replica failover—one for testing purposes and two for real disaster scenarios:
- Test failover—This type of failover is triggered on the replica VM. The replica VM can then be started on the replica Hyper-V host. To do so, create a temporary VM that is based on the selected recovery point, and then test to ensure that replication is working as planned and as part of a larger site-failover test process. During the test failover, the primary VM continues to send log updates to the replica VM. These updates are merged into the replica VHDs, ensuring that replication continues. When testing is complete, the temporary VM is deleted
- Planned failover—This type of failover is triggered on the primary VM and is the preferred failover type. This process shuts down the primary VM, replicates any pending changes to ensure that no data is lost, fails over to the replica VM, reverses the replication so that changes flow in the reverse direction, and then starts the replica VM. That VM becomes the primary, whereas the old primary becomes the replica.
- Unplanned failover—This failover type is triggered on the replica VM, the assumption being that in an unplanned failover the primary is unavailable because of a disaster. When this type of failover is performed, a replication of pending changes is not possible, and reverse replication must be manually enabled with a resynchronization because there is no way to know at which point replication stopped. When starting the reverse replication, choose Do not copy the initial replication on the Initial Replication page. The original primary VM can be used, and a block-by-block comparison is performed to synchronize between the replica VM and the original primary VM. Only the delta content needs to be sent over the network.
Something might be bothering you about the failover to the disaster-recovery site in a different location: The VM has a TCP/IP configuration that is unlikely to work in a separate location, which will almost certainly be on a different subnet. As part of the Hyper-V Replica functionality, an additional Failover TCP/IP configuration is available on the VM when replication has been enabled. This configuration (found under the Network Adapter configuration of the VM) allows an alternative IPv4 or IPv6 configuration to be specified on the replica VM. The network configuration is injected into the VM during a failover, as shown in Figure 4.
This process works by Hyper-V updating the VM through the Windows Server 2012 Hyper-V integration services running inside the VM. The process works only on synthetic network adapters, not on legacy network adapters, and requires Windows XP Service Pack 2 (SP2) or Windows Server 2003 SP2 and later to be running inside the VM. At the time of writing, this process does not work with Linux VMs but is actively being worked on, so that functionality should be available soon. A good practice is to complete the Failover TCP/IP configuration on the primary VM with its normal IP configuration. That way, if the replica is ever activated, replication reversed, and the VM failed back to the original primary, the correct IP address for the primary location can automatically be reinstated.
Replication for Recovery
Hyper-V Replica is a powerful feature. I teased earlier that it's useful even for organizations without a second data center; remember, certificate-based authentication is possible with replication over HTTPS. If you have a hoster that supports Windows Server 2012 Hyper-V (or hopefully Windows Azure Infrastructure as a Service—IaaS), you can replicate from your data center to the public cloud for disaster-recovery purposes. On its own, Hyper-V Replica is a great way to enable failover for individual VMs, but this functionality can also be used by other processes and orchestration components to quickly provide a powerful site-recovery feature that will benefit most organizations.