With virtualization, you can drastically reduce the number of physical boxes in your environment, carving up fewer but more powerful servers into multiple virtual environments and allocating resources based on the needs of the particular guest instances. This sounds great—until you realize you’re taking all of your eggs and putting them into a much smaller number of baskets.

To manage a virtual environment well, you need to be able to move virtual machines (VMs) between the virtual servers with no downtime and provide high availability for services that don't natively support high availability. Additionally, you need ways to make virtual environments highly available. For that, you need Failover Clustering. http://windowsitpro.com/article/articleid/101489/windows-server-2008-failover-clustering.html

2 Challenges With Windows Server 2008’s Failover Clustering
Windows Server 2008 introduced a Failover Clustering Virtual Machine application/service type, which allows Hyper-V VM configuration and virtual disk resources to be part of a resource group that can be moved between the nodes in the failover cluster. http://windowsitpro.com/article/articleid/101489/windows-server-2008-failover-clustering.html The VM configuration and virtual disk resources must be stored on shared storage.

With the VM as part of a resource group, you can perform a quick migration in planned situations, suspending the VM on the active node and writing the content of the memory, processor, and device registers related to the VM to a file on the shared storage. The LUN (essentially a portion of space carved from a SAN, think of it like a disk) containing the configuration and virtual hard disks (VHDs) is moved to the target node, then the memory read from the file into a new VM created on the target node. After all this is done, the VM becomes available again.

It sounds like a lot of time, but in reality it takes around eight seconds per 1GB of memory configured to the VM; still, it’s a period of unavailability and clients with connections to the VM will time out. You could perform these failovers after hours, so the downtime wouldn’t be a big deal; however, many people want to be able to move VMs between nodes without downtime.

I’d like to point out two potential challenges, however. quick migration works in planned situations where you manually move the VM to a new node. In the event of a node crash where the memory can’t be written to file first, there’s no way to perform a quick migration. Although the VM is started on an alternate node, it will start in a crash-consistent state, which basically means it performs a full boot from the current VHD content, and anything in memory at the time that had not been written to disk would be lost.

The second challenge is that because you’re moving the LUN between nodes when you perform a quick migration, if you want the granularity of failover to be at the VM level, then you can have only one VM on each LUN. This is because the LUN is the smallest disk unit that can be moved between nodes in a cluster. If you placed two VMs on a single LUN and wanted to move only one VM to another node, you couldn’t; the move would force the second VM to also move.

The Solution: R2's Live Migration and Cluster Shared Volumes
In Windows Server 2008 R2, both Hyper-V and Failover Clustering have undergone changes that help to support improved high availability in a virtual environment. The goal with Server 2008 R2 is to provide a zero-downtime planned failover. However, in the event of a node crash, the VM will still start in a crash-consistent state on the new owning node with a period of downtime.

Still, Server 2008 R2’s changes address the two challenges with Server 2008 and planned failover:
1. The need to pause the VM to copy its memory to the target node
2. The need to move LUN ownership from one node to another, which requires a time-consuming dismount and mount operation of the physical disk resource.
Let’s take a look at the changes in Server 2008 R2. They can help you get to a zero-downtime planned failover.

Live Migration and Challenge #1: Pausing the VM
To address the first challenge of having to suspend the VM to copy the memory, the Hyper-V team came up with Live Migration, which copies the VM’s memory to the target node while it’s still running. This sounds very easy, but it’s a little more complicated.

We can’t just copy the memory of a VM to another node, because as we are copying the memory, the VM is still running and parts of the memory are changing. Although we are copying from memory to memory over very fast networks, it still takes a finite amount of time. We can’t just pause the VM while we copy the memory, as that would be an outage.

The solution is to take an iterative approach. The first stage in Live Migration is to copy the VM’s configuration and device information from the existing node to the target node. This creates a shell VM on the target node that acts as a container and receives the VM memory and state.

The next stage is the transfer of the VM memory, which is the bulk of the information and which takes up the bulk of the time during a Live Migration. Remember that the VM is still running, so we need a way to track pages of memory that change while we are copying. To this end, the worker process on the current node creates a “dirty bitmap” of memory pages used by the VM and registers for modify-notifications on the pages of memory used by the VM.

When a memory page is modified, the bitmap of memory is updated to show a page has been modified. After the first pass of the memory copy is complete, all the pages of memory that have been marked “dirty” in the memory map are re-copied to the target. This time only the changed pages are copied, which means fewer pages to copy and the operation should be much faster. However, once again while we are copying these pages, other memory pages change and so this memory copy process repeats itself.

In an ideal world, with each iteration of memory copy, the amount of data to copy will shrink as the time to copy decreases, and we eventually reach a point where all the memory has been copied and we can perform a switch. However, this might not always be the case, which is why there’s a limit to the number of memory copy passes that are performed; otherwise the memory copy might just repeat forever.

After the memory pages have all been copied or we have reached our maximum number of copy passes (eight at publication time, but this could change), it’s time to switch the VM to execute on the target node. To make this switch we suspend the VM on the source node, transfer any final memory pages that couldn’t be copied as part of the memory transfer phase, then transfer the state of the VM to the target, which includes items such as device and processor state.

We then resume the VM on the target node. An unsolicited ARP reply is sent notifying that the IP address used by the VM has moved to a new location, which enables routing devices to update their tables. It’s at this moment that clients now connect to the target node.

You might be wondering which of these actions is done automatically and which requires admin actions. The answer is that all of this is automatic: The only action an admin performs is to initiate a live migration.

Yes, there’s a slight suspend of the VM, which is required to copy the state information, but this moment is milliseconds and below the TCP connection timeout threshold. Clients won’t disconnect during the live migration process, and users are unlikely to notice anything.

After the migration to the new target is complete, the previous host is notified that it can clean up the VM environment. Figure 1 shows the entire process: A VM container is created on the target, the memory is copied in several phases, then the VM state is transferred, which then allows the VM to start on the target.

So Live Migration allows the migration of the configuration, memory, and state of a VM, with essentially no downtime. Great—but that’s only one of the two challenges solved. What about the movement of the LUN containing the VM configuration files and VHDs? We need to remove the requirement to move the LUN between nodes in the cluster.

Cluster Shared Volumes and Challenge #2: Moving the LUN
The dismount and mount operations involved in moving the LUN require downtime, which might break the TCP connection timeout window, resulting in client disconnections. The basic problem is that NTFS is a shared-nothing file system and doesn’t support multiple OS instances connecting concurrently to it, which is the limitation. (The actual SAN holding the LUNs supports multiple concurrent connections with no problem.)

One solution would have been to create a new cluster-aware file system that could be mounted on multiple nodes in the cluster at the same time, which would remove the LUN failover requirement. However, this would have been a huge undertaking both from a development perspective and from a testing perspective, considering how many services, applications, and tools are based around features of NTFS. Adding an additional file system would also have increased hardware and support costs which would have discouraged deployment.

So Microsoft looked at ways to make NTFS-formatted LUNs available to multiple nodes in a cluster, concurrently enabling all the nodes to read and write at the same time, and came up with Cluster Shared Volumes (CSVs).

After you enable CSV, which Figure 2 shows, you select one or more disks that are available as cluster storage and enable them for CSV. When you enable a disk for CSV, any previous mounts or drive letters are removed and the disk is made available as a child folder of the %systemroot%\ClusterStorage folder as name Volume; for example, C:\ClusterStorage\Volume1 for the first volume, C:\ClusterStorage\Volume2 for the next.

The content of the disk will be visible as content within that disk’s volume folder, which Figure 3 shows. As a best practice, place each VM in its own folder.

The ClusterStorage structure is shared, providing a single consistent file name space to all nodes in the cluster, so every node has the same view. After you add a disk to CSV, it’s accessible to all nodes at the same time.

All nodes can read and write concurrently to storage that’s part of ClusterStorage. It sounds great, but as NTFS doesn’t support multiple owners at the same time, how does it work?

How CSVs Work
Each CSV is physically mounted on only one node in the cluster and is assigned to act as the coordinator node. The coordinator node has complete access to the disk as a locally mounted device.

The other nodes don’t have the disk mounted but instead receive a raw sector map of the files of interest to them on each LUN that’s part of the CSV. This sector map enables the non-coordinator nodes to perform read and write operations directly to the disk without actually mounting the NTFS volume, a process called Direct I/O. However, non-coordinator nodes route all metadata to NTFS on the coordinator node. When a non-coordinator node needs to perform an action, it forwards the action over the network to the coordinator node, which then makes the namespace changes on the non-coordinator node's behalf.

In Figure 4, the CSV filter allows Direct I/O to all disks using CSV but redirects metadata operations to the coordinator for each volume. In this example, one node is the coordinator for both disks but this doesn’t have to be the case. Note the CSV filter is present on all nodes but for simplicity is not shown on the coordinator.

The CSV filter actually gives us another great feature. In the event a non-coordinator node loses direct access to the LUN—for example, its iSCSI network connection fails—all of its I/O can be performed over the network via the coordinator node using the cluster network (more on this in a second).

This is known as redirected I/O, and it works great. During testing, I accidentally shut off the iSCSI network from one of my boxes, and I didn’t know until I happened to see the CSV was in Online (Redirected I/O) mode. All of the VMs on it were still running great with no performance degradation. Everything continued to work because all the I/O was now being sent over the network between the node running the VMs and the coordinator node for the LUN, where the VMs resided.

Figure 5 shows such a scenario, in which a node has lost access to the storage directly and the CSV filter redirects all I/O via the network.

One question that often comes up when talking about the CSV redirect I/O is, which network is used? Suddenly potentially huge amounts of traffic are being sent over the network between nodes in the cluster instead of over the dedicated storage networks (if iSCSI is used) or cabling (for Fibre Channel/SAS). 

The NetFT network is a virtual network that binds to one of the physical cluster networks that has been enabled for cluster use. It’s the equivalent of the old private network we had in Windows Server 2003 and was used for internal cluster communications such as heartbeat.

The route that NetFT creates for internal communication traffic is based on an automatic metric, which is given to each cluster network: The network with the lowest metric is used for internal communication. The metric assignment is such that a cluster-enabled network that’s not enabled for client communications, and therefore is only for cluster communications, will have the lowest metric and thus will more likely be used by NetFT.

Given the critical role of the network with CSV and Live Migration, you need to make sure your cluster has a dedicated network just for cluster communications that’s connected to a gigabit or higher dedicated switch. Actually, Live Migration doesn’t really use the NetFT virtual network; instead, every VM has its own properties including those that determine which networks can be used for the Live Migration traffic.

In the beta builds of Windows 2008 R2 the default order for Live Migration is based on the same metrics used by NetFT, so whatever network NetFT binds to would be the top network used by Live Migration. This changed in the Release Candidate and the final code, as Microsoft decided it didn't want the NetFT traffic and Live Migration traffic on the same network due to network traffic conflict.

So, by default, the Live Migration traffic is enabled on the network with the second lowest metric. You can change the Live Migration network order and available networks for Live Migration traffic at your discretion.

In Figure 6, you can see that I manually deselected the other networks so Live Migration traffic can only be sent over the Cluster Internal network as I didn't want to use a separate network for Live Migration. You should make sure you check the networks you are using for Live Migration in your environment as it's quite possible Live Migration may choose a network you did not want used for cluster traffic, such as the iSCSI network!

The actual coordinator node can be changed with minimal impact. There’s a slight pause in I/O if you move the coordinator to another node, as the I/O is queued at each node. However, the pause is unlikely to be noticed, which is crucial given how important the coordinator node is to CSV.

Having multiple nodes directly writing to blocks on the disk can cause some complications, mainly because most utilities don’t expect it. When you want to perform a backup or other disk action such as a defragmentation or chkdsk, you need to put the disk in maintenance mode, which disables direct I/O from the other nodes in the cluster and makes them use redirected I/O. This ensures only the coordinator node is accessing the disk, which stops interference with backups and disk operations.

The good news is that in the final Server 2008 R2 release, a PowerShell comdlet, RepairClusterSharedVolume exposes the defrag and chkdsk actions and performs all the other preparation tasks for you.

The Current CSV Scenario
It’s important to note that currently CSV supports only Hyper-V. Although CSV is visible from all nodes (and I’m sure we can all think of many other uses for this method to share a LUN concurrently on multiple nodes in the cluster for Server 2008 R2), when you enable CSV in the Failover Cluster Management MMC snap-in, you are reminded of the Hyper-V exclusive use, so don’t stray.

In the future, other scenarios for CSV might be added. By using CSV, we’re no longer required to move LUNs between nodes in the cluster during the migration of a VM because the LUN is available to all nodes all the time, solving the mount/dismount problem.

CSV combined with Live Migration offers a migration with no user impact. To perform a migration, you simply complete the action shown in Figure 7. You can also still perform the old-style quick migration using the Move virtual machine action, which Figure 7 also shows.

It’s important to look at CSV as more than part of a zero-downtime VM migration story. Previously we had to maintain multiple LUNs to be able to make the information on them available to different nodes in the cluster. For example, at a minimum, a four-node cluster required four LUNs to be able to move VMs independently of one another. Now, with CSV, the LUNs that are part of cluster storage are available to all nodes, so you don’t need separate LUNs. This lets you share your free space among all VMs on a LUN and makes the configuration validation wizard faster, since it has to test fewer LUNs.

Likewise, you don’t have to use CSV with Live Migration. You can use it on its own and accept the small suspension of availability while a LUN is failed over to the new target node. (But why would you want to?)

Or you can use another cluster file system such as Melio FS, which allows multiple concurrent connections from nodes in a cluster. However, it costs more to use a proprietary file system, whereas CSV only requires standard NTFS.

A Great High Availability Story
Live Migration and Cluster Shared Volumes together offer a great high availability story with Hyper-V—and after trying for a long time to break Hyper-V, I can honestly say it works well. For those of us using the standalone Hyper-V Server, the great news is that Hyper-V Server 2008 R2 is built on the Enterprise Edition of 2008 R2 Server Core, which means the free virtualization platform has clustering support—we get Live Migration and CSV for nothing!

 

Related Reading:

And some more good reading: