Architecture basics for improved storage
Microsoft Hyper-V was introduced in Windows Server 2008. This enterprise-ready virtualization solution provides true hypervisor-based virtualization that enables virtual machine (VM) performance that’s equivalent to running on native hardware. Hyper-V uses failover clustering as the mechanism to create highly available Hyper-V environments. This feature enables the ability to move a VM from one node to another with minimal downtime through quick migration, which essentially saves the VM memory and state to disk, pauses the VM, dismounts the LUN on the current VM owner node, mounts the LUN on the target VM owner node, reads the memory and state information back into the VM on the target, and then starts the VM. This process is fairly fast but typically makes the VM unavailable for about 30 seconds (or longer, depending on configuration), which causes users to be disconnected.
One of biggest shortcomings of Server 2008’s Hyper-V is its inability to move a VM between nodes in a failover cluster without any downtime. For Hyper-V to realistically compete against other virtualization solutions, it needed an overhaul in Windows Server 2008 R2 for its VM migration technology. This revamp required two major changes. First, it was necessary that a VM’s memory and state could be copied between nodes while the VM was still running. This change would avoid the long downtime associated with saving memory to disk and then reading from disk on the target machine. Hyper-V Server 2008 R2 introduced Live Migration to address this issue. Second, Hyper-V 2008 R2 removed the LUN dismount and mount operation, which was necessary to make the configuration and Virtual Hard Disk (VHD) available to the source and target nodes simultaneously. This article focuses on the second Hyper-V change in Server 2008 R2.
The Shared Nothing Problem
NTFS is a very monogamous file system. An NTFS volume can be accessed and used by only one OS instance at a time; it wasn’t designed as a cluster file system. However, NTFS is also a very powerful file system: highly secure, industry tested, and with a huge support ecosystem that includes backup, defragmentation, and maintenance tools, as well as many services that rely on features of the file system.
A cluster consists of multiple nodes, each running an instance of Windows Server. But when NTFS volumes are used in a cluster, how is the integrity of NTFS maintained, and how are multiple concurrent mounts of an NTFS volume prevented? In a failover cluster, shared disks (i.e., disks that all the nodes in a cluster have a path to—which means they’re typically LUNs on a SAN) are resources of the cluster. Like other resources in a failover cluster, these resources have only one owner at any one time. Therefore, multiple mounts of a LUN are blocked because only one node can own the resource. The resource owner mounts the LUN and can access the NTFS volumes stored in the LUN.
Consider Hyper-V using failover clusters—particularly VMs. This lack of sharing introduces several design considerations. Hyper-V uses disks to store not only the VHD for the VM but also for the various configuration and state files. If an NTFS volume can’t be accessed by more than one node at a time, any VMs that share a LUN for storage of the VHD and configuration must run on the same node. Moving one VM on its own that shares a LUN for storage with another VM isn’t possible; all the VMs sharing the LUN must move as a collective unit. This restriction means that in Server 2008, each VM must have its own LUN so that each VM can be moved independently of other VMs around the cluster, by dismounting the LUN and mounting on the new node. A limit of one VM per LUN means administrators must deal with a lot of LUNs, as well as a lot of potential wasted space, because each LUN is provisioned with a certain amount of space and room to grow.
In addition to extra management of multiple LUNs and the potential for a lot of wasted space, another problem with Hyper-V in Server 2008 is the act of dismounting the LUN from the current owning node and mounting on the target node when a VM needs to be moved. This dismount/mount operation takes a lot of time. The goal in Server 2008 R2 was to achieve a zero downtime migration solution for VMs, with no visible end-user effects. The Hyper-V team implemented Live Migration to transfer memory and state information without stopping the VM. However, dismounting and mounting the LUN that contains the VHD requires a pause of the VHD, which results in downtime. To prevent the need to dismount and mount during moves and to avoid having one LUN per VM, as well as to save space, prevent administrative headaches, and—ultimately—save money, it’s necessary to have a LUN that’s accessible by all nodes in the cluster at once.
Cluster Shared Volumes
Although NTFS has a sharing issue, SANs themselves have no problem with multiple concurrent connections to a LUN, which means that the only changes necessary involve the file system or how it’s accessed. One option is to create a whole new file system that’s clusterable; however, this solution requires a lot of development and testing—plus, it negates all the support and trust in NTFS. Another solution is to fundamentally change NTFS and how it handles metadata updates; however, changing NTFS in this manner would be a Herculean task and would require numerous changes to the Windows OS, as well as changes to many applications and services that use NTFS—not to mention the additional testing required.
Microsoft’s solution to the shared nothing problem was to make NTFS sharable without changing the file system at all—which seems like an obvious solution but is more difficult to implement than you might think. Server 2008 R2 achieves this goal with the addition of cluster shared volumes.
NTFS has problems with concurrent access because of the way it handles metadata changes, which are changes that affect the actual file system structure, such as file size. Having multiple entities updating the metadata at the same time can lead to corruption. Cluster shared volumes solves this problem by nominating one of the nodes in the cluster as the owner for each cluster shared volume (i.e., the coordinator). That owner node then performs all metadata updates to the volume, while the other nodes can perform direct I/O to the volume—which for virtualization loads is the majority of the access, such as reading and writing to a VHD. The owner node has the cluster shared volume locally mounted and therefore has full access. Each cluster shared volume disk has its own owning node, rather than having one node nominated to be the owning node for every cluster shared volume disk in the cluster. Multiple nodes can be owners for different cluster shared volumes. The owning node selection is dynamic in nature. If an owning node is shut down or fails, then another node automatically becomes the owning node for any cluster shared volume disks on that node.
This separation of metadata activity and normal I/O is achieved through the introduction of the cluster shared volume filter into the file system stack on each node in the cluster when the cluster shared volume is enabled. When a non-owner node needs to make a change to metadata, such as a file extend operation on a dynamic VHD, then that metadata change, which is generally small in size, is captured by the cluster shared volume filter and sent over the cluster network to the owning node, which performs the metadata update on behalf of the non-owning node. The network used for the cluster shared volume communication between the owner and non-owner nodes is the same network used for the cluster health communication (i.e., the cluster network). The non-owning nodes can perform direct I/O to the LUNs because the cluster shared volume filter actually creates a sector map for each cluster shared volume disk that shows where the file data resides. This map is shared with all nodes in the cluster, giving them direct access to the correct sectors.
As you can see in Figure 1, both nodes have connectivity to the storage. However, the LUNs are actually locally mounted on the owning node, which can perform both data and metadata actions, whereas the non-owning node can perform only direct I/O, such as reads and writes of blocks on disk.
Figure 1: Cluster shared volume architecture
An important point here is that cluster shared volumes are a solution for Hyper-V VM workloads; as such, this solution has been optimized around the way in which storage is used by VMs. Cluster shared volumes shouldn’t be used nor are they supported for anything other than storing Hyper-V VMs. When you enable cluster shared volumes for a failover cluster through the Microsoft Management Console (MMC) Failover Cluster Management snap-in, you must accept the dialog box terms and restrictions that explain that the cluster shared volume can be used only for files created by the Hyper-V role.
The cluster shared volume filter provides an additional benefit to storage access within the failover cluster. Normally, the cluster shared volume filter intercepts metadata activity and passes these requests to the owner node for action; however, the filter can also intercept all I/O made to a cluster shared volume disk and pass over the cluster network to the owning node for execution. When this interception of all I/O is used, the cluster shared volume disk is in redirected mode. It’s important to understand why this ability is important.
You should eliminate single points of failure in any high-availability solution. Microsoft Multipath I/O (MPIO) is a great solution to allow multiple paths to storage, which eliminates a single point of failure for accessing storage. Problems can still occur, and many installations don’t use MPIO—which is where redirected mode can be a life saver. If a node in a cluster loses the direct connectivity to a LUN hosting a cluster shared volume disk (typically because of a problem connecting to the SAN hosting the LUN), then the access to the cluster shared volume automatically switches to redirected mode until the node reestablishes direct connectivity to the LUN. In the meantime, all the I/O is sent over the cluster network to the owning node and executed, which lets the node that lost connectivity to the storage continue functioning with no interruption to the VMs, as Figure 2 shows. In redirected mode, a lot more traffic is sent over the cluster network. This consideration is important in selecting the specifications for the cluster network (e.g., 10Gb versus 1Gb). Only the node that lost connectivity to the LUN goes into redirected mode for the cluster shared volume hosted on the LUN. Other nodes in the cluster that still have connectivity will continue to perform direct I/O and only send the metadata over the cluster network to the owning node.
Figure 2: Redirection of I/O in the case of lost connectivity
Beyond connectivity failures, there are certain types of maintenance operations that simply don’t work well with multiple OS instances directly writing to blocks on disk. This is another reason for redirected mode—sometimes it’s necessary to have only one node writing to a disk. Manually placing a cluster shared volume disk in redirected mode accomplishes this. When you manually place a cluster shared volume disk in redirected mode, all nodes in the cluster go into redirected mode for the specific cluster shared volume disk. All I/O for the cluster shared volume disk for all nodes is then sent over the cluster network to the owning node for that specific cluster shared volume disk.
Cluster Shared Volume Requirements
Cluster shared volumes don’t have any specific requirements regarding shared storage within a failover cluster. If the storage is available as shared storage within a cluster, it can be added to the cluster shared volume namespace and made accessible to all nodes in the cluster concurrently. However, there are some network requirements when cluster shared volumes are used, especially with Live Migration. When you use cluster shared volumes, some storage actions take place related to metadata being sent over the network rather than directly to the storage. The network needs to be low latency to avoid introducing any lag in disk operations but typically doesn’t need to be high bandwidth because of the minimal size of the metadata traffic under normal circumstances.
The cluster network has some specific configuration requirements to support cluster shared volume communication, in addition to some configuration on the nodes themselves. The cluster network should be a private network that only the network adapters that are used for the cluster shared volume are connected to. The IPv6 protocol must be enabled, because Microsoft’s development and testing was based on IPv6. IPv4 can be disabled on the cluster network adapter.
Because the cluster shared volume communication between nodes occurs via Server Message Block (SMB), the Client for Microsoft Networks and File and Printer Sharing for Microsoft Networks services must be enabled on the network adapter that’s used for the cluster network (as well as the cluster shared volume). Disabling NetBIOS over TCP/IP on the network adapter is also recommended.
On the failover cluster nodes, you need to ensure that the server and workstation services are running and that NTLM hasn’t been disabled. Although failover clustering uses Kerberos exclusively for its normal operations, NTLM is required for the cluster shared volume communication over the cluster network. NTLM is typically disabled through Group Policy; therefore, checking the Group Policy Resultant Set of Policy (RSoP) results will help confirm that NTLM isn’t disabled. A great way to actually check SMB connectivity between nodes is to run the command
net use * <cluster network IP address of another node>\c$
on all the nodes.
Picking the Cluster Network
I discussed the cluster network for transporting cluster shared volume traffic, which is typically just metadata but in redirected mode could be all the I/O for a cluster shared volume disk. It’s important to architect your networks to ensure optimal performance and availability.
With Server 2008 failover clusters, you don’t usually have only a single network that the cluster can use for communication—which would constitute a single point of failure. Instead, the Network Fault Tolerant (NetFT) driver picks a network to use for the cluster communication, based on several attributes of the available networks.
For each network that’s available in the cluster, the administrator can specify whether the network can be used for cluster communication, such as cluster health and cluster shared volume traffic, and whether clients can connect through the network, as Figure 3 shows. These settings are used to calculate a metric for each network adapter. The metric assignment is such that a cluster-enabled network that isn’t enabled for client communication, and therefore is used only for cluster communication, will have the lowest metric and therefore will more likely be used by NetFT. Given the critical role of NetFT with cluster shared volumes, you need to ensure that your cluster has a dedicated network—only for cluster communication—that’s connected to a gigabit or higher dedicated switch to prevent disruption to heartbeat communication and avoid false failovers.
Figure 3: Setting cluster network properties
To see all your networks, including which metric they’ve been assigned, you can use Server 2008 R2’s new PowerShell Failover Clustering module, as follows:
PS C: > Get-ClusterNetwork | ft Name, Metric, AutoMetric
Name Metric AutoMetric
---- ------ ----------
Client Network 1000 False
Cluster Network 100 False
iSCSI Network 10000 True
LM Network 110 False
The first line imports the Failover Clustering module; the next line lists the cluster networks, which are then formatted into a table.
Note in my example that AutoMetric is false for my Client, Cluster, and Live Migration networks. I set these metrics manually, to ensure that NetFT always uses the network I want for cluster communication and the cluster shared volume.
To set a metric, you must first create an object pointer that points to the network. For example:
$netobj = Get-ClusterNetwork "Cluster Network"
Now that we have the object pointer, we can modify its Metric attribute as follows:
$netobj.Metric = <custom value, i.e. 100>
You should be careful when using this method. Typically, if you set all other attributes correctly (e.g., which networks can be used by the cluster, which networks are exclusively for client communication), you don’t need to manually change the metrics.
If you’re using Live Migration in the cluster, Live Migration uses the network with the second lowest metric, by default. This prevents the Live Migration traffic from clashing with the cluster and the cluster shared volume traffic. You can use the Failover Cluster Management MMC to change which networks Live Migration can use for each VM, via the Network for live migration tab of each VM’s properties.
Networks are an art unto themselves, with failover clusters that use cluster shared volumes and Live Migration—and even more so when iSCSI is used. In this scenario, you’d typically see five network adapters, which might actually consist of multiple physical network adapters teamed together for resiliency and performance:
- One network adapter for management of the Hyper-V host
- At least one network adapter for virtual networks used by the VMs
- One network adapter for cluster communication and the cluster shared volume
- One network adapter for Live Migration traffic
- One network adapter for iSCSI communication
For more information about network adapters, see the Microsoft TechNet article “Hyper-V: Live Migration Network Configuration Guide.” This article discusses all the network adapters you might have, including how to best separate the various types of traffic. To learn more about cluster network configurations, watch my video, "CSV and NetFT Traffic Deep Drive."
Using Cluster Shared Volumes
Using a cluster shared volume is easy. After you add available disk storage to the cluster shared volume, the volumes on the disk are exposed as folders under the \%systemdrive%\ClusterStorage folder, which by default are named Volume1, Volume2, Volume3, and so on. For example, on a node with C as the system drive, the first cluster shared volume would be accessible as C:\ClusterStorage\Volume1, the second cluster shared volume as C:\ClusterStorage\Volume2, and so on, as Figure 4 shows.
Figure 4: Cluster shared volume folder structure
Every node in the failover cluster has exactly the same ClusterStorage namespace, all with the same volumes and content. You can rename VolumeX, but you can’t rename the ClusterStorage folder. Note that each cluster shared volume doesn’t need a drive letter; therefore, there are no restrictions based on available drive letters and no more management through a globally unique identifier (GUID). Your VMs and VHDs are placed in a volume under \ClusterStorage and used as usual. Nothing special needs to be done—although this doesn’t mean there are no considerations to take into account.
Server 2008 requires you to put one VM on each LUN, to enable maximum flexibility in VM placement and migration. With cluster shared volumes, all the VMs and VHDs can sit on a single cluster shared volume–enabled LUN. But just because you can put all VHDs on one LUN doesn’t mean you should. Consider the typical rules for placing application data or databases, log files, and the OS on separate disks. If all these VHDs are on the same cluster shared volume, you lose a lot of the benefit of different disks providing protection. To provide better protection, you might instead want to consider separate cluster shared volumes, with one designated to store OS VHDs, one to store log VHDs, and one for application data or database storage.
Performance is also critical. You need to consider your VMs’ I/O operations per second (IOPS) requirement. Putting numerous VMs on a single LUN might help cut down on management and wasted space; however, you need to understand the capabilities of the LUN to ensure it meets the combined IOPS requirements of all the VMs on the LUN.
Maintaining Cluster Shared Volumes
As I mentioned earlier, having every node able to directly access the blocks on cluster shared volumes is great for flexible architecture, availability, and functionality but does introduce some complications when we consider volume maintenance. Many utilities expect exclusive access to a volume and its blocks during operation. Imagine running a disk defragmentation on an owning node, while meanwhile another node writes to a block on the disk that the defrag operation just moved—not good! Consider Microsoft Volume Shadow Copy Service (VSS) backups, chkdsk, and so on—none of these operations would work if multiple nodes could write to blocks on the disk while the backup or utility was trying to run.
This is the situation in which placing a cluster shared volume disk in redirected mode is important. However, the good news is that you don’t have to worry about this as long as you use utilities and backup software that support cluster shared volumes.
Actions such as the Inbox defragmentation utility and chkdsk should be performed via the Repair-ClusterSharedVolume PowerShell cmdlet, which automatically places the cluster shared volume in redirected mode, performs the tasks, and then takes the cluster shared volume out of redirected mode to resume normal operations.
Backup vendors also have access to a special API call: PrepareVolumeForSnapshotSet(). This call automatically places the cluster shared volume in redirected mode and releases it after the backup is complete. If you’re performing a hardware-based VSS snapshot, the redirected mode should be required for only a few seconds. A software-based VSS snapshot, however, might keep the cluster shared volume in redirected mode for quite some time while the backup completes.
Multisite Considerations With Cluster Shared Volumes
Server 2008 introduced widespread support for multisite clusters spanning multiple subnets. However, the cluster network that cluster shared volumes use must be a single subnet network. This means that although all the other networks used by the cluster can cross the subnet, you must use some kind of stretched Virtual LAN (VLAN) for the cluster network.
For the actual storage, various hardware and software vendors support the replication of cluster shared volumes and ensure that the integrity of the cluster shared volumes is maintained. You only need to be sure that your VM replication solution supports cluster shared volumes.
The question often comes up regarding why DFS Replication (DFSR) can’t be used to replace cluster shared volume content between sites. DFSR is a great solution to replicate files; however, it works by replicating the changes to a file when the file closes. With VMs, files remain open all the time until the VM is stopped—which isn’t useful for most needs.
Easier Than You Might Think
Cluster shared volumes give Hyper-V environments the flexibility necessary for VM placement and the capability to enable storage optimization. Cluster shared volumes don’t change the underlying file system but instead open up a volume’s availability to all nodes in the cluster simultaneously. Therefore, the learning curve for using cluster shared volumes is fairly small because an environment’s processes and tools typically don’t need to change to allow these storage management changes.
Although we typically think of cluster shared volumes and Live Migration as working together for zero downtime VM migration, the two technologies are separate. Cluster shared volumes can be used even if Live Migration isn’t used. Using cluster shared volumes alone still lets organizations simplify storage management and optimize storage while gaining VM placement flexibility.