In “Troubleshooting Windows Server 2008 R2 Failover Clusters,” I discussed troubleshooting failover clusters—specifically, the locations and tips for where you can go to get the data you need in order to troubleshoot a problem. The Microsoft Program Management Team looked at quite a few of the top problems and worked to improve them in Windows Server 2012 Failover Clustering. So this month, I’ll talk about the new features and functionality of Server 2012 Failover Clustering. The new changes for failover clustering offer easier management, increased scalability, and more flexibility.
One of the first things to talk about is scalability limits. With Server 2012 clustering, you now have a maximum limit of 64 nodes per cluster. If you’re running as a Hyper-V cluster with highly available virtual machines (VMs), the limit has increased to 4,000 VMs per cluster and 1,024 VMs per node. With these increased limits, Server Manager has been bolstered with the ability to discover and provide remote management capabilities. When you’ve configured a cluster, it will show all nodes, including the name of the cluster and any VMs on the cluster. In Server Manager, you would see where the remote management can be accomplished, as Figure 1 shows. With this capability, you can enable additional roles/features remotely.
Figure 1: Configuring remote management
A New Level of AD Integration
When you’re creating a cluster, you’ll experience a new level of detection about where the cluster creates an object in Active Directory (AD). When allowing the cluster to create the object, it will detect the organizational unit (OU) where the nodes reside and create the object in the same OU. It will still use the logged-on user account to create the Cluster Name Object (CNO), so this account needs to have Read and Create permissions on this OU. If you want to bypass this detection, or place it in a separate OU, you can specify that during the creation. For example, if I want to place the cluster name in an OU called Cluster, during creation I would input the data that Figure 2 shows.
Figure 2: Placing the cluster name in an OU called Cluster
If you’re doing it through PowerShell, the command would be
The quorum configuration has been simplified, and a new dynamic quorum model is available—now the default when you’re creating a cluster. You can also manually remove nodes from participating in the voting. When you go through the Configure Cluster Quorum Wizard, you’re provided with three options:
- Use typical settings (recommended)—The cluster determines quorum management options and, if necessary, selects the quorum witness.
- Add or change the quorum witness—You can select the quorum witness; the cluster determines quorum management options.
- Advanced quorum configuration and witness selection—You determine the quorum management options and the quorum witness.
When you choose the typical settings, the wizard will select the quorum type as dynamic. With a dynamic quorum, the number of votes changes depending on the number of participating nodes. The way it works is that, to keep a cluster up, you must have a quorum or consensus of votes. Each node that participates in a cluster is a vote. If you have also chosen to have a witness disk or share, that’s an additional vote. To keep the cluster going, more than half of the votes must continue to run. You can use the math equation of (total votes +1)/2. I have nine total nodes in a cluster without a witness disk. So, using the math equation above, it would be (9+1)/2 or 5 total votes to keep the cluster up.
So, for example, consider what occurs with a Server 2008 R2 cluster and a Server 2012 cluster. In a Server 2008 R2 cluster, using the same nine nodes in the cluster, this means that I have nine total votes and will need five votes (nodes) to remain going to keep the cluster up. If there are only four nodes up, the cluster service will terminate because there aren’t enough remaining votes. The administrator would need to take manual actions to force the cluster to start and get back to production. In a new Server 2012 failover cluster, when a node goes down, the number of votes needed to remain up also dynamically goes down. With my nine nodes (votes), if one node (vote) goes down, the total vote count becomes eight. If another two nodes go down, the vote count becomes six. The Server 2012 cluster will continue running and stay in production without intervention needed. Dynamic Quorum is the default and recommended quorum configuration for Server 2012 clusters.
Figure 3: Changing the quorum configuration
To change the quorum witness configuration, you can right-click the name of the cluster in the far left pane, choose More Actions, and select Configure Cluster Quorum Settings as Figure 3 shows. The wizard will let you set a disk witness, set a file share witness, or leave it as dynamic. If you choose the advanced settings, one of the first settings you’ll determine is what nodes actually have a vote in the cluster, as Figure 4 shows. All nodes participate with a vote to achieve quorum. If you de-select a node, it won’t have a vote. Using the earlier example of nine nodes, for Server 2008 R2 clusters, you have only eight voting members, so a witness disk or share would need to be added. In both Server 2008 R2 and Server 2012 clusters, this non-voting node doesn’t have a vote; if it’s the only node left, the cluster service will stop and manual intervention will be necessary.
Figure 4: Determining which cluster nodes have a vote
When going through the Configure Cluster Quorum Wizard, the next screen shows the option where you can select or de-select the dynamic quorum option, as Figure 5 shows. As you can see from it, the default action is selected and is also recommended. If you want to change the quorum configuration to add a witness disk or share, the next screen in the wizard, Select Quorum Witness, is where you can choose the witness disk or share to use.
Figure 5: The Configure Quorum Management page
There are some new cluster-validation enhancements. One of the big benefits is that storage tests will run significantly faster. The storage tests measure things such as which node can see the drives, determine failovers individually and as groups to all nodes, check to see if the drive can be yanked away from each node from the other nodes, and so on. In Server 2008 R2 failover clusters, if you had a large number of disks, the storage tests took a lot of time to complete. With Server 2012 cluster validation, the tests have been streamlined in their execution and in the speed they take to complete. A new option for the storage tests is that you can target specific LUNs to run the tests against, as you see in Figure 6. If you want to test a single LUN or a specific set of LUNs, just select the ones you want. There are also new tests for Cluster Shared Volumes (CSV) as well as for Hyper-V and the VMs. These tests check to see if your networking is configured with the recommended settings to ensure that network connectivity can be made between machines, quick/live migrations are set up to work, the same network switches are created on all nodes, and so on.
Figure 6: Reviewing your storage status
Cluster Virtual Machine Monitoring
When you’re running highly available VMs with the Hyper-V role in a cluster, you can take advantage of a new feature called Virtual Machine Monitoring. With this new monitoring, you can actually have Failover Clustering monitor specific services within the VM and react if there is a problem with a service. For example, if you’re running a VM that provides print services, you can monitor the Print Spooler service. To set this up, you can:
- Right-click the VM in Failover Clustering.
- Choose More Actions.
- Choose Configure Monitoring.
- Choose the service or services that you would like to monitor.
If you want to set this up with PowerShell, the command would be
Failover Clustering will then “monitor” the VM and service through periodic health checks. If it determines that the monitored service is unhealthy, it will consider it to be in a critical state. It will first log the following event on the host. For example:
Description: Cluster Resource “Virtual Machine Name” in clustered role “Virtual Machine Name” has received a critical state notification. For a virtual machine this indicates that an application or service inside the virtual machine is in an unhealthy state. Verify the functionality of the service or application being monitored within the virtual machine.
It will then restart the VM (forced shutdown, but graceful) on the host that it’s currently running on. If it fails again, it will move it to another node to start. Virtual Machine Monitoring gives you a finer granularity of the kind of monitoring you want to have for your VMs. It also brings the added benefit of additional health-checking, as well as availability. Without Virtual Machine Monitoring, if a particular service has a problem, it would continue in that state and user intervention would be required to get it back up.
Cluster Aware Updating
Cluster Aware Updating (CAU) is new to Server 2012 Failover Clustering. This feature automates software updating (security patches) while maintaining availability. With CAU, you have the following actions available:
- Apply updates to this cluster
- Preview updates to this cluster
- Create or modify the Updating Run Profile
- Generate a report on past Updating Runs
- Configure cluster self-updating options
- Analyze cluster updating readiness
CAU will work in conjunction with your existing Windows Update Agent (WUA) and Windows Server Update Services (WSUS) infrastructures to apply important Microsoft updates. When CAU begins to update, it will go through the following steps:
- Put each node of the cluster into node maintenance mode.
- Move the clustered roles off the node. In the case of highly available VMs, it will perform a live migration of the VMs.
- Install updates and any dependent updates.
- Perform a reboot of the node, if necessary.
- Bring the node out of maintenance mode.
- Restore the clustered roles on the node.
- Move to update the next node.
You can start CAU from Server Manager, Failover Cluster Manager, or a remote machine. The recommendations and considerations for setting this up are as follows:
- Don’t configure the nodes for automatic updating either from Windows Update or a WSUS server.
- All cluster nodes should be uniformly configured to use the same update source (WSUS server, Windows Update, or Microsoft Update).
- If you update using Microsoft System Center Configuration Manager 2007 and Microsoft System Center Virtual Machine Manager 2008, exclude cluster nodes from all required or automatic updates.
- If you use internal software distribution servers (e.g., WSUS servers) to contain and deploy the updates, ensure that those servers correctly identify the approved updates for the cluster nodes.
- Review any preferred owner settings for clustered roles. Configure these settings so that when the software-update process is complete, the clustered roles will be distributed across the cluster nodes.
In the previous versions, the only way you could connect to a share is with the Client Access Point (the network name in the group) because the shares were scoped to only this name. More information about this behavior is explained in the blog post “ File Share ‘Scoping’ in Windows Server 2008 Failover Clusters.” This limited the way administrators had clients connect to shares because there was only one option to connect. This was a big problem with administrators because, in some cases, it made server consolidations more difficult and time consuming because additional steps needed to be taken into consideration—which, in turn, led to longer downtimes to perform the consolidations. Because of this, Server 2012 Failover Clustering now gives you the ability to connect to shares via the virtual network name, the virtual IP address, or a CNAME that is created in DNS. One caveat is that when using CNAMEs, there is an additional configuration needed for the name. For example, suppose you had a file share with the clustered network name TXFILESERVER and you wanted to set up a CNAME of TEXAS in DNS to connect. Through PowerShell, you would execute
When this is done, you must take the name offline and put it back online before it will take effect and answer the connection.
You need to consider the repercussions of connecting via the IP address or alias. When connecting via these methods, Kerberos won’t be the authentication method used because it will drop to using NTLM security. So, although connecting via these alternative methods does bring flexibility, the security trade-off for NTLM authentication must be taken into consideration.
CSV has been updated with the following list of capabilities. These features provide for easier setups, broader workloads, and enhanced security and performance in a wider variety of deployments, as well as greater availability.
- Storage capabilities for Scale-Out File Servers (more on this later), not just highly available VMs
- A new CSV Proxy File System (CSVFS) to provide a single, consistent filename space
- Support for BitLocker drive encryption
- Direct I/O for file data access, enhancing VM creation and copy performance
- Removal of external authentication dependencies when a domain controller (DC) might not be available
- Integration with the new Server Message Block (SMB) 3.0 to provide for file servers, Hyper-V VMs, and applications such as SQL Server
- Use of SMB Multichannel and SMB Direct to allow CSV traffic to stream across multiple networks, and use of network adapters that support Remote Direct Memory Access (RDMA)
- Ability to scan and correct volumes with zero offline time as NTFS identifies, logs, and repairs issues without affecting the availability of CSV drives
Scale-Out File Servers
Scale-Out File Servers can host continuously available and scalable storage, by using the SMB 3.0 protocol, and utilizes CSV for the storage. Benefits of Scale-Out File Servers include the following:
- Provides for active-active file shares in which all nodes accept and serve SMB client requests. This functionality provides for transparent failover to other cluster nodes during planned maintenance and unplanned failures.
- Increases the total bandwidth of all file server nodes. There is no longer the bandwidth concern of all network client connections going to a single node; instead, a Scale-Out File Server gives you the ability to transparently move a client connection to another node to continue servicing that client without any network disruption. The limit to a Scale-Out File Server at this time is eight nodes.
- CSV takes the improved Chkdsk times a step further by eliminating the offline phase. With CSVFS, you can run Chkdsk without affecting applications.
- Another new CSV feature is CSV Cache, which can improve performance in some scenarios such as Virtual Desktop Infrastructure (VDI). CSV Cache allows you to allocate system memory (RAM) as a write-through cache. This provides caching of read-only unbuffered I/O that isn’t cached by the Windows Cache Manager. CSV Cache boosts the performance of read requests with write-through for no caching of write requests.
- Eases management tasks. Is no longer necessary to create multiple clustered file servers where multiple disks and placement strategies are needed.
Scale-Out File Servers are ideal for SQL Server and Hyper-V configurations. The design behind Scale-Out File Servers is for applications that keep files open for long periods of time, as do most data operations. A SQL Server database or a Hyper-V VM .vhd file performs a lot of data operations (changes to the file itself), but doesn't perform a lot of actual metadata updates. It shouldn’t be used as a user data share where the workload has a high number of NTFS metadata updates. With NTFS, metadata updates are operations such as opening/closing files, creating new files, renaming existing files, deleting files, and so on, that make changes to the file system of the drive.
There are further enhancements, but I just don’t have the space to list them all. Our Microsoft Program Management Team has worked hard and listened to users about features they've been wanting from failover clusters, and the team has delivered. We've also taken some of the top issues seen in previous versions and made them into positives with the new version.