4 more steps to high availability
In "8 Ways to Improve Your Exchange Cluster, Part 1," April 2004, InstantDoc ID 41630, I discuss how getting cluster-specific training, planning ahead, building in extra redundancy, and deploying a solid infrastructure are vital to a successful Exchange Server 2003 or Exchange 2000 Server cluster deployment. Now let's take a look at four more important factors: using the correct configurations, implementing the right security measures, minimizing the downtime and impact of failovers, and efficiently deploying Exchange service packs to your cluster.
As I explain in Step 4 in Part 1, the stability of the Windows infrastructure underlying your cluster is key to the cluster's success. Properly configuring that infrastructure can also improve your cluster's performance. Important configuration steps include setting staggered boot delays for the cluster nodes, obtaining the applicable OS resource kit, and tuning memory.
Set staggered boot delays. When power returns after a power failure, each node in your cluster will attempt to access shared storage at the same time. To avoid this conflict, set your preferred passive node's boot delay to be longer than the active node's delay.
To access the delay setting on Windows 2000 servers, right-click My Computer and select Properties from the context menu. Click Advanced, then click Startup and Recovery. On Windows Server 2003 nodes, open the My Computer Properties dialog box, click Advanced, then click Settings under Startup and Recovery. In the Startup and Recovery dialog box, select the Display a list of Operating Systems for __ seconds check box and enter the desired delay in the scroll box. Set the active node to 5 seconds and the passive node to 20 seconds. Alternatively, you can manually edit the boot.ini file on each node to implement a specific delay.
Obtain the OS resource kit. The Microsoft Windows 2000 Server Resource Kit contains valuable tools for cluster administrators. The resource kit provides approximately 300 utilities that aid management of Active Directory (AD) and Win2K servers, and several of these utilities are specific to clusters. Among the most important are dumpcfg.exe, which manages and records disk signature information; the Cluster Tool (clustool.exe), which backs up and restores cluster configurations; and clusrest.exe, which restores the quorum database. The Microsoft Windows Server 2003 Resource Kit tools include new and improved cluster utilities such as the Cluster Server Recovery Utility (clusterrecovery.exe), which you can use when restoring resource checkpoint files, replacing a failed disk, recovering from disk signature changes, or migrating cluster data to a different disk in the cluster; and the Cluster Diagnostics and Verification Tool (clusdiag.exe), which provides diagnostic tests to verify a cluster's functionality and which assists in reading the cluster log files.
Copy the tools to a standard folder on each cluster node as part of your cluster installation. Having the tools readily available can reduce the amount of time you need to diagnose a clustering problem if one arises. See http://www.microsoft.com/windows2000/techinfo/reskit/default.asp for more information about obtaining resource kits or resource kit tools.
Tune memory. Exchange 2000 servers that run on Win2K Advanced Server or Win2K Datacenter Server and that have more than 1GB of RAM require you to add the /3GB switch to the startup line, as the Microsoft article "XGEN: Exchange 2000 Requires /3GB Switch with More Than 1 Gigabyte of Physical RAM" (http://support.microsoft.com/?kbid=266096) explains. However, using the /3GB switch reduces the number of available Free System page table entries (PTEs), a situation that can cause performance problems—most noticeably the server's loss of network connectivity or blue screens. Microsoft recommends that you monitor the Free System PTE counter under the Performance Monitor's Memory object. If the value drops below 10,000, modify the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management registry subkey's SystemPages entry, as the Microsoft article "XADM: An Exchange 2000 Server with the '/3GB' Switch in the Boot.ini File May Lose Network Connectivity Under a Heavy Messaging Load" (http://support.microsoft.com/?kbid=313707) describes.
Windows 2003, Standard Edition and Windows 2003, Enterprise Edition both support the use of the /3GB switch. However, both editions also support a new switch, /userva, which allows a custom environment size for the application virtual address space and lets you allocate PTEs from boot.ini (rather than from the registry). For Exchange 2003 servers that have more than 1GB of RAM, use /userva=3030 in conjunction with the /3GB switch. For more Exchange 2003 memory-tuning procedures, see the Microsoft article "How to Optimize Memory Usage in Exchange Server 2003" (http://support.microsoft.com/?kbid=815372). Both Windows 2003 and Win2K require a reboot after you make these memory changes.
Another way to reduce virtual memory usage is to minimize the number of storage groups (SGs). Additional virtual memory is used when an SG is mounted, but additional databases within an existing SG have little effect on the amount of virtual memory used.
You need to lock down your Exchange cluster to prevent the spread of W32.Blaster.Worm, the Nimda virus, and other network-based attacks. Viruses can infect systems through file shares, Web browsers, OS vulnerabilities, or email, and your antivirus strategy should address each of these areas. But be careful: When deploying file-based antivirus scanning, be sure to exclude the Exchange database files and transient files (i.e., the Message Transfer Agent—MTA—and mailroot folders). A file-based virus scanner that attempts to disinfect or quarantine an Exchange database or transaction log can prevent Exchange from accessing the database or log, thus causing data corruption. Ideally, the file-based antivirus scanning product you use will let you define these exclusions during installation to ensure that the transient files aren't accidentally included. (For Microsoft's recommendations about which antivirus measures to take for Exchange, see the Microsoft article "XADM: Exchange and Antivirus Software" at http://support.microsoft.com/?kbid=328841.)
Recent security rollup patches from Microsoft include protection from W32.Blaster.Worm and fix other OS vulnerabilities exploited by virus writers. Use the Microsoft Baseline Security Analyzer (MBSA) to audit cluster nodes for vulnerabilities and to get a list of recommended security updates. (See http://www.microsoft.com/technet/security/tools/mbsahome.asp for more information about MBSA.) Alternatively, use the Windows Update service to download the most recent security patches.
If your cluster runs Exchange 2000, set access on the built-in Message Tracking Log share and Address share to Read only. By default, the Everyone security group has full-write access to these shares. The Microsoft article "XADM: The Nimda Virus May Infect the Files in Log Folders on New Exchange 2000 Virtual Servers in a Cluster" (http://support.microsoft.com/?kbid=312465) describes how to change the permissions. As an added precaution, don't create any shared folders on your Exchange cluster. (By default, Exchange 2003 sets built-in Exchange shares to read-only.)
Implement an antivirus solution to protect your cluster from email viruses. Third-party antivirus products for Exchange (such as those listed at http://www.microsoft.com/exchange/partners/antivirus.asp) can scan mailboxes in real time for viruses such as Sobig.F and ILOVEYOU. Be sure to schedule regular virus-pattern updates from the product's vendor. Microsoft introduced the Virus Scanning API (VS API) so that antivirus vendors could develop software that can scan Exchange components such as databases, SMTP queues, and the MTA. Choose an antivirus solution that's VS API—compliant and that runs on clusters.
Failover occurs when an Exchange cluster group is moved from one node to another. Microsoft Outlook clients can't access Exchange during a failover, so minimizing failover times is necessary to provide high availability and meet service level agreements (SLAs). To reduce the impact of failovers, you can deploy Microsoft Office Outlook 2003 clients running in cached mode, which lets users work from a local cache when no network connection is available. Outlook 2003 cached mode handles the loss of network connectivity much more efficiently than earlier versions of Outlook, which must be restarted to handle changes in network connectivity. Outlook 2003 cached mode can detect whether the Exchange server is reachable and seamlessly reconnect and synchronize without any action from the user. (For more information about the improvements that make Outlook 2003 a better network client, see "Outlook 11 and Exchange Server," April 2003, InstantDoc ID 38271.)
Exchange 2003 can achieve better failover times than Exchange 2000 because Microsoft has enhanced the resource model by flattening the Exchange dependency tree. Web Figure 1 (http://www.winnetmag.com/microsoftexchangeoutlook, InstantDoc ID 41943) shows the cluster resource model for Exchange 2000; protocol resources such as HTTP can come online only after the Exchange Store resource has started. Web Figure 2 shows the resource model for Exchange 2003; protocol resources depend only on the System Attendant resource. This change leads to faster failover times because cluster resources can start in parallel. Microsoft implemented a 3-minute timeout in Exchange 2003, after which, if the failover hasn't happened, the Store process is terminated to expedite the failover. This timeout should result in faster failovers compared with Exchange 2000.
Two types of failover exist: planned and unplanned. Let's take a look at each type and how you can best handle each, as well as how monitoring unplanned failovers can help improve your deployment.
Planned failovers. Planned failovers usually take place as part of scheduled system maintenance tasks, such as an Exchange service pack rolling upgrade (which I explain in more detail later). To use that task as an example, on a two-node cluster, the process involves installing the service pack upgrade on the passive node (node 2) and rebooting if required. After you complete maintenance on node 2, you use Cluster Administrator to move the Exchange Virtual Servers (EVS) from the active node (node 1) to the passive node and perform the upgrade on node 1. An additional failover operation to fail the EVS back over to node 1 (called failback) will be necessary if node 1 is the preferred node in the cluster. During the planned failover process, the Exchange Resource DLL (exres.dll) takes the Exchange components in the Exchange cluster group offline; dismounts the SGs; stops protocols such as IMAP, POP, and HTTP; and takes the EVS Network Name and IP Address resources offline. Exres.dll then brings the EVS Network Name and IP Address resources, followed by the Exchange resources, online on the other node. The Cluster Service also updates the quorum database.
You can reduce planned failover time by performing the failover outside working hours. During hours of operation, a heavily loaded Exchange server can host as many as 3000 Messaging API (MAPI) connections, each of which must be terminated during the failover. The number of connections decreases outside working hours. Outlook 2003 cached mode, which makes better use of the network, might also help reduce failover times.
After a Windows service pack upgrade, the Store process rebuilds indexes, which can add several minutes to your failover time. The delay depends on the size of your Exchange databases. Event ID 611 in the Application event log indicates that an index rebuild is taking place.
Unplanned failovers. Unplanned failovers occur when the node hosting the EVS crashes or loses power. When the active node (node 1) goes down, the Cluster Service detects that the heartbeat connection—and therefore the active node—is no longer available and brings the EVS IP Address and Network Name resources online on the passive node (node 2). The disks belonging to the EVS are brought online on node 2. Exres.dll brings the Exchange cluster resources online. The Store mounts the SGs and performs recovery tasks. The quorum database is updated. If you've designated node 1 as the preferred owner for the cluster, failback will occur to return the EVS to node 1 when it comes back online.
You can't do much to reduce unplanned failover time. However, if you've designated a preferred node in your cluster and your cluster fails over, you can at least schedule failback to occur outside working hours. To set failback times, right-click the Exchange Group in Cluster Administrator, select Properties, and choose the Failback tab. Select the Allow failback option and specify the permitted failback window (according to the 24-hour clock) in the drop-down boxes in the Failback between ___ and ___ hours option. Choose a time that doesn't conflict with other events such as backups and online database defragmentations. The failback process, which dismounts and remounts the Exchange databases, would disrupt these tasks.
If your cluster serves as a back end in your messaging organization, you can reduce the time needed to perform unplanned failovers. The Microsoft article "How to Configure IPSec on an Exchange Server 2003 Back-End Server That Is Running on a Windows Server 2003 Server Cluster" (http://support.microsoft.com/?kbid=821839) describes procedures for improving unplanned failover times for back-end virtual servers running on Windows 2003 clusters with Exchange 2003 and using IP Security (IPSec) to secure traffic between front-end and back-end servers.
Monitoring. One of the best ways to reduce the frequency of unplanned failovers and failover time is to monitor and analyze performance data on an ongoing basis. As "Monitoring Exchange 2000," October 2002, InstantDoc ID 26183, explains, Windows and Exchange both include built-in monitoring tools. (For large production deployments, a management framework such as Microsoft Operations Manager—MOM—can simplify monitoring.) Before outages, both Windows and Exchange can log events that indicate a hardware or application problem. For example, the virtual memory fragmentation issue that "Monitoring Virtual Memory," November 2003, InstantDoc ID 40458, describes is logged to the Application log as event ID 9582. By actively monitoring event logs, disk performance, and memory usage, you can often prevent many outages—or at least delay them until later in the day when they'll affect fewer users.
8. Tips for Exchange Service Packs
As with single-server Exchange implementations, knowing the proper procedures for dealing with the installation and configuration of service packs is an important part of keeping your cluster running smoothly. You can get the best performance by keeping your Exchange service packs up-to-date, performing full backups of the Exchange database before and after service pack installation, verifying permissions, and testing upgrades in a clustered test environment before rolling them out.
Upgrade to the most recent service pack. As I write this article, no service packs are available yet for Exchange 2003. For clusters running Exchange 2000, Service Pack 3 (SP3) is available for download (http://www.microsoft.com/exchange/downloads/2000/sp3/default.asp) and incorporates more than 400 bug fixes. Virtual memory fragmentation is a major problem for large Exchange installations, especially for those that host many MAPI clients. SP3 includes an updated version of the Store (store.exe) to help address this problem. SP3 also includes several fixes that specifically address clustering concerns, as the Microsoft articles "XCON: Cluster Failover Process Is Delayed Because of Message Transfer Agent Remote Procedure Call Timeouts" (http://support.microsoft.com/?kbid=316354), "XCON: Messages Back Up in Queue When the Virtual Server Is Set to Forward All Messages with Unresolved Recipients" (http://support.microsoft.com/?kbid=316204), "XADM: Cluster Service Terminate Function Does Not Kill the Information Store Unless It Times Out" (http://support.microsoft.com/?kbid=322126), and "XADM: The Information Store Stops on a Cluster Because of the IsAlive Check" (http://support.microsoft.com/?kbid=315771) describe.
Perform full backups. Take full backups of your Exchange server before and immediately after installing an Exchange service pack. Set aside tapes that are outside your typical backup cycle. Exchange service packs usually include an updated version of store.exe. When you remount the Exchange databases after service pack installation, the service pack upgrades the databases to work with the new Store binary. The differences in Store versions mean that you can't roll back to a backup performed on an earlier service pack. For example, you can't restore to an SP3 server a backup that you took when the server ran SP2. (The Microsoft article "XADM: Exchange 2000 SP2 Does Not Allow You to Restore Exchange 2000 or Exchange 2000 SP1" at http://support.microsoft.com/?kbid=316794 contains some background information about Store version mismatch scenarios.) Be sure to update your disaster-recovery servers and procedures to reflect changes in Exchange service packs.
Verify permissions. As I explain in Part 1, you must have Exchange Full Administrator permissions to upgrade an Exchange 2000 cluster. Applying service packs, however, can cause permissions to be reset to their default values. Before and after you apply a service pack, verify permissions on the administrative group in which your EVS resides. (By default, Exchange System Manager—ESM—doesn't show security settings. To enable the Security tab, add the REG_DWORD entry ShowSecurityPage to the HKEY_CURRENT_USER\Software\Microsoft\Exchange\ExAdmin registry subkey and set the entry's value to 1.)
Test upgrades. Create a parallel test environment so that you can try out service packs, hotfixes, and third-party products before deploying them in production. (Many third-party products aren't supported on clusters and might require some customization to work properly.) Exactly replicating your production configuration might not be cost-effective, but consider implementing a test cluster that uses virtual server technology, such as VMware, to test third-party products. Cluster-aware third-party applications create cluster resources in Cluster Administrator; you can move these resources between nodes as failover operations are performed. For third-party products that don't create cluster resources, however, be sure that you can automatically shut down and start the products on cluster nodes during failover and failback operations.
One benefit of using clusters with Exchange is the ability to perform rolling service pack upgrades to minimize downtime for end users. A rolling upgrade entails moving the EVS to one node and performing the installation on the passive node, then failing back the cluster and upgrading the other node. For example, suppose you have a two-node cluster in which node 1 is the active node with one EVS (EVS1) and node 2 is the passive node with no active cluster resources. First, back up node 2, apply the Exchange service pack to node 2, then move EVS1 to node 2. Check the Application log for errors and verify that Exchange starts correctly on node 2. Assuming that everything is working as it should, back up node 1 and apply the service pack to that node. Move EVS1 back to node 1, check the Application log for errors, and verify that Exchange starts correctly on node 1. Take a full backup of the cluster, including the system state on each node and the Exchange databases.
The guidelines in this two-part article series can help you achieve success when deploying Exchange 2003 or Exchange 2000 clusters. One last bit of advice for improving performance on Exchange 2000 clusters: Take a look at the improvements in Windows 2003 and Exchange 2003 clusters. (For more information, see "Exchange Server 2003 Clusters," November 2003, InstantDoc ID 40457.)