Flex your hardware muscle

Microsoft Exchange Server 5.5's performance characteristics are well known. In 1996, Exchange Server 4.0 laid down the basic principles of achieving optimum performance through file distribution, and not much has changed since. True, Microsoft expanded the capacity of the Information Store (IS) to a theoretical limit of 16TB, but the messaging server's essential characteristics remain. The hot spots—the files that generate the heaviest I/O load—are the IS and Directory Store databases, their transaction logs, the Windows NT swap file, and the Message Transfer Agent (MTA) work directory.

Exchange 2000 Server is a different beast. The new messaging server boasts the following improvements:

  • The IS architecture has evolved from the simple partitioning of the private and public databases to a point at which, theoretically, the architecture lets you run as many as 90 databases on one server.
  • Microsoft Internet Information Server (IIS) handles all protocol access for SMTP, Internet Message Access Protocol 4 (IMAP4), HTTP, Network News Transfer Protocol (NNTP), and POP3, so IIS is more important to Exchange 2000 than it was to earlier versions of Exchange Server.
  • A new streaming database can hold native Internet content.
  • Windows 2000's (Win2K's) Active Directory (AD) replaces the Directory Store.
  • A new SMTP-based Routing and Queuing engine replaces the older X.400-based MTA.

These improvements come in a customizable package that third-party solutions will likely extend to provide Exchange 2000 with antivirus, fax, workflow, document management, and other capabilities that aren't part of the base server. Exchange 2000 introduces important architectural changes that have a profound effect on performance. The question that system designers now face is how to best optimize these new features in terms of system and hardware configurations. To answer that question, let's start by investigating Exchange 2000's partitioning of the IS.

Partitioning the IS
Exchange Server 5.5 uses one storage group composed of the private and public stores. Exchange 2000 extends this functionality into storage groups. A storage group is an instance of the Extensible Storage Engine (ESE) database engine, which runs in the store.exe process and manages a set of databases. Exchange Server 5.5 uses a variant called ESE 97; Exchange 2000 uses the updated ESE 98.

Each Exchange 2000 storage group has a separate set of transaction log files that as many as six message databases share. A message database consists of two files—the .edb file (i.e., the property database) and the .stm file (i.e., the streaming database). The .edb file holds message properties (e.g., author, recipients, subject, priority), which Exchange Server typically indexes for use in search operations. The .edb file also stores message and attachment content that Messaging API (MAPI) clients such as Microsoft Outlook 2000 generate. The .stm file holds native Internet content (e.g., MIME). The ESE manages the seamless join between the .edb and .stm files. The new IS architecture permits as many as 16 storage groups on a server. Exchange 2000 devotes 15 of these storage groups to regular operation and 1 to restoring or recovering databases. Each active group consumes system resources such as virtual memory. Microsoft is working to identify the maximum number of storage groups and databases that can be active on a 32-bit platform. That number will likely be well under the maximum that the architecture allows—possibly between four and six storage groups. As Windows and Exchange Server move toward a 64-bit platform, memory management will become less important and servers will be able to use as many storage groups as the architecture allows.

In response to criticism about the original 16GB database limit, Microsoft lifted some internal Exchange Server restrictions to let the database grow as large as available disk space permits. (The limit still exists for the standard version of Exchange Server 5.5.) A larger database lets you allocate greater mailbox quotas to users and lets a server support more mailboxes. However, when a database grows past 50GB, you need to pay special attention to backup and restore procedures, as well as the performance characteristics of the I/O subsystem. Although databases of any size require backups, the larger a database grows, the more challenging it becomes to manage. The ability to store massive amounts of data is useless if poor operational discipline compromises that data or if the data can't get to the CPU for processing because of I/O bottlenecks. In this respect, 50GB is an artificial limit.

Despite of the larger store, the practical limit for user mailboxes on one Exchange server—even when you involve Microsoft Cluster Server (MSCS)—remains at about 3000. Hardware vendors have published performance data that suggests the possibility of supporting 30,000 or more simulated users on one 8-way Xeon server. Regardless of that data, if one large database experiences a problem, thousands of users will be unhappy. Large databases are potential single points of failure. Therefore, you won't find many Exchange servers that support more than 3000 mailboxes. The largest database in production today is approaching 200GB, so functioning with very large Exchange Server databases is possible—but only when you devote great care to day-to-day operations and system performance.

Partitioning the store is interesting from several perspectives. First, by removing a potential single point of failure (i.e., dividing user mailboxes across multiple databases), you can minimize the impact of database failure. Second, you can let users have larger mailbox quotas. Third, you can avoid potential I/O bottlenecks by dividing the I/O load that large user populations across multiple spindles generate. Finally, the advent in Win2K of active-active 2-way and 4-way clustering (which Exchange 2000 supports) increases overall system resilience through improved failovers.

On an operational level, Microsoft has gone to great lengths to ensure that multiple databases are easier to manage. As Screen 1 shows, Win2K's Backup utility can back up and restore individual storage groups and databases rather than process the entire IS. Third-party backup utilities (e.g., VERITAS Software's Backup Exec, Legato Systems' NetWorker, Computer Associates'—CA's—ARCserveIT) will probably support this feature by the time Microsoft releases Exchange 2000. Using the Exchange System Manager Microsoft Management Console (MMC) snap-in, you can dismount and mount an individual database for maintenance without halting all store operations, as Exchange Server 5.5 does. For example, suppose that the Applied Microsoft Technologies database in the First Storage Group is dismounted, as Screen 2 shows. Right-clicking the database brings up a context-sensitive menu in which you can choose the All Tasks, Mount Store option to bring the store online. Generally, you'll find that Exchange 2000 database operations are easier than Exchange Server 5.5 operations because the databases are smaller, and because you can process operations against multiple databases in parallel.

However, Exchange 2000 uses a single storage group by default. Out of the box, or following an upgrade from Exchange Server 5.5, Exchange 2000 operations proceed exactly as they do in Exchange Server 5.5. To take advantage of the new features and gain extra resilience, you need to partition the store, and you can't partition the store until you carefully consider database placement, file protection, and I/O patterns.

In terms of I/O, the Exchange Server 5.5 store is a set of hot files. All mailbox operations flow through priv.edb, whereas all public folder operations channel through pub.edb. If you partition the store and create multiple databases, you need to consider how to separate I/O across a system's available disks in such a way that you increase performance and protect your data. I don't mean to suggest that you ought to rush out and buy a set of large disks. Increased information density means that 72GB disks are now available, and the price per megabyte is constantly dropping. However, you won't attain good I/O performance by merely increasing disk capacity. The number of disks, as well as the intelligent placement of files across the disks, is much more important. CPU power increases every 6 months, and 8-way processors are now reasonably common. However, even the most powerful system can't surpass a humble single processor if its CPUs can't get data. Therefore, storage configuration is crucial to overall system performance.

Storage Performance Basics
Figure 1 illustrates a typical disk response time. As the number of requests to the disk increase, the response time also increases along an exponential curve. Disk queuing causes this behavior, and you can't do anything about it. Any disk can service only a limited number of I/Os, and I/O queues accumulate after a disk reaches that limit. Also, the larger the disk, the slower it typically is. For example, don't expect a 50GB disk to process more than 70 I/O requests per second. Over time, disks might spin faster, get denser, and hold more data, but they can still serve I/O at only a set rate, and that rate isn't increasing.

Transactions that the ESE applies to the Exchange Server databases use a two-phase commit (2PC) process, which ensures that all database changes that are part of a transaction occur. A transaction modifies database pages as it proceeds, and the transaction log buffer stores the changes. To ensure the integrity of the database, a special memory area called the Version Store holds the original page content. When the transaction commits, the database engine writes the page changes from the transaction log buffers to the transaction log files, then removes the pages from the Version Store. If the ESE must abort the transaction, any changes related to the transaction will roll back.

Writing to the transaction log is the performance-critical part of the process. The IS orders the pages in memory and commits them in an efficient, multithreaded manner, but the writes to the transaction log are sequential. If the disk holding the logs is unresponsive, delay will occur and Exchange Server won't log transactions quickly. Therefore, you need to ensure that the I/O path to the disk holding the transaction logs is as efficient as possible. Note that the same performance characteristic is evident in AD, which uses a modified version of the ESE.

For optimum performance, you need to place the transaction logs on the most responsive volume or the device with optimal write performance. Your aim is for Exchange Server to write transactions to the log files as quickly as possible. The typical—and correct—approach is to locate the log files on a disk separate from the disk that holds the database. To ensure data resilience, the logs and database must be separate. Remember that an operational database is never fully up-to-date. The transaction logs contain transactions that the IS might not have committed yet. Therefore, if the disk holding the database fails, you need to rebuild the database (by restoring the most recent full backup) and let Exchange Server replay the outstanding transactions from the logs that users created since the backup. Clearly, if the transaction logs and the database reside on the same disk and a fault occurs, you're in big trouble. To ensure resilience, mirror the transaction log disk. Don't use RAID 5 on the volume that hosts transaction logs, because it slows down the write operations to the logs and degrades overall system performance. (For more information about RAID 5, see the sidebar "Why Is RAID 5 Slow on Writes?" page 80.) RAID 0+1 (i.e., striping and mirroring) delivers the best write performance for larger volumes and is highly resilient to failure. However, RAID 0+1 is typically too expensive in terms of allocating disks to transaction logs. RAID 1 (i.e., mirroring), which provides an adequate level of protection balanced with good I/O performance, is the usual choice for volumes that host transaction logs. Never use RAID 0 for a disk that holds transaction logs—if one disk fails, you run the risk of losing data.

Each storage group uses a separate set of transaction logs. You need to separate the log sets as effectively as possible on multiple mirrored volumes. However, one storage array can support only a limited number of LUs, so compromise might be necessary. On small servers, you can combine log sets from different storage groups on one volume. This approach reduces the amount of storage the server requires at the expense of placing all your eggs in one basket. A fault that occurs on the volume affects all log sets; therefore, you need to take every storage group offline.

Exchange Server databases' I/O characteristics exhibit random access across the entire database file. The IS uses parallel threads to update pages within the database, so a multispindle volume helps service multiple concurrent read or write requests. In fact, the system's ability to process multithreaded requests increases as you add more disks to a volume.

Since Exchange Server's earliest days, most system designers have recommended RAID 5 protection for the databases. RAID 5 is a good compromise for protecting storage and delivering reasonable read/write performance without using too many disks. However, given the low cost of disks and the need to drive up I/O performance, many high-end Exchange Server 5.5 implementations now use RAID 0+1 volumes to host the databases. Expect this trend to continue in Exchange 2000. Although you can now partition I/O across multiple databases, the number of mailboxes that an individual server supports will likely increase, thereby driving up the total generated I/O. Large 4-way Exchange 2000 clusters need to be able to support as many as 10,000 mailboxes and manage 200GB to 400GB of databases across multiple storage groups. In terms of write operations, RAID 0+1 can perform at twice the speed of RAID 5, so any large Exchange 2000 server needs to deploy this configuration for database protection.

To yield the best performance for both transaction log and database writes, use the write cache on the storage controller. However, don't use the write cache unless you're sure that you've adequately protected the data in the cache against failure and loss. You need to mirror the cache and use battery backup to protect it from power failure. You also need to be able to transfer the cache between controllers in case you want to replace the controller. Read operations to access messages and attachments from the database typically retrieve information across the entire file, so controller read cache doesn't help performance. The ESE performs application-level caching.

Don't attempt to combine too many spindles in a RAID 5 volume. Each time a failure occurs, the entire volume rebuilds. The duration of the rebuild is directly proportional to the number and size of disks in the volume, so each disk you add increases the rebuild time. Most volume rebuilds occur in the background, and the volume remains online. However, if another failure occurs during the rebuild, you might experience data loss. Therefore, reducing rebuild time by reducing the number of disks in the volume set is good practice. Deciding the precise number of disks to place in a volume can be a balancing act between the size of the volume you want to create, the expected rebuild time, the data that you want to store on the volume, and the expected mean time between failures. If you want to store nonessential data on a large volume for the sake of convenience, you can combine many disks into the volume. However, an Exchange Server database tends to be sensitive to failure. I recommend erring on the side of caution and not placing more than 12 disks into the volume.

Examination of Exchange 2000 servers' I/O pattern reveals some interesting points, some of which differ significantly from Exchange Server 5.5 patterns. The streaming database delivers sparkling performance to IMAP4, POP3, and HTTP clients because they can store or retrieve data much faster from the streaming database than they can from the traditional Exchange Database (EDB). Clients access the streaming database through a kernel-mode filter driver called the Exchange Server Installable File System (ExIFS). Like the EDB, the ExIFS processes data in 4KB pages. However, the ExIFS can allocate and access the pages contiguously, whereas the EDB merely requests pages from ESE and might end up receiving pages that are scattered across the file. You won't see a performance advantage for small messages, but consider the amount of work necessary to access a large attachment from a series of 4KB pages that the IS needs to fetch from multiple locations. Because its access is contiguous, the streaming database delivers much better performance for large files. Interestingly, contiguous disk access transfers far more data (as much as 64KB per I/O); therefore, to achieve the desired performance, the storage subsystem must be able to handle such demands. Advances in storage technology often focus on the amount of data that can reside on a physical device. As we move toward the consolidation of small servers into larger clusters, I/O performance becomes key. System designers need to focus on how to incorporate new technologies that enable I/O to get to CPUs faster. Exchange 2000 is the first general-purpose application to take full advantage of the fibre channel protocol, which delivers transfer rates as high as 100MBps. Systems that support thousands of users must manage large quantities of data. The ability to store and manage data isn't new, but advances such as fibre channel now let system configurations attain a better balance between CPU, memory, and storage.

Storage Configuration
Most Exchange Server 5.5 servers use SCSI connections. As a hardware layer, SCSI demonstrates expandability limitations, especially in the number of disks that you can connect to one SCSI bus and the distance over which you can connect the disks. As Exchange servers get larger and handle more data, SCSI becomes less acceptable.

As I noted, fibre channel delivers high I/O bandwidth and great flexibility. You can increase storage without making major changes to the underlying system, and fibre channel storage solutions that extend over several hundred meters are common. Win2K's disk-management tools simplify the addition or expansion of volumes, so you can add capacity for new storage groups or databases without affecting users. Better yet, fibre channel implementations let servers share powerful and highly protected storage enclosures called Storage Area Networks (SANs). For most individual servers, a SAN is an expensive data storage solution. However, a SAN makes sense when you need to host a large corporate user community by colocating several Exchange 2000 servers or clusters in a data center. You need to weigh the advantages of a SAN, as well as its additional cost, against the advantages and disadvantages of server-attached storage. A SAN can grow as storage requirements change. Its adaptability and ability to change without affecting server uptime might be a crucial factor in installations that need to support large user communities and deliver 99.99 percent or greater system availability.

Example Configuration
Let's put some of the theory I've discussed into the context of an example Exchange 2000 system configuration. Assume that your server must support 3000 mailboxes and you want to allocate a 100MB mailbox quota. This size might seem large, but given the ever-increasing size of messages and lower cost of storage, installations are raising mailbox quotas from the 10MB-to-20MB limits imposed in the early days of Exchange Server to 50MB-to-70MB limits. A simple calculation (i.e., mailboxes ¥ quota) gives you a storage requirement of 300GB. This calculation doesn't consider the single-instance ratio or the effect of the Deleted Items cache, but it serves as a general sizing figure.

A casual look at system configuration options suggests that you can solve your storage problem by combining seven 50GB disks into a RAID 5 volume. Although this volume would deliver the right capacity, the seven spindles probably couldn't handle the I/O load that 3000 users generate. Observation of production Exchange Server 5.5 servers reveals that each spindle in a RAID 5 volume can handle the I/O load of approximately 200 mailboxes. Spindles in a RAID 0+1 volume push the supported I/O load up to 300 mailboxes. If you apply these guidelines to our Exchange 2000 example, you'll need 15 spindles (i.e., physical disks) in a RAID 5 volume, or 10 spindles in a RAID 0+1 volume, to support the expected load.

Exchange Server 5.5 has one storage group, so splitting the I/O load across multiple volumes is difficult. Exchange 2000 lets you split the 3000 mailboxes across four storage groups. If you use one message database in each storage group, each database is 75GB, which is unwieldy for the purpose of maintenance. To achieve a situation in which each database supports 150 users and is about 15GB, you can split the users further across five message databases in each storage group.

Splitting users this way affects the single-instance storage model that Exchange Server uses. Single-instance storage means that users who receive the same message share one copy of the message's content. But single-instance storage extends across only one database. After you split users into separate databases, multiple copies of messages are necessary—one for each database that holds a recipient's mailbox. However, experience shows that most Exchange servers have low sharing ratios (e.g., between 1.5 and 2.5), and dividing users across multiple databases produces manageable databases that you can back up in less than 1 hour using a DLT. Also, a disk failure that affects a database will concern only 150 users, and you can restore the database in an hour or two. Although four storage groups, each containing five databases, might seem excessive, this example realistically represents the types of configurations that system designers are now considering for early Exchange 2000 deployments.

Each storage group contains a set of transaction logs. Recalling the basics of disk configuration, you might think that you need five mirror sets for the logs and five RAID 5 or RAID 0+1 sets for each set of databases. Managing such a large amount of storage from a backplane adapter—you'd probably double the physical storage to 600GB because you don't want to fill disks and you want room to grow—is impractical because you'd probably encounter a limit to the number of disks you can connect. Also, a system this large is a candidate for clustering, so you need a solution that can deliver the I/O performance, handle the number of spindles required to deliver the capacity, and support Win2K clustering. For all Exchange 2000 clusters, consider using a SAN either to share load between servers that use the Win2K active/active clustering model or to benefit from advanced data-protection mechanisms such as online snapshots and distant mirroring. If you need to add users, you simply create a new storage group, create a new volume in the SAN, and mount the database without interrupting service. The Win2K Disk Administrator can bring new disks online without requiring a system reboot. Generally speaking, Win2K greatly improves disk administration—a welcome advance given the size of volumes in large configurations. Screen 3 shows the Disk Management MMC snap-in dealing with some very large volumes, including one whose size is 406.9GB! This volume should be large enough to keep many Exchange Server databases happy.

Each database or storage group doesn't require its own volume. You can divide the databases across available volumes as long as you keep an eye on overall resilience against failure and don't put too many databases on the same volume. Exchange 2000 clusters use storage groups as cluster resources, so you need to place all the databases for a storage group on the same volume. This placement ensures that the complete storage group and the disks holding the databases will fail over as one unit.

Transaction logs that handle the traffic of 600 users will be busy. In such a configuration, you could create four separate RAID 1 sets for the logs. If you use 9.2GB disks, you'll need eight disks in four volumes. A 9GB volume has more than enough space to hold the transaction logs of even the most loaded server. For best performance, don't put files that other applications use on the transaction log volumes.

Systems that run with more than a couple of storage groups can group transaction logs from different storage groups on the same volumes. You don't want to create too many volumes only for the purpose of holding transaction logs. Figure 2 illustrates how you might lay out databases and transaction logs across a set of available volumes.

Disks that you use in Win2K clusters must be independently addressable, so if you want to consider a clustered system, you need to use hardware-based partitions, which let the controller present multiple LUs to the cluster or server, as well as use fewer disks. Clusters require a disk called the quorum disk to hold quorum data. I recommend using a hardware partition for this data; the actual data hardly ever exceeds 100MB, and dedicating an entire physical disk is a waste.

If you use RAID 5 to protect the four storage groups, you'll need five 18.2GB disks for each volume. You can gain better I/O performance by using 9 ¥ 9.2GB disks. The volumes have 72GB capacity, which is more than the predicted database size (3 ¥ 15GB = 45GB). You need the extra space for maintenance purposes (e.g., rebuilding a database with the Eseutil utility) and to ensure that the disk never becomes full. Stuffing a disk to its capacity is unwise because you'll probably reach capacity at the most inconvenient time. Exchange Server administrators typically find that databases grow past predicted sizes over time. After all, databases never shrink—they only get bigger as users store more mail.

Expanding Boundaries
The changes that Microsoft has introduced in the Exchange 2000 IS offer system designers extra flexibility in hardware configurations. Partitioning the IS means that you can exert more control over I/O patterns. I'm still investigating the opportunities that SANs, Exchange 2000 clusters, and different storage group configurations offer, but clearly the number of mailboxes that one production server can support will climb well past Exchange Server 5.5's practical limit of 3000.