When I speak about Exchange at seminars and other events, the topic of best operational practices often comes up. People want to know the steps they must take to operate an efficient and effective Exchange installation once the software moves from pilot status into production. Email is now a mission-critical application for many large companies, and these organizations want to minimize the company's risk in the investment they make to implement client/server-based messaging. In "Planning a Large-Scale Exchange Implementation," May 1997, I discussed how to plan for a successful implementation; now I'll consider day-to-day operations in an Exchange environment and explain the five guiding principles that will make your operations successful.

Microsoft designed Exchange to be scalable, robust, and reliable in distributed environments. Exchange manages reasonably large user populations on individual servers (one server at Digital has supported more than 2750 mailboxes) and will manage far larger populations as Windows NT and hardware evolve. Exchange is more akin to mainframe or mini-computer messaging systems, such as IBM PROFS or Digital ALL-IN-1, than Microsoft Mail or Lotus cc:Mail.

Guiding Principles

Managing very large user communities is impossible if you don't follow disciplined systems management practices. I have several principles that guide efficient system management for a production-category Exchange server.

  1. Plan for success. Assume that users will increase the demand on the servers, the volume of mail traffic will increase, and you'll deploy new messaging applications (such as workflow). Make sure that system configurations incorporate room for growth and accommodate periods of increased demand.
  2. Use dedicated hardware for Exchange.Configure the hardware to provide a resilient and reliable service on a continuous basis for three years with a minimum number of interventions (and system downtime) required. After three years, replace the hardware.
  3. Keep downtime to a minimum. Never take an action that interferes with or removes the Exchange service from users. For any intervention that requires taking servers offline, plan in advance and clearly communicate your intentions to users. Also, be prepared for catastrophic hardware failure. Outline a recovery plan to handle emergencies.
  4. Track system statistics. Proactive system monitoring is a prerequisite for delivering a production-quality service. While you're monitoring the system, gather regular statistics on system use and analyze the data to help identify potential problems and protect the quality of service.
  5. Follow well-defined, regular housekeeping procedures.

Exchange needs disciplined management to achieve maximum potential. Anyone can take the Exchange CD-ROM, slap it into a drive, install the software, and have a server up and running with clients connected in 30 minutes. Such a system can handle a small user community. This approach is OK if that level of service is all you need. The strategy I outline here is geared to large, corporate deployments, but the logic that drives the strategy is valuable no matter what size shop you run. The five principles are generic, but they have proved to work over a large number of Exchange deployments in the past two years.

1. Plan for Success

Any configuration will come under increasing pressure as it ages. You experience the best performance immediately after you install the system, when disks are not fragmented, users put little demand on the computer, and application files are as small as they'll ever be.

As people get to know an application, the user-generated load increases. Users send more messages, and the messages are larger. Users find more reasons to use the underlying service: For example, you might install a fax connector for better communication with external agencies or deploy a full-text retrieval package to improve manageability of public folder contents. The disks fill up with user and application data. With Exchange, the information store swells to occupy as much space as you can devote to it. If you don't configure the system with success in mind and incorporate room for growth, you'll end up with a system that runs smoothly at the beginning only to suffer increasingly as time goes by.

I recommend overconfiguring the service at the start so that you don't become entangled in a cycle of constant upgrades. Install two CPUs rather than one, use 128MB of RAM rather than 96MB, have 20GB of disk instead of 16GB, and so on. Build server configurations that can handle at least some expected software developments over the next few years. For example, consider RAID controllers for system clustering. Look at the hardware that existing clustering solutions use and see whether you can include hardware with the same or superior capabilities. (For more information on clustering solutions, see Mark Smith, "Clusters for Everyone," and Joel Sloss, "Clustering Solutions for Windows NT," June 1997.) Because the upcoming release of 64-bit NT 5.0 will require a new version of Exchange before it can be used for messaging, it is probably at the outer range of consideration. But think about Alpha CPUs if you're interested in building high-end servers that you want to eventually run 64-bit NT on. Alpha CPUs are also appropriate as servers that must handle high levels of format translation work, such as those that host Internet connectors. Configure systems that will be successful over time rather than just today. Any other approach might require more hardware upgrades than you want in a production environment.

2. Use Dedicated Hardware for Exchange

You can install Exchange on just about any NT server that has the correct revision level of the operating system (for Exchange 5.0, the correct level is NT 4.0 with Service Pack 3--SP3) and a minimum of 32MB of RAM. The same server can run other BackOffice applications and some personal productivity applications such as Office 97. For good measure, the server can provide file and print sharing to a set of workstations, not to mention Domain Name System (DNS), Windows Internet Name Service (WINS), and Dynamic Host Configuration Protocol (DHCP), and act as a domain controller. The applications will install and run, but run slowly. And, with all those applications, think of the steps you'll have to take to get the server back online in case of hardware failure. I do not recommend this mix on a production system. Having dedicated hardware lets you tailor and tune the configuration to meet the needs of an application.

P> Most accountants are happy to depreciate servers over three years. Plan to run Exchange on dedicated boxes without interruption for three years and replace the servers at the end of that time.

We've already discussed configuring systems for success. Apart from the obvious need for a fast CPU and enough memory, the I/O subsystem and hardware for system backups require special attention in an Exchange environment.

With a database at its center (the information and directory stores), Exchange is sensitive to disk I/O. If you design systems to support hundreds of users, you must pay attention to the number of disks and the way you arrange the Exchange files across the disks. If you don't pay attention to I/O, your system will run into an I/O bottleneck long before it exhausts CPU or memory resources. The system often masks an I/O bottleneck by 100 percent CPU usage, largely because of the work that the CPU does in swapping processes around.

Classically, the major sources of I/O on an Exchange server are pub.edb and priv.edb, the public and private information stores; (to a lesser extent) dir.edb, the directory store; the transaction logs; and the Message Transfer Agent (MTA) work directory. Servers hosting the Internet Mail Server (IMS) have to cope with its work directory as well. Ideally, allocate a separate physical disk to each I/O source to give the system a separate channel for the I/O activity each source generates. Resilience is also important, and you need to protect Exchange against the effects of a disk failure: Place the stores in a RAID-5 array, and keep the stores separate from the transaction logs. If you have to restore a database, you'll want the transaction logs structured so that you can roll forward any outstanding transactions once you restart Exchange. If the stores and the logs are on the same drive and a problem occurs, you can recover the store from a backup, but all transactions since the backup will vanish.

Servers that do a lot of work involving connectors generate a large amount of traffic through the MTA work directory. (For a description of how Exchange uses connectors, see "Planning a Large-Scale Exchange Implementation," May 1997.) Unlike the databases for the stores, Exchange uses the NT file structure to hold information about messages as they go through the MTA. Exchange maintains a set of indexes and writes each message to disk as it is processed. Exchange takes some steps to minimize I/O, but generally this scenario is the way things happen. With servers hosting connectors to the Internet, other Exchange sites, or other messaging systems, thousands of messages pass through the MTA daily. In these cases, you must isolate the MTA work directory and prevent the I/O it generates from interfering with other processing. For example, do not have the MTA work directory on the same drive as the information store, or on the same drive as NT. Put this directory on a drive allocated to user directories or anywhere else where disk I/O is low.

3. Keep Downtime to a Minimum

Managing a system also means anticipating downtime. Each time you take a server down for preventative maintenance or to upgrade hardware or software, you risk that the server might not come up again smoothly. Errors happen; software and hardware aren't perfect. Doesn't minimizing the number of times that you'll have to interfere with a server during its lifetime make sense?

You must do preventative maintenance, and you can't avoid software upgrades. Despite misgivings, you will probably install every service pack, at least for Exchange if not for NT. So all you can do to minimize system downtime is configure hardware so that it can comfortably last its predicted lifetime without requiring an upgrade. You compensate for the extra up-front expense of such configuration with peace of mind for the systems administrator and more predictable service for users.

But what about getting a system back online quickly if a disaster occurs, specifically if some catastrophic hardware failure happens? In a problem situation, you don't want to install half-a-dozen applications back onto new hardware just to get a mail server back online. I prefer a situation where I can follow four steps to get the server back online:

  1. Install and configure NT (including service packs).
  2. Install and configure Exchange (including service packs) using a "forklift install," meaning you install Exchange, but its services will not be started. You don't want services such as the directory to start immediately after you install the software because the directory will synchronize its brand new databases with other servers and sites, leading to possible data loss. Allow synchronization to proceed only after you've restored the information and directory stores.
  3. Restore the information and directory stores and restart the Exchange services.
  4. Check that everything has worked and that users can access their mail.

Have you ever noticed how usually logical people do the craziest things in pressure situations? If you keep things simple and have dedicated hardware for Exchange, you'll make a recovery exercise much easier. I assume Exchange will be around for at least three or four more years. How many hardware problems can you expect on a server in that time? Now multiply the chance of a hardware problem occurring across many servers, and you'll understand why it pays to run dedicated hardware.

Recently at a customer site, a server had gone down Friday evening and wasn't back online until Sunday afternoon. Such an outage is barely acceptable over the weekend when you have less user demand, but the same outage is unacceptable during peak working hours. No one knew how to get a replacement server online. The customer had no clear and simple steps outlined, and the staff went down many blind alleys before they restarted the server.

While we're discussing hardware backup and restore, let me make a couple points. First, get the fastest backup devices you can afford. Moving away from the digital audio tape (DAT) device that is often automatically configured into every server will cost extra money, but you'll be glad you made the investment. The time for backups (and restores) will be shorter, and you'll be able to make full daily backups instead of incremental daily backups and a weekly full backup. Exchange stores have a maximum size of 16GB, but Microsoft will remove this restriction in the Exchange Osmium release, due by the end of 1997. Then you might have to back up stores as large as the disks you attach to a server, conceivably hundreds of gigabytes. The faster the backup device, the easier the task. Even on small servers, a digital linear technology (DLT) tape device is preferable to a DAT.

Second, don't assume that NTBACKUP scores 100 percent in the backup software desirability stakes. The best things about NTBACKUP are the price (it's free) and that it comes ready to work with Exchange. Screen 1 shows NTBACKUP ready to back up a server selected from an Exchange organization. NTBACKUP works, and you must make a conscious decision to purchase replacement backup software (and not just for one server; use the same software everywhere). Increased speed, a greater degree of control over backup operations, and a scheduling engine are among the justifications for these purchases. All these reasons are valid. Seagate's Backup Exec, Cheyenne's ARCServe, and Barratt Edwards International's UltraBac are good examples of third-party backup software that works with Exchange. If the extra expense is not for you, be sure that you are happy with NTBACKUP and take the time to create some batch files to help automate backup procedures. You can use the AT and WINAT utilities to schedule backups, but if you use these utilities, you'll need some handcrafted batch code to start off the backups with the proper command switches.

4. Track System Statistics

You can say you know what's happening on a server, but proving it is another thing. Recording regular statistics about message throughput, growth in disk usage, number of supported users, volume of Help desk calls, average message transmission time, and so on provides the evidence of a system's workload. Good statistics can also give you the necessary background to help justify hardware upgrades or replacements when the time arrives.

Gather some statistics that don't directly relate to Exchange, such as the growth of disk space allocated to networked personal drives. You can use the Exchange message tracking logs to analyze a server's workload. Unfortunately this measurement is relatively crude because it is based on the transactions recorded in the tracking logs as they pass through Exchange. Each message generates a number of transactions depending on the number of components (the MTA and connectors) that handle the message. A message to a local recipient generates fewer transactions than a message that an external connector processes.

You must create tracking logs before you can use them for analysis. Select the Enable message tracking checkbox on the properties of the MTA Site Configuration object to create message tracking logs. Exchange will automatically create the logs and store them on a network share called \\server_name\tracking.log. The network share lets you track the path of a message from server to server as it makes its way to its final destination. The Message Tracking Center option in the administration program lets you track messages.

Logs are simple ASCII files. Each entry, such as messages being submitted and then delivered to a recipient or connector, contains a code (to identify the type of transaction, see Chapter 17, "Troubleshooting Tools and Resources," of the Exchange Administrator's Guide) and some information about the message, such as the recipient. Exchange creates a new log every day, and the log size varies from server to server, depending on the amount of message traffic.

Screen 2 shows the set of tracking logs on a server. In this case, the logs are reasonably small. Based on figures from some reasonably large servers at Digital and other customers, even on the largest server, you'll probably see no more than 40MB of logs generated daily. Of course, servers that deliver a high proportion of messages to local recipients will generate smaller logs than servers that route many messages to different connectors. Distribution list expansion also creates entries for the logs. Writing entries into the logs does not place a strain on the server, and you have no reason not to generate tracking logs.

You can analyze the log contents with Crystal Reports for Exchange, which is on the Microsoft Exchange Resource Kit. You can view data in report format or export the data into Excel for further manipulation. Screen 1 shows the result of analyzing the message traffic through one of Digital's large Exchange servers in the U.S. The time line is based on Greenwich mean time, five hours ahead of eastern standard time. Thus, the peak load at 16:00 GMT is 11:00 EST.

You can also extract statistics from Exchange by examining properties of mailboxes and other objects through the Administration program. However, this manual process is difficult when you have a server hosting more than a hundred users.

5. Housekeeping

Regular housekeeping and systems monitoring are important. You must monitor servers regularly if you want to maintain a predictable quality of service. Exchange provides several tools for monitoring important system indicators, including counters, link monitors, and server monitors.

Exchange publishes more than 100 counters that NT's Performance Monitor can use. Exchange server installs eight predefined workspaces automatically. You can use these workspaces or define your own.

Link monitors check whether the network links to other servers are available. The monitor sends probe messages to the Exchange System Attendant process on remote servers. If the System Attendant is active, it replies to the probe and the monitor notes the reply.

Server monitors check whether important NT services (such as the Exchange MTA or Information Store) are active on remote servers. You can use server monitors only if you have administration permission for the servers you want to monitor.

You can run all the standard monitors on an NT server or workstation (you have to install the Exchange administration program to use them on a workstation). Link and server monitors run as windows inside the Exchange administration program. Screen 3 shows a server monitor keeping an eye on six servers in five sites. The monitor has detected problems on four servers, ranging from serious (the IMS is not active on one server) to inconsequential (the time on the server is off by 51 seconds). You can define actions if a server monitor detects a problem. For example, you can have the Exchange System Attendant send an email message to an administrator, attempt to restart a missing service, or display an NT alert. Compare the information available from the server monitor with the information from a link monitor, which Screen 4 shows. The link monitor shows only whether a network path to a remote server exists.

Many installations have Performance Monitor running constantly, checking on important Exchange indicators such as message queues and the number of users logged on. Screen 5 shows the server health workspace (a workspace is a set of Performance Monitor counters) monitoring a lightly loaded Exchange server. The four essential Exchange components (the store, directory, MTA, and System Attendant) are being monitored with the overall CPU usage and system paging. All the standard monitors are fine in small deployments, but they become less useful when you need to check more than a couple of servers on a regular basis. At this stage, consider other options such as NetIQ's AppManager Console, a command-center type utility.

If you're concerned about message delivery times, use pings to check how quickly messages get from one point of the network to another. A ping is a message that the system sends to a mailbox on a remote server, which then bounces it back to its originator. The system measures how long the roundtrip takes. If you don't want to write procedures to send and measure pings, consider solutions such as Baranof Software's MailCheck for Exchange. You can think of MailCheck as a highly developed version of the standard link monitor, complete with reporting facilities.

Automated monitoring is all very well, but you need some manual checks to back up the tools. The checklist, "Regular Maintenance Tasks for Exchange," lists everyday maintenance items for Exchange that fill this gap.

On a weekly basis, check the public folder hierarchy to ensure that unauthorized folders have not appeared or that users have not created unauthorized replicas on servers in the organization. You can perform this check less often in deployments where you use a small number of public folders. The aim here is to keep the public folder hierarchy well organized so that it doesn't degenerate into anarchy. Also, review the directory contents regularly to ensure that email addresses are as up to date and accurate as possible. This step is especially important when you synchronize the Exchange directory with information from other messaging systems.

Every three months or so, review the system configuration (hardware and software). A planned software upgrade might be available, or moving files to different disks might create a more efficient configuration, especially for controlling disk I/O.

Aside from a regular system review, the most important intervention you need to consider is database defragmentation. Exchange databases do not support online defragmentation. In other words, over time, the databases swell to occupy all available disk space, halting only when the disk is filled. Of course, the database won't be filled with messages and other items, but instead, a great deal of white space will intersperse the useful material. You can remove the white space and defragment the database only if you take Exchange offline and run the EDBUTIL utility. Because mail messages have a shorter lifetime than items in public folders, you can recover more white space in the private information store.

You can run EDBUTIL only after you stop the Exchange services. Screen 6 shows a successful run. The time you need depends on the size of the database, the speed of the CPU and I/O subsystem, and whether the server is doing any other work at the same time. Expect to be able to process 1GB to 2GB an hour on small to medium servers (100MHz to 200MHz single Pentiums) and up to 4GB per hour on systems with dual CPUs or on Alpha processors. Your mileage may vary, so always depend on the results achieved in your environment rather than what anyone tells you.

My experience with many servers shows that if you run EDBUTIL every three months, you can recover substantial disk space. You might not see the same results as Digital, which recovered more than 5GB of space when we defragmented a 15.5GB store, but I'm sure that you'll recover between 10 percent and 20 percent. Because you alter the internal structure of the database during compaction, be sure to make a backup before and after any EDBUTIL run.

The Payoff

Systems won't deliver reliable performance if you leave them alone. A proactive approach pays big dividends when you configure and maintain systems. The suggestions in this article are generic, and you need to refine them for your installation. Use them as input to your plans, but always remember you're the expert when it comes to the details of your site.