From the UNIX/mainframe perspective, large disk systems with gigabytes of capacity are nothing new. Fault tolerance and RAID configurations are something to take for granted. However, from the PC point of view, these systems can be daunting and confusing. Many administrators just take the default settings on their server and leave it at that. The problem with this approach is that although it is simple, you aren't helping server performance by not optimizing your disk subsystem for the applications you run or the configuration of your hardware.

Our Exchange/NT scaleability tests gave us an ideal place to begin analyzing the effects of disk configuration and performance on overall system performance. On a system such as the Tricord PowerFrame, which gives you a wide variety of setup options, you have lots of room to find the ideal organization of disks, controllers, and files.

Tricord studied RAID performance on NTFS volumes, and we used it as a basis for setting up our test environment. What Tricord found (using its own hardware-accelerated RAID controller and nine 2.1GB Seagate drives) was that stripe size has a big impact on performance, the type of disk transaction (sequential or random, read or write) affects throughput, and the RAID level plays an important role (what level of fault tolerance do you want vs. how much it costs) in overall system performance. This may all seem patently obvious (the word "duh" comes to mind), but if you aren't thinking about performance when you set up your system, you aren't taking advantage of what's available.

The Tricord test (as seen in Graph 1) varied the stripe size (disk sectors per stripe, a.k.a. logical blocking factor-NOTE: stripe size is not the same as cluster size on a single drive; a cluster determines the minimum number of bytes that form a logical unit on a disk-a file cannot take up less space than a single cluster; RAID stripe width determines how much data is written to a disk in a single chunk-in 512 byte sectors-before the controller submits the I/O to the next drive in the stripe set) from eight sectors per stripe to 4096 sectors per stripe on three types of RAID volumes: 0, 5, and 10. The test ran on a PowerFrame ES133 (two 133MHz Pentium CPUs, 256MB RAM, one 2-channel Intelligent Storage Subsystem-ISS-controller, running NT 3.51 with no service packs). Tricord configured the nine drives as follows: Drive 1 on bus 1 had a 300MB FAT partition with the OS on it, and the rest of the drive was a single FAT partition with the pagefile on it; drives 2 through 9 resided on bus 2 and were striped (RAID 0, 5, or 10, at the varying stripe widths) for an NTFS volume with 8GB of usable storage.

The RAID 0 volume (striping only) only used four drives, with two drives on each available SCSI channel to get optimal controller performance (splitting read/write activity across multiple drives greatly enhances performance over a standard sequential volume set, and further splitting this activity across multiple channels on the same controller improves it even more since the controller can parallel task the I/O). The RAID 10 volume (mirrored stripe sets) used eight drives (as seen in Figure A), split up such that half of each mirror set and stripe set straddled the two SCSI channels (using both buses for both striping and mirroring only slightly improved performance over using one channel for each stripe set). The RAID 5 volume (striping with parity) also used both buses-three drives on bus 1 and two on bus 2. And, although the 8MB disk cache proved to offer only a minimal performance improvement under this test, it was enabled for each of the logical devices being tested.

The tool Tricord used (which we will also be using to test RAID solutions in the near future) was designed by National Peripherals for performance testing RAID on NT. It generates all of the load (sequential reads and writes, random reads and writes) at the server instead of over a network. It used a 100MB test file accessed by the testing application in 64KB blocks-each test run lasted 180 seconds. The test measures data transfer rates in MB per second.

As you can see in Graph 1, a stripe size of 1024 sectors per stripe is the point where performance starts to dip or level off for all RAID and I/O activity types. On the RAID 0 volume, performance characteristics are very similar for reads and writes-random activity performs slightly better on this type of volume than sequential activity. On the RAID 5 volume, write operations are far slower than reads, and random I/O is faster than sequential I/O once again. On the RAID 10 volume, there was a significant gap between read and write performance: read was much higher than write on both random and sequential I/O.

Analysis
Choosing a RAID level depends on what you are doing and how much money you have to spend. RAID 0 offers high performance at low cost-you can stripe many drives for the best I/O throughput, but there is no redundancy or fault tolerance (having a hardware RAID controller is much faster than using NT's striping). RAID 5 offers fault tolerance at a cost only slightly higher than RAID 0, since it requires only one drive for parity data. However, RAID 5 has the slowest I/O: a hardware accelerated RAID controller can alleviate some performance problems-and will be much faster than NT's software RAID 5-but it will still be far slower than RAID 0. RAID 10 offers the best combination of performance and fault tolerance (especially if your system supports hot-swap drives) on the system we used for the NT scaleability tests (future tests will verify this on other systems as well). The problem is that RAID 10 is very hardware intensive, requiring multichannel hardware accelerated controllers, and twice as many drives as RAID 0. Since you are building mirrored stripe sets, you don't need a parity drive, and you don't need to duplex controllers (although you can, but you'll take a performance hit since it is done via software in NT).

RAID 10 is an excellent option for enterprise mission critical applications where fault tolerance is absolutely necessary, as is high performance, and money is no object. Most IS shops have financial considerations, so they should consider smaller RAID 10 volumes for those portions of the system that need the performance and fault tolerance (such as data drives), and use normal striping for everything else (or RAID 5).

This consideration brings up the point of the I/O transaction mix. The results graph shows the disparities among the types of I/O on the different RAID volumes, so you should analyze your workload before choosing a RAID level. If you are in a write-intensive environment, do not use RAID 5-use RAID 0 with frequent backups, or RAID 10. A mixed environment runs very well on a RAID 10 volume. A read-intensive environment will benefit from RAID 10, followed by RAID 5, and then RAID 0. A write-intensive environment performs better on a RAID 0 volume. Again, it depends on how much money you have, what performance vs. fault tolerance you need, and what your workload is.

We chose RAID 10 for all volumes under test with Exchange/LoadSim on the PowerFrame, because we anticipated a very mixed I/O environment (write intensive and sequential for the log volumes, and random reads and writes for the data volumes)-we also gained a high level of fault tolerance in case of problems during a test run. The I/O turned out to be predominantly write-oriented (anywhere from 97% writes and 3% reads when the system had 1024MB of RAM, to a 60/40 write/read split at 128MB), so RAID 10 was definitely the best choice-plus, price was no object since we had 60 drives lying around the Lab!