This high-end server promises high availability

The NetFRAME 9008XP system (NF9008XP), a 4-way clustering server from NetFRAME (which Micron recently acquired), incorporates some interesting features that move the system beyond fault tolerance to what NetFRAME calls high availability, in other words no (or minimal) downtime. NetFRAME attempts to achieve this goal by making the NF9008XP's SCSI hard disks, power supplies, and PCI cards hot-swappable.

The NF9008XP's four 200MHz Pentium Pro processors (each of which has a 1MB Level 2 cache), four 128MB DIMMs, slots for 12 more DIMMs (for a total of 2GB of RAM), eight 9GB hard disks, and eight PCI expansion slots offer network administrators a variety of possible configurations. The system's Maestro and Maestro Recovery Manager (MRM) programs monitor many of the NF9008XP's major internal devices, including cooling fans, temperature sensors, CPUs, PCI bus slots, hard disks, and power supplies, to orchestrate system performance. Maestro and MRM provide status reports on NF9008XP components in an understandable, graphical format, as Screen 1, page 96, shows. Maestro gives device specifics and includes a field that specifies the physical location of each component.

The MRM option lets you remotely monitor and tweak NF9008XP components and even reset the machine. To set up remote access, you cable a dedicated MRM RJ-45 port on the back of the NF9008XP to a small black box. You then connect the box to a modem, or directly to the serial port on a remote machine running Windows NT Workstation 4.0. These alternative methods of remotely accessing the system provide you with identical access and functionality, except that the direct connection is significantly faster.

Opening Up the NF9008XP
The NF9008XP is roughly the size of three full towers side by side, and it has many flaps and folds that open systematically, giving computer mechanics easy access to its internal components. Looking down at the top of an open NF9008XP, you see the eight PCI expansion slots.

Each expansion slot has a light-emitting diode (LED), which signals its power status. Plastic separators between the slots prevent you from touching one card to another during removal and installation and accidentally shorting out the system. Slots 1 and 8 house the QLogic SCSI adapter cards that access the hard disks.

Maestro lets you disable, power down, power up, and restart any of the slots. Removing the adapter cards requires some effort. The screws and small metal bands that make up the cards' locks do not come out easily. This design feature prevents the screw and band from falling into the internals of the machine when you hot-swap a card. But you must work the card around a half-inserted screw and protruding band to get it out. This design needs improvement, but with practice, I could switch a card in just over 2 minutes.

The NF9008XP's PCI bus lets you take advantage of NetFRAME's MultiSpan technology: You can bind two or more NICs to a MultiSpan virtual adapter and bind the virtual adapter to your network's communications protocols (including TCP/IP, NetBEUI, IPX, and SPX). When you use this configuration, one NIC handles incoming network traffic, and the NF9008XP load balances outgoing traffic among the cards in the virtual adapter set. This MultiSpan feature provides fail-safe redundancy for your network adapters and uses the I/O of the redundant, fail-over cards in a set.

When you configure the NF9008XP's PCI bus and adapters, pay attention to the QLogic SCSI adapters' connection to the hard disks. If you create a stripe set rather than two separate buses running mirrored hard disks, attempting to hot-swap either SCSI card will lock up the machine. I encountered this situation in my tests when I removed and reinserted a SCSI adapter: The NF9008XP froze, forcing me to manually reset it.

Fortunately, hot-swapping power supplies does not cause similar problems. With the NF9008XP open, you can easily access the power supplies from the rear of the box. To remove one of the test system's three Cherokee power supplies, I turned a locking screw, unfolded a handle on its side, and simply slid out the unit.

NF9008XP
Contact: NetFRAME * 408-474-1000 or 800-737-8377, Web: http://www.netframe.com
Price: $33,437
System Configuration: Four 200MHz Pentium Pro processors, with 1MB Level 2 cache each, 512MB of RAM, Eight 8.7GB hot-swappable hard disks, Eight hot-swappable PCI slots, Three Cherokee hot-swappable power supplies

Shutting Down the NF9008XP
Maestro monitors temperature sensors and tracks the rotational speed of the NF9008XP's numerous fans (two for the hard disk bays, two for the expansion slots, four for the processors, and one for each power supply) to keep temperature levels throughout the unit within acceptable parameters. When a power supply fan fails, Maestro disables that power supply. When another fan fails or the system overheats, Maestro increases the rotational speed of other fans to compensate. If increasing fan speed doesn't solve the problem, Maestro powers down the machine before excessive heat causes permanent damage.

When the Windows NT Magazine Lab received the test unit, a NetFRAME representative demonstrated Maestro's reaction to fan failure. He stuck a small straw into a fan blade and stopped the fan's motor. Taking this demonstration as an endorsement of such abuse, I grabbed a plastic fork and started stopping fans all over the machine.

Maestro has a screen that shows when a particular fan stops functioning properly and other fans speed up to compensate. But I had to look for this information. I expected a warning about the failure to pop up in Maestro, but no such warning appeared.

Fortunately, a small LCD screen on the front of the unit notes fan failures, and the NF9008XP's Simple Network Management Protocol (SNMP) notification feature can send notification messages when the system has a problem. You can specify which functions, failures, or operations will send an SNMP alert. NetFRAME is developing Desktop Management Interface (DMI) support for lower-level hardware failures.

Armed with my fork, I momentarily stopped the fan on each power supply. As each fan stopped, Maestro showed that its power supply was offline. The fan resumed spinning, but Maestro didn't indicate that the power supply came back on. Stabbing the third power supply's fan caused a total system failure. I removed and reinserted all three power supplies and rebooted the machine, but I could not get it back online.

I continued my testing on a second NF9008XP. This time, I created a 45GB RAID 5 stripe set using the NT system tools. The machine configured and formatted the hard disks in roughly 4 hours. After killing one NF9008XP, I chose not to experiment with this unit's power supply fans. Still, I found that powering down the unit and trying to restart it produced the same results as my fork experiments: no power to the PCI bus.

In lieu of running away and hiding, I did everything I could think of to get the second machine back online, with no results at first. Two days later, I again tried to bring the NF9008XP back to life. This time I pushed really hard on a black button on the back of the machine. The button, identified with only a circle and a horizontal line inside it, did the trick. The PCI bus blinked back to life, the system found its hard disks, and the NF9008XP booted up.

Testing the NF9008XP
I finally tested file and print services performance by comparing the NF9008XP with a brand-name control server that has performed well in recent Lab tests. The control server has four 200MHz Pentium Pro processors, 512MB of RAM, four 10/100Mbps Fast Ethernet PCI network cards, and four 2GB SCSI hard disks. For my tests, both servers were running Windows NT Server 4.0 with Service Pack 3 (SP3).

To test the two systems, I ran Bluecurve's Dynameasure/File Services 1.5's Copy All Bidirectional tests. (For information about Dynameasure, see Carlos Bernal, "Dynameasure Enterprise 1.5," September 1997.) These tests process, in random order, 16 transactions that copy compressed data, uncompressed data, binary files, text files, and image files between the server and clients. The test files range in size from 500KB to 5MB. The test specifications called for six steps, starting with 10 motors (simulated users) at step 1 and increasing that number at each step to 100 motors at step 6. I ran the tests on the Lab's standard configuration: a set of client machines on a 100Mbps Ethernet network simulating the workload of multiple users. (For more information about the Lab's benchmarking network, see "The Lab's Test Environment," page 96.)

Graph 1 shows the two systems' throughput at each step. Throughput measures system capacity in terms of the number of bytes that all the motors copy during the measurement phase, divided by the elapsed time of the measurement phase. The NF9008XP reached its maximum throughput of 4.43MB per second (MBps) in step 2. The control server reached its maximum throughput of 4.28MBps in step 3.

Graph 2 shows the systems' average response times, which measure the average speed at which each system reads a file and copies it to another disk. The NF9008XP was faster than the control server for low numbers of motors. At step 2, with 20 motors, the NF9008XP had an average response time of 2.52 seconds, and the control server had an average response time of 7.78 seconds. After step 2, the NF9008XP's performance degraded rapidly. At step 3, with 39 motors, the control server had an average response time of 9.84 seconds, and the NF9008XP had an average response time of 17.13 seconds. After step 3, both systems' performance degraded, but the NF9008XP's performance degraded much more quickly than the control server's performance.

Returning the NF9008XP
Back at NetFRAME's laboratories, the first machine came back to life. A NetFRAME engineer told me that nothing was wrong with it. I have a feeling that the problem was related to the unmarked black button.

I obviously found some less-than-desirable traits of the NF9008XP. But, systems administrators aren't likely to poke their servers with eating utensils. If you need access to mission-critical information in seconds, 24 hours a day, 7 days a week, and if no more than 30 users will generate transactions simultaneously, then the NF9008XP might be a good deal for you. If you require access for more than 30 users at a time, keep shopping.