Will this chipset bring an end to NT's scalability problem?

Many companies use Intel processors to run Windows NT Server. Choosing Intel processors confers several advantages: For example, the hardware is cheaper, you have more hardware vendors to choose from, and you have a greater choice of applications to support your business. As NT Server acceptance continues to grow, many companies are deploying enterprise applications such as data mining, enterprise resource planning (ERP), and terminal servers on NT Server running on Intel-based SMP servers. At the same time, these companies demand higher levels of scalability from their systems to gain improved performance. The 4-way SMP server can no longer meet these needs. In response to this situation, Intel has developed a new industry-standard SMP system that the company calls Profusion. Profusion offers a standardized method of placing eight processors in an Intel-based CPU. Many 4-way and greater than 4-way machines are available, but they use proprietary system architectures. A truly scalable 8-way SMP server helps NT applications such as SAP and Microsoft SQL Server 7.0 achieve higher performance.

Profusion's architecture is unique in its support for eight Intel processors. Understanding Profusion architecture before you evaluate and buy a new 8-way SMP server is important. In this article, I dig into the Profusion architecture, take a look at each of its components, and help you better understand its scalability.

More than Four Is Hard
Before I introduce you to the Profusion architecture, let's take a quick look at the traditional Intel-based 4-way SMP server architecture. I want to show you why having more than four processors in the same server without an architecture redesign is difficult.

Figure 1 shows the traditional 4-way SMP server architecture on Intel Pentium Pro or Xeon processors. This architecture consists of as many as four processors; one or more PCI bridges, each of which supports as many as four PCI devices; and a memory controller for memory access. A processor bus, which many vendors call the system bus, connects the processors, PCI bridges, and memory controller to facilitate communication. This architecture supports as many as four processors well but is hard to scale beyond four processors. Several limitations contribute to this difficulty. The first limitation is electronic. For example, a system bus speed of 100MHz in a Pentium II or Pentium III Xeon server supports a maximum of only five loads on the bus. A higher bus clock speed requires a shorter bus length, which restricts the number of components you can connect to the bus. The second limitation is in bandwidth. A traditional SMP server has approximately 400MBps of bandwidth on its system bus, which is acceptable for four processors but is difficult to sustain for more than four processors. Another limitation is logical design. Traditional SMP architecture uses a two-bit code to identify as many as four processors on one bus. Pentium Pro and later Pentium-version SMP systems support a pipelined transaction protocol, which lets the system bus start handling additional processors' new data requests before the bus finishes data delivery for the first processor's request. Additional processors on one bus can bring more simultaneous data requests to the bus, which complicates the system coherency tracking that makes sure no errors occur.

Because of these limitations, successfully integrating more than four Intel processors in an SMP system requires redesigning SMP architecture. Researchers have proposed several approaches to overcome the limitations, such as using existing cluster technology, using Non-Uniform Memory Access (NUMA), and expanding the number of system buses. (The industry rejected two other bus-based solutions—dual-ported memory and chained buses—as too inefficient to scale.) Cluster technology can place two or more 4-way SMP servers into a cluster. Although this solution scales well and supports fault tolerance and load balancing, it requires redesigning OSs and applications. NUMA, which vendors use in high-end UNIX systems, lets you add multiple blocks of four processors to a system (e.g., 8 blocks for a 32-way system). This solution also requires rewriting OSs and applications. However, increasing the number of system buses doesn't require changing OSs and applications. An intuitive approach with system buses is to build a hierarchy of buses in one system. Hierarchical bus architecture, which Figure 2 shows, adds a common processor bus to the traditional SMP architecture and attaches two 4-way subsystems to the common processor bus through a third-level cache. This L3 cache filters memory access from the processors before the access operation reaches the common processor bus. Although this architecture scales to eight (and potentially more) processors, it involves high overhead in cache use and affords slow access to memory because of multiple memory layers.

A New Fusion of Buses
In 1996, Corollary, a company that develops multiprocessing products, introduced Profusion, the company's 8-way Intel-based SMP architecture. (Intel acquired Corollary in 1997.) Profusion is a bus-based solution that differs from conventional bus-based solutions. Profusion creates a crossbar switch to interconnect all subsystems and lets the subsystems access shared memory independently at high speed. Network-design aficionados will recognize that this technique is the same technique that the switching fabric of high-speed network interconnect switches uses.

Figure 3 illustrates Profusion's architecture. The core of this architecture is the five-port crossbar switch, which Intel calls the Profusion chipset. This crossbar switch builds an 8-way SMP system by linking one processor bus that is attached to four processors, a second processor bus that is attached to four processors, one I/O system bus, and two memory banks. The Profusion chipset also uses two cache coherency filters to maintain data coherency. Let's delve into the Profusion architecture.

System buses, processors, and bridges. In addition to two processor buses, the Profusion architecture includes a dedicated I/O bus for I/O traffic. The dedicated I/O bus improves system performance and reliability. The bus can communicate directly with memory while keeping processor overhead and processor interruptions to a minimum. In addition, the bus can isolate processors from misbehaving I/O subsystems to reduce system failures.

Profusion's three buses can run at 100MHz or 133MHz and carry five loads. Each processor bus carries one load for each of four processors and one load for the Profusion chipset. The I/O bus carries one load for each of the four PCI bridges and one load for the Profusion chipset.

Two processor types can work with the Profusion chipset: 400MHz and 450MHz Pentium II Xeon processors and 500MHz and faster Pentium III Xeon processors with as much as 2MB of L2 cache. Most high-end server vendors will likely use Pentium III Xeon processors to fully take advantage of the performance gains that the Profusion architecture can support. Some vendors, such as Compaq, have incorporated 64-bit system buses and 64-bit 66MHz PCI buses under PCI bridges into their 8-way servers to support the new PCI specification 2.1. These vendors' PCI bridges also support hot-plug technology, which lets you install and replace PCI devices without shutting the system down. Each PCI bridge can connect as many as four PCI devices.

Memory. The Profusion architecture contains two memory banks. The memory banks share the same address space but are cache-line interleaved, which means that one memory bank contains even-numbered cache lines (i.e., with addresses 0, 2, 4, and so on) and the other memory bank contains odd-numbered cache lines (i.e., with addresses 1, 3, 5, and so on). The memory access method that the banks use is a uniform memory access method, which allows equal access to both banks. Using the probability of random access, half of the total accesses will go to the even-numbered-address bank, and the other half of the total accesses will go to the odd-numbered-address bank. Thus, memory access occurs twice as fast with two banks.

Each of Profusion's memory banks can contain as much as 16GB of memory—usually Synchronous DRAM (SDRAM). Therefore, a Profusion server can contain a total of 32GB of memory. Windows 2000 Datacenter Server (Datacenter), which will support as much as 64GB of memory access, can fully take advantage of this scalability. However, a vendor's Profusion server implementation might not support as much as 32GB of memory. For example, currently you can expand Compaq's 8-way servers, the ProLiant 8000 and ProLiant 8500, to only 8GB of memory. You need to check your vendor's implementation carefully when you buy an 8-way server.

Cache coherency filter. Intel-based SMP servers share data among processors and I/O subsystems. The possibility exists that copies of data that share an identical memory address and occur in processor caches and memory are inconsistent. Such inconsistency causes system operation errors. A well-designed SMP system maintains a consistent view of memory, or data coherency. If one or more processors cache copies of data with the same memory address, the system needs to synchronize the cached copies and the corresponding data in memory. Intel processors use a snooping protocol for data coherency. That is, when a processor bus processes a memory transaction, the bus will ask its other processors and processors on the remote processor buses whether they cache the data with the same memory address. If one or more processors cache the data with the same memory address, the system will synchronize the cached copies with the data in the memory and use the most recent data. However, snooping consumes extra processor cycles and bus bandwidth and increases traffic between the processor buses.

The Profusion architecture uses two cache coherency filters to help reduce snooping on the remote processor bus. The cache coherency filter on the left in Figure 3 keeps the addresses of the data that resides in the L2 caches of the processors on the processor bus on the left. The cache coherency filter on the right keeps the addresses of the data that resides in the L2 processor caches of the processors on the processor bus on the right. When a memory transaction occurs, the local processor bus checks the remote cache coherency filter first, rather than snooping the remote processor bus to determine whether to maintain data coherency. The local processor bus will snoop the other processor bus only when the local bus finds that the remote cache coherency filter contains the same address of the data in the memory transaction. Otherwise, the snoop remains on the local bus. Some vendors refer to cache coherency filters as cache accelerators because of the performance improvement the filters make possible.

The Profusion chipset. The Profusion chipset, a five-port crossbar switch, is the central component in the Profusion architecture. The chipset's five ports connect to one another, as Figure 3 shows. The five ports comprise two processor bus interfaces, two memory interfaces, and one I/O bus interface. All 5 ports are bidirectional; 10 unidirectional ports in a 64-bit static RAM (SRAM) implement the 5 bidirectional ports. Ten \[(5 ports * 4 paths per port) ÷ 2 directions\] bidirectional direct paths exist between the processor buses, memory banks, and I/O bus. Therefore, read and write operations can occur simultaneously between the processor buses, memory, and I/O bus. This one-hop simultaneous communication can tremendously reduce access latency. Theoretically, a 100MHz bus (the minimum speed of a system bus in a Profusion server) can provide a peak throughput of 800MBps. With this throughput, the Profusion crossbar switch can allow a peak throughput of 4GBps (800MB * 5 ports).

The Profusion chipset consists of two physical application-specific integrated circuit (ASIC) partitions: a memory access controller (MAC) and a data interface buffer (DIB), as Figure 4 shows. The MAC transfers address and control information between the processor buses, I/O bus, and memory, and manages the two cache coherency filters. The DIB transfers and buffers data between the processor buses, I/O bus, and memory.

Here's a simple example to illustrate MAC and DIB functions. Suppose a processor needs to retrieve the data that memory address 100 contains. The processor's local processor bus sends the 100 address and read-memory control information from the processor to the MAC. The local system bus uses the MAC to check the cache coherency filter of the remote processor bus to see whether any processor cache on the remote processor bus is holding a copy of the 100 address' data. If no processor on the remote processor bus caches that data, the MAC forwards the address and read control information to the even-numbered memory bank. If a processor on the remote processor bus caches the data, the local processor bus will snoop the remote processor bus to make sure that the system synchronizes the data copies in the memory and the cache. Then, memory returns the data to the DIB. The DIB checks to see whether the local processor bus is busy handling other requests. If the local processor bus is busy, the DIB keeps the data in its buffer and forwards the data to the local bus when the bus is free.

8-Way to 2000
Server hardware vendors can use the Profusion chipset and architecture to build a new generation of Intel SMP servers that scale to as many as eight processors. Leading server vendors are delivering Profusion-based 8-way SMP servers, so you can run your performance-sensitive enterprise applications on these new faster servers. The Profusion architecture, in combination with Windows 2000 Advanced Server (Win2K AS), which works with as many as 8 processors and 8GB of memory, and Datacenter, which supports as many as 32 processors and 64GB of memory, will take your enterprise NT applications to new heights of performance and scalability in 2000.

Corrections to this Article:
  • "Profusion Architecture" doesn't mention Compaq as a codeveloper of the Profusion architecture. Corollary, a wholly owned subsidiary of Intel, and Compaq codeveloped the Profusion 8-way system architecture. We apologize for any inconvenience this error might have caused.