NVIDIA's latest Graphics Processing Unit (GPU), the GeForce4, is a technical marvel featuring 63 million transistors—a full 20 percent more than Intel's Pentium 4 CPU. To create the GeForce4, NVIDIA switched to a 0.15-micron process (the same size as the GeForce3 die, but a step down from the GeForce2's 0.18-micron fabrication), which lets the new GPU hit higher clock speeds, consume less power, and generate less heat. But the new chip's features, including Lightspeed Memory Architecture II, Accuview High Resolution Anti-Aliasing (HRAA), and DirectX 8.0 support are what make the chip ready for the next generation of games.

Lightspeed Memory Architecture II


One problem that has always affected 3-D accelerators is a lack of available memory bandwidth, which causes real-world performance to fall far short of the vendor's published fill-rate specifications. Although vendors usually list a high performance rating (e.g., most GeForce2 cards claimed 5.3GBps throughput), consider that these numbers represent the graphics card's peak performance. So, although many graphics cards can push multiple gigabytes worth of data per second, you won't see anything approaching these speeds because of memory latency issues inherent to 3-D rendering. The more pixels a 3-D card has to render, the more latency it incurs in reading the data.

Previous GeForce GPUs used a 128-bit interface that let the controller transfer memory in 128-bit pieces. Although a 128-bit interface is fast, it's not very efficient. For example, if the controller needs to send 32-bit pieces of data, it must use a full 128-bit clock cycle, thus, it wastes 96 bits of that clock cycle. The advent of Double Data Rate (DDR) memory further complicates matters because you're effectively looking at a 256-bit interface.

With the GeForce3, NVIDIA implemented a new memory architecture called the Crossbar Memory Controller. This new design uses four 32-bit (effectively 64-bit because the GeForce3 reference design calls for DDR memory standard) load-balanced memory subcontrollers that interact with each other to improve efficiency. Each subcontroller can have its own open pages to store data. By using four independent memory subcontrollers, NVIDIA cut latency by 75 percent.

In designing the GeForce4, NVIDIA went a step above the original Lightspeed Memory Architecture by essentially tuning every design component. Lightspeed Memory Architecture II's Crossbar Memory Controller uses the same partitioning scheme to separate data among the four memory subcontrollers. The enhancements come in the GPU's load-balancing algorithms that make more efficient use of bandwidth. This advancement, above all else, is what lets the GeForce4 handle FSAA with no discernable performance hit.

The other component of NVIDIA's Lightspeed Memory Architecture is the GeForce4's support for Z-occlusion detection and culling (i.e., Hidden Surface Removal). When displaying a scene, a 3-D card renders the entire scene, regardless of whether everything in that scene is visible to the player. So, if you're running down a hall and 10 monsters suddenly spawn into the room, the GPU has to render all 10 models and the background in which they're hiding. Taking a page from ATI Technologies, NVIDIA uses a Z-occlusion culling routine on the GeForce4 to determine whether a pixel is hiding behind another pixel-if so, the GPU discards the pixel from the pipeline to save bandwidth for what the player sees on screen.

Because the AGP bus is slower than local memory, the GeForce4 uses several caches to prevent data from having to be sent over the bus repeatedly. Like the GeForce3, the GeForce4 includes Dual Texture Caches, but NVIDIA has also built in a Pixel Cache, a Vertex Cache (used to store vertices that are sent across the AGP bus), and a Primitive Cache (used to assemble vertices before setting up triangles).

Finally, the GeForce4 includes the same lossless Z-buffer compression found in the GeForce3. New to the GeForce4 is a more effective 4:1 compression ratio that lets the GPU store data within the Z-buffer.

Accuview


With digital graphics, the GPU strategically places several pixels together to resemble an object, and at low screen resolutions, a simple diagonal line will look jagged. Anti-aliasing is a technique used to smooth out the spaces between pixels. By default, Windows uses anti-aliasing to enhance the look of screen fonts by adding information to the image. Because Windows does this anti-aliasing on an object-by-object basis, you don't notice much of a performance hit.

With 3-D scenes, however, GPUs don't have the luxury of selecting which polygons to smooth out and which to leave alone. As such, GPUs use a technique called Full Scene Anti-Aliasing (FSAA) to smooth every polygon on screen. The GeForce2 chipset used a method called supersampling to perform FSAA: The GPU rendered the scene internally at a high resolution, then sampled the scene down to the screen resolution. The problem with FSAA is that it requires an enormous amount of fill rate from the graphics card, causing a huge performance hit and forcing the user to choose between high frame rates and jagged edges, and low frame rates and smooth visual quality.

The GeForce4 takes a different approach to FSAA: Accuview. By using multisampling, the GeForce4 takes multiple samples of a frame, filters the samples, and combines the subpixels from each sample to create a smooth final image. The GeForce4 supports 2X HRAA and 4X HRAA, with the latter offering the best image quality at the highest performance cost.

To strike a balance between performance and visual quality, the GeForce4 includes a special anti-aliasing mode called Quincunx (which, apparently, is the name of the five dots on a 6-sided die). Quincunx takes only two samples per frame and uses a supersampling technique to filter five subpixels. The end result is an image that is almost as pristine as 4X FSAA, while only incurring the performance hit associated with 2X FSAA. Of course, your mileage may vary depending on the game you're playing-but trust me, any form of anti-aliasing is better than dealing with the dreaded "jaggies." And with the GeForce4's more efficient memory architecture, you can now enable FSAA in any game without seeing much of a performance hit.

DirectX 8.0


The promise of DirectX 8.0 is improved visual quality. By implementing features such as programmable shaders, DirectX 8.0 serves as a conduit that developers can use to make their titles look as stunning as possible. Because most of today's PC games are based on Direct3D, having a graphics card that can take advantage of these features is becoming increasingly important; otherwise, developers are stuck emulating these functions in software.

For example, programmable shaders are what make the GeForce4 a true next-generation 3-D GPU. Historically, 3-D accelerators have forced developers to send data through an inflexible pipeline. This coercive approach limits the types of effects that game developers can use. By letting these developers create and implement their own effects, the GeForce4 realizes its full potential by delivering effects that are worlds beyond what was previously available with legacy GPUs.

The core of the GeForce4 is its nFiniteFX II Engine, which consists of a programmable vertex processor and a programmable pixel processor. Because 3-D objects consist of multiple triangles, the vertex processor carries and computes information on the vertices of each object. The pixel processor generates the pixels used to represent the 3-D objects on screen. Combined with DirectX 8.0, the two processors let the GeForce4 compute almost any type of effect the developer desires.

Now this feature is nothing new if you already own a GeForce3. However, what the GeForce4 brings to the table is a second vertex shader that works in parallel with the original unit. Just like the XGPU found in Microsoft's Xbox, the two vertex shaders are completely multithreaded, which lets them process more than three times the number of vertices as the high-end GeForce3 Ti 500 processes.