The Fermi Architecture

The GeForce GTX 400 family of GPUs is based on NVIDIA’s Fermi architecture. The Fermi architecture is the most significant leap forward in GPU architecture since the original G80. G80 was our initial vision of what a unified graphics and compute processor should look like. GT200 extended the performance and functionality of G80. With Fermi, we have taken all we have learned from the two prior processors, analyzed the various applications that were written for them, and developed a completely new architecture optimized for next generation games and applications.

Parallel Tessellation Engines

Traditional GPU designs use a single geometry engine to perform tessellation. This approach is analogous to early GPU designs which used a single pixel pipeline to perform pixel shading. Having observed how pixel pipelines grew from a single unit to many parallel units and its subsequent impact on 3D realism, we designed our tessellation architecture to be parallel from day one.

Fermi GPUs implement up to fifteen parallel tessellation units, each with its own dedicated shading resources. Up to four parallel Raster Engines transform newly tessellated triangles into a fine stream of pixels for shading. The close coupling for tessellation, shading, and raster units provides enormous on-chip bandwidth and high execution efficiency. The result is a breakthrough in tessellation performance at over 1.6 billion triangles per second. Compared to competing products, Fermi GPUs are 2-8x faster as measured by independent reviews using Microsoft’s DirectX 11 software development kit.

Superscalar Streaming Multiprocessor

The Streaming Multiprocessor (SM) is the heart of the GPU. It performs vital functions such as pixel shading, tessellation, and physics and compute calculations. The GeForce GTX 460 SM is highly parallel processor employing superscalar execution for optimal performance.

Superscalar execution is a technique that allows sequential instrutions from a program to be executed in parallel. Unlike thread level parallelism which improves throughput, superscalar execution also improves latency since the same program executes in less time.

The GeForce GTX 460 introduces a mode of superscalar execution that balances performance with efficiency. Wide superscalar designs have high peak performance that is often not realized in realworld applications. Based on thorough analysis of shaders used in modern games, NVIDIA engineers determined that the number of parallelizable instructions is approximately two. With this in mind, we designed the GTX 460 SM to process two instructions per thread, per clock. Since two warps (groups of 32 threads) execute concurrently, a peak of four instructions per clock is realized. This architecture provides an optimum balance between the benefits of scalar efficiency and superscalar performance.

To faciliate greater instruction throughput, the GTX 460 SM greatly expands the number of execution units. The number of CUDA cores is increased from 32 to 48. The number of special function units (SFUs) and texture units is doubled from four to eight. Texture filteirng for FP16 textures is now performed at full speed. These improvements enable exceptional performance in DirectX 11 games such as Battlefield: Bad Company 2 and Metro 2033.

New Cache Architecture

Fermi is the first GPU architecture with fully cached memory access. Caches have long been used in CPUs to improve memory access performance. With Fermi, we’ve built a unified cache architecture that extends the benefit of caching to all graphics and compute programs.

Fermi programs have access to a texture cache, an L1 cache, and an L2 cache. The L1 and L2 caches improve performance for programs with random memory access patterns such as raytracing and physics. The texture cache enables fast and efficient texture filtering. In addition, programs also have access to a fast, dedicated, shared memory that greatly improves GPGPU applications such as video transcoding and photo processing.

New Render Output Units with Improved Antialiasing

Fermi’s Render Output (ROP) subsystem has been redesigned for improved throughput and efficiency. One Fermi ROP partition contains eight ROP units, a twofold improvement over prior architectures. 8x antialiasing, an expensive operation on prior generation GPUs, is now much faster thanks to improved memory compression and a larger framebuffer. Along with performance improvements, we’ve also enhanced our image quality. Fermi supports 32x coverage sampling antialiasing (CSAA), the highest sample antialiasing mode on any GPU. The result is perfectly smooth geometry edges with exceptional performance.