The Fermi Architecture
The GeForce GTX 400/500 family of GPUs is based on NVIDIA’s Fermi architecture—the most significant leap in GPU architecture since the original G80. G80 was our initial vision of what a unified graphics and compute processor should look like. GT200 extended the performance and functionality of G80. With Fermi, we have taken all we have learned from the two prior processors, analyzed the various applications that were written for them, and developed a completely new architecture optimized for next generation games and applications.
Parallel Tessellation Engines
Traditional GPU designs use a single geometry engine to perform tessellation. This approach is analogous to early GPU designs which used a single pixel pipeline to perform pixel shading. Having observed how pixel pipelines grew from a single unit to many parallel units and its subsequent impact on 3D realism, we designed our tessellation architecture to be parallel from day one.
Fermi GPUs implement up to sixteen parallel tessellation units, each with its own dedicated shading resources. Up to four parallel Raster Engines transform newly tessellated triangles into a fine stream of pixels for shading. The close coupling for tessellation, rasterization, and shading units provides enormous on-chip bandwidth and high execution efficiency. The result is a breakthrough in tessellation performance at up to two billion triangles per second. Compared to competing products, Fermi GPUs are up to 8x faster as measured by independent reviews using Microsoft’s DirectX 11 software development kit.
Third Generation Streaming Multiprocessor
Fermi’s third generation Streaming Multiprocessor (SM) introduces several architectural innovations that improve both the performance and accuracy of complex graphics and compute work loads.
Each of Fermi’s SMs contains 32 CUDA processors—a fourfold increase over prior SM designs. By employing a flexible scalar architecture, CUDA cores achieve full performance on a variety of workloads such as textures, shadow maps, and complex shaders. The result is consistently high performance in current as well as future games.
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). The FPU complies with the IEEE 754-2008 industry standard for floating-point arithmetic, the same standard used by the world’s premier computer scientists and research institutions. Fermi applies this high standard of precision for all workloads, be it in games, video transcoding, or desktop applications.
Fermi’s third generation SM also improves execution efficiency through improved scheduling. The SM schedules work in groups of 32 threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently—twice over the prior generation. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream. Using this elegant model of dual-issue, Fermi achieves great performance efficiency with minimal hardware.
New Cache Architecture
Fermi is the first GPU architecture with fully cached memory access. Caches have long been used in CPUs to improve memory access performance. With Fermi, we’ve built a unified cache architecture that extends the benefit of caching to all graphics and compute programs.
Fermi programs have access to a texture cache, an L1 cache, and an L2 cache. The L1 and L2 caches improve performance for programs with random memory access patterns such as raytracing and physics. The texture cache enables fast and efficient texture filtering. In addition, programs also have access to a fast, dedicated, shared memory that greatly improves GPGPU applications such as video transcoding and photo processing.
New Render Output Units with Improved Antialiasing
Fermi’s Render Output (ROP) subsystem has been redesigned for improved throughput and efficiency. One Fermi ROP partition contains eight ROP units, a twofold improvement over prior architectures. 8x antialiasing, an expensive operation on prior generation GPUs, is now much faster thanks to improved memory compression and a larger framebuffer. Along with performance improvements, we’ve also enhanced our image quality. Fermi supports 32x coverage sampling antialiasing (CSAA), the highest sample antialiasing mode on any GPU. The result is perfectly smooth geometry edges with exceptional performance.