Temperature-Aware GPU Design Jeremy W. Sheaffer, Kevin Skadron, David P. Luebke {jws9c, skadron, luebke}@cs.virginia.edu University of Virginia, Charlottesville, VA 22904 http://gfx.cs.virginia.edu http://qsilver.cs.virginia.edu http://lava.cs.virginia.edu Problem Statement Simulator Setup and Output Cooling for graphics processors is becoming prohibitively expensive, but cooling solutions are designed for worst-case behavior. Since power dissipation is spatially non-uniform across the chip, localized heating occurs much faster than chip-wide heating, which leads to “hot spots” and spatial gradients that can cause accelerated aging and timing errors. Reducing hot spots reduces cooling requirements. In fact, as true worst-case behavior is rare, a solution designed for the worst case is overdesigned for typical operating conditions. However, a package designed for typical behavior could be overcome by some unusual application, requiring dynamic thermal management (DTM). GPU Simulation with Qsilver To study thermal issues in a GPU, we have developed a simulator called Qsilver that: • models GPU clock-cycle-by-cycle activity and power in the microarchitecture domain. •uses the Chromium† system to intercept a stream of OpenGL calls, annotating it with aggregate information about the vertices and fragments, textures, lighting, and other relevant rendering state Qsilver is useful for: •analyzing performance bottlenecks •estimating power •exploring new graphics architectural ideas Texture accesses Fragments generated Vertices transformed Thermal Simulation Results Default Floorplan → Performance Maximum Technique ↓ Architecture-Level Thermal Modeling Cost Requirements: and fast, and must model heating at the granularity of architectural objects We have used General, Qsilver simple, to analyze a hypothetical fixed-function console-like GPU architecture.calculate For these results, for weeach augment Must be able to dynamically temperatures block in the architecture Qsilver with an architectural thermal model called HotSpot‡ that Must be able to simulate billions of clock cycles in a few hours tracks temperature in each functional unit over time. Must be general enough to use for modeling a variety of processor architectures Must be able to reason about results at the architecture level No DTM † http://chromium.sourceforge.net/ ‡ http://lava.cs.virginia.edu/HotSpot/ Floorplans Solution: an equivalent circuit of lumped resistances andbecapacitances. In order toDerive add thermal modeling to Qsilver, thethermal simulator must first instrumented with an architectural floorplan. From the left, these This circuitare: must be derived at the granularity of the processor architecture. floorplans •Default—based on an nVIDIA marketing photo. We use this chip to drive an 800×600, console-like display in our simulations. Key components: •Separating Hot Units—based on the default floorplan. The two hottest units, framebuffer operations and the vertex engine, are separated. Floorplanning •High Resolution—also based on the default, but modified to drive a PC display at 1280×1024. The framebuffer, fragment engine, and Lumped-RC circuit derivation texture cache are enlarged to maintain reasonable power densities under higher workload. •Partitioned High Resolution—this novel floorplan maintains the functional unit area of the high resolution design, but partitions units into separate blocks per pipe, and separates hot blocks from cooler ones. Vertex Engine Rasterizer Vertex Engine Rasterizer Unused Data Compression Framebuffer control Unused Host Interface Vertex Engine 2D Video Framebuffer and Framebuffer and Data Compression 2D Video 2D Video Framebuffer and Data Compression Data Compression Fragment Engine Fragment Engine Fragment Engine Rasterizer Rasterizer Framebuffer control Fragment Engine Host Texture Interface Cache 2D Video Framebuffer control Texture Cache Host Texture Interface Cache Host Interface Texture Cache Fragment Engine Fragment Engine Vertex Engine Framebuffer control Framebuffer control Default Separating Hot Units High Resolution Clock Gating Fetch Gating Vertex Fetch Fetch Gating Rasterizer Dynamic Voltage Scaling Multiple Clock Domains Temperature Separating Hot Units Performance Cost Maximum Temperature Partitioned High Resolution High Resolution Performance Cost Partitioned High Resolution Maximum Temperature Performance Cost Maximum Temperature 0.0% 62.0% 106.4 97.0 0.0% 13.6% 105.5 97.0 0.0% 14.8% 103.7 97.0 0.0% 0.7% 100.9 97.0 25.9% 102.9 10.2% 98.7 9.2% 101.3 0.5% 98.1 90.1% 98.1 17.7% 97.8 17.4% 97.0 0.7% 97.8 13.1% 100.7 3.4% 98.2 3.4% 97.4 0.1% 97.0 16.7% 98.4 4.1% 97.0 3.7% 97.0 0.5% 97.4 From left to right, below: No architectural thermal management with the default floorplan yields a very hot vertex engine; the hot units moved apart, combined with DVS make the chip cooler with a less profound thermal spatial gradient; fetch gating on the high resolution system; and DVS on the redesigned high-res chip, where the affect of separating hotspots on spatial gradient is more obvious—combining static and dynamic techniques is a double win. Vertex Engine Framebuffer and Framebuffer control For these results, our simulator is configured to model a system: •Built on a 180nm process at 1.8V and 300MHz •Using an aluminum cooling solution with no fan •With a temperature sensor on each functional unit block. We assume that the vendor specifies a 100°C maximum safe operating temperature and enable dynamic thermal management at 97°C to account for sensor imprecision. We have implemented the following DTM techniques on Qsilver: •Clock Gating—the clock is stopped until the chip drops below the threshold temperature. •Fetch Gating—a single stage in the pipeline is slowed down. We implement this in both the vertex fetch and rasterization stages. •Dynamic Voltage Scaling—DVS scales the core voltage, and with it frequency, yielding a cubic reduction in power. •Multiple Clock Domains—MCD also scales voltage and frequency, but on the granularity of individual functional units. Both DVS and MCD require a sync time ‘penalty’ when they are enabled and disabled. Note that to better illustrate their full dynamic range, these thermal maps are not all on the same scale.