Temperature-Aware GPU Design Problem Statement

advertisement
Temperature-Aware GPU Design
Jeremy W. Sheaffer, Kevin Skadron, David P. Luebke
{jws9c, skadron, luebke}@cs.virginia.edu
University of Virginia, Charlottesville, VA 22904
http://gfx.cs.virginia.edu
http://qsilver.cs.virginia.edu
http://lava.cs.virginia.edu
Problem Statement
Simulator Setup and Output
Cooling for graphics processors is becoming prohibitively expensive, but cooling solutions are designed for worst-case behavior. Since power
dissipation is spatially non-uniform across the chip, localized heating occurs much faster than chip-wide heating, which leads to “hot spots”
and spatial gradients that can cause accelerated aging and timing errors. Reducing hot spots reduces cooling requirements. In fact, as true
worst-case behavior is rare, a solution designed for the worst case is overdesigned for typical operating conditions. However, a package
designed for typical behavior could be overcome by some unusual application, requiring dynamic thermal management (DTM).
GPU Simulation with Qsilver
To study thermal issues in a GPU, we have developed a simulator
called Qsilver that:
• models GPU clock-cycle-by-cycle activity and power in the
microarchitecture domain.
•uses the Chromium† system to intercept a stream of OpenGL calls,
annotating it with aggregate information about the vertices and
fragments, textures, lighting, and other relevant rendering state
Qsilver is useful for:
•analyzing performance bottlenecks
•estimating power
•exploring new graphics architectural ideas
Texture accesses
Fragments generated
Vertices transformed
Thermal Simulation Results
Default
Floorplan →
Performance
Maximum
Technique ↓
Architecture-Level Thermal Modeling
Cost
Requirements:
and fast,
and must model
heating at the granularity of architectural objects
We
have used General,
Qsilver simple,
to analyze
a hypothetical
fixed-function
console-like
GPU
architecture.calculate
For these
results, for
weeach
augment
 Must be able
to dynamically
temperatures
block in the architecture
Qsilver with an architectural thermal model called HotSpot‡ that
 Must be able to simulate billions of clock cycles in a few hours
tracks temperature in each functional unit over time.
 Must be general enough to use for modeling a variety of processor architectures
 Must be able to reason about results at the architecture level
No DTM
† http://chromium.sourceforge.net/
‡ http://lava.cs.virginia.edu/HotSpot/
Floorplans
Solution:
an equivalent
circuit
of lumped
resistances
andbecapacitances.
In
order toDerive
add thermal
modeling
to Qsilver,
thethermal
simulator
must first
instrumented with an architectural floorplan. From the left, these
This circuitare:
must be derived at the granularity of the processor architecture.
floorplans
•Default—based on an nVIDIA marketing photo. We use this chip to drive an 800×600, console-like display in our simulations.
Key components:
•Separating
Hot Units—based on the default floorplan. The two hottest units, framebuffer operations and the vertex engine, are separated.

Floorplanning
•High Resolution—also based on the default, but modified to drive a PC display at 1280×1024. The framebuffer, fragment engine, and
 Lumped-RC
circuit
derivation
texture
cache are
enlarged
to maintain reasonable power densities under higher workload.
•Partitioned High Resolution—this novel floorplan maintains the functional unit area of the high resolution design, but partitions units into
separate blocks per pipe, and separates hot blocks from cooler ones.
Vertex Engine
Rasterizer
Vertex Engine
Rasterizer
Unused
Data Compression
Framebuffer control
Unused
Host
Interface
Vertex Engine
2D Video
Framebuffer and
Framebuffer and
Data Compression
2D Video
2D Video
Framebuffer and
Data Compression
Data Compression
Fragment Engine
Fragment Engine
Fragment Engine
Rasterizer
Rasterizer
Framebuffer control
Fragment Engine
Host
Texture
Interface
Cache
2D Video
Framebuffer control
Texture Cache
Host
Texture
Interface
Cache
Host Interface
Texture Cache
Fragment Engine
Fragment Engine
Vertex Engine
Framebuffer control
Framebuffer control
Default
Separating Hot Units
High Resolution
Clock Gating
Fetch Gating
Vertex Fetch
Fetch Gating
Rasterizer
Dynamic Voltage
Scaling
Multiple Clock
Domains
Temperature
Separating Hot Units
Performance
Cost
Maximum
Temperature
Partitioned High Resolution
High Resolution
Performance
Cost
Partitioned High Resolution
Maximum
Temperature
Performance
Cost
Maximum
Temperature
0.0%
62.0%
106.4
97.0
0.0%
13.6%
105.5
97.0
0.0%
14.8%
103.7
97.0
0.0%
0.7%
100.9
97.0
25.9%
102.9
10.2%
98.7
9.2%
101.3
0.5%
98.1
90.1%
98.1
17.7%
97.8
17.4%
97.0
0.7%
97.8
13.1%
100.7
3.4%
98.2
3.4%
97.4
0.1%
97.0
16.7%
98.4
4.1%
97.0
3.7%
97.0
0.5%
97.4
From left to right, below: No architectural thermal management with the default floorplan yields a very hot vertex engine; the hot units
moved apart, combined with DVS make the chip cooler with a less profound thermal spatial gradient; fetch gating on the high resolution
system; and DVS on the redesigned high-res chip, where the affect of separating hotspots on spatial gradient is more obvious—combining
static and dynamic techniques is a double win.
Vertex Engine
Framebuffer and
Framebuffer control
For these results, our simulator is configured to model a system:
•Built on a 180nm process at 1.8V and 300MHz
•Using an aluminum cooling solution with no fan
•With a temperature sensor on each functional unit block. We assume
that the vendor specifies a 100°C maximum safe operating temperature
and enable dynamic thermal management at 97°C to account for sensor
imprecision.
We have implemented the following DTM techniques on Qsilver:
•Clock Gating—the clock is stopped until the chip drops below the
threshold temperature.
•Fetch Gating—a single stage in the pipeline is slowed down. We
implement this in both the vertex fetch and rasterization stages.
•Dynamic Voltage Scaling—DVS scales the core voltage, and with it
frequency, yielding a cubic reduction in power.
•Multiple Clock Domains—MCD also scales voltage and frequency,
but on the granularity of individual functional units. Both DVS and
MCD require a sync time ‘penalty’ when they are enabled and disabled.
Note that to better illustrate their full dynamic range, these thermal
maps are not all on the same scale.
Download