Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University

advertisement
Streaming Architectures and GPUs
Ian Buck
Bill Dally & Pat Hanrahan
Stanford University
February 11, 2004
1
February 11, 2004
To Exploit VLSI Technology We Need:
• Parallelism
– To keep 100s of ALUs per chip
(thousands/board millions/system)
busy
• Latency tolerance
0.5mm
64-bit FPU
(to scale)
50pJ/FLOP
90nm Chip
$200
1GHz
1 clock
– To cover 500 cycle remote memory
access time
• Locality
– To match 20Tb/s ALU bandwidth to
~100Gb/s chip bandwidth
• Moore’s Law
– Growth of transistors, not
performance
12mm
Courtesy of Bill Dally
2
Arithmetic is cheap, global bandwidth is expensive
Local << global on-chip << off-chip << global system
February 11, 2004
Arithmetic Intensity
Lots of ops per word
transferred
• Compute-to-Bandwidth ratio
• High Arithmetic Intensity desirable
0.5mm
64-bit FPU
(to scale)
50pJ/FLOP
90nm Chip
$200
1GHz
1 clock
– App limited by ALU performance, not
off-chip bandwidth
– More chip real estate for ALUs, not
caches
12mm
Courtesy of Pat Hanrahan
3
February 11, 2004
Brook: Stream programming Model
– Enforce Data Parallel computing
– Encourage Arithmetic Intensity
– Provide fundamental ops for stream computing
4
February 11, 2004
Streams & Kernels
• Streams
– Collection of records requiring similar computation
• Vertex positions, voxels, FEM cell, …
– Provide data parallelism
• Kernels
– Functions applied to each element in stream
• transforms, PDE, …
• No dependencies between stream elements
– Encourage high Arithmetic Intensity
5
February 11, 2004
Vectors vs. Streams
• Vectors:
• v: array of floats
• Instruction sequence
LD v0
LD v1
ADD v0, v1, v2
ST v2
•  Large set of temps
• Streams:
• s: stream of records
• Instruction sequence
LD s0
LD s1
CALLS f, s0, s1, s2
ST s2
•  Small set of temps
Higher arithmetic intensity: |f|/|s|  |+|/|v|
6
February 11, 2004
Imagine
• Imagine
SDRAM
SDRAM
SDRAM
SDRAM
Stream
Register File
– Stream processor for image and
signal processing
– 16mm die in 0.18um TI process
– 21M transistors
2GB/s
7
ALU Cluster
ALU Cluster
ALU Cluster
32GB/s
544GB/s
February 11, 2004
Merrimac Processor
1.6 mm
8K Words SRF Bank
FP/INT
64 Bit
MADD
FP/INT
64 Bit
MADD
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
FP/INT
64 Bit
MADD
FP/INT
64 Bit
MADD
0.6 mm
0.9 mm
Network
8
Cluster
10.2 mm
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Microcontroller
Cluster
$ bank
Cluster
$ bank
Forward ECC
16 RDRAM Interfaces
$ bank
Cluster
Mips64
20kc
M e m sw itc h
A ddres s Ge n
A ddres s Gen
A ddres s Ge n
A ddres s Gen
Reorder B uf fer
R eorder B uf fe r
Mips64
20kc
Cluster
2.3 mm
12.5 mm
•
•
•
•
•
90nm tech (1 V)
ASIC technology
1 GHz (37 FO4)
128 GOPs
Inter-cluster switch between
clusters
• 127.5 mm2 (small ~12x10)
– Stanford Imagine is 16mm x
16mm
– MIT Raw is 18mm x 18mm
• 25 Watts (P4 = 75 W)
– ~41W with memories
February 11, 2004
Merrimac Streaming Supercomputer
Backplane
Board
Node
16 x
DRDRAM
2GBytes
16GBytes/s
Node
2
Node
16
Stream
Processor
128 FPUs
128GFLOPS
Board 2
16 Nodes
1K FPUs
2TFLOPS
32GBytes
Board 32
16GBytes/s
32+32 pairs
Backplane 2
32 Boards
512 Nodes
64K FPUs
64TFLOPS
1TByte
Backplane
32
On-Board Network
64GBytes/s
128+128 pairs
6" Teradyne GbX
E/O
O/E
Intra-Cabinet Network
1TBytes/s
2K+2K links
Ribbon Fiber
Inter-Cabinet Network
All links 5Gb/s per pair or fiber
All bandwidths are full duplex
Bisection 32TBytes/s
9
February 11, 2004
Streaming Applications
•
•
•
•
•
10
Finite volume – StreamFLO (from TFLO)
Finite element - StreamFEM
Molecular dynamics code (ODEs) - StreamMD
Model (elliptic, hyperbolic and parabolic) PDEs
PCA Applications: FFT, Matrix Mul, SVD, Sort
February 11, 2004
StreamFLO
• StreamFLO is the Brook version
of FLO82, a FORTRAN code
written by Prof. Jameson, for the
solution of the inviscid flow
around an airfoil.
• The code uses a cell centered
finite volume formulation with a
multigrid acceleration to solve the
2D Euler equations.
• The structure of the code is
similar to TFLO and the algorithm
is found in many compressible
flow solvers.
11
February 11, 2004
StreamFEM
• A Brook implementation of the Discontinuous Galerkin
(DG) Finite Element
• Method (FEM) in 2D triangulated domains.
12
February 11, 2004
StreamMD: motivation
• Application: study the folding of human
proteins.
• Molecular Dynamics: computer
simulation of the dynamics of macro
molecules.
• Why this application?
– Expect high arithmetic intensity.
– Requires variable length
neighborlists.
– Molecular Dynamics can be used in
engine simulation to model spray,
e.g. droplet formation and breakup,
drag, deformation of droplet.
• Test case chosen for initial evaluation:
box of water molecules.
DNA molecule
Human immunodeficiency virus (HIV)
13
February 11, 2004
Summary of Application Results
Application
Sustained
GFLOPS1
FP Ops /
Mem Ref
LRF Refs
SRF Refs
Mem
Refs
StreamFEM2D
(Euler, quadratic)
32.2
23.5
169.5M
(93.6%)
10.3M
(5.7%)
1.4M
(0.7%)
StreamFEM2D
(MHD, cubic)
33.5
50.6
733.3M
(94.0%)
43.8M
(5.6%)
3.2M
(0.4%)
StreamMD
14.22
12.1
90.2M
(97.5%)
1.6M
(1.7%)
0.7M
(0.8%)
StreamFLO
11.42
7.4
234.3M
(95.7%)
7.2M
(2.9%)
3.4M
(1.4%)
1. Simulated on a machine with 64GFLOPS peak performance
2. The low numbers are a result of many divide and square-root operations
14
February 11, 2004
Streaming on graphics hardware?
Pentium 4 SSE theoretical*
3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS
GeForce FX 5900 (NV35) fragment shader observed:
MULR R0, R0, R0:
20 GFLOPS
equivalent to a 10 GHz P4
GeForce FX
and getting faster: 3x improvement over NV30 (6 months)
25
NV35
GFLOPS
20
15
10
NV30
Pentium 4
5
0
Jun-01
15
Sep-01
Dec-01
Mar-02
Jun-02
Sep-02
Dec-02
Apr-03
Jul-03
February 11, 2004
*from Intel P4 Optimization Manual
GPU Program Architecture
Input
Registers
Texture
Program
Constants
Registers
Output
Registers
16
February 11, 2004
Example Program
Simple Specular and Diffuse Lighting
!!VP1.0
#
# c[0-3] = modelview projection (composite) matrix
# c[4-7] = modelview inverse transpose
# c[32]
= eye-space light direction
# c[33]
= constant eye-space half-angle vector (infinite viewer)
# c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat.
# c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat.
# c[36]
= specular color
# c[38].x = specular power
# outputs homogenous position and color
#
DP4
o[HPOS].x, c[0], v[OPOS];
# Compute position.
DP4
o[HPOS].y, c[1], v[OPOS];
DP4
o[HPOS].z, c[2], v[OPOS];
DP4
o[HPOS].w, c[3], v[OPOS];
DP3
R0.x, c[4], v[NRML];
# Compute normal.
DP3
R0.y, c[5], v[NRML];
DP3
R0.z, c[6], v[NRML];
# R0 = N' = transformed normal
DP3
R1.x, c[32], R0;
# R1.x = Ldir DOT N'
DP3
R1.y, c[33], R0;
# R1.y = H DOT N'
MOV
R1.w, c[38].x;
# R1.w = specular power
LIT
R2, R1;
# Compute lighting values
MAD
R3, c[35].x, R2.y, c[35].y;
# diffuse + ambient
MAD
o[COL0].xyz, c[36], R2.z, R3; # + specular
END
17
February 11, 2004
Cg/HLSL: High level language for GPUs
Specular Lighting
// Lookup the normal map
float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5);
// Multiply 3 X 2 matrix generated using lightDir and halfAngle with
// scaled normal followed by lookup in intensity map with the result.
float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz),
dot(I.halfAngle.xyz, normal.xyz));
float4 intensity = tex2D(intensityMap, intensCoord);
// Lookup color
float4 color = tex2D(colorMap, I.texCoord3.xy);
// Blend/Modulate intensity with color
return color * intensity;
18
February 11, 2004
GPU: Data Parallel
– Each fragment shaded independently
• No dependencies between fragments
– Temporary registers are zeroed
– No static variables
– No Read-Modify-Write textures
• Multiple “pixel pipes”
– Data Parallelism
• Support ALU heavy architectures
• Hide Memory Latency
[Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]
19
February 11, 2004
GPU: Arithmetic Intensity
Lots of ops per word transferred
Graphics pipeline
– Vertex
• BW: 1 triangle = 32 bytes;
• OP: 100-500 f32-ops / triangle
– Rasterization
• Create 16-32 fragments per triangle
– Fragment
• BW: 1 fragment = 10 bytes
• OP: 300-1000 i8-ops/fragment
Shader Programs
Courtesy of Pat Hanrahan
20
February 11, 2004
SDRAM
SDRAM
SDRAM
SDRAM
21
Stream
Register File
Streaming Architectures
ALU Cluster
ALU Cluster
ALU Cluster
February 11, 2004
Streaming Architectures
Kernel Execution Unit
SDRAM
SDRAM
SDRAM
SDRAM
22
Stream
Register File
MAD
MAD
R3, R1, R2;
R5, R2, R3;
ALU Cluster
ALU Cluster
ALU Cluster
February 11, 2004
Streaming Architectures
Kernel Execution Unit
SDRAM
SDRAM
SDRAM
SDRAM
Stream
Register File
MAD
MAD
R3, R1, R2;
R5, R2, R3;
ALU Cluster
ALU Cluster
ALU Cluster
Parallel Fragment Pipelines
23
February 11, 2004
Streaming Architectures
Kernel Execution Unit
SDRAM
SDRAM
SDRAM
SDRAM
Stream
Register File
MAD
MAD
ALU Cluster
ALU Cluster
ALU Cluster
Stream Register File:
• Texture Cache?
• F-Buffer [Mark et al.]
24
R3, R1, R2;
R5, R2, R3;
Parallel Fragment Pipelines
February 11, 2004
Conclusions
• The problem is bandwidth – arithmetic is cheap
• Stream processing & architectures can provide VLSIefficient scientific computing
– Imagine
– Merrimac
• GPUs are first generation streaming architectures
– Apply same stream programming model for general purpose
computing on GPUs
25
GeForce FX
February 11, 2004
Download