Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004 1 February 11, 2004 To Exploit VLSI Technology We Need: • Parallelism – To keep 100s of ALUs per chip (thousands/board millions/system) busy • Latency tolerance 0.5mm 64-bit FPU (to scale) 50pJ/FLOP 90nm Chip $200 1GHz 1 clock – To cover 500 cycle remote memory access time • Locality – To match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth • Moore’s Law – Growth of transistors, not performance 12mm Courtesy of Bill Dally 2 Arithmetic is cheap, global bandwidth is expensive Local << global on-chip << off-chip << global system February 11, 2004 Arithmetic Intensity Lots of ops per word transferred • Compute-to-Bandwidth ratio • High Arithmetic Intensity desirable 0.5mm 64-bit FPU (to scale) 50pJ/FLOP 90nm Chip $200 1GHz 1 clock – App limited by ALU performance, not off-chip bandwidth – More chip real estate for ALUs, not caches 12mm Courtesy of Pat Hanrahan 3 February 11, 2004 Brook: Stream programming Model – Enforce Data Parallel computing – Encourage Arithmetic Intensity – Provide fundamental ops for stream computing 4 February 11, 2004 Streams & Kernels • Streams – Collection of records requiring similar computation • Vertex positions, voxels, FEM cell, … – Provide data parallelism • Kernels – Functions applied to each element in stream • transforms, PDE, … • No dependencies between stream elements – Encourage high Arithmetic Intensity 5 February 11, 2004 Vectors vs. Streams • Vectors: • v: array of floats • Instruction sequence LD v0 LD v1 ADD v0, v1, v2 ST v2 • Large set of temps • Streams: • s: stream of records • Instruction sequence LD s0 LD s1 CALLS f, s0, s1, s2 ST s2 • Small set of temps Higher arithmetic intensity: |f|/|s| |+|/|v| 6 February 11, 2004 Imagine • Imagine SDRAM SDRAM SDRAM SDRAM Stream Register File – Stream processor for image and signal processing – 16mm die in 0.18um TI process – 21M transistors 2GB/s 7 ALU Cluster ALU Cluster ALU Cluster 32GB/s 544GB/s February 11, 2004 Merrimac Processor 1.6 mm 8K Words SRF Bank FP/INT 64 Bit MADD FP/INT 64 Bit MADD 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF FP/INT 64 Bit MADD FP/INT 64 Bit MADD 0.6 mm 0.9 mm Network 8 Cluster 10.2 mm Cluster Cluster Cluster Cluster Cluster Cluster Cluster $ bank Cluster $ bank Cluster $ bank Cluster $ bank Cluster $ bank Microcontroller Cluster $ bank Cluster $ bank Forward ECC 16 RDRAM Interfaces $ bank Cluster Mips64 20kc M e m sw itc h A ddres s Ge n A ddres s Gen A ddres s Ge n A ddres s Gen Reorder B uf fer R eorder B uf fe r Mips64 20kc Cluster 2.3 mm 12.5 mm • • • • • 90nm tech (1 V) ASIC technology 1 GHz (37 FO4) 128 GOPs Inter-cluster switch between clusters • 127.5 mm2 (small ~12x10) – Stanford Imagine is 16mm x 16mm – MIT Raw is 18mm x 18mm • 25 Watts (P4 = 75 W) – ~41W with memories February 11, 2004 Merrimac Streaming Supercomputer Backplane Board Node 16 x DRDRAM 2GBytes 16GBytes/s Node 2 Node 16 Stream Processor 128 FPUs 128GFLOPS Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Board 32 16GBytes/s 32+32 pairs Backplane 2 32 Boards 512 Nodes 64K FPUs 64TFLOPS 1TByte Backplane 32 On-Board Network 64GBytes/s 128+128 pairs 6" Teradyne GbX E/O O/E Intra-Cabinet Network 1TBytes/s 2K+2K links Ribbon Fiber Inter-Cabinet Network All links 5Gb/s per pair or fiber All bandwidths are full duplex Bisection 32TBytes/s 9 February 11, 2004 Streaming Applications • • • • • 10 Finite volume – StreamFLO (from TFLO) Finite element - StreamFEM Molecular dynamics code (ODEs) - StreamMD Model (elliptic, hyperbolic and parabolic) PDEs PCA Applications: FFT, Matrix Mul, SVD, Sort February 11, 2004 StreamFLO • StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil. • The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations. • The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers. 11 February 11, 2004 StreamFEM • A Brook implementation of the Discontinuous Galerkin (DG) Finite Element • Method (FEM) in 2D triangulated domains. 12 February 11, 2004 StreamMD: motivation • Application: study the folding of human proteins. • Molecular Dynamics: computer simulation of the dynamics of macro molecules. • Why this application? – Expect high arithmetic intensity. – Requires variable length neighborlists. – Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet. • Test case chosen for initial evaluation: box of water molecules. DNA molecule Human immunodeficiency virus (HIV) 13 February 11, 2004 Summary of Application Results Application Sustained GFLOPS1 FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM2D (Euler, quadratic) 32.2 23.5 169.5M (93.6%) 10.3M (5.7%) 1.4M (0.7%) StreamFEM2D (MHD, cubic) 33.5 50.6 733.3M (94.0%) 43.8M (5.6%) 3.2M (0.4%) StreamMD 14.22 12.1 90.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) StreamFLO 11.42 7.4 234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) 1. Simulated on a machine with 64GFLOPS peak performance 2. The low numbers are a result of many divide and square-root operations 14 February 11, 2004 Streaming on graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 GeForce FX and getting faster: 3x improvement over NV30 (6 months) 25 NV35 GFLOPS 20 15 10 NV30 Pentium 4 5 0 Jun-01 15 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03 February 11, 2004 *from Intel P4 Optimization Manual GPU Program Architecture Input Registers Texture Program Constants Registers Output Registers 16 February 11, 2004 Example Program Simple Specular and Diffuse Lighting !!VP1.0 # # c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector (infinite viewer) # c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat. # c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat. # c[36] = specular color # c[38].x = specular power # outputs homogenous position and color # DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Compute normal. DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normal DP3 R1.x, c[32], R0; # R1.x = Ldir DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting values MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END 17 February 11, 2004 Cg/HLSL: High level language for GPUs Specular Lighting // Lookup the normal map float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5); // Multiply 3 X 2 matrix generated using lightDir and halfAngle with // scaled normal followed by lookup in intensity map with the result. float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz), dot(I.halfAngle.xyz, normal.xyz)); float4 intensity = tex2D(intensityMap, intensCoord); // Lookup color float4 color = tex2D(colorMap, I.texCoord3.xy); // Blend/Modulate intensity with color return color * intensity; 18 February 11, 2004 GPU: Data Parallel – Each fragment shaded independently • No dependencies between fragments – Temporary registers are zeroed – No static variables – No Read-Modify-Write textures • Multiple “pixel pipes” – Data Parallelism • Support ALU heavy architectures • Hide Memory Latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98] 19 February 11, 2004 GPU: Arithmetic Intensity Lots of ops per word transferred Graphics pipeline – Vertex • BW: 1 triangle = 32 bytes; • OP: 100-500 f32-ops / triangle – Rasterization • Create 16-32 fragments per triangle – Fragment • BW: 1 fragment = 10 bytes • OP: 300-1000 i8-ops/fragment Shader Programs Courtesy of Pat Hanrahan 20 February 11, 2004 SDRAM SDRAM SDRAM SDRAM 21 Stream Register File Streaming Architectures ALU Cluster ALU Cluster ALU Cluster February 11, 2004 Streaming Architectures Kernel Execution Unit SDRAM SDRAM SDRAM SDRAM 22 Stream Register File MAD MAD R3, R1, R2; R5, R2, R3; ALU Cluster ALU Cluster ALU Cluster February 11, 2004 Streaming Architectures Kernel Execution Unit SDRAM SDRAM SDRAM SDRAM Stream Register File MAD MAD R3, R1, R2; R5, R2, R3; ALU Cluster ALU Cluster ALU Cluster Parallel Fragment Pipelines 23 February 11, 2004 Streaming Architectures Kernel Execution Unit SDRAM SDRAM SDRAM SDRAM Stream Register File MAD MAD ALU Cluster ALU Cluster ALU Cluster Stream Register File: • Texture Cache? • F-Buffer [Mark et al.] 24 R3, R1, R2; R5, R2, R3; Parallel Fragment Pipelines February 11, 2004 Conclusions • The problem is bandwidth – arithmetic is cheap • Stream processing & architectures can provide VLSIefficient scientific computing – Imagine – Merrimac • GPUs are first generation streaming architectures – Apply same stream programming model for general purpose computing on GPUs 25 GeForce FX February 11, 2004