Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08 Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends System Architecture GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs) What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second! General Purposed Computing ref: http://www.nvidia.com/object/tesla_computing_solutions.html The Vision of NVIDIA "Within the next few years, there will be single-chip graphics devices more powerful and versatile than any graphics system that has ever been built, at any price." -- David Kirk, NVIDIA, 1998 Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html Top500 Super Computer in June 2010 GPU Will Top the List in Nov 2010 The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure GPU Has 10x Comp Density Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU. Evolution of Intel Pentium Pentium I Pentium II Chip area breakdown Pentium III Q: What can you observe? Why? Pentium IV Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen. Q: What happened instead? Why? Evolution of Multi-core CPUs Penryn Chip area Bloomfield breakdown Gulftown Q: What can you observe? Why? Beckton Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why? The Memory Hierarchy Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density. The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application? Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading. Interleaved Multi-threading Interleaved Multi-threading The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency Interleaved Multi-threading The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: ? Cons: ? Interleaved Multi-threading The granularity of interleaved multi-threading: • 100 cycles: hide off-chip memory latency • 10 cycles: + hide cache latency • 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: remove branch predictor, OOO scheduler, large cache Cons: register pressure, etc. Fine-Grained Interleaved Threading Without and with fine-grained interleaved threading Pros: reduce cache size, no branch predictor, no OOO scheduler Cons: register pressure, thread scheduler, require huge parallelism HW Support Register file supports zero overhead context switch between interleaved threads. Can We Make Further Improvement? Reducing large cache gives 2x computational density. Q: Can we make further improvements? Hint: We have only utilized thread level parallelism (TLP) so far. Step 2: Single Instruction Multiple Data GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4. SSE has 4 data lanes GPU has 8/16/24/... data lanes Hardware Support Supporting interleaved threading + SIMD execution Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads. Example of SIMT Execution Assume 32 threads are grouped into one warp. Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs) Throughput Oriented Architectures 1. Fine-grained interleaved threading (~2x comp density) 2. SIMD/SIMT (>10x comp density) 3. Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link) Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends CUDA Programming Massive number (>10000) of light-weight threads. Express Data Parallelism in Threads Compare thread program with vector program. Vector Program Vector width is exposed to programmers. Vector program (vector width of 8) Scalar program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } } float A[4][8]; do-all(i=0;i<4;i++){ movups xmm0, [ &A[i][0] ] incps xmm0 movups [ &A[i][0] ], xmm0 } CUDA Program • CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time. CUDA program Scalar program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } } float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } Two Levels of Thread Hierarchy kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } Multi-dimension Thread and Block ID Both grid and thread block can have two dimensional index. kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; } Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs. Executing Thread Block on SM kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; } Notes: the number of Processing Elements (PEs) is transparent to programmer. Executed on machine with width of 4: Executed on machine with width of 8: Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared Constant Texture Local No Yes Yes L1/L2 1~3 1~3 ~100 200~400 (cache miss) R/W Read-only Read-only R/W Explicit Management of Shared Mem Shared memory is frequently used to exploit locality. Shared Memory and Synchronization Example: average filter with 3x3 window 3x3 window on image kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; //allocate smem i = threadIdx.y; j = threadIdx.x; Image smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } data in DRAM Shared Memory and Synchronization Example: average filter over 3x3 window 3x3 window on image kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Stage data in shared mem Shared Memory and Synchronization Example: average filter over 3x3 window 3x3 window on image kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // every thread is ready A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } all threads finish the load Shared Memory and Synchronization Example: average filter over 3x3 window 3x3 window on image kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9; } Start computation Programmers Think in Threads Q: Why make this hassle? Why Use Thread instead of Vector? Thread Pros: • Portability. Machine width is transparent in ISA. • Productivity. Programmers do not need to take care the vector width of the machine. Thread Cons: • Manual sync. Give up lock-step within vector. • Scheduling of thread could be inefficient. • Debug. "Threads considered harmful". Thread program is notoriously hard to debug. Features of CUDA • Programmers explicitly express DLP in terms of TLP. • Programmers explicitly manage memory hierarchy. • etc. Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends Micro-architecture GF100 micro-architecture HW Groups Threads Into Warps Example: 32 threads per warp Example of Implementation Note: NVIDIA may use a more complicated implementation. Example Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Assume warp 0 and warp 1 are scheduled for execution. Read Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r1 for warp 0 r4 for warp 1 Buffer Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Push ops to op collector: r1 for warp 0 r4 for warp 1 Read Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1 Buffer Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Push ops to op collector: r2 for warp 0 r5 for warp 1 Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the first 16 threads in the warp. Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the last 16 threads in the warp. Write back Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Write back: r0 for warp 0 r3 for warp 1 Other High Performance GPU • ATI Radeon 5000 series. ATI Radeon 5000 Series Architecture Radeon SIMD Engine • 16 Stream Cores (SC) • Local Data Share VLIW Stream Core (SC) Local Data Share (LDS) Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends Performance Optimization Optimizations on memory latency tolerance • Reduce register pressure • Reduce shared memory pressure Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts • Grouping byte access • Avoid Partition camping Optimizations on computation efficiency • Mul/Add balancing • Increase floating point proportion Optimizations on operational intensity • Use tiled algorithm • Tuning thread granularity Performance Optimization Optimizations on memory latency tolerance • Reduce register pressure • Reduce shared memory pressure Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts • Grouping byte access • Avoid Partition camping Optimizations on computation efficiency • Mul/Add balancing • Increase floating point proportion Optimizations on operational intensity • Use tiled algorithm • Tuning thread granularity Shared Mem Contains Multiple Banks Compute Capability Need arch info to perform optimization. ref: NVIDIA, "CUDA C Programming Guide", (link) Shared Memory (compute capability 2.x) without bank conflict: with bank conflict: Performance Optimization Optimizations on memory latency tolerance • Reduce register pressure • Reduce shared memory pressure Optimizations on memory bandwidth • Global memory alignment and coalescing • Avoid shared memory bank conflicts • Grouping byte access • Avoid Partition camping Optimizations on computation efficiency • Mul/Add balancing • Increase floating point proportion Optimizations on operational intensity • Use tiled algorithm • Tuning thread granularity Global Memory In Off-Chip DRAM Address space is interleaved among multiple channels. Global Memory Global Memory Global Memory Roofline Model Identify performance bottleneck: computation bound v.s. bandwidth bound Optimization Is Key for Attainable Gflops/s Computation, Bandwidth, Latency Illustrating three bottlenecks in the Roofline model. Today's Topics • • • • • GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends Trends Coming architectures: • Intel's Larabee successor: Many Integrated Core (MIC) • CPU/GPU fusion, Intel Sandy Bridge, AMD Llano. Intel Many Integrated Core (MIC) 32 core version of MIC: Intel Sandy Bridge Highlight: • Reconfigurable shared L3 for CPU and GPU • Ring bus Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link) Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link) AMD Llano Fusion APU (expt. Q3 2011) Notes: • CPU and GPU are not sharing cache? • Unknown interface between CPU/GPU GPU Research in ES Group GPU research in the Electronic Systems group. http://www.es.ele.tue.nl/~gpuattue/