Lecture 15 Graphics Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 15 - 1 Christos Kozyrakis Announcements • Readings for today: P&H appendix A – Some Nvidia hype included, but overall a good and detailed GPU discussion • Due today: summary of paper handed out in L14 • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 15 - 2 Christos Kozyrakis Review: Vector Processors VECTOR (N operations) SCALAR (1 operation) v1 v2 r2 r1 + + r3 v3 add r3, r1, r2 vector length vadd.vv v3, v1, v2 • Scalar processors operate on single numbers (scalars) • Vector processors operate on vectors of numbers – Linear sequences of numbers EE382A – Autumn 2009 Lecture 15 - 3 Christos Kozyrakis Review: Advantages of Vector ISAs • Compact: single instruction defines N operations – Also reduces the frequency of branches • Parallel: N operations are (data) parallel – No dependencies – No need for complex hardware to detect parallelism (similar to VLIW) – Can execute in parallel assuming N parallel datapaths • Expressive: memory operations describe patterns – Continuous or regular memory access pattern – Can prefetch or accelerate using wide/multi-banked memory – Can amortize high latency for 1st element over large sequential pattern EE382A – Autumn 2009 Lecture 15 - 4 Christos Kozyrakis Review: Vector Lanes Pipelined Lane Vector Reg. Partition Datapath Elements Elements Elements Elements Functional Unit To/From Memory System • • • • • Elements for each vector register interleaved across the lanes Each lane receives identical control Multiple element operations executed per cycle Modular, scalable design No need for inter-lane communication for most vector instructions EE382A – Autumn 2009 Lecture 15 - 5 Christos Kozyrakis The Timeline of Vector Processors • Widely used for supercomputing systems in the 70s – 90s – Cray, CDC, Convex, TI, IBM, .. • Fell out of fashion in the 80s and 90s – Difficult to fit a vector processor in a single chip – Building supercomputers out of commodity microprocessors • Remaining vector supercomputer: NEC SX-9 – 8 lanes (5 functional unites), 8+64 vregs (256 elements/reg), 3.2GHz • But now vectors are making a come back – Short vectors in all ISAs (SIMD), Intel Larabee, … – Why? EE382A – Autumn 2009 Lecture 15 - 6 Christos Kozyrakis Which Applications Fit the Vector Model? • Vectors are great when we have data-level parallelism (DLP) – Most efficient way to exploit DLP – Remember, we can exploit DLP as ILP or TLP • On a superscalar or multiprocessor • Which applications have DLP? – Scientific computing • Weather forecast, car-crash simulation, biological modeling • Vector processors were invented for this purpose (supercomputers) – Multimedia computing • Speech, image, and video processing • Identical operations execution on streams or arrays of sound samples, pixels, and video frames • The reason for the recent revival of vector architectures • Multimedia on embedded devices – Need high performance @ low power @ low complexity @ small code size EE382A – Autumn 2009 Lecture 15 - 7 Christos Kozyrakis Vector Power Consumption • Can trade-off parallelism for power – Power = C *Vdd2 *F – If we double the lanes, peak performance doubles – Halving F restores peak performance but also allows halving the Vdd – Powernew = (2C)*(Vdd/2)2*(f/2) = Power/4 • Simpler logic for large number of operations/cycle – Replicated control for all lanes – No multiple issue or dynamic execution logic • Simpler to gate clocks – Each vector instruction explicitly describes all the resources it needs for a number of cycles – Conditional execution leads to further savings EE382A – Autumn 2009 Lecture 15 - 8 Christos Kozyrakis SIMD Extensions for Superscalar Processors • Every CISC/RISC processor today has SIMD extensions – MMX, SSE, SSE-2, SSE-3 3D-Now, Altivec, VIS, … • Basic idea: accelerate multimedia processing – Define vectors of 16 and 32 bit elements in regular registers – Apply SIMD arithmetic on these vectors • Nice and cheap – Don’t need to define big vector register file • Takes up area and complicates exceptions – All we need to do • Add the proper opcodes for SIMD arithmetic • Modify datapaths to execute SIMD arithmetic – Certain operations are easier on short vectors • Reductions, random permutations EE382A – Autumn 2009 Lecture 15 - 9 Christos Kozyrakis Example of Simple SIMD Instruction SIMD ADD 64-bit reg. 64-bit reg. + + + + 64-bit reg. EE382A – Autumn 2009 Lecture 15 - 10 Christos Kozyrakis Example of Fancy SIMD Instruction Sum of Partial Products 64-bit reg. 64-bit reg. * * * * temp. result + + 64-bit reg. EE382A – Autumn 2009 Lecture 15 - 11 Christos Kozyrakis Loading & Storing SIMD Values • Typical case: no vector-like loads & stores – Must use regular 64-bit load/store instructions – Problems: data-sizes, alignment, strides • Solution: multiple load/stores & manipulation instructions – Pack & unpack • To solve problems with data sizes – Rotate & shift • To solve problem with alignment EE382A – Autumn 2009 Lecture 15 - 12 Christos Kozyrakis Problems with SIMD Extension • SIMD defines short, fixed-sized, vectors – Cannot capture data parallelism wider than 64 bits – Must use wide-issue to utilize more than 64-bit datapaths – SSE and Altivec have switched to 128-bits because of this • SIMD does not support vector memory accesses – Strided and indexed accesses for narrow elements – Needs multi-instruction sequence to emulate • Pack, unpack, shift, rotate, merge, etc – Cancels most of performance and code density benefits of vectors • Compiler support for SIMD? – They change too often… EE382A – Autumn 2009 Lecture 15 - 13 Christos Kozyrakis Intel Larrabee: The Design Tradeoff for Data-level Parallelism CPU design experiment: specify a throughputoptimized processor with same area and power of a standard dual core CPU. # CPU cores 2 out of order 10 in-order Instructions per issue 4 per clock 2 per clock VPU lanes per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single-stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock 20 times the multiply-add operations per clock Peak vector throughput for given power and area. Ideal for graphics & other throughput applications. Data in table from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture for visual computing. SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY EE382A – Autumn 2009 Lecture 15 - 14 Christos Kozyrakis ... Wide SIMD I$ D$ L2 Cache Multi-Threaded MultiThreaded Wide SIMD Wide SIMD I$ D$ ... Multi-Threaded MultiThreaded Wide SIMD Wide SIMD I$ D$ Memory Controller Memory Controller Wide SIMD I$ D$ Multi-Threaded MultiThreaded Wide SIMD Display Interface Multi-Threaded MultiThreaded Wide SIMD System Interface Fixed Function Texture Logic Memory Controller Intel Larrabee: A Single-Chip Vector Multiprocessor • 2-way issue, in-order cores with vector capabilities • + 4-way multithreaded • Cores communicate on a wide ring bus • L2 cache is partitioned among the cores – Provides high aggregate bandwidth – Allows data replication & sharing EE382A – Autumn 2009 Intel® Microarchitecture (Larrabee) Lecture 15 - 15 Christos Kozyrakis Larrabee x86 Core Block Diagram Instruction Decode • Separate scalar and vector units with separate registers Scalar Unit • In-order x86 scalar core Vector Unit • Vector unit: 16 32-bit ops/clock Scalar Registers Vector Registers L1 Icache & Dcache 256K L2 Cache Local Subset Ring EE382A – Autumn 2009 Intel® Microarchitecture (Larrabee) • Short execution pipelines • Fast access from L1 cache • Direct connection to each core’s subset of the L2 cache • Prefetch instructions load L1 and L2 caches Lecture 15 - 16 Christos Kozyrakis Larrabee Vector Unit Block Diagram Mask Registers 16-wide Vector ALU Replicate Reorder Vector Registers Numeric Convert Numeric Convert L1 Data Cache EE382A – Autumn 2009 • Vector complete instruction set – 32 vector registers (512 bits), 8 mask registers – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – This enables mapping a separate execution kernel to each VPU lane • Vector instructions support – Fast read from L1 cache – Numeric type conversion and data replication while reading from memory – Rearrange the lanes on register read – Fused multiply add (three arguments) – Int32, Float32 and Float64 data Lecture 15 - 17 Christos Kozyrakis Summary • Vector processors – Processors that operate on linear sequences of numbers • Vector add, vector load, vector store, … – Can express and exploit data-level parallelism in applications • SIMD extension – Short vector extensions for ILP processors – Get some of the advantages of vector processors without most of the cost • Remember what Jim Smith said: – “The most efficient way to execute a vectorizable applications is a vector processor” EE382A – Autumn 2009 Lecture 15 - 18 Christos Kozyrakis Graphics Processors (GPUs) EE382A – Autumn 2009 Lecture 15 - 19 Christos Kozyrakis Graphics Processors Timeline • Till mid 90s – VGA controllers used to accelerate some display functions • Mid 90s to mid 00s – Fixed-function graphics accelerators for the OpenGL and DirectX APIs • Some GP-GPU capabilities by on top of the interfaces – 3D graphics: triangle setup & rasterization, texture mapping & shading • Modern GPUs – Programmable multiprocessors (optimized for data-parallel ops) • OpenGL/DirectX and general purpose languages – Some fixed function hardware (texture, raster ops, …) EE382A – Autumn 2009 Lecture 15 - 20 Christos Kozyrakis GPU’s Role in Modern Workstations • Coprocessor to the CPU • PCIe based interconnect – 8GB/sec per direction • Separate GPU memory – Aka frame buffer – Provides high bandwidth access to local data • Upcoming trend – Fusion: CPU + GPU integration EE382A – Autumn 2009 Lecture 15 - 21 Christos Kozyrakis GPU Thread Model (Software View) • Single-program multiple data (SPMD) model • Each thread has local memory • Parallel threads packed in blocks – Access to per-block shared memory – Can synchronize with barrier • Grids include independent groups – May execute concurrently EE382A – Autumn 2009 Lecture 15 - 22 Christos Kozyrakis GPU Architecture: Nvidia GeForce 8800 (aka Tesla Architecture) EE382A – Autumn 2009 Lecture 15 - 23 Christos Kozyrakis GPU Architecture • A highly multithreaded, multiprocessor system – 100s of streaming processors (SPs) – 8 SPs in a streaming multiprocessor (SM) with some caches – 2 SMs in a texture processor cluster (TPCs) with one texture pipe • Scheduling controlled mostly by hardware • Scalability – By scaling the number of TPCs and memory channels • Fixed function components for graphics – Texture pipes and caches, raster operation units (ROP), … EE382A – Autumn 2009 Lecture 15 - 24 Christos Kozyrakis Streaming Multiprocessor EE382A – Autumn 2009 Lecture 15 - 25 Christos Kozyrakis Streaming Multiprocessor Details • Each SP is a simple processor core – 1024 32-bit registers shared flexibly by up to 64 threads – Integer and floating-point arithmetic units • Including multiply add • Special function unit – Implements functions such as divide, square root, sine, cosine, … • Instruction cache and constants cache – Shared by all threads • Multibanked shared memory – E.g. 16 banks to allow parallel accesses by all SP EE382A – Autumn 2009 Lecture 15 - 26 Christos Kozyrakis Instruction and Thread Scheduling: Where Thread Parallelism Meets Data Parallelism • In theory, all threads can be independent – • For efficiency, 32 threads are packed in warps – – – • Because they branched differently or predication Loss of efficiency if not data parallel Software thread blocks mapped to warps – EE382A – Autumn 2009 Warp: set of parallel threads the execute same instruction Warps introduce data parallelism (SIMT) 1 warp instruction keeps SPs busy for 4 cycles Individual threads may be inactive – – • SM hardware implements zero-overhead switching When HW resources are available Lecture 15 - 27 Christos Kozyrakis Instruction Buffering & Warp Scheduling • Fetch one instruction/cycle I$ L1 – From the L1 instruction cache into an instruction buffer slot • Issue one “ready-to-go” instruction/cycle – All elements of the warp must be ready – Scoreboarding used to track hazards and determine ready warps – Round-robin or age based selection between ready warps Multithreaded Instruction Buffer R F Shared Mem Operand Select MAD • C$ L1 SFU Instruction broadcasted to all SP – Will keep SPs busy for up to 4 cycles EE382A – Autumn 2009 Lecture 15 - 28 Christos Kozyrakis Dependency Tracking Using a Scoreboard • Status of all register operands is tracked – RAW hazards for high-latency operations – Dependencies to memory operations • Instructions become ready when all register operands are ready for the whole wrap – Divergence of threads within wraps is also tracked • A wrap may be blocked because of – Dependencies through registers – Synchronization operations (barriers or atomic ops) • But other wraps can proceed in order to hide latency EE382A – Autumn 2009 Lecture 15 - 29 Christos Kozyrakis Memory System • Per SM caches/memories – Instruction and constant caches – Multi-banked shared memory • Distributed texture cache – Per TPC L1 and distributed L2 cache – Specialized for texture accesses in graphics pipeline – Moving towards generalized and shared L2 in upcoming chips • Multi-channel DRAM main memory (e.g. 8 DDR-3 channels) – Interleaved addresses to achieve higher bandwidth – Lossless and loosy compression used to increase bandwidth – Aggressive access scheduling used to increase bandwidth • Per thread private memory and global memory mapped to DRAM – Relying on threads to hide long latencies EE382A – Autumn 2009 Lecture 15 - 30 Christos Kozyrakis Synchronization • Barrier synchronization within a thread block – Tracking simplified by grouping threads into wraps – Counter used to track number of threads that have arrived to barrier • Atomic operations to global memory – Atomic read-modify-write (add, min, max, and, or, xor) – Atomic exchange or compare and swap – They are tied to DRAM latency • Moving to shared L2 in upcoming chips EE382A – Autumn 2009 Lecture 15 - 31 Christos Kozyrakis GPU Vs Vector Processors: Discussion • How are GPUs similar or different to vector processors? – What are the primary issues to consider here? • How are GPUs different from an architecture like Larrabee? EE382A – Autumn 2009 Lecture 15 - 32 Christos Kozyrakis Why is Data Parallelism Interesting Again: Building a 100TF Datacenter CPU 1U Server 4 CPU cores 0.07 Teraflop 4 GPUs: 960 cores $ 2000 4 Teraflops 400 W $ 8000 1429 CPU servers 700 W 25 CPU servers 25 Tesla systems $ 3.1 M 571 KW GPU 1U System 10x lower cost 21x lower power $ 0.31 M 27 KW EE382A – Autumn 2009 Lecture 15 - 33 Christos Kozyrakis