Lecture 15 Graphics Processors

Lecture 15 Graphics Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 15 - 1 Christos Kozyrakis Announcements • Readings for today: P&H appendix A – Some Nvidia hype included, but overall a good and detailed GPU discussion • Due today: summary of paper handed out in L14 • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 15 - 2 Christos Kozyrakis Review: Vector Processors VECTOR (N operations) SCALAR (1 operation) v1 v2 r2 r1 + + r3 v3 add r3, r1, r2 vector length vadd.vv v3, v1, v2 • Scalar processors operate on single numbers (scalars) • Vector processors operate on vectors of numbers – Linear sequences of numbers EE382A – Autumn 2009 Lecture 15 - 3 Christos Kozyrakis Review: Advantages of Vector ISAs • Compact: single instruction defines N operations – Also reduces the frequency of branches • Parallel: N operations are (data) parallel – No dependencies – No need for complex hardware to detect parallelism (similar to VLIW) – Can execute in parallel assuming N parallel datapaths • Expressive: memory operations describe patterns – Continuous or regular memory access pattern – Can prefetch or accelerate using wide/multi-banked memory – Can amortize high latency for 1st element over large sequential pattern EE382A – Autumn 2009 Lecture 15 - 4 Christos Kozyrakis Review: Vector Lanes Pipelined Lane Vector Reg. Partition Datapath Elements Elements Elements Elements Functional Unit To/From Memory System • • • • • Elements for each vector register interleaved across the lanes Each lane receives identical control Multiple element operations executed per cycle Modular, scalable design No need for inter-lane communication for most vector instructions EE382A – Autumn 2009 Lecture 15 - 5 Christos Kozyrakis The Timeline of Vector Processors • Widely used for supercomputing systems in the 70s – 90s – Cray, CDC, Convex, TI, IBM, .. • Fell out of fashion in the 80s and 90s – Difficult to fit a vector processor in a single chip – Building supercomputers out of commodity microprocessors • Remaining vector supercomputer: NEC SX-9 – 8 lanes (5 functional unites), 8+64 vregs (256 elements/reg), 3.2GHz • But now vectors are making a come back – Short vectors in all ISAs (SIMD), Intel Larabee, … – Why? EE382A – Autumn 2009 Lecture 15 - 6 Christos Kozyrakis Which Applications Fit the Vector Model? • Vectors are great when we have data-level parallelism (DLP) – Most efficient way to exploit DLP – Remember, we can exploit DLP as ILP or TLP • On a superscalar or multiprocessor • Which applications have DLP? – Scientific computing • Weather forecast, car-crash simulation, biological modeling • Vector processors were invented for this purpose (supercomputers) – Multimedia computing • Speech, image, and video processing • Identical operations execution on streams or arrays of sound samples, pixels, and video frames • The reason for the recent revival of vector architectures • Multimedia on embedded devices – Need high performance @ low power @ low complexity @ small code size EE382A – Autumn 2009 Lecture 15 - 7 Christos Kozyrakis Vector Power Consumption • Can trade-off parallelism for power – Power = C *Vdd2 *F – If we double the lanes, peak performance doubles – Halving F restores peak performance but also allows halving the Vdd – Powernew = (2C)*(Vdd/2)2*(f/2) = Power/4 • Simpler logic for large number of operations/cycle – Replicated control for all lanes – No multiple issue or dynamic execution logic • Simpler to gate clocks – Each vector instruction explicitly describes all the resources it needs for a number of cycles – Conditional execution leads to further savings EE382A – Autumn 2009 Lecture 15 - 8 Christos Kozyrakis SIMD Extensions for Superscalar Processors • Every CISC/RISC processor today has SIMD extensions – MMX, SSE, SSE-2, SSE-3 3D-Now, Altivec, VIS, … • Basic idea: accelerate multimedia processing – Define vectors of 16 and 32 bit elements in regular registers – Apply SIMD arithmetic on these vectors • Nice and cheap – Don’t need to define big vector register file • Takes up area and complicates exceptions – All we need to do • Add the proper opcodes for SIMD arithmetic • Modify datapaths to execute SIMD arithmetic – Certain operations are easier on short vectors • Reductions, random permutations EE382A – Autumn 2009 Lecture 15 - 9 Christos Kozyrakis Example of Simple SIMD Instruction SIMD ADD 64-bit reg. 64-bit reg. + + + + 64-bit reg. EE382A – Autumn 2009 Lecture 15 - 10 Christos Kozyrakis Example of Fancy SIMD Instruction Sum of Partial Products 64-bit reg. 64-bit reg. * * * * temp. result + + 64-bit reg. EE382A – Autumn 2009 Lecture 15 - 11 Christos Kozyrakis Loading & Storing SIMD Values • Typical case: no vector-like loads & stores – Must use regular 64-bit load/store instructions – Problems: data-sizes, alignment, strides • Solution: multiple load/stores & manipulation instructions – Pack & unpack • To solve problems with data sizes – Rotate & shift • To solve problem with alignment EE382A – Autumn 2009 Lecture 15 - 12 Christos Kozyrakis Problems with SIMD Extension • SIMD defines short, fixed-sized, vectors – Cannot capture data parallelism wider than 64 bits – Must use wide-issue to utilize more than 64-bit datapaths – SSE and Altivec have switched to 128-bits because of this • SIMD does not support vector memory accesses – Strided and indexed accesses for narrow elements – Needs multi-instruction sequence to emulate • Pack, unpack, shift, rotate, merge, etc – Cancels most of performance and code density benefits of vectors • Compiler support for SIMD? – They change too often… EE382A – Autumn 2009 Lecture 15 - 13 Christos Kozyrakis Intel Larrabee: The Design Tradeoff for Data-level Parallelism CPU design experiment: specify a throughputoptimized processor with same area and power of a standard dual core CPU. # CPU cores 2 out of order 10 in-order Instructions per issue 4 per clock 2 per clock VPU lanes per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single-stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock 20 times the multiply-add operations per clock Peak vector throughput for given power and area. Ideal for graphics & other throughput applications. Data in table from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture for visual computing. SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY EE382A – Autumn 2009 Lecture 15 - 14 Christos Kozyrakis ... Wide SIMD I$ D$ L2 Cache Multi-Threaded MultiThreaded Wide SIMD Wide SIMD I$ D$ ... Multi-Threaded MultiThreaded Wide SIMD Wide SIMD I$ D$ Memory Controller Memory Controller Wide SIMD I$ D$ Multi-Threaded MultiThreaded Wide SIMD Display Interface Multi-Threaded MultiThreaded Wide SIMD System Interface Fixed Function Texture Logic Memory Controller Intel Larrabee: A Single-Chip Vector Multiprocessor • 2-way issue, in-order cores with vector capabilities • + 4-way multithreaded • Cores communicate on a wide ring bus • L2 cache is partitioned among the cores – Provides high aggregate bandwidth – Allows data replication & sharing EE382A – Autumn 2009 Intel® Microarchitecture (Larrabee) Lecture 15 - 15 Christos Kozyrakis Larrabee x86 Core Block Diagram Instruction Decode • Separate scalar and vector units with separate registers Scalar Unit • In-order x86 scalar core Vector Unit • Vector unit: 16 32-bit ops/clock Scalar Registers Vector Registers L1 Icache & Dcache 256K L2 Cache Local Subset Ring EE382A – Autumn 2009 Intel® Microarchitecture (Larrabee) • Short execution pipelines • Fast access from L1 cache • Direct connection to each core’s subset of the L2 cache • Prefetch instructions load L1 and L2 caches Lecture 15 - 16 Christos Kozyrakis Larrabee Vector Unit Block Diagram Mask Registers 16-wide Vector ALU Replicate Reorder Vector Registers Numeric Convert Numeric Convert L1 Data Cache EE382A – Autumn 2009 • Vector complete instruction set – 32 vector registers (512 bits), 8 mask registers – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – This enables mapping a separate execution kernel to each VPU lane • Vector instructions support – Fast read from L1 cache – Numeric type conversion and data replication while reading from memory – Rearrange the lanes on register read – Fused multiply add (three arguments) – Int32, Float32 and Float64 data Lecture 15 - 17 Christos Kozyrakis Summary • Vector processors – Processors that operate on linear sequences of numbers • Vector add, vector load, vector store, … – Can express and exploit data-level parallelism in applications • SIMD extension – Short vector extensions for ILP processors – Get some of the advantages of vector processors without most of the cost • Remember what Jim Smith said: – “The most efficient way to execute a vectorizable applications is a vector processor” EE382A – Autumn 2009 Lecture 15 - 18 Christos Kozyrakis Graphics Processors (GPUs) EE382A – Autumn 2009 Lecture 15 - 19 Christos Kozyrakis Graphics Processors Timeline • Till mid 90s – VGA controllers used to accelerate some display functions • Mid 90s to mid 00s – Fixed-function graphics accelerators for the OpenGL and DirectX APIs • Some GP-GPU capabilities by on top of the interfaces – 3D graphics: triangle setup & rasterization, texture mapping & shading • Modern GPUs – Programmable multiprocessors (optimized for data-parallel ops) • OpenGL/DirectX and general purpose languages – Some fixed function hardware (texture, raster ops, …) EE382A – Autumn 2009 Lecture 15 - 20 Christos Kozyrakis GPU’s Role in Modern Workstations • Coprocessor to the CPU • PCIe based interconnect – 8GB/sec per direction • Separate GPU memory – Aka frame buffer – Provides high bandwidth access to local data • Upcoming trend – Fusion: CPU + GPU integration EE382A – Autumn 2009 Lecture 15 - 21 Christos Kozyrakis GPU Thread Model (Software View) • Single-program multiple data (SPMD) model • Each thread has local memory • Parallel threads packed in blocks – Access to per-block shared memory – Can synchronize with barrier • Grids include independent groups – May execute concurrently EE382A – Autumn 2009 Lecture 15 - 22 Christos Kozyrakis GPU Architecture: Nvidia GeForce 8800 (aka Tesla Architecture) EE382A – Autumn 2009 Lecture 15 - 23 Christos Kozyrakis GPU Architecture • A highly multithreaded, multiprocessor system – 100s of streaming processors (SPs) – 8 SPs in a streaming multiprocessor (SM) with some caches – 2 SMs in a texture processor cluster (TPCs) with one texture pipe • Scheduling controlled mostly by hardware • Scalability – By scaling the number of TPCs and memory channels • Fixed function components for graphics – Texture pipes and caches, raster operation units (ROP), … EE382A – Autumn 2009 Lecture 15 - 24 Christos Kozyrakis Streaming Multiprocessor EE382A – Autumn 2009 Lecture 15 - 25 Christos Kozyrakis Streaming Multiprocessor Details • Each SP is a simple processor core – 1024 32-bit registers shared flexibly by up to 64 threads – Integer and floating-point arithmetic units • Including multiply add • Special function unit – Implements functions such as divide, square root, sine, cosine, … • Instruction cache and constants cache – Shared by all threads • Multibanked shared memory – E.g. 16 banks to allow parallel accesses by all SP EE382A – Autumn 2009 Lecture 15 - 26 Christos Kozyrakis Instruction and Thread Scheduling: Where Thread Parallelism Meets Data Parallelism • In theory, all threads can be independent – • For efficiency, 32 threads are packed in warps – – – • Because they branched differently or predication Loss of efficiency if not data parallel Software thread blocks mapped to warps – EE382A – Autumn 2009 Warp: set of parallel threads the execute same instruction Warps introduce data parallelism (SIMT) 1 warp instruction keeps SPs busy for 4 cycles Individual threads may be inactive – – • SM hardware implements zero-overhead switching When HW resources are available Lecture 15 - 27 Christos Kozyrakis Instruction Buffering & Warp Scheduling • Fetch one instruction/cycle I$ L1 – From the L1 instruction cache into an instruction buffer slot • Issue one “ready-to-go” instruction/cycle – All elements of the warp must be ready – Scoreboarding used to track hazards and determine ready warps – Round-robin or age based selection between ready warps Multithreaded Instruction Buffer R F Shared Mem Operand Select MAD • C$ L1 SFU Instruction broadcasted to all SP – Will keep SPs busy for up to 4 cycles EE382A – Autumn 2009 Lecture 15 - 28 Christos Kozyrakis Dependency Tracking Using a Scoreboard • Status of all register operands is tracked – RAW hazards for high-latency operations – Dependencies to memory operations • Instructions become ready when all register operands are ready for the whole wrap – Divergence of threads within wraps is also tracked • A wrap may be blocked because of – Dependencies through registers – Synchronization operations (barriers or atomic ops) • But other wraps can proceed in order to hide latency EE382A – Autumn 2009 Lecture 15 - 29 Christos Kozyrakis Memory System • Per SM caches/memories – Instruction and constant caches – Multi-banked shared memory • Distributed texture cache – Per TPC L1 and distributed L2 cache – Specialized for texture accesses in graphics pipeline – Moving towards generalized and shared L2 in upcoming chips • Multi-channel DRAM main memory (e.g. 8 DDR-3 channels) – Interleaved addresses to achieve higher bandwidth – Lossless and loosy compression used to increase bandwidth – Aggressive access scheduling used to increase bandwidth • Per thread private memory and global memory mapped to DRAM – Relying on threads to hide long latencies EE382A – Autumn 2009 Lecture 15 - 30 Christos Kozyrakis Synchronization • Barrier synchronization within a thread block – Tracking simplified by grouping threads into wraps – Counter used to track number of threads that have arrived to barrier • Atomic operations to global memory – Atomic read-modify-write (add, min, max, and, or, xor) – Atomic exchange or compare and swap – They are tied to DRAM latency • Moving to shared L2 in upcoming chips EE382A – Autumn 2009 Lecture 15 - 31 Christos Kozyrakis GPU Vs Vector Processors: Discussion • How are GPUs similar or different to vector processors? – What are the primary issues to consider here? • How are GPUs different from an architecture like Larrabee? EE382A – Autumn 2009 Lecture 15 - 32 Christos Kozyrakis Why is Data Parallelism Interesting Again: Building a 100TF Datacenter CPU 1U Server 4 CPU cores 0.07 Teraflop 4 GPUs: 960 cores $ 2000 4 Teraflops 400 W $ 8000 1429 CPU servers 700 W 25 CPU servers 25 Tesla systems $ 3.1 M 571 KW GPU 1U System 10x lower cost 21x lower power $ 0.31 M 27 KW EE382A – Autumn 2009 Lecture 15 - 33 Christos Kozyrakis

Lecture 15 Graphics Processors

Related documents

Products

Support

Lecture 15 Graphics Processors

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib