Lecture 15 Graphics Processors

advertisement
Lecture 15
Graphics Processors
Department of Electrical Engineering
Stanford University
http://eeclass.stanford.edu/ee382a
EE382A – Autumn 2009
Lecture 15 - 1
Christos Kozyrakis
Announcements
• Readings for today: P&H appendix A
– Some Nvidia hype included, but overall a good and detailed GPU discussion
• Due today: summary of paper handed out in L14
• Exam on Fri 11/13, 9am - noon, room 200-305
– All lectures + required papers
– Closed books, 1 page of notes, calculator
– Review session on Friday 11/6, 2-3pm, Gates Hall Room 498
EE382A – Autumn 2009
Lecture 15 - 2
Christos Kozyrakis
Review: Vector Processors
VECTOR
(N operations)
SCALAR
(1 operation)
v1 v2
r2
r1
+
+
r3
v3
add r3, r1, r2
vector
length
vadd.vv v3, v1, v2
• Scalar processors operate on single numbers (scalars)
• Vector processors operate on vectors of numbers
– Linear sequences of numbers
EE382A – Autumn 2009
Lecture 15 - 3
Christos Kozyrakis
Review: Advantages of Vector ISAs
• Compact: single instruction defines N operations
– Also reduces the frequency of branches
• Parallel: N operations are (data) parallel
– No dependencies
– No need for complex hardware to detect parallelism (similar to VLIW)
– Can execute in parallel assuming N parallel datapaths
• Expressive: memory operations describe patterns
– Continuous or regular memory access pattern
– Can prefetch or accelerate using wide/multi-banked memory
– Can amortize high latency for 1st element over large sequential pattern
EE382A – Autumn 2009
Lecture 15 - 4
Christos Kozyrakis
Review: Vector Lanes
Pipelined
Lane
Vector Reg.
Partition
Datapath
Elements
Elements
Elements
Elements
Functional
Unit
To/From Memory System
•
•
•
•
•
Elements for each vector register interleaved across the lanes
Each lane receives identical control
Multiple element operations executed per cycle
Modular, scalable design
No need for inter-lane communication for most vector instructions
EE382A – Autumn 2009
Lecture 15 - 5
Christos Kozyrakis
The Timeline of Vector Processors
• Widely used for supercomputing systems in the 70s – 90s
– Cray, CDC, Convex, TI, IBM, ..
• Fell out of fashion in the 80s and 90s
– Difficult to fit a vector processor in a single chip
– Building supercomputers out of commodity microprocessors
• Remaining vector supercomputer: NEC SX-9
– 8 lanes (5 functional unites), 8+64 vregs (256 elements/reg), 3.2GHz
• But now vectors are making a come back
– Short vectors in all ISAs (SIMD), Intel Larabee, …
– Why?
EE382A – Autumn 2009
Lecture 15 - 6
Christos Kozyrakis
Which Applications Fit the Vector Model?
• Vectors are great when we have data-level parallelism (DLP)
– Most efficient way to exploit DLP
– Remember, we can exploit DLP as ILP or TLP
• On a superscalar or multiprocessor
• Which applications have DLP?
– Scientific computing
• Weather forecast, car-crash simulation, biological modeling
• Vector processors were invented for this purpose (supercomputers)
– Multimedia computing
• Speech, image, and video processing
• Identical operations execution on streams or arrays of sound samples, pixels,
and video frames
• The reason for the recent revival of vector architectures
• Multimedia on embedded devices
– Need high performance @ low power @ low complexity @ small code size
EE382A – Autumn 2009
Lecture 15 - 7
Christos Kozyrakis
Vector Power Consumption
• Can trade-off parallelism for power
– Power = C *Vdd2 *F
– If we double the lanes, peak performance doubles
– Halving F restores peak performance but also allows halving the Vdd
– Powernew = (2C)*(Vdd/2)2*(f/2) = Power/4
• Simpler logic for large number of operations/cycle
– Replicated control for all lanes
– No multiple issue or dynamic execution logic
• Simpler to gate clocks
– Each vector instruction explicitly describes all the resources it needs for a
number of cycles
– Conditional execution leads to further savings
EE382A – Autumn 2009
Lecture 15 - 8
Christos Kozyrakis
SIMD Extensions for Superscalar Processors
• Every CISC/RISC processor today has SIMD extensions
– MMX, SSE, SSE-2, SSE-3 3D-Now, Altivec, VIS, …
• Basic idea: accelerate multimedia processing
– Define vectors of 16 and 32 bit elements in regular registers
– Apply SIMD arithmetic on these vectors
• Nice and cheap
– Don’t need to define big vector register file
• Takes up area and complicates exceptions
– All we need to do
• Add the proper opcodes for SIMD arithmetic
• Modify datapaths to execute SIMD arithmetic
– Certain operations are easier on short vectors
• Reductions, random permutations
EE382A – Autumn 2009
Lecture 15 - 9
Christos Kozyrakis
Example of Simple SIMD Instruction
SIMD ADD
64-bit reg.
64-bit reg.
+
+
+
+
64-bit reg.
EE382A – Autumn 2009
Lecture 15 - 10
Christos Kozyrakis
Example of Fancy SIMD Instruction
Sum of Partial Products
64-bit reg.
64-bit reg.
*
*
*
*
temp. result
+
+
64-bit reg.
EE382A – Autumn 2009
Lecture 15 - 11
Christos Kozyrakis
Loading & Storing SIMD Values
• Typical case: no vector-like loads & stores
– Must use regular 64-bit load/store instructions
– Problems: data-sizes, alignment, strides
• Solution: multiple load/stores & manipulation instructions
– Pack & unpack
• To solve problems with data sizes
– Rotate & shift
• To solve problem with alignment
EE382A – Autumn 2009
Lecture 15 - 12
Christos Kozyrakis
Problems with SIMD Extension
• SIMD defines short, fixed-sized, vectors
– Cannot capture data parallelism wider than 64 bits
– Must use wide-issue to utilize more than 64-bit datapaths
– SSE and Altivec have switched to 128-bits because of this
• SIMD does not support vector memory accesses
– Strided and indexed accesses for narrow elements
– Needs multi-instruction sequence to emulate
• Pack, unpack, shift, rotate, merge, etc
– Cancels most of performance and code density benefits of vectors
• Compiler support for SIMD?
– They change too often…
EE382A – Autumn 2009
Lecture 15 - 13
Christos Kozyrakis
Intel Larrabee:
The Design Tradeoff for Data-level Parallelism
CPU design experiment:
specify a throughputoptimized processor with
same area and power of a
standard dual core CPU.
# CPU cores
2 out of order
10 in-order
Instructions per issue
4 per clock
2 per clock
VPU lanes per core
4-wide SSE
16-wide
L2 cache size
4 MB
4 MB
Single-stream
4 per clock
2 per clock
Vector throughput
8 per clock
160 per clock
20 times the multiply-add operations per clock
Peak vector throughput for given power and area.
Ideal for graphics & other throughput applications.
Data in table from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture
for visual computing. SIGGRAPH ’08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY
EE382A – Autumn 2009
Lecture 15 - 14
Christos Kozyrakis
...
Wide SIMD
I$
D$
L2 Cache
Multi-Threaded
MultiThreaded
Wide
SIMD
Wide SIMD
I$
D$
...
Multi-Threaded
MultiThreaded
Wide
SIMD
Wide SIMD
I$
D$
Memory
Controller
Memory
Controller
Wide SIMD
I$
D$
Multi-Threaded
MultiThreaded
Wide
SIMD
Display Interface
Multi-Threaded
MultiThreaded
Wide
SIMD
System Interface
Fixed Function
Texture Logic
Memory Controller
Intel Larrabee:
A Single-Chip Vector Multiprocessor
• 2-way issue, in-order cores with vector capabilities
• + 4-way multithreaded
• Cores communicate on a wide ring bus
• L2 cache is partitioned among the cores
– Provides high aggregate bandwidth
– Allows data replication & sharing
EE382A – Autumn 2009
Intel® Microarchitecture (Larrabee)
Lecture 15 - 15
Christos Kozyrakis
Larrabee x86 Core Block Diagram
Instruction Decode
• Separate scalar and vector units with
separate registers
Scalar
Unit
• In-order x86 scalar core
Vector
Unit
• Vector unit: 16 32-bit ops/clock
Scalar
Registers
Vector
Registers
L1 Icache & Dcache
256K L2 Cache
Local Subset
Ring
EE382A – Autumn 2009
Intel® Microarchitecture (Larrabee)
• Short execution pipelines
• Fast access from L1 cache
• Direct connection to each core’s
subset of the L2 cache
• Prefetch instructions load L1 and L2
caches
Lecture 15 - 16
Christos Kozyrakis
Larrabee Vector Unit Block Diagram
Mask Registers
16-wide Vector ALU
Replicate
Reorder
Vector Registers
Numeric
Convert
Numeric
Convert
L1 Data Cache
EE382A – Autumn 2009
• Vector complete instruction set
– 32 vector registers (512 bits), 8 mask registers
– Scatter/gather for vector load/store
– Mask registers select lanes to write, which allows
data-parallel flow control
– This enables mapping a separate execution
kernel to each VPU lane
• Vector instructions support
– Fast read from L1 cache
– Numeric type conversion and data replication
while reading from memory
– Rearrange the lanes on register read
– Fused multiply add (three arguments)
– Int32, Float32 and Float64 data
Lecture 15 - 17
Christos Kozyrakis
Summary
• Vector processors
– Processors that operate on linear sequences of numbers
• Vector add, vector load, vector store, …
– Can express and exploit data-level parallelism in applications
• SIMD extension
– Short vector extensions for ILP processors
– Get some of the advantages of vector processors without most of the cost
• Remember what Jim Smith said:
– “The most efficient way to execute a vectorizable applications is a vector
processor”
EE382A – Autumn 2009
Lecture 15 - 18
Christos Kozyrakis
Graphics Processors (GPUs)
EE382A – Autumn 2009
Lecture 15 - 19
Christos Kozyrakis
Graphics Processors Timeline
• Till mid 90s
– VGA controllers used to accelerate some display functions
• Mid 90s to mid 00s
– Fixed-function graphics accelerators for the OpenGL and DirectX APIs
• Some GP-GPU capabilities by on top of the interfaces
– 3D graphics: triangle setup & rasterization, texture mapping & shading
• Modern GPUs
– Programmable multiprocessors (optimized for data-parallel ops)
• OpenGL/DirectX and general purpose languages
– Some fixed function hardware (texture, raster ops, …)
EE382A – Autumn 2009
Lecture 15 - 20
Christos Kozyrakis
GPU’s Role in Modern Workstations
• Coprocessor to the CPU
• PCIe based interconnect
– 8GB/sec per direction
• Separate GPU memory
– Aka frame buffer
– Provides high bandwidth access to
local data
• Upcoming trend
– Fusion: CPU + GPU integration
EE382A – Autumn 2009
Lecture 15 - 21
Christos Kozyrakis
GPU Thread Model (Software View)
• Single-program multiple data
(SPMD) model
• Each thread has local memory
• Parallel threads packed in blocks
– Access to per-block shared
memory
– Can synchronize with barrier
• Grids include independent groups
– May execute concurrently
EE382A – Autumn 2009
Lecture 15 - 22
Christos Kozyrakis
GPU Architecture: Nvidia GeForce 8800
(aka Tesla Architecture)
EE382A – Autumn 2009
Lecture 15 - 23
Christos Kozyrakis
GPU Architecture
• A highly multithreaded, multiprocessor system
– 100s of streaming processors (SPs)
– 8 SPs in a streaming multiprocessor (SM) with some caches
– 2 SMs in a texture processor cluster (TPCs) with one texture pipe
• Scheduling controlled mostly by hardware
• Scalability
– By scaling the number of TPCs and memory channels
• Fixed function components for graphics
– Texture pipes and caches, raster operation units (ROP), …
EE382A – Autumn 2009
Lecture 15 - 24
Christos Kozyrakis
Streaming Multiprocessor
EE382A – Autumn 2009
Lecture 15 - 25
Christos Kozyrakis
Streaming Multiprocessor Details
• Each SP is a simple processor core
– 1024 32-bit registers shared flexibly by up to 64 threads
– Integer and floating-point arithmetic units
• Including multiply add
• Special function unit
– Implements functions such as divide, square root, sine, cosine, …
• Instruction cache and constants cache
– Shared by all threads
• Multibanked shared memory
– E.g. 16 banks to allow parallel accesses by all SP
EE382A – Autumn 2009
Lecture 15 - 26
Christos Kozyrakis
Instruction and Thread Scheduling:
Where Thread Parallelism Meets Data Parallelism
•
In theory, all threads can be independent
–
•
For efficiency, 32 threads are packed in warps
–
–
–
•
Because they branched differently or predication
Loss of efficiency if not data parallel
Software thread blocks mapped to warps
–
EE382A – Autumn 2009
Warp: set of parallel threads the execute same
instruction
Warps introduce data parallelism (SIMT)
1 warp instruction keeps SPs busy for 4 cycles
Individual threads may be inactive
–
–
•
SM hardware implements zero-overhead switching
When HW resources are available
Lecture 15 - 27
Christos Kozyrakis
Instruction Buffering &
Warp Scheduling
•
Fetch one instruction/cycle
I$
L1
– From the L1 instruction cache into an
instruction buffer slot
•
Issue one “ready-to-go” instruction/cycle
– All elements of the warp must be ready
– Scoreboarding used to track hazards and
determine ready warps
– Round-robin or age based selection between
ready warps
Multithreaded
Instruction Buffer
R
F
Shared
Mem
Operand Select
MAD
•
C$
L1
SFU
Instruction broadcasted to all SP
–
Will keep SPs busy for up to 4 cycles
EE382A – Autumn 2009
Lecture 15 - 28
Christos Kozyrakis
Dependency Tracking Using a Scoreboard
• Status of all register operands is tracked
– RAW hazards for high-latency operations
– Dependencies to memory operations
• Instructions become ready when all register operands are ready for the
whole wrap
– Divergence of threads within wraps is also tracked
• A wrap may be blocked because of
– Dependencies through registers
– Synchronization operations (barriers or atomic ops)
• But other wraps can proceed in order to hide latency
EE382A – Autumn 2009
Lecture 15 - 29
Christos Kozyrakis
Memory System
• Per SM caches/memories
– Instruction and constant caches
– Multi-banked shared memory
• Distributed texture cache
– Per TPC L1 and distributed L2 cache
– Specialized for texture accesses in graphics pipeline
– Moving towards generalized and shared L2 in upcoming chips
• Multi-channel DRAM main memory (e.g. 8 DDR-3 channels)
– Interleaved addresses to achieve higher bandwidth
– Lossless and loosy compression used to increase bandwidth
– Aggressive access scheduling used to increase bandwidth
• Per thread private memory and global memory mapped to DRAM
– Relying on threads to hide long latencies
EE382A – Autumn 2009
Lecture 15 - 30
Christos Kozyrakis
Synchronization
• Barrier synchronization within a thread block
– Tracking simplified by grouping threads into wraps
– Counter used to track number of threads that have arrived to barrier
• Atomic operations to global memory
– Atomic read-modify-write (add, min, max, and, or, xor)
– Atomic exchange or compare and swap
– They are tied to DRAM latency
• Moving to shared L2 in upcoming chips
EE382A – Autumn 2009
Lecture 15 - 31
Christos Kozyrakis
GPU Vs Vector Processors: Discussion
• How are GPUs similar or different to vector processors?
– What are the primary issues to consider here?
• How are GPUs different from an architecture like Larrabee?
EE382A – Autumn 2009
Lecture 15 - 32
Christos Kozyrakis
Why is Data Parallelism Interesting Again:
Building a 100TF Datacenter
CPU 1U
Server
4 CPU cores
0.07 Teraflop
4 GPUs: 960
cores
$ 2000
4 Teraflops
400 W
$ 8000
1429 CPU
servers
700 W
25 CPU
servers
25 Tesla
systems
$ 3.1 M
571 KW
GPU 1U
System
10x lower cost
21x lower
power
$ 0.31 M
27 KW
EE382A – Autumn 2009
Lecture 15 - 33
Christos Kozyrakis
Related documents
Download