Introduction - ECE Users Pages

advertisement
Operation of the SM Pipeline
©Sudhakar Yalamanchili unless otherwise noted
(1)
Objectives
•
Cycle-level examination of the operation of major
pipeline stages in a stream multiprocessor
•
Understand the type of information necessary for each
stage of operation
•
Identification of performance bottlenecks
 Detailed implementations are addressed in subsequent
modules
(2)
Reading
•
Documentation for the GPGPUSim simulator


•
Good source of information about the general organization and
operation of a stream multiprocessor
http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual
Operation of a Scoreboard

https://en.wikipedia.org/wiki/Scoreboarding
•
X. Xiang, Y. Yiang, H. Zhou, “Warp Level Divergence in GPUs:
Characterization, Impact, and Mitigation,” International
Symposium on High Performance Computer Architecture, 2014.
•
D. Tarjan and K. Skadron, “On Demand Register Allocation and
Deallocation for a Multithreaded Processor,” US Patent
2011/0161616 A1, June 2011
(3)
NVIDIA GK110 (Keplar)
Thread Block Scheduler
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
(4)
SMX Organization : GK 110
Multiple Warp Schedulers
64K 32-bit registers
192 cores – 6 clusters
of 32 cores each
What are the main
stages of a generic
SMX pipeline?
Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/
(5)
A Generic SM Pipeline
I-Fetch
Scalar Fetch &
Decode
Decode
Instruction
Issue & Warp
Scheduler
I-Buffer
Predicate & GP
Register Files
PRF
RF
Writeback/Commit
Miss?
D-Cache
All Hit?
Scalar
Cores
scalar
pipeline
Data Memory
Access
Issue
scalar
pipeline
scalar
Pipeline
Scalar Pipelines
Front-end
Data
Writeback
Pending Warps
Warp 1
Warp 2
Back-end
Warp 6
(6)
Single Warp Execution
warp state
Thread Block
State
PC
AM
WID
PTX (Assembly):
setp.lt.s32 %p, %r5, %rd4; //r5 =
index, rd4 = N
@p bra L1;
bra L2;
Grid
L1:
ld.global.f32 %f1, [%r6]; //r6 =
&a[index]
ld.global.f32 %f2, [%r7]; //r7 =
&b[index]
add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8 =
&c[index]
L2:
ret;
(7)
Instruction Fetch & Decode
I-Fetch
Examples from Harmonica2 GPU
Decode
State
PC
AM
WID
Instr
I-Buffer
Issue
Warp 0 PC
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
Warp n-1 PC
pending
warps
D-Cache
All Hit?
Warp 1 PC
To I-Cache
Next Warp
Data
May realize
multiple fetch
policies
Writeback
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(8)
Instruction Buffer
Example: buffer 2 instructions/warp
I-Fetch
Decoded instruction
Decode
V
V
V
I-Buffer
R Instr 1 W1
R Instr 2 W1
R Instr 1 W2
Scoreboard
ECE 6100/CS 6290
Issue
V
PRF
RF
•
scalar
pipeline
scalar
pipeline
scalar
Pipeline
•
Data
Writeback
Buffer a fixed number of instructions
per warp
Coordinated with instruction fetch
 Need an empty I-buffer for the warp
pending
warps
D-Cache
All Hit?
R Instr 2 Wn
•
•
V: valid instruction in the buffer
R: instruction ready to be issued
 Set using the scoreboard logic
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(9)
Instruction Buffer (2)
I-Fetch
Decode
I-Buffer
Issue
V
V
V
R Instr 1 W1
R Instr 2 W1
R Instr 1 W2
V
R Instr 2 Wn
•
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
All Hit?
Data
Writeback
Scoreboard enforces and WAW and
RAW hazards
 Indexed by Warp ID
 Each entry hosts required registers,
 Destination registers are reserved at
issue
 Reserved registers released at
writeback
pending
warps
D-Cache
Scoreboard
•
Enables multiple instructions to be
issued from a single warp
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(10)
Instruction Buffer (3)
I-Fetch
Decode
I-Buffer
Issue
PRF
RF
R Instr 1 W1
R Instr 2 W1
R Instr 1 W2
V
R Instr 2 Wn
Scoreboard
Generic Scoreboard
Function unit
producing value
scalar
pipeline
scalar
Pipeline
scalar
pipeline
dest reg src1
Name
Int
Busy
Yes
Op
Load
Fi
F2
Fj
Source
Registers
have value?
src2
Fk
R3
Qj
Qk
Rj
No
pending
warps
D-Cache
All Hit?
V
V
V
Data
Writeback
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(11)
Rk
Instruction Issue
I-Fetch
Decode
pool of ready warps
I-Buffer
Warp 3
Warp 8
Issue
Warp 7
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
All Hit?
Warp
Scheduler
pending
warps
D-Cache
Data
instruction
Manages implementation of
barriers, register dependencies, and
control divergence
Writeback
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(12)
Instruction Issue (2)
warp
I-Fetch
Decode
barrier
I-Buffer
Warp 3
Warp 8
Issue
Warp 7
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
All Hit?
Warp
Scheduler
•
pending
warps
D-Cache
Data
Writeback
instruction
Barriers – warps wait here for
barrier synchronization
 All threads in the CTA must
reach the barrier
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(13)
Instruction Issue (3)
I-Fetch
Decode
I-Buffer
Warp 3
Warp 8
Issue
R Instr 1 W1
R Instr 2 W1
R Instr 1 W2
V
R Instr 2 Wn
Scoreboard
instruction
Warp 7
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
Warp
Scheduler
pending
warps
D-Cache
All Hit?
V
V
V
Data
•
Register Dependencies - track
through the scoreboard
Writeback
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(14)
Instruction Issue (4)
I-Fetch
divergent warps
Decode
I-Buffer
Warp 3
Warp 8
Issue
Warp 7
PRF
RF
Keeps track of
divergent threads
at a branch
scalar
pipeline
scalar
pipeline
scalar
Pipeline
Warp
Scheduler
pending
warps
D-Cache
All Hit?
instruction
Data
•
SIMT Stack
(per warp)
Control Divergence - per warp
stack
Writeback
From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(15)
Instruction Issue (5)
•
Scheduler can issue multiple instructions from a warp
•
Issue conditions




Has valid instructions
Not waiting at a barrier
Scoreboard check
Pipeline line is not stalled: operand access stage (will get to it
later)
•
Reserve destination registers
•
Instructions may issue to memory, SP or SFU
pipelines
•
Warp scheduling disciplines  more later in the
course
(16)
Register File Access
Banks 0-15
I-Fetch
Decode
Arbiter
RF
RF
RF
RF
I-Buffer
n-1
n-2
n-3
n-4
n-1
n-2
n-3
n-4
RF
RF
RF
RF
n-1
n-2
n-3
n-4
Single ported
Register File
Banks
1024 bit
Issue
RF1
RF0
PRF
RF
RF1
RF0
RF1
RF0
Xbar
scalar
pipeline
scalar
pipeline
scalar
Pipeline
OC
OC
OC
OC
Operand Collectors (OC)
DU
DU
DU
DU
Dispatch Units (DU)
pending
warps
D-Cache
All Hit?
RF
RF
RF
RF
Data
ALUs
L/S
SFU
Writeback
(17)
Scalar Pipeline
I-Fetch
Decode
• Functional units are pipelined
I-Buffer
Issue
• Designs with multiple issue
PRF
RF
Dispatch
scalar
pipeline
scalar
pipeline
scalar
Pipeline
LD/SD ALU
FPU
A Single
Core
Result Queue
D-Cache
All Hit?
Data
pending
warps
Writeback
(18)
Shared Memory Access
I-Fetch
Conflict
free access
Decode
I-Buffer
• Multiple bank
organization
Issue
• Data is
interleaved
across banks
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
• Bank conflicts
extend access
times
D-Cache
All Hit?
Data
pending
warps
Writeback
2-way Conflict
access
(19)
Memory Request Coalescing
Memory Requests
Tid
Tid
Tid
Tid
RQ Size
RQ Size
RQ Size
RQ Size
Base Add
Base Add
Offset
Offset
Base Add Base Add
Offset
Offset
•
PRT is filled whenever
a memory request is
issued
•
Generate a set of
address masks  one
for each memory
transaction
•
Issue transactions
Pending Request Table
Memory Address Coalescing
Pending RQ Count Addr Mask Addr Mask
Addr Mask
Thread Masks
From J. Leng et.al., “GPUWattch : Enabling Energy
Optimizations in GPGPUs,’ ISCA 2013
(20)
Case Study: Keplar GK 110
From GK110: NVIDIA white paper
(21)
Keplar SMX
A slice of the SMX
•
Up to two instruction can be issued per warp

•
•
From GK110: NVIDIA white paper
E.g., LD and SFU
More flexible instruction paring rules
More efficient support for atomic operations in global memory –
both latency and throughput

E.g., atomicADD, atomicEXC
(22)
Shuffle Instruction
From GK110: NVIDIA white paper
From GK110: NVIDIA white paper
•
Permits threads in a warp to share data
 Avoid a load-store sequence
•
Reduce the shared memory requirement per TB 
increase occupancy
 Data exchanged in registers without using shared memory
•
Some operations become more efficient
(23)
Memory Hierarchy
warp
•
•
L1 Cache
Shared
Memory
L2 Cache
DRAM
From GK110: NVIDIA white paper
ReadOnly
Cache
•
•
Configurable cache/shared
memory configuration for
L1
Read-only cache for
compiler or developer
(intrinsics) use
Shared L2 across all SMXs
ECC coverage across the
hierarchy
 Performance impact
(24)
Dynamic Parallelism
From GK110: NVIDIA white paper
•
The ability for device-side nested kernel launch
 Eliminates host-GPU interactions
 Current overheads are high
•
Matches a wider range of parallelism patterns – will
cover in more depth later
(25)
Concurrent Kernel Launch
From GK110: NVIDIA white paper
•
•
Kernels from multiple streams are now mapped to
distinct hardware queues
TBs from multiple kernels can share a SMX
(26)
Warp and Instruction Dispatch
From GK110: NVIDIA white paper
(27)
HW Work Queues
Kernel Distributor Entry
PC
Dim Param ExeBL
Kernel Management Unit
Warp Schedulers
Pending
Kernels
Interconnection Bus
Grid Management
Warp Context
Kernel Distributor
Registers
SMX Scheduler
Core
Core
Core
Core
Control Registers
L1 Cache / Shard Memory
SMX
SMX
SMX
SMX
GPU
Host
CPU
•
•
L2 Cache
Memory
Controller
DRAM
Multiple grids launched from both CPU and GPU can
be handled in Keplar
Need the ability to re-prioritize and schedule new
grids
(28)
Summary
•
Synchronous progress of a warp through the SM
pipelines
•
Warp progress in a thread block can diverge for many
reasons
 Barriers
 Control divergence
 Memory divergence
•
How is the execution optimized? Next 
(29)
Download