Operation of the SM Pipeline ©Sudhakar Yalamanchili unless otherwise noted (1) Objectives • Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor • Understand the type of information necessary for each stage of operation • Identification of performance bottlenecks Detailed implementations are addressed in subsequent modules (2) Reading • Documentation for the GPGPUSim simulator • Good source of information about the general organization and operation of a stream multiprocessor http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual Operation of a Scoreboard https://en.wikipedia.org/wiki/Scoreboarding • X. Xiang, Y. Yiang, H. Zhou, “Warp Level Divergence in GPUs: Characterization, Impact, and Mitigation,” International Symposium on High Performance Computer Architecture, 2014. • D. Tarjan and K. Skadron, “On Demand Register Allocation and Deallocation for a Multithreaded Processor,” US Patent 2011/0161616 A1, June 2011 (3) NVIDIA GK110 (Keplar) Thread Block Scheduler Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/ (4) SMX Organization : GK 110 Multiple Warp Schedulers 64K 32-bit registers 192 cores – 6 clusters of 32 cores each What are the main stages of a generic SMX pipeline? Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/ (5) A Generic SM Pipeline I-Fetch Scalar Fetch & Decode Decode Instruction Issue & Warp Scheduler I-Buffer Predicate & GP Register Files PRF RF Writeback/Commit Miss? D-Cache All Hit? Scalar Cores scalar pipeline Data Memory Access Issue scalar pipeline scalar Pipeline Scalar Pipelines Front-end Data Writeback Pending Warps Warp 1 Warp 2 Back-end Warp 6 (6) Single Warp Execution warp state Thread Block State PC AM WID PTX (Assembly): setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; Grid L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; (7) Instruction Fetch & Decode I-Fetch Examples from Harmonica2 GPU Decode State PC AM WID Instr I-Buffer Issue Warp 0 PC PRF RF scalar pipeline scalar pipeline scalar Pipeline Warp n-1 PC pending warps D-Cache All Hit? Warp 1 PC To I-Cache Next Warp Data May realize multiple fetch policies Writeback From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (8) Instruction Buffer Example: buffer 2 instructions/warp I-Fetch Decoded instruction Decode V V V I-Buffer R Instr 1 W1 R Instr 2 W1 R Instr 1 W2 Scoreboard ECE 6100/CS 6290 Issue V PRF RF • scalar pipeline scalar pipeline scalar Pipeline • Data Writeback Buffer a fixed number of instructions per warp Coordinated with instruction fetch Need an empty I-buffer for the warp pending warps D-Cache All Hit? R Instr 2 Wn • • V: valid instruction in the buffer R: instruction ready to be issued Set using the scoreboard logic From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (9) Instruction Buffer (2) I-Fetch Decode I-Buffer Issue V V V R Instr 1 W1 R Instr 2 W1 R Instr 1 W2 V R Instr 2 Wn • PRF RF scalar pipeline scalar pipeline scalar Pipeline All Hit? Data Writeback Scoreboard enforces and WAW and RAW hazards Indexed by Warp ID Each entry hosts required registers, Destination registers are reserved at issue Reserved registers released at writeback pending warps D-Cache Scoreboard • Enables multiple instructions to be issued from a single warp From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (10) Instruction Buffer (3) I-Fetch Decode I-Buffer Issue PRF RF R Instr 1 W1 R Instr 2 W1 R Instr 1 W2 V R Instr 2 Wn Scoreboard Generic Scoreboard Function unit producing value scalar pipeline scalar Pipeline scalar pipeline dest reg src1 Name Int Busy Yes Op Load Fi F2 Fj Source Registers have value? src2 Fk R3 Qj Qk Rj No pending warps D-Cache All Hit? V V V Data Writeback From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (11) Rk Instruction Issue I-Fetch Decode pool of ready warps I-Buffer Warp 3 Warp 8 Issue Warp 7 PRF RF scalar pipeline scalar pipeline scalar Pipeline All Hit? Warp Scheduler pending warps D-Cache Data instruction Manages implementation of barriers, register dependencies, and control divergence Writeback From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (12) Instruction Issue (2) warp I-Fetch Decode barrier I-Buffer Warp 3 Warp 8 Issue Warp 7 PRF RF scalar pipeline scalar pipeline scalar Pipeline All Hit? Warp Scheduler • pending warps D-Cache Data Writeback instruction Barriers – warps wait here for barrier synchronization All threads in the CTA must reach the barrier From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (13) Instruction Issue (3) I-Fetch Decode I-Buffer Warp 3 Warp 8 Issue R Instr 1 W1 R Instr 2 W1 R Instr 1 W2 V R Instr 2 Wn Scoreboard instruction Warp 7 PRF RF scalar pipeline scalar pipeline scalar Pipeline Warp Scheduler pending warps D-Cache All Hit? V V V Data • Register Dependencies - track through the scoreboard Writeback From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (14) Instruction Issue (4) I-Fetch divergent warps Decode I-Buffer Warp 3 Warp 8 Issue Warp 7 PRF RF Keeps track of divergent threads at a branch scalar pipeline scalar pipeline scalar Pipeline Warp Scheduler pending warps D-Cache All Hit? instruction Data • SIMT Stack (per warp) Control Divergence - per warp stack Writeback From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (15) Instruction Issue (5) • Scheduler can issue multiple instructions from a warp • Issue conditions Has valid instructions Not waiting at a barrier Scoreboard check Pipeline line is not stalled: operand access stage (will get to it later) • Reserve destination registers • Instructions may issue to memory, SP or SFU pipelines • Warp scheduling disciplines more later in the course (16) Register File Access Banks 0-15 I-Fetch Decode Arbiter RF RF RF RF I-Buffer n-1 n-2 n-3 n-4 n-1 n-2 n-3 n-4 RF RF RF RF n-1 n-2 n-3 n-4 Single ported Register File Banks 1024 bit Issue RF1 RF0 PRF RF RF1 RF0 RF1 RF0 Xbar scalar pipeline scalar pipeline scalar Pipeline OC OC OC OC Operand Collectors (OC) DU DU DU DU Dispatch Units (DU) pending warps D-Cache All Hit? RF RF RF RF Data ALUs L/S SFU Writeback (17) Scalar Pipeline I-Fetch Decode • Functional units are pipelined I-Buffer Issue • Designs with multiple issue PRF RF Dispatch scalar pipeline scalar pipeline scalar Pipeline LD/SD ALU FPU A Single Core Result Queue D-Cache All Hit? Data pending warps Writeback (18) Shared Memory Access I-Fetch Conflict free access Decode I-Buffer • Multiple bank organization Issue • Data is interleaved across banks PRF RF scalar pipeline scalar pipeline scalar Pipeline • Bank conflicts extend access times D-Cache All Hit? Data pending warps Writeback 2-way Conflict access (19) Memory Request Coalescing Memory Requests Tid Tid Tid Tid RQ Size RQ Size RQ Size RQ Size Base Add Base Add Offset Offset Base Add Base Add Offset Offset • PRT is filled whenever a memory request is issued • Generate a set of address masks one for each memory transaction • Issue transactions Pending Request Table Memory Address Coalescing Pending RQ Count Addr Mask Addr Mask Addr Mask Thread Masks From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013 (20) Case Study: Keplar GK 110 From GK110: NVIDIA white paper (21) Keplar SMX A slice of the SMX • Up to two instruction can be issued per warp • • From GK110: NVIDIA white paper E.g., LD and SFU More flexible instruction paring rules More efficient support for atomic operations in global memory – both latency and throughput E.g., atomicADD, atomicEXC (22) Shuffle Instruction From GK110: NVIDIA white paper From GK110: NVIDIA white paper • Permits threads in a warp to share data Avoid a load-store sequence • Reduce the shared memory requirement per TB increase occupancy Data exchanged in registers without using shared memory • Some operations become more efficient (23) Memory Hierarchy warp • • L1 Cache Shared Memory L2 Cache DRAM From GK110: NVIDIA white paper ReadOnly Cache • • Configurable cache/shared memory configuration for L1 Read-only cache for compiler or developer (intrinsics) use Shared L2 across all SMXs ECC coverage across the hierarchy Performance impact (24) Dynamic Parallelism From GK110: NVIDIA white paper • The ability for device-side nested kernel launch Eliminates host-GPU interactions Current overheads are high • Matches a wider range of parallelism patterns – will cover in more depth later (25) Concurrent Kernel Launch From GK110: NVIDIA white paper • • Kernels from multiple streams are now mapped to distinct hardware queues TBs from multiple kernels can share a SMX (26) Warp and Instruction Dispatch From GK110: NVIDIA white paper (27) HW Work Queues Kernel Distributor Entry PC Dim Param ExeBL Kernel Management Unit Warp Schedulers Pending Kernels Interconnection Bus Grid Management Warp Context Kernel Distributor Registers SMX Scheduler Core Core Core Core Control Registers L1 Cache / Shard Memory SMX SMX SMX SMX GPU Host CPU • • L2 Cache Memory Controller DRAM Multiple grids launched from both CPU and GPU can be handled in Keplar Need the ability to re-prioritize and schedule new grids (28) Summary • Synchronous progress of a warp through the SM pipelines • Warp progress in a thread block can diverge for many reasons Barriers Control divergence Memory divergence • How is the execution optimized? Next (29)