OCELOT: PTX EMULATOR SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 1 Overview Ocelot PTX Emulator Multicore-Backend NVIDIA AMD GPU Backend GPU Backend SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2 3 Execution Model NVIDIA’s PTX Execution Model Parallel thread Execution (PTX) Explicit memory hierarchy Cooperative thread arrays (CTAs) and ordering constraints Array of multiprocessors each executing a CTA (coarse grain) SIMD multiprocessors (fine grain) Single instruction multiple thread (SIMT) execution Enables hardware to exploit control flow uniformity and data locality SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4 The Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX execution model Access to complete machine state Enables detailed performance evaluation/debugging Program correctness checking Alignment checks, out of bounds etc. Workload modeling characterization and Trace generation to drive architecture simulators PTX 3.0 (Fermi) support Timing information not available SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4 Emulator Implementation Serialized execution of CTAs Implements the CUDA execution model semantics CUDA does not guarantee concurrent CTA execution! Implements Multiple reconvergence mechanisms are implemented: IPDOM, Barrier, Thread frontiers Software abstract reconvergence mechanism support for special functions: Texture sampling 1D, 2D, 3D, cube Nearest, linear New instructions may be prototyped SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Source Code: PTX Emulator Device Backend • ocelot/ • executive/ • • • • • interface/EmulatorDevice.h interface/EmulatedKernel.h interface/EmulatorCallStack.h interface/CooperativeThreadArray.h interface/TextureOperations.h • trace/ • interface/TraceEvent.h • interface/TraceGenerator.h SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6 7 Trace Generator Interface Emulator broadcasts events to event trace analyzers during execution Events provide detailed device state: PC, activity mask, operand data, thread ID, etc. Used for error checking, instrumentation, and simulation SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7 8 Trace Generator Interface // ocelot/trace/interface/TraceGenerator.h // // Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8 9 TraceEvent Object Captures execution of a dynamic instruction, including Device pointer to access device state Kernel grid, CTA dimensions, parameter values PC and internal representation of PTX instruction Set of active threads executing instruction Memory addresses and size of transfer Branch target(s) and diverging threads SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9 trace::TraceEvent // \file ocelot/trace/interface/TraceEvent.h class TraceEvent { public: // ID of the block that generated the event ir::Dim3 blockId; // PC index into EmulatedKernel's packed instruction sequence ir::PTXU64 PC; // Depth of call stack [i.e. number of contexts on the runtime stack] ir::PTXU32 contextStackSize; // Instruction const pointer to instruction pointed to by PC const ir::PTXInstruction* instruction; // Bit mask of active threads that executed this instruction BitMask active; // Taken thread mask in case of a branch BitMask taken; // Fall through thread mask in case of a branch BitMask fallthrough; // Vector of memory addresses possibly generated for this instruction U64Vector memory_addresses; // Vector of sizes of memory operations possibly issued by this // instruction ir::PTXU32 memory_size; // Dimensions of the kernel grid that generated the event ir::Dim3 gridDim; // Dimensions of the kernel block that generated the event ir::Dim3 blockDim; // Captures just events related to thread reconvergence ReconvergenceTraceEvent reconvergence; }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11 PTX Emulator – CUDA Debugging – Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load } . . . raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr ); . . . ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11 12 PTX Emulator – CUDA Debugging- Illegal Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // int *validPtr = 0; to an existing memory allocation cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12 Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface • Enables • Inspection of application state • Single-stepping of instructions • Breakpoints and watchpoints $ ./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) • Faults in MemoryChecker and RaceDetector invoke ocelotdbg automatically SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Performance Tuning Identify critical regions • Memory demand • Floating-point intensity • Shared memory bank conflicts • .... for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; } } .... SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15 Example: Inter-thread Data Flow • Which kernels exchange computed results through shared memory? • Track id of producer thread • Ensure threads are well synchronized • Optionally ignore uses of shared memory to transfer working sets SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15 16 Current Trace Generators Trace Generators Branch Instruction Integrated Debugger Kernel Dimension Machine Attributes Memory Memory Checker Memory Race Detector Parallelism Performance Bound Shared Computation Warp Synchronous Function Measures control flow uniformity and branch divergence Static and dynamic instruction count GDB-like interface Kernel grid and block dimensions Observe and record machine characteristics Working set size, memory intensity, memory efficiency i) Bounds checks, ii) alignment checks, and iii) uninitialized loads (shared memory) Race conditions on shared memory MIMD and SIMD parallelism limits Compute and memory throughput Extent of data flow among threads Hot-paths/regions for warp synchronous execution SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16 17 Using Trace Generators Implement TraceGenerator interface override methods: Add initialize( ), event( ), postEvent( ), finish( ) to Ocelot runtime explicitly: ocelot::addTraceGenerator( ) or, add to trace::TraceConfiguration Online Link analysis or serialize event traces applications with libocelotTrace.so SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17 18 Configuring & Executing Applications Controls Ocelot’s initial state Located in application’s startup directory trace: { Edit configure.ocelot memoryChecker: { enabled: true, checkInitialization: false trace specifies which trace generators are initially attached }, raceDetector: { executive controls device properties enabled: true, ignoreIrrelevantWrites: true trace: memoryChecker – ensures raceDetector - enforces synchronized access to .shared debugger - interactive debugger }, debugger: { enabled: true, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", executive: devices: List of Ocelot backend devices that are enabled emulated – Ocelot PTX emulator (trace generators) alwaysAttach: false }, }, executive: { devices: [ "emulated" ], Additional devices: nvidia – execution on NVIDIA GPUs llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON GPU SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY } } 19 Dynamic Instruction Count Example: Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::TraceGenerator { •Mandelbrot (CUDA SDK) public: std::vector< size_t > dynamicInstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicInstructions.size()) dynamicInstructions.resize(event.active.size(), 0); for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicInstructions[i]++; } } }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19 EXAMPLE: CONTROL-FLOW DIVERGENCE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20 Control-Flow Divergence in PTX Emulator PTX Emulator facilitates customizable handlers for control flow divergence Currently implements: Immediate post-dominator (ipdom) Barrier divergence Thread frontiers, sorted stack Thread frontiers, GEN6 Assumes warp is CTA wide Abstract handlers implement potentially divergent control instructions eg. Bra, Bar, Exit, Vote Executing instructions drives TraceGenerators Reconvergence affects active threads, dynamic instruction count, and instruction trace Analysis tools can group threads into warps of arbitrary size SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21 class ReconvergenceMechanism // ocelot/executive/interface/ReconvergenceMechanism.h // virtual void eval_Reconverge(CTAContext &context, class ReconvergenceMechanism { const ir::PTXInstruction &instr) = 0; public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual void eval_Exit(CTAContext &context, virtual ~ReconvergenceMechanism(); const ir::PTXInstruction &instr) = 0; virtual void initialize() = 0; virtual void eval_Vote(CTAContext &context, const ir::PTXInstruction &instr); virtual void evalPredicate(CTAContext &context) = 0; virtual bool nextInstruction(CTAContext &context, virtual bool eval_Bra(CTAContext &context, const ir::PTXInstruction &instr, const ir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const ir::PTXInstruction::Opcode &) = 0; const boost::dynamic_bitset<> & fallthrough) = 0; virtual CTAContext& getContext() = 0; virtual void eval_Bar(CTAContext &context, } const ir::PTXInstruction &instr) = 0; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22 Example: Immediate Post-Dominator Reconvergence // ocelot/executive/interface/ReconvergenceMechanism.h // CTAContext& getContext(); size_t stackSize() const; class ReconvergenceIPDOM: public ReconvergenceMechanism { void push(CTAContext&); public: void pop(); ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; void initialize(); void evalPredicate(CTAContext &context); bool eval_Bra(CTAContext &context, unsigned int reconvergeEvents; }; PTXInstruction &instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough); void eval_Bar(CTAContext &context, PTXInstruction &instr); void eval_Reconverge(CTAContext &context, PTXInstruction &instr); void eval_Exit(CTAContext &context, PTXInstruction &instr); bool nextInstruction(CTAContext &context, PTXInstruction &instr, PTXInstruction::Opcode &); SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23 Example: Immediate Post-Dominator Reconvergence // ocelot/executive/implementation/ReconvergenceMechanism.cpp // void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, const ir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier #if REPORT_BAR report(" Bar called - " << context.active.count() << " of " << context.active.size() << " threads active"); #endif std::stringstream message; message << "barrier deadlock:\n"; throw RuntimeException(message.str(), context.PC, instr); } } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24 Example: Immediate Post-Dominator Reconvergence void executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, const ir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; } } void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, const ir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false; } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25 Applications: Thread Frontiers Evaluate the impact of novel warp reconvergence mechanisms on unstructured control-flow graphs Approach: Control the layout of basic blocks Select threads with highest priority PC Model priority queue in hardware Evaluation: measure impact on Activity factor - SIMD utilization Dynamic instruction count Effective memory bandwidth Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, Sudhakar Yalamanchili. SIMD Reconvergence at Thread Frontiers. 44th International Symposium on Microarchitecture (MICRO 44). December 2011. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26 Applications: Thread Frontiers // ocelot/executive/interface/ReconvergenceMechanism.h // class ReconvergenceTFSortedStack: public ReconvergenceMechanism { public: ReconvergenceTFSortedStack(CooperativeThreadArray *cta); ~ReconvergenceTFSortedStack(); // ... omitted typedef std::map<int, CTAContext> RuntimeStack; typedef std::vector<RuntimeStack> StackVector; StackVector stack; unsigned int reconvergeEvents; }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27 Applications: Thread Frontiers // ocelot/executive/implementation/ReconvergenceMechanism.cpp // executive::CTAContext& executive::ReconvergenceTFSortedStack::getContext() { bool executive::ReconvergenceTFSortedStack::nextInstruction( executive::CTAContext &context, PTXInstruction &instr, return stack.back().begin()->second; PTXInstruction::Opcode &opcode) { } // advance to next instruction if the current instruction void executive::ReconvergenceTFSortedStack::eval_Exit( // wasn't a branch executive::CTAContext &context, PTXInstruction &instr) { if (opcode != ir::PTXInstruction::Bra if (stack.back().size() == 1) { && opcode != ir::PTXInstruction::Call context.running = false; && opcode != ir::PTXInstruction::Ret) { } context.PC++; else { } throw RuntimeException("not all threads hit the exit: ", context.PC, instr); } return context.running; } } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 28 Applications: Thread Frontiers SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 29 Summary: Ocelot PTX Emulator Used Can for instruction level analysis be attached to trace generators and trace analyzers Simple processing and filtering of machine state Forms the basis for a range of productivity tools Correctness tools Debugging tools Workload characterization Drives instruction and address traces to MACSIM (Part 2 of this tutorial) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 30