Overview of Ocelot - CompArch - Georgia Institute of Technology

OVERVIEW OF OCELOT: ARCHITECTURE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Overview  GPU Ocelot overview  Building, configuring, and executing Ocelot programs  Ocelot Device Interface and CUDA Runtime API  Ocelot PTX Internal Representation  PTX Pass Manager SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2 3 Ocelot: Multiplatform Dynamic Compilation esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) Just-in-time code generation and optimization for data intensive applications • Environment for i) compiler research, ii) architecture research, and iii) productivity tools SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3 NVIDIA’s Compute Unified Device Architecture (CUDA)  Integrate the concept of a compute kernel called from standard languages  Multithreaded host programs  The compute kernel specifies data parallel computation as thousands of threads  An accelerator model of computing Explicit functions for off-loading computation to GPUs  Data movement explicitly managed by the programmer  SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4 NVIDIA’s Compute Unified Device Architecture (CUDA) Host  For GPU access to CUDA tutorials http://developer.nvidia.com/cuda-education-training SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5 Structure of a Compute Kernel Parallel Thread Execution (PTX) instruction set architecture  Arrays of (data parallel) thread blocks called cooperative thread arrays (CTAs)  Barrier synchronization  Mapped to single instruction stream multiple data stream (SIMD) processor SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6 NVIDIA Fermi GF 100 •4 Global Processing Clusters (GPCs) containing 4 SMs each •Each SM has 32 ALUs, 4 SFUs, and 16 LS units •Each ALU has access to 1024 32bit registers (total of 128kB per SM) •Each SM has its own Shared Memory/L1 cache (64kB total) ALU •Unified L2 cache (768kB) Streaming multiprocessor (SM) •Six 64bit Memory Controllers (total 384bit wide) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7 Ocelot Structure1 PTX Kernel CUDA Application nvcc  Ocelot  is built with nvcc and the LLVM backend Structured around a PTX IR LLVM IR Translator  Compile stock CUDA applications without modification Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. . 1G. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8 CUDA to PTX  PTX modules stored as string literals in fat binary  We ignore accompanying binary image (GPU native binary) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9 Overview  GPU Ocelot overview  Building, configuring, and executing Ocelot programs  Ocelot Device Interface and CUDA Runtime API  Ocelot PTX Internal Representation  PTX Pass Manager SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10 Dependencies    Software  C++ Compiler (GCC 4.5.x)  Lex Lexer Generator (Flex 2.5.35)  YACC Parser Generator (Bison 2.4.1)  Scons (Python 2.7)  LLVM (3.1) Libraries  boost_system (1.46)  boost_filesystem (1.46)  boost_serialization (1.46)  GLEW (optional for GL interop) (1.5)  GL (for NVIDIA GPU Devices) Library headers  Boost (1.46) http://code.google.com/p/gpuocelot/wiki/Installation SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11 Ocelot Source Code • Freely available via Google Code project site (New BSD License) http://code.google.com/p/gpuocelot/ • ocelot/ • analysis/ • api/ • cuda/ • executive/ • ir/ • parser/ • tools/ • trace/ • translator/ • transforms/ -- analysis passes -- Ocelot-specific API extensions -- implements CUDA runtime -- Device interface and backend implementations -- internal representations (PTX, LLVM, AMD IL) -- parser (to PTX) -- standalone applications using Ocelot -- trace generation and analysis tools -- translators from PTX to LLVM and AMD IL -- program transformations svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12 Building GPU Ocelot  Obtain source code   Compile with Scons    sudo ./build.py –install Build and execute unit tests   svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only sudo ./build.py –test=full Output appears in .release_build  libocelot.so  OcelotConfig  Tests Installation directory:  /usr/local/include/ocelot  /usr/local/lib http://code.google.com/p/gpuocelot/wiki/Installation SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13 14 Configuring Ocelot   Controls Ocelot’s initial state  Located in application’s startup directory     trace: { configure.ocelot memoryChecker: { enabled: true, checkInitialization: false trace specifies which trace generators are initially attached }, raceDetector: { executive controls device properties enabled: false, ignoreIrrelevantWrites: true trace:  memoryChecker – ensures  raceDetector - enforces synchronized access to .shared  debugger - interactive debugger }, debugger: { enabled: false, kernelFilter: "_Z13scalarProdGPUPfS_S_ii", executive:  devices:  List of Ocelot backend devices that are enabled  nvidia - NVIDIA GPU backend  emulated – Ocelot PTX emulator (trace generators)  llvm – efficient execution of PTX on multicore CPU  amd – translation to AMD IL for PTX on AMD RADEON GPU SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY alwaysAttach: true }, }, executive: { devices: [ "emulated" ], } } 14 Building and Executing CUDA Programs  nvcc -c example.cu -arch sm_23  g++ -o example example.o `OcelotConfig -l`  `OcelotConfig -l` expands to ‘-locelot’  libocelot.so replaces libcudart.so SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15 Overview  GPU Ocelot overview  Building,  Ocelot  Ocelot  PTX configuring, and executing Ocelot programs Device Interface and CUDA Runtime API PTX Internal Representation Pass Manager SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16 CUDA Runtime API  Ocelot implements CUDA Runtime API  Transparent hooks into existing CUDA applications  override methods of cuda::CudaDeviceInterface  Maps CUDA RT onto Ocelot device interface abstraction  cuda::CudaRuntime  Extended through custom Ocelot API  e.g. ocelot::registerPTXModule( ); SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17 18 Ocelot CUDA Runtime Overview A reimplementation of the CUDA Runtime API  Compatible with existing applications  Link against libocelot.so instead of libcudart R. Domingo & D. Kaeli (NEU) Kernels execute anywhere  Key to portability! SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18 Ocelot CUDA Runtime  Clean device abstraction  All back-ends implement same interface  Ocelot API Extensions Add/remove trace generators  Compile/launch kernels directly in PTX  Device memory sharing among host threads  Device switching  SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19 Ocelot Source Code: CUDA Runtime API • ocelot/ • analysis/ • api/ • cuda/ • • • • • • • • • • • • -- analysis passes -- Ocelot-specific API extensions -- implements CUDA runtime interface/CudaRuntimeInterface.h interface/CudaRuntime.h interface/CudaRuntimeContext.h interface/FatBinaryContext.h interface/CudaDriverFrontend.h executive/ ir/ parser/ tools/ trace/ translator/ transforms/ -- Device interface and backend implementations -- internal representations (PTX, LLVM, AMD IL) -- parser (to PTX) -- standalone applications using Ocelot -- trace generation and analysis tools -- translators from PTX to LLVM and AMD IL -- program transformations SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20 Ocelot CUDA Runtime API Implementation    Implement interface defined by cuda::CudaRuntimeInterface  ocelot/cuda/interface/CudaRuntime.h  ocelot/cuda/implementation/CudaRuntime.cpp  class cuda::CudaRuntime cuda::CudaRuntime members  Host thread contexts  Ocelot devices  Registered modules, textures, kernels  Fat binaries  Global mutex CUDA Runtime API functions   eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(), Additional functions  eg. _lock(), _unlock(), _registerModule() SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21 Ocelot Source Code: Device Interface • ocelot/ • executive/ • • • • • -- Device interface and backend implementations interface/Device.h interface/EmulatorDevice.h interface/NVIDIAGPUDevice.h interface/MulticoreCPUDevice.h interface/ATIGPUDevice.h SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22 Ocelot Device Interface  class executive::Device  Succinct interface for device objects   Module registration  Memory management  Kernel configuration and launching  Global variable and texture management  OpenGL interoperability  Streams and Events  Trace generators Minimal set of APIs for device-oriented programming model   Capture device state:   Memory allocations, global variables, textures, graphics interoperability Facilitate creation of backend execution targets   57 functions (versus CUDA Runtime’s 120+) Implement Device interface Enable multiple API front ends  Implement front ends targeting Device interface SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23 Overview  GPU Ocelot overview  Building,  Ocelot Device Interface and CUDA Runtime API  Ocelot  PTX configuring, and executing Ocelot programs PTX Internal Representation Pass Manager SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24 Ocelot PTX Intermediate Representation (IR)  Backend compiler framework for PTX  Full-featured PTX IR      Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization PTX Kernel  IR to IR translation  From PTX to other IRs  LLVM (x86/PowerPC/ARM)  CAL (AMD GPUs) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25 Ocelot Source Code: Intermediate Representation • ocelot/ • ir/ • • • • • • • -- internal representations (PTX, LLVM, AMD IL) interface/Module.h interface/PTXInstruction.h interface/PTXOperand.h interface/PTXKernel.h interface/ControlFlowGraph.h interface/ILInstruction.h interface/LLVMInstruction.h • parser/ -- parser (to PTX) • interface/PTXParser.h SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26 Ocelot PTX Internal Representation   C++ classes representing PTX module  ir::PTXModule  ir::PTXKernel  ir::PTXInstruction  ir::PTXOperand  ir::GlobalVariable  ir::LocalVariable  ir::Parameter Ocelot PTX Parser target, Emitter source   ir::PTXInstruction::valid( ) Translator source  PTX to LLVM  PTX to AMD IL  Suitable for analysis and transformation  Executable representation  PTX Emulator SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27 Ocelot PTX IR: Kernels ir::Module .global .f32 globalVariable; ir::Global ir::Kernel .entry sequence ( .param .u64 __cudaparm_sequence_A, .param .s32 __cudaparm_sequence_N) { .reg .u32 %r<11>; .reg .u64 %rd<6>; .local u32 %rp0; ir::Local ... ... ir::Parameter ir::BasicBlock $LDWbegin_sequence: ld.param.s32 %r6, [__cudaparm_sequence_N]; setp.le.s32 %p1, %r6, %r5; @%p1 bra $Lt_0_1026; ... ... $Lt_0_1026: exit; $LDWend_sequence: } // sequence SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 28 Ocelot PTX IR: Instructions ir::BasicBlock ir::PTXInstruction add.s32 %r7, %r5, 1; ir::PTXOperand ld .param .u64 %rd1, [__cudaparm_sequence_A]; addressMode: address opcode addressSpace dataType d a addressMode: register cvt.s64.s32 %rd2, %r5; mul.wide.s32 %rd3, %r5, 4; add.u64 %rd4, %rd1, %rd3; st .global .s32 [ %rd4 + 0 ], %r7; addressMode: indirect $Lt_0_6146; addressMode: label addressMode: immediate Guard predicate @%p1 bra SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 29 Control and Data-Flow Graphs • Data structure for representing kernels • Basic blocks • fall-through and branch edges • instruction vector • label • Traversals: • pre-order, topological, post-order • iterator visits blocks • Data-flow graph overlays CFG • definition-use chains explicit • to and from SSA form • CFG Transformations: • split blocks, edges • DFG Transformations: • insert and remove values • iterate over def-use SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 30 Example: Control-Flow Graphs // example: splits basic blocks containing barriers // for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin(); bb_it != kernel->cfg()->end(); ++bb_it) { // iterate over basic blocks unsigned int n = 0; ir::BasicBlock::InstructionList::iterator inst_it; for (inst_it = (bb_it)->instructions.begin(); inst_it != (bb_it)->instructions.end(); ++inst_it, n++) { // iterate over instructions in *bb_it const ir::PTXInstruction *inst = static_cast< const ir::PTXInstruction *>(*inst_it); if (inst->opcode == ir::PTXInstruction::Bar) { if (n + 1 < (unsigned int)(bb_it)->instructions.size()) { std::string label = (bb_it)->label + "_bar"; kernel->cfg()->split_block(bb_it, n+1, ir::BasicBlock::Edge::FallThrough, label); } break; // split block containing bar.sync // so that it’s always the last // instruction in a block } } // end for (inst_it) } // end for (bb_it) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 31 Example: Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ) { unsigned int bytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 ); ... SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example: Spilling Live Values ... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY IR for AMD and LLVM  LLVM IR • • •  AMD Backend: R. Domingo & D. Kaeli (NEU) Implements all of the LLVM instruction set Decouples translator with LLVM project Easier to construct than LLVM’s actual IR AMD IL • Supports translation from PTX to AMD interface  Emitters construct parseable string representations of modules SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 34 Overview  GPU Ocelot overview  Building, configuring, and executing Ocelot programs  Ocelot Device Interface and CUDA Runtime API  Ocelot PTX Internal Representation  PTX Pass Manager SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 35 PTX PassManager  Orchestrates analysis and transformation passes     Derived from LLVM model Analysis Passes generate meta-data Meta-data consumed by transformations Transformation Passes modify the IR SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 36 Using the Pass Manager  Passes added to a manager   Schedules execution Manages analysis meta-data   Ensures meta-data available Up to date; not redundantly computed SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 37 Analysis Passes  Analysis runs over the PTX IR     Generates meta-data Modifies PTX IR Possibly updates or invalidates existing meta-data Examples    Data-flow graph Dominator and Post-dominator trees Thread frontiers SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 38 Analysis Passes – Supported Analaysis Structures  Control Flow Graph   Data Flow Graph     analysis/interface/DominatorTree.h  analysis/interface/PostDominatorTree.h Superblock Analysis analysis/interface/SuperblockAnalysis.h Divergence Graph   analysis/interface/DataflowGraph.h Dominator and Post-Dominator Trees   ir/interface/ControlFlowGraph.h analysis/interface/DivergenceGraph.h Thread Frontiers  analysis/interface/ThreadFrontiers.h SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 39 Transformation Passes  Modify the PTX IR   Consume meta-data Examples:  Dead-code elimination   Control-flow structuring   transforms/interface/StructuralTransform.h Sync elimination   transforms/interface/DeadCodeEliminationPass.h transforms/interface/SyncElimination.h Dynamic instrumentation SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 40 Example: Dead Code Elimination Transformation Pass SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 41 Dead Code Elimination  Approach  Run once on each kernel  Consume data-flow analysis meta-data  Delete instructions producing values with no users  Implementation  transforms/interface/DeadCodeEliminationPass.h  transforms/implementation/DeadCodeEliminationPass.cpp SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 42 Dead Code Elimination (1 of 5)  Setup pass dependencies DeadCodeEliminationPass::DeadCodeEliminationPass() : KernelPass(Analysis::DataflowGraphAnalysis | Analysis::StaticSingleAssignment, "DeadCodeEliminationPass") { } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 43 Dead Code Elimination (2 of 5)  Run pass void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k) {  Get analysis metadata Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis); assert(dfgAnalysis != 0); // cast up analysis::DataflowGraph& dfg = *static_cast<analysis::DataflowGraph*>(dfgAnalysis); assert(dfg.ssa()); SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 44 Dead Code Elimination (3 of 5)  Loop until change BlockSet blocks; for (iterator block = dfg.begin(); block != dfg.end(); ++block) { report(" Queueing up BB_" << block->id()); blocks.insert(block); } while(!blocks.empty()) { iterator block = *blocks.begin(); blocks.erase(blocks.begin()); eliminateDeadInstructions(dfg, blocks, block); } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 45 Dead Code Elimination (4 of 5)  Remove unused live-out values AliveKillList aliveOutKillList; for (RegisterSet::iterator aliveOut = block->aliveOut().begin(); aliveOut != block->aliveOut().end(); ++aliveOut) { if (canRemoveAliveOut(dfg, block, *aliveOut)) { report(" removed " << aliveOut->id); aliveOutKillList.push_back(aliveOut); } } for (AliveKillList::iterator killed = aliveOutKillList.begin(); killed != aliveOutKillList.end(); ++killed) { block->aliveOut().erase(*killed); } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 46 Dead Code Elimination (5 of 5)  Check if an instruction can be removed if (ptx.hasSideEffects()) return false; for (RegisterPointerVector::iterator reg = instruction->d.begin(); reg != instruction->d.end(); ++reg) { // the reg is alive outside the block if (block->aliveOut().count(*reg) != 0) return false; InstructionVector::iterator next = instruction; for (++next; next != block->instructions().end(); ++next) { for (RegisterPointerVector::iterator source = next->s.begin(); source != next->s.end(); ++source) { // found a user in the block if (*source->pointer == *reg->pointer) return false; } } } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 47 Dead Code Elimination  Repeat for     phi instructions Other instructions alive-in values Ensures meta-data is valid SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 48 Running Passes on PTX  Static optimizer     PTXOptimizer Runs passes on PTX assembly files ocelot/tools/PTXOptimizer.cpp JIT optimization   Runs passes before kernels are launched ocelot/api/implementation/OcelotRuntime.cpp SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 49 Questions   GPU Ocelot  Google Code site:  Research Project site: http://gpuocelot.gatech.edu  Mailing list: gpuocelot@googlegroups.com Contributors   http://code.google.com/p/gpuocelot Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin Lele, Si Li, Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili Sponsors  AMD, IBM, Intel, LogicBlox, NSF, NVIDIA SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 50

Overview of Ocelot - CompArch - Georgia Institute of Technology

Related documents

Products

Support

Overview of Ocelot - CompArch - Georgia Institute of Technology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib