>> Aaron Smith: So I'm Aaron Smith. And I want to introduce Andrew Kerr to you from Georgia Tech. So Andrew is currently a PhD student. And he's getting ready to graduate. His advisor is Suda -- you'll have to say the last name. >> Andrew Kerr: Suda Yalamanchili. >> Aaron Smith: Thank you. So Andrew has been like one of the lead developers of this dynamic compilation infrastructure for GPUs called GPU Ocelot. And so he's been going around difference conferences and giving this tutorial. So he was kind enough to come today and give it to us. So the plan is the tutorial will take about two hours. And then we're going to have some sandwiches and coffee and stuff delivered. And we'll do like and hour of demos or an hour and a half of demos. And if you want to skip out at that point or if you're interested, feel free to stick around. Okay. So I'll let Andrew get started. >> Andrew Kerr: Okay. Thanks. Thank you, Aaron. So as he said, I'm one of GPU Ocelot. These are some of the other numerous contributors. Gregory Diamos and I started the project several years ago. He's now in NVIDIA research. We had several other students join us. Rodrigo Dominguez is actually from Northeastern University, and he developed an AMD back end, which I'll provide some detail on. So the structure of the talk is describe sort of a motivation for developing a heterogenous dynamic compilation infrastructure, like Ocelot. Present an overview of Ocelot, what its purpose is. Describe how we implemented the CUDA Runtime API, how we implemented a device -- and abstract device interface for basically extending CUDA on to multiple times of processors. I'll describe what was necessary to support each of these types of processors, taking maybe a deeper dive into our internal representation of PTX, which we're using as sort of a device agnostic data parallel internal representation for compute kernels. Describe some of the research topics that we're currently looking at. Provide some examples for modifying Ocelot to maybe sort of cover the tutorial aspect. And then afterwards, I'm going to boot up Linux and show several demos of Ocelot running, just off the shelf CUDA applications, how we can use it to do analysis, and basically let you suggest any additional tasks or have a discussion at that point. So the motivation here is heterogenous computing is fairly mainstream. It has been for quite a while. We've seen many systems with both CPUs and GPUs. They're currently making their way into cluster environments. And so they're ubiquitous, but they're still sort of programming challenges. And the challenges can be broken down into may be two different areas or can be divided into two different areas. Number one is you just have multiple ISAs. Each of these processors have different instructions that reveal different functionality. So -- not two, but multiple. And so a programming model or an application would really need to be written for each of the processors available on a system to be able to take advantage of all of them. And we also have execution model differences. So CPUs are optimized for maybe a small collection of high-speed serial threads. GPUs are optimized for a large number of threads and require that much concurrency to hide latencies. So we have sort of this mismatch between each of the processors in a system and the way that they're programmed. And so this is sort of the environment in which a heterogenous compilation framework can really excel. And so the GPU Ocelot project tries to take a single representation of a program in the form of a data parallel internal representation and then do program analysis on that, and translate it for execution on to multiple different types of processor back ends. And so this is sort of the overview slide of GPU Ocelot. We envision multiple language front ends, device independent parallel program transformation and analysis and then multiple back ends. And we've tried to make this not just a research project but also sort of a useful vehicle for other developers and other researchers. So we've tried to adopt off-the-shelf programming languages and programming models. And so most of this work started when CUDA was roughly the only thing available. So a lot of it is sort of informed by design decisions that went into CUDA. But as Rodrigo has shown, it's possible to run CUDA applications on AMD GPUs. Some of our work has shown it's possible to run cube applications efficiently on CPUs. And we've also seen some examples of languages other than CUDA being complied to PTX and run on each of these processors. At Georgia Tech we're working on building a -- basically a database system that compiles data log as opposed to SQL to an internal representation and then provide CUDA implementations of primitives. And then the whole thing runs through Ocelot and sort of achieves heterogenous execution. And it's also worth pointing out that consider we this a dynamic compiler because because Ocelot here is capable of monitoring the program as it's executing. It's able to record application behaviors, transfer the code, Just-in-time compile it for the processor and continue executing it. And as a research tool we've used it for doing research and compilation and optimizations for heterogenous processors, use it to drive GPU simulators, and make contributions to GPU architectures in general. And we've also used it on -- as sort of a way of enhancing developer productivity, to provide basically correctness checks for applications, to make sure that they're running correctly. And also feedback for doing performance tuning. So now discuss maybe some of the additional details. First. Most of this work is intended to support NVIDIA's CUDA. And we've tried several versions over the years, beginning with CUDA version 2.2. And all the way through the current version, which is CUDA 4.1. They rewrote their own compilers so that sort of trickled some modifications down to us. The notion here is the core of an application's computation is expressed in data parallel compute kernels. And the host application using an API, watches his kernels on a device. And so we're trying to support as many different devices using that one programming model. Here's sort of the structure of a typical CUDA application. There's a host thread which calls just standard CAPI functions that allocate memory on a device, register compute kernels for execution on the device launching them and then receiving them results. And then on the accelerator side, in this case the GPU, we have this hierarchy of -- we have a kernel that is launched over a hierarchy of threads and the threads have sort of a two-level hierarchy in which a set of threads are grouped into a cooperatedive thread array. This is basically a set of threads that are mapped to the same core, can synchronize and exchange data through a on-chip scratch-pad. And then a collection of these are distributed throughout the cores of the GPU. So this is how they sort of map this programming model of massive amounts of parallelism on to a processor that has hundreds or one day thousands of cores. And another point here is CUDA's been around for a while. So there are a large number of CUDA applications, in of which get fairly high speedups. And Ocelot is sort of meant to be compatible with all of them. The structure of a compute kernel itself is worth understanding in greater detail. The notion is a C or C++ function is written. And it's fairly standard with a few exceptions. There are some keywords to indicate that it's an actual kernel or a function that's called from a kernel. There are some built-in invariables to allow a thread to determine what its location is within this hierarchy of threads. And so these built-invariables like thread AD, block dimension, which is the number of threads per block, the ID within the block -- I'm sorry, the ID of the block within the collection of blocks. And then this kernel itself is replicated for each of the threads. And this thread hierarchy then is meant to be executed on this abstract virtual machine. And the notion is there's an arbitrary with SIMD processor where each thread is mapped to one of the lanes of the processor. There's a register file, local memory, some other explicitly addressed memory spaces like texture memory, constant memory, parameters, and also shared memory. There's a heavily ported scratch-pad memory that allows threads to write to it, explicitly synchronize and then fetch the results. So threads can have producer/consumer relationships. So that's the execution model of CUDA. Here's just sort of an overview of NVIDIA's GPU. This is sort of their GPUs that sort of evolved with the CUDA programming model and execution model, so it's probably worth understanding. So it's just a collection of processors. The processors themselves have numerous ALUs, which are meant to execute SIMD and spatial and temporal SIMD. In the same execution -- or the same instruction is issued for multiple clock cycles for different threads. Each streaming multiprocessor the equivalent of a core. It executes one CTA. And then finally there's a single share last level cache with request a large cross-bar interact and then access to six memory controllers. So GPUs have fairly high bandwidth. And the whole point of all these threads basically is the hide latency. So the L -- L2 is very high latency, off-chip memory is very high latency. But by launching a number of threads and having some of them stalled on memory request while they're computation, you can achieve fairly high throughputs. And so back to Ocelot. So Ocelot is meant to be able to execute CUDA applications. And so CUDA programs themselves are written in C++ with compute kernels. They are compiled with NVCC, which is NVIDIA's CUDA compiler. It basically separates the program into the host code or a source-to-source translator separates the program into the host and then compute kernel. The compute kernel goes through a separate compilation path and ultimately winds up in this virtual -- virtual assembly language called PTX. And so Ocelot sort of uses PTX as its own internal representation. So we have an implementation of this CUDA runtime API for allowing the application to register compute kernels, parses PTX into an internal representation, does analysis, program transformations, then ultimately it issues it to several Ocelot devices that might be present in the system. And these actually execute the program. And so at runtime, the kernels are actually parsed, analyzed and translated. But you can also extract some of the tools that we built, the stand-alone tools -- or, I'm sorry, extract some of this functionality as stand-alone tools. For instance, if you wanted just to do sort of a characterization study. You could use Ocelot's parser and path manager to do optimizations or analyses of PTX kernels. And one of the demos I'll sort of demonstrate this. You can also use it to do basically online scheduling decisions if you have multiple devices here, you can have an application that basically changes devices, depending on what's -- what might be available. So in a virtualized environment, some other application might need the GPU. And so Ocelot allows you to basically switch context over to this multicore CPU Ocelot device and continue executing the CUDA application as if nothing had happened. If the CUDA application itself performs better on the multicore CPU than the GPU, Ocelot provides sort of a seamless way of taking advantage of that. >>: [inaudible]. >> Andrew Kerr: Yes, please. >>: So [inaudible] I was wondering if you were going to talk about if there was any support for isolation for the shared resources? Like, for example you talked about [inaudible] cache. Is there isolation of ->> Andrew Kerr: Okay. So ->>: [inaudible]. >> Andrew Kerr: So on NVIDIA GPUs isolation is provided by this memory manager and then this contact -- this notion of a device context. And that includes things like the page table and allocation of resources. But it's fairly coarse grain. And typically you only get -- like different applications can be utilized in the GPU. But they utilize it by the driver partitioning certain SMs. And there's a virtual physical mapping so they can't necessarily read each other's data. But they could evict other application' data from the last level cache. And they'll never share the same SMs. And so we're not -- we're not really in a position to look at those types of problems because we can't control the driver. So most of this work is sort of geared toward making use of different types of processors by translating execution models and doing efficient co-generation. Fair enough? Okay. So within a CUDA application there's a host code that's compiled by the systems compiler. Then there's also the GPU kernels were just compiled by NVCC and embeds that in the application's data segment. And the application runs on startup. And this is all sort of transparent of the CUDA programmer. But these function calls or actually invisible. If you take a look at some of the intermittent compilation. There's a -- several CUDA functions that sort of bind these GPU kernels to objects or variables. And then when a program runs, it uses the -- those bound objects to sort of reference the function. And so when the host application calls the kernel, these are used by Ocelot or by any other CUDA runtime implementation to pass the representation of the GPU kernel to the driver or to the CUDA implementation which then does Just-in-time compilation and executes it on the device. And that's just sort of the architecture of CUDA. So we tried to implement as much of this CUDA interface as possible within Ocelot to chief transparency. So in this case this CUDA register fat binary passes a -- basically a string literal containing PTX into Ocelot, which parses PTX into its own IR. We do control and dataflow analysis. We have support for doing additional PTX to PTX transformations. And we've used those to explore things like on -- doing online optimization as well as instrumentation of compute kernels. And then these are ultimately translated on -- into the native ISA of the target device. So if the application is executing on an NVIDIA GPU, we just re-emit the same PTX. But this is -- it's not the string literal that came from the application, it's coming from Ocelot. So any transformations are sort of present in that PTX. On a multicore CPU, we translate PTX to LLVM's IR and use LLVM as sort of a virtual machine for compiling Mac to x86, or whatever other ASA you're supporting. There's some additional transformations to map the PTX thread hierarchy on to the amount of currency that the processor actually supports. And in the case of AMD's GPUs it's translated to sort of the equivalent of PTX from AMD called IL. And there's some additionally heavy weight transformations that I'll talk about later. And then finally, we also have implemented a PTX emulatedor that directly executes PTX as if it were a instruction set and provides like very fine-grain call backs. So you can do things like trace every instruction, see -- inspect the contents of the register file before and after the instruction executes, intercept memory addresses. You could do things like verify that those addresses are valid or just save them to a trace file for driving some simulator down the line. So that's sort of the overarcing structure here. And so some additional information before we go any further is Ocelot's available as an open source project. It's hosted on Google Code. We also have research project site that's hosted at Georgia Tech, which has all of the publications that we've worked on. If you go to slash news, you can see sort of a log post describing this talk. And there's a link to the presentation slides. So there might be a lot of technical content that I don't necessarily cover in the greatest detail like code snippets. But if you want to -- if you're really interesting, you can sort of treat this set of slides as documentation. So you go visit that. And then we also have a mailing list. So if you have any questions or you can't get it to compile or you want -- you want to test out some ideas, you can post to the mailing list, and we'll try to get back to you as soon as possible. A complete list of contributors. And our sponsors. So if you visit the solicitation page you can see sort of a list of the steps needed to compile Ocelot. And I'm just going to summarize them here. So dependencies, C++ compiler. We use C++ OX features. So it needs to be -- it needs to be able to support those. VisualStudio 2010 didn't have any problem compiling Ocelot, which I thought was pretty impressive. We depend on Flex and Bison to implement the parser. And that was something I wasn't able to get to run on Windows without lots of hacking that I didn't get done before this talk. But presumably it's possible to build a parser generator. We use Scons as the build system. We use LLVM to support execution on multicore. We use boost, OpenGL libraries, OpenGL to do visualization. So -- so the source code, generally we try to do package releases about once a year. But there are a lot of bug fixes and modifications that happen in the meantime. So I generally recommend that anyone actually trying to use the Ocelot just do an anonymous checkout. The source code or the checkout directory contains some scripts but then the Ocelot directory have most of the source code and so each of these subdirectors sort of corresponds to a namespace and the C++ application. So we have an implementation of CUDA which there's some additional slides on. We have some Ocelot specific APIs. So applications can interact with Ocelot. They can add a trace generators, which are sort of analysis tools that the -- each of the devices drives. We have IR which defines our PTX internal representation and then IRs for both LLVM and AMD IL. An implementation of the parser. Some stand-alone applications that can be useful. For instance, it's possible to run Ocelot as sort of a server mode. And so if you have like a heavy end or a high-end work station running the Ocelot server, your local machine can run sort of an Ocelot client and through RPC over sockets your laptop can be running a heavy weight CUDA application and use the CPU in the server. So it's kind of useful. You have a set of trace generators that we've developed for doing specific studies. These are just maybe one-off analysis tools. But we have a few that were actually useful enough to include you with the main project. So they're located here. These do things like ensuring that memory addresses are valid, detecting race conditions and then an interactive debugger. We had translators from PTX to LLVM and PTX to AMD's IL. And then we have a set of program transformations and optimizations which do things like register allocation, splitting basic blocks at barriers, structural transformation for doing unstructured structure control flow. And finally I just want to point out there's a link to doxygen. So this is an automatically generated code documentation. This set of presentations and doxygen are basically the bulk of our documentation. We have some additional Wiki slides that try to help -- help solve common problems. For instance, some version of boost actually require manually patching it. So we try to record that. Maybe at some point we'll write an actual manual or a book. But this is it so far. To build Ocelot, just obtain the source code. We have a script that builds Scons, so it's pretty straightforward. You could also just type Scons and it should -- should compile. Liboscelot.so, Ocelotconfig. And then if you pass this test, flag full, it will compile a number of like white box units tests. These evaluate Ocelot's ability to implement and run just CUDA -- a set of CUDA programs that we wrote. It also runs unit tests on the emulator and a translator. So these are sort of like the minimal set of applications that are just within Ocelot to make sure that it works. We also distribute applications that we took from the CUDA and software development kit. Sort of added these as a separate set of applications also distributed through our Google code site. So if you want to run Ocelot with these other applications, you don't really have to do very much. They have Scons builds scripts for all of those as well. The installation director, just the usual places, user local include, your local lib, and user local bin. And these directions are also available on the Wiki. So when Ocelot -- or when a CUDA application starts up, the first CUDA call instantiates Ocelot internally. And Ocelot looks in this application startup directory for this JSON document called configure.ocelot. This is sort of a snippet from it. And this configures both Ocelot's devices as well as any trace generators that might be present. So if you look in executive devices, this is just sort of a list of the devices that you actually want the application to see as CUDA devices. So define NVIDIA emulated LLVM and AMD. Trace contains the trace generators that we've implemented. I'll go into greater detail later in this talk. But these are basically the greatness tools that I was referring to. And then some additional opposition -- or certain devices support certain optimizations. For instance LD, multicore back end will actually control the number of worker threads. And so those are also in executive. So basically this is how you control the startup configuration of Ocelot. And there are typically -- we've implemented API functions for many of these as well. So the goal for this is basically never have to modify a CUDA application to be able to configure Ocelot within the CUDA application. But there's an API for changing these values at runtime. So you can add additional worker threads, for instance. So to build an application using the Ocelot, you basically had to do very little. The only changes both NVCC on the source file and you pass libocelot instead of libcuda runtime. And so that sort of speaks to the notion that we're trying to be as transparent as possible by reimplementing all of CUDA. And then some additional libraries might be necessarily if you need to compile with LLVM. And so we wrote this Ocelot config application, which is if you're familiar with the LLVM project, there's an LLVM config. But basically it allows you to -- this program has hard coded versions of paths to each of the important directories like the library directory and the include directory. And you can use it to sort of generate additional inputs for the command string for a compiler. And then the last point, libocelot.so replaces libcudart.so. And so that's it for the overview. Next I'm going to move on to our implementation of the device interface and the CUDA runtime API. And so once again, the host application interacts with the accelerator by just the standard C functions. Some examples are CUDA Malloc, which allocates memory of the device, returns a Power Point. CUDA mem copy, which takes a pointer. And system memory as well as a pointer on the device and copies data to that allocation or from that allocation. Kernel -- this is sort of the -- the syntax for launching a kernel. And it's not strictly C++. Their -- some additional parameters are grid dimensions and block dimensions. So that specifies the number of blocks within kernel grid. And then the CTA dimensions are the number of threads within the block. So basically multiplying these two together gives the total number of threads. And then finally when the application -- or when the compute kernel is finished, another CUDA mem copy copies the data in GPU's memory back to the host. And so NVCC, NVIDIA CUDA compiler does source-to-source translation and replaces this with a set of additional function calls which setup -- set up the kernel, copy the parameter values into a block of memory which ultimately gets copied on to the device. But the point is, these are CUDA runtime API functions. And this is a CUDA kernel. And by the time Ocelot is actually running, this CUDA kernel has been compiled into the PTX virtual ISA. And so for Ocelot to be able to run the CUDA application, it has to implement each of these API function calls. And so we tried to make this as modular as possible. So we have this abstract CUDA device interface. Okay. Yeah. So this CUDA device interface. So our implementation of CUDA maps these CUDA runtime API calls on to this set of functions. And the CUDA runtime API is the CAPI. It's very verbose. There are probably 120 or 140 functions in it, depending on which version you're looking at. And we tried to avoid as much redundancy as possible to simplify the interface as much as possible. And so the goal here is to be able to support multiple devices without going to the trouble of reimplementing CUDA for all of them. So instead of implementing CUDA for each new architecture that we wanted to add to Ocelot, we just implemented this device interface. And then, conversely, we wanted to be able to add multiple API front ends. And so we actually have a student working on OpenCL support. And to do that, they just have to implement the OpenCL functions, which are sort of similar to the CUDA runtime API functions. In terms of the device interface. And then all of the existing devices actually work. And so we have base classes for the device and executive and then our CUDA runtime implementation in CUDA. And we also have sort of a virtual interface that this CUDA runtime implements. So if you wanted to add instrumentation just to the API, you could implement that, chained to the CUDA runtime and just record any kind of -- any important data within the API calls. And then we've also extended some additional API functions like CUDA register PTX module. Currently CUDA doesn't allow you to provide your own PTX. You basically have -- always have to go through CUDA unless you use them on PTX. And so this is sort of a simplification or a way around that. Some additional APIs for registering and removing analysis tools. And then for controlling the state of Ocelot. So here's sort of a visualization. So the CUDA runtime API and the Ocelot API is set on top of the interface. And then each of these devices implements that interface. And I guess one additional point we're stressing here is each of these devices sort of behaves as if it's a single Ocelot device. And you can always program it using CUDA. And so you get kernel portability because suddenly this CUDA kernel or OpenCL kernel or PTX kernel can now execute on each of these devices. And the only difference might be the amount of performance that you get or in the case of the emulator you get some additional callbacks or some additional hooks to monitor the execution of the CUDA kernel. And I also included this. This is the remote device interface that I mentioned. It's basically a custom RPC layer implemented using this Ocelot device interface. And since Ocelot is both the compiler, meaning you could modify the program, it's also the -like the execution manager. So it has access to all of the memory allocations that are present. And so it's actually able to reproduce state from one device on another device. Actually this is kind of a useful feature if you're doing operating systems research. It's also kind of useful if you are debugging an application. You might want -- you might want to run the application for, you know, minutes or hours on a high-speed device like an a NVIDIA GPU until you get to a certain kernel that is only executed late in the program, but you want very detailed performance results from -- or very detailed application behaviors. You want to monitor the application behavior at that point. So you could actually switch to -- switch devices just for that execution of that kernel, run on it this emulator to capture instruction or memory trace and then resume execution on the NVIDIA GPU. The CUDA runtime API is implemented using this sort of base class define all the API functions. Then one particular implementation CUDA runtime. There's some additional data structures that maintain like host -- host thread local state, like parameter values. These are sort implied from NVIDIA's sort of design decisions when they implemented the CUDA runtime. Not much more to say about that. And so the class here is CudaRuntime.cpp. It's implemented in Ocelot CUDA implementation. If you wanted to add your own API front end, like OpenCL, you'd sort of follow the same strategy here of mapping OpenCL API functions on to the device interface functions. There's some additional undocumented functions like CUDA register module. So if you're doing any CUDA hacking, you might notice that a CUDA application contains PTX but it also contains a fat binary for continuing sort of a binary representation of the kernel for several different types of GPU architectures. People have been able to sort of reverse engineer them using this data structure. And so we try to parse as much of that as possible so there's some sort of see the source code here is documenting NVIDIA's undocumented implementation choices. And it's also worth noting that the CUDA runtime implementation as it stands now from NVIDIA prevents a host thread from changing devices. So if you want to -- if you have a multi-GPU system and you want to use multiple GPUs, you basically have to launch a new thread and kill the old one if you want to change -- change -- change devices at runtime. And also you can't pass pointers from one device memory -- or a pointer for, say, GPU 1 to another thread running GPU 2. And the reason to do that is sort of a quirk or an undesirable consequence from their choice of implementing CUDA runtime. And so since Ocelot reimplements it, we try to get around that. And so now you can -- the short story is Ocelot's runtime implementation allows to you change devices. So that's kind of useful. Okay. So the device interface. As I said before, this is -- sits underneath the CUDA runtime implementation. There's some abstract base class which defines a set of methods for allocating memory, binding textures, configuring textures, registering PTX modules, getting kernels from those, executing those kernels. And then there are implementations for each of the back ends that we've been interested in. So emulator, NVIDIA GPU, multicore, ATI, and then some other sort of research vehicles like the -the remote device interface -- implementation. And so the point of -- the point of the device interface is just to simplify each of the API calls. So we've produced maybe hundreds of CUDA runtime API calls or API functions into around 57 or so. And this also includes some additional data structures. So if you wanted to iterate over all the allocations and a GPU's global address space, the NVIDIA GPU device contains a vector of those and it's just -- just uses C++ iterator to do that. And once again, the goal is to make it easy to add different devices and different APIs. >>: So how complete is your [inaudible]. >> Andrew Kerr: Our implementation supports most of the features from CUDA 3.2. NVIDIA released CUDA 4.0 and then 4.1 they added many new, let's say addressing modes. They added surfaces as well as textures. And so we've tried to implement most of the features that the applications that we have encountered actually use. So that means textures but not surfaces. Some things have been added to support OpenCL. And we've only been supporting those to the extent that we actually have to, to get our applications to run and to satisfy the goals of getting OpenCL -- having OpenCL support. There are also features within CUDA that programs don't actually use. I was telling Aaron about this. There are many PTX instructions added by say people with an NVIDIA to accelerate their own, like, personal kernels, things like doing population counts to see how many threads at a certain point have evaluated a variable or predicate variable as true or false. Like do a reduction across those. Or allow basically hardware performance counters to be copied in the registers and used. And so you can write PTX that express these things. The CUDA programs don't actually use them. Because there's no, like, high-level language support for them. And we've tried to support those when it makes sense to, but there's a very large spec, and it's kind of a research project. So the short story is many CUDA programs work. It's very easy to write a CUDA program that exercises a control path that don't actually support. But you'd probably be the only person to try to do that, if that makes sense. >>: [inaudible] CUDA 3.2 [inaudible] it should run on Ocelot. >> Andrew Kerr: It should run on Ocelot, yes. Some additional things have sort of broken in the last release. NVIDIA added sort of a back door between their driver and the runtime. And you can only -- it's not something that any third party would actually be able to use because they don't have any documentation about it. But since NVIDIA has sort of taken over some, like, third party libraries and are distributing them as their own, things like CUDA FFT and CUDA BLAS and the CUDA sparse primitives, some of them actually make use of this back door. And we haven't really tried to -- or haven't been able to successful reverse engineer it. So as of 3.2, we could run CUDA FFT programs through Ocelot and everything was fine. As of CUDA 4.1, FFT library uses this back door. And it's just broken. So that's one of the challenges. But presumably it's something that could use all the time. And so now we're going to take a dive into the PTX emulator. This part we're going to go through basically all of the devices. PTX emulator is first. And the goal here is execute PTX programs directly on an abstract virtual machine implied by the PTX specification and then do very detailed ex -- very detailed trace analysis. And so assume the thread hierarchy. Here's the abstract virtual machine. We tried to implement this in software as faithfully as possible. So we have an arbitrary width SIMD processor. Arbitrarily sized register file, arbitrarily sized shared memory. I'm sorry. Local memory. Shared memory. Texture sampling through software support. And so the PTX emulator just executes each instruction over all the threads and then moves on. And so the -- with this, we were able to do things like generate actual traces of PTX instructions, monitor kernel's interaction with the memory hierarchy, characterize the application, and then use to it drive some additional timing simulators. There's another project at Georgia Tech that's developing MACSIM, which is sort of a heterogenous trace-driven cycleable simulator. So this may be one of the first applications of Ocelot was to drive instruction and memory -- and address traces to it. So we've sort of exploit some of the undefined properties of the PTX execution model, one of which is different CTAs or different cooperative thread arrays within CUDA kernels can be executed out of order. And they don't actually have to be executed concurrently. So, for example, if this is one CTA, and this is another CTA, a thread here doesn't have to be live at the same time as the thread over here is. So you could actually execute those CTAs in serial or, however, within the CTA the threads suddenly have to be live because they have to be able to execute barriers. And so what go is we serialize the execution of one instruction over each of the threads that are present in the CTA. And when threads take arbitrary -- since it's a SIMD processor, if threads are taking the same control path, the same instruction is issued for each of the threads before moving on to the next. But in the case of control divergence, we sort of predicate off the threads that haven't taken the same control path. And that predicated value is sort of maintained within the emulator. The order in which you reconverge threads that have diverged sort of affects the overall SIMD utilization of the processor. We're capable of measuring that. We've also done some research that evaluates different techniques for doing the thread reconvergence. The emulator is a good sort of vehicle for seeing what the impact would be on average SIMD utilization. And so we have this abstract with the thread reconvergence mechanism that it makes it very easy to develop and add your own reconvergence policy. PTX defines a number of special functions like texture sampling. That's a big one. Since CUDA was sort of developed targeting GPUs, they wanted to make a lot of the GPUs specially purpose hardware available to the programming model. So texture sampling basically is a very large configuration space. You have one, two, or three-dimensional textures as of CUDA 4.1, queue mapping as well. Different texture filtering modes, different address clamping modes, different data types, different number of channels, et cetera. And so we actually developed a software texture sampling library that tries to implement all of these as faithfully as possible to what the GPU actually produces. So with the exception of rounding errors, it's very close. And the emulator is also sort of interesting as a way of adding new instructions to a GPU. And maybe test -- test there, seeing how they might affect the application so that you could insert explicit reconverged points within the application. It sort of hints to the hardware about which control path to take first. The emulator's a very convenient way of sort of doing that kind of research. It's implemented in the following sort of header files and corresponding C++ files. So there's the actual device interface. There's a kernel which takes sort of an arbitrary -- a PTX kernel with the control flow the control flow graph and dataflow and then it does register allocation, forms sort of a dense packing of the instructions, replaces labels with PC offsets. There's a call stack for actually supporting function calling within CUDA. Cooperative thread array is sort of the heart of the emulator and has implementations for each of the PTX instructions. And then texture operations implements each of the texture sampling methods. And then final trace defines a class and trace generator and trace event. And this is sort of the preferred way of getting instrumentation results out of the emulator. So it allows you to define a set of trace generators which are attached to the emulator. And then every time the emulator executes an instruction, it sends -- it constructs a trace event object, sends it to each of the attached trace generators, and then those are sort of user defined. They can take some kind of action, like record a trace to a file or update some performance counters. Whatever you'd like. Then after your instruction is committed, another call goes back to the trace generator. There's maybe a better illustration. And so the goal here is to provide sort of like the greatest level of detail to an analysis tool for observing execution of a CUDA kernel. And so within tracegenerator.h, we have this class that has four methods, initialize, event, post event, finish. It's worth pointing out that the trace generator is applied to each of the back ends. So initialize and finish are called before and after the kernel has actually executed. So at this point, parameter values are known. The state of the device is made available to the trace generator so you could look at all the memory allocations if you're interested. You could see what parameters are being passed if you wanted to do like profiling and see if specialization based on parameter value made sense. You can see how many threads are going to be launched for the kernel. And then finish called when the kernel is finished executing. So these are called for each of the device back ends. So if you wanted to do timing for NVIDIA -- or kernels running on the NVIDIA GPU, trace generator would be a convenient way of doing that. But on the emulator, it also will call event and post event. And these receive a trace event object, which I'll go into detail on the next slide. These are sort of how you would actually inspect the state of the device on every instruction. So here we have three PTX instructions executing in the CTA. And so event and post event are called before and after each one. And the event object contains roughly the same -- same information before and after, except that the post event method can look in the GPU's memory allocations and see which values are overwritten. And so the trace -- the trace event object contains a device pointer, kernel grid dimensions, the program counter, PTX instruction instance, which is the actual instruction being executed, bit vectors indicating which threads are actually active, a vector indicating which addresses have been generated by a load or store instruction. And then as PC is for branch and branch targets in the case of a control flow instruction. And so it's worth pointing out that imagine this were a load instruction. The event object would see which addresses were about to be loaded before it actually performs the load. So if it was an invalid address, you -- the trace generator should presumably see that before it actually happens. Here's the actual class. Here's the PTX instruction. And since I have a hold set of slides on this later in the tutorial that sort of explains how PTX is represented. But the short story is the trace generator can see all the analysis that the rest of Ocelot can. There's the bit vector indicating which threads are active. So you sort of correspond to which lanes of the SIMD processor are predicated on for execution. And then the vector of memory addresss. So here's sort of a very simple example of race detection. So here we have CUDA kernel declares an array and shared memory this means that all threads can write to this. But it's possible that some thread could write or read from shared memory when it's expecting a value that another thread would have written but because there's no synchronization barrier it's possible that that other thread hadn't actually executed that code yet. Because there could be, say, a hundred or 128 threads running on one of the multiprocessors of the GPU. But it's only a 32 wide SIMD machine, so it has to do some temporal interleaving as well. So basically their race conditions are possible. So in this case we've implemented a trace generator that annotates or that will initialize an array of thread IDs for every byte of shared memory. And then whenever it intercepts a store instruction, it will update that table with the ID of the thread that wrote to that byte. And then when it sees a synchronization instruction it clears that. When it sees a load instruction, and the -- a PTX load instruction corresponding to dereferencing the shared memory array and trying to load its value in the register before it stores a -- the load will look at that annotation table, see that some other thread last wrote to it and there was no barrier that cleared it, so it must be a race condition. And so when executed, Ocelot will actually -- or can be configured to throw an exception, the exception object identifies the name of the kernel, there's a name Mengle [phonetic] -or Mengle name. The program counter, the thread ID, the CTA ID. The actual -- the actual PTX instruction that's faulting tells you which address was being accessed, which thread last wrote to it. And it also provides near file name and line number. So when PTX is compiled by NVCC, it inserts some additional directives that sort of stored the file name and the line number, Ocelot preserves that. And there's generally a way to map the actual PTX instruction back to the original CUDA source file. So this is what the user wrote, and this is what they see when they run it. And so they find that line 9 contains the race condition. So it's just kind of a useful tool for debugging applications. One problem that we've discovered, though, is a lot of programs have intentional race conditions. And the reason behind that is this -- the extra barrier instruction here would take some additional cycles. It might be within the interloop of the kernel so the cycles actually matter. And the programmer knows that only the threads that are reading and writing to the same setup memory are likely to be packed into a WARP, which is threads that are actually issued to the SIMD processor at the same time. So the programmer knows it's locked -- it's executing a locked step. And so the race condition will never actually happen or has a very defined result on today's hardware. And so as sort of a performance optimization they remove the barrier. And it causes problems for us. But in our opinion, those programs are actually incorrect. They just happen to work. So occasionally you see these things, even though the program would be correct without it. So we get rid of it. Since the emulator tries to preserve the SIMD execution semantics, those types of race conditions never happen because threads are always executed in locked step. So the same instruction is broadcast over all the threads before moving on. But it's still nice to be able to detect it. And there are many other programs that just have real race conditions that you don't want. And this can be useful for finding them. As I said before, it's also useful for catching invalid memory, accesses. So here's a very simple program tries to store values to global memory. The programmer is passing an invalid value. They also have a real allocation that they meant to pass, but unfortunately they passed the invalid pointer. And so when you run it, this address is being dereferenced. It doesn't correspond to anything. So this trace generator just uses the ability to iterate over the allocation table. And on store instructions, it sees that this global address is not with any -- doesn't correspond to a real allocation. It lists the actual allocations that might be nearby, which the right size. And then, again, it tells you the file name and the line number. So this is kind of a useful feature. In the years since we first developed this NVIDIA's debugger has gotten a little bit better, so you can run a real debugger on the device. But you need their latest hardware to do that, their latest tool chain to do that. You might not want to upgrade. And this is also sort of composable. So it throws a -- throws an exception. And if you wanted your system to be able to respond to that exception without just crashing the program or without running the program in a debugger, this is sort of a useful way of sort of dealing with that. Yes? >>: When you use the [inaudible] did they do that in software or hardware [inaudible]. >> Andrew Kerr: They have some hardware hooks. They have some back doors into the driver, presumably that have hardware support. I don't know how they detect race conditions without doing something heavy weight like this through software. >>: Do they actually tell you like there's [inaudible] and this thread? They give you a lot of information [inaudible]. >> Andrew Kerr: I would imagine that they do that with some kind of instrumentation or like some kind of software instrumentation. So, I mean, that's the same information that this provides. It's just -- this is running an emulator because we don't have the ability to. I guess you could implement an instrumentation tool on PTX that does this kind of tracking as well, and you just pay a performance penalty. I assume they do something to that effect. It's worth noting ->>: [inaudible]. >> Andrew Kerr: Okay. Here's -- this is the interactive debugger. This is sort of another trace generator that we implemented. And the idea is basically it takes similar concepts that you might find in a traditional debugger and implement them into the emulator. And so here the programmer is basically starting the application, setting a watch point, which is just a break point that's triggered when a thread accesses a particular location. So it is passing an address, data type, number of elements. And then when the program runs, it sort of breaks on that store instruction that's trying to access it. And it presents some useful information. Including this here. Just sort of as sort of an example of what's possible to do with this trace generator framework. Because none of this requires modifying the Ocelot. It just -- just requires implementing that trace generator class and implementing each of the methods. And we've included these within the main Ocelot project as opposed to an additional library, just because we find them pretty useful. For instance, we have the memory checker and the race detector launch the debugger automatically if configured. So you can single set the program. So beyond correctness, it's also useful to understand the performance of the application. And so since we're able to inspect the instruction trace, determine SIMD utilization, count instructions and also see which addresses are being accessed, we've implemented this performance bound trace generator, which computes all of these statistics for each basic block. For instance, the amount of memory demand, how much data was accessed from shared memory, whether there are any bank conflicts, the various ports of shared memory. The number of dynamic instructions. The number of flops. It stores this for every basic block and then provides this nice visualization. This is basically directly out of the tool. And then it also provides sort of aggregate over the kernel. So if you wanted to know if your program was memory bound or compute bound, this might give you sort of an indication of what the theoretical limits might be. Tells you like number of flops per word transferred off chip. It also lets you know where the hottest path are. So this color intensity is sort of logarithmic with respect to dynamic instructions. And it also, again, includes the file name and the line number. So if you visualize this, you actually know that there's a loop here near scannativekernel.CU, line 51 or 61. And you can very quickly go back to the program and see the CUDA source -- CUDA lines of source that correspond to each of these basic blocks. So we find that pretty useful. This is just another analysis that we did to see if there were -- how much -- whether there are -- were real producer consumer relationships on a lot of CUDA applications. So if you have an FFT, basically some thread is producing a value, which ultimately is loaded by another thread. We found that a lot of applications were using a scratch pad and a synchronization but were never actually producing values consumed by other threads. They were just using the scratch pad to reorder their accesses to global memory, kind of like a cache. So we build this tool to sort of filter out those types of applications from applications that really did need the same threads to be running on the same SM, because they're actually sharing data quite frequently. Here's sort is of a list of each of the trace generators that we've implemented. Memory checker race detector and integrated debugger are part of it. You have some performance bound generator, some additional trace generators that store values of parameters and kernel launch dimensions in a database. And so to use a trace generator, you implement the trace generator interface. Basically these four methods. Adding trace generators transparently can be done through Ocelot. So since Ocelot is a library, there's a global constructor somewhere within the lib Ocelot data so that we'll call the Ocelot add trace generator method. This sort of makes the trace generator live for the entire run of the application. Alternatively if you only want to instrument a specific kernel, you could modify the program itself to call this API function. And several different modes of using it. One is just store traces of each instruction as it's called with all the memory and then analyze those later offline. Or do some kind of online analysis. Or maybe even a couple of -- a visualization tool if you're really interested. Most of our trace generators are in a separate sort of subdirectory of the Ocelot project. And you'd have to explicitly link those with CUDA applications. But they compile as lib Ocelot trace. And so to execute a program using the trace generators, you modify the configure.Ocelot document. These sort of have -- we have sort of a configuration objects for each of the trace generators so it's implemented. So most of them you just need to enable them. In of the addition trace generators have parameters like the debugger allows you to always break into a kernel before executing or just choose a specific kernel. Set the device to emulate it if you want sort of the detailed instruction level trace analysis. If you just set a timer or something else that you only need callbacks at the beginning and ending of a kernel, you can try the trace generator and use it with the additional devices like the NVIDIA device, the multicore device and the AMD device. And so here's a really, really simple source code for our trace generator that just does the load imbalance. So the event method is really the only thing that's called, sets up a set of counters for each thread and then on -- when instruction calls event, it iterates over all the possible threads within the CTA. It looks at the active data member, which is a bit vector, indicating whether the thread was actually predicated on for that instruction in increments that counter. It include the output that would store to a file. But when I ran this on the Mandelbrot application, you just sort of see the variance of many different -- or some variance because it's the Mandelbrot side, and not every thread has the same workload. But just a very simple example. So that's it for the emulator. It's worth pointing out that we've used this for a number of different studies. Did it for architecture research to examine different dread reconverge mechanisms. Other people have used to it drive their own simulators. MACSIM at Georgia Tech is one example. There's another example from the University of Texas. Mark Gebhart did a study and used Ocelot to sort of evaluate difference or impact -- a study basically but used Ocelot to dread it. Okay. So now I'll discuss some of the additional back ends. So that was the emulator. We have a multicore CPU back end, NVIDIA GPU back end AMD. And so the goal for the multicore CPU back end was to sort of achieve portability by executing coded kernels as efficiently as possible on multicore CPUs. And our implementation uses LLVM. So we did instruction set translation from PTX to LLVM. And then execution model translation that takes this very large thread hierarchy with lots of express parallelism and maps it on to a smaller number of like worker threads. And to do this, we use basically a compiler transformation that serializes the execution of threads. We have an execution manager running in each like worker thread that selects different threads to execute for different parts of the kernel. And we use LLVM as sort of the virtual machine that is originally designed to do Just-in-time implication to x86. And so the goal again is map this thread hierarchy on to the different cores that might be present and the processor of interest. So in most of our evaluations we were just looking at x86 multicore processors. But we're also using this to try to support ARM. The current status is to -- building LLVM as a cross-compiler for ARM is kind of an iffy proposition. But presumably it's possible. So in all of these cases we have a large number of threads that basically need to be serialized. The point here is each of these CUDA kernels are sort of written with the assumption that threads are lightweight and that can be created and destructed on the -- very quickly. For most applications, kernel level threads aren't that lightweight, and so we a compiler transform to iterate over different regions of the kernel, basically between barriers. And then on a barrier do sort of a lightweight context switch. And so we use compiler and sort of instructions to implement the context switch. So each of these thread blocks corresponds to a CTA, each CTA on to one host thread. And then serialize the threads within the CTA. And then working thread receives this -or is running an instance of an execution manager, which selects different threads to execute. So if all threads -- most threads are waiting on a barrier, which is another one. It sort of has the option of doing some kind of intelligent scheduling to try to encourage all threads to be the same part of the program at the same time. There are certain benefits for doing that. Namely -- so the CUDA applications are written such that different threads are executing a mach step. And so if they access contiguous elements of memory, those accesses will sort of be coalesced into maybe a few off-chip transactions, sort of the equivalent of transferring whole cache lines at a time. In our case if we serialized threads there would be maybe 10s or hundreds of cycles between a load in the first thread and a load to the second thread, even though they might hit the same cache line. And so if you can actually execute those threads in an interleave fashion or with very frequently changing between one thread and the other, those -- the number of cycles between those accesses can be reduced and presumably achieve high head rates in the cache. The multicore back end is implemented -- we have device implementation, we have a number of classes that interact with LLVM to manage the translation, to specialize the translation. And then to implement several of the program transformations, we need to support barriers. LLVM cooperative thread array is sort of the execution manager that chooses threads to execute. We've implemented all of the LLVM IR in sort of our own C++ classes as opposed to including headers from the LLVM project. The goal -- the reason for that was basically LLVM is very fast moving and they don't change their instruction set very frequently, but they do change their implementation of their IR. And we didn't want to have to be as active as they are just to keep up. And so our interaction with LLVM are Ocelot implements its own internal representation. Emits that as a string. Reparses that using LLVM's parser and then uses LLVM's high-level functions to do the Just-in-time compilation. We have some performance results that show that almost all the time spent in LLVM is actually in the code generator and not the parser and the emitter and so it's really not much of a performance overhead. The translator itself is implemented in the translator module, PTX, LLVM translator. It's about 10,000 lines of code. It's kind of interesting. And then we have some additional transformations. We have partitioning which tries to sort of take a large kernel, partition it into certain regions and then translate, compile these and execute them independently. One observation is CUDA did not receive -- or did not have the ability to call functions until quite recently. And they don't have a linker. So in many CUDA applications kernels themselves are very large. And in some libraries like the FFT library there are many specializations of kernels, particularly if you have like a CUDA template library that's expanded in many different ways. So there's a lot of dead code. We found that partitioning the kernel into these regions and compiling them separately allowed us to avoid compiling a lot of dead code. Remove barrier passes basically breaks the -- or partitions the kernel on barrier instructions and treats those as context switches. So here's sort of an example. Beginning -- at the beginning we have a CUDA kernel, it's compiled to PTX. It's just -- this is PTX. And then we do just a fairly standard ISA level translation. So most PTX instructions correspond to maybe a very small number of LLVM instructions like the arithmetic construction's basically correspond to one. Since both LLVM and PTX have load store architectures, there aren't too many addressing loads that we have to support. The special instructions and special registration are handled by either function calls in the case of like the transcendental PTX instructions like cosine and sine. The special registers like thread ID are handled by a loading out of this context object that's passed through the translation has served its one parameter. This contains pointers to local memory, shared memory, and then actually has values of what the thread ID is. And so this PTX kernel becomes sort of a set of LLVM functions, and calling that LLVM function is equivalent to executing one thread over a particular region. So maybe some additional details. LLVM depends on SSA form. So Ocelot will first transform its own internal representation into SSA form, insert V nodes. PTX has typed registers and typed instructions but its typing isn't always very strict. So a register could be declared as one value but loaded or stored as another value. Like a sine integer can become a non-sine integer. And so LLVM is -- does of strict typing. So we have some additional conversion for that. And maybe one final note is every PTX instruction takes an optional predicate register. That's common in branch instructions because that's how PTX expresses conditional branching. But it's kind of a little bit more awkward if you have like a load or store, some kind of -- some kind of other instruction that LLVM does not actually support predication for. So LLVM has a conditional select instruction. But none of the other instructions can be predicated. And so we have the transformation that reverses if conversion and replaces it with control flow. And so here's just sort of an example of how the translate add instruction is implemented. Basically each of the operands are -- there's a subtraction for translating a PTX operand. It becomes and LLVM value. There's an LLVM instruction called add. It basically says the on code and then the two operands. And then it's added to the LLVM kernel. So there's tens of thousands of lines -- 10,000 lines of code roughly to implement this translator. But the result is sort of a -- the input is PTX kernel, the output is an LLVM function that is equivalent. And it should be invertible. There aren't really any great or high quality LLVM to PTX code generators that are available to the public domain. Presumably they're being developed in several places. But the observation here is Ocelot's translator should be sort of an invertible process. If you wanted to use LLVM as sort of an optimizer, translate PTX to LLVM, optimize it, do some additional transformations and then use their code generator. So here's the thread serialization method. So on the left side we have just this one large basic block in PTX with a barrier right here. And so for the barrier semantics to be honored, all threads have to execute up to this point before any thread proceeds. So the way Ocelot's multicore back end handles this is it treats this as a context switch. And then the execution manager will loop over all threads. For each thread call it. And then at the beginning, a scheduler block is inserted basically receives an ID of which barrier the thread is waiting to enter. So in this -- and when the kernel is first called, threads need to enter sort of the first block. So they execute it. Then instead of at the barrier point it's sort of been replaced by these store local instructions which write out all the live values and then exit. And then that return is controlled with the execution manager which will choose the next thread and continue. And it sort of repeats this process until all threads of reached the barrier. And then the next -- then it sort of reschedules the first thread who's entry point has been updated to point to this -- this basic block corresponding to this point in the program. There are load instructions which load live values back in the registers and continues. So essentially this implements like a loop over the first part of the program which is the barrier and then a loop over the second. And we have -- in the beginning this is a fairly straightforward process. With we would be only get a context switch at barriers. But we sort of extended this, sort of generalized it to the concept of subkernels. So as I sort of mentioned before, this is a partitioning of a program. And each time threads exit one partition and enter another, that's sort of an opportunity for the execution manager to schedule more threads. And there are several reasons for wanting to do this. Number one is just to sort of reduce the translation cost. If this part of the kernel is never executed, then you only have to translate these two. There's also sort of -- there's some performance benefits to trying to keep all threads within the same part of the program at the same time. So you can improve instruction cache hit rate if all threads are only executing in this one loop before they move on to the next. You can also achieve some performance improvement. If all threads are sort of stepping lieu the program at around the same rate, [inaudible] the same data. If there's spatial locality within the program. And there's also opportunities to do more sophisticated code generation and specialization. For instance, you could compile a version of this subkernel that assumes that all threads are going to take the same control path and then replace scale instructions with vector instructions or SIMD instructions. And you can also just implement more sophisticated scheduling opportunities. Here's an example of inserting spill instructions into basic blocks at the location of barriers. So here just instantiate an IR PTX instruction, pass out an opcode for move. There are -- so the convex of this is this is how Ocelot uses a PTX to PTX transformation to do most of the program transformations before it compiles it to LLVM. And so within PTX it -- we've implemented dataflow analysis. So we have the ability to iterate over the set of live values. And the goal here would be produce store instructions for each live value that will write the value of that register or that variable into that thread's local memory. And so somewhere else in this transformation a local variable has been added. It's called a remove variable pass stack. And it's basically just a region of local memory that each thread has where it can store these kinds of things. The dataflow graph is sort of an overlay on control flow graph, which is the main internal representation. And so it provides some additional methods to modify the program without disrupting the results of the dataflow analysis so you don't have to redo the analysis every time you change things. So this basically just creates a new instruction, sets its data type, sets some identifiers for the source operand. So it's basically getting a pointer to shared memory -- or to the local memory declaration. Its destination operand. This is the register. I'm sorry. This is the issued load instructions. So assume a barrier has just been encountered loads values into registers from local memory, sets the address mode to register data type, inserts it in the kernel. And then this iterates over the set of library registers, constructs store instructions. And then inserts them. And so this is sort of an example of modifying PTX using Ocelot's IR. And it happens to be in support of the multicore back end. To use the multicore back end you basically set the device to LLVM. There's some additional parameters to change like optimization level. These sort of -- these are read by the multicore back end when it's actually doing the JIT compilation from LLVM X 86. So it adds some additional LLVM optimization passes. Working thread limits controls the number of worker threads. Some additional transformations. You can specify the number of subkernel -- number of PTX instructions per subkernel so you can basically choose any kind of partitioning here as you want. Simplify CFG, basically coalesces basic blocks that have single edge. And there's some other optimizations that you can invoke. You can still use the trace generators, however only the initialize and finish are actually called. So if you wanted to make your timing, that's how you do it. And so here's just sort of an example of how you might configure it. So the NVIDIA back end is pretty straightforward. It basically involves reemitting PTX kernels from the IR and then sending them to the GPU via the CUDA driver API. And so it's not very heavyweight but it does allow Ocelot to sort of sit as -- sit between the application and the GPU and make transformations at runtime. And so some simple applications are just doing register array allocation, potentially modifying the application or changing the number of threads that might be launched. And then finally instrumentation. We've done some work on instrumentation where Ocelot sort of inserts instrumentation probes into kernels we further executed and then transparently gets the results back from the instrumentation. It's implemented just in these two classes of the GPU device and executable kernel. These just are sort of wrappers around the driver API for managing the NVIDIA GPU, managing its device convex and then issuing PTX instructions to it. And once again, just use the NVIDIA GPU device. You can still use trace generators. Those can profile kernel launches but you can't really interrupt the execution of a kernel when it's running on the NVIDIA GPU. And so here's the work on dynamic instrumentation that we did. The goal was allow Ocelot to transparently insert instruction tools on an application allocate data structures that they might need, fetch those -- fetch the results from the instrumentation tool once the kernel is finished, and then use it -- or make it available in some useful composable way. It's sort of in contrast with NVIDIA's own hardware performance counters. There's sort of limits to which performance counters can be active. They always sample one SM at a time as opposed to the complete execution of the program. And the goal is just be as flexible as possible by doing software rather than hardware and just take whatever performance that you actually have to take. But if it's on a GPU, it's still probably running quickly. And then it's larger inspired by PIN. But there's some points that we have to make that basically -- since the execution model is different, there's sort of constraints about which types of instrumentation you can actually implement. An example for that might be the NVIDIA GPU is a SIMD processor executing multiple threads. But if your instrumentation tool has control flow, like if you have a loop, that loop itself and the control flow that implements it could potentially cause the hardware to sort of split the collection of threads into multiple WARPs and then that itself could distort the performance of the application. And if the programmer's relying on that race -- or implicit synchronization, it would actually break the program entirely. And so this is sort of implemented with those -- those types of constraints in mind. And it's -- it fits within Ocelot in the sense that the PTX transformations are applied on the fly to all PTX modules that are loaded. There is a set of call backs that are called just before the application executes the kernel, so the instrumenter can construct its own data structures, analyze the kernel. So, for instance, it might set up constructors for every basic block or for every thread. So it needs to know both the launch configuration and the program itself. And it has access to all of the analysis that Ocelot has already performed up to that point. And one additional point is Ocelot has the ability to remove the instrumentation. So if you want to run a program at full speed without instrumenting for a while, then suspend execution, insert an instrumentation tool on a particular part of the program to monitor maybe a specific kernel or some specific -- some specific properties have been met like the kernel is -- or the program has converged on some value and now you want to start instrumenting. So Ocelot allows you to do this at runtime. That's not really something that we've seen in any of the other instrumentation tools out there. Most of them tend to be like source-to-source compilers. You just have to recompile the entire application if you want the instrumentation present. And we're actually looking at implementing like the race detection in software that you mentioned. And also the memory checker. The memory checker that the emulator was doing. Do that in PTX instrumentation, insert into the program and running on the GPU. >>: [inaudible]. >> Andrew Kerr: That's the [inaudible] so it depends on the instrumentation tool. And I'll get it -- and I'll cover that in greater detail in a second. Just wanted to walk through the workflow. So when I originally started we implemented the instrumenter as sort of a PTX to PTX transformation, which meant interacting with the IR at sort of the instruction level. If you've ever written any transformations for LLVM or Open64, I guess VisualStudio you're probably sort of aware that it's kind of cumbersome. If you have a large piece of code to implement the instrumenter. And so for around two of this, we actually implemented sort of a primitive C to PTX compiler using a research compiler that someone else at Georgia Tech developed and building PTX code generator for it. So this is the actual implementation of an instrumentation tool that measures memory efficiency. And it uses sort of a set of labels to sort of tell the tool where to insert the PTX that implements this block in the program. So in this case, it basically tells it on every memory instruction that reads or writes to global memory. So it sort of rep indicates the set of PTX. And we have a few built-in functions that do useful things like determining whether a particular thread is the least active thread in a WARP and then computing a predicate based on that. And then having predicated execution, implement the rest of this code. And so the goal was just have one thread in a WARP, execute the instrumentation tool, but be able to observe the behaviors of all the other threads, and then not create any thread divergent conditions that would destroy the execution of the program. We have a reduction function across a buffer and shared memory. And so in this particular case the memory efficiency tool here basically masks off the low order bits of each of the addresss that are computed for a load or store instruction. And then tries to coalesce those into basically as few cache lines as possible. So if you mask the low order bits, suddenly every thread just has the address or the base address that corresponds to the start of the cache line, stores those to a buffer in shared memory. And that one thread iterates over that buffer and then reduces that into a single set of unique, uniquely -- I'm sorry, unique cache lines and then stores the number of unique cache lines. And so ideally if all threads are accessing consecutive elements in memory, that corresponds to one cache line satisfying all the threads. If the threads are doing a random scatter gather, it might correspond to as many cache lines as you have threads. And so the memory performance would diminish. And so this tool just tries to measure that. And again, it tries to -- it tried to produce abstractions that are actually useful, so we implemented this unique element reduction. And the tool itself just constructs the counter corresponding to every WARP and then stores both the number of times the memory instruction took place and then the number of cache lines needed to satisfy it. And then all of this tool's sort of wrapped up under the name Lynx which includes the API for adding callbacks that construct data structures, registering instrumentation tools so the PTX is always transfer -- PTX is always transformed when the application is running. And Ocelot sort of knows to apply these transformations when the program is running. And so here's some performance results. And here's some sample output. So over here we basically implemented a basic block dynamic level dynamic instruction counter. So when the program is -- or when the kernel is launched, the instructor analyzes the program, counts the number of basic blocks, uses the CUDA Malloc function to allocate device memory representing the data structure that the instrumentation tool is going to use, inserts global variables into the PTX module and then it actually inserts PTX instructions which count out the number of instructions per basic block. And then every time that block is entered by one or more threads increments the counter. So it's basically dynamic memory -- dynamic instruction count on a per-thread basis. And then when the kernel ends, the instrumenter call back fetches all that data, does some reduction, does some analysis, and then it produced this nice visualization. And this is sort of the same kernel as before that was running on the emulator, only now it's running on the GPU. And so for some -- some applications the performance impact is very low. So if the application has like many say compute instructions per basic block, they're only sort of the ratio of instruction code to original application code is fairly low. So the performance, in fact, is very small. In other cases like binomial options, each basic block was about two or three instructions. And then the instrumentation code is probably like five or six instructions. So its performance, in fact, is pretty high. The instrumentation tools also might -- or also have to update memory somewhere. And if the application is memory bound already and now suddenly there's even more memory bandwidth demand, increment the counters, then that can slow it down even more. But it's really sort of -- it's up to the instrumentation tool and the characteristics of the program to know what the performance impact is. And maybe one sort of anecdote is we use some of the built-in registers to figure out which multiprocessors or cores are executing CTAs. And we ran the Mandelbrot application through the instrumentation tool and discovered that some multiprocessors are sitting idle while other multiprocessors were oversubscribed. And so the actual like indicated kernel runtime was about twice as bad as it actually would have been. It turns out it was a hardware bug that the driver had sort of not been able to work around. So it's just sort of an anecdote of the instrumentation being used -- useful for observing execution of the program in ways that NVIDIA or what other hardware vendors might not have necessarily provided easy ways to do. The remote device. This is just sort of a -- is sort of an experiment to see if you do RPC when the notion is just translate Ocelot device calls or serialize them, send them to some remote server which is then executing them on an actual Ocelot device so you have your laptop running in CUDA application and your cluster actually executing it. So basically allows a multi-GPU application to suddenly become a multi-node distributed GPU application. And some of the work that we did at Georgia Tech sort of indicates that even unreasonable doesn't really matter because CUDA programs are fairly latency tolerant when it comes to latency with kernel execution. And then switchable compute. I mentioned this before, but basically allows you to recreate the state of a device on another device so you can just do an on-the-fly context switch. Has all the same problems. It's just serializing the state of any other application and then restarting it. So the data structure has lots of embedded pointers. You have to remap those somehow. Ocelot is capable of remapping parameter values. And it provides a mapping table. But if the data structure itself has embedded pointers then it's up to the application to deal with it. But it's -- this is basically as good as we can do without doing some kind of analysis to infer data structure type. And then last element in this section is the AMD back end. This is work that Rodrigo Dominguez did at Northeastern. And the goal is translate PTX to IL or AMD's intermediate language, which is PTX like, and execute CUDA kernels on AMD GPUs. And then do this in a way that makes it just another Ocelot device. So if you have a system with an AMD in it, NVIDIA GPU, just sort of choose between them. One sort of motivating -- or one bit of motivation is if you're trying to compare the performance of a two different GPU applications running two different GPU architectures, the current state of like the tool chain world is you have to rewrite the application. So like might have to rewrite in OpenCL. And then once do you that, even if you had an OpenCL implementation on both types of architectures, it's really up to like how well that particular vendor implemented things. And so I think NVIDIA might have sort of an interest in making CUDA programs a little bit faster all the time, just as a matter of course. So our goal was to provide as much portability as possible so you have one tool chain, one program representation and then presumably efficient execution on each type of processor that's, you know, currently available to the mainstream. And working with AMD GPUs was fairly interesting. They have sort of an unusual memory hierarchy. This is -- so it's probably worth pointing out that currently IL has been deprecated as of the latest release. AMD is sort of promising a new intermediate representation of GPU and heterogenous programs, kind of like PTX and kind of like IL but better somehow. But this currently was working to the previous generation of AMD Radeons. In their case, they're still using the VLIW style cores. And so it was sort of up to the compiler to make use of each of the instruction -- or each of the ALUs that were available. And this is sort of opposed to NVIDIA's approach, where you have just a collection of scalar threads executing on SIMD hardware. So in this case it puts a lot of sort of pressure on the compiler to do efficient code generation. There's just sort of an arrangement of special purpose functional units, regular ALUs, load store units. Another interesting characteristic is there's sort of a fast path and a slow path to global memory. And misaligned accesses have to be handled in software through the slow path. And so all of this is sort of up to someone implementing a code generator. So Rodrigo dealt with this. There's also some additional issues like you can't do a CUDA mem copy to anything except sort of the base address of an allocation. Which is complicated. You have to set up these uniform addressing views. So I'm not -- don't really have too much more to say about the actual hardware. I do want to point out, though, that there's sort of a big issue with supporting control flow. And so PTX has branch instructions. AMD GPUs don't. Instead they have assembly level structured control flow instructions. So basically an if else at the assembly level. And so to get PTX kernels running on that, you actually have to do a structural transform and replace this arbitrary control flow with these nested control structures. And so some of you are probably familiar with that. But it's kind of a heavyweight transformation. But it's sort of the equivalent of just removing all the gotos in your program and replacing it with control structures. And so there's a unstructured, two structured control flow transformation within Ocelot that's implemented as a PTX-to-PTX pass. And so he ran the application -- or ran several applications from the CUDA SDK. One of the interesting ideas was to see how a program that was heavily optimized for say NVIDIA GPUs would perform on his implementation on AMD GPUs. And he found that there's up to maybe two orders of magnitude in runtime difference between the same application, depending on whether it had been optimized or unoptimized for NVIDIA GPU use. The implementation there's structural analysis, structural transform. And then there's just the boilerplate necessary to interact with the -- AMD's cal API, which is sort of the equivalent of the CUDA -- NVIDIA CUDA driver API. And so here's some slides I'm probably not going to go into. I'm running short of time. But basically show that a program with unstructured control flow is easy to -- easy to write, particularly if you use some things like short circuit conditional evaluation and then being able to run the PTX that might correspond to this on an AMD GPU requires lots of work. So that's truly all I wanted to say about the AMD GPU. So this is a -- sort of a dive into the Ocelot's PTX IR implementation. Per PTX, sort of virtual assembly language. This is maybe one of the useful things for doing compiler research. Because we've completed our own parser from the PTX IR. We've implemented lots of our analysis and transformations in terms of our own PTX IR. This is probably one of the better ones that I've seen available on the Internet. And for a while I think we were one of the only ones to have a fully featured PTX to IR and emitter and then analysis tools. So it's implemented -- the source is implemented in the IR directory in module, PTX instruction, operands, PTX kernel, control flow graph. We also have IR for both AMD IL and LLVM. So those are also in the IR directory. And so the internal representation is a collection of C++ classes that correspond to elements of PTX modules. So these are PTX module, kernel, instruction, operand, global variable, local variable, parameters. The IR also limits the emitter which takes the internal representation and produces a string representation that's parsable by other PTX tools. And so that's the how we get from Ocelot to NVIDIA GPUs. Also the ability to apply transformations and then reemit PTX which can be used just sort of by stand-alone -- or, sorry, do like static transformations at PTX offline and then sort of package that with the rest of the application. So if you wanted to do your own characterization studies or optimize someone else's programs, this is how you would do it. And there's also -- there's a valid method which is sort of our attempt to make sure that the resulting PTX instruction is actually correct. So we have sort of a shallow class hierarchy of just one PTX instruction class. And so it's possible to specify parameters that just don't correspond to what you could -- what would actually make sense as a PTX instruction. And this tries to catch that. And it's always invoked whenever the emitter is running. There are translators from PTX to LLVM and to AMD's IL. These use PTX IR is sort of the source for that. It's how we implemented all of our analysis and transformation passes. And then finally it's the form in which the PTX emulator is actually executing. So inside the emulator there's sort of a switch statement on the PTX instructions op code which calls the appropriate handler. So here's how PTX module is structured. There's sort of the text here that corresponds to what you would actually get from NVCC, and then these boxes sort of correspond to how that text correspondence to objects within the IR. This is the module. Global variables which have a type identifier, PTX kernel which has name, a set of parameters which themselves have type. Global variable declarations. The registers are handled as sort of values instead of actual registered declarations. And so if you added additional instructions or did -- performed a transformation to and from SSA, the actual set of register that is are visible and the PTX kernel might be different. There's a control flow graph which itself is composed of basic blocks and edges. So here's the basic block. So basically the kernel owns parameters and control flow graph and control flow graph owns basic blocks. The basic block contains a set of PTX instructions. And so these have op code, address spaces, data types and then a collection of operands. The operand itself has various addressing modes. Since PTX is a virtual ISA, they're free to make it as complex as they want. So an operand itself could actually be a vector of registers. And we have support for that. That's how they supported texture sampling. The return value is a vector instead of just a single registered item. Then just various addressing modes. And then finally there's an implicit predicate operand. So basically each of these potentially could have a predicate operand. Generally it's always predicated on -- in the case of the conditional branch it's -- there's an explicit predicate. So that's that. So the control and the dataflow graphs, the dataflow graph is basically an overlay on the control flow graph that tracks -- it's in SSA form. It allows you to iterate over the definition and all the uses. At any basic block you can determine a set of live values. We've provided some convenience functions to transform the control flow graph. Like split a basic block at a certain point. Split an edge at a new basic block. Add new values, new instructions without invalidating the dataflow graph. And then traverse the control flow graph in several different ways. And so here's sort of a very simple example that splits basic blocks at barrier instructions such that the barrier is the -- always the last instruction of a basic block. This is one of the maybe upstream passes before replacing barriers with exits and inserting context switch points. And so you have an iterator of the control flow graph that iterates over the blocks of the kernel's control flow graph. And then reiterate over the instructions within the basic block. We have iterators for that. There's a PTX instruction. So the control flow graph is meant to be sort of ISA agnostic, so we have a base class below PTX instruction. So it cast it up. Examine the op code. Determine if this basic block contains more instructions than just the barrier. And if it does, we call the control flow graph split block, past that iterator that corresponds to the basic block, gives some additional information for the new edges to create. But basically the point of this is just to show that it's fairly easy to add new basic blocks. It's fairly easy to iterate over or traverse the in general representation of the program and make modifications to it. We tried to make it as sort of convenient and sane and conventional for C++ programmers as possible. So they have iterators for basically everything. And we've seen this spill code. And then it's also worth pointing out we have the internal representations for LLVM and AMD, IL, and then emitters for both of those. So on top of the PTX IR, we've implemented a pass manager that orchestrates the application of PTX to PTX transformations as well as analyses. It is largely inspired by LLVM's implementation. And it's ->>: It's largely inspired by [inaudible]. >> Andrew Kerr: It's largely inspired by [inaudible]. So thank you. But basically the concept as you're probably all aware is avoid recomputing analyses unless the program itself actually changes. And then if your transformation can update the analysis data structures, don't recompute them because even though it's changed, they're still valid. It's sort of a structured way of doing that. So I probably won't tell you things you already know. And maybe one additional point is there's a PTX task manager sort of built into the PTX pipeline for each of the devices. So you can always add transformations to a program before it's run. We have like some example analyses like dataflow analysis, dominance analysis, thread frontiers analysis, which is some architecture work that we did that tries to improve SIMD utilization for programs with unstructured control flow. So these data structures are implemented in the IR for the control flow graph. The dataflow growth is an analysis. Dominator and post dominator trees. We have sort of a super -- hyperblock and superblock formation passes. So we're experimenting with if conversion and inserting some additional predication -- or predicating PTX instructions instead of control flow. Divergence graph attempts to identify expression that is are uniform across all the threads within a program. And if that's true, you can mark those expressions as thread invariant. If they are used as conditions and control flow, then suddenly you know that all threads are going to take the same paths. You can mark regions of the kernel as uniform or convergent. And so we have sort of an example of implementing dead code elimination as Ocelot PTX pass. And so the goal here is use dataflow analysis, identify instructions that have no side effects and the users, remove them, and then keep removing until no additional instructions have been removed. So the dead code elimination pass is derived from kernel pass, indicates that it needs dataflow and SSA form. As dependent analysis. So we just have an accessor. Fairly straightforward. And that obtains a dataflow graph, asserts that it's in static single assignment form. Iterates over the blocks of the dataflow graph. Inserts a work list. And then for each block on the work list identify the instructions that are dead. And so the dead instructions are those that have no users or they're not live out. And if they can be removed, just basically it has no side effects and no users then the instruction can be eliminated. And so the source code is here for reference. Don't necessarily have to walk through it. It's fairly straightforward. But the point is dataflow analysis is -- or the dataflow graph is a usable data structure. There are iterators for it. It should be fairly intuitive for someone with compiler experience. And so it's -- I'd like to draw attention to this PTX optimizer program. This is sort of a standalone utility for parser PTX into the IR, applying a custom optimizations and then reinventing them. It's sort of a handy utility. And then you can attach PTX transformation passes to Ocelot, so they are always applied for a program. Okay. So these are just some of the other research topics that we've -- we have been looking at and other people have been looking at with GPU Ocelot. Workload characterization was sort of one of the original capabilities. And so we've used that to maybe define a set of metrics that losing correspond to performance and then use those metrics. That sort of led to this project called Eiger, where basically store statistics about some set of applications running on some set of likes heterogenous hardware. Take measurements, store application characteristics in the database. Then create a statistical performance model so that you can use that to predict the performance of the application on either new hardware or processor that might be available. And so with performance prediction you can suddenly make scheduling decisions on heterogenous hardware. You could also use it to sort of do like very coarse grain analysis when doing design space exploration for like a cluster environment. This is meant to interact being done at Sandia, which sort of developed this simulation infrastructure for evaluating a cluster or a set of different types of processor in nodes of a cluster. If you wanted to do a detailed performance analysis of one particular program on a detailed simulator you still sort of need to simulate the rest of the cluster environment. And if you have a thousand nodes you don't want to do a detailed simulation of all thousand nodes you just need to know like how quickly to play messages coming in from the network, basically. And so this is trying to do these like coarse grain and inaccurate but fairly fast. Statistical simulation basically. Also sort of been interested in doing automatic tuning. And so -- in one of our instrumentation papers we showed how the instrumentation tool itself could be used by a scheduler to sort of adjust the issue rate of kernels coming from different simulated applications to try to achieve fairness. And just sort of one example of monitoring execution of the program through software. It's sort of by Ocelot to make intelligent or acceptable scheduling decisions. It's the feedback-driven resource management. We've also been looking at compiling other types of programming models or execution models on to Ocelot for heterogenous execution. And so this is called Red Fox which is basically sort of a GPU accelerated database model. And it's one of our sponsors is a company called LogicBlox, which is developed sort of a high performance database based on Datalog. And so Datalog is sort of an alternative to SQL and expresses sort of -- or can be used to express relational algebra. So we've implemented CUDA kernels that correspond to primitives and then the relationships between those primitives are specified by Datalog. There's a translator from Datalog into this relational algebra, which ultimately goes to a scheduler and then that scheduler ultimately invocation Ocelot which executes the kernels. So basically as you add different types of processor the scheduler can schedule them on those new processor and performance improves. I'm sort of interested in new architectures as they're coming out. We currently focus on just like mainstream CPU architectures now. But if say Intel's GPU had sufficient software tools that we could actually get code running on them, it would be very interesting to add them to a tool like Ocelot. Also interested in supporting Intel's MIC and AMD's Fusion. Basically just new heterogenous processors as they're emerging. And we're also interested in targeting some more research oriented architectures like Rigel. Vanaheimr is just sort of a side project. Would be interesting to see if we could get the CUDA execution model running on E2 for instance. >>: [inaudible]. >> Andrew Kerr: We were also sort of curious like if you have low-level code generation capabilities like what -- why do think change if the compiler itself is aware of all the threads that are running and what the relationships between threads are. And so it's just sort of one opportunity. And then felt like citing Mark's work because he actually used the Ocelot simulator to explore the power and performance impact on a small register cache. Close to the ALUs. And then thread frontiers is some work that we actually did. It's meant to avoid the penalties associated with unstructured control flow on SIMD architectures. So most prevailing techniques rely on reconverging at post dominator or if it has -- if the program has structure control flow just outside the control structure basically. But if a program has unstructured control flow, I had a good example, the reconverge point is actually far beyond -- I'm sorry, the post dominator of a control flow instruction is far beyond other locations in which control flow could actually reconverge. And so this work is trying to come up with other say hardware techniques that depend on some kind of compiler analysis and program way out to actually improve -- or prove the hardware's ability to reconverge threads by scheduling them differently. So it's basically just signs of priority that basic block is a program out such that the basic blocks are in decreasing priority and then allows the hardware to always choose threads that have the highest priority basic block. So execute the programs near the beginning of the -- I'm sorry, execute threads that are waiting near the beginning of the program before you execute programs when you're near the end. The assumption is that -- or the assumption is those threads will ultimately catch up with the other threads and reconverge early. And then interfaces to MACSIM is basically, as I mentioned before, but it's just the heterogenous trace driven simulator. And so we use Ocelot to drive instruction traces from CUDA kernels. And the interface is just the trace generator interface that I described before. And so that work has been done by some other students at Georgia Tech under Hyesoon Kim and Rich Vuduc. And the last set of slides I have are just sort of a walk-through of how to add an additional instruction to Ocelot. And so in this case, NVIDIA's PTX spec defines PTX -or the PTX instructions for prefetching to various levels of the cache hierarchy. And so the set of slides sort of covers what was necessary to get those running on the emulator and the NVIDIA back end. And so the set of steps are modify the PTX IR, add support from the parser and the emitter and then implement the instruction for each of the devices. So in this case we implement it for the emulator, the NVIDIA GPU back end. Since it doesn't produce any new values, it doesn't modify the data like the dataflow graph at all. So basically it just requires the emitter to be able to inserted the instructions when it's JIT compiling the kernel for the NVIDIA GPU back end. And so here we just modify the PTX instruction class with two new op codes. Add an additional enumeration that indicates which cache level the prefetching is going to. At two stream methods which are useful for the emitter and also just convenient. And then add the data memorandum storing your cache level. So the PTX instruction can now indicate that it's -- or indicate the new op codes and indicate which cache level it's being sorted to. Within the PTX instruction class, the functions are implemented fairly straightforward. The two-string method is pretty obvious. The PTX instruction two-string method itself is what it's actually implementing the emitter. So there's this large enumeration. So for each of the newer opcodes, print the guard instruction, the opcode itself, string literal. Address space is sort of one of the other modifiers with the PTX instruction to indicate where the load or which address space the pointer actually refers to. And then the cache level. And then there's this data member called D that corresponds to the first operand of the PTX instruction. And then similar for prefetchu, which is just uniform prefetching. The valid method I mentioned earlier. Make sure that the PTX instruction corresponds to something that's meaningful. So in this case, make sure that the cache level is either L1 or L2, the address space is either thread local or global, and make sure that the operand a meaningful memory address that it's not like a register that you're trying to prefetch from. And so it basically looks at the address mode have the destination operand, make sure that it's indirect or that it corresponds from an address or that it's an immediate value. And if all of these are true, then the valid method returns true. So once the IR can actually store the new instruction we have parser support, which means modifying the lexer or modifying the grammar and then modifying this PTX parser class, which actually sort of builds up the instruction as it's being parsed. And so were add the new tokens at .L1 and .L2, add the new on codes to the PTX [inaudible], modify PTX parser to actually take the new tokens and translate them into the PTX IR. Do the same with the cache -- for cache level. And this also sort of translates the string representation to the new opcodes. Pretty straightforward. Here we modify the grammar. We add the new opcodes. Add declarations for new tokens. We modify the instruction rule to include rules for prefetch and prefetchu. Here's the prefetch rule. Takes the prefetch token, address space, cache level then brackets memory operand close brackets. So it's fairly straightforward for modifying your grammar. And then so now the parser is capable of actually receiving the PTX instruction and storing it to the IR, modify the actual emulator. So here we add hand lesser for the two new instructions to cooperative thread array, which implements the emulator's execution of a CTA. Has an execute method which is basically a while loop while the program is running. And then choose the instruction at the current PC, call one of the handlers for each of the instructions. So you add new handlers for those instructions. And here's the handler for prefetch. Since it -- since the goal here is actually to be able to maybe drive a simulator, we need to compute -- we need to actually take some action that determines which addresss are being accessed, even though the emulator itself doesn't feed to really do anything because it's not changing the state. It still needs to be able to drive the addresses to the trace generators. So here we just iterate over all the threads within the CTA, determine if the thread's actually executing, like whether it's predicate operand is on and whether the thread is active at that particular location of the program. And then just evaluate the address that's being referenced, depending on which addressing modes -- or which addressing mode is syndicated, and add the address to this memory address's vector, which is a member of the trace event object, which ultimately is patched to the set of registered trace handlers. And so if we run the program, have a very simple like CUDA kernel that I child and manually modified to insert the new PTX instruction. Run a very simple trace generator that would actually produce some output if prefetch is executed. So basically just wait for the opcode prefetch, and when that happens, print the instruction, print the set of addresses and then -- I'm sorry. Iterate over the set of addresses and then for each one print it out. And so then when you run it on some particular -- or some execution of the program or just this output. So that's sort of a complete walk through that hopefully provides really a detailed look into how Ocelot handles PTX and how you can add new instructions. And for the NVIDIA back end, nothing else is necessary. It basically just depends on the emitter and the valid call. And so that's basically everything in the tutorial. I think we're going to have a demo shortly after lunch. And I'd be happy to take -- answer any questions on any of these topics. >>: [inaudible] maybe you mentioned this, but if you want [inaudible] to compiler that [inaudible] this type of GPU or this type of [inaudible]. >> Andrew Kerr: Yes. Absolutely. So the question is if you had some information about which processor's likely to execute it the fastest or which processor is available. Yes, so Ocelot has the ability to switch devices on the fly. Before a kernel is executed, it has the opportunity to migrate all the device state from that device to another. And so ->>: [inaudible]. >> Andrew Kerr: Yeah. So there's an action -- so there's an Ocelot API call that will change devices. And then the return is a mapping of basically old pointers in the old device's address space to new pointers or new allocations in the target -- or the destination device. And then it tries to copy each of those box of memory just sort of like a binary block. And so the application if it's written in sort of the OpenCL style where the only pointers you actually of are parameters to the kernel, then every embedded -- every dereference must take sort of that pointer and then a index. Then that's sufficient. Because Ocelot is sort of sensible enough to remap. But the parameter values, if you have sort of a data structure that has lots of embedded pointers that CUDA and PTX allow you to write, then the program itself needs to be modified to sort of serialize data structures. But it's the same sort of problem if you tried to sort of terminate an application and then try to restart it at a certain location, you just have to save all the state and then rematerialize it. And then if the parameter value -- or if the pointer values aren't the same, then update those as well. But, yeah, that's definitely one of the design goals of being able to execute the same kernel within the same application on different devices depending on whether they're faster or not. >>: Sounds like [inaudible] how do you migrate [inaudible]. >> Andrew Kerr: Right. Right. So Ocelot is sort of like the low level way of doing it. In the Red Fox example, we had harmony runtime which is sort of a scheduler for scheduling kernels on different devices that are available in the system. So if you had some sort of sophisticated performance model, Ocelot could allow to you change it. And so Ocelot also allows you to insert instrumentation which you might need to make that decision. >>: So what do the applications have the same assumption about the device then you take them and transfer them to [inaudible] back end to run [inaudible] do you have to run it in each iteration that that actually breaks the code? >> Andrew Kerr: Yes. So the implicit synchronization among threads in the same WARP definitely breaks the multicore back end. There's a prototype multicore back end taking shape that basically will execute a kernel as if all threads take the same control path, and it vectorizes the scalar instructions. And so if the WARP size must be at least say four threads for the program to execute correctly, it will execute correctly on the vectorized back end. I think it would be nice if the programming model itself allowed the programmer to tell the compiler in the runtime what the minimum WARP size is necessary for the program to be correct. And there have been several efforts that try to sort of infer that by some kind of say pointer analysis of the program. But I think even in those cases they're not completely channeled. Because you could have this sort of implicit synchronization in the code block that's divergent. So like not every thread reaches it. And so every method that I've seen that tries to allow the compiler to make this inference would break in that case. >> Aaron Smith: Okay. Let's thank our speaker and then we can have lunch and talk. [applause]. >> Andrew Kerr: So in this directory this is basically the Ocelot source, is the main Ocelot code base. So everything is in the Ocelot directory. There's a number of unit tests for -- there are a number of unit tests. Most of them are written with complete knowledge of Ocelot's internals. So there's basically a test for every PTX instruction. There are also just some CUDA tests, which are just off the shelf -- or not off the shelf, handwritten by us. But they're just regular CUDA programs that you should be able to run on any implementation. So I'm going to run several of those. So here's the Ocelot configure file. So change the set of devices just to include the emulator. So here's a set of most of the applications. Here's this test CUDA sequence application. So there's -- we tried to build a lot of sort of like debugging tools into Ocelot just because it's really hard to build a compiler without having lots of observability into the state of the program. So if we change the device back end to LLVM and change the optimization level to report, we run the program as the translator is producing an LLVM representation of the operating system it basically inserts these function calls into every set of translated LLVM instructions, which maintain a pointer to the original PTX instruction that it's actually translating. And you can sort of configure the translator to produce debugging output for each of the threads basically. So here we just have like a single thread's execution. We want to monitor the control path and make sure it's correct. Kind of a handy feature. Let me think. Okay. >>: Why are they all threads [inaudible]? >> Andrew Kerr: So this is -- corresponds to the execution of the first thread. And the thread IDs are sort of a -- it's a three dimensional tuple. So that sort of tries -- they were trying to make it easier to write let's say dense linear algebra by giving a thread an ID of an X coordinate, a Y coordinate, and a Z coordinate. And this just sort of outputs the instruction trace for a single -- single thread. Otherwise there would be quite a lot of output. If you have hundreds of threads. In this directory we have the -- a set of applications from the CUDA -- CUDA SDK. And these are provided by NVIDIA at other locations. It's just sort of like this repository of interesting applications that make use of CUDA in some interesting way. And so I'm going to run the Eigenvalues example. And hopefully you should see -- many of these have built-in quality assurance tests. So you can tell if the runtime -- or the implementation the actual correct by producing the correct results. And they also have like built-in performance monitoring. So you can sort of see how long it takes the Eigenvalues application to compute Eigenvalues sum matrix. And as you tweak various parameters like run it on the emulator versus the LLVM back end you can see different performance. If you add worker threads you should see performance scaling. And so if we set the back end to LLVM, set the number of worker threads to one and run the Eigenvalues example, get say a certain runtime. I think if we change the number of worker threads to say four, we should see a performance scale. And there's a slight performance difference. One of these applications has a race condition. And so we ran this with the race detector off. If we set it to true, okay. I believe it had a race detection -- a race condition, but I guess not. If we turn the debugger on, every time the kernel runs it breaks into this debugger. And debugger has sort of like a text interface, a cool graphic and some other things. And so you can sort of set a watch point. You can print PTX at the certain location. Should be able to single step it so you can monitor the execution of the program. You can print the registered IU. And so print registers for, you know, certain set of threads that are active. You can determine like which line of source code that the corresponding PTX instruction corresponds to. Let me think. I'm just going to exit the program. So beforehand I compiled -- I just invoked NVCC to compile the PTX for the scan program. And if you run the PTX optimizer, there are a number of options. One of them dumps the control flow graph. So -- and the result is this dot file. I can see it. One second. And so it just sort of dumps like a control -- like a visualization of the control flow graph. Zoom in on PTX instructions. And I guess sort of a -- this is sort of like the foundation for a lot of our visualization tools that produces a dot file corresponding to the program. So the PTX optimizer is kind of like LLVM's op in the sense that you can specify some additional passes. You can also do things like reallocate all of the registers. >>: Why would you want to do that? >> Andrew Kerr: And so -- okay. The register file is -- it's sort of a statically known concept. And so if your kernel makes heavy use of the register file and it sort of impacts on how many threads you can launch, and if you have the same program that's targeting say a test load class GPU, which has a certain -- I think it's like a 4K register file, and then suddenly you run it on a GPU with say an eight kilobyte register file like Fermi, you might want to make use of more registers. So being able to reallocate them allows you to launch more threads, launch the right number of threads. >>: PTX [inaudible]. >> Andrew Kerr: It has a virtual register set, but the optimizing code generator doesn't try to do very aggressive register allocation. And so it depends on PTX to do spills, I believe. I've seen cases where programs will crash if you try to launch too many threads, and PTX implies so many registers. So Ocelot will basically insert the spill code for you. And so the PTX still reflects a large number of registers. But when OCG compiles it and executes it, it will reallocate them basically. But using the live ranges implied by the spill code. >>: So is that implied in NVIDIA allocators [inaudible]. >>: [inaudible]. >>: [inaudible]. >>: Well, because you're emulating the -- >> Andrew Kerr: Well, in this case you might actually wants to do it for a program running on the NVIDIA GPU. But if you -- if you have a -- if you want to launch a certain number of threads, then that sort of limits how many registers you can allocate per thread. And so the register allocator will insert the spill code so you can have fewer registers. So the kernel runs and it's actually the driver that would catch the error when it sees that you're trying to allocate too many registers. And so with four registers, it basically will insert spill code on every instruction basically. It's worth pointing out that these -- it does insert lots of declarations but these are basically values that come out of the SSA form. And so it doesn't actually remove them. So each one of these loads just produces a new value. But it's live ranges basically that long. And so the optimizing -- the OCG's register allocator can deal with that pretty well. I was having some problems with my video driver, so I can't run programs that use OpenGL on this machine right now. I was sort of waiting until the end to possibly install it, and if it actually works show some programs with visual output. I'm not sure what you guys want to see. >>: I have a question. So you can do a transfer enable PTX and then pass it through NVIDIA [inaudible] you get the binary? >> Andrew Kerr: Yes. >>: Have you seen cases that, you know, [inaudible] register allocation [inaudible] PTX [inaudible] change the performance of the binary for them? Or the binary's kind of agnostic to [inaudible]. >> Andrew Kerr: I'd say a lot of programs the programmer has already tried to do a lot of, say, performance tuning. We haven't really explored the space of doing like say low level code generation and register allocation style optimizations to really see what performance options are available. Presumably we could do things like unrolling loops automatically. We have a register allocator. Potentially they are better than -- their allocator is better than linear scan. But we just don't have our own results. >>: [inaudible] binary compiler [inaudible] or whatever it is can roll it back and forward? >> Andrew Kerr: Yeah. That's true. And occasionally I guess it probably does do heavyweight transformations like that. We've sort of been asking for insight from NVIDIA into what their code generator is actually doing and an ability to control it. But it's just -- yeah, you're right, it's just something that's below like what we can actually interact with. >>: [inaudible] compiler or it's just a simple task ->> Andrew Kerr: I'm told it's heavyweight. But there are no details that I've been available -- that have been available to me that precisely describe what that means. >>: It's being released. >> Andrew Kerr: Well, it's being released according to the NVIDIA marketing team. >>: You could sign up for ->> Andrew Kerr: Yeah. >>: You just have to [inaudible]. >> Andrew Kerr: You just register your interest. So let me think. Sort of demonstrated the debugger. We could write a CUDA program and try to crash the emulator and see what its output is. I'm not really sure what everyone else wants to see. >>: Multi-targeting [inaudible]. >> Andrew Kerr: The multi-targeting? Okay. So so here's a set of, say, functions that interact with Ocelot that are outside of CUDA. And so this context switch basically just call this, give it an index of the destination device and the source device. And so CUDA allows you to enumerate devices. And so we just sort of express all the Ocelot devices using that API. But this allows you to actually invoke the context switch. And the return value is a mapping of all the old allocations and the old address space to the new allocation. Sorry. You can't see what I'm typing. And so here's sort of a simple example registering a PTX kernel. So this actually implements the context switch. So up until this point the kernel was just executing on whatever device was set. Then the context switch is performed at runtime. The pointer map is provided. And sort of up to the user to do something insensible with the pointer map. And so generally they would probably just -- you would probably just want to like transform base pointers, just like some, like, data structure. And then, like, if you have a tree or a graph or something, you just use indices instead of actual pointers so you don't have to remap each of those. It's very standard. Here we're using the Ocelot launch method instead of like the usual let's say syntactic sugar that allows you to pass at the kernel block dimensions using the brackets. This is just sort of a convenient way of writing C++ instead of CUDA and only using the CUDA compiler to compile the actual compute kernel. >>: So once you have this mapping have you [inaudible] Ocelot to make some smart decision about [inaudible] to be mapped? >> Andrew Kerr: Okay. So Ocelot lets you -- like it gives you like the function that will do it. And so what you need is some additional runtime. And so if the program itself is running with Ocelot, which generally it will be, you can use instrumentation or some of the other callbacks like the trace generators to actually handle your own handler which will measure kernel runtimes. And let's say you keep a list of how long each kernel took every time you ran it and use that to feed some kind of performance modeling tool that you have, you might make the decision that it's, you know, runtime is equal to like some constant times problem size or problem size is like a parameter value which you can observe. So then you might say, okay, well if I adjust one of those parameters, let's say I have a machine with twice as many SMs or its clock frequency is higher, I should expect that the performance -- new performance will be, you know, something else. And so your runtime tool running as a trace generator could actually call this context switch function. And then choose a new device. The main goal, though, is let Ocelot be like the low level handler of execution and then have some additional level of distraction that sort of manipulating Ocelot using Ocelot to insert measurement tools and then perform the -- modify the execution of the program. But it's not necessary like the final orchestrator of execution. It basically lets you say execute a kernel on this device. And so if you had like maybe a more sophisticated runtime, that might be more appropriate in like and OpenCL type command scheduler. You could uses Ocelot to actually implement those commands. Or if your programming model wasn't C++ and CUDA but it was more along the lines of like Intel array building blocks or C++ Amp, there might be sort of room for a higher level, more abstract scheduler. Then you had data structures partitioned and could execute some kernel or some subset of the data structure for a [inaudible] device. >>: Is there somewhere to get into the Ocelot's scheduler? >> Andrew Kerr: The Ocelot scheduler is and imperative thing, but, yeah -- so it's basically -- let's see. So here's the device class. Has sort of some subclasses describing properties. This abstract base class for memory allocation. And ultimately it should have a larger than function. So here. And then each of the additional back end sort of implements this interface. So the NVIDIA back end, its launch method will ultimately call like CU funk launch, which invocation the driver API. Conceivably you could have sort of an additional like -- or additional like layer of indirection where you have sort of a scheduled GPU -- GPU device of some sort which has awareness of all of the different threads of execution and was able to come up with even a more sophisticated schedule for executing those kernels. But so far Ocelot is mostly imperative. Even the CUDA function calls can be asynchronous, many of the back ends just block until the kernel is actually returned. In the case of the multicore back end it kind of make sense to us because the worker threads are doing most of the work, and there's no point in letting the application use the main CPU when it could be running the kernels faster. But it's the short summary here is just override the launch method, possibly override the memory allocation methods if you wanted. If you had a programmer representation that could sort of track dependencies between kernels, that would be probably the right -- the right place to -- or the right location to build a scheduler. We were sort of wondering if it was possible to use Ocelot and instrument just the runtime API to sort of track which values are being passed to various kernels and come up with like a dataflow graph at the kernel level. And that was kind of interesting. Nothing really became of it. Our main goal was if you already had that information somehow, could a scheduler using Ocelot make things faster? And that work went into the Red Fox implementation of the Datalog execution manager. >>: You said the -- for example [inaudible] this is a data plan [inaudible]. And you [inaudible] OpenCL and you have a back end device but not back end for, you know, a set of devices that [inaudible]. Is there any or are there any features in OpenCL that if you [inaudible] OpenCL would help you [inaudible]? >> Andrew Kerr: I think the features in OpenCL that are -- that would make it better are the ones that construct the command queues to create multiple queues and sort of control the scheduling between each queue. We are actually sort of interested in that problem in particular. And we are -- there's an ongoing effort to build like a very lightweight imperative OpenCL API front end for Ocelot that still requires you to use some other cool -- other tool to compile the OpenCL kernel to PTX, but the API itself can go through Ocelot. Or if that makes sense. The API itself is implemented within Ocelot, and it assumes another tool's compiled OpenCL to PTX. We would like to sort of expand that and look at some of the opportunities to make more intelligent scheduling decisions based on what ultimately amounts to like dataflow analysis at the kernel level inferred from the contents of the command queues. But I don't think we would have to do any backtracking, it's just something we haven't implemented yet. >>: I assume that -- I mean, maybe this is [inaudible] but OpenCL and [inaudible] NVIDIA [inaudible] CUDA and OpenCL [inaudible]. I don't know if that [inaudible]. >> Andrew Kerr: I think ->>: [inaudible]. >> Andrew Kerr: I think that's because at the time their open 64 base kernel compiler from CUDA to PTX was very mature. Their -- their OpenCL implementation started basically from the ground up on LLVM. And so they were at the same time they were developing driver support for OpenCL, they were also building their PTX code generator. And I think by the time performance became comparable, they just switched and used their LLVM tool chain for both CUDA and OpenCL. And you could probably make the comparison. There's one command line switch I think that will use the open 64 back end instead of LLVM in the CUDA compiler. So you could write a CUDA program and just take a test. Is there anything else you'd like to see? >>: [inaudible]. >> Andrew Kerr: Okay. So, I don't know, maybe some conclusions. The source code has been freely available since 2009. It's a new BSD license. We develop on it continuously, and most of the work is either in the main trunk, which we try to avoid breaking at all costs, and then variety of branches. So there's work to add vectorization to take advantage of SSE and AVX in the multicore back end. We're trying to, you know, try to remain as current as possible with respect to CUDA features. So we're sort of in the process of adding new API support to Ocelot, new PTX instructions as they come out. We have a large set of applications outside of the CUDA SDK that we've used for our own research studies that are just available. These are from like the Parboil benchmark, Rodinia -- I forget where Rodinia is from. Oh, University of Virginia. Yeah. Some other applications that just have like unstructured control flow that we used for a study like [inaudible] GPU. We also noticed that some applications just don't use the CUDA runtime API, like OPDX and some other tools that just use the driver API. And so there's sort of a prototype driver API implementation. But we also built a tool that just -- it's just like a very thin layer that wraps the driver API and then captures the device state before and after the kernel is executed. So you basically just take a CUDA binary or like an application binary without even recompiling it, run it through this tool, and it will capture the state of the GPU. And then you can sort of replay that through Ocelot. So if you have this full application but you just want to monitor -- like study the execution of a single kernel, you can. And it will tell you if the results are correct. So if you wanted to like evaluate an optimization you could just launch the single kernel, launch -initialize the GPU device with whatever state that kernel required. So. I think that's everything I wanted to show. So thank you for attending. [applause]