>> Aaron Smith: So I'm Aaron Smith. And... Georgia Tech. So Andrew is currently a PhD student. ...

advertisement
>> Aaron Smith: So I'm Aaron Smith. And I want to introduce Andrew Kerr to you from
Georgia Tech. So Andrew is currently a PhD student. And he's getting ready to
graduate. His advisor is Suda -- you'll have to say the last name.
>> Andrew Kerr: Suda Yalamanchili.
>> Aaron Smith: Thank you. So Andrew has been like one of the lead developers of
this dynamic compilation infrastructure for GPUs called GPU Ocelot. And so he's been
going around difference conferences and giving this tutorial. So he was kind enough to
come today and give it to us.
So the plan is the tutorial will take about two hours. And then we're going to have some
sandwiches and coffee and stuff delivered. And we'll do like and hour of demos or an
hour and a half of demos. And if you want to skip out at that point or if you're interested,
feel free to stick around. Okay. So I'll let Andrew get started.
>> Andrew Kerr: Okay. Thanks. Thank you, Aaron. So as he said, I'm one of GPU
Ocelot. These are some of the other numerous contributors. Gregory Diamos and I
started the project several years ago. He's now in NVIDIA research. We had several
other students join us.
Rodrigo Dominguez is actually from Northeastern University, and he developed an AMD
back end, which I'll provide some detail on.
So the structure of the talk is describe sort of a motivation for developing a
heterogenous dynamic compilation infrastructure, like Ocelot. Present an overview of
Ocelot, what its purpose is. Describe how we implemented the CUDA Runtime API,
how we implemented a device -- and abstract device interface for basically extending
CUDA on to multiple times of processors. I'll describe what was necessary to support
each of these types of processors, taking maybe a deeper dive into our internal
representation of PTX, which we're using as sort of a device agnostic data parallel
internal representation for compute kernels. Describe some of the research topics that
we're currently looking at. Provide some examples for modifying Ocelot to maybe sort
of cover the tutorial aspect.
And then afterwards, I'm going to boot up Linux and show several demos of Ocelot
running, just off the shelf CUDA applications, how we can use it to do analysis, and
basically let you suggest any additional tasks or have a discussion at that point.
So the motivation here is heterogenous computing is fairly mainstream. It has been for
quite a while. We've seen many systems with both CPUs and GPUs. They're currently
making their way into cluster environments.
And so they're ubiquitous, but they're still sort of programming challenges. And the
challenges can be broken down into may be two different areas or can be divided into
two different areas. Number one is you just have multiple ISAs. Each of these
processors have different instructions that reveal different functionality. So -- not two,
but multiple.
And so a programming model or an application would really need to be written for each
of the processors available on a system to be able to take advantage of all of them.
And we also have execution model differences. So CPUs are optimized for maybe a
small collection of high-speed serial threads. GPUs are optimized for a large number of
threads and require that much concurrency to hide latencies.
So we have sort of this mismatch between each of the processors in a system and the
way that they're programmed. And so this is sort of the environment in which a
heterogenous compilation framework can really excel. And so the GPU Ocelot project
tries to take a single representation of a program in the form of a data parallel internal
representation and then do program analysis on that, and translate it for execution on to
multiple different types of processor back ends.
And so this is sort of the overview slide of GPU Ocelot. We envision multiple language
front ends, device independent parallel program transformation and analysis and then
multiple back ends.
And we've tried to make this not just a research project but also sort of a useful vehicle
for other developers and other researchers. So we've tried to adopt off-the-shelf
programming languages and programming models. And so most of this work started
when CUDA was roughly the only thing available. So a lot of it is sort of informed by
design decisions that went into CUDA.
But as Rodrigo has shown, it's possible to run CUDA applications on AMD GPUs.
Some of our work has shown it's possible to run cube applications efficiently on CPUs.
And we've also seen some examples of languages other than CUDA being complied to
PTX and run on each of these processors.
At Georgia Tech we're working on building a -- basically a database system that
compiles data log as opposed to SQL to an internal representation and then provide
CUDA implementations of primitives. And then the whole thing runs through Ocelot and
sort of achieves heterogenous execution.
And it's also worth pointing out that consider we this a dynamic compiler because
because Ocelot here is capable of monitoring the program as it's executing. It's able to
record application behaviors, transfer the code, Just-in-time compile it for the processor
and continue executing it.
And as a research tool we've used it for doing research and compilation and
optimizations for heterogenous processors, use it to drive GPU simulators, and make
contributions to GPU architectures in general. And we've also used it on -- as sort of a
way of enhancing developer productivity, to provide basically correctness checks for
applications, to make sure that they're running correctly. And also feedback for doing
performance tuning.
So now discuss maybe some of the additional details. First. Most of this work is
intended to support NVIDIA's CUDA. And we've tried several versions over the years,
beginning with CUDA version 2.2. And all the way through the current version, which is
CUDA 4.1. They rewrote their own compilers so that sort of trickled some modifications
down to us.
The notion here is the core of an application's computation is expressed in data parallel
compute kernels. And the host application using an API, watches his kernels on a
device. And so we're trying to support as many different devices using that one
programming model.
Here's sort of the structure of a typical CUDA application. There's a host thread which
calls just standard CAPI functions that allocate memory on a device, register compute
kernels for execution on the device launching them and then receiving them results.
And then on the accelerator side, in this case the GPU, we have this hierarchy of -- we
have a kernel that is launched over a hierarchy of threads and the threads have sort of
a two-level hierarchy in which a set of threads are grouped into a cooperatedive thread
array. This is basically a set of threads that are mapped to the same core, can
synchronize and exchange data through a on-chip scratch-pad. And then a collection of
these are distributed throughout the cores of the GPU.
So this is how they sort of map this programming model of massive amounts of
parallelism on to a processor that has hundreds or one day thousands of cores. And
another point here is CUDA's been around for a while. So there are a large number of
CUDA applications, in of which get fairly high speedups. And Ocelot is sort of meant to
be compatible with all of them.
The structure of a compute kernel itself is worth understanding in greater detail. The
notion is a C or C++ function is written. And it's fairly standard with a few exceptions.
There are some keywords to indicate that it's an actual kernel or a function
that's called from a kernel. There are some built-in invariables to allow a thread to
determine what its location is within this hierarchy of threads. And so these
built-invariables like thread AD, block dimension, which is the number of threads per
block, the ID within the block -- I'm sorry, the ID of the block within the collection of
blocks. And then this kernel itself is replicated for each of the threads. And this thread
hierarchy then is meant to be executed on this abstract virtual machine. And the notion
is there's an arbitrary with SIMD processor where each thread is mapped to one of the
lanes of the processor. There's a register file, local memory, some other explicitly
addressed memory spaces like texture memory, constant memory, parameters, and
also shared memory.
There's a heavily ported scratch-pad memory that allows threads to write to it, explicitly
synchronize and then fetch the results. So threads can have producer/consumer
relationships.
So that's the execution model of CUDA. Here's just sort of an overview of NVIDIA's
GPU. This is sort of their GPUs that sort of evolved with the CUDA programming model
and execution model, so it's probably worth understanding. So it's just a collection of
processors. The processors themselves have numerous ALUs, which are meant to
execute SIMD and spatial and temporal SIMD. In the same execution -- or the same
instruction is issued for multiple clock cycles for different threads. Each streaming
multiprocessor the equivalent of a core. It executes one CTA.
And then finally there's a single share last level cache with request a large cross-bar
interact and then access to six memory controllers. So GPUs have fairly high
bandwidth. And the whole point of all these threads basically is the hide latency. So
the L -- L2 is very high latency, off-chip memory is very high latency. But by launching a
number of threads and having some of them stalled on memory request while they're
computation, you can achieve fairly high throughputs.
And so back to Ocelot. So Ocelot is meant to be able to execute CUDA applications.
And so CUDA programs themselves are written in C++ with compute kernels. They are
compiled with NVCC, which is NVIDIA's CUDA compiler. It basically separates the
program into the host code or a source-to-source translator separates the program into
the host and then compute kernel. The compute kernel goes through a separate
compilation path and ultimately winds up in this virtual -- virtual assembly language
called PTX.
And so Ocelot sort of uses PTX as its own internal representation. So we have an
implementation of this CUDA runtime API for allowing the application to register
compute kernels, parses PTX into an internal representation, does analysis, program
transformations, then ultimately it issues it to several Ocelot devices that might be
present in the system. And these actually execute the program.
And so at runtime, the kernels are actually parsed, analyzed and translated. But you
can also extract some of the tools that we built, the stand-alone tools -- or, I'm sorry,
extract some of this functionality as stand-alone tools.
For instance, if you wanted just to do sort of a characterization study. You could use
Ocelot's parser and path manager to do optimizations or analyses of PTX kernels. And
one of the demos I'll sort of demonstrate this.
You can also use it to do basically online scheduling decisions if you have multiple
devices here, you can have an application that basically changes devices, depending
on what's -- what might be available. So in a virtualized environment, some other
application might need the GPU. And so Ocelot allows you to basically switch context
over to this multicore CPU Ocelot device and continue executing the CUDA application
as if nothing had happened.
If the CUDA application itself performs better on the multicore CPU than the GPU,
Ocelot provides sort of a seamless way of taking advantage of that.
>>: [inaudible].
>> Andrew Kerr: Yes, please.
>>: So [inaudible] I was wondering if you were going to talk about if there was any
support for isolation for the shared resources? Like, for example you talked about
[inaudible] cache. Is there isolation of ->> Andrew Kerr: Okay. So ->>: [inaudible].
>> Andrew Kerr: So on NVIDIA GPUs isolation is provided by this memory manager
and then this contact -- this notion of a device context. And that includes things like the
page table and allocation of resources. But it's fairly coarse grain. And typically you
only get -- like different applications can be utilized in the GPU. But they utilize it by the
driver partitioning certain SMs. And there's a virtual physical mapping so they can't
necessarily read each other's data. But they could evict other application' data from the
last level cache. And they'll never share the same SMs.
And so we're not -- we're not really in a position to look at those types of problems
because we can't control the driver. So most of this work is sort of geared toward
making use of different types of processors by translating execution models and doing
efficient co-generation. Fair enough? Okay.
So within a CUDA application there's a host code that's compiled by the systems
compiler. Then there's also the GPU kernels were just compiled by NVCC and embeds
that in the application's data segment. And the application runs on startup. And this is
all sort of transparent of the CUDA programmer. But these function calls or actually
invisible. If you take a look at some of the intermittent compilation.
There's a -- several CUDA functions that sort of bind these GPU kernels to objects or
variables. And then when a program runs, it uses the -- those bound objects to sort of
reference the function. And so when the host application calls the kernel, these are
used by Ocelot or by any other CUDA runtime implementation to pass the
representation of the GPU kernel to the driver or to the CUDA implementation which
then does Just-in-time compilation and executes it on the device. And that's just sort of
the architecture of CUDA.
So we tried to implement as much of this CUDA interface as possible within Ocelot to
chief transparency. So in this case this CUDA register fat binary passes a -- basically a
string literal containing PTX into Ocelot, which parses PTX into its own IR. We do
control and dataflow analysis. We have support for doing additional PTX to PTX
transformations. And we've used those to explore things like on -- doing online
optimization as well as instrumentation of compute kernels.
And then these are ultimately translated on -- into the native ISA of the target device.
So if the application is executing on an NVIDIA GPU, we just re-emit the same PTX.
But this is -- it's not the string literal that came from the application, it's coming from
Ocelot. So any transformations are sort of present in that PTX.
On a multicore CPU, we translate PTX to LLVM's IR and use LLVM as sort of a virtual
machine for compiling Mac to x86, or whatever other ASA you're supporting. There's
some additional transformations to map the PTX thread hierarchy on to the amount of
currency that the processor actually supports.
And in the case of AMD's GPUs it's translated to sort of the equivalent of PTX from
AMD called IL. And there's some additionally heavy weight transformations that I'll talk
about later.
And then finally, we also have implemented a PTX emulatedor that directly executes
PTX as if it were a instruction set and provides like very fine-grain call backs. So you
can do things like trace every instruction, see -- inspect the contents of the register file
before and after the instruction executes, intercept memory addresses.
You could do things like verify that those addresses are valid or just save them to a
trace file for driving some simulator down the line. So that's sort of the overarcing
structure here.
And so some additional information before we go any further is Ocelot's available as an
open source project. It's hosted on Google Code. We also have research project site
that's hosted at Georgia Tech, which has all of the publications that we've worked on.
If you go to slash news, you can see sort of a log post describing this talk. And there's a
link to the presentation slides. So there might be a lot of technical content that I don't
necessarily cover in the greatest detail like code snippets. But if you want to -- if you're
really interesting, you can sort of treat this set of slides as documentation. So you go
visit that.
And then we also have a mailing list. So if you have any questions or you can't get it to
compile or you want -- you want to test out some ideas, you can post to the mailing list,
and we'll try to get back to you as soon as possible.
A complete list of contributors. And our sponsors. So if you visit the solicitation page
you can see sort of a list of the steps needed to compile Ocelot. And I'm just going to
summarize them here. So dependencies, C++ compiler. We use C++ OX features. So
it needs to be -- it needs to be able to support those. VisualStudio 2010 didn't have any
problem compiling Ocelot, which I thought was pretty impressive.
We depend on Flex and Bison to implement the parser. And that was something I
wasn't able to get to run on Windows without lots of hacking that I didn't get done before
this talk. But presumably it's possible to build a parser generator. We use Scons as the
build system. We use LLVM to support execution on multicore.
We use boost, OpenGL libraries, OpenGL to do visualization. So -- so the source code,
generally we try to do package releases about once a year. But there are a lot of bug
fixes and modifications that happen in the meantime. So I generally recommend that
anyone actually trying to use the Ocelot just do an anonymous checkout.
The source code or the checkout directory contains some scripts but then the Ocelot
directory have most of the source code and so each of these subdirectors sort of
corresponds to a namespace and the C++ application. So we have an implementation
of CUDA which there's some additional slides on. We have some Ocelot specific APIs.
So applications can interact with Ocelot. They can add a trace generators, which are
sort of analysis tools that the -- each of the devices drives. We have IR which defines
our PTX internal representation and then IRs for both LLVM and AMD IL. An
implementation of the parser. Some stand-alone applications that can be useful. For
instance, it's possible to run Ocelot as sort of a server mode. And so if you have like a
heavy end or a high-end work station running the Ocelot server, your local machine can
run sort of an Ocelot client and through RPC over sockets your laptop can be running a
heavy weight CUDA application and use the CPU in the server. So it's kind of useful.
You have a set of trace generators that we've developed for doing specific studies.
These are just maybe one-off analysis tools. But we have a few that were actually
useful enough to include you with the main project. So they're located here. These do
things like ensuring that memory addresses are valid, detecting race conditions and
then an interactive debugger.
We had translators from PTX to LLVM and PTX to AMD's IL. And then we have a set of
program transformations and optimizations which do things like register allocation,
splitting basic blocks at barriers, structural transformation for doing unstructured
structure control flow. And finally I just want to point out there's a link to doxygen. So
this is an automatically generated code documentation. This set of presentations and
doxygen are basically the bulk of our documentation. We have some additional Wiki
slides that try to help -- help solve common problems.
For instance, some version of boost actually require manually patching it. So we try to
record that. Maybe at some point we'll write an actual manual or a book. But this is it
so far.
To build Ocelot, just obtain the source code. We have a script that builds Scons, so it's
pretty straightforward. You could also just type Scons and it should -- should compile.
Liboscelot.so, Ocelotconfig. And then if you pass this test, flag full, it will compile a
number of like white box units tests. These evaluate Ocelot's ability to implement and
run just CUDA -- a set of CUDA programs that we wrote.
It also runs unit tests on the emulator and a translator. So these are sort of like the
minimal set of applications that are just within Ocelot to make sure that it works.
We also distribute applications that we took from the CUDA and software development
kit. Sort of added these as a separate set of applications also distributed through our
Google code site. So if you want to run Ocelot with these other applications, you don't
really have to do very much. They have Scons builds scripts for all of those as well.
The installation director, just the usual places, user local include, your local lib, and user
local bin. And these directions are also available on the Wiki.
So when Ocelot -- or when a CUDA application starts up, the first CUDA call
instantiates Ocelot internally. And Ocelot looks in this application startup directory for
this JSON document called configure.ocelot. This is sort of a snippet from it. And this
configures both Ocelot's devices as well as any trace generators that might be present.
So if you look in executive devices, this is just sort of a list of the devices that you
actually want the application to see as CUDA devices. So define NVIDIA emulated
LLVM and AMD.
Trace contains the trace generators that we've implemented. I'll go into greater detail
later in this talk. But these are basically the greatness tools that I was referring to. And
then some additional opposition -- or certain devices support certain optimizations. For
instance LD, multicore back end will actually control the number of worker threads. And
so those are also in executive. So basically this is how you control the startup
configuration of Ocelot.
And there are typically -- we've implemented API functions for many of these as well.
So the goal for this is basically never have to modify a CUDA application to be able to
configure Ocelot within the CUDA application. But there's an API for changing these
values at runtime. So you can add additional worker threads, for instance.
So to build an application using the Ocelot, you basically had to do very little. The only
changes both NVCC on the source file and you pass libocelot instead of libcuda
runtime. And so that sort of speaks to the notion that we're trying to be as transparent
as possible by reimplementing all of CUDA.
And then some additional libraries might be necessarily if you need to compile with
LLVM. And so we wrote this Ocelot config application, which is if you're familiar with the
LLVM project, there's an LLVM config. But basically it allows you to -- this program has
hard coded versions of paths to each of the important directories like the library
directory and the include directory. And you can use it to sort of generate additional
inputs for the command string for a compiler.
And then the last point, libocelot.so replaces libcudart.so. And so that's it for the
overview. Next I'm going to move on to our implementation of the device interface and
the CUDA runtime API.
And so once again, the host application interacts with the accelerator by just the
standard C functions. Some examples are CUDA Malloc, which allocates memory of
the device, returns a Power Point. CUDA mem copy, which takes a pointer. And
system memory as well as a pointer on the device and copies data to that allocation or
from that allocation. Kernel -- this is sort of the -- the syntax for launching a kernel. And
it's not strictly C++. Their -- some additional parameters are grid dimensions and block
dimensions. So that specifies the number of blocks within kernel grid.
And then the CTA dimensions are the number of threads within the block. So basically
multiplying these two together gives the total number of threads. And then finally when
the application -- or when the compute kernel is finished, another CUDA mem copy
copies the data in GPU's memory back to the host.
And so NVCC, NVIDIA CUDA compiler does source-to-source translation and replaces
this with a set of additional function calls which setup -- set up the kernel, copy the
parameter values into a block of memory which ultimately gets copied on to the device.
But the point is, these are CUDA runtime API functions. And this is a CUDA kernel.
And by the time Ocelot is actually running, this CUDA kernel has been compiled into the
PTX virtual ISA. And so for Ocelot to be able to run the CUDA application, it has to
implement each of these API function calls.
And so we tried to make this as modular as possible. So we have this abstract CUDA
device interface. Okay. Yeah. So this CUDA device interface. So our implementation
of CUDA maps these CUDA runtime API calls on to this set of functions. And the
CUDA runtime API is the CAPI. It's very verbose. There are probably 120 or 140
functions in it, depending on which version you're looking at.
And we tried to avoid as much redundancy as possible to simplify the interface as much
as possible. And so the goal here is to be able to support multiple devices without
going to the trouble of reimplementing CUDA for all of them. So instead of
implementing CUDA for each new architecture that we wanted to add to Ocelot, we just
implemented this device interface.
And then, conversely, we wanted to be able to add multiple API front ends. And so we
actually have a student working on OpenCL support. And to do that, they just have to
implement the OpenCL functions, which are sort of similar to the CUDA runtime API
functions. In terms of the device interface. And then all of the existing devices actually
work. And so we have base classes for the device and executive and then our CUDA
runtime implementation in CUDA.
And we also have sort of a virtual interface that this CUDA runtime implements. So if
you wanted to add instrumentation just to the API, you could implement that, chained to
the CUDA runtime and just record any kind of -- any important data within the API calls.
And then we've also extended some additional API functions like CUDA register PTX
module. Currently CUDA doesn't allow you to provide your own PTX. You basically
have -- always have to go through CUDA unless you use them on PTX. And so this is
sort of a simplification or a way around that.
Some additional APIs for registering and removing analysis tools. And then for
controlling the state of Ocelot. So here's sort of a visualization. So the CUDA runtime
API and the Ocelot API is set on top of the interface. And then each of these devices
implements that interface.
And I guess one additional point we're stressing here is each of these devices sort of
behaves as if it's a single Ocelot device. And you can always program it using CUDA.
And so you get kernel portability because suddenly this CUDA kernel or OpenCL kernel
or PTX kernel can now execute on each of these devices. And the only difference
might be the amount of performance that you get or in the case of the emulator you get
some additional callbacks or some additional hooks to monitor the execution of the
CUDA kernel.
And I also included this. This is the remote device interface that I mentioned. It's
basically a custom RPC layer implemented using this Ocelot device interface. And
since Ocelot is both the compiler, meaning you could modify the program, it's also the -like the execution manager. So it has access to all of the memory allocations that are
present. And so it's actually able to reproduce state from one device on another device.
Actually this is kind of a useful feature if you're doing operating systems research.
It's also kind of useful if you are debugging an application. You might want -- you might
want to run the application for, you know, minutes or hours on a high-speed device like
an a NVIDIA GPU until you get to a certain kernel that is only executed late in the
program, but you want very detailed performance results from -- or very detailed
application behaviors. You want to monitor the application behavior at that point. So
you could actually switch to -- switch devices just for that execution of that kernel, run
on it this emulator to capture instruction or memory trace and then resume execution on
the NVIDIA GPU.
The CUDA runtime API is implemented using this sort of base class define all the API
functions. Then one particular implementation CUDA runtime. There's some additional
data structures that maintain like host -- host thread local state, like parameter values.
These are sort implied from NVIDIA's sort of design decisions when they implemented
the CUDA runtime. Not much more to say about that.
And so the class here is CudaRuntime.cpp. It's implemented in Ocelot CUDA
implementation. If you wanted to add your own API front end, like OpenCL, you'd sort
of follow the same strategy here of mapping OpenCL API functions on to the device
interface functions.
There's some additional undocumented functions like CUDA register module. So if
you're doing any CUDA hacking, you might notice that a CUDA application contains
PTX but it also contains a fat binary for continuing sort of a binary representation of the
kernel for several different types of GPU architectures.
People have been able to sort of reverse engineer them using this data structure. And
so we try to parse as much of that as possible so there's some sort of see the source
code here is documenting NVIDIA's undocumented implementation choices.
And it's also worth noting that the CUDA runtime implementation as it stands now from
NVIDIA prevents a host thread from changing devices. So if you want to -- if you have a
multi-GPU system and you want to use multiple GPUs, you basically have to launch a
new thread and kill the old one if you want to change -- change -- change devices at
runtime.
And also you can't pass pointers from one device memory -- or a pointer for, say, GPU
1 to another thread running GPU 2. And the reason to do that is sort of a quirk or an
undesirable consequence from their choice of implementing CUDA runtime. And so
since Ocelot reimplements it, we try to get around that.
And so now you can -- the short story is Ocelot's runtime implementation allows to you
change devices. So that's kind of useful.
Okay. So the device interface. As I said before, this is -- sits underneath the CUDA
runtime implementation. There's some abstract base class which defines a set of
methods for allocating memory, binding textures, configuring textures, registering PTX
modules, getting kernels from those, executing those kernels. And then there are
implementations for each of the back ends that we've been interested in. So emulator,
NVIDIA GPU, multicore, ATI, and then some other sort of research vehicles like the -the remote device interface -- implementation.
And so the point of -- the point of the device interface is just to simplify each of the API
calls. So we've produced maybe hundreds of CUDA runtime API calls or API functions
into around 57 or so. And this also includes some additional data structures. So if you
wanted to iterate over all the allocations and a GPU's global address space, the NVIDIA
GPU device contains a vector of those and it's just -- just uses C++ iterator to do that.
And once again, the goal is to make it easy to add different devices and different APIs.
>>: So how complete is your [inaudible].
>> Andrew Kerr: Our implementation supports most of the features from CUDA 3.2.
NVIDIA released CUDA 4.0 and then 4.1 they added many new, let's say addressing
modes. They added surfaces as well as textures. And so we've tried to implement
most of the features that the applications that we have encountered actually use.
So that means textures but not surfaces. Some things have been added to support
OpenCL. And we've only been supporting those to the extent that we actually have to,
to get our applications to run and to satisfy the goals of getting OpenCL -- having
OpenCL support.
There are also features within CUDA that programs don't actually use. I was telling
Aaron about this. There are many PTX instructions added by say people with an
NVIDIA to accelerate their own, like, personal kernels, things like doing population
counts to see how many threads at a certain point have evaluated a variable or
predicate variable as true or false. Like do a reduction across those. Or allow basically
hardware performance counters to be copied in the registers and used.
And so you can write PTX that express these things. The CUDA programs don't
actually use them. Because there's no, like, high-level language support for them. And
we've tried to support those when it makes sense to, but there's a very large spec, and
it's kind of a research project.
So the short story is many CUDA programs work. It's very easy to write a CUDA
program that exercises a control path that don't actually support. But you'd probably be
the only person to try to do that, if that makes sense.
>>: [inaudible] CUDA 3.2 [inaudible] it should run on Ocelot.
>> Andrew Kerr: It should run on Ocelot, yes. Some additional things have sort of
broken in the last release. NVIDIA added sort of a back door between their driver and
the runtime. And you can only -- it's not something that any third party would actually be
able to use because they don't have any documentation about it. But since NVIDIA has
sort of taken over some, like, third party libraries and are distributing them as their own,
things like CUDA FFT and CUDA BLAS and the CUDA sparse primitives, some of them
actually make use of this back door.
And we haven't really tried to -- or haven't been able to successful reverse engineer it.
So as of 3.2, we could run CUDA FFT programs through Ocelot and everything was
fine. As of CUDA 4.1, FFT library uses this back door. And it's just broken. So that's
one of the challenges. But presumably it's something that could use all the time.
And so now we're going to take a dive into the PTX emulator. This part we're going to
go through basically all of the devices. PTX emulator is first. And the goal here is
execute PTX programs directly on an abstract virtual machine implied by the PTX
specification and then do very detailed ex -- very detailed trace analysis.
And so assume the thread hierarchy. Here's the abstract virtual machine. We tried to
implement this in software as faithfully as possible. So we have an arbitrary width SIMD
processor. Arbitrarily sized register file, arbitrarily sized shared memory. I'm sorry.
Local memory. Shared memory. Texture sampling through software support.
And so the PTX emulator just executes each instruction over all the threads and then
moves on. And so the -- with this, we were able to do things like generate actual traces
of PTX instructions, monitor kernel's interaction with the memory hierarchy, characterize
the application, and then use to it drive some additional timing simulators.
There's another project at Georgia Tech that's developing MACSIM, which is sort of a
heterogenous trace-driven cycleable simulator. So this may be one of the first
applications of Ocelot was to drive instruction and memory -- and address traces to it.
So we've sort of exploit some of the undefined properties of the PTX execution model,
one of which is different CTAs or different cooperative thread arrays within CUDA
kernels can be executed out of order. And they don't actually have to be executed
concurrently. So, for example, if this is one CTA, and this is another CTA, a thread here
doesn't have to be live at the same time as the thread over here is. So you could
actually execute those CTAs in serial or, however, within the CTA the threads suddenly
have to be live because they have to be able to execute barriers. And so what go is we
serialize the execution of one instruction over each of the threads that are present in the
CTA.
And when threads take arbitrary -- since it's a SIMD processor, if threads are taking the
same control path, the same instruction is issued for each of the threads before moving
on to the next. But in the case of control divergence, we sort of predicate off the
threads that haven't taken the same control path. And that predicated value is sort of
maintained within the emulator. The order in which you reconverge threads that have
diverged sort of affects the overall SIMD utilization of the processor. We're capable of
measuring that.
We've also done some research that evaluates different techniques for doing the thread
reconvergence. The emulator is a good sort of vehicle for seeing what the impact would
be on average SIMD utilization. And so we have this abstract with the thread
reconvergence mechanism that it makes it very easy to develop and add your own
reconvergence policy.
PTX defines a number of special functions like texture sampling. That's a big one.
Since CUDA was sort of developed targeting GPUs, they wanted to make a lot of the
GPUs specially purpose hardware available to the programming model. So texture
sampling basically is a very large configuration space. You have one, two, or
three-dimensional textures as of CUDA 4.1, queue mapping as well. Different texture
filtering modes, different address clamping modes, different data types, different number
of channels, et cetera.
And so we actually developed a software texture sampling library that tries to implement
all of these as faithfully as possible to what the GPU actually produces. So with the
exception of rounding errors, it's very close. And the emulator is also sort of interesting
as a way of adding new instructions to a GPU. And maybe test -- test there, seeing how
they might affect the application so that you could insert explicit reconverged points
within the application. It sort of hints to the hardware about which control path to take
first. The emulator's a very convenient way of sort of doing that kind of research.
It's implemented in the following sort of header files and corresponding C++ files. So
there's the actual device interface. There's a kernel which takes sort of an arbitrary -- a
PTX kernel with the control flow the control flow graph and dataflow and then it does
register allocation, forms sort of a dense packing of the instructions, replaces labels with
PC offsets.
There's a call stack for actually supporting function calling within CUDA. Cooperative
thread array is sort of the heart of the emulator and has implementations for each of the
PTX instructions. And then texture operations implements each of the texture sampling
methods. And then final trace defines a class and trace generator and trace event. And
this is sort of the preferred way of getting instrumentation results out of the emulator.
So it allows you to define a set of trace generators which are attached to the emulator.
And then every time the emulator executes an instruction, it sends -- it constructs a
trace event object, sends it to each of the attached trace generators, and then those are
sort of user defined. They can take some kind of action, like record a trace to a file or
update some performance counters. Whatever you'd like.
Then after your instruction is committed, another call goes back to the trace generator.
There's maybe a better illustration. And so the goal here is to provide sort of like the
greatest level of detail to an analysis tool for observing execution of a CUDA kernel.
And so within tracegenerator.h, we have this class that has four methods, initialize,
event, post event, finish. It's worth pointing out that the trace generator is applied to
each of the back ends. So initialize and finish are called before and after the kernel has
actually executed. So at this point, parameter values are known. The state of the
device is made available to the trace generator so you could look at all the memory
allocations if you're interested.
You could see what parameters are being passed if you wanted to do like profiling and
see if specialization based on parameter value made sense. You can see how many
threads are going to be launched for the kernel. And then finish called when the kernel
is finished executing. So these are called for each of the device back ends. So if you
wanted to do timing for NVIDIA -- or kernels running on the NVIDIA GPU, trace
generator would be a convenient way of doing that.
But on the emulator, it also will call event and post event. And these receive a trace
event object, which I'll go into detail on the next slide. These are sort of how you would
actually inspect the state of the device on every instruction. So here we have three PTX
instructions executing in the CTA. And so event and post event are called before and
after each one. And the event object contains roughly the same -- same information
before and after, except that the post event method can look in the GPU's memory
allocations and see which values are overwritten.
And so the trace -- the trace event object contains a device pointer, kernel grid
dimensions, the program counter, PTX instruction instance, which is the actual
instruction being executed, bit vectors indicating which threads are actually active, a
vector indicating which addresses have been generated by a load or store instruction.
And then as PC is for branch and branch targets in the case of a control flow instruction.
And so it's worth pointing out that imagine this were a load instruction. The event object
would see which addresses were about to be loaded before it actually performs the
load.
So if it was an invalid address, you -- the trace generator should presumably see that
before it actually happens. Here's the actual class. Here's the PTX instruction. And
since I have a hold set of slides on this later in the tutorial that sort of explains how PTX
is represented. But the short story is the trace generator can see all the analysis that
the rest of Ocelot can.
There's the bit vector indicating which threads are active. So you sort of correspond to
which lanes of the SIMD processor are predicated on for execution. And then the
vector of memory addresss.
So here's sort of a very simple example of race detection. So here we have CUDA
kernel declares an array and shared memory this means that all threads can write to
this. But it's possible that some thread could write or read from shared memory when
it's expecting a value that another thread would have written but because there's no
synchronization barrier it's possible that that other thread hadn't actually executed that
code yet. Because there could be, say, a hundred or 128 threads running on one of the
multiprocessors of the GPU. But it's only a 32 wide SIMD machine, so it has to do
some temporal interleaving as well.
So basically their race conditions are possible. So in this case we've implemented a
trace generator that annotates or that will initialize an array of thread IDs for every byte
of shared memory. And then whenever it intercepts a store instruction, it will update
that table with the ID of the thread that wrote to that byte. And then when it sees a
synchronization instruction it clears that.
When it sees a load instruction, and the -- a PTX load instruction corresponding to
dereferencing the shared memory array and trying to load its value in the register before
it stores a -- the load will look at that annotation table, see that some other thread last
wrote to it and there was no barrier that cleared it, so it must be a race condition. And
so when executed, Ocelot will actually -- or can be configured to throw an exception, the
exception object identifies the name of the kernel, there's a name Mengle [phonetic] -or Mengle name. The program counter, the thread ID, the CTA ID. The actual -- the
actual PTX instruction that's faulting tells you which address was being accessed, which
thread last wrote to it. And it also provides near file name and line number. So when
PTX is compiled by NVCC, it inserts some additional directives that sort of stored the
file name and the line number, Ocelot preserves that. And there's generally a way to
map the actual PTX instruction back to the original CUDA source file.
So this is what the user wrote, and this is what they see when they run it. And so they
find that line 9 contains the race condition. So it's just kind of a useful tool for
debugging applications.
One problem that we've discovered, though, is a lot of programs have intentional race
conditions. And the reason behind that is this -- the extra barrier instruction here would
take some additional cycles. It might be within the interloop of the kernel so the cycles
actually matter. And the programmer knows that only the threads that are reading and
writing to the same setup memory are likely to be packed into a WARP, which is threads
that are actually issued to the SIMD processor at the same time. So the programmer
knows it's locked -- it's executing a locked step. And so the race condition will never
actually happen or has a very defined result on today's hardware. And so as sort of a
performance optimization they remove the barrier. And it causes problems for us.
But in our opinion, those programs are actually incorrect. They just happen to work. So
occasionally you see these things, even though the program would be correct without it.
So we get rid of it. Since the emulator tries to preserve the SIMD execution semantics,
those types of race conditions never happen because threads are always executed in
locked step. So the same instruction is broadcast over all the threads before moving
on. But it's still nice to be able to detect it. And there are many other programs that just
have real race conditions that you don't want. And this can be useful for finding them.
As I said before, it's also useful for catching invalid memory, accesses. So here's a very
simple program tries to store values to global memory. The programmer is passing an
invalid value. They also have a real allocation that they meant to pass, but
unfortunately they passed the invalid pointer. And so when you run it, this address is
being dereferenced. It doesn't correspond to anything. So this trace generator just
uses the ability to iterate over the allocation table.
And on store instructions, it sees that this global address is not with any -- doesn't
correspond to a real allocation. It lists the actual allocations that might be nearby, which
the right size. And then, again, it tells you the file name and the line number. So this is
kind of a useful feature. In the years since we first developed this NVIDIA's debugger
has gotten a little bit better, so you can run a real debugger on the device. But you
need their latest hardware to do that, their latest tool chain to do that. You might not
want to upgrade. And this is also sort of composable. So it throws a -- throws an
exception.
And if you wanted your system to be able to respond to that exception without just
crashing the program or without running the program in a debugger, this is sort of a
useful way of sort of dealing with that. Yes?
>>: When you use the [inaudible] did they do that in software or hardware [inaudible].
>> Andrew Kerr: They have some hardware hooks. They have some back doors into
the driver, presumably that have hardware support. I don't know how they detect race
conditions without doing something heavy weight like this through software.
>>: Do they actually tell you like there's [inaudible] and this thread? They give you a lot
of information [inaudible].
>> Andrew Kerr: I would imagine that they do that with some kind of instrumentation or
like some kind of software instrumentation. So, I mean, that's the same information that
this provides. It's just -- this is running an emulator because we don't have the ability to.
I guess you could implement an instrumentation tool on PTX that does this kind of
tracking as well, and you just pay a performance penalty. I assume they do something
to that effect.
It's worth noting ->>: [inaudible].
>> Andrew Kerr: Okay. Here's -- this is the interactive debugger. This is sort of
another trace generator that we implemented. And the idea is basically it takes similar
concepts that you might find in a traditional debugger and implement them into the
emulator. And so here the programmer is basically starting the application, setting a
watch point, which is just a break point that's triggered when a thread accesses a
particular location. So it is passing an address, data type, number of elements. And
then when the program runs, it sort of breaks on that store instruction that's trying to
access it. And it presents some useful information. Including this here. Just sort of as
sort of an example of what's possible to do with this trace generator framework.
Because none of this requires modifying the Ocelot. It just -- just requires implementing
that trace generator class and implementing each of the methods.
And we've included these within the main Ocelot project as opposed to an additional
library, just because we find them pretty useful. For instance, we have the memory
checker and the race detector launch the debugger automatically if configured. So you
can single set the program.
So beyond correctness, it's also useful to understand the performance of the
application. And so since we're able to inspect the instruction trace, determine SIMD
utilization, count instructions and also see which addresses are being accessed, we've
implemented this performance bound trace generator, which computes all of these
statistics for each basic block. For instance, the amount of memory demand, how much
data was accessed from shared memory, whether there are any bank conflicts, the
various ports of shared memory. The number of dynamic instructions. The number of
flops. It stores this for every basic block and then provides this nice visualization. This
is basically directly out of the tool.
And then it also provides sort of aggregate over the kernel. So if you wanted to know if
your program was memory bound or compute bound, this might give you sort of an
indication of what the theoretical limits might be. Tells you like number of flops per word
transferred off chip. It also lets you know where the hottest path are. So this color
intensity is sort of logarithmic with respect to dynamic instructions.
And it also, again, includes the file name and the line number. So if you visualize this,
you actually know that there's a loop here near scannativekernel.CU, line 51 or 61. And
you can very quickly go back to the program and see the CUDA source -- CUDA lines of
source that correspond to each of these basic blocks. So we find that pretty useful.
This is just another analysis that we did to see if there were -- how much -- whether
there are -- were real producer consumer relationships on a lot of CUDA applications.
So if you have an FFT, basically some thread is producing a value, which ultimately is
loaded by another thread.
We found that a lot of applications were using a scratch pad and a synchronization but
were never actually producing values consumed by other threads. They were just using
the scratch pad to reorder their accesses to global memory, kind of like a cache. So we
build this tool to sort of filter out those types of applications from applications that really
did need the same threads to be running on the same SM, because they're actually
sharing data quite frequently.
Here's sort is of a list of each of the trace generators that we've implemented. Memory
checker race detector and integrated debugger are part of it. You have some
performance bound generator, some additional trace generators that store values of
parameters and kernel launch dimensions in a database.
And so to use a trace generator, you implement the trace generator interface. Basically
these four methods. Adding trace generators transparently can be done through
Ocelot. So since Ocelot is a library, there's a global constructor somewhere within the
lib Ocelot data so that we'll call the Ocelot add trace generator method. This sort of
makes the trace generator live for the entire run of the application.
Alternatively if you only want to instrument a specific kernel, you could modify the
program itself to call this API function. And several different modes of using it. One is
just store traces of each instruction as it's called with all the memory and then analyze
those later offline. Or do some kind of online analysis. Or maybe even a couple of -- a
visualization tool if you're really interested.
Most of our trace generators are in a separate sort of subdirectory of the Ocelot project.
And you'd have to explicitly link those with CUDA applications. But they compile as lib
Ocelot trace. And so to execute a program using the trace generators, you modify the
configure.Ocelot document. These sort of have -- we have sort of a configuration
objects for each of the trace generators so it's implemented. So most of them you just
need to enable them.
In of the addition trace generators have parameters like the debugger allows you to
always break into a kernel before executing or just choose a specific kernel. Set the
device to emulate it if you want sort of the detailed instruction level trace analysis. If
you just set a timer or something else that you only need callbacks at the beginning and
ending of a kernel, you can try the trace generator and use it with the additional devices
like the NVIDIA device, the multicore device and the AMD device.
And so here's a really, really simple source code for our trace generator that just does
the load imbalance. So the event method is really the only thing that's called, sets up a
set of counters for each thread and then on -- when instruction calls event, it iterates
over all the possible threads within the CTA. It looks at the active data member, which
is a bit vector, indicating whether the thread was actually predicated on for that
instruction in increments that counter.
It include the output that would store to a file. But when I ran this on the Mandelbrot
application, you just sort of see the variance of many different -- or some variance
because it's the Mandelbrot side, and not every thread has the same workload. But just
a very simple example.
So that's it for the emulator. It's worth pointing out that we've used this for a number of
different studies. Did it for architecture research to examine different dread reconverge
mechanisms. Other people have used to it drive their own simulators. MACSIM at
Georgia Tech is one example. There's another example from the University of Texas.
Mark Gebhart did a study and used Ocelot to sort of evaluate difference or impact -- a
study basically but used Ocelot to dread it. Okay.
So now I'll discuss some of the additional back ends. So that was the emulator. We
have a multicore CPU back end, NVIDIA GPU back end AMD. And so the goal for the
multicore CPU back end was to sort of achieve portability by executing coded kernels
as efficiently as possible on multicore CPUs. And our implementation uses LLVM. So
we did instruction set translation from PTX to LLVM. And then execution model
translation that takes this very large thread hierarchy with lots of express parallelism
and maps it on to a smaller number of like worker threads.
And to do this, we use basically a compiler transformation that serializes the execution
of threads. We have an execution manager running in each like worker thread that
selects different threads to execute for different parts of the kernel. And we use LLVM
as sort of the virtual machine that is originally designed to do Just-in-time implication to
x86. And so the goal again is map this thread hierarchy on to the different cores that
might be present and the processor of interest.
So in most of our evaluations we were just looking at x86 multicore processors. But
we're also using this to try to support ARM. The current status is to -- building LLVM as
a cross-compiler for ARM is kind of an iffy proposition. But presumably it's possible.
So in all of these cases we have a large number of threads that basically need to be
serialized. The point here is each of these CUDA kernels are sort of written with the
assumption that threads are lightweight and that can be created and destructed on the
-- very quickly. For most applications, kernel level threads aren't that lightweight, and so
we a compiler transform to iterate over different regions of the kernel, basically between
barriers. And then on a barrier do sort of a lightweight context switch. And so we use
compiler and sort of instructions to implement the context switch.
So each of these thread blocks corresponds to a CTA, each CTA on to one host thread.
And then serialize the threads within the CTA. And then working thread receives this -or is running an instance of an execution manager, which selects different threads to
execute. So if all threads -- most threads are waiting on a barrier, which is another one.
It sort of has the option of doing some kind of intelligent scheduling to try to encourage
all threads to be the same part of the program at the same time.
There are certain benefits for doing that. Namely -- so the CUDA applications are
written such that different threads are executing a mach step. And so if they access
contiguous elements of memory, those accesses will sort of be coalesced into maybe a
few off-chip transactions, sort of the equivalent of transferring whole cache lines at a
time.
In our case if we serialized threads there would be maybe 10s or hundreds of cycles
between a load in the first thread and a load to the second thread, even though they
might hit the same cache line. And so if you can actually execute those threads in an
interleave fashion or with very frequently changing between one thread and the other,
those -- the number of cycles between those accesses can be reduced and presumably
achieve high head rates in the cache.
The multicore back end is implemented -- we have device implementation, we have a
number of classes that interact with LLVM to manage the translation, to specialize the
translation. And then to implement several of the program transformations, we need to
support barriers.
LLVM cooperative thread array is sort of the execution manager that chooses threads to
execute. We've implemented all of the LLVM IR in sort of our own C++ classes as
opposed to including headers from the LLVM project. The goal -- the reason for that
was basically LLVM is very fast moving and they don't change their instruction set very
frequently, but they do change their implementation of their IR. And we didn't want to
have to be as active as they are just to keep up. And so our interaction with LLVM are
Ocelot implements its own internal representation. Emits that as a string. Reparses
that using LLVM's parser and then uses LLVM's high-level functions to do the
Just-in-time compilation.
We have some performance results that show that almost all the time spent in LLVM is
actually in the code generator and not the parser and the emitter and so it's really not
much of a performance overhead. The translator itself is implemented in the translator
module, PTX, LLVM translator. It's about 10,000 lines of code. It's kind of interesting.
And then we have some additional transformations. We have partitioning which tries to
sort of take a large kernel, partition it into certain regions and then translate, compile
these and execute them independently.
One observation is CUDA did not receive -- or did not have the ability to call functions
until quite recently. And they don't have a linker. So in many CUDA applications
kernels themselves are very large. And in some libraries like the FFT library there are
many specializations of kernels, particularly if you have like a CUDA template library
that's expanded in many different ways. So there's a lot of dead code. We found that
partitioning the kernel into these regions and compiling them separately allowed us to
avoid compiling a lot of dead code. Remove barrier passes basically breaks the -- or
partitions the kernel on barrier instructions and treats those as context switches.
So here's sort of an example. Beginning -- at the beginning we have a CUDA kernel,
it's compiled to PTX. It's just -- this is PTX. And then we do just a fairly standard ISA
level translation. So most PTX instructions correspond to maybe a very small number
of LLVM instructions like the arithmetic construction's basically correspond to one.
Since both LLVM and PTX have load store architectures, there aren't too many
addressing loads that we have to support.
The special instructions and special registration are handled by either function calls in
the case of like the transcendental PTX instructions like cosine and sine. The special
registers like thread ID are handled by a loading out of this context object that's passed
through the translation has served its one parameter.
This contains pointers to local memory, shared memory, and then actually has values of
what the thread ID is.
And so this PTX kernel becomes sort of a set of LLVM functions, and calling that LLVM
function is equivalent to executing one thread over a particular region. So maybe some
additional details. LLVM depends on SSA form. So Ocelot will first transform its own
internal representation into SSA form, insert V nodes.
PTX has typed registers and typed instructions but its typing isn't always very strict. So
a register could be declared as one value but loaded or stored as another value. Like a
sine integer can become a non-sine integer. And so LLVM is -- does of strict typing. So
we have some additional conversion for that. And maybe one final note is every PTX
instruction takes an optional predicate register. That's common in branch instructions
because that's how PTX expresses conditional branching. But it's kind of a little bit
more awkward if you have like a load or store, some kind of -- some kind of other
instruction that LLVM does not actually support predication for.
So LLVM has a conditional select instruction. But none of the other instructions can be
predicated. And so we have the transformation that reverses if conversion and replaces
it with control flow.
And so here's just sort of an example of how the translate add instruction is
implemented. Basically each of the operands are -- there's a subtraction for translating
a PTX operand. It becomes and LLVM value. There's an LLVM instruction called add.
It basically says the on code and then the two operands. And then it's added to the
LLVM kernel. So there's tens of thousands of lines -- 10,000 lines of code roughly to
implement this translator. But the result is sort of a -- the input is PTX kernel, the output
is an LLVM function that is equivalent. And it should be invertible. There aren't really
any great or high quality LLVM to PTX code generators that are available to the public
domain. Presumably they're being developed in several places.
But the observation here is Ocelot's translator should be sort of an invertible process.
If you wanted to use LLVM as sort of an optimizer, translate PTX to LLVM, optimize it,
do some additional transformations and then use their code generator.
So here's the thread serialization method. So on the left side we have just this one
large basic block in PTX with a barrier right here. And so for the barrier semantics to be
honored, all threads have to execute up to this point before any thread proceeds. So
the way Ocelot's multicore back end handles this is it treats this as a context switch.
And then the execution manager will loop over all threads. For each thread call it. And
then at the beginning, a scheduler block is inserted basically receives an ID of which
barrier the thread is waiting to enter. So in this -- and when the kernel is first called,
threads need to enter sort of the first block. So they execute it. Then instead of at the
barrier point it's sort of been replaced by these store local instructions which write out all
the live values and then exit. And then that return is controlled with the execution
manager which will choose the next thread and continue.
And it sort of repeats this process until all threads of reached the barrier. And then the
next -- then it sort of reschedules the first thread who's entry point has been updated to
point to this -- this basic block corresponding to this point in the program. There are
load instructions which load live values back in the registers and continues. So
essentially this implements like a loop over the first part of the program which is the
barrier and then a loop over the second.
And we have -- in the beginning this is a fairly straightforward process. With we would
be only get a context switch at barriers. But we sort of extended this, sort of
generalized it to the concept of subkernels. So as I sort of mentioned before, this is a
partitioning of a program. And each time threads exit one partition and enter another,
that's sort of an opportunity for the execution manager to schedule more threads. And
there are several reasons for wanting to do this. Number one is just to sort of reduce
the translation cost.
If this part of the kernel is never executed, then you only have to translate these two.
There's also sort of -- there's some performance benefits to trying to keep all threads
within the same part of the program at the same time. So you can improve instruction
cache hit rate if all threads are only executing in this one loop before they move on to
the next.
You can also achieve some performance improvement. If all threads are sort of
stepping lieu the program at around the same rate, [inaudible] the same data. If there's
spatial locality within the program. And there's also opportunities to do more
sophisticated code generation and specialization. For instance, you could compile a
version of this subkernel that assumes that all threads are going to take the same
control path and then replace scale instructions with vector instructions or SIMD
instructions. And you can also just implement more sophisticated scheduling
opportunities.
Here's an example of inserting spill instructions into basic blocks at the location of
barriers. So here just instantiate an IR PTX instruction, pass out an opcode for move.
There are -- so the convex of this is this is how Ocelot uses a PTX to PTX
transformation to do most of the program transformations before it compiles it to LLVM.
And so within PTX it -- we've implemented dataflow analysis. So we have the ability to
iterate over the set of live values. And the goal here would be produce store
instructions for each live value that will write the value of that register or that variable
into that thread's local memory.
And so somewhere else in this transformation a local variable has been added. It's
called a remove variable pass stack. And it's basically just a region of local memory
that each thread has where it can store these kinds of things.
The dataflow graph is sort of an overlay on control flow graph, which is the main internal
representation. And so it provides some additional methods to modify the program
without disrupting the results of the dataflow analysis so you don't have to redo the
analysis every time you change things.
So this basically just creates a new instruction, sets its data type, sets some identifiers
for the source operand. So it's basically getting a pointer to shared memory -- or to the
local memory declaration. Its destination operand. This is the register. I'm sorry. This
is the issued load instructions. So assume a barrier has just been encountered loads
values into registers from local memory, sets the address mode to register data type,
inserts it in the kernel.
And then this iterates over the set of library registers, constructs store instructions. And
then inserts them. And so this is sort of an example of modifying PTX using Ocelot's IR.
And it happens to be in support of the multicore back end.
To use the multicore back end you basically set the device to LLVM. There's some
additional parameters to change like optimization level. These sort of -- these are read
by the multicore back end when it's actually doing the JIT compilation from LLVM X 86.
So it adds some additional LLVM optimization passes. Working thread limits controls
the number of worker threads.
Some additional transformations. You can specify the number of subkernel -- number
of PTX instructions per subkernel so you can basically choose any kind of partitioning
here as you want. Simplify CFG, basically coalesces basic blocks that have single
edge. And there's some other optimizations that you can invoke.
You can still use the trace generators, however only the initialize and finish are actually
called. So if you wanted to make your timing, that's how you do it.
And so here's just sort of an example of how you might configure it. So the NVIDIA
back end is pretty straightforward. It basically involves reemitting PTX kernels from the
IR and then sending them to the GPU via the CUDA driver API. And so it's not very
heavyweight but it does allow Ocelot to sort of sit as -- sit between the application and
the GPU and make transformations at runtime.
And so some simple applications are just doing register array allocation, potentially
modifying the application or changing the number of threads that might be launched.
And then finally instrumentation. We've done some work on instrumentation where
Ocelot sort of inserts instrumentation probes into kernels we further executed and then
transparently gets the results back from the instrumentation.
It's implemented just in these two classes of the GPU device and executable kernel.
These just are sort of wrappers around the driver API for managing the NVIDIA GPU,
managing its device convex and then issuing PTX instructions to it. And once again,
just use the NVIDIA GPU device. You can still use trace generators. Those can profile
kernel launches but you can't really interrupt the execution of a kernel when it's running
on the NVIDIA GPU.
And so here's the work on dynamic instrumentation that we did. The goal was allow
Ocelot to transparently insert instruction tools on an application allocate data structures
that they might need, fetch those -- fetch the results from the instrumentation tool once
the kernel is finished, and then use it -- or make it available in some useful composable
way.
It's sort of in contrast with NVIDIA's own hardware performance counters. There's sort
of limits to which performance counters can be active. They always sample one SM at
a time as opposed to the complete execution of the program. And the goal is just be as
flexible as possible by doing software rather than hardware and just take whatever
performance that you actually have to take. But if it's on a GPU, it's still probably
running quickly.
And then it's larger inspired by PIN. But there's some points that we have to make that
basically -- since the execution model is different, there's sort of constraints about which
types of instrumentation you can actually implement.
An example for that might be the NVIDIA GPU is a SIMD processor executing multiple
threads. But if your instrumentation tool has control flow, like if you have a loop, that
loop itself and the control flow that implements it could potentially cause the hardware to
sort of split the collection of threads into multiple WARPs and then that itself could
distort the performance of the application. And if the programmer's relying on that race
-- or implicit synchronization, it would actually break the program entirely.
And so this is sort of implemented with those -- those types of constraints in mind. And
it's -- it fits within Ocelot in the sense that the PTX transformations are applied on the fly
to all PTX modules that are loaded. There is a set of call backs that are called just
before the application executes the kernel, so the instrumenter can construct its own
data structures, analyze the kernel.
So, for instance, it might set up constructors for every basic block or for every thread.
So it needs to know both the launch configuration and the program itself. And it has
access to all of the analysis that Ocelot has already performed up to that point.
And one additional point is Ocelot has the ability to remove the instrumentation. So if
you want to run a program at full speed without instrumenting for a while, then suspend
execution, insert an instrumentation tool on a particular part of the program to monitor
maybe a specific kernel or some specific -- some specific properties have been met like
the kernel is -- or the program has converged on some value and now you want to start
instrumenting. So Ocelot allows you to do this at runtime.
That's not really something that we've seen in any of the other instrumentation tools out
there. Most of them tend to be like source-to-source compilers. You just have to
recompile the entire application if you want the instrumentation present.
And we're actually looking at implementing like the race detection in software that you
mentioned. And also the memory checker. The memory checker that the emulator was
doing. Do that in PTX instrumentation, insert into the program and running on the GPU.
>>: [inaudible].
>> Andrew Kerr: That's the [inaudible] so it depends on the instrumentation tool. And
I'll get it -- and I'll cover that in greater detail in a second. Just wanted to walk through
the workflow. So when I originally started we implemented the instrumenter as sort of a
PTX to PTX transformation, which meant interacting with the IR at sort of the instruction
level. If you've ever written any transformations for LLVM or Open64, I guess
VisualStudio you're probably sort of aware that it's kind of cumbersome. If you have a
large piece of code to implement the instrumenter.
And so for around two of this, we actually implemented sort of a primitive C to PTX
compiler using a research compiler that someone else at Georgia Tech developed and
building PTX code generator for it.
So this is the actual implementation of an instrumentation tool that measures memory
efficiency. And it uses sort of a set of labels to sort of tell the tool where to insert the
PTX that implements this block in the program.
So in this case, it basically tells it on every memory instruction that reads or writes to
global memory. So it sort of rep indicates the set of PTX. And we have a few built-in
functions that do useful things like determining whether a particular thread is the least
active thread in a WARP and then computing a predicate based on that. And then
having predicated execution, implement the rest of this code.
And so the goal was just have one thread in a WARP, execute the instrumentation tool,
but be able to observe the behaviors of all the other threads, and then not create any
thread divergent conditions that would destroy the execution of the program.
We have a reduction function across a buffer and shared memory. And so in this
particular case the memory efficiency tool here basically masks off the low order bits of
each of the addresss that are computed for a load or store instruction. And then tries to
coalesce those into basically as few cache lines as possible. So if you mask the low
order bits, suddenly every thread just has the address or the base address that
corresponds to the start of the cache line, stores those to a buffer in shared memory.
And that one thread iterates over that buffer and then reduces that into a single set of
unique, uniquely -- I'm sorry, unique cache lines and then stores the number of unique
cache lines.
And so ideally if all threads are accessing consecutive elements in memory, that
corresponds to one cache line satisfying all the threads.
If the threads are doing a random scatter gather, it might correspond to as many cache
lines as you have threads. And so the memory performance would diminish. And so
this tool just tries to measure that.
And again, it tries to -- it tried to produce abstractions that are actually useful, so we
implemented this unique element reduction. And the tool itself just constructs the
counter corresponding to every WARP and then stores both the number of times the
memory instruction took place and then the number of cache lines needed to satisfy it.
And then all of this tool's sort of wrapped up under the name Lynx which includes the
API for adding callbacks that construct data structures, registering instrumentation tools
so the PTX is always transfer -- PTX is always transformed when the application is
running. And Ocelot sort of knows to apply these transformations when the program is
running.
And so here's some performance results. And here's some sample output.
So over here we basically implemented a basic block dynamic level dynamic instruction
counter. So when the program is -- or when the kernel is launched, the instructor
analyzes the program, counts the number of basic blocks, uses the CUDA Malloc
function to allocate device memory representing the data structure that the
instrumentation tool is going to use, inserts global variables into the PTX module and
then it actually inserts PTX instructions which count out the number of instructions per
basic block. And then every time that block is entered by one or more threads
increments the counter.
So it's basically dynamic memory -- dynamic instruction count on a per-thread basis.
And then when the kernel ends, the instrumenter call back fetches all that data, does
some reduction, does some analysis, and then it produced this nice visualization. And
this is sort of the same kernel as before that was running on the emulator, only now it's
running on the GPU.
And so for some -- some applications the performance impact is very low. So if the
application has like many say compute instructions per basic block, they're only sort of
the ratio of instruction code to original application code is fairly low. So the
performance, in fact, is very small.
In other cases like binomial options, each basic block was about two or three
instructions. And then the instrumentation code is probably like five or six instructions.
So its performance, in fact, is pretty high.
The instrumentation tools also might -- or also have to update memory somewhere.
And if the application is memory bound already and now suddenly there's even more
memory bandwidth demand, increment the counters, then that can slow it down even
more. But it's really sort of -- it's up to the instrumentation tool and the characteristics of
the program to know what the performance impact is.
And maybe one sort of anecdote is we use some of the built-in registers to figure out
which multiprocessors or cores are executing CTAs. And we ran the Mandelbrot
application through the instrumentation tool and discovered that some multiprocessors
are sitting idle while other multiprocessors were oversubscribed. And so the actual like
indicated kernel runtime was about twice as bad as it actually would have been.
It turns out it was a hardware bug that the driver had sort of not been able to work
around. So it's just sort of an anecdote of the instrumentation being used -- useful for
observing execution of the program in ways that NVIDIA or what other hardware
vendors might not have necessarily provided easy ways to do.
The remote device. This is just sort of a -- is sort of an experiment to see if you do RPC
when the notion is just translate Ocelot device calls or serialize them, send them to
some remote server which is then executing them on an actual Ocelot device so you
have your laptop running in CUDA application and your cluster actually executing it.
So basically allows a multi-GPU application to suddenly become a multi-node
distributed GPU application. And some of the work that we did at Georgia Tech sort of
indicates that even unreasonable doesn't really matter because CUDA programs are
fairly latency tolerant when it comes to latency with kernel execution.
And then switchable compute. I mentioned this before, but basically allows you to
recreate the state of a device on another device so you can just do an on-the-fly context
switch. Has all the same problems. It's just serializing the state of any other application
and then restarting it.
So the data structure has lots of embedded pointers. You have to remap those
somehow. Ocelot is capable of remapping parameter values. And it provides a
mapping table. But if the data structure itself has embedded pointers then it's up to the
application to deal with it. But it's -- this is basically as good as we can do without doing
some kind of analysis to infer data structure type.
And then last element in this section is the AMD back end. This is work that Rodrigo
Dominguez did at Northeastern. And the goal is translate PTX to IL or AMD's
intermediate language, which is PTX like, and execute CUDA kernels on AMD GPUs.
And then do this in a way that makes it just another Ocelot device.
So if you have a system with an AMD in it, NVIDIA GPU, just sort of choose between
them. One sort of motivating -- or one bit of motivation is if you're trying to compare the
performance of a two different GPU applications running two different GPU
architectures, the current state of like the tool chain world is you have to rewrite the
application.
So like might have to rewrite in OpenCL. And then once do you that, even if you had an
OpenCL implementation on both types of architectures, it's really up to like how well that
particular vendor implemented things. And so I think NVIDIA might have sort of an
interest in making CUDA programs a little bit faster all the time, just as a matter of
course.
So our goal was to provide as much portability as possible so you have one tool chain,
one program representation and then presumably efficient execution on each type of
processor that's, you know, currently available to the mainstream.
And working with AMD GPUs was fairly interesting. They have sort of an unusual
memory hierarchy. This is -- so it's probably worth pointing out that currently IL has
been deprecated as of the latest release. AMD is sort of promising a new intermediate
representation of GPU and heterogenous programs, kind of like PTX and kind of like IL
but better somehow.
But this currently was working to the previous generation of AMD Radeons. In their
case, they're still using the VLIW style cores. And so it was sort of up to the compiler to
make use of each of the instruction -- or each of the ALUs that were available. And this
is sort of opposed to NVIDIA's approach, where you have just a collection of scalar
threads executing on SIMD hardware.
So in this case it puts a lot of sort of pressure on the compiler to do efficient code
generation. There's just sort of an arrangement of special purpose functional units,
regular ALUs, load store units.
Another interesting characteristic is there's sort of a fast path and a slow path to global
memory. And misaligned accesses have to be handled in software through the slow
path. And so all of this is sort of up to someone implementing a code generator. So
Rodrigo dealt with this.
There's also some additional issues like you can't do a CUDA mem copy to anything
except sort of the base address of an allocation. Which is complicated. You have to set
up these uniform addressing views. So I'm not -- don't really have too much more to
say about the actual hardware.
I do want to point out, though, that there's sort of a big issue with supporting control
flow. And so PTX has branch instructions. AMD GPUs don't. Instead they have
assembly level structured control flow instructions. So basically an if else at the
assembly level. And so to get PTX kernels running on that, you actually have to do a
structural transform and replace this arbitrary control flow with these nested control
structures. And so some of you are probably familiar with that. But it's kind of a
heavyweight transformation. But it's sort of the equivalent of just removing all the gotos
in your program and replacing it with control structures.
And so there's a unstructured, two structured control flow transformation within Ocelot
that's implemented as a PTX-to-PTX pass. And so he ran the application -- or ran
several applications from the CUDA SDK. One of the interesting ideas was to see how
a program that was heavily optimized for say NVIDIA GPUs would perform on his
implementation on AMD GPUs. And he found that there's up to maybe two orders of
magnitude in runtime difference between the same application, depending on whether it
had been optimized or unoptimized for NVIDIA GPU use.
The implementation there's structural analysis, structural transform. And then there's
just the boilerplate necessary to interact with the -- AMD's cal API, which is sort of the
equivalent of the CUDA -- NVIDIA CUDA driver API.
And so here's some slides I'm probably not going to go into. I'm running short of time.
But basically show that a program with unstructured control flow is easy to -- easy to
write, particularly if you use some things like short circuit conditional evaluation and then
being able to run the PTX that might correspond to this on an AMD GPU requires lots of
work. So that's truly all I wanted to say about the AMD GPU.
So this is a -- sort of a dive into the Ocelot's PTX IR implementation. Per PTX, sort of
virtual assembly language. This is maybe one of the useful things for doing compiler
research. Because we've completed our own parser from the PTX IR. We've
implemented lots of our analysis and transformations in terms of our own PTX IR. This
is probably one of the better ones that I've seen available on the Internet. And for a
while I think we were one of the only ones to have a fully featured PTX to IR and emitter
and then analysis tools. So it's implemented -- the source is implemented in the IR
directory in module, PTX instruction, operands, PTX kernel, control flow graph. We also
have IR for both AMD IL and LLVM. So those are also in the IR directory.
And so the internal representation is a collection of C++ classes that correspond to
elements of PTX modules. So these are PTX module, kernel, instruction, operand,
global variable, local variable, parameters.
The IR also limits the emitter which takes the internal representation and produces a
string representation that's parsable by other PTX tools. And so that's the how we get
from Ocelot to NVIDIA GPUs. Also the ability to apply transformations and then reemit
PTX which can be used just sort of by stand-alone -- or, sorry, do like static
transformations at PTX offline and then sort of package that with the rest of the
application.
So if you wanted to do your own characterization studies or optimize someone else's
programs, this is how you would do it. And there's also -- there's a valid method which
is sort of our attempt to make sure that the resulting PTX instruction is actually correct.
So we have sort of a shallow class hierarchy of just one PTX instruction class. And so
it's possible to specify parameters that just don't correspond to what you could -- what
would actually make sense as a PTX instruction.
And this tries to catch that. And it's always invoked whenever the emitter is running.
There are translators from PTX to LLVM and to AMD's IL. These use PTX IR is sort of
the source for that. It's how we implemented all of our analysis and transformation
passes. And then finally it's the form in which the PTX emulator is actually executing.
So inside the emulator there's sort of a switch statement on the PTX instructions op
code which calls the appropriate handler.
So here's how PTX module is structured. There's sort of the text here that corresponds
to what you would actually get from NVCC, and then these boxes sort of correspond to
how that text correspondence to objects within the IR. This is the module. Global
variables which have a type identifier, PTX kernel which has name, a set of parameters
which themselves have type. Global variable declarations. The registers are handled
as sort of values instead of actual registered declarations. And so if you added
additional instructions or did -- performed a transformation to and from SSA, the actual
set of register that is are visible and the PTX kernel might be different.
There's a control flow graph which itself is composed of basic blocks and edges. So
here's the basic block. So basically the kernel owns parameters and control flow graph
and control flow graph owns basic blocks. The basic block contains a set of PTX
instructions. And so these have op code, address spaces, data types and then a
collection of operands. The operand itself has various addressing modes.
Since PTX is a virtual ISA, they're free to make it as complex as they want. So an
operand itself could actually be a vector of registers. And we have support for that.
That's how they supported texture sampling. The return value is a vector instead of just
a single registered item. Then just various addressing modes. And then finally there's
an implicit predicate operand.
So basically each of these potentially could have a predicate operand. Generally it's
always predicated on -- in the case of the conditional branch it's -- there's an explicit
predicate. So that's that.
So the control and the dataflow graphs, the dataflow graph is basically an overlay on the
control flow graph that tracks -- it's in SSA form. It allows you to iterate over the
definition and all the uses. At any basic block you can determine a set of live values.
We've provided some convenience functions to transform the control flow graph. Like
split a basic block at a certain point. Split an edge at a new basic block. Add new
values, new instructions without invalidating the dataflow graph. And then traverse the
control flow graph in several different ways.
And so here's sort of a very simple example that splits basic blocks at barrier
instructions such that the barrier is the -- always the last instruction of a basic block.
This is one of the maybe upstream passes before replacing barriers with exits and
inserting context switch points. And so you have an iterator of the control flow graph
that iterates over the blocks of the kernel's control flow graph. And then reiterate over
the instructions within the basic block. We have iterators for that. There's a PTX
instruction. So the control flow graph is meant to be sort of ISA agnostic, so we have a
base class below PTX instruction. So it cast it up. Examine the op code. Determine if
this basic block contains more instructions than just the barrier. And if it does, we call
the control flow graph split block, past that iterator that corresponds to the basic block,
gives some additional information for the new edges to create.
But basically the point of this is just to show that it's fairly easy to add new basic blocks.
It's fairly easy to iterate over or traverse the in general representation of the program
and make modifications to it. We tried to make it as sort of convenient and sane and
conventional for C++ programmers as possible. So they have iterators for basically
everything.
And we've seen this spill code. And then it's also worth pointing out we have the
internal representations for LLVM and AMD, IL, and then emitters for both of those.
So on top of the PTX IR, we've implemented a pass manager that orchestrates the
application of PTX to PTX transformations as well as analyses. It is largely inspired by
LLVM's implementation. And it's ->>: It's largely inspired by [inaudible].
>> Andrew Kerr: It's largely inspired by [inaudible]. So thank you. But basically the
concept as you're probably all aware is avoid recomputing analyses unless the program
itself actually changes. And then if your transformation can update the analysis data
structures, don't recompute them because even though it's changed, they're still valid.
It's sort of a structured way of doing that. So I probably won't tell you things you already
know.
And maybe one additional point is there's a PTX task manager sort of built into the PTX
pipeline for each of the devices. So you can always add transformations to a program
before it's run. We have like some example analyses like dataflow analysis, dominance
analysis, thread frontiers analysis, which is some architecture work that we did that tries
to improve SIMD utilization for programs with unstructured control flow.
So these data structures are implemented in the IR for the control flow graph. The
dataflow growth is an analysis. Dominator and post dominator trees. We have sort of a
super -- hyperblock and superblock formation passes. So we're experimenting with if
conversion and inserting some additional predication -- or predicating PTX instructions
instead of control flow.
Divergence graph attempts to identify expression that is are uniform across all the
threads within a program. And if that's true, you can mark those expressions as thread
invariant. If they are used as conditions and control flow, then suddenly you know that
all threads are going to take the same paths. You can mark regions of the kernel as
uniform or convergent.
And so we have sort of an example of implementing dead code elimination as Ocelot
PTX pass. And so the goal here is use dataflow analysis, identify instructions that have
no side effects and the users, remove them, and then keep removing until no additional
instructions have been removed.
So the dead code elimination pass is derived from kernel pass, indicates that it needs
dataflow and SSA form. As dependent analysis. So we just have an accessor. Fairly
straightforward. And that obtains a dataflow graph, asserts that it's in static single
assignment form. Iterates over the blocks of the dataflow graph. Inserts a work list.
And then for each block on the work list identify the instructions that are dead. And so
the dead instructions are those that have no users or they're not live out. And if they
can be removed, just basically it has no side effects and no users then the instruction
can be eliminated. And so the source code is here for reference. Don't necessarily
have to walk through it. It's fairly straightforward.
But the point is dataflow analysis is -- or the dataflow graph is a usable data structure.
There are iterators for it. It should be fairly intuitive for someone with compiler
experience.
And so it's -- I'd like to draw attention to this PTX optimizer program. This is sort of a
standalone utility for parser PTX into the IR, applying a custom optimizations and then
reinventing them. It's sort of a handy utility. And then you can attach PTX
transformation passes to Ocelot, so they are always applied for a program.
Okay. So these are just some of the other research topics that we've -- we have been
looking at and other people have been looking at with GPU Ocelot. Workload
characterization was sort of one of the original capabilities. And so we've used that to
maybe define a set of metrics that losing correspond to performance and then use those
metrics. That sort of led to this project called Eiger, where basically store statistics
about some set of applications running on some set of likes heterogenous hardware.
Take measurements, store application characteristics in the database. Then create a
statistical performance model so that you can use that to predict the performance of the
application on either new hardware or processor that might be available.
And so with performance prediction you can suddenly make scheduling decisions on
heterogenous hardware. You could also use it to sort of do like very coarse grain
analysis when doing design space exploration for like a cluster environment.
This is meant to interact being done at Sandia, which sort of developed this simulation
infrastructure for evaluating a cluster or a set of different types of processor in nodes of
a cluster.
If you wanted to do a detailed performance analysis of one particular program on a
detailed simulator you still sort of need to simulate the rest of the cluster environment.
And if you have a thousand nodes you don't want to do a detailed simulation of all
thousand nodes you just need to know like how quickly to play messages coming in
from the network, basically. And so this is trying to do these like coarse grain and
inaccurate but fairly fast. Statistical simulation basically.
Also sort of been interested in doing automatic tuning. And so -- in one of our
instrumentation papers we showed how the instrumentation tool itself could be used by
a scheduler to sort of adjust the issue rate of kernels coming from different simulated
applications to try to achieve fairness. And just sort of one example of monitoring
execution of the program through software. It's sort of by Ocelot to make intelligent or
acceptable scheduling decisions.
It's the feedback-driven resource management. We've also been looking at compiling
other types of programming models or execution models on to Ocelot for heterogenous
execution. And so this is called Red Fox which is basically sort of a GPU accelerated
database model. And it's one of our sponsors is a company called LogicBlox, which is
developed sort of a high performance database based on Datalog. And so Datalog is
sort of an alternative to SQL and expresses sort of -- or can be used to express
relational algebra.
So we've implemented CUDA kernels that correspond to primitives and then the
relationships between those primitives are specified by Datalog. There's a translator
from Datalog into this relational algebra, which ultimately goes to a scheduler and then
that scheduler ultimately invocation Ocelot which executes the kernels. So basically as
you add different types of processor the scheduler can schedule them on those new
processor and performance improves.
I'm sort of interested in new architectures as they're coming out. We currently focus on
just like mainstream CPU architectures now. But if say Intel's GPU had sufficient
software tools that we could actually get code running on them, it would be very
interesting to add them to a tool like Ocelot. Also interested in supporting Intel's MIC
and AMD's Fusion. Basically just new heterogenous processors as they're emerging.
And we're also interested in targeting some more research oriented architectures like
Rigel. Vanaheimr is just sort of a side project. Would be interesting to see if we could
get the CUDA execution model running on E2 for instance.
>>: [inaudible].
>> Andrew Kerr: We were also sort of curious like if you have low-level code generation
capabilities like what -- why do think change if the compiler itself is aware of all the
threads that are running and what the relationships between threads are. And so it's
just sort of one opportunity.
And then felt like citing Mark's work because he actually used the Ocelot simulator to
explore the power and performance impact on a small register cache. Close to the
ALUs.
And then thread frontiers is some work that we actually did. It's meant to avoid the
penalties associated with unstructured control flow on SIMD architectures. So most
prevailing techniques rely on reconverging at post dominator or if it has -- if the program
has structure control flow just outside the control structure basically.
But if a program has unstructured control flow, I had a good example, the reconverge
point is actually far beyond -- I'm sorry, the post dominator of a control flow instruction is
far beyond other locations in which control flow could actually reconverge. And so this
work is trying to come up with other say hardware techniques that depend on some kind
of compiler analysis and program way out to actually improve -- or prove the hardware's
ability to reconverge threads by scheduling them differently.
So it's basically just signs of priority that basic block is a program out such that the basic
blocks are in decreasing priority and then allows the hardware to always choose threads
that have the highest priority basic block. So execute the programs near the beginning
of the -- I'm sorry, execute threads that are waiting near the beginning of the program
before you execute programs when you're near the end. The assumption is that -- or
the assumption is those threads will ultimately catch up with the other threads and
reconverge early.
And then interfaces to MACSIM is basically, as I mentioned before, but it's just the
heterogenous trace driven simulator. And so we use Ocelot to drive instruction traces
from CUDA kernels. And the interface is just the trace generator interface that I
described before. And so that work has been done by some other students at Georgia
Tech under Hyesoon Kim and Rich Vuduc.
And the last set of slides I have are just sort of a walk-through of how to add an
additional instruction to Ocelot. And so in this case, NVIDIA's PTX spec defines PTX -or the PTX instructions for prefetching to various levels of the cache hierarchy. And so
the set of slides sort of covers what was necessary to get those running on the emulator
and the NVIDIA back end.
And so the set of steps are modify the PTX IR, add support from the parser and the
emitter and then implement the instruction for each of the devices. So in this case we
implement it for the emulator, the NVIDIA GPU back end. Since it doesn't produce any
new values, it doesn't modify the data like the dataflow graph at all. So basically it just
requires the emitter to be able to inserted the instructions when it's JIT compiling the
kernel for the NVIDIA GPU back end.
And so here we just modify the PTX instruction class with two new op codes. Add an
additional enumeration that indicates which cache level the prefetching is going to. At
two stream methods which are useful for the emitter and also just convenient. And then
add the data memorandum storing your cache level. So the PTX instruction can now
indicate that it's -- or indicate the new op codes and indicate which cache level it's being
sorted to.
Within the PTX instruction class, the functions are implemented fairly straightforward.
The two-string method is pretty obvious. The PTX instruction two-string method itself is
what it's actually implementing the emitter. So there's this large enumeration. So for
each of the newer opcodes, print the guard instruction, the opcode itself, string literal.
Address space is sort of one of the other modifiers with the PTX instruction to indicate
where the load or which address space the pointer actually refers to.
And then the cache level. And then there's this data member called D that corresponds
to the first operand of the PTX instruction. And then similar for prefetchu, which is just
uniform prefetching.
The valid method I mentioned earlier. Make sure that the PTX instruction corresponds
to something that's meaningful. So in this case, make sure that the cache level is either
L1 or L2, the address space is either thread local or global, and make sure that the
operand a meaningful memory address that it's not like a register that you're trying to
prefetch from. And so it basically looks at the address mode have the destination
operand, make sure that it's indirect or that it corresponds from an address or that it's an
immediate value.
And if all of these are true, then the valid method returns true. So once the IR can
actually store the new instruction we have parser support, which means modifying the
lexer or modifying the grammar and then modifying this PTX parser class, which
actually sort of builds up the instruction as it's being parsed.
And so were add the new tokens at .L1 and .L2, add the new on codes to the PTX
[inaudible], modify PTX parser to actually take the new tokens and translate them into
the PTX IR. Do the same with the cache -- for cache level.
And this also sort of translates the string representation to the new opcodes. Pretty
straightforward. Here we modify the grammar. We add the new opcodes. Add
declarations for new tokens. We modify the instruction rule to include rules for prefetch
and prefetchu. Here's the prefetch rule. Takes the prefetch token, address space,
cache level then brackets memory operand close brackets. So it's fairly straightforward
for modifying your grammar.
And then so now the parser is capable of actually receiving the PTX instruction and
storing it to the IR, modify the actual emulator. So here we add hand lesser for the two
new instructions to cooperative thread array, which implements the emulator's execution
of a CTA. Has an execute method which is basically a while loop while the program is
running. And then choose the instruction at the current PC, call one of the handlers for
each of the instructions. So you add new handlers for those instructions. And here's
the handler for prefetch. Since it -- since the goal here is actually to be able to maybe
drive a simulator, we need to compute -- we need to actually take some action that
determines which addresss are being accessed, even though the emulator itself doesn't
feed to really do anything because it's not changing the state. It still needs to be able to
drive the addresses to the trace generators.
So here we just iterate over all the threads within the CTA, determine if the thread's
actually executing, like whether it's predicate operand is on and whether the thread is
active at that particular location of the program. And then just evaluate the address
that's being referenced, depending on which addressing modes -- or which addressing
mode is syndicated, and add the address to this memory address's vector, which is a
member of the trace event object, which ultimately is patched to the set of registered
trace handlers.
And so if we run the program, have a very simple like CUDA kernel that I child and
manually modified to insert the new PTX instruction. Run a very simple trace generator
that would actually produce some output if prefetch is executed. So basically just wait
for the opcode prefetch, and when that happens, print the instruction, print the set of
addresses and then -- I'm sorry. Iterate over the set of addresses and then for each one
print it out. And so then when you run it on some particular -- or some execution of the
program or just this output. So that's sort of a complete walk through that hopefully
provides really a detailed look into how Ocelot handles PTX and how you can add new
instructions.
And for the NVIDIA back end, nothing else is necessary. It basically just depends on
the emitter and the valid call.
And so that's basically everything in the tutorial. I think we're going to have a demo
shortly after lunch. And I'd be happy to take -- answer any questions on any of these
topics.
>>: [inaudible] maybe you mentioned this, but if you want [inaudible] to compiler that
[inaudible] this type of GPU or this type of [inaudible].
>> Andrew Kerr: Yes. Absolutely. So the question is if you had some information
about which processor's likely to execute it the fastest or which processor is available.
Yes, so Ocelot has the ability to switch devices on the fly. Before a kernel is executed,
it has the opportunity to migrate all the device state from that device to another. And so
->>: [inaudible].
>> Andrew Kerr: Yeah. So there's an action -- so there's an Ocelot API call that will
change devices. And then the return is a mapping of basically old pointers in the old
device's address space to new pointers or new allocations in the target -- or the
destination device.
And then it tries to copy each of those box of memory just sort of like a binary block.
And so the application if it's written in sort of the OpenCL style where the only pointers
you actually of are parameters to the kernel, then every embedded -- every dereference
must take sort of that pointer and then a index. Then that's sufficient. Because Ocelot
is sort of sensible enough to remap. But the parameter values, if you have sort of a
data structure that has lots of embedded pointers that CUDA and PTX allow you to
write, then the program itself needs to be modified to sort of serialize data structures.
But it's the same sort of problem if you tried to sort of terminate an application and then
try to restart it at a certain location, you just have to save all the state and then
rematerialize it. And then if the parameter value -- or if the pointer values aren't the
same, then update those as well.
But, yeah, that's definitely one of the design goals of being able to execute the same
kernel within the same application on different devices depending on whether they're
faster or not.
>>: Sounds like [inaudible] how do you migrate [inaudible].
>> Andrew Kerr: Right. Right. So Ocelot is sort of like the low level way of doing it. In
the Red Fox example, we had harmony runtime which is sort of a scheduler for
scheduling kernels on different devices that are available in the system.
So if you had some sort of sophisticated performance model, Ocelot could allow to you
change it. And so Ocelot also allows you to insert instrumentation which you might
need to make that decision.
>>: So what do the applications have the same assumption about the device then you
take them and transfer them to [inaudible] back end to run [inaudible] do you have to
run it in each iteration that that actually breaks the code?
>> Andrew Kerr: Yes. So the implicit synchronization among threads in the same
WARP definitely breaks the multicore back end. There's a prototype multicore back end
taking shape that basically will execute a kernel as if all threads take the same control
path, and it vectorizes the scalar instructions.
And so if the WARP size must be at least say four threads for the program to execute
correctly, it will execute correctly on the vectorized back end. I think it would be nice if
the programming model itself allowed the programmer to tell the compiler in the runtime
what the minimum WARP size is necessary for the program to be correct. And there
have been several efforts that try to sort of infer that by some kind of say pointer
analysis of the program.
But I think even in those cases they're not completely channeled. Because you could
have this sort of implicit synchronization in the code block that's divergent. So like not
every thread reaches it. And so every method that I've seen that tries to allow the
compiler to make this inference would break in that case.
>> Aaron Smith: Okay. Let's thank our speaker and then we can have lunch and talk.
[applause].
>> Andrew Kerr: So in this directory this is basically the Ocelot source, is the main
Ocelot code base. So everything is in the Ocelot directory. There's a number of unit
tests for -- there are a number of unit tests. Most of them are written with complete
knowledge of Ocelot's internals. So there's basically a test for every PTX instruction.
There are also just some CUDA tests, which are just off the shelf -- or not off the shelf,
handwritten by us. But they're just regular CUDA programs that you should be able to
run on any implementation.
So I'm going to run several of those. So here's the Ocelot configure file. So change the
set of devices just to include the emulator. So here's a set of most of the applications.
Here's this test CUDA sequence application. So there's -- we tried to build a lot of sort
of like debugging tools into Ocelot just because it's really hard to build a compiler
without having lots of observability into the state of the program. So if we change the
device back end to LLVM and change the optimization level to report, we run the
program as the translator is producing an LLVM representation of the operating system
it basically inserts these function calls into every set of translated LLVM instructions,
which maintain a pointer to the original PTX instruction that it's actually translating.
And you can sort of configure the translator to produce debugging output for each of the
threads basically. So here we just have like a single thread's execution. We want to
monitor the control path and make sure it's correct. Kind of a handy feature.
Let me think. Okay.
>>: Why are they all threads [inaudible]?
>> Andrew Kerr: So this is -- corresponds to the execution of the first thread. And the
thread IDs are sort of a -- it's a three dimensional tuple. So that sort of tries -- they were
trying to make it easier to write let's say dense linear algebra by giving a thread an ID of
an X coordinate, a Y coordinate, and a Z coordinate. And this just sort of outputs the
instruction trace for a single -- single thread. Otherwise there would be quite a lot of
output. If you have hundreds of threads.
In this directory we have the -- a set of applications from the CUDA -- CUDA SDK. And
these are provided by NVIDIA at other locations. It's just sort of like this repository of
interesting applications that make use of CUDA in some interesting way. And so I'm
going to run the Eigenvalues example. And hopefully you should see -- many of these
have built-in quality assurance tests. So you can tell if the runtime -- or the
implementation the actual correct by producing the correct results.
And they also have like built-in performance monitoring. So you can sort of see how
long it takes the Eigenvalues application to compute Eigenvalues sum matrix. And as
you tweak various parameters like run it on the emulator versus the LLVM back end you
can see different performance. If you add worker threads you should see performance
scaling.
And so if we set the back end to LLVM, set the number of worker threads to one and
run the Eigenvalues example, get say a certain runtime. I think if we change the
number of worker threads to say four, we should see a performance scale.
And there's a slight performance difference. One of these applications has a race
condition. And so we ran this with the race detector off. If we set it to true, okay. I
believe it had a race detection -- a race condition, but I guess not.
If we turn the debugger on, every time the kernel runs it breaks into this debugger. And
debugger has sort of like a text interface, a cool graphic and some other things. And so
you can sort of set a watch point. You can print PTX at the certain location. Should be
able to single step it so you can monitor the execution of the program. You can print the
registered IU.
And so print registers for, you know, certain set of threads that are active.
You can determine like which line of source code that the corresponding PTX instruction
corresponds to. Let me think. I'm just going to exit the program. So beforehand I
compiled -- I just invoked NVCC to compile the PTX for the scan program.
And if you run the PTX optimizer, there are a number of options. One of them dumps
the control flow graph. So -- and the result is this dot file. I can see it. One second.
And so it just sort of dumps like a control -- like a visualization of the control flow graph.
Zoom in on PTX instructions. And I guess sort of a -- this is sort of like the foundation
for a lot of our visualization tools that produces a dot file corresponding to the program.
So the PTX optimizer is kind of like LLVM's op in the sense that you can specify some
additional passes. You can also do things like reallocate all of the registers.
>>: Why would you want to do that?
>> Andrew Kerr: And so -- okay. The register file is -- it's sort of a statically known
concept. And so if your kernel makes heavy use of the register file and it sort of impacts
on how many threads you can launch, and if you have the same program that's
targeting say a test load class GPU, which has a certain -- I think it's like a 4K register
file, and then suddenly you run it on a GPU with say an eight kilobyte register file like
Fermi, you might want to make use of more registers.
So being able to reallocate them allows you to launch more threads, launch the right
number of threads.
>>: PTX [inaudible].
>> Andrew Kerr: It has a virtual register set, but the optimizing code generator doesn't
try to do very aggressive register allocation. And so it depends on PTX to do spills, I
believe. I've seen cases where programs will crash if you try to launch too many
threads, and PTX implies so many registers. So Ocelot will basically insert the spill
code for you.
And so the PTX still reflects a large number of registers. But when OCG compiles it and
executes it, it will reallocate them basically. But using the live ranges implied by the spill
code.
>>: So is that implied in NVIDIA allocators [inaudible].
>>: [inaudible].
>>: [inaudible].
>>: Well, because you're emulating the --
>> Andrew Kerr: Well, in this case you might actually wants to do it for a program
running on the NVIDIA GPU. But if you -- if you have a -- if you want to launch a certain
number of threads, then that sort of limits how many registers you can allocate per
thread. And so the register allocator will insert the spill code so you can have fewer
registers.
So the kernel runs and it's actually the driver that would catch the error when it sees that
you're trying to allocate too many registers. And so with four registers, it basically will
insert spill code on every instruction basically.
It's worth pointing out that these -- it does insert lots of declarations but these are
basically values that come out of the SSA form. And so it doesn't actually remove them.
So each one of these loads just produces a new value. But it's live ranges basically that
long. And so the optimizing -- the OCG's register allocator can deal with that pretty well.
I was having some problems with my video driver, so I can't run programs that use
OpenGL on this machine right now. I was sort of waiting until the end to possibly install
it, and if it actually works show some programs with visual output. I'm not sure what you
guys want to see.
>>: I have a question. So you can do a transfer enable PTX and then pass it through
NVIDIA [inaudible] you get the binary?
>> Andrew Kerr: Yes.
>>: Have you seen cases that, you know, [inaudible] register allocation [inaudible] PTX
[inaudible] change the performance of the binary for them? Or the binary's kind of
agnostic to [inaudible].
>> Andrew Kerr: I'd say a lot of programs the programmer has already tried to do a lot
of, say, performance tuning. We haven't really explored the space of doing like say low
level code generation and register allocation style optimizations to really see what
performance options are available.
Presumably we could do things like unrolling loops automatically. We have a register
allocator. Potentially they are better than -- their allocator is better than linear scan. But
we just don't have our own results.
>>: [inaudible] binary compiler [inaudible] or whatever it is can roll it back and forward?
>> Andrew Kerr: Yeah. That's true. And occasionally I guess it probably does do
heavyweight transformations like that. We've sort of been asking for insight from
NVIDIA into what their code generator is actually doing and an ability to control it. But
it's just -- yeah, you're right, it's just something that's below like what we can actually
interact with.
>>: [inaudible] compiler or it's just a simple task ->> Andrew Kerr: I'm told it's heavyweight. But there are no details that I've been
available -- that have been available to me that precisely describe what that means.
>>: It's being released.
>> Andrew Kerr: Well, it's being released according to the NVIDIA marketing team.
>>: You could sign up for ->> Andrew Kerr: Yeah.
>>: You just have to [inaudible].
>> Andrew Kerr: You just register your interest.
So let me think. Sort of demonstrated the debugger. We could write a CUDA program
and try to crash the emulator and see what its output is. I'm not really sure what
everyone else wants to see.
>>: Multi-targeting [inaudible].
>> Andrew Kerr: The multi-targeting? Okay. So so here's a set of, say, functions that
interact with Ocelot that are outside of CUDA. And so this context switch basically just
call this, give it an index of the destination device and the source device. And so CUDA
allows you to enumerate devices. And so we just sort of express all the Ocelot devices
using that API. But this allows you to actually invoke the context switch.
And the return value is a mapping of all the old allocations and the old address space to
the new allocation.
Sorry. You can't see what I'm typing. And so here's sort of a simple example
registering a PTX kernel. So this actually implements the context switch. So up until
this point the kernel was just executing on whatever device was set. Then the context
switch is performed at runtime. The pointer map is provided. And sort of up to the user
to do something insensible with the pointer map.
And so generally they would probably just -- you would probably just want to like
transform base pointers, just like some, like, data structure. And then, like, if you have a
tree or a graph or something, you just use indices instead of actual pointers so you don't
have to remap each of those. It's very standard. Here we're using the Ocelot launch
method instead of like the usual let's say syntactic sugar that allows you to pass at the
kernel block dimensions using the brackets.
This is just sort of a convenient way of writing C++ instead of CUDA and only using the
CUDA compiler to compile the actual compute kernel.
>>: So once you have this mapping have you [inaudible] Ocelot to make some smart
decision about [inaudible] to be mapped?
>> Andrew Kerr: Okay. So Ocelot lets you -- like it gives you like the function that will
do it. And so what you need is some additional runtime. And so if the program itself is
running with Ocelot, which generally it will be, you can use instrumentation or some of
the other callbacks like the trace generators to actually handle your own handler which
will measure kernel runtimes.
And let's say you keep a list of how long each kernel took every time you ran it and use
that to feed some kind of performance modeling tool that you have, you might make the
decision that it's, you know, runtime is equal to like some constant times problem size or
problem size is like a parameter value which you can observe.
So then you might say, okay, well if I adjust one of those parameters, let's say I have a
machine with twice as many SMs or its clock frequency is higher, I should expect that
the performance -- new performance will be, you know, something else.
And so your runtime tool running as a trace generator could actually call this context
switch function. And then choose a new device. The main goal, though, is let Ocelot be
like the low level handler of execution and then have some additional level of distraction
that sort of manipulating Ocelot using Ocelot to insert measurement tools and then
perform the -- modify the execution of the program. But it's not necessary like the final
orchestrator of execution. It basically lets you say execute a kernel on this device.
And so if you had like maybe a more sophisticated runtime, that might be more
appropriate in like and OpenCL type command scheduler. You could uses Ocelot to
actually implement those commands. Or if your programming model wasn't C++ and
CUDA but it was more along the lines of like Intel array building blocks or C++ Amp,
there might be sort of room for a higher level, more abstract scheduler. Then you had
data structures partitioned and could execute some kernel or some subset of the data
structure for a [inaudible] device.
>>: Is there somewhere to get into the Ocelot's scheduler?
>> Andrew Kerr: The Ocelot scheduler is and imperative thing, but, yeah -- so it's
basically -- let's see. So here's the device class. Has sort of some subclasses
describing properties. This abstract base class for memory allocation. And ultimately it
should have a larger than function. So here.
And then each of the additional back end sort of implements this interface. So the
NVIDIA back end, its launch method will ultimately call like CU funk launch, which
invocation the driver API. Conceivably you could have sort of an additional like -- or
additional like layer of indirection where you have sort of a scheduled GPU -- GPU
device of some sort which has awareness of all of the different threads of execution and
was able to come up with even a more sophisticated schedule for executing those
kernels.
But so far Ocelot is mostly imperative. Even the CUDA function calls can be
asynchronous, many of the back ends just block until the kernel is actually returned. In
the case of the multicore back end it kind of make sense to us because the worker
threads are doing most of the work, and there's no point in letting the application use the
main CPU when it could be running the kernels faster. But it's the short summary here
is just override the launch method, possibly override the memory allocation methods if
you wanted.
If you had a programmer representation that could sort of track dependencies between
kernels, that would be probably the right -- the right place to -- or the right location to
build a scheduler.
We were sort of wondering if it was possible to use Ocelot and instrument just the
runtime API to sort of track which values are being passed to various kernels and come
up with like a dataflow graph at the kernel level. And that was kind of interesting.
Nothing really became of it. Our main goal was if you already had that information
somehow, could a scheduler using Ocelot make things faster? And that work went into
the Red Fox implementation of the Datalog execution manager.
>>: You said the -- for example [inaudible] this is a data plan [inaudible]. And you
[inaudible] OpenCL and you have a back end device but not back end for, you know, a
set of devices that [inaudible]. Is there any or are there any features in OpenCL that if
you [inaudible] OpenCL would help you [inaudible]?
>> Andrew Kerr: I think the features in OpenCL that are -- that would make it better are
the ones that construct the command queues to create multiple queues and sort of
control the scheduling between each queue.
We are actually sort of interested in that problem in particular. And we are -- there's an
ongoing effort to build like a very lightweight imperative OpenCL API front end for
Ocelot that still requires you to use some other cool -- other tool to compile the OpenCL
kernel to PTX, but the API itself can go through Ocelot. Or if that makes sense. The
API itself is implemented within Ocelot, and it assumes another tool's compiled OpenCL
to PTX.
We would like to sort of expand that and look at some of the opportunities to make more
intelligent scheduling decisions based on what ultimately amounts to like dataflow
analysis at the kernel level inferred from the contents of the command queues. But I
don't think we would have to do any backtracking, it's just something we haven't
implemented yet.
>>: I assume that -- I mean, maybe this is [inaudible] but OpenCL and [inaudible]
NVIDIA [inaudible] CUDA and OpenCL [inaudible]. I don't know if that [inaudible].
>> Andrew Kerr: I think ->>: [inaudible].
>> Andrew Kerr: I think that's because at the time their open 64 base kernel compiler
from CUDA to PTX was very mature. Their -- their OpenCL implementation started
basically from the ground up on LLVM. And so they were at the same time they were
developing driver support for OpenCL, they were also building their PTX code
generator.
And I think by the time performance became comparable, they just switched and used
their LLVM tool chain for both CUDA and OpenCL. And you could probably make the
comparison. There's one command line switch I think that will use the open 64 back
end instead of LLVM in the CUDA compiler. So you could write a CUDA program and
just take a test.
Is there anything else you'd like to see?
>>: [inaudible].
>> Andrew Kerr: Okay. So, I don't know, maybe some conclusions. The source code
has been freely available since 2009. It's a new BSD license. We develop on it
continuously, and most of the work is either in the main trunk, which we try to avoid
breaking at all costs, and then variety of branches. So there's work to add vectorization
to take advantage of SSE and AVX in the multicore back end.
We're trying to, you know, try to remain as current as possible with respect to CUDA
features. So we're sort of in the process of adding new API support to Ocelot, new PTX
instructions as they come out.
We have a large set of applications outside of the CUDA SDK that we've used for our
own research studies that are just available. These are from like the Parboil
benchmark, Rodinia -- I forget where Rodinia is from. Oh, University of Virginia. Yeah.
Some other applications that just have like unstructured control flow that we used for a
study like [inaudible] GPU. We also noticed that some applications just don't use the
CUDA runtime API, like OPDX and some other tools that just use the driver API. And
so there's sort of a prototype driver API implementation. But we also built a tool that just
-- it's just like a very thin layer that wraps the driver API and then captures the device
state before and after the kernel is executed. So you basically just take a CUDA binary
or like an application binary without even recompiling it, run it through this tool, and it
will capture the state of the GPU. And then you can sort of replay that through Ocelot.
So if you have this full application but you just want to monitor -- like study the execution
of a single kernel, you can. And it will tell you if the results are correct. So if you
wanted to like evaluate an optimization you could just launch the single kernel, launch -initialize the GPU device with whatever state that kernel required. So. I think that's
everything I wanted to show. So thank you for attending.
[applause]
Download