>> Shaz Qadeer: Okay. Let's get started. It's... MSR today. He's here just for this afternoon. ...

advertisement
>> Shaz Qadeer: Okay. Let's get started. It's my pleasure to welcome Alastair Donaldson to
MSR today. He's here just for this afternoon. Alastair has collaborated with many of us over a
large number of years. He's a professor at Imperial College, London, and he has expertise in
compilers and program verification and GPUs and he has recently added compiler fuzzing to his
repertoire. Today he's going to tell us about some of his recent work in that direction.
>> Alastair Donaldson: Okay. Thank you, Shaz. So the work I'm going to present the today is a
collaboration with Christopher Lidbury and Andrei Lascu who are PhD students in my group at
Imperial and also with Nathan Chong who used to be a PhD student in our group and is now a
postdoctoral researcher at UCL. What we've been doing is looking at the reliability of compilers
for the many-core programming language OpenCL, and more generally we are interested in the
reliability of compilers for GPUs. OpenCL is a programming language that targets multi-core
and many-core devices of which GPUs is a major fraction of the target device audience.
So I have a personal motivation for this which is that for the last few years I've been working
quite intensively on a tool called GPUVerify which is a static verification tool for GPU kernels.
This is something I started with Shaz as a visiting researcher in 2011, and GPUVerify is these
days pretty effective at finding bugs in GPU kernels and sometimes it’s capable of verifying
particular correctness properties of those kernels under a number of assumptions and caveats
and this has been a main focus of my group and I for a few years.
This begs an obvious question: so GPUVerify operates on source code. It actually operates on
LLVM bit code that's been obtained from source code from an OpenCL or [indiscernible] kernel.
So the question is can we trust the tools that are going to compile that source code to get you
something that runs on a GPU? So if you can't trust these compilers, if the compilers have bugs
in them then you could use GPUVerify or a similar tool, maybe one of the tools from Ganesh
Gopalakrishnan’s group in Utah to check, correct these properties of your GPU kernel only to
have the kernel do something completely wrong on the hardware. So this is the personal
motivation for me for getting interested in this work.
So what we wanted to do in the project was look at two existing compiler testing techniques
that have been successful in finding bugs in C compilers and applied them in the domain of
OpenCL. So those techniques are random differential testing which was popularized in C by the
Csmith tool from the University of Utah from PLID 2011 and a more recent technique called
equivalents modular inputs testing which was at last year's PLDI from researchers at UC Davis.
So what we wanted to try and validate those in a new domain and in the process devise novel
OpenCL targeted extensions to these techniques and use them to assess the quality of
industrial OpenCL compilers.
So we tested 21 different device compiler configurations in this work, and we discovered a
number of bugs from various vendors including Altera, AMD, NVIDIA and Intel as well as in
open source OpenCL compilers. A key thing here is that many of these bugs are now fixed so
some of these vendors have been pretty active in fixing the bugs we’ve been reporting which I
think suggests that they’re bugs that people do take notice of and we’ve now been doing
further testing on these implementations and already noting bug rates to be going to down
partly as a result of these fixes.
So what I'm going to do is talk about random differential testing, give you some background on
that, and in the process take a quick detour into the problems of undefined behaviors in
compilers and then I'll tell you how we lifted random differential testing to OpenCL. Then I'll
tell you background on equivalents modular inputs testing, the other technique we validated,
and show you how we lifted this to OpenCL and then I'll discuss some of our experimental
findings and show you a few examples of compilers bugs we found. And please do feel free to
interrupt me if you've got any questions as we go along.
Okay. So random differential testing is conceptually a very simple idea. Like I said, it was
popularized by this tool Csmith from the University of Utah. So the idea, in the context of the C,
is just imagine we have got a tool to generate random C programs and Csmith is such a tool. So
you run Csmith and it produces a random C program by which I mean the program was
randomly generated. The program itself is deterministic and does not deploy any
randomization but its contents are random. We will then give this random C program to a
number of compilers. For example, we might give it to Microsoft and Intel's compilers, a couple
of open source compilers, compile the program, and let's assume that the program has been
designed so that it should print a number. That's all the program does. If we get different
numbers from these compilers than that indicates that at least one of the compilers has a bug.
In this case I've singled out Clang as being the compiler that gives a minority result and it will be
likely that Clang will be the compiler with the bug here, although of course theoretically, it
could be the only compiler that compiled the program correctly or they might all be wrong. In
practice typically the minority compiler is the compiler with a bug.
So question for the audience then, why might this actually not signify a compiler bug?
>>: Because of undefined behavior.
>> Alastair Donaldson: Right. So if the program random.c exhibits undefined behavior then
there will be no surprise that you would get different results from different compilers. So
mismatches in results indicate bugs if the compilers agree on implementation defined behavior,
but this depends upon the input program being free from undefined behavior. So if the input
program is free from undefined behavior but does exercise implementation defined behavior
the compilers must agree on that implementation defined behavior. But the undefined
behavior is the real key. So I think it's instructive to take a quick detour into the problems of
undefined behavior in the C programming language.
So if you take a look at this little program main, so what I'm doing here is reading X from the
command line and squaring X. And then if X squared is less than zero, printing X squared is
negative and then printing the value we computed for X squared, otherwise printing X squared
is nonnegative and printing the value computed for X squared. So if I take this program and I
compile it with GCC and I would run it, for example, on the number 10 then we see X squared is
nonnegative 100. So what happens then if we run this on a large number like 1 billion? What
do you think is going to happen?
>>: That's more than>>: One, two, three, four, five, six, seven, eight, nine.
>> Alastair Donaldson: What you think would happen then if we try and square one?
>>: Overflow or something?
>> Alastair Donaldson: Overflow or something. So you might not be surprised to see X squared
is negative and then a very negative number. So now let me compile the program again but
this time with optimizations.
>>: That's not a bug because C doesn't release, is not supposed to guard against overflow.
>> Alastair Donaldson: Right. So that might be what you would expect to see though is an
overflow and that seems like the right result in a sense. You might think that this is not an
acceptable result. X squared is nonnegative and then a negative number. So you can imagine
very much confusing a programmer and a programmer might think oh dear, there's a bug in the
compiler. What the compiler done wrong?
>>: [inaudible] nonnegative.
>> Alastair Donaldson: Yeah. So without optimizations we’ve got X squared is negative and a
negative number. With optimizations we get X squared as nonnegative and a negative number.
But this can actually be quite easily explained. So when you write X times X in a C program
where X is signed you are actually telling the compiler that it is allowed to assume at this
program point that X times X does not overflow because if X times X does overflow then not
just the result of that operation is undefined but actually the whole behavior of the program on
that execution is undefined.
So as a programmer, when you do an arithmetic operation unsigned, you’re telling the
compiler that this is not going to overflow because if it would overflow the compiler can
generate anything it likes. So because the compiler knows this did not overflow it's able to do a
good optimization, it can optimize away that guard, and it can actually transform the program
into this program which is a more efficient program and that's compiler’s job, and this more
efficient program prints the result we get. It says X squared is nonnegative and it computes
whatever you got by squaring X and in some register on an X86 machine doing multiplication
you are going to get overflow, you are going to get wraparound, you're going to get this
negative number. So it's quite a neat example of>>: So the compiler is smart enough to know that X squared is positive?
>> Alastair Donaldson: Yes.
>>: Interesting.
>> Alastair Donaldson: So the compiler is smart enough to know that.
>>: I think the compiler is smart enough to know that the semantics of this program is
undefined so it can do whatever it wants.
>> Alastair Donaldson: Well, the compiler is smart enough to know the semantics is only
defined if this does not overflow. Furthermore, the compiler knows that if it does not overflow
the result will be nonnegative and therefore it can apply this optimization. So it's applying dead
code elimination on the assumption the program is well-defined.
So for this random differential testing approach to work we need the program to be free from
undefined behaviors. So the way this works in a program generator like Csmith is you can use a
conservative effect analysis to, for instance, make sure that you don’t use some pointer unless
you're guaranteed you've initialized the pointer, for instance, as you generate the program be
conservative there. And the challenge in making a program generator is it's easy to write
something that generates programs that are free from undefined behavior if you don't want
those programs to be interesting, but writing something that generates interesting programs in
the sense that they really explore the language and really give the compiler a hard time and yet
our free from undefined behavior, that's the challenge.
So Csmith has this effect analysis and also uses safe map operations. So for instance, instead of
generating a divide where you want any two random expressions, Csmith will generate a call to
a macro called safe div that will take E1 and E2 and the macro uses the ternary operator to say
that if E2 is zero then evaluate the result of evaluation which is BE1 the numerator otherwise it
would be the actual division. There's nothing special about E1 here. The point is that we need
to give some defined result and E1 is probably more interesting than just zero or three or
something because E1 is itself potentially a large and interesting randomly generated
expression.
Okay. So I'll briefly tell you how we lifted random differential testing to OpenCL. So the Csmith
tool generates random C programs that explore the richness of C and avoid undefined
behaviors. So what we did is built an extension called CLsmith that aims to generate OpenCL
kernels that explore the further richness of OpenCL because OpenCL is based on C so we can
get a lot of the richness of C, the richness of OpenCL already from Csmith.
But examples of additional richness are vector types and operations in OpenCL and inter-thread
communication. So you have all these threads running a GPU kernel and communication
between those threads is something we were keen to explore. We need to avoid additional
undefined behaviors, so for instance data races and barrier divergence are two kinds of
undefined behaviors you have in this data parallel world, and also there are some vector
operations that come with undefined behavior constraints.
And furthermore, we need to guarantee determinism. So in the world of concurrency you can't
have programs that are free from data races and yet behave non-deterministically. So these
programs do have well defined behavior but you get a set of possible results from such
programs, and for a random differential testing it would be pretty difficult to apply random
differential testing if your program can compute some result of an unknown set because then
you have no idea where the two programs, the same program running on two different
implementations is giving different results because they are both acceptable members of the
set or because one of them is wrong for a compiler bug reason.
>>: Is the reason they could be nondeterministic is because of the presence of atomics?
>> Alastair Donaldson: That's right.
>>: I see.
>> Alastair Donaldson: Not because of just regular operations because we would only get nondeterminism from data races.
>>: [inaudible]?
>> Alastair Donaldson: With atomics you could. Okay. So a very basic thing then is instead of
making a random C program make a random OpenCL kernel by having a kernel>>: [inaudible] not be able to test programs or generate programs that use atomics?
>> Alastair Donaldson: Well, we'll come to that but in general, yeah. We have to be careful.
We have to use atomics in carefully crafted ways.
So a simple thing to do then is have a program where every thread runs a function and writes
the result of that function [indiscernible] its thread ID. And this function we can generate using
a Csmith because Csmith can generate C functions. OpenCL is very similar to C in its core
language. So if we've got this func which is a C function we can make every thread run the
same function, write the result into an array, and then we can print the result, print the
contents of that array as our program result. So all threads would run the same function,
there's no communication between the threads, and the only engineering challenge here was
to work around the fact that we don't have globally scoped variables in OpenCL so Csmith we
generate a lot of global variables. We've got to find a way around them and what we do is we
pack them into a struct. We have a struct that's got all the would-be global variables and we
pass that around by reference so [indiscernible] func would take the reference to that structure
and if func calls another function it would pass in a reference to that struct.
So the reason we did this is that I think in research is very important to start with simple
solutions first before you go for more complicated solutions, and our hypothesis was that we
wouldn’t really find any bugs with this technique because it be so straightforward. In fact,
almost all the bugs we found where with this technique. So we found lots and lots and still
managed [indiscernible].
>>: [inaudible].
>> Alastair Donaldson: Well, I mean we did a big evaluation and scientifically showed blah,
blah, blah.
>>: Or some science.
>> Alastair Donaldson: Well, it's an empirical study, right? The question, so to make an
empirical study scientific you just need to be rigorous. So we found a lot of basic compiler bugs
in these compilers, so problems compiling the sequential code that the threads execute rather
than difficult optimizations to do with concurrency which is I suppose perhaps not that
surprising in a language like OpenCL with hindsight. So data parallel language, so the
programmer is actually controlling the parallel execution. It's not that the compiler is
controlled in the parallel execution. So to some extent the compiler is not going to be doing
sophisticated optimizations related to concurrency although I know that some compilers do
some optimizations to merge threads together, for example.
Okay. So I'll show you what one of these random kernels looks like. And if you've ever seen a
random program from Csmith you'll realize this looks quite similar. So I'm going to deliberately
flick through this quite quickly just to give you a flavor for what one of these things looks like.
So you’ve got loads of variable declarations inside these functions, you have all kinds of crazy
code like four loops with break statements in the middle of them, here's some OpenCL stuff I'll
come to later, a barrier there. So you can see this is not the kind of program you would like to
be understanding as a human but these programs are good at inducing bugs in compilers.
So the next thing we wanted to do was to support vectors. Our hypothesis here was that
because OpenCL has a very rich set of vector types and operations these may be under-tested
from compiler writers. So we extended CLsmith to exercise this rich vector language and this
required some engineering effort because Csmith, the original tool, would slightly abuse the
fact that C is pretty liberal with the types and you can implicitly coerce between most integer
types. So in generating expressions Csmith deliberately wasn't taking care to track the type of
the expression being generated. To make this work for vectors in OpenCL which have stricter
typing rules we had to really do some quite serious engineering work but actually this was
straightforward.
And we had to take care of some undefined behaviors. So for example, I thought I found a bug
in Intel’s compiler because I ended up with a very small program. But my small program tried
to clamp a variable X into the range of minimum one, maximum zero. And then I stared at the
small program and thought what does that mean to clamp something in the range one, zero?
And then I looked to the spec and I said that the behavioral clamp is undefined if min is larger
than the max. In fact the small program was just an undefined program. So there were a few
cases where we had to treat these vector operations to make sure that they had well-defined
behavior under all inputs.
So just to give you vectors. So for CL zed, so [indiscernible] leading zeroes. This is an example
of and OpenCL vector intrinsic. This gives you the number of leading zeros and binary
representation of a number. Another one is rotate. So if you apply rotate you give it two
vectors and what it does is it takes the first vector and the second vector and for each
component of the first vector it takes the component of the second vector and rotates this
thing the number of bits specified by this thing. And that's well-defined even for sign because
OpenCL demands two’s complement so you can do these bit level rotations.
>>: Quick question. Could you attempt to reduce these or is that already like some sort of
minimal state?
>> Alastair Donaldson: No, no. So these are the original programs and what we do is if we find
a mismatch between implementations what we would be doing is reducing the large programs
by hand until we get small test cases. And I'll show you later some examples of the reduced
programs. And something we are working on at the moment is trying to extend existing work
on reducing C programs to OpenCL which has some practical challenges.
>>: So the requirement is just Csmith and CLsmith?
>> Alastair Donaldson: Yeah.
>>: Was it necessary to check run such large programs in the first place [inaudible] bounds?
>> Alastair Donaldson: In the Csmith work they did a study where they assessed the extent
which they would find bugs if they were restricted to small programs, medium programs, large
programs, and in that work they did find indeed that you had to have large programs if you
want to have a good chance of finding mis-compilations and compiler crashes. Now we did not
do an equivalent evaluation in OpenCL context but we actually do have all the data at our
fingertips to do that retrospectively. So that's something that, yeah.
>>: So was there any sign of [inaudible] or was it just [inaudible]?
>> Alastair Donaldson: I think it’s just a balance of probability. If there is enough code, if
there's more, more, more code, there's a higher chance the compiler will have an opportunity
to mis-compile. I think it's as simple as that.
>>: So if it’s a case that for every bug there exists a smaller program that introduces the same
bug then the random number generator is not efficient in generating the small programs that
find the same a bug.
>> Alastair Donaldson: I think it's definitely true. So in the Csmith my memory of what they
said was that to find bugs they needed large programs but they could also always reduce them
to small programs and still expose the bugs. It wasn't as if they had large it’s just that through
large programs they find more bugs. But sure, if they found the right small programs they
would find the bugs.
>>: I think the fundamental question here is not so much the size of it. So you can ask directly
[inaudible]. It’s really the debug-ability of it. If it's really huge and our ability to debug then the
chances of getting fixed are reduced, right?
>> Alastair Donaldson: Right, but I think if you find that you’ve got a program that causes two
compilers to get different results once you’ve found that you can then apply this automated
reduction technique which they have and then you can get the small program. You can get a
small example that can be fixed. So I think the strategy in Csmith and C reduces use big
programs to find mismatches but then you’ve got a reducer to give you small programs. And
we didn't evaluate whether we needed programs here. Some of the implications we tested
were quite weak in which case then we needed big programs, but some of them were pretty
strong as well but we don't have the reduction yet, the automatic rejection so reducing the big
program by hand does take longer, although it doesn't take proportionally longer because often
you can prune away huge parts of the program immediately. You can just cut out some
function call that's calling most of the functions and>>: [inaudible]super-fast.
>> Alastair Donaldson: Exactly. Okay. So the next thing was inter-thread communications. So
in OpenCL you can use barriers to allow threads to communicate with one another,
synchronization barriers. And our hypothesis, based on having seen some LLVM bug reports
related to OpenCL barriers, was that compilers sometimes have a hard time optimizing around
barriers. So we figured if we could find a way to use barriers and have threads communicate in
a race-free manner this might give us more opportunity to find compiler bugs.
So what we implemented was roughly the following, imagine you’ve got a bunch of threads
executing a kernel and we’ve got a shared array and let's imagine we can give every thread a
randomly generated but unique index into this array at the beginning of time. So every thread
owns an element of this array. The threads can read and write that element freely as they
execute the kernel. So as the threads execute they’ve got unique access to their element and
then when they reach a barrier synchronization we can do an ownership redistribution so we
can change which thread owns which element and then the threads can carry on executing.
This allows the threads to communicate data values to each other but in a way that guarantees
freedom from data races because between barriers the threads have unique access and at the
barrier they then do a permutation of who owns what. So this is a fairly simple way to ensure
that we avoid data races.
We have to be careful about where we place the barriers though because in GPU kernels you
can’t have barriers in thread sensitive regions of code. So the way we dealt with this is we is
restricted the use of thread ID in our random programs so that we could place barriers
wherever we wanted. So to give you a little flavor of this in the random code, so here's an
example of a barrier and then immediately after the barrier this TID variable gets updated to be
this which is a random permutation.
Then Shaz, you’d asked about atomic operations. So indeed we wanted to try to investigate the
use of atomics in these random programs but atomics are the one way in OpenCL you can have
race-free non-determinism. So atomic operations are not deemed to race with one another.
You can use atomics to, for example, see who’s the first to get to a certain point in the code. So
our hypothesis was that the compilation might be sensitive to the use of atomics so they'd be
worth exploring so I'm going to find a way to use atomics in a deterministic manner.
So we had a couple of ideas and very much our protege was we would brainstorm ideas until
we came up with something we were sure was going to lead to determinism and welldefinedness and implement it and see what would happen. So the first idea we had was atomic
sections. The idea was that we would give every workgroup a shared counter, call C, and an
associated result variable could see result both initialized to zero and then we would inject in
the code an atomic section. So this would involve doing an atomic increment on the counter
and testing whether you get the constant value D which is some constant chosen at generation
time. If you do that means you are the D plus first thread to execute this atomic increment. So
if D was 2, for instance, then you would get the old value of increment. So you would be the
third thread to increment it if you find it was equal to two.
And in particular, unless so many threads increment this that the counter wraps around, only
one thread can get into this conditional. Agreed? So now that thread can call a function which
is a randomly generated function and then it can add the result of that function to the atomic
counter variable. Now the reason we made this be an anatomic add is precisely to capture that
wraparound case where you may actually have an atomic in loop and you may have the
possibility to have the counter wrapping around and then in theory two threads could get in
here and if we atomically add then we are still race free.
So how do we make sure that this gives us determinism? Well, we be careful this function
doesn't leak which thread executed it. If this function would somehow leak the thread’s ID
then we would actually get different results depending on which thread got into that function
because that thread would then start to behave differently; and in particular, that thread may
now not reach barriers that it would have reached otherwise. So this was actually the most
challenging thing to implement by a long way. Almost all the bugs in our tool were related to
this mode. It was a pretty difficult thing to get right. We kept thinking we've found, in fact we
didn't find any compiler bugs related to atomics but we kept thinking we had. So we kept
thinking we’d found one we would investigate, investigate, investigate, and we would find it
was a problem with our implementation and we eventually didn't find a single bug that needed
an atomic. We found plenty of mis-matches in programs that contained atomics but we were
always eventually able to reduce them down to not need the atomic in the end.
So the key thing here is that the effects of the function should not be visible outside the section
and that’s pretty difficult because there's going to be computing with all kinds of pointers and
we had to restrict what could be done with of those pointers. In particular, you shouldn't be
able to modify things outside, you shouldn’t be able to modify data outside this section.
So the other idea is simpler. So some operations in OpenCL that are atomic are suitable for
doing reductions, in particular add, min, and xor are three examples. You can do a reduction
operation on them, right? You can apply these operations to a bunch of data to get a result,
and because these operations are associative and commutative it doesn't matter in which order
you apply them; you’ll get the same result in the end. So the simple idea here is that we
randomly emit an atomic operation, so this is one of these associative commutative operations,
and we have every thread evaluating an expression E and atomically apply the result of E to a
shared variable S and then we can do a barrier synchronization and then we can, at the end of
the kernel, have the master thread from a thread group add the final result of this S to its final
result. So this is another way of using some atomic operations in practice.
All right. So to show you these atomics in one of our random kernels. So here's an atomic
reduction using max. So the threads are doing an atomic max operation and this is this
expression here which is a fairly hefty expression is the expression they are reducing. You
might notice that we do use a thread’s ID in the reduction operation. We use a thread’s linear
global ID in the operand to the reduction and that's actually safe because even though the
threads have got different IDs since they're all contributing to this commutative operation
that's fine. And then afterwards we have a barrier. If I look for atomic Inc.[phonetic] here's an
example of one of these atomic sections. So if we atomically increment this counter variable if
it's value one we must be the second thread to have done this and then we've got some code
which has got to be thread insensitive.
Okay. So the second technique we investigated was equivalents modular inputs testing which
was proposed at last year's PL I. I think this is a pretty cool technique, the technique that these
people proposed. So it works as follows. Let’s imagine you've got a program in some
programming language and let's suppose that this program is well-defined and it's
deterministic. So we can compile this program, and let's imagine we only have a compiler. We
don't have the liberty of multiple compilers to check against each other; we've only got one
compiler. There could be a few reasons for this. This could be a new language. Another reason
could be that in some domains, and in particular in the GPU domain some of the vendors we’ve
talked to have told us they're not allowed to have a benchmark against other GPU
implementations. I don't quite understand the reasons for that but I think it maybe to do with
the possibility of being [indiscernible] reverse engineering perhaps, but some of the vendors
have said they're not allowed to just get the latest tools from some of their competitors and
compare with them.
So if you are under those constraints or indeed if it's a new language, you only have one
compiler, then you can't do random differential testing. You can do with it multiple
optimization levels but you can’t do it with multiple tools. So the idea of this is let's say we've
got a program P. We can compile it and get through a profiler with respect to some input I.
And because the program is assumed to be well-defined and deterministic what this profiling
will do is partition the statements of the program P into two disjoint subsets. We've got those
statements that are covered by the input I and statements that are not covered by the input I
and I hope you'd agree with me that if we run the program again and again on I we would get
exactly the same partitioning because the program is deterministic.
>>: There is no undefined behavior.
>> Alastair Donaldson: No undefined behavior, no non-determinism. You get exactly the same
partitioning. So what we can do then is if we call the statements that were not touched by I we
can take the program P and we can manufacture from it as many programs as we like by
messing with D. So we could delete some statements from D or we could attend some
operators you used in the statements or we could add some fuzzed code into D or we could
take some statements from the program and put them again into D. We could do anything we
like to this I dead code to make lots of variants of P. And these programs, globally speaking, are
completely different programs that for the global space of inputs might give totally different
results but they have the property by construction that for I they would give the same result.
So that we could then compile all of those programs with our one compiler and we could run
them and if they give different results, let’s assume again that they print an integer, if they give
different results we know something must be wrong with the compiler. So this is a very smart
idea from these guys at last year’s PLDI.
We wanted to try to validate whether this would be effective at finding bugs in OpenCL.
However, there was a problem. So first of all this equivalents modular input requires the
existence of this input dependent code. If you’ve got a program that doesn't have any code
that’s only reachable for certain inputs then you won't be able to do this manufacturing of the
multiple programs. Secondly>>: I don’t understand that. You can always create a program which has a substantial amount.
>> Alastair Donaldson: So a second point of this technique is that you can apply it to existing
code. So you might have some test cases that were hand written that are already interesting
test cases and you can apply this technique to those existing programs. You don't have to be
using random programs. So we were looking at trying to apply this to existing OpenCL
programs, but our experience from working with these programs is that they don't typically
contain that much code that’s conditional on certain input values having properties; and the
second thing is you would need a profiler to implement this technique and there is no readily
available OpenCL profiler. So the problems we found on a typical OpenCL kernel just don't
contain that much input dependent code and second there is not a readily available profiler for
OpenCL. Now of course we could build a profiler so this was the fundamental problem. We
didn’t want to build this profiler to find there was no I dead code to be found.
So our idea was, and you already sort of suggested this in your question was, well we can make
up this input dependent code. We could make programs that have input dependent code, but
if we've already got a program we could give that program extra code that’s input dependent in
the following ways. This is a very simple idea but it was pretty effective in our work and there's
no reason why this idea couldn’t be applied to other languages. There's nothing OpenCL
specific about this idea. So let's imagine that we've got some piece of code>>: I have a question about the previous technique. So I was just imagining that if I sat down
and typed for say for two hours I can come up with five different ways of compiler fuzzing. This
is one particular technique. So what is it that is ultimately going to distinguish one technique
from another? I'm sure that even companies that have the job of producing commercial
compilers there's a lot of compiler fuzzing going on already. What are we chasing here?
>> Alastair Donaldson: So I think the goal of compiler fuzzing is to find bugs. The way you find
bugs with fuzzing is by perusing interesting inputs. So I think there are two things. If you can
fuzz in a way that allows you to get high bandwidth then it doesn't matter if your inputs are
interesting on average. If you can get through loads of inputs, if there are some interesting
ones, then you will find bugs with them eventually. So the challenge is to have a method that
allows you to very quickly try lots of inputs and to make sure that there are going to be a
decent number of interesting inputs in that set.
So in Csmith what they were trying to do was to try to explore as much of the C programming
language as they could and try to not be restricted. For example, in Csmith, they didn't restrict
programs that guaranteed termination because if they did they could easily generate
terminating programs but they would have to place a bunch of restrictions to ensure
termination and they feared that those restrictions would reduce the effectiveness of finding
compiler bugs. So instead they had programs that may not terminate and they just used a
timeout. So the idea with fuzzing really would be, of course there's not really any science in
coming up with ideas for making random programs, you can evaluate the thing scientifically but
the evaluation would have to be guided by how effective you are at finding bugs in comparison
to other fuzzes whereas you fix these bugs do you then find anymore or do you basically have a
bunch of bugs that are given fuzzer confined, you fix those bugs and that fuzz in there just can't
find anymore bugs even though there are plenty of bugs?
>>: Maybe I’ll disagree with the kind of overall goal that you pose because to my mind the goal
is not to find bugs. The goal is to find bugs that matter; and I think there's a big difference
between the two which is why there is so much [inaudible] testing in just about every compiler
that frankly gets anywhere because if they declare it at a certain number of programs, websites,
what have you are over-importance and they matter to say the customers and developers or
whatever and as such they really don't want to break the [inaudible]. I think just what every
compiler developer will acknowledge that there are dark and troubling corner cases and yes,
maybe sometimes they would like to know about these things, but I think at the same time they
will acknowledge that there are ID problems just that you have to prioritize your time is well.
>> Alastair Donaldson: Yeah. So I would say that anyone involved in the real world of software
would agree with you that in any large software project you have a priority list and you usually
have more bugs then you can fix.
>>: So the question is how do you mesh your [inaudible] results approach with [inaudible]?
>> Alastair Donaldson: I would say that if you are finding bugs that people are fixing then that's
a sign that you are doing something that people think is at least partly useful.
>>: [inaudible] fixing it.
>> Alastair Donaldson: I suppose. If you e-mail some privately with a bug report and they say
thanks a lot and they say we’ve now fixed it then you’ve certainly not done any harm. Well, of
course they could introduce more bugs by fixing the bug. But I think a way I would say that
fuzzes are useful is in trying to get some idea about whether your system is satisfying a
[indiscernible] of robustness. So if you run the fuzzer and every hundred thousand programs
you find a mismatch I think you can maybe be quite happy and you may not even bother
investigating those mismatches. But if you are finding a really large number of mismatches
then I think that might suggest that it’s just a basic quality problem and you need to improve
that quality. I think it's sort of difficult to give these questions definitive answers, right? Yes,
it's definitely true that what [indiscernible] finding bugs that matter, but defining what it means
for a bug to matter is another question.
>>: [inaudible].
>> Alastair Donaldson: So the idea here is suppose you’ve got an existing piece of code. This is
a little OpenCL kernel, actually not a little OpenCL kernel, this is a fragment of a real OpenCL
kernel that does breadth for a search. Let’s say we've got this piece of code then what we can
do is we can add a new parameter to it. This is a parameter which is an array called dead and
this is something we’ve added. Then given that we've added this new parameter we can inject
code into the kernel that is dead by construction. So we can add a condition here saying if dead
element of 43 is less than dead element 21 then execute this arbitrary code. And if we make
sure this condition is going to be false at runtime then this code is dead by construction.
So in particular, if we fill the array up with increasing values this condition is guaranteed to be
false, this will code will not be executable and it should therefore not change the semantics of
the program. The key thing is that the compiler cannot know what we're going to invoke this
function without runtime so the compiler must compile this function to work for whatever
inputs would give it well-defined behavior and the compiler furthermore is going to try and
optimize all of the code including this code, and if the compiler gets this optimization wrong it
may cause some code that is actually reachable to be in this compiler. So this is a very simple
idea injecting dead by construction code and this is actually pretty effective and like I said, is
the reason why this could not be applied to other languages.
Okay. So what we did experimentally was we took 21 of these device and compiler
configurations, so a bunch of GPUs from NVIDIA, AMD, and Intel, some CPU implementations
from Intel, AMD, and>>: In particular you can apply this thing, do the previous thing you were inspired by and maybe
it will start finding more bugs.
>> Alastair Donaldson: Right. You can and we did do that as well. And an FPGA
implementation Altera, and also to the Oclgrind open source emulator. What we did is we
applied random differential testing by generating 60,000 kernels. So we have six modes that
we can run CLsmith in, the basic mode, the mode with vectors, the mode with barriers, the
mode with atomic reductions, atomic sections, and the mode with everything on at once. So
we did 10K kernels each of those modes. We had to discard some of them due to some bugs
we found in CLsmith later on in the study; so this is a very large study with a lot of
implementations and several times during this study we had to restart because we found
problems in our tools. And I should say right now that I don't believe that these 60,000 kernels
are all going to be good kernels. There will certainly be some bugs that we have not discovered
in our tools so I think any macroscopic result we give should be given as an indication of the
quality of the tools we are testing. Of course there will be some cases where we messed up
and have generated a nonsense kernel.
We applied EMI testing to ten real world applications. So we took applications from the
Rodinia and Parboil benchmark suites and we injected dead by construction code into those
kernels and then we also did what you were alluding to Shaz which is we did random EMI. So
we took a bunch of kernels generated by CLsmith and we equipped those kernels with dead by
construction code as well. Is that what you are thinking of? So what we did here is we made
180 base programs and then we tried 40 variants for each of those base program to give us
7200. The reason for these slightly funny numbers is actually we had 250 of these programs
but I told you we had to discard some of these programs and we ended up having to discard 70
of our base programs similarly and all the corresponding programs so we have slightly smaller
numbers than we would have liked here.
So the first thing to say experimentally was that we discovered a lot of machine crashes. So
some of the configurations we tested we had to give up testing because we found that the
programs we were generating were causing not just the program to crash but the whole
machine to crash. So I'm going to show you an example of this but I'm going to save it till the
end of my talk.
>>: Is it because the GPU driver is buggy and it’s just overriding some memory in the OS?
>> Alastair Donaldson: That’s our hypothesis, but we don't know. It makes testing very
difficult. If you want to run 10,000 tests, run them overnight, what we found is we would set
this thing off, go for lunch, come back, machine would be dead. So ten we would try and log
where we got to because if the machine would've rebooted that would be okay because we can
have a script that would say where did you get to, carry on. But sometimes the machine was
frozen so this is pretty difficult. So it seems that a mis-compiled GPU kernel can wreak more
havoc than a mis-compiled C program in our experience.
And then it’s particularly interesting because you have these things Web GL and Web CL where
you can visit a website and it can make things happen in your GPU. So I was thinking about
trying to come up with website of death that says if you've got an X, X, X card don't go to this
page because if you do your machine will just lock up. So this was something that made the
testing very difficult for us in starting configurations and we basically stopped testing those
configurations.
>>: You can bring up mis-compile C program also cause the machine to crash it’s just that that
particular C program has to be running in kernel [inaudible], right?
>> Alastair Donaldson: That's right. But what interests me is that the memory, so I think it has
to do with buffer overflows but I don't quite understand it. If you imagine something running
on a GPU that’s an accelerator card then the goal global memory as I understand it on that GPU
is not in the same memory space as your operating system so I don't understand how
something like a buffer overflow could be the issue. And then the GPU driver is going to be, it
says go, execute the program, but it doesn’t interact with the program as the program executes
so I fail to see how the program doing weird things could cause the driver to do weird things. I
don't really know enough about operating systems to know but I found it very shocking.
So I'm going to show you some data by the effectiveness>>: [inaudible] some of the GPU vendors?
>> Alastair Donaldson: Yeah. We’re going to do sections.
>>: And are the bugs deterministic?
>> Alastair Donaldson: Unfortunately no. So we found that it was not deterministic, and
actually one with the vendors in question their latest drive is much more reliable actually than
the ones we were testing.
>>: A buffer will probably be deterministic.
>> Alastair Donaldson: A buffer would be deterministic. Whether the buffer would lead to a
crash that’s less clear to me.
>>: Actually the reputation around Microsoft is that these graphics drivers are among the most
buggy drivers.
>> Alastair Donaldson: So what I'm going to show you here is some data. So what I'm going to
look at is for a given configuration and a given mode of our tool I'm going to show you the
percentage of kernels that did produce a result that was not in agreement with the majority
result for that particular kernel. So this ignores time max, ignores compiler crashes, ignores
cases where the kernel crashed at runtime. It's cases where the program gracefully terminated
and gave an answer and the answer did not agree with the majority. Does that make sense?
So I think this is a reasonable way of assessing to what extent you're getting wrong code bugs
from these programs. There are some caveats though. A, some of our programs are likely not
well-defined despite our efforts. We don't know of cases where that's the case but it's certain
there will be some. The second caveat is that of course it may be that sometimes there's been
a compiler that's been the only one who got the answer right and we're assuming that doesn't
happen in these results. The second mode is more of a theoretical concern I think. The first
one I’m sure there will be some examples where the problem is with us.
So we did 10,000 per mode except not as many of that in atomic sections and all due to a bug
that I mentioned that we found in CLsmith. So I’ll show you four examples, NVIDIA, GTX, Titan
and GPU and Intel CPU, GPU from an anonymous vendor, and Oclgrind which is an emulator.
I'm looking at optimizations either off or on, although in this emulator the optimization setting
doesn't do anything so there’s just one column here. And what I'm seeing here is there are
these six modes, basic, vector, barrier, atomic sections, atomic reductions and all meaning
everything at once. So what I'm saying here is for instance that in basic mode any NVIDIA, DTX,
Titan with no optimizations .1 percent of programs that gave a result appeared to give the
wrong result.
So a couple of things we can see from here, first of all NVIDIA’s implementation seems to be
doing well but you can observe that in all cases by this one, these two both as a tie, but in these
three cases you can see that there appear to be more bugs with optimizations on than with off
which is perhaps not surprising; you might expect that compiler would misbehave more with
optimizations on and then off, although some of the vendors we talked to said that they don't
really test the compiler with optimizations off because it's on by default in OpenCL, and if you
look at the internal results you can see this plays out. They didn't tell us that actually but this
appears to be the case here. So with optimizations off you can see that there is quite some
problems, and interestingly you can see that with barrier mode in Intel we get a massive
increase in the wrong code bug rate and I think this is an issue with fuzzing.
So the thing about fuzzing is that A, relatively easy to trick a bug shows up a lot and we found a
particular bug to do with barriers on this version of Intel's compiler and it just appears that that
bug we trigger all the time. So I don't think this means that there's something terribly wrong
with our compiler or anything like that, it’s just in this barrier mode the kinds of programs we
are generating are very likely showing this same bug over and over again. Of course we didn't
reduce that many of these 10K tests because we were doing in the reduction by hand. You can
see in this enormous configuration you can see again that the no optimizations appears to be
giving a higher bug rate in general than the optimizations enabled although that's not true in
the basic mode.
And one thing to say about the Oclgrind emulator, so this is actually a really excellent piece of
software from the University of Bristol but it had a couple of very simple bugs. To give you one
example, they had a bug in the way the comma operator in C was interpreted. Now our fuzzing
tool based on Csmith generates the comma operator all over the place so you can see that this
appears, this does have a very high wrong code rate and you might think I wouldn’t use
Olcgrind, but that would actually be grossly unfair because it's a very useful tool. We find it
very useful in debugging our own tool; it just had a bug with the comma operator which they’ve
fixed now and now if we would rerun these things again with Oclgrind you would see that rate
would drop way down. So some basic problems can be lead to a very high bug rate. And in
fact, we're using Olcgrind actively at the moment because we are building a reduction tool to
automate this reduction process and we're using Oclgrind as our undefined behavior tester
because it does lots of checks for undefined behaviors and it's a dynamic tool so it scales well
on the examples we are using.
So a couple of characteristics of the bug we found. So we investigated thoroughly just over 50
problems and we focused mainly on investigating wrong code problems. So this is not
representative of the kind of problems you see. Our focus was biased towards finding wrong
code problem but there are other things we found on the way. So we found a few frontend
bugs, so a couple of cases where certain operators and languages were just not supported but
we did find a couple of spec ambiguities. So there were some weird interaction between vector
initializers and casts and it's not clear to me what the right rules are from the OpenCL
specification and NVIDIA’s implementation differs from Intel's implementation, for example so I
thought that was pretty interesting. We actually found those because we were generating
some code where the code we were generating was sort of ambiguous. We knew what we
wanted when we investigated it but we could understand why there was disagreement from
the implementations.
We found loader compiler internal errors, a couple of compiler timeouts, cases where the
compiler actually gets stuck compiling the program; a couple of runtime crashes, although I
suspect there are many more cases where the compiled program crashes, and then a number
of wrong code bugs and we investigated a bunch of further bugs arising from dead code
injection.
I wanted to talk a little bit about the nature of the wrong code bugs. So a lot of the bugs we
found related to incorrect handling of structs and this might go back to Ben's point about
whether a bug matters or not. So in OpenCL one does not tend to make excessive use of
structs, in particular, you would not typically have a struct with a struct inside it with a struct
inside with a struct inside it with a union inside it with an array of length 1024 inside it. These
are things you just would not do in OpenCL. Of course the compiler should still compile them
correctly, but you might be sympathetic to a compiler developer who puts that quite low in
their priority list.
We found a lot of very basic of bugs related to struct handling and the reason we found those is
that I mentioned earlier there were no global variables in OpenCL and we mimic global
variables by putting them all in a struct and then any bug with struct compilation was very, very
likely to show up in our fuzzing becauses structs were fundamental to our infrastructure. But
we found some very simple bugs to do with structs like the struct initializer is not matching up
with the way structs are indexed and we found these problems in most of the implementations
we tested and those are the ones that the vendors are very keen to fix.
We did find some bugs related to barriers but these were a little bit disappointing in that
although they were bugs that did require barriers to be present they actually didn’t require the
barriers to be doing something related to inter-thread communication, in particular we could
have cases where the threads would not be communicating at all, they would be using strictly
private memory but the presence or absence of some barriers would be the difference between
successful or unsuccessful compilation It was quite interesting. We don't know why because
these are closed source compilers. And we found a couple of bugs related to vectors and sadly
no bugs related to atomic operations despite the fact that that was the hardest thing to
implement.
So I'm going to finish the presentation by showing you a few examples of bugs. So the first one
I want to show you is pretty simple. This is a bug with a rotate vector instruction. This rotate
takes a vector of size 2 and a vector of size 2 and what it should do is take the first component
of this one and rotate it by the number of bits specified in the second component and the same
for the second components. So what do you think the result should be there? If we rotate 1,1
by 0,0 and take the X component of the result, the bits should be one, and we are running this
with one thread, X is component zero, so let's run this. So we run our launcher tool which
launches the kernel and I'm going to say, rotate so platform zero device zero. This is an old CPU
compiler on my laptop. So you can see that we get the wrong result here. We get F, F, F, F, F, F
and if you try this with Intel's more recent compiler they’ve independently fixed the problem so
this is not something we reported. They'd already fixed this. And if we look at the assembly for
this we see that there appears to be some constant propagation gone wrong in the assembly
code. F, F, F, F is just stored to memory. There’s no rotation going on. So that's an example of
a small bug. But I would say that it seems like a bug that could wreak havoc in an application
that needs rotate. I'm not quite sure what people do with this rotate by the way, rotating bits,
but OpenCL has a whole load of intrinsic come from the group who created the standard, lots
and lots of competing demands there.
>>: Sometimes we end up. [inaudible], not rotate but [inaudible].
>> Alastair Donaldson: So this is an example with barriers. So I have a kernel, it declares an int
X, passes the address of X to H, then it eventually writes out the value of X. So what H does is it
calls K and what K does is it does a barrier, no reason to do a barrier, no shared memory going
on here, then it calls F and stores the result of F in the point of P which is our X. And what F
does is it does another barrier and returns one. So we can find here that this should end up
printing the value one because X should end up being one and this is executed by two threads.
And it needs to be executed by two threads for the bug to show up although the threads don't
communicate. So if I would run this with optimizations enabled then we see the correct result
1,1 and if I disable optimizations then we get wrong result 1,0. And we did look at the assembly
code produce for this but we didn't manage to work out what was going wrong here because it
was an extreme long assembly code so we didn't spend the time to dig into that. Okay. So
that's just a couple of the sorts of wrong code bugs we found. So you can see that they're not
terribly obscure bugs. There are relatively small programs that you feel should work correctly.
So some ongoing and future work. So some of the vendors we’ve talked to about this work
have been pretty interested in fixed bugs in their OpenCL compilers. And some of the vendors
we've talked to have been interested enough but have basically said to us that for them
OpenGL is a much bigger priority than OpenCL. So OpenGL is actual graphics as opposed to this
so-called general purpose graphics processing. I guess because I'm very much in the world of
[indiscernible] and OpenCL I sort of forget that there are people who actually do graphics. So
some companies said it would be great if you had an OpenGL version of this. To me that seems
very challenging in a way that really interests me because in OpenGL everything is floating-
point but it's floating-point with a distinct lack of precise rules for what is acceptable from a
given operator. What's acceptable ultimately is what would A, pass the conformance test suite
and B, what would be acceptable to gamers or to people who are interested in using the GPUs.
>>: [inaudible] conformance?
>> Alastair Donaldson: If you want to be labeled as an OpenGL compliant implementation
you’ve got the past a set of conformance tests. So those conformance tests do provide some
implicit specification of what would be, so the specification of OpenGL as far as I understand it
places almost no obligation in formal text on what your floating-point operator has got to do.
So if you're a pedantic formal verification person you could say okay, I’m going to make them all
return zero but then you'd fail the conformance tests. So there clearly is some notion of what is
acceptable. There's the conformance test and there’s what would stop people buying your
GPU but there is no exact image a particular OpenGL shader should produce.
>>: [inaudible] I don’t understand why they won’t just let the market decide.
>> Alastair Donaldson: I suppose because they don’t, I don’t know, but my guess would be that
they don't want OpenGL to get a bad name.
>>: Maybe the market should decide that too.
>> Alastair Donaldson: Anyways, the point I'm making there is that there's no precise
specification so you couldn't even do some kind of abstract interpretation over a proximitive
analysis.
>>: I would not advocate that even if there was a [inaudible] specification.
>> Alastair Donaldson: Okay. So that's something that we’re quite interested in working on.
Related to that is finding floating-point compiler bugs. So in C there is some notion of what
floating-point can and cannot produce if the compiler obeys the standard but compilers also
have optimizations you can turn on like fast math where they do more interesting things with
your floating-point if you want speed. And then there's this interesting distinction between
result difference due to an aggressive floating-point optimization that's correctly implemented
and the user asks for it and a compiler bug and how do you tell the difference. I find that quite
a challenging problem.
Also compiler fuzzing on nondeterministic programs that you use the rich set of OpenCL
atomics that are related to C Plus Plus 11 atomics. That seems like an interesting challenge.
And then more pragmatically being able to automatically reduce these bugs and rank them
would help us make more progress in this work and try more interesting strategies and be able
to evaluate more than we have done which things are working well and which things are not
working well.
So I want to conclude by showing you this little OpenCL kernel and this is doing a reduction.
This loop here is a reduction loop and it has a misplaced barrier. This barrier is supposed to be
there but it’s inside this conditional.
>>: [inaudible] but>> Alastair Donaldson: This program has got no semantics because the barrier is in a divergent
location. So if you think that bugs in GPUs don't matter then>>: You’re going to crash your machine now?
>>: He’s going to melt his laptop.
>>: Really? Step away from the podium.
>>: Are you going to show us that your computer froze?
>>: It’s not frozen yet.
>>: Is your cursor frozen?
>> Alastair Donaldson: It's frozen.
>>: Try typing.
>> Alastair Donaldson: Come up here and have a go if you want. So on that note, thank you
very much.
>>: Is that a realistic bug? This happens all the time.
>> Alastair Donaldson: So what happens all the time is that the things that freeze. So I
discovered this unfortunately when I was teaching at the UP MARC summer school two days
ago and I was trying to show that you get a weird result with barriers and I crashed my machine
and had to start using the blackboard for the rest of the lecture because this machine, I find
that this machine doesn't reboot very well.
>>: [inaudible].
>> Alastair Donaldson: So then after I gave a lecture on the blackboard and I did manage to
reboot it but then it would project properly. But then actually look, so I did find that in this case
it appeared the display had crashed and then it appears Windows got the display back now so
that's what I thought the problem was. But then actually this morning I'm trying to just prepare
this crash to show you and I waited for about 10 minutes and the display didn't come back. So I
don't know whether the machine had crashed or whether the display would have eventually
come back. So I maybe exaggerated when I said that this crashed my laptop. I’m not
exaggerating when I said it crashed this other machines. They definitely did crash. I suppose if
this would lead to something, a long-running competition on the GPU if this barrier is in the
wrong place and it's maybe leading to the thing getting stuck in an infinite loop you could
imagine that could hog up all the resources that are rendering. This is the same GPU that's
rendering things to my screen so you can imagine that could be the reason.
Okay. So anyway, that was a badly defined program but imagine your compiler mis-optimized
and gave you that program. All right.
>>: Thank you.
Download