>> Shaz Qadeer: Okay. Let's get started. It's my pleasure to welcome Alastair Donaldson to MSR today. He's here just for this afternoon. Alastair has collaborated with many of us over a large number of years. He's a professor at Imperial College, London, and he has expertise in compilers and program verification and GPUs and he has recently added compiler fuzzing to his repertoire. Today he's going to tell us about some of his recent work in that direction. >> Alastair Donaldson: Okay. Thank you, Shaz. So the work I'm going to present the today is a collaboration with Christopher Lidbury and Andrei Lascu who are PhD students in my group at Imperial and also with Nathan Chong who used to be a PhD student in our group and is now a postdoctoral researcher at UCL. What we've been doing is looking at the reliability of compilers for the many-core programming language OpenCL, and more generally we are interested in the reliability of compilers for GPUs. OpenCL is a programming language that targets multi-core and many-core devices of which GPUs is a major fraction of the target device audience. So I have a personal motivation for this which is that for the last few years I've been working quite intensively on a tool called GPUVerify which is a static verification tool for GPU kernels. This is something I started with Shaz as a visiting researcher in 2011, and GPUVerify is these days pretty effective at finding bugs in GPU kernels and sometimes it’s capable of verifying particular correctness properties of those kernels under a number of assumptions and caveats and this has been a main focus of my group and I for a few years. This begs an obvious question: so GPUVerify operates on source code. It actually operates on LLVM bit code that's been obtained from source code from an OpenCL or [indiscernible] kernel. So the question is can we trust the tools that are going to compile that source code to get you something that runs on a GPU? So if you can't trust these compilers, if the compilers have bugs in them then you could use GPUVerify or a similar tool, maybe one of the tools from Ganesh Gopalakrishnan’s group in Utah to check, correct these properties of your GPU kernel only to have the kernel do something completely wrong on the hardware. So this is the personal motivation for me for getting interested in this work. So what we wanted to do in the project was look at two existing compiler testing techniques that have been successful in finding bugs in C compilers and applied them in the domain of OpenCL. So those techniques are random differential testing which was popularized in C by the Csmith tool from the University of Utah from PLID 2011 and a more recent technique called equivalents modular inputs testing which was at last year's PLDI from researchers at UC Davis. So what we wanted to try and validate those in a new domain and in the process devise novel OpenCL targeted extensions to these techniques and use them to assess the quality of industrial OpenCL compilers. So we tested 21 different device compiler configurations in this work, and we discovered a number of bugs from various vendors including Altera, AMD, NVIDIA and Intel as well as in open source OpenCL compilers. A key thing here is that many of these bugs are now fixed so some of these vendors have been pretty active in fixing the bugs we’ve been reporting which I think suggests that they’re bugs that people do take notice of and we’ve now been doing further testing on these implementations and already noting bug rates to be going to down partly as a result of these fixes. So what I'm going to do is talk about random differential testing, give you some background on that, and in the process take a quick detour into the problems of undefined behaviors in compilers and then I'll tell you how we lifted random differential testing to OpenCL. Then I'll tell you background on equivalents modular inputs testing, the other technique we validated, and show you how we lifted this to OpenCL and then I'll discuss some of our experimental findings and show you a few examples of compilers bugs we found. And please do feel free to interrupt me if you've got any questions as we go along. Okay. So random differential testing is conceptually a very simple idea. Like I said, it was popularized by this tool Csmith from the University of Utah. So the idea, in the context of the C, is just imagine we have got a tool to generate random C programs and Csmith is such a tool. So you run Csmith and it produces a random C program by which I mean the program was randomly generated. The program itself is deterministic and does not deploy any randomization but its contents are random. We will then give this random C program to a number of compilers. For example, we might give it to Microsoft and Intel's compilers, a couple of open source compilers, compile the program, and let's assume that the program has been designed so that it should print a number. That's all the program does. If we get different numbers from these compilers than that indicates that at least one of the compilers has a bug. In this case I've singled out Clang as being the compiler that gives a minority result and it will be likely that Clang will be the compiler with the bug here, although of course theoretically, it could be the only compiler that compiled the program correctly or they might all be wrong. In practice typically the minority compiler is the compiler with a bug. So question for the audience then, why might this actually not signify a compiler bug? >>: Because of undefined behavior. >> Alastair Donaldson: Right. So if the program random.c exhibits undefined behavior then there will be no surprise that you would get different results from different compilers. So mismatches in results indicate bugs if the compilers agree on implementation defined behavior, but this depends upon the input program being free from undefined behavior. So if the input program is free from undefined behavior but does exercise implementation defined behavior the compilers must agree on that implementation defined behavior. But the undefined behavior is the real key. So I think it's instructive to take a quick detour into the problems of undefined behavior in the C programming language. So if you take a look at this little program main, so what I'm doing here is reading X from the command line and squaring X. And then if X squared is less than zero, printing X squared is negative and then printing the value we computed for X squared, otherwise printing X squared is nonnegative and printing the value computed for X squared. So if I take this program and I compile it with GCC and I would run it, for example, on the number 10 then we see X squared is nonnegative 100. So what happens then if we run this on a large number like 1 billion? What do you think is going to happen? >>: That's more than>>: One, two, three, four, five, six, seven, eight, nine. >> Alastair Donaldson: What you think would happen then if we try and square one? >>: Overflow or something? >> Alastair Donaldson: Overflow or something. So you might not be surprised to see X squared is negative and then a very negative number. So now let me compile the program again but this time with optimizations. >>: That's not a bug because C doesn't release, is not supposed to guard against overflow. >> Alastair Donaldson: Right. So that might be what you would expect to see though is an overflow and that seems like the right result in a sense. You might think that this is not an acceptable result. X squared is nonnegative and then a negative number. So you can imagine very much confusing a programmer and a programmer might think oh dear, there's a bug in the compiler. What the compiler done wrong? >>: [inaudible] nonnegative. >> Alastair Donaldson: Yeah. So without optimizations we’ve got X squared is negative and a negative number. With optimizations we get X squared as nonnegative and a negative number. But this can actually be quite easily explained. So when you write X times X in a C program where X is signed you are actually telling the compiler that it is allowed to assume at this program point that X times X does not overflow because if X times X does overflow then not just the result of that operation is undefined but actually the whole behavior of the program on that execution is undefined. So as a programmer, when you do an arithmetic operation unsigned, you’re telling the compiler that this is not going to overflow because if it would overflow the compiler can generate anything it likes. So because the compiler knows this did not overflow it's able to do a good optimization, it can optimize away that guard, and it can actually transform the program into this program which is a more efficient program and that's compiler’s job, and this more efficient program prints the result we get. It says X squared is nonnegative and it computes whatever you got by squaring X and in some register on an X86 machine doing multiplication you are going to get overflow, you are going to get wraparound, you're going to get this negative number. So it's quite a neat example of>>: So the compiler is smart enough to know that X squared is positive? >> Alastair Donaldson: Yes. >>: Interesting. >> Alastair Donaldson: So the compiler is smart enough to know that. >>: I think the compiler is smart enough to know that the semantics of this program is undefined so it can do whatever it wants. >> Alastair Donaldson: Well, the compiler is smart enough to know the semantics is only defined if this does not overflow. Furthermore, the compiler knows that if it does not overflow the result will be nonnegative and therefore it can apply this optimization. So it's applying dead code elimination on the assumption the program is well-defined. So for this random differential testing approach to work we need the program to be free from undefined behaviors. So the way this works in a program generator like Csmith is you can use a conservative effect analysis to, for instance, make sure that you don’t use some pointer unless you're guaranteed you've initialized the pointer, for instance, as you generate the program be conservative there. And the challenge in making a program generator is it's easy to write something that generates programs that are free from undefined behavior if you don't want those programs to be interesting, but writing something that generates interesting programs in the sense that they really explore the language and really give the compiler a hard time and yet our free from undefined behavior, that's the challenge. So Csmith has this effect analysis and also uses safe map operations. So for instance, instead of generating a divide where you want any two random expressions, Csmith will generate a call to a macro called safe div that will take E1 and E2 and the macro uses the ternary operator to say that if E2 is zero then evaluate the result of evaluation which is BE1 the numerator otherwise it would be the actual division. There's nothing special about E1 here. The point is that we need to give some defined result and E1 is probably more interesting than just zero or three or something because E1 is itself potentially a large and interesting randomly generated expression. Okay. So I'll briefly tell you how we lifted random differential testing to OpenCL. So the Csmith tool generates random C programs that explore the richness of C and avoid undefined behaviors. So what we did is built an extension called CLsmith that aims to generate OpenCL kernels that explore the further richness of OpenCL because OpenCL is based on C so we can get a lot of the richness of C, the richness of OpenCL already from Csmith. But examples of additional richness are vector types and operations in OpenCL and inter-thread communication. So you have all these threads running a GPU kernel and communication between those threads is something we were keen to explore. We need to avoid additional undefined behaviors, so for instance data races and barrier divergence are two kinds of undefined behaviors you have in this data parallel world, and also there are some vector operations that come with undefined behavior constraints. And furthermore, we need to guarantee determinism. So in the world of concurrency you can't have programs that are free from data races and yet behave non-deterministically. So these programs do have well defined behavior but you get a set of possible results from such programs, and for a random differential testing it would be pretty difficult to apply random differential testing if your program can compute some result of an unknown set because then you have no idea where the two programs, the same program running on two different implementations is giving different results because they are both acceptable members of the set or because one of them is wrong for a compiler bug reason. >>: Is the reason they could be nondeterministic is because of the presence of atomics? >> Alastair Donaldson: That's right. >>: I see. >> Alastair Donaldson: Not because of just regular operations because we would only get nondeterminism from data races. >>: [inaudible]? >> Alastair Donaldson: With atomics you could. Okay. So a very basic thing then is instead of making a random C program make a random OpenCL kernel by having a kernel>>: [inaudible] not be able to test programs or generate programs that use atomics? >> Alastair Donaldson: Well, we'll come to that but in general, yeah. We have to be careful. We have to use atomics in carefully crafted ways. So a simple thing to do then is have a program where every thread runs a function and writes the result of that function [indiscernible] its thread ID. And this function we can generate using a Csmith because Csmith can generate C functions. OpenCL is very similar to C in its core language. So if we've got this func which is a C function we can make every thread run the same function, write the result into an array, and then we can print the result, print the contents of that array as our program result. So all threads would run the same function, there's no communication between the threads, and the only engineering challenge here was to work around the fact that we don't have globally scoped variables in OpenCL so Csmith we generate a lot of global variables. We've got to find a way around them and what we do is we pack them into a struct. We have a struct that's got all the would-be global variables and we pass that around by reference so [indiscernible] func would take the reference to that structure and if func calls another function it would pass in a reference to that struct. So the reason we did this is that I think in research is very important to start with simple solutions first before you go for more complicated solutions, and our hypothesis was that we wouldn’t really find any bugs with this technique because it be so straightforward. In fact, almost all the bugs we found where with this technique. So we found lots and lots and still managed [indiscernible]. >>: [inaudible]. >> Alastair Donaldson: Well, I mean we did a big evaluation and scientifically showed blah, blah, blah. >>: Or some science. >> Alastair Donaldson: Well, it's an empirical study, right? The question, so to make an empirical study scientific you just need to be rigorous. So we found a lot of basic compiler bugs in these compilers, so problems compiling the sequential code that the threads execute rather than difficult optimizations to do with concurrency which is I suppose perhaps not that surprising in a language like OpenCL with hindsight. So data parallel language, so the programmer is actually controlling the parallel execution. It's not that the compiler is controlled in the parallel execution. So to some extent the compiler is not going to be doing sophisticated optimizations related to concurrency although I know that some compilers do some optimizations to merge threads together, for example. Okay. So I'll show you what one of these random kernels looks like. And if you've ever seen a random program from Csmith you'll realize this looks quite similar. So I'm going to deliberately flick through this quite quickly just to give you a flavor for what one of these things looks like. So you’ve got loads of variable declarations inside these functions, you have all kinds of crazy code like four loops with break statements in the middle of them, here's some OpenCL stuff I'll come to later, a barrier there. So you can see this is not the kind of program you would like to be understanding as a human but these programs are good at inducing bugs in compilers. So the next thing we wanted to do was to support vectors. Our hypothesis here was that because OpenCL has a very rich set of vector types and operations these may be under-tested from compiler writers. So we extended CLsmith to exercise this rich vector language and this required some engineering effort because Csmith, the original tool, would slightly abuse the fact that C is pretty liberal with the types and you can implicitly coerce between most integer types. So in generating expressions Csmith deliberately wasn't taking care to track the type of the expression being generated. To make this work for vectors in OpenCL which have stricter typing rules we had to really do some quite serious engineering work but actually this was straightforward. And we had to take care of some undefined behaviors. So for example, I thought I found a bug in Intel’s compiler because I ended up with a very small program. But my small program tried to clamp a variable X into the range of minimum one, maximum zero. And then I stared at the small program and thought what does that mean to clamp something in the range one, zero? And then I looked to the spec and I said that the behavioral clamp is undefined if min is larger than the max. In fact the small program was just an undefined program. So there were a few cases where we had to treat these vector operations to make sure that they had well-defined behavior under all inputs. So just to give you vectors. So for CL zed, so [indiscernible] leading zeroes. This is an example of and OpenCL vector intrinsic. This gives you the number of leading zeros and binary representation of a number. Another one is rotate. So if you apply rotate you give it two vectors and what it does is it takes the first vector and the second vector and for each component of the first vector it takes the component of the second vector and rotates this thing the number of bits specified by this thing. And that's well-defined even for sign because OpenCL demands two’s complement so you can do these bit level rotations. >>: Quick question. Could you attempt to reduce these or is that already like some sort of minimal state? >> Alastair Donaldson: No, no. So these are the original programs and what we do is if we find a mismatch between implementations what we would be doing is reducing the large programs by hand until we get small test cases. And I'll show you later some examples of the reduced programs. And something we are working on at the moment is trying to extend existing work on reducing C programs to OpenCL which has some practical challenges. >>: So the requirement is just Csmith and CLsmith? >> Alastair Donaldson: Yeah. >>: Was it necessary to check run such large programs in the first place [inaudible] bounds? >> Alastair Donaldson: In the Csmith work they did a study where they assessed the extent which they would find bugs if they were restricted to small programs, medium programs, large programs, and in that work they did find indeed that you had to have large programs if you want to have a good chance of finding mis-compilations and compiler crashes. Now we did not do an equivalent evaluation in OpenCL context but we actually do have all the data at our fingertips to do that retrospectively. So that's something that, yeah. >>: So was there any sign of [inaudible] or was it just [inaudible]? >> Alastair Donaldson: I think it’s just a balance of probability. If there is enough code, if there's more, more, more code, there's a higher chance the compiler will have an opportunity to mis-compile. I think it's as simple as that. >>: So if it’s a case that for every bug there exists a smaller program that introduces the same bug then the random number generator is not efficient in generating the small programs that find the same a bug. >> Alastair Donaldson: I think it's definitely true. So in the Csmith my memory of what they said was that to find bugs they needed large programs but they could also always reduce them to small programs and still expose the bugs. It wasn't as if they had large it’s just that through large programs they find more bugs. But sure, if they found the right small programs they would find the bugs. >>: I think the fundamental question here is not so much the size of it. So you can ask directly [inaudible]. It’s really the debug-ability of it. If it's really huge and our ability to debug then the chances of getting fixed are reduced, right? >> Alastair Donaldson: Right, but I think if you find that you’ve got a program that causes two compilers to get different results once you’ve found that you can then apply this automated reduction technique which they have and then you can get the small program. You can get a small example that can be fixed. So I think the strategy in Csmith and C reduces use big programs to find mismatches but then you’ve got a reducer to give you small programs. And we didn't evaluate whether we needed programs here. Some of the implications we tested were quite weak in which case then we needed big programs, but some of them were pretty strong as well but we don't have the reduction yet, the automatic rejection so reducing the big program by hand does take longer, although it doesn't take proportionally longer because often you can prune away huge parts of the program immediately. You can just cut out some function call that's calling most of the functions and>>: [inaudible]super-fast. >> Alastair Donaldson: Exactly. Okay. So the next thing was inter-thread communications. So in OpenCL you can use barriers to allow threads to communicate with one another, synchronization barriers. And our hypothesis, based on having seen some LLVM bug reports related to OpenCL barriers, was that compilers sometimes have a hard time optimizing around barriers. So we figured if we could find a way to use barriers and have threads communicate in a race-free manner this might give us more opportunity to find compiler bugs. So what we implemented was roughly the following, imagine you’ve got a bunch of threads executing a kernel and we’ve got a shared array and let's imagine we can give every thread a randomly generated but unique index into this array at the beginning of time. So every thread owns an element of this array. The threads can read and write that element freely as they execute the kernel. So as the threads execute they’ve got unique access to their element and then when they reach a barrier synchronization we can do an ownership redistribution so we can change which thread owns which element and then the threads can carry on executing. This allows the threads to communicate data values to each other but in a way that guarantees freedom from data races because between barriers the threads have unique access and at the barrier they then do a permutation of who owns what. So this is a fairly simple way to ensure that we avoid data races. We have to be careful about where we place the barriers though because in GPU kernels you can’t have barriers in thread sensitive regions of code. So the way we dealt with this is we is restricted the use of thread ID in our random programs so that we could place barriers wherever we wanted. So to give you a little flavor of this in the random code, so here's an example of a barrier and then immediately after the barrier this TID variable gets updated to be this which is a random permutation. Then Shaz, you’d asked about atomic operations. So indeed we wanted to try to investigate the use of atomics in these random programs but atomics are the one way in OpenCL you can have race-free non-determinism. So atomic operations are not deemed to race with one another. You can use atomics to, for example, see who’s the first to get to a certain point in the code. So our hypothesis was that the compilation might be sensitive to the use of atomics so they'd be worth exploring so I'm going to find a way to use atomics in a deterministic manner. So we had a couple of ideas and very much our protege was we would brainstorm ideas until we came up with something we were sure was going to lead to determinism and welldefinedness and implement it and see what would happen. So the first idea we had was atomic sections. The idea was that we would give every workgroup a shared counter, call C, and an associated result variable could see result both initialized to zero and then we would inject in the code an atomic section. So this would involve doing an atomic increment on the counter and testing whether you get the constant value D which is some constant chosen at generation time. If you do that means you are the D plus first thread to execute this atomic increment. So if D was 2, for instance, then you would get the old value of increment. So you would be the third thread to increment it if you find it was equal to two. And in particular, unless so many threads increment this that the counter wraps around, only one thread can get into this conditional. Agreed? So now that thread can call a function which is a randomly generated function and then it can add the result of that function to the atomic counter variable. Now the reason we made this be an anatomic add is precisely to capture that wraparound case where you may actually have an atomic in loop and you may have the possibility to have the counter wrapping around and then in theory two threads could get in here and if we atomically add then we are still race free. So how do we make sure that this gives us determinism? Well, we be careful this function doesn't leak which thread executed it. If this function would somehow leak the thread’s ID then we would actually get different results depending on which thread got into that function because that thread would then start to behave differently; and in particular, that thread may now not reach barriers that it would have reached otherwise. So this was actually the most challenging thing to implement by a long way. Almost all the bugs in our tool were related to this mode. It was a pretty difficult thing to get right. We kept thinking we've found, in fact we didn't find any compiler bugs related to atomics but we kept thinking we had. So we kept thinking we’d found one we would investigate, investigate, investigate, and we would find it was a problem with our implementation and we eventually didn't find a single bug that needed an atomic. We found plenty of mis-matches in programs that contained atomics but we were always eventually able to reduce them down to not need the atomic in the end. So the key thing here is that the effects of the function should not be visible outside the section and that’s pretty difficult because there's going to be computing with all kinds of pointers and we had to restrict what could be done with of those pointers. In particular, you shouldn't be able to modify things outside, you shouldn’t be able to modify data outside this section. So the other idea is simpler. So some operations in OpenCL that are atomic are suitable for doing reductions, in particular add, min, and xor are three examples. You can do a reduction operation on them, right? You can apply these operations to a bunch of data to get a result, and because these operations are associative and commutative it doesn't matter in which order you apply them; you’ll get the same result in the end. So the simple idea here is that we randomly emit an atomic operation, so this is one of these associative commutative operations, and we have every thread evaluating an expression E and atomically apply the result of E to a shared variable S and then we can do a barrier synchronization and then we can, at the end of the kernel, have the master thread from a thread group add the final result of this S to its final result. So this is another way of using some atomic operations in practice. All right. So to show you these atomics in one of our random kernels. So here's an atomic reduction using max. So the threads are doing an atomic max operation and this is this expression here which is a fairly hefty expression is the expression they are reducing. You might notice that we do use a thread’s ID in the reduction operation. We use a thread’s linear global ID in the operand to the reduction and that's actually safe because even though the threads have got different IDs since they're all contributing to this commutative operation that's fine. And then afterwards we have a barrier. If I look for atomic Inc.[phonetic] here's an example of one of these atomic sections. So if we atomically increment this counter variable if it's value one we must be the second thread to have done this and then we've got some code which has got to be thread insensitive. Okay. So the second technique we investigated was equivalents modular inputs testing which was proposed at last year's PL I. I think this is a pretty cool technique, the technique that these people proposed. So it works as follows. Let’s imagine you've got a program in some programming language and let's suppose that this program is well-defined and it's deterministic. So we can compile this program, and let's imagine we only have a compiler. We don't have the liberty of multiple compilers to check against each other; we've only got one compiler. There could be a few reasons for this. This could be a new language. Another reason could be that in some domains, and in particular in the GPU domain some of the vendors we’ve talked to have told us they're not allowed to have a benchmark against other GPU implementations. I don't quite understand the reasons for that but I think it maybe to do with the possibility of being [indiscernible] reverse engineering perhaps, but some of the vendors have said they're not allowed to just get the latest tools from some of their competitors and compare with them. So if you are under those constraints or indeed if it's a new language, you only have one compiler, then you can't do random differential testing. You can do with it multiple optimization levels but you can’t do it with multiple tools. So the idea of this is let's say we've got a program P. We can compile it and get through a profiler with respect to some input I. And because the program is assumed to be well-defined and deterministic what this profiling will do is partition the statements of the program P into two disjoint subsets. We've got those statements that are covered by the input I and statements that are not covered by the input I and I hope you'd agree with me that if we run the program again and again on I we would get exactly the same partitioning because the program is deterministic. >>: There is no undefined behavior. >> Alastair Donaldson: No undefined behavior, no non-determinism. You get exactly the same partitioning. So what we can do then is if we call the statements that were not touched by I we can take the program P and we can manufacture from it as many programs as we like by messing with D. So we could delete some statements from D or we could attend some operators you used in the statements or we could add some fuzzed code into D or we could take some statements from the program and put them again into D. We could do anything we like to this I dead code to make lots of variants of P. And these programs, globally speaking, are completely different programs that for the global space of inputs might give totally different results but they have the property by construction that for I they would give the same result. So that we could then compile all of those programs with our one compiler and we could run them and if they give different results, let’s assume again that they print an integer, if they give different results we know something must be wrong with the compiler. So this is a very smart idea from these guys at last year’s PLDI. We wanted to try to validate whether this would be effective at finding bugs in OpenCL. However, there was a problem. So first of all this equivalents modular input requires the existence of this input dependent code. If you’ve got a program that doesn't have any code that’s only reachable for certain inputs then you won't be able to do this manufacturing of the multiple programs. Secondly>>: I don’t understand that. You can always create a program which has a substantial amount. >> Alastair Donaldson: So a second point of this technique is that you can apply it to existing code. So you might have some test cases that were hand written that are already interesting test cases and you can apply this technique to those existing programs. You don't have to be using random programs. So we were looking at trying to apply this to existing OpenCL programs, but our experience from working with these programs is that they don't typically contain that much code that’s conditional on certain input values having properties; and the second thing is you would need a profiler to implement this technique and there is no readily available OpenCL profiler. So the problems we found on a typical OpenCL kernel just don't contain that much input dependent code and second there is not a readily available profiler for OpenCL. Now of course we could build a profiler so this was the fundamental problem. We didn’t want to build this profiler to find there was no I dead code to be found. So our idea was, and you already sort of suggested this in your question was, well we can make up this input dependent code. We could make programs that have input dependent code, but if we've already got a program we could give that program extra code that’s input dependent in the following ways. This is a very simple idea but it was pretty effective in our work and there's no reason why this idea couldn’t be applied to other languages. There's nothing OpenCL specific about this idea. So let's imagine that we've got some piece of code>>: I have a question about the previous technique. So I was just imagining that if I sat down and typed for say for two hours I can come up with five different ways of compiler fuzzing. This is one particular technique. So what is it that is ultimately going to distinguish one technique from another? I'm sure that even companies that have the job of producing commercial compilers there's a lot of compiler fuzzing going on already. What are we chasing here? >> Alastair Donaldson: So I think the goal of compiler fuzzing is to find bugs. The way you find bugs with fuzzing is by perusing interesting inputs. So I think there are two things. If you can fuzz in a way that allows you to get high bandwidth then it doesn't matter if your inputs are interesting on average. If you can get through loads of inputs, if there are some interesting ones, then you will find bugs with them eventually. So the challenge is to have a method that allows you to very quickly try lots of inputs and to make sure that there are going to be a decent number of interesting inputs in that set. So in Csmith what they were trying to do was to try to explore as much of the C programming language as they could and try to not be restricted. For example, in Csmith, they didn't restrict programs that guaranteed termination because if they did they could easily generate terminating programs but they would have to place a bunch of restrictions to ensure termination and they feared that those restrictions would reduce the effectiveness of finding compiler bugs. So instead they had programs that may not terminate and they just used a timeout. So the idea with fuzzing really would be, of course there's not really any science in coming up with ideas for making random programs, you can evaluate the thing scientifically but the evaluation would have to be guided by how effective you are at finding bugs in comparison to other fuzzes whereas you fix these bugs do you then find anymore or do you basically have a bunch of bugs that are given fuzzer confined, you fix those bugs and that fuzz in there just can't find anymore bugs even though there are plenty of bugs? >>: Maybe I’ll disagree with the kind of overall goal that you pose because to my mind the goal is not to find bugs. The goal is to find bugs that matter; and I think there's a big difference between the two which is why there is so much [inaudible] testing in just about every compiler that frankly gets anywhere because if they declare it at a certain number of programs, websites, what have you are over-importance and they matter to say the customers and developers or whatever and as such they really don't want to break the [inaudible]. I think just what every compiler developer will acknowledge that there are dark and troubling corner cases and yes, maybe sometimes they would like to know about these things, but I think at the same time they will acknowledge that there are ID problems just that you have to prioritize your time is well. >> Alastair Donaldson: Yeah. So I would say that anyone involved in the real world of software would agree with you that in any large software project you have a priority list and you usually have more bugs then you can fix. >>: So the question is how do you mesh your [inaudible] results approach with [inaudible]? >> Alastair Donaldson: I would say that if you are finding bugs that people are fixing then that's a sign that you are doing something that people think is at least partly useful. >>: [inaudible] fixing it. >> Alastair Donaldson: I suppose. If you e-mail some privately with a bug report and they say thanks a lot and they say we’ve now fixed it then you’ve certainly not done any harm. Well, of course they could introduce more bugs by fixing the bug. But I think a way I would say that fuzzes are useful is in trying to get some idea about whether your system is satisfying a [indiscernible] of robustness. So if you run the fuzzer and every hundred thousand programs you find a mismatch I think you can maybe be quite happy and you may not even bother investigating those mismatches. But if you are finding a really large number of mismatches then I think that might suggest that it’s just a basic quality problem and you need to improve that quality. I think it's sort of difficult to give these questions definitive answers, right? Yes, it's definitely true that what [indiscernible] finding bugs that matter, but defining what it means for a bug to matter is another question. >>: [inaudible]. >> Alastair Donaldson: So the idea here is suppose you’ve got an existing piece of code. This is a little OpenCL kernel, actually not a little OpenCL kernel, this is a fragment of a real OpenCL kernel that does breadth for a search. Let’s say we've got this piece of code then what we can do is we can add a new parameter to it. This is a parameter which is an array called dead and this is something we’ve added. Then given that we've added this new parameter we can inject code into the kernel that is dead by construction. So we can add a condition here saying if dead element of 43 is less than dead element 21 then execute this arbitrary code. And if we make sure this condition is going to be false at runtime then this code is dead by construction. So in particular, if we fill the array up with increasing values this condition is guaranteed to be false, this will code will not be executable and it should therefore not change the semantics of the program. The key thing is that the compiler cannot know what we're going to invoke this function without runtime so the compiler must compile this function to work for whatever inputs would give it well-defined behavior and the compiler furthermore is going to try and optimize all of the code including this code, and if the compiler gets this optimization wrong it may cause some code that is actually reachable to be in this compiler. So this is a very simple idea injecting dead by construction code and this is actually pretty effective and like I said, is the reason why this could not be applied to other languages. Okay. So what we did experimentally was we took 21 of these device and compiler configurations, so a bunch of GPUs from NVIDIA, AMD, and Intel, some CPU implementations from Intel, AMD, and>>: In particular you can apply this thing, do the previous thing you were inspired by and maybe it will start finding more bugs. >> Alastair Donaldson: Right. You can and we did do that as well. And an FPGA implementation Altera, and also to the Oclgrind open source emulator. What we did is we applied random differential testing by generating 60,000 kernels. So we have six modes that we can run CLsmith in, the basic mode, the mode with vectors, the mode with barriers, the mode with atomic reductions, atomic sections, and the mode with everything on at once. So we did 10K kernels each of those modes. We had to discard some of them due to some bugs we found in CLsmith later on in the study; so this is a very large study with a lot of implementations and several times during this study we had to restart because we found problems in our tools. And I should say right now that I don't believe that these 60,000 kernels are all going to be good kernels. There will certainly be some bugs that we have not discovered in our tools so I think any macroscopic result we give should be given as an indication of the quality of the tools we are testing. Of course there will be some cases where we messed up and have generated a nonsense kernel. We applied EMI testing to ten real world applications. So we took applications from the Rodinia and Parboil benchmark suites and we injected dead by construction code into those kernels and then we also did what you were alluding to Shaz which is we did random EMI. So we took a bunch of kernels generated by CLsmith and we equipped those kernels with dead by construction code as well. Is that what you are thinking of? So what we did here is we made 180 base programs and then we tried 40 variants for each of those base program to give us 7200. The reason for these slightly funny numbers is actually we had 250 of these programs but I told you we had to discard some of these programs and we ended up having to discard 70 of our base programs similarly and all the corresponding programs so we have slightly smaller numbers than we would have liked here. So the first thing to say experimentally was that we discovered a lot of machine crashes. So some of the configurations we tested we had to give up testing because we found that the programs we were generating were causing not just the program to crash but the whole machine to crash. So I'm going to show you an example of this but I'm going to save it till the end of my talk. >>: Is it because the GPU driver is buggy and it’s just overriding some memory in the OS? >> Alastair Donaldson: That’s our hypothesis, but we don't know. It makes testing very difficult. If you want to run 10,000 tests, run them overnight, what we found is we would set this thing off, go for lunch, come back, machine would be dead. So ten we would try and log where we got to because if the machine would've rebooted that would be okay because we can have a script that would say where did you get to, carry on. But sometimes the machine was frozen so this is pretty difficult. So it seems that a mis-compiled GPU kernel can wreak more havoc than a mis-compiled C program in our experience. And then it’s particularly interesting because you have these things Web GL and Web CL where you can visit a website and it can make things happen in your GPU. So I was thinking about trying to come up with website of death that says if you've got an X, X, X card don't go to this page because if you do your machine will just lock up. So this was something that made the testing very difficult for us in starting configurations and we basically stopped testing those configurations. >>: You can bring up mis-compile C program also cause the machine to crash it’s just that that particular C program has to be running in kernel [inaudible], right? >> Alastair Donaldson: That's right. But what interests me is that the memory, so I think it has to do with buffer overflows but I don't quite understand it. If you imagine something running on a GPU that’s an accelerator card then the goal global memory as I understand it on that GPU is not in the same memory space as your operating system so I don't understand how something like a buffer overflow could be the issue. And then the GPU driver is going to be, it says go, execute the program, but it doesn’t interact with the program as the program executes so I fail to see how the program doing weird things could cause the driver to do weird things. I don't really know enough about operating systems to know but I found it very shocking. So I'm going to show you some data by the effectiveness>>: [inaudible] some of the GPU vendors? >> Alastair Donaldson: Yeah. We’re going to do sections. >>: And are the bugs deterministic? >> Alastair Donaldson: Unfortunately no. So we found that it was not deterministic, and actually one with the vendors in question their latest drive is much more reliable actually than the ones we were testing. >>: A buffer will probably be deterministic. >> Alastair Donaldson: A buffer would be deterministic. Whether the buffer would lead to a crash that’s less clear to me. >>: Actually the reputation around Microsoft is that these graphics drivers are among the most buggy drivers. >> Alastair Donaldson: So what I'm going to show you here is some data. So what I'm going to look at is for a given configuration and a given mode of our tool I'm going to show you the percentage of kernels that did produce a result that was not in agreement with the majority result for that particular kernel. So this ignores time max, ignores compiler crashes, ignores cases where the kernel crashed at runtime. It's cases where the program gracefully terminated and gave an answer and the answer did not agree with the majority. Does that make sense? So I think this is a reasonable way of assessing to what extent you're getting wrong code bugs from these programs. There are some caveats though. A, some of our programs are likely not well-defined despite our efforts. We don't know of cases where that's the case but it's certain there will be some. The second caveat is that of course it may be that sometimes there's been a compiler that's been the only one who got the answer right and we're assuming that doesn't happen in these results. The second mode is more of a theoretical concern I think. The first one I’m sure there will be some examples where the problem is with us. So we did 10,000 per mode except not as many of that in atomic sections and all due to a bug that I mentioned that we found in CLsmith. So I’ll show you four examples, NVIDIA, GTX, Titan and GPU and Intel CPU, GPU from an anonymous vendor, and Oclgrind which is an emulator. I'm looking at optimizations either off or on, although in this emulator the optimization setting doesn't do anything so there’s just one column here. And what I'm seeing here is there are these six modes, basic, vector, barrier, atomic sections, atomic reductions and all meaning everything at once. So what I'm saying here is for instance that in basic mode any NVIDIA, DTX, Titan with no optimizations .1 percent of programs that gave a result appeared to give the wrong result. So a couple of things we can see from here, first of all NVIDIA’s implementation seems to be doing well but you can observe that in all cases by this one, these two both as a tie, but in these three cases you can see that there appear to be more bugs with optimizations on than with off which is perhaps not surprising; you might expect that compiler would misbehave more with optimizations on and then off, although some of the vendors we talked to said that they don't really test the compiler with optimizations off because it's on by default in OpenCL, and if you look at the internal results you can see this plays out. They didn't tell us that actually but this appears to be the case here. So with optimizations off you can see that there is quite some problems, and interestingly you can see that with barrier mode in Intel we get a massive increase in the wrong code bug rate and I think this is an issue with fuzzing. So the thing about fuzzing is that A, relatively easy to trick a bug shows up a lot and we found a particular bug to do with barriers on this version of Intel's compiler and it just appears that that bug we trigger all the time. So I don't think this means that there's something terribly wrong with our compiler or anything like that, it’s just in this barrier mode the kinds of programs we are generating are very likely showing this same bug over and over again. Of course we didn't reduce that many of these 10K tests because we were doing in the reduction by hand. You can see in this enormous configuration you can see again that the no optimizations appears to be giving a higher bug rate in general than the optimizations enabled although that's not true in the basic mode. And one thing to say about the Oclgrind emulator, so this is actually a really excellent piece of software from the University of Bristol but it had a couple of very simple bugs. To give you one example, they had a bug in the way the comma operator in C was interpreted. Now our fuzzing tool based on Csmith generates the comma operator all over the place so you can see that this appears, this does have a very high wrong code rate and you might think I wouldn’t use Olcgrind, but that would actually be grossly unfair because it's a very useful tool. We find it very useful in debugging our own tool; it just had a bug with the comma operator which they’ve fixed now and now if we would rerun these things again with Oclgrind you would see that rate would drop way down. So some basic problems can be lead to a very high bug rate. And in fact, we're using Olcgrind actively at the moment because we are building a reduction tool to automate this reduction process and we're using Oclgrind as our undefined behavior tester because it does lots of checks for undefined behaviors and it's a dynamic tool so it scales well on the examples we are using. So a couple of characteristics of the bug we found. So we investigated thoroughly just over 50 problems and we focused mainly on investigating wrong code problems. So this is not representative of the kind of problems you see. Our focus was biased towards finding wrong code problem but there are other things we found on the way. So we found a few frontend bugs, so a couple of cases where certain operators and languages were just not supported but we did find a couple of spec ambiguities. So there were some weird interaction between vector initializers and casts and it's not clear to me what the right rules are from the OpenCL specification and NVIDIA’s implementation differs from Intel's implementation, for example so I thought that was pretty interesting. We actually found those because we were generating some code where the code we were generating was sort of ambiguous. We knew what we wanted when we investigated it but we could understand why there was disagreement from the implementations. We found loader compiler internal errors, a couple of compiler timeouts, cases where the compiler actually gets stuck compiling the program; a couple of runtime crashes, although I suspect there are many more cases where the compiled program crashes, and then a number of wrong code bugs and we investigated a bunch of further bugs arising from dead code injection. I wanted to talk a little bit about the nature of the wrong code bugs. So a lot of the bugs we found related to incorrect handling of structs and this might go back to Ben's point about whether a bug matters or not. So in OpenCL one does not tend to make excessive use of structs, in particular, you would not typically have a struct with a struct inside it with a struct inside with a struct inside it with a union inside it with an array of length 1024 inside it. These are things you just would not do in OpenCL. Of course the compiler should still compile them correctly, but you might be sympathetic to a compiler developer who puts that quite low in their priority list. We found a lot of very basic of bugs related to struct handling and the reason we found those is that I mentioned earlier there were no global variables in OpenCL and we mimic global variables by putting them all in a struct and then any bug with struct compilation was very, very likely to show up in our fuzzing becauses structs were fundamental to our infrastructure. But we found some very simple bugs to do with structs like the struct initializer is not matching up with the way structs are indexed and we found these problems in most of the implementations we tested and those are the ones that the vendors are very keen to fix. We did find some bugs related to barriers but these were a little bit disappointing in that although they were bugs that did require barriers to be present they actually didn’t require the barriers to be doing something related to inter-thread communication, in particular we could have cases where the threads would not be communicating at all, they would be using strictly private memory but the presence or absence of some barriers would be the difference between successful or unsuccessful compilation It was quite interesting. We don't know why because these are closed source compilers. And we found a couple of bugs related to vectors and sadly no bugs related to atomic operations despite the fact that that was the hardest thing to implement. So I'm going to finish the presentation by showing you a few examples of bugs. So the first one I want to show you is pretty simple. This is a bug with a rotate vector instruction. This rotate takes a vector of size 2 and a vector of size 2 and what it should do is take the first component of this one and rotate it by the number of bits specified in the second component and the same for the second components. So what do you think the result should be there? If we rotate 1,1 by 0,0 and take the X component of the result, the bits should be one, and we are running this with one thread, X is component zero, so let's run this. So we run our launcher tool which launches the kernel and I'm going to say, rotate so platform zero device zero. This is an old CPU compiler on my laptop. So you can see that we get the wrong result here. We get F, F, F, F, F, F and if you try this with Intel's more recent compiler they’ve independently fixed the problem so this is not something we reported. They'd already fixed this. And if we look at the assembly for this we see that there appears to be some constant propagation gone wrong in the assembly code. F, F, F, F is just stored to memory. There’s no rotation going on. So that's an example of a small bug. But I would say that it seems like a bug that could wreak havoc in an application that needs rotate. I'm not quite sure what people do with this rotate by the way, rotating bits, but OpenCL has a whole load of intrinsic come from the group who created the standard, lots and lots of competing demands there. >>: Sometimes we end up. [inaudible], not rotate but [inaudible]. >> Alastair Donaldson: So this is an example with barriers. So I have a kernel, it declares an int X, passes the address of X to H, then it eventually writes out the value of X. So what H does is it calls K and what K does is it does a barrier, no reason to do a barrier, no shared memory going on here, then it calls F and stores the result of F in the point of P which is our X. And what F does is it does another barrier and returns one. So we can find here that this should end up printing the value one because X should end up being one and this is executed by two threads. And it needs to be executed by two threads for the bug to show up although the threads don't communicate. So if I would run this with optimizations enabled then we see the correct result 1,1 and if I disable optimizations then we get wrong result 1,0. And we did look at the assembly code produce for this but we didn't manage to work out what was going wrong here because it was an extreme long assembly code so we didn't spend the time to dig into that. Okay. So that's just a couple of the sorts of wrong code bugs we found. So you can see that they're not terribly obscure bugs. There are relatively small programs that you feel should work correctly. So some ongoing and future work. So some of the vendors we’ve talked to about this work have been pretty interested in fixed bugs in their OpenCL compilers. And some of the vendors we've talked to have been interested enough but have basically said to us that for them OpenGL is a much bigger priority than OpenCL. So OpenGL is actual graphics as opposed to this so-called general purpose graphics processing. I guess because I'm very much in the world of [indiscernible] and OpenCL I sort of forget that there are people who actually do graphics. So some companies said it would be great if you had an OpenGL version of this. To me that seems very challenging in a way that really interests me because in OpenGL everything is floating- point but it's floating-point with a distinct lack of precise rules for what is acceptable from a given operator. What's acceptable ultimately is what would A, pass the conformance test suite and B, what would be acceptable to gamers or to people who are interested in using the GPUs. >>: [inaudible] conformance? >> Alastair Donaldson: If you want to be labeled as an OpenGL compliant implementation you’ve got the past a set of conformance tests. So those conformance tests do provide some implicit specification of what would be, so the specification of OpenGL as far as I understand it places almost no obligation in formal text on what your floating-point operator has got to do. So if you're a pedantic formal verification person you could say okay, I’m going to make them all return zero but then you'd fail the conformance tests. So there clearly is some notion of what is acceptable. There's the conformance test and there’s what would stop people buying your GPU but there is no exact image a particular OpenGL shader should produce. >>: [inaudible] I don’t understand why they won’t just let the market decide. >> Alastair Donaldson: I suppose because they don’t, I don’t know, but my guess would be that they don't want OpenGL to get a bad name. >>: Maybe the market should decide that too. >> Alastair Donaldson: Anyways, the point I'm making there is that there's no precise specification so you couldn't even do some kind of abstract interpretation over a proximitive analysis. >>: I would not advocate that even if there was a [inaudible] specification. >> Alastair Donaldson: Okay. So that's something that we’re quite interested in working on. Related to that is finding floating-point compiler bugs. So in C there is some notion of what floating-point can and cannot produce if the compiler obeys the standard but compilers also have optimizations you can turn on like fast math where they do more interesting things with your floating-point if you want speed. And then there's this interesting distinction between result difference due to an aggressive floating-point optimization that's correctly implemented and the user asks for it and a compiler bug and how do you tell the difference. I find that quite a challenging problem. Also compiler fuzzing on nondeterministic programs that you use the rich set of OpenCL atomics that are related to C Plus Plus 11 atomics. That seems like an interesting challenge. And then more pragmatically being able to automatically reduce these bugs and rank them would help us make more progress in this work and try more interesting strategies and be able to evaluate more than we have done which things are working well and which things are not working well. So I want to conclude by showing you this little OpenCL kernel and this is doing a reduction. This loop here is a reduction loop and it has a misplaced barrier. This barrier is supposed to be there but it’s inside this conditional. >>: [inaudible] but>> Alastair Donaldson: This program has got no semantics because the barrier is in a divergent location. So if you think that bugs in GPUs don't matter then>>: You’re going to crash your machine now? >>: He’s going to melt his laptop. >>: Really? Step away from the podium. >>: Are you going to show us that your computer froze? >>: It’s not frozen yet. >>: Is your cursor frozen? >> Alastair Donaldson: It's frozen. >>: Try typing. >> Alastair Donaldson: Come up here and have a go if you want. So on that note, thank you very much. >>: Is that a realistic bug? This happens all the time. >> Alastair Donaldson: So what happens all the time is that the things that freeze. So I discovered this unfortunately when I was teaching at the UP MARC summer school two days ago and I was trying to show that you get a weird result with barriers and I crashed my machine and had to start using the blackboard for the rest of the lecture because this machine, I find that this machine doesn't reboot very well. >>: [inaudible]. >> Alastair Donaldson: So then after I gave a lecture on the blackboard and I did manage to reboot it but then it would project properly. But then actually look, so I did find that in this case it appeared the display had crashed and then it appears Windows got the display back now so that's what I thought the problem was. But then actually this morning I'm trying to just prepare this crash to show you and I waited for about 10 minutes and the display didn't come back. So I don't know whether the machine had crashed or whether the display would have eventually come back. So I maybe exaggerated when I said that this crashed my laptop. I’m not exaggerating when I said it crashed this other machines. They definitely did crash. I suppose if this would lead to something, a long-running competition on the GPU if this barrier is in the wrong place and it's maybe leading to the thing getting stuck in an infinite loop you could imagine that could hog up all the resources that are rendering. This is the same GPU that's rendering things to my screen so you can imagine that could be the reason. Okay. So anyway, that was a badly defined program but imagine your compiler mis-optimized and gave you that program. All right. >>: Thank you.