Document 17864661

>> Shaz Qadeer: It’s my great pleasure to welcome Alastair Donaldson again to MSR. Alastair is a professor, I don’t know, that’s not the title actually, a lecturer, lecturer in the Department of Computing at Imperial College, London. He, after, doing his postdoc at Oxford University, he spent a few months as a visiting researcher at Microsoft and he started the project called GPUVerify for verification of GPU kernels, back at that time in collaboration with people here. And since then he has made a lot of progress on that project, along with colleagues at Imperial and he is going to tell us about some of that work today. >> Alastair Donaldson: Thanks Shaz. Given it is such a small audience, let’s make it really informal and I’ll just tell you about what we do and please ask as many questions and yeah, we can, we don’t have to get through the whole deck of slides if you have more questions, [inaudible]. So, this is joint work with Shaz, initially just Shaz when we were collaborators here at Microsoft, and then since then I recruited various PhD students of postdocs, none of whom have GPUVerify as their principal thing, but all of whom have done something, some hacking on it. There’s this long list of contributors, and the project is supported by the CARP project, which is an EU funded project, Correct and Efficient Accelerated Program, which I’m coordinating at Imperial and the project’s kind of split between optimizations, performance optimizations and correctness checking, and we are focusing on the correctness checking side of things. So, these are the guys who contributed to the project. I have three postdocs, Jeoren, Adam, who is a bit shy as you can see and John, and then some PhD students, Paul and Nathan who have worked a lot on the project and Dan who just started recently, and a new student, Pantazis, who is starting in July. And I’m actually looking for one more PhD student so if you know any bright, young candidates who would like to live in London, please send them my way. And generally the aim of the research in our group, the Multicore Programming Group is to design automated techniques to help people write correct and efficient parallel software. So I’m particularly interested in concurrent programming, partly because I think it’s cool to try to get things to go faster, but partly because of the correctness challenges it raises, and actually I think that from a correctness and verification point of view, I think it is very hard to verify general properties of arbitrary sequential programs. I think you could take a parallel program and you look for things like data races or deadlock freedom. They can be easier perhaps to give people useful tools to help them with rather than trying to, trying to look in the verification of more general properties. Okay, so I’m going to tell you about our work on verification of GPU kernels, which I think is an application of the idea that’s becoming quite popular these days of trying to analyze concurrent programs by somehow converting the problem to a sequential program analysis task. So first of all, let me tell you a bit about what a GPU is. A GPU is a graphics processing unit and I’m going to give a schematic overview of what a typical GPU looks like. Nothing that I say here is going to be completely true of all GPU’s. So if you know about GPU’s, then you know that it’s not [inaudible] true, but. A GPU generally consists of a number of processing elements which you might like to think of as cores, although they’re typically a bit simpler than CPU cores. And every processing element has a small amount of memory that it has exclusive access to. Then there are a number of these processing elements on the GPU and the processing elements are arranged into groups, such that every group has a portion of memory that’s shared among all the processing elements in the group. So, these guys can communicate with each other through this group-shared memory but they cannot communicate with those guys through each other’s group- shared memory. And then in addition there is a pool of global memory, which all the processing elements can share, and to some extent processing elements in different groups can communicate through this global memory. However, in typical GPU designs, there is a mechanism for processing elements in the same group to synchronize with one another, but no mechanism for processing elements in different groups to synchronize with one another. So this global memory is not really used for inter processing element communication. It is more used to actually get data from the host device and give data back to the host device. Make sense? >>: I have one question. >> Alastair Donaldson: Yeah >>: Do GPU’s typically provide interlocked operations? >> Alastair Donaldson: Like, such as? >>: Like [inaudible] those kinds of things? >>Alastair Donaldson: Well yeah they do, yeah, and that’s actually something which we want to look at next in our work that’s quite challenging from a verification perspective. But the one problem is that there’s not really a consensus among which operations the barriers found in GPU’s provide. For the OpenCL’s pack for instance has a bunch of atomic operations that it specifies. Okay, so a GPU accelerated system would typically consist of a host computer, like a multicore PC. Here I am showing you maybe an eight core PC and a plug-in card, and what happens is the host that’s responsible for copying both data and code into the global memory of the GPU, and the code is a function called a kernel function, and this has nothing to do with OS kernels. I sometimes get people asking me questions about [inaudible] we have different meaning of the word kernel, or the same ultimate meaning, but a different specific meaning. And the host then says to the GPU, go invoke the kernel and what this does is it lights up all of these processing elements and what they do is they copy data from global memory into group-shared memory, from group-shared memory into private memory. They crunch through it, eventually copy it back to global memory, and when they’re done, the host is interrupted and it can copy back the results for further processing. So in a typical GPU accelerated application you might just have some preparation code on the host, one kernel implication, and then some processing code, or you might have a sequence of kernels in a pipeline, or you might have something like a loop with a kernel implication inside it if you’re doing an [inaudible] algorithm where you have a number of time steps. A common thing would be to have a timing loop on the host, and then for every time step you do some calculation and then do again and again and again. Okay. So, a serious problem when programming GPU’s is the problem of data races, which are well known from regular concurrent programming. So a data race occurs when we have got two processing elements or threads running on two processing elements in the same group that access a location in group-shared memory and at least one of these accesses is a write. And this is called an intra-group data race. And we can also have an inter-group data race where we have threads running on processing elements in different groups and they access a global memory location and at least one accesses a write and there’s no synchronization operation separating them. This is an inter-group data race and we can also have an intra-group data race on global memory, which I’m not showing you in this diagram. So, data races lead to all kinds of problems and I think something that’s interesting in the GPU context. So I think it’s very well known. The problems that data races bring problems of non-determinism mainly, but in GPU kernels there is I think a worse problem, which is that actually you may have device determinism, so in a particular GPU architecture, GPU’s, if you know a bit about how they work, you know that they’re kind of deterministic, so actually threads don’t get scheduled by an OS and you don’t have preemptions and that kind of thing. Threads get scheduled by a driver and they get scheduled in a very deterministic way on a given GPU. So it may well be that you’ve got a kernel that could exhibit a race but never does on [inaudible] architecture X. And then if you port that kernel to another architecture, you may then discover there’s a problem, and there are programming models such as OpenCL which aim to be portable, so that actually kernels get compiled at run time for whatever architecture is available. So in that setting you don’t necessarily know what your customers are going to be running your kernel on so if you’ve tested it on the write of architectures and discovered no data races, it may be that you cannot discover data races by testing alone, and yet on some architectures there would be data races. So, and another thing to point out is that data races and GPU kernels are almost always accidental and unwanted. So we don’t have the case in the systems code where there are deliberate benign data races. What I have seen are benign data races where for instance many threads write the same value to a location. That happens. You sometimes have data races where, for strange, but actually good reasons, a thread is going to write something but is guaranteed to write what’s already there and another thread might be reading, in which case that doesn’t matter. But I haven’t seen examples where we have got for instance, synchronization parameters being implemented by busy weight loops and GPU kernels. That is not something that would be very efficient and it’s not something that would be portable across architectures either. So what we’ve been doing in our work is looking at data race analysis for GPU kernels. But let me tell you first of all how you would avoid data races in a GPU kernel. I’ll also show you an example kernel. So this is the little kernel where it’s in an OpenCL language. We have a regular C function which we prefix with the key work kernel. This is an entry point to the kernel. This is where threads commence execution. And the kernel is going to declare that it takes an array of [inaudible] as an argument and then this array, the contents are going to reside in local memory, which is in OpenCL what group-shared memory is called. So local means group-shared. And then also this kernel is going to take an end offset as another argument. And what the kernel is going to do is it’s going to add, every thread is going to write to its thread ID. And what it’s going to write is the, what is already at its thread ID, plus what’s at its neighbor’s thread ID. So it’s going to write A[tid] + A[tid] + offset. Okay, so spot the data race. I’ll grab my coffee while you think. >>: Well, you mean everyone is running in parallel right? >> Alastair Donaldson: Yeah >>: Everyone is reading tid plus offset but that’s someone else’s tid right? >> Alastair Donaldson: Exactly right. So if offset was one for instance and tid was zero, then thread zero would be writing to zero or reading from one, potentially in parallel with thread one writing to one. So this would be a read/write data race. Okay. And we can avoid this data race by using a barrier so we can for instance read A[tid] + offset into a temporary variable, then we can do our write using temp instead of A[tid] + offset, and then we can have a barrier synchronization statement in between these statements. And what barrier says is that every thread run in the kernel must get to the barrier before any thread leaves the barrier. And furthermore, that all loads and storage for memory would have completed before any thread leaves the barrier. Now actually, like I mentioned briefly earlier, it is only possible for threads within the same group to synchronize with each other, so a barrier is a synchronization operation between threads in the same group. For the rest of the talk, I’m just going to assume all threads are in the same group for explanation purposes. Although in the tool, in theory we deal with the general case. Okay, any questions at this point about GPU’s, kernels and [inaudible]? Oh yeah, so it stops the accesses from being concurrent. Okay? So, there’s been a lot of interest over the last few years in verification and analysis for GPU kernels. The leaders in this area were the group of [inaudible] of the University of Utah, who have a tool called PUG for analyzing CUDA kernels which was published at FSC a few years ago. And there are these days more focusing on a tool called GKLEE, which uses dynamic symbolic execution. It’s based on the KLEE execution engine developed at Stanford and Imperial College. And an interesting thing they’ve done recently is extended their GP tool to handle atomic operations which is a very nice piece of work. The University of Trento, collaborators on our CARP project, are looking at using separation logic with permissions to prove data race freedom of GPU kernels. The idea here is that you have a permission logic and you prove a kernel by showing that a thread can only write somewhere if it has write permission. So it’s a nice application of separation logic. There is a nice paper from ESOP a couple of years old about the SIMT model that’s used for CUDA kernels, and yeah, another partner in our CARP project are looking at doing symbolic execution of GPU kernels. And finally, aside from our work, there was a paper about using test amplification of PLDI last year. The idea of this work is that you actually dynamically run a kernel and check for one trace whether there were data races, and then you use some static analysis to try to discover whether that trace was in any way influenced by inputs to the kernel. And if it wasn’t, you can conclude that the kernel is free from data races. >>: I have a question [inaudible]. Do they actually run the kernel or do they have a simulator? >> Alastair Donaldson: I believe they did it with a simulator because it’s quite difficult with a, ah, it’s very difficult to do logging on a GPU. Yeah, okay. Yeah we published our work OOPS LA last year. That was the main paper about GPUVerify and a more technical paper about some of the recent developments in the tool at ESOP this year. So GPUVerify is a tool for verifying data race freedom which I have described to you and another property barrier and divergence freedom, which I will talk about briefly later, for OpenCL and CUDA kernels. So CUD A is a GPU programming model from NVIDIA who are the market leader in GPU devices and OpenCL is a more general programming model that’s been put together by the Khronos Group, an consortium consisting of a bunch of partners, including I would say, pretty every major player apart from Microsoft, I think, it’s fair to say, and so Microsoft has C++ AMP which is another accelerated mass [inaudible] which is different again. And, yeah we decided to focus on both of these programming models because they are very similar. We’d rather just focus on OpenCL I suppose for simplicity, but CUDA is more widely used still. Hopefully that won’t be the case in a year or so, but it is at the moment. So before I go into details, I’m going to give you a demo of GPUVerify to give you a feeling of what it does. Please just interrupt me if you have any questions. So I’m going to write a little kernel to perform a reduction operation. What this is going to do is it’s going to take an array of ints and local memory, and I’m going to have every thread, I’m going to have the threads sum their neighbor’s elements using a, doing a tree reduction. So a thread will sum a neighbor if there are N threads N over two places away, and then N over four places away, and then N over eight places away and N over sixteen places away and every iteration of the reduction of half the threads will drop out of the computation. This is a common thing to do in a GPU kernel to collate results. So I’m going to say four int D equals N over two, where N is the number of threads, while D is greater than zero, going to divide D by two by shifting it right by one. I’m going to say if my ID is less than D, so if I’m still active, then A at my ID is incremented by A at my ID plus my neighbor D places away ID. I’m going to put some defines here so there is no tid actually at OpenCL. I’m going to find tids to the get local ID zero. This is a built in function that gets a thread’s local ID in the 0th dimension. So these kernels could be multidimensional. I am not going to go into the details of that here. Let’s assume the kernels are one dimensional and N is going to be the number of threads in dimension zero. Alright, so assuming I have not made any syntax errors, this should, the tool should do something on this, so first of all the tool will complain and it will say that the [inaudible] group size must be specified using local size. So what we’re not trying to do in this work is parameterize verification. We’re not trying to prove that these kernels are correct for any number of threads. That of course would be a nice thing to do, but kernels are not usually correct for any number of threads. They are usually correct for say every number of threads that’s a power of two or every number of threads with some property. And second, the theorem proof as the [inaudible] work don’t deal very well with no linear arithmetic and we very commonly do something like multiply a variable by the number of threads. So if the number of threads is a constant, that’s okay, but if it’s not a constant, that would be very hard to reason about. So GPUVerify local size so let’s try it with 1,024 threads and similarly you have to say how many groups there are. I’m just going to consider one group here. So, two will think for a minute. I tend to find that when I first run a C sharp application it takes awhile. Okay, so it has reported a couple of possible data races. So it’s saying that a kernel at CL there is a possible read/write race on the array A, at bite offset four. So we see A cast to a character pointer and then by offset four, and this is line 9, column 25 by thread zero of group zero, and line 9, column 15, a write by thread one of group zero. So if I go back to the example, then line, column 25 I think is this read here and column 15 is this write here. Okay and you can see that because I have not got any barrier synchronization, there’s actually nothing to stop one thread skipping ahead to a further loop iteration and interfering with another thread in a previous loop iteration. So, to eliminate this problem, I can put a barrier synchronization in here. So, barrier and then I give a flag to say this CLK local [inaudible] this is a way of saying I want to do a barrier on local memory, so because this array is a local pointer, then that’s the right thing to do. Okay? So does this look good? Yeah? Really? Okay, so now what the two is going to say is that one error so line 10, column 13 barrier may be reached by non-uniform control flow. So this barrier statement here mad a deliberate error and what I did was I enclosed the barrier statement inside this conditional, which means that some threads will reach the barrier, but not all threads. And this is illegal in the OpenCL programming model. If you have a barrier in conditional code, either all threads or no threads must reach the barrier. In other words the condition guard in the barrier must be uniform. Okay so the tool has detected barrier divergence here. When I say detected, I’m running the tool in verify mode so it’s not actually detecting anything, it’s not managing to prove the code, so it can run the two [inaudible]. It will unwind the program. So if I fix this problem, then what we should see now is some success. So two will tell us that there were no data races within work groups, no data races between work groups, no barrier divergence, no assertion failures because you can write your own assertions in GPUVerify. Of course, no warranty provided because this is a research prototype. Okay, I [inaudible] >>: [inaudible] >> Alastair Donaldson: It is intended to be signed, yeah, we’re trying to do signed verification, but there are various ways in which we’re not signed. For instance, we made the pragmatic assumption that the point of parameters to a kernel pointed disjoint arrays. That, I actually have a feeling that that may be required in a spec. I should look that up. But, even if it’s not required, that’s what people do. I mean, and if we didn’t make that assumption we would just report data races everywhere. So that’s one thing. We don’t do bands checking, so it’s possible you could have a buffer followed by another buffer and then you could overflow the bands of that buffer and have a data race as a result. Whereas we would C write different arrays and we would say that they were race free. So there are various ways in which, you know, where a sound [inaudible] a whole bunch of provisos. Okay and you know we can write some, some assertions so I might say here assert is power two D and I could say X is a power of two if X bit [inaudible] ended with X minus one equals zero and X is not equal to zero. So a power of two is a binary number with one bit set. So you can test that by saying if the, if you just subtract one you get all the lower bits and if you add them together you should get zero. The zero satisfies that property so zero is not power of two. Okay, so now I could run GPUVerify on this and ah, I, you might notice that I made a slight tweak to the current to make this not hold. >>: [inaudible]. Was there a loop there? >> Alastair Donaldson: There is a loop there, yeah. But we have to have loop invariants to do all that proof right? >>: Yeah. >> Alastair Donaldson: Yeah, okay, so, assertion might not hold for thread 544 group zero. And then you might think hang on a minute, is that problem specific to thread 544? So we could say here something like assume that tid is less than ten for instance, which is just something you could do as a programmer to rein in the verifier a bit. Okay, so I should complain in there about thread 8. So it’s using a constraints over behind the scenes and we have just told the constraints over I want you to find me a thread less than ten. Obviously you have to be very careful because if we do something stupid like that, then the kernel of course will become correct. All of a sudden, okay, but anyway, what I wanted to show you was that now we could run the two in bug finding mode because as Shaz said, we need loop invariants to prove correctness of these kernels and it may be that the kernel is correct, be we didn’t find the loop invariant. >>: Can you sometimes [inaudible] loop invariant on the C level? >> Alastair Donaldson: Yeah. >>: C level [inaudible]? >> Alastair Donaldson: Uh huh. So I could say find bugs and then loop unwind equals four for instance, and oh, wow [laughter]. Okay, let’s not do that. >>: [inaudible] So right, now if I said find bugs and I have 1024 threads, how many threads is it going to use? Did you specify in the command line how many threads to use with [inaudible]? >> Alastair Donaldson: Yeah, so I say 1,024 threads. But the [inaudible] is going to only ever consider two threads, which I’ll come to in a minute. And when I say, and to make that sound and use this among traction, so when I say find bugs, actually it’s going to be finding bugs, not employing any kind of loop invariant abstraction, but it will be employing some abstraction still, so these bugs could still be false positives due to that abstraction. >>: But it will be like, say some kind of context of boundaries? >> Alastair Donaldson: Ah, no it’s not. Let me get on to that. That’s the next part of the talk. Okay, well I, anyway, I would have been able to show you something is a bit wrong. Now I would have been able to show you that actually because I said greater than or equal to zero, this really is a bug and we’d find it if we [inaudible] enough and if we, but what I can show you is that if we just say greater than zero, then the tool should be able to verify this. Oh man, something is seriously wrong. Okay. I’ve been working on this, adding new features, so never do that before a talk. >>: [inaudible] >> Alastair Donaldson: Oh right, oh right, oh right, right. >>: You actually have found that bug, right? [inaudible] say ten or something to get all the way ->> Alastair Donaldson: Yeah, it wouldn’t hurt. But I have now fixed the bug and ->>: So now it should be inducted, right? >> Alastair Donaldson: Yeah. >>: [inaudible] >> Alastair Donaldson: Yeah but it should get this. >>: [inaudible]. But that should be inductive, right? Because you can take a power of two and divide it by two and it’s either zero or another power of two, right? So it should work. >> Alastair Donaldson: Yeah. >>: Can you, can you get out the actual variable values? When, when an induction fails? >> Alastair Donaldson: Ah, so, no, we, yes in principle, because it just, yes in principle, but no not right now. Okay, I’m now going to try and debug this. I know, it’s kind of frustrating because – >>: We know how this goes. >> Alastair Donaldson: Yeah. Alright, so let me tell you now about the verification strategy behind the tool. So the first thing is, what we’re trying to do in this project is actually exploit the simplicity of the GPU programming model to come up with a very efficient verification method. So the first thing we exploit is the fact that data races always occur between a pair of barriers, so barrier one and barrier two, alright? So we have a barrier-free region of code and we can have a race between statements of different threads in this region, but we can’t have a data race between say a thread X keeping something up there and a thread X keeping something in there because they can’t be at these places separately . So this immediately makes the program analysis problem easier, if we can restrict attention to barrier-free regions of code. This is something that the PUG approach also exploits. Okay, the next thing to observe, which is something that was kind of new to me, but actually it’s discovered as a fairly well known thing, is that when you are doing data race analysis, you put off [inaudible] the number of schedules you consider if you are guaranteed to abort on a data race. So, actually between a pair of barriers A and B, we can pull the following trick. We can run thread zero all the way from A to B and log all of the accesses that it makes. Then run thread one all the way from A to B and log all its accesses and also check all of its accesses against those of thread zero. And if we find that there is a problem between these we abort immediately. Otherwise you run thread two all the way from A to B, log all of its accesses and check them against threads zero and one, and abort if we find a race. And we keep doing this until we finally run the final thread from A to B. I guess we don’t need to log all of its accesses, but we do need to check them against those of all the other threads and abort if we find a race. So, the thing to observe here is that if you think about it, a data race always occurs between some pair of threads. So with this schedule, if there could be a data race between some pair of threads, between these two barriers, then this schedule will find the race because of the logging and checking. And if there’s no race, then actually this schedule will lead to precisely the same state here as any other schedule would have. If you think about it, you might think well, that’s not true because a different interleaving might have resulted in different interactions between the threads, leading to a different state. But the threads can only interact by racing with each other, and if they race, we abort. Does anyone disbelieve that? Or we’ll need to clarify it further. I think it’s ->>: So essentially, no one can, between barriers, no one can read anything that’s been written by anyone. >> Alastair Donaldson: Since the barrier, yeah. >>: Right, so they have to be independent. >> Alastair Donaldson: Yeah, and we’re, so it’s that exactly, you said, well they can, but in that case we’re going to abort. >>: Well, right, so I mean so you don’t even have to know the order of reads and writes from the threads when you check the logs, right, you just have to say, did anyone read some – >> Alastair Donaldson: Yeah, that’s right. We can record these as sets, not sequences. >>: [inaudible] >> Alastair Donaldson: Yeah, ok, so, a pretty straightforward trick. And this immediately reduces us to the problem of sequential program verification. We can rewrite a GPU kernel as a sequential program where we serialize the threads in this schedule or in fact any other schedule we choose. We could take a round robin schedule. >>: Or in particular what it means is that you have a very simple sequential [inaudible]. >> Alastair Donaldson: Yeah. >>: As opposed to the rather more complex case if you allow the threads to interact. >> Alastair Donaldson: Yes, that’s right, yeah. Okay, so this avoids reasoning by interleavings, which is good. Or we can actually do better and we can observe that data races occur between just pairs of threads, so what we can do is we can pick an arbitrary pair of threads I and J, inside this region. Sorry, inside the range of threads. And now we can consider barriers A and B and we can consider running thread I from A to B and logging all of its accesses and then running thread J from A to B and checking all of its accesses. Now because I and J are arbitrary, I am not, by the way, talking about choosing a specific pair like thread one and two. I’m talking about taking, considering every possible pair. If we can show for every possible pair, that between these barriers they can’t race, then there can be no race between any threads for the barrier region. Right? So, in some feedback on our work, we’ve had comments that this is quite similar to ideas in protocol verification, where you pick a pair of processes and you make the other process abstract. I think that there are similarities but the key difference here is that we actually have to consider all of the pairs because these kernels are not symmetric, even though the threads run the same program, they do not have to do the same thing. They can follow different control flow. >>: [inaudible] symbolic constants. >> Alastair Donaldson: Yeah the symbolic constants, that’s right. >>: So, the solver will consider all possible values. >> Alastair Donaldson: Yeah, exactly. So, if a data race exists, then some choice of I and J will expose it and if we can show for all I and J, that this little program fragment is free from data races, then we know that there can be no data races if we did have all the threads executed. >>: But on the other hand if you don’t, and if you have the data race, it can be a false one simply because you started it in [inaudible] state. >> Alastair Donaldson: So that’s the , yeah, if this is the beginning of the kernel and we had a precondition on the entry state, then up to the first barrier we could be precise, right? But this brings me into my next slide which is, is this actually a sound thing to do at all. So say we had barrier A, barrier B. We pull this trick of running thread I and then thread J and checking for races between them. And then we have barrier B to barrier C and we pull the same trick again. Well, if you think about it, this is not a sound thing to do because it would appear that the other threads just don’t exist. We would see the world changing for these two threads. Well then the point of a barrier synchronization is now we can see what the other threads did and we can safely read it. If we don’t model those other threads and we just continue, then we are, you know you might have an array that is all zeros. And threads are going to set it to one. So it should be all ones after, when we reach the barrier. We are going to see it be one in two places and carry on. And then the analysis is going to be nonsensical. So this is not sound on its own. To make it sound we have to make the shared state somehow abstract to model possible effects of all the other threads. So there are two things that we have explored here in our initial work, and something more sophisticated recently. The simplest idea is to make the shared state completely arbitrary. There are two ways of doing that. One is you can havoc the shared state every time you reach a barrier. Another thing you can do is you can actually just remove the shared state completely and treat all reads as nondeterministic reads. So if you read into a variable, you just havoc the variable, and when you write to the shared state you simply, you log where the write would have gone for race checking purposes, but you just remove the actual write statement. So we have both options. The second option has the advantage of, well it avoids the need to generate a race for the theorem proof to reason by, which can lead to more, to better efficiency. But is has some disadvantages as well, which I, which I won’t go into right now. But I can tell you about them later if you’re interested. Alright, so the GPU verification strategy is to exploit the any schedule will do trick, and the two threads will do trick, with some obstruction to make the whole thing sound. And also with predicated execution, which I will come to if I have time in a little while, to turn a massively parallel kernel K so a program that we want to consider for thousands of threads potentially, into a sequential program P that is linear in the size of K, the text size of K, such that if P can be program correct by which I mean that no assertions can fail in P, I mean partially correct here, not totally correct, then K is free from data races and barrier divergence. So this is the, the matter theory behind our approach. >>: But if you are going to consider only two threads, then you don’t really need predicated execution, right? Because you can just copy the kernel twice. >> Alastair Donaldson: Not if they step into procedures and that kind of thing. >>: Oh yeah, I was forgetting about that. And there could be loops there also. >> Alastair Donaldson: Exactly, yeah. So, I will come into predicated execution shortly and we can describe things. How am I doing for time? I don’t have a watch on me. >>: Oh, ah >> Alastair Donaldson: Oh there, okay, cool. Okay, so let me tell you briefly about the tool chain architecture. What we do is we take an OpenCL or a CUDA kernel and in future we would like to consider C++ AMP kernels, but they have been more challenging because C++ AMP has this nice thing of being a single source solution, where you write a C++ program and there is special use of templates to describe that you want some piece of code to be accelerated, which is a bit like saying it should be run as a GPU kernel or their amp is in principle more general than that. This makes it quite difficult for an academic tool to pause. So these are easier targets. We use the CLANG and LLVM compiler framework and particularly the CLANG front-end to turn the kernel written in one of these programming models into a Boogie program, a sequential Boogie program. So we, what we actually do is we pause the kernel and turn it into one kind of Boogie program and then we have this kernel transformation engine that applies all our tricks to produce a sequential program to be verified. And then we give this to the Boogie verification engine developed here at Microsoft Research and Boogie uses an SMT solver, principally the Z3 SMT solver, although we’re looking right now at support for the CVC 4 solver as an alternative, which has a more industry friendly license. And then for verification to work we have to generate candidate loop invariants and procedure pre and post conditions. Although GPU kernels don’t allow recursion currently, and they’re not that big, the programs tend to be hundreds, tens of hundreds or maybe a thousand lines of code rather than tens of thousands of lines of code, so we have not yet found a case where it’s, we found it’s always better to do full end lining, than to try to actually infer contracts. Sometimes we can infer contracts, but actually it’s more expensive to do the inference, than to just end line everything and often the inference will fail. So really our effort has been on candidate loop invariants and a good thing about this setup is the CLANG is extremely widely used. It’s being used by almost everybody these days. Boogie is very widely used in the verification community and Z3 I suppose is even more widely used because I would say that every verification tool for the moment seems to use Z3 and then I think there are a whole bunch of other uses as well. So, these things are being improved all the time. Other people fix bugs in them and the only magic of our approach where we have to actually be really careful we don’t introduce unsoundness is in this component here. So we can put all of our brainpower into making this correct and rely on that the thing’s being as correct as can be expected from complicated [inaudible] software. So yeah, the soundness of our approach is much easier to argue than it would be had we built some complete verifier for kernels, but we actually did the verification condition generation in some smart way ourselves. Okay, so now what I want to do is show you an example of how we take a kernel that doesn’t have any loops or conditionals and do this two threaded transformation. I’ll go through this reasonably quickly because it’s fairly straightforward. So this is an OpenCL kernel. And what we do is we generate a sequential program. I’m going to show you in C form here. It would really be a Boogie program. [inaudible] And we have a precondition. We introduce two symbolic constants tid$1 and tid$2. And we have a precondition saying that they are in the range of thread ID’s that are between zero and N. And that they’re different from one another. So this is our way of considering two arbitrary distinct threads. We have the symbolic constants tid$1 and tid$2. And then what follows $1 and $2 are going to be used to indicate that the version of the variable for the first thread or the second thread. And just to be clear, I’m not talking, I’m very much not talking about threads 1 and 2. I’m talking about the first and second of the two arbitrary threads under consideration. Okay, what we do is we take the parameter idx and we say that every, that each thread has its own copy of this parameter. And we add a precondition saying that these copies are initially equal because the kernel gets invoked with parameters being passed by value and every thread receives the same value for parameters. So the threads could in principle change these parameters later, which is why they need their own copy but initially the values will be the same. We just remove this array A, because I’m going to show you this abstraction which we call the adversarial abstraction where the shared state just disappears completely. Then X becomes X for thread one, X for thread two, Y becomes Y for thread one, Y for thread two. Now this read from A tid plus index into X. This turns into the following. First of all we log that a read has occurred from A and we log that thread one was the reader. Just thread one. We’re not going to consider thread two reading, only thread one. So, tid$1 plus idx$1, that’s the offset for the read. And then we check that thread two reading from A of this offset is okay with respect to any prior reads or writes that have happened, which in this case would just be that read. Okay, but this is the general translation. So we do logging for thread one and checking for thread two. And then to reflect the fact that X will be modified by reading from the shared state, we havoc X for both of the threads. So both threads are doing the read, but we’re logging it for the first thread and checking it for the second thread, but we have to reflect the fact that the read happened for both of them so we havoc both copies of X. And this models the fact that A could have been changing arbitrarily by other threads that may have even been having data races with each other. Then we do the same for Y. So, a log read, check read, havoc Y. And then slightly more interestingly, when we write into A at tid we write the value X plus Y. Then what we do is we log a write by thread one at index tid$1 so at thread one’s ID. Then we check a write at index tid$2 for thread two and then there is actually nothing more to do, so for the read case we have to havoc the receiving variable, but for the write case we don’t have to do anything to reflect the effect of the write because we have actually removed the shared state completely. So it’s like the write is disappearing into the void and in return, reads anything that writes. Alright, so now let me explain briefly the race checking instrumentation we use. This is a bit of an implementation detail. We could have done something different here but this we found to work very well. So for every array parameter, we introduce a bunch of global variables. We have a variable read has occurred for the array, which is a Boolean and a variable write has occurred which is a Boolean. And the idea here is that this Boolean will be false if we are currently not considering any read being in flight for this array. And it will be true if we are considering some read. And then we introduce a variable read offset A and a variable write offset A, which are integers. So the idea is that if read has occurred as true, then read offset says the offset corresponding to the read. If read has occurred as false, then the value of read offset is irrelevant. So this allows us to track either zero reads or one read, but no more than that. Does this make sense so far? And I’ll show you how we’re going to use it in a minute. Let me introduce four procedures, a procedure called log read A, which takes an offset, log write A, which takes an offset, check read A and check write A, each of which take an offset. And then there is the, the log procedures will be invoked with respect to the first thread, and the check procedures with respect to the second thread. So what we’re going to do is we’re going to consider just the first thread logging and just the second thread checking, but because we’re going to consider all possible pairs of threads, this is okay. So we’re exploiting symmetry here. Okay and we get rid of the array parameter. Alright. >>: The idea is that the log read is now somehow being non-determinative. >> Alastair Donaldson: Yeah, okay. You’re one step ahead. Alright, so I think I have explained this, yeah. This is for an undergraduate summer school, so I was stepping through a bit more slowly, but I think we can skip on. Okay, so a log read takes in an offset and there may be an immediate thing you might think of would be to say, we say that a read has occurred, and it occurred to offset. But this wouldn’t be good enough because this would mean that we would only be logging the latest read, and we won’t be able to check for races against any prior read that happened before the last barrier. So, what we do is we wrap it in an [inaudible] star. So the theorem proofer, or the verifier, should consider that the program does log this read, or that it just continues to log whatever it was logging, if anything. Alright? So, yes star is an expression that evaluates non-deterministically. So we either log this read in which case read has occurred A, and we don’t say A overwritten. Or we leave them alone, in which case they do whatever they were already doing. And log write is exactly the same. Now check read is very simple. We just assert that if a write has occurred to A by the other thread, then the offset written to by the other thread must not be the same as this offset that I am checking. And that’s all we have to do. And, yeah, this is what I just said. And check read, check write is slightly more sophisticated because a write can race with either a write or a read. So we check that if a write has occurred, then the offset written to must be different from the offset that I’m writing. And if a read has occurred, the offset read from must be different from the offset that I am writing. Okay. And then finally, we have a precondition on the whole kernel, saying that initially there’s no read and no write on the array A. And we do this for every array. In principle we could just have a single Boolean and a single offset and we could store for a read, which array it was from, which offset. Okay? And that’s actually what I tried first, then Shaz suggested splitting up at the multiple arrays because I think this was giving the theorem proofer a really hard time and also it means that when you have, well if you know about theorem proof based verification, they have these modified sets for loops and it means that every loop does a read that read is going to kill this logging stuff for every array in the kernel and you have to have an invariant recovering where reads could have happened for , you know, all the arrays, even if they weren’t touched by that loop. So splitting things up has some advantages. And a barrier is quite nice to implement. Well the first way I thought about doing it was that barrier should set read has occurred and write has occurred to false by assigning to them. And then, I realized that this was really inconvenient, because you might have a loop that does not read or write a particular array, but it’s got a barrier in it. And then because we were assigning to these variables, they were in the loop modified set. And then we had to have invariants saying that you know they’ve got the same value they had before and yeah we had [inaudible] invariants about these variables whereas if we just assume that they are false, this does the trick. Okay, so the insertion behind this is that if you remember the instrumentation, there was always a possible non-determinate choice not to track a read, not to track a write. So there is always one part that is very lazy. It doesn’t do anything. It just goes no, no, no, no, no, no, no, no, no, no, no, no, no, no, no. And then there is this tree of other parts that do track some reads and some writes against each other, and they’re the important parts that actually find data races. But when you hit a barrier, you basically say assume that we did the lazy thing. So all the parts that did track reads and writes, they get snipped. And we only have this part that we can do race logging and checking afresh from there. Okay, any questions about this? I mean it’s, it’s kind of lower level but I find it quite interesting. I find this assuming business quite interesting. Yeah. >>: Well this [inaudible] so if you have a single thread which for read and the writes in the same thread and each thread is read and write so that they [inaudible]. So this [inaudible] say there is a data race inside of thread? [inaudible] in sequential ->> Alastair Donaldson: You mean like, so we have an array A and we do something like a read, so X equals A of some offset? >>: [inaudible] >> Alastair Donaldson: [inaudible]. And then a write did you say, like A tid equals X plus one for example? >>: Yes. So will this be considered as a data race? >> Alastair Donaldson: No, because there is no race checking performed within a thread. So, it’s, we’re not going to check. Like say thread six, we’re not going to check does thread six’s read conflict with thread six’s write. But for thread nine, we will consider does thread six’s read conflict with thread nine’s write. But they won’t conflict because they use tid which is different for every thread. >>: Okay. [inaudible]. As I see the transformation you showed before. So the first one will generate a read. >> Alastair Donaldson: So we do like log read, tid 1, yeah? And check read tid 2. That’s what we would generate, basically. >>: Oh, checks. Okay I see [inaudible] >> Alastair Donaldson: And then we would do log, write tid 1 check write tid 2. >>: Yes, when you log write tid. Oh you never check tid 1. >> Alastair Donaldson: No, we only ever checked tid 2. Yeah so the first thread is the logger and the second is the checker. So basically, we’re going to consider that log read and that check write, we will consider, we will look for conflicts between those. And for conflicts between those, yeah, that case. And actually that case there would check for a write, [inaudible] to the write. In both case you can see it’s tid 1 and tid 2 which will be different. So, good question. >>: [inaudible] only attempt the race between different ->> Alastair Donaldson: Putting different threads. That’s right. Okay, now let me tell you how we handle loops and conditionals. So, we use predicated execution. The idea here is that we flatten the kernel, so that all threads execute the same sequence of statements. We more or less eliminate any conditional code. We can’t eliminate loops though, so we have to do something a little bit sophisticated for loops. So let me explain first of all independently from GPU kernels how predicated execution works. Just consider this snippet of code. If X is less than 100, increment X, otherwise increment Y. We can make this predicated by introducing two fresh Booleans, P and Q, and we can say that P is assigned to the Boolean X less than 100, and Q assigned to the negation of that, not X less than 100. And then we can have, if this problem is being executed, then both of these things will be executed but in predicated form, so we will say that X becomes equal to X plus one, if P holds. Otherwise it just gets assigned back to X. So effectively this is a no op if P is false. And then we can execute this statement. So Y gets assigned Y plus one if Q is true, otherwise it gets reassigned Y. Okay? So, this is something that has been used, used sometimes in compilers if you have got say a processor where branches are quite expensive, you might have a little bit of conditional [inaudible] a simple FNL but you don’t have many statements and yet either the N branch or the S branch. It might be more efficient to flatten the whole thing using predicates, then to actually have a branch and avoid the expense of that. >>: For example, [inaudible] a question about this compiler you are talking about. How would a compiler transmit B? X + 1:X [inaudible] use a branch? >> Alastair Donaldson: Well it would use a special instruction, a special select instruction. >>: Oh I see. >> Alastair Donaldson: Yeah, so this is when you work at the architecture support of that kind of instruction. It’s a very common thing to, if you want to vectorize code automatically, you first can eliminate these conditionals and then vector architectures [inaudible] out of a select operation that can crunch through things like this. So you make like a vector of Booleans, something you have got to select, it takes a vector of Booleans and then a vector of N values and a vector of L’s values. But it doesn’t work if you’ve got really deep [inaudible] stuff because if you flatten at all, then you’ve got select with select with select with select with select inside them. That’s fine for a theorem proofer. Okay. So, what we do is we apply predication so that every execution point is a predicate determining whether each of the two threads is enabled, and then we parameterize the log and check procedures with a predicate recording whether the threads are actually, actually logging or actually checking or really they’re not supposed to be there because they didn’t get into that part of the code. So, yeah, like X becomes equal to E, we turn that into for each of the threads, this is if we’ve got a predicate P. We’re translating [inaudible] P, we do the select thing. So we say for the first thread, X gets E$1 by which I mean the expression E dollarized, so I’m going to take all the local variables and turn them into their $1 form. Okay, so if P holds then E1, otherwise X gets left alone. And then an array read becomes, we log the read for thread one, but now we pass in P1 to record whether P was really alive or not. And the same for checking for thread two. And we do the havocking, but now the havocking has to become predicated so we don’t necessarily havoc X. We only havoc X if we are enabled. So we now have to do something like X1 becomes equal to if P then arbitrary, otherwise X1. In fact, like we, we have to do something a bit different from that in Boogie because you can’t have star in Boogie. And a write is almost the same as before but we pass in these predicates. Alright now I think I have some slides about loops coming up. Yeah the[inaudible] is where the predicates come from so if we’ve got if E do S otherwise do T, or we introduce a new predicate Q, so P is our current predicate. So we say Q for thread one is P for thread one and E1. So we take the existing predicate and we strengthen it with the conditional. And then we have another predicate R which is the incoming predicate strengthened with the negation of the conditional. And in case you were wondering before, by the way, why it didn’t, why it had P and Q and didn’t just use P and not P, maybe this will, if you were thinking that, maybe this will answer your thought, because Q and R are not going to be necessarily a negation of one another because they’re both, they both invoke P. Right. [inaudible] Okay, and then we translate the then side with respect to Q and then we translate the L side with respect to R. Okay, and then the loop case I think is the most interesting. What we do here is we can’t eliminate the loop, but what we do is we force both threads to keep executing the loop until they’re both done with the loop. So, if neither of them wants to enter the loop, you skip the loop. If one of them wants to enter the loop and one doesn’t, they both enter the loop and they both execute the loop, but the one that didn’t really want to go in the loop just does nothing until that first [inaudible] is finished and then they both leave. Okay? So we turn this into the following: We evaluate the loop guide into a predicate Q, so it’s the loop guide strengthened by the incoming predicate. And then we loop while either Q1 is true or Q2 is true. So we loop while at least one of the threads is enabled. We translate the body of the loop with respect to Q so this means that a thread, when we translate [inaudible] the body with Q, this will make sure that if a thread was not enabled, and its Q was false, it won’t do anything. And then we update Q, so we say that Q becomes its old value strengthened by the loop guard, which may have changed, probably will have changed, according to the loop body. Okay. >>: And so in general, you handle non-structured code, right? LLVM >> Alastair Donaldson: We do, yeah. That’s right. That’s what our ESOP paper was about. That was actually in my opinion one of the biggest challenges of this project. So, we had all this figured out more or less when I was at Microsoft. And then we started to build a front-end based on CLANG’s AST at the structured level, because that was the easiest thing to do in our minds. And that worked okay, but it meant that we couldn’t handle kernels that did switch statements or kernels with breaks and continues or they were hard to handle, so yeah, so we thought [inaudible] then we thought how do we do this predicated execution of the unstructured program level because here it’s very nice because of the hierarchal structure you have a predicate you can descend in and descend in. Anyway, Shaz had a very smart idea and we worked out the details. >>: I’m trying to understand the purpose of this predication because a lot of it seems like what the Boogie would already do for verification. So is it because of the loops that you have to introduce this explicitly, that you just can’t let Boogie ->> Alastair Donaldson: What we’re trying to do is just, so, we’re trying to give Boogie a program whose correctness implies race freedom of the GPU kernel. And this, at the moment, we’re just trying to construct that program. Once we have that program, Boogie then has to do his usual thing on the program. He has to do verification, condition, generation. >>: You argued previously that we could just run thread one followed by thread two. >> Alastair Donaldson: Yeah, yeah yeah. >>: Right, and that was going to be our schedule. In fact, you have sort of – >> Alastair Donaldson: Oh well I told you that you could do that schedule or maybe you could do a round robin schedule where you run like thread one makes a set, thread two makes a set, thread one makes a set, thread two makes a set. >>: [inaudible] so the question is why in fact do you not just do the schedule where thread one executes all the way to the barrier and then thread two? And I think the answer is like you said, which is that it’s the loops because you want to get a joint invariant for those two threads at the loop heads. Correct? >> Alastair Donaldson: What I would say is like, more fundamental than that, well like again, barriers for a minute, how do you even write down that program, that where, say you’ve got a barrier in a loop for instance and ->>: If you, I would argue that if you did not have any loops in the program right? >> Alastair Donaldson: Yeah. >>: It was a straight line code. Then what you would do is you would just replicate that code from one barrier to another twice. One for tid 1, one for tid 2. >> Alastair Donaldson: What if the barriers are nested inside conditionals? >>: Well, so ->> Alastair Donaldson: If you have no conditionals, then yes, you can do exactly that replication thing. If you’ve got conditionals, like a nest of conditionals with barriers here and barriers there, how would you get these barrier regions, barrier to barrier regions? It, it becomes tricky. And then also what would you do if you’ve got procedures and you’ve got like some code you go in a procedure, then there’s a barrier. Then you have to basically expand everything and I know I said we do full end [inaudible] but we don’t want to be limited to that. We would like to be able to do a modular analysis. >>: Yeah I mean I guess if you know where the barriers are and you can go from one barrier to the next >> Alastair Donaldson: Yeah, okay so, okay so yeah, I think it’s a really good question and that was actually what I very first thought of but the problem I found was that this motion of going from one barrier to the next becomes quite complicated with conditionals. The next barrier might be the same barrier having gone around two iterations of a nested loop, or something. So I couldn’t work out how you could write out this program without starting to expand all the possibilities. You know, you might be able to kind of expand all the possible resolutions of the conditionals, but then I think the problem would grow very large to [inaudible] the cases. >>: Well, at least you have to consider all the barriers and barrier [inaudible]. >> Alastair Donaldson: Yeah >>: Actually, which is [inaudible] and this is linear, right? >> Alastair Donaldson: This is linear, that’s right. >>: Okay. >> Alastair Donaldson: And, and I think that the thing that this brings which we, I hope people will write more complex GPU kernels if you [inaudible], thus verification will become important, more important, and thus modular verification will become important. And then I think a strength of this predicated approach is it means that actually both threads appear to be stepping into a procedure at the same time, and then you really can verify procedures in isolation and use specifications. >>: What’s the translation, okay, so you haven’t shown us the predicated barrier ->> Alastair Donaldson: Yeah, so that’s the [inaudible] >>: So we haven’t seen enough slides ->> Alastair Donaldson: Yeah, no. So I’m running over time a bit now, so I’ll go on a few more minutes – >> Shaz Qadeer: It’s alright, because there were lots of questions [inaudible]. You can continue up to 11:45. >> Alastair Donaldson: Okay. If you need to go, I won’t be offended. Okay, so yeah, barrier, we turn into barrier, giving it both the predicates, predicate one, predicate two. Alright. And now, so the log read and log write, they get modified in the obvious way, so we now have an enabled parameter. [inaudible] is the thread enabled. And now, if the thread is not enabled, we do not log its access, but if it is enabled, we may log its access. So this is like before, but just with this predication. Okay, and yeah the same for the checks are predicated, so check write says if I’m enabled, if I’m disabled, there is nothing to check. I’m not really here. I’m not really going to race. But if I’m enabled, then do the usual check. Okay, but know that ->>: The barrier operation was just this assertion that said, ah, we haven’t logged anything. >> Alastair Donaldson: Yeah >>: Right, so now ->>: Oh, sorry, never mind. >> Alastair Donaldson: So let me show you here, so the barrier takes these enabled parameters. Now this is how we check barrier divergence and actually predicated execution. Actually, that’s why we first predicated execution. We wanted a way to check barrier divergence. Remember we talked to all these people about what is this barrier divergence problem. We finally narrowed it down and that was the initial source of predicated execution. But then it had various other benefits which I now, I think of ->>: Is barrier divergence when the two threads wind up at different barrier obstructions? >> Alastair Donaldson: So, it’s slightly more substance than that. If they wind up at different barrier obstructions, that is barrier divergence. However, if they get to the same barrier, but they executed different numbers of loop iterations, that is barrier divergence. So for instance, if you’ve got an outer loop and an inner loop, and a barrier inside the inner loop, it is not permissible for one thread to go, to basically go outer loop, inner loop, inner loop, inner loop, inner loop, inner loop, and the other thread to go outer loop, outer loop, outer loop, outer loop, inner loop, inter loop, you know, and hit the barrier the same number of times. That’s no good. They actually have to take the same paths through the loops. >>: Then are the Open CL’s ->> Alastair Donaldson: [inaudible] the OpenCL’s back, yeah. And the reason for it is that it’s very difficult to actually compile a barrier operation in the way you could, in something like Open MP or MPI, you can have processors hit different barriers and you can implement barrier synchronization that way. In the GPU kernel it would be very difficult to do that due to the way these threads actually work. So the way we implement barrier is we assert that the threads are uniformly enabled. Either they’re both disabled or they’re both enabled. But if one is enabled and the other is disabled, this means that they have reached the barrier, they’ve either reached diff, either one of them is, it basically means one of them is not there. Either because he would be going to a different barrier or not going to any barrier ever, or would be, say in a, like out of sync, with respect to the loops they are executing. So this precisely captures the requirements for barrier divergence checking. If I am in, I sometimes give talks where I really explain that carefully with some examples, which I haven’t done here. And then, if the first thread is not enabled, if either of them is not enabled, then they’re both not enabled, so we return, because the barrier is not really being hit. Otherwise we do the assumed thing as before. >>: Do you want to say it’s not enabled one or oh, I see, because enabled one is equal to [inaudible] >> Alastair Donaldson: Yeah, it might be clearer, it might be clearer not to do that optimization. Ah, does that make sense? >>: Yeah, I don’t understand the definition of barrier divergence for non-structured code. [inaudible] >> Alastair Donaldson: Okay, yeah. Alright well there I mean, now I just had a slide here where if there was time I would talk about some of these things. I don’t have slides on them. So, like the, we spent a lot of time working on invariant inference, which I find very interesting. We used Houdini to do that and now I’m working on and I’ve been talking to Akashla quite a bit but this is some ideas for trying to optimize Houdini to be a bit smarter in how it considers candidates. And for GPUVerify, my ideas work really well. That’s where they came from and Akash gave me some [inaudible] boogie programs, and my ideas didn’t work well on them and I’m hoping to talk to Shaz about that over this next couple of days. So, yeah, there were a lot of really interesting practical issues in building on the LLVM compiler frame, but the principle one was this predicated execution for unstructured control flow graphs. That was very challenging and interesting. And then doing source level error reporting actually was very interesting, because these, we don’t report some assertion might fail. We actually want to report there might be a data race between a pair of statements and the problem we have is that for one of the statements things are fine. That’s the statement that was reached second. And then there is an actual statement. We know that that statement is the potential culprit. But then that statement will be interfering with something that may be another statement or it may be something that came from abstracting a loop or abstracting a procedure. And what we want to state to the user is, we want to give them a best effort guess at the program statement that caused that problem. So what we actually do is we carry source location information and write in loop invariance. So we, whether a read or a write, we log the offset. We also log the line number and then we have loop invariants that say things like if there’s a read, it’s from an offset satisfying this pattern, and it’s from one of those line numbers. Seriously, I know it sounds crazy, but that means that when we get a model from Z3 we can ask which line number the first error came from and that allows us to give a potential error. >>: [inaudible] the read or the write logging is using this [inaudible] choice. >> Alastair Donaldson: Yeah >>: So, just following that control information about which branch was taken, wouldn’t that give you ->> Alastair Donaldson: No, because say you’ve got a read that’s inside a loop, then havoc in the loop modified variables can just set, can just say a read to the kernel occurred from location five billion, and that may just be a false [inaudible]. Maybe there is a write in the loop that could write to some, to it at that location. So, what we have to do is additionally, well what we do, we don’t have to do this, in fact, Matthew Parkinson, I told him and he came up with a smarter idea, but what we do currently is we actually carry around source location information in those tracking variables and then we havoc the source line, and then we have a global invariant saying that the source location variables can only be one of the possible locations they could be. And then we have smarter invariants that that try to infer bands on those line numbers. Matthew’s idea is smarter, right? Basically you know a certain check failed. So you know immediately which array is the problem. So now what you can do is you can eliminate all of the other checks and you can eliminate all the logs in other arrays. Now you’ve got a bunch of logs on the array in question. Split them into two sets, disable them half of them here, half of them there, and run the verifier on both. At least one will fail. As soon as one fails, kill the other one. Now divide those into two and keep going. You basically binary search until one log remains and one check remains. Ah, and that would avoid all this invariant stuff, because it’s very expensive to carry this stuff around in invariants. [laughter] >> Alastair Donaldson: Again, is it [inaudible] >>: You’re just sort of saying take, take away, you know, keep taking away sources. >> Alastair Donaldson: Okay, yeah. >>: Anyway, so ->> Alastair Donaldson: But since you [inaudible] >>: Yeah, there might be a better way to do that. I don’t know, but um ->>: Yeah, yeah >>: Did you say, you know, out of the various sources of [inaudible], what’s most commonly, you know, the reason why you fail to prove race freedom? Is it because you [inaudible] havoc to all the reads from the arrays? >>: Yes, not having a strong enough loop invariant, and saying that ->>: Okay, so it’s, is it because your abstraction allowed a loop invariant [inaudible] or is it because by abstracting so much, you know by turning all the reads of the arrays into havocs ->> Alastair Donaldson: Yes >>: You know, you just lost all possibility of ->> Alastair Donaldson: It’s almost always that, there are some examples where up front abstraction makes it actually impossible to verify the kernel, no matter what invariants we then got. But that’s [inaudible] and there is an important class of kernels which we have a paper under review about dealing with. We’re using this barrier invariants technique for more precise abstractions. But in the, for the most part, kernels don’t fall into that category. And those kernels, it really comes down to finding loop invariants that characterize the access patterns of arrays. And we have a template based approach for doing that. And template based approaches have advantages, but their main disadvantage is that if the access point falls outside the template, or is obscured by syntax, that you know, ah then, then we die. >>: It’s often because you have access to a certain stride, or something like that? >> Alastair Donaldson: Yeah, a certain stride, or maybe yeah, I mean I can show you some examples we have. Yeah we have a bunch of cases. And so the things I would like to explore are be more aggressive with the invariant inference and using this technique that I alluded to about optimizing Houdini to make Houdini be able to take larger candidate sets. Because the more candidates, the more chance you have. The second thing is we’re looking at a dycon-like technique for actually doing some dynamic invariant generation. To give Houdini candidates, that’s probably how we’re going to try it. And the third thing, well we have a grant funded in which we said we would export interpolation. >>: So what, [inaudible] you’ll do a simulation, right? >> Alastair Donaldson: Yeah, actually simulate it. >>: [inaudible] >> Alastair Donaldson: Yeah, we need a, we either run the Boogie program, if that’s possible, using this as a tool. I come up with the name of someone that has run a Boogie interpretive. So we might use that. Or we might work on an interpretive LLVM bit code so that we can run our kernels. And there’s this KLEE CL tool at Imperial that does dynamic symbolic execution of OpenCL kernels. And the main problem is once we find these invariants, these likely invariants, had we mapped them back to the actual Boogie variables, that’s the kind of practical problem. And that can be really tricky if you’re dealing with, yeah. >>: I have a better suggestion, is actually to do that on the, on the logic, on the verification conditions, rather than the Boogie [inaudible]. >> Alastair Donaldson: Okay, but the problem with the dycon style approach is also a template based approach. It has the potential to find things that are syntactically [inaudible] but occur dynamically, so you can see that they happen. But still, you have to have your predefined things you’re looking for, so I really, like, despite having seen a number of faults in my interpolation, I still don’t get the general thing about understanding the interpolation. You may be able to discover things that are problem specific. >>: Well, you might put, I was going to say, the other thing about a dycon is you can discover things that you can’t actually proof. >> Alastair Donaldson: That’s true, right. >>: [inaudible] to say they rely on some [inaudible] >> Alastair Donaldson: Okay, so we’re really trying to actually push this technique to people in industry and I really think that probably they’re more interested in bug finding and such like, so we might use the dycon thing to take invariants and trust them. You know, just as, I mean, [inaudible] reviewers will cry if they hear you say that, but -- >>: If you want to find bugs why don’t you just unroll the damn thing. Wouldn’t you find enough bugs that way? >> Alastair Donaldson: So, there’s another problem. So often you have a loop that’s going to do an extortionately large number of iterations. And it’s correct. And then you have another loop that’s bad, and you’re never going to get past that with unwinding. So you may be able to get a hint it’s a problem by a failed proof attempt. I have some ideas about trying to under approximate loops. >>: [inaudible] under approximate acceleration. >> Alastair Donaldson: Yeah, yeah, there’s a nice paper by [inaudible] and others at [inaudible] about loop, well it’s not loop acceleration, it’s loop under approximation, which ->>: [inaudible] it’s something like under approximation acceleration. >> Alastair Donaldson: Yeah, so I look at that and I think, like, that was formulizing some things I’ve been thinking about, so ->>: Which is why it would be cool to have [inaudible] at a lower level [inaudible] >> Alastair Donaldson: Oh, so other people could benefit from it. >>: So that everybody could benefit from it. >> Alastair Donaldson: Yeah, that’s right, yeah. >>: Yeah I think that there are many advantages to what you’re proposing because if nobody is really interested in all these templates and invariants at the lowest level that you can do it, the less engineering you would have to do. [inaudible] >> Alastair Donaldson: So, I took my plan for practical deployment of GPUVerify is as follows: There are three versions of the tool. There is one version which is eager to find bugs, one version that’s eager to verify, and one version that’s neutral. So by eager to find bugs, I mean you do like unrolling for instance. By eager to verify, I mean for instance you turn off race checking, but keep on race logging, so that you can quickly find your best invariant with Houdini, without the expense of actually performing the race check. And once you’ve found the best invariant you then perceive as good enough to do the race check, and then in the middle you’ve got the one that just does everything at once and it may quickly discover that a proof won’t work, but doesn’t find a bug. [inaudible] is if the bug finder finds a problem, you kill the others and say there’s a problem. If the verifier proves things, you say good and um, if either of the two verifier approaches fail, you just ignore that, and after 90 seconds you say no problems were found. That’s my kind of, that’s my, the way I envision people getting use from the tool. I don’t really think it’s going to be very useful if, I mean despite our efforts with invariant inference, there’s a very high chance that the tool will report false positives. >>: [inaudible] trying to take incomplete proofs and sort of use those as a way of if you will, triaging the error reports and deciding which ones are most likely to be real on error reports. Also, the thing that you’re doing with joint invariants for the two threads, for loops, is actually, it seems pretty closely related to his differentials [inaudible]. >> Alastair Donaldson: Really? Okay. >>: Yeah, because he’s essentially doing that as though he has a construction that is actually giving you Boogie to Boogie translation that’s essentially giving you joint invariants for loops and ->> Alastair Donaldson: Okay, so you’ve got two programs and it’s like you’re considering them ->>: [inaudible] say two versions of the same program ->> Alastair Donaldson: I see, right. >>: And what he wants to know is [inaudible] to say one is safer than the other. You know that one, if one, if the second version, if the ah, second version crashes on a given input, then the first version crashes on that, on that input, and so that’s what he means by differential assertion checking. And so, but in practice what that means is you’re looking for joint invariants for the loops so it’s like running, you know, the loops in the two versions. [inaudible] It may give something [inaudible] >>: Yeah, okay. I’ll definitely talk to him about that. Okay, thanks for listening. [applause]

Document 17864661

Products

Support

Document 17864661

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib