Document 17864661

advertisement
>> Shaz Qadeer: It’s my great pleasure to welcome Alastair Donaldson again to MSR. Alastair is a
professor, I don’t know, that’s not the title actually, a lecturer, lecturer in the Department of Computing
at Imperial College, London. He, after, doing his postdoc at Oxford University, he spent a few months as
a visiting researcher at Microsoft and he started the project called GPUVerify for verification of GPU
kernels, back at that time in collaboration with people here. And since then he has made a lot of
progress on that project, along with colleagues at Imperial and he is going to tell us about some of that
work today.
>> Alastair Donaldson: Thanks Shaz. Given it is such a small audience, let’s make it really informal and
I’ll just tell you about what we do and please ask as many questions and yeah, we can, we don’t have to
get through the whole deck of slides if you have more questions, [inaudible].
So, this is joint work with Shaz, initially just Shaz when we were collaborators here at Microsoft, and
then since then I recruited various PhD students of postdocs, none of whom have GPUVerify as their
principal thing, but all of whom have done something, some hacking on it. There’s this long list of
contributors, and the project is supported by the CARP project, which is an EU funded project, Correct
and Efficient Accelerated Program, which I’m coordinating at Imperial and the project’s kind of split
between optimizations, performance optimizations and correctness checking, and we are focusing on
the correctness checking side of things.
So, these are the guys who contributed to the project. I have three postdocs, Jeoren, Adam, who is a bit
shy as you can see and John, and then some PhD students, Paul and Nathan who have worked a lot on
the project and Dan who just started recently, and a new student, Pantazis, who is starting in July. And
I’m actually looking for one more PhD student so if you know any bright, young candidates who would
like to live in London, please send them my way.
And generally the aim of the research in our group, the Multicore Programming Group is to design
automated techniques to help people write correct and efficient parallel software. So I’m particularly
interested in concurrent programming, partly because I think it’s cool to try to get things to go faster,
but partly because of the correctness challenges it raises, and actually I think that from a correctness
and verification point of view, I think it is very hard to verify general properties of arbitrary sequential
programs. I think you could take a parallel program and you look for things like data races or deadlock
freedom. They can be easier perhaps to give people useful tools to help them with rather than trying to,
trying to look in the verification of more general properties.
Okay, so I’m going to tell you about our work on verification of GPU kernels, which I think is an
application of the idea that’s becoming quite popular these days of trying to analyze concurrent
programs by somehow converting the problem to a sequential program analysis task.
So first of all, let me tell you a bit about what a GPU is. A GPU is a graphics processing unit and I’m going
to give a schematic overview of what a typical GPU looks like. Nothing that I say here is going to be
completely true of all GPU’s. So if you know about GPU’s, then you know that it’s not [inaudible] true,
but. A GPU generally consists of a number of processing elements which you might like to think of as
cores, although they’re typically a bit simpler than CPU cores. And every processing element has a small
amount of memory that it has exclusive access to. Then there are a number of these processing
elements on the GPU and the processing elements are arranged into groups, such that every group has a
portion of memory that’s shared among all the processing elements in the group. So, these guys can
communicate with each other through this group-shared memory but they cannot communicate with
those guys through each other’s group- shared memory.
And then in addition there is a pool of global memory, which all the processing elements can share, and
to some extent processing elements in different groups can communicate through this global memory.
However, in typical GPU designs, there is a mechanism for processing elements in the same group to
synchronize with one another, but no mechanism for processing elements in different groups to
synchronize with one another. So this global memory is not really used for inter processing element
communication. It is more used to actually get data from the host device and give data back to the host
device. Make sense?
>>: I have one question.
>> Alastair Donaldson: Yeah
>>: Do GPU’s typically provide interlocked operations?
>> Alastair Donaldson: Like, such as?
>>: Like [inaudible] those kinds of things?
>>Alastair Donaldson: Well yeah they do, yeah, and that’s actually something which we want to look at
next in our work that’s quite challenging from a verification perspective. But the one problem is that
there’s not really a consensus among which operations the barriers found in GPU’s provide. For the
OpenCL’s pack for instance has a bunch of atomic operations that it specifies.
Okay, so a GPU accelerated system would typically consist of a host computer, like a multicore PC. Here
I am showing you maybe an eight core PC and a plug-in card, and what happens is the host that’s
responsible for copying both data and code into the global memory of the GPU, and the code is a
function called a kernel function, and this has nothing to do with OS kernels. I sometimes get people
asking me questions about [inaudible] we have different meaning of the word kernel, or the same
ultimate meaning, but a different specific meaning.
And the host then says to the GPU, go invoke the kernel and what this does is it lights up all of these
processing elements and what they do is they copy data from global memory into group-shared
memory, from group-shared memory into private memory. They crunch through it, eventually copy it
back to global memory, and when they’re done, the host is interrupted and it can copy back the results
for further processing. So in a typical GPU accelerated application you might just have some preparation
code on the host, one kernel implication, and then some processing code, or you might have a sequence
of kernels in a pipeline, or you might have something like a loop with a kernel implication inside it if
you’re doing an [inaudible] algorithm where you have a number of time steps. A common thing would
be to have a timing loop on the host, and then for every time step you do some calculation and then do
again and again and again. Okay.
So, a serious problem when programming GPU’s is the problem of data races, which are well known
from regular concurrent programming. So a data race occurs when we have got two processing
elements or threads running on two processing elements in the same group that access a location in
group-shared memory and at least one of these accesses is a write. And this is called an intra-group
data race. And we can also have an inter-group data race where we have threads running on processing
elements in different groups and they access a global memory location and at least one accesses a write
and there’s no synchronization operation separating them. This is an inter-group data race and we can
also have an intra-group data race on global memory, which I’m not showing you in this diagram.
So, data races lead to all kinds of problems and I think something that’s interesting in the GPU context.
So I think it’s very well known. The problems that data races bring problems of non-determinism
mainly, but in GPU kernels there is I think a worse problem, which is that actually you may have device
determinism, so in a particular GPU architecture, GPU’s, if you know a bit about how they work, you
know that they’re kind of deterministic, so actually threads don’t get scheduled by an OS and you don’t
have preemptions and that kind of thing. Threads get scheduled by a driver and they get scheduled in a
very deterministic way on a given GPU. So it may well be that you’ve got a kernel that could exhibit a
race but never does on [inaudible] architecture X.
And then if you port that kernel to another architecture, you may then discover there’s a problem, and
there are programming models such as OpenCL which aim to be portable, so that actually kernels get
compiled at run time for whatever architecture is available. So in that setting you don’t necessarily
know what your customers are going to be running your kernel on so if you’ve tested it on the write of
architectures and discovered no data races, it may be that you cannot discover data races by testing
alone, and yet on some architectures there would be data races.
So, and another thing to point out is that data races and GPU kernels are almost always accidental and
unwanted. So we don’t have the case in the systems code where there are deliberate benign data races.
What I have seen are benign data races where for instance many threads write the same value to a
location. That happens. You sometimes have data races where, for strange, but actually good reasons,
a thread is going to write something but is guaranteed to write what’s already there and another thread
might be reading, in which case that doesn’t matter. But I haven’t seen examples where we have got
for instance, synchronization parameters being implemented by busy weight loops and GPU kernels.
That is not something that would be very efficient and it’s not something that would be portable across
architectures either.
So what we’ve been doing in our work is looking at data race analysis for GPU kernels. But let me tell
you first of all how you would avoid data races in a GPU kernel. I’ll also show you an example kernel. So
this is the little kernel where it’s in an OpenCL language. We have a regular C function which we prefix
with the key work kernel. This is an entry point to the kernel. This is where threads commence
execution.
And the kernel is going to declare that it takes an array of [inaudible] as an argument and then this
array, the contents are going to reside in local memory, which is in OpenCL what group-shared memory
is called. So local means group-shared.
And then also this kernel is going to take an end offset as another argument. And what the kernel is
going to do is it’s going to add, every thread is going to write to its thread ID. And what it’s going to
write is the, what is already at its thread ID, plus what’s at its neighbor’s thread ID. So it’s going to write
A[tid] + A[tid] + offset.
Okay, so spot the data race. I’ll grab my coffee while you think.
>>: Well, you mean everyone is running in parallel right?
>> Alastair Donaldson: Yeah
>>: Everyone is reading tid plus offset but that’s someone else’s tid right?
>> Alastair Donaldson: Exactly right. So if offset was one for instance and tid was zero, then thread zero
would be writing to zero or reading from one, potentially in parallel with thread one writing to one. So
this would be a read/write data race. Okay.
And we can avoid this data race by using a barrier so we can for instance read A[tid] + offset into a
temporary variable, then we can do our write using temp instead of A[tid] + offset, and then we can
have a barrier synchronization statement in between these statements. And what barrier says is that
every thread run in the kernel must get to the barrier before any thread leaves the barrier. And
furthermore, that all loads and storage for memory would have completed before any thread leaves the
barrier.
Now actually, like I mentioned briefly earlier, it is only possible for threads within the same group to
synchronize with each other, so a barrier is a synchronization operation between threads in the same
group. For the rest of the talk, I’m just going to assume all threads are in the same group for
explanation purposes. Although in the tool, in theory we deal with the general case.
Okay, any questions at this point about GPU’s, kernels and [inaudible]? Oh yeah, so it stops the accesses
from being concurrent. Okay?
So, there’s been a lot of interest over the last few years in verification and analysis for GPU kernels. The
leaders in this area were the group of [inaudible] of the University of Utah, who have a tool called PUG
for analyzing CUDA kernels which was published at FSC a few years ago. And there are these days more
focusing on a tool called GKLEE, which uses dynamic symbolic execution. It’s based on the KLEE
execution engine developed at Stanford and Imperial College. And an interesting thing they’ve done
recently is extended their GP tool to handle atomic operations which is a very nice piece of work. The
University of Trento, collaborators on our CARP project, are looking at using separation logic with
permissions to prove data race freedom of GPU kernels. The idea here is that you have a permission
logic and you prove a kernel by showing that a thread can only write somewhere if it has write
permission. So it’s a nice application of separation logic.
There is a nice paper from ESOP a couple of years old about the SIMT model that’s used for CUDA
kernels, and yeah, another partner in our CARP project are looking at doing symbolic execution of GPU
kernels. And finally, aside from our work, there was a paper about using test amplification of PLDI last
year. The idea of this work is that you actually dynamically run a kernel and check for one trace whether
there were data races, and then you use some static analysis to try to discover whether that trace was in
any way influenced by inputs to the kernel. And if it wasn’t, you can conclude that the kernel is free
from data races.
>>: I have a question [inaudible]. Do they actually run the kernel or do they have a simulator?
>> Alastair Donaldson: I believe they did it with a simulator because it’s quite difficult with a, ah, it’s
very difficult to do logging on a GPU. Yeah, okay. Yeah we published our work OOPS LA last year. That
was the main paper about GPUVerify and a more technical paper about some of the recent
developments in the tool at ESOP this year.
So GPUVerify is a tool for verifying data race freedom which I have described to you and another
property barrier and divergence freedom, which I will talk about briefly later, for OpenCL and CUDA
kernels. So CUD A is a GPU programming model from NVIDIA who are the market leader in GPU devices
and OpenCL is a more general programming model that’s been put together by the Khronos Group, an
consortium consisting of a bunch of partners, including I would say, pretty every major player apart from
Microsoft, I think, it’s fair to say, and so Microsoft has C++ AMP which is another accelerated mass
[inaudible] which is different again. And, yeah we decided to focus on both of these programming
models because they are very similar. We’d rather just focus on OpenCL I suppose for simplicity, but
CUDA is more widely used still. Hopefully that won’t be the case in a year or so, but it is at the moment.
So before I go into details, I’m going to give you a demo of GPUVerify to give you a feeling of what it
does. Please just interrupt me if you have any questions.
So I’m going to write a little kernel to perform a reduction operation. What this is going to do is it’s
going to take an array of ints and local memory, and I’m going to have every thread, I’m going to have
the threads sum their neighbor’s elements using a, doing a tree reduction. So a thread will sum a
neighbor if there are N threads N over two places away, and then N over four places away, and then N
over eight places away and N over sixteen places away and every iteration of the reduction of half the
threads will drop out of the computation. This is a common thing to do in a GPU kernel to collate
results. So I’m going to say four int D equals N over two, where N is the number of threads, while D is
greater than zero, going to divide D by two by shifting it right by one. I’m going to say if my ID is less
than D, so if I’m still active, then A at my ID is incremented by A at my ID plus my neighbor D places
away ID. I’m going to put some defines here so there is no tid actually at OpenCL. I’m going to find tids
to the get local ID zero. This is a built in function that gets a thread’s local ID in the 0th dimension. So
these kernels could be multidimensional. I am not going to go into the details of that here. Let’s
assume the kernels are one dimensional and N is going to be the number of threads in dimension zero.
Alright, so assuming I have not made any syntax errors, this should, the tool should do something on
this, so first of all the tool will complain and it will say that the [inaudible] group size must be specified
using local size. So what we’re not trying to do in this work is parameterize verification. We’re not
trying to prove that these kernels are correct for any number of threads. That of course would be a nice
thing to do, but kernels are not usually correct for any number of threads. They are usually correct for
say every number of threads that’s a power of two or every number of threads with some property.
And second, the theorem proof as the [inaudible] work don’t deal very well with no linear arithmetic
and we very commonly do something like multiply a variable by the number of threads. So if the
number of threads is a constant, that’s okay, but if it’s not a constant, that would be very hard to reason
about. So GPUVerify local size so let’s try it with 1,024 threads and similarly you have to say how many
groups there are. I’m just going to consider one group here. So, two will think for a minute. I tend to
find that when I first run a C sharp application it takes awhile. Okay, so it has reported a couple of
possible data races. So it’s saying that a kernel at CL there is a possible read/write race on the array A,
at bite offset four. So we see A cast to a character pointer and then by offset four, and this is line 9,
column 25 by thread zero of group zero, and line 9, column 15, a write by thread one of group zero. So
if I go back to the example, then line, column 25 I think is this read here and column 15 is this write here.
Okay and you can see that because I have not got any barrier synchronization, there’s actually nothing
to stop one thread skipping ahead to a further loop iteration and interfering with another thread in a
previous loop iteration. So, to eliminate this problem, I can put a barrier synchronization in here. So,
barrier and then I give a flag to say this CLK local [inaudible] this is a way of saying I want to do a barrier
on local memory, so because this array is a local pointer, then that’s the right thing to do. Okay? So
does this look good? Yeah? Really?
Okay, so now what the two is going to say is that one error so line 10, column 13 barrier may be reached
by non-uniform control flow. So this barrier statement here mad a deliberate error and what I did was I
enclosed the barrier statement inside this conditional, which means that some threads will reach the
barrier, but not all threads. And this is illegal in the OpenCL programming model. If you have a barrier
in conditional code, either all threads or no threads must reach the barrier. In other words the condition
guard in the barrier must be uniform. Okay so the tool has detected barrier divergence here. When I
say detected, I’m running the tool in verify mode so it’s not actually detecting anything, it’s not
managing to prove the code, so it can run the two [inaudible]. It will unwind the program. So if I fix this
problem, then what we should see now is some success. So two will tell us that there were no data
races within work groups, no data races between work groups, no barrier divergence, no assertion
failures because you can write your own assertions in GPUVerify. Of course, no warranty provided
because this is a research prototype. Okay, I [inaudible]
>>: [inaudible]
>> Alastair Donaldson: It is intended to be signed, yeah, we’re trying to do signed verification, but there
are various ways in which we’re not signed. For instance, we made the pragmatic assumption that the
point of parameters to a kernel pointed disjoint arrays. That, I actually have a feeling that that may be
required in a spec. I should look that up. But, even if it’s not required, that’s what people do. I mean,
and if we didn’t make that assumption we would just report data races everywhere. So that’s one thing.
We don’t do bands checking, so it’s possible you could have a buffer followed by another buffer and
then you could overflow the bands of that buffer and have a data race as a result. Whereas we would C
write different arrays and we would say that they were race free. So there are various ways in which,
you know, where a sound [inaudible] a whole bunch of provisos.
Okay and you know we can write some, some assertions so I might say here assert is power two D and I
could say X is a power of two if X bit [inaudible] ended with X minus one equals zero and X is not equal
to zero. So a power of two is a binary number with one bit set. So you can test that by saying if the, if
you just subtract one you get all the lower bits and if you add them together you should get zero. The
zero satisfies that property so zero is not power of two. Okay, so now I could run GPUVerify on this and
ah, I, you might notice that I made a slight tweak to the current to make this not hold.
>>: [inaudible]. Was there a loop there?
>> Alastair Donaldson: There is a loop there, yeah. But we have to have loop invariants to do all that
proof right?
>>: Yeah.
>> Alastair Donaldson: Yeah, okay, so, assertion might not hold for thread 544 group zero. And then
you might think hang on a minute, is that problem specific to thread 544? So we could say here
something like assume that tid is less than ten for instance, which is just something you could do as a
programmer to rein in the verifier a bit. Okay, so I should complain in there about thread 8. So it’s using
a constraints over behind the scenes and we have just told the constraints over I want you to find me a
thread less than ten. Obviously you have to be very careful because if we do something stupid like that,
then the kernel of course will become correct. All of a sudden, okay, but anyway, what I wanted to
show you was that now we could run the two in bug finding mode because as Shaz said, we need loop
invariants to prove correctness of these kernels and it may be that the kernel is correct, be we didn’t
find the loop invariant.
>>: Can you sometimes [inaudible] loop invariant on the C level?
>> Alastair Donaldson: Yeah.
>>: C level [inaudible]?
>> Alastair Donaldson: Uh huh. So I could say find bugs and then loop unwind equals four for instance,
and oh, wow [laughter]. Okay, let’s not do that.
>>: [inaudible] So right, now if I said find bugs and I have 1024 threads, how many threads is it going to
use? Did you specify in the command line how many threads to use with [inaudible]?
>> Alastair Donaldson: Yeah, so I say 1,024 threads. But the [inaudible] is going to only ever consider
two threads, which I’ll come to in a minute. And when I say, and to make that sound and use this among
traction, so when I say find bugs, actually it’s going to be finding bugs, not employing any kind of loop
invariant abstraction, but it will be employing some abstraction still, so these bugs could still be false
positives due to that abstraction.
>>: But it will be like, say some kind of context of boundaries?
>> Alastair Donaldson: Ah, no it’s not. Let me get on to that. That’s the next part of the talk. Okay, well
I, anyway, I would have been able to show you something is a bit wrong. Now I would have been able to
show you that actually because I said greater than or equal to zero, this really is a bug and we’d find it if
we [inaudible] enough and if we, but what I can show you is that if we just say greater than zero, then
the tool should be able to verify this. Oh man, something is seriously wrong. Okay. I’ve been working
on this, adding new features, so never do that before a talk.
>>: [inaudible]
>> Alastair Donaldson: Oh right, oh right, oh right, right.
>>: You actually have found that bug, right? [inaudible] say ten or something to get all the way ->> Alastair Donaldson: Yeah, it wouldn’t hurt. But I have now fixed the bug and ->>: So now it should be inducted, right?
>> Alastair Donaldson: Yeah.
>>: [inaudible]
>> Alastair Donaldson: Yeah but it should get this.
>>: [inaudible]. But that should be inductive, right? Because you can take a power of two and divide it
by two and it’s either zero or another power of two, right? So it should work.
>> Alastair Donaldson: Yeah.
>>: Can you, can you get out the actual variable values? When, when an induction fails?
>> Alastair Donaldson: Ah, so, no, we, yes in principle, because it just, yes in principle, but no not right
now. Okay, I’m now going to try and debug this. I know, it’s kind of frustrating because –
>>: We know how this goes.
>> Alastair Donaldson: Yeah. Alright, so let me tell you now about the verification strategy behind the
tool. So the first thing is, what we’re trying to do in this project is actually exploit the simplicity of the
GPU programming model to come up with a very efficient verification method. So the first thing we
exploit is the fact that data races always occur between a pair of barriers, so barrier one and barrier two,
alright? So we have a barrier-free region of code and we can have a race between statements of
different threads in this region, but we can’t have a data race between say a thread X keeping
something up there and a thread X keeping something in there because they can’t be at these places
separately . So this immediately makes the program analysis problem easier, if we can restrict attention
to barrier-free regions of code. This is something that the PUG approach also exploits.
Okay, the next thing to observe, which is something that was kind of new to me, but actually it’s
discovered as a fairly well known thing, is that when you are doing data race analysis, you put off
[inaudible] the number of schedules you consider if you are guaranteed to abort on a data race. So,
actually between a pair of barriers A and B, we can pull the following trick. We can run thread zero all
the way from A to B and log all of the accesses that it makes. Then run thread one all the way from A to
B and log all its accesses and also check all of its accesses against those of thread zero. And if we find
that there is a problem between these we abort immediately. Otherwise you run thread two all the way
from A to B, log all of its accesses and check them against threads zero and one, and abort if we find a
race. And we keep doing this until we finally run the final thread from A to B. I guess we don’t need to
log all of its accesses, but we do need to check them against those of all the other threads and abort if
we find a race.
So, the thing to observe here is that if you think about it, a data race always occurs between some pair
of threads. So with this schedule, if there could be a data race between some pair of threads, between
these two barriers, then this schedule will find the race because of the logging and checking. And if
there’s no race, then actually this schedule will lead to precisely the same state here as any other
schedule would have. If you think about it, you might think well, that’s not true because a different
interleaving might have resulted in different interactions between the threads, leading to a different
state. But the threads can only interact by racing with each other, and if they race, we abort. Does
anyone disbelieve that? Or we’ll need to clarify it further. I think it’s ->>: So essentially, no one can, between barriers, no one can read anything that’s been written by
anyone.
>> Alastair Donaldson: Since the barrier, yeah.
>>: Right, so they have to be independent.
>> Alastair Donaldson: Yeah, and we’re, so it’s that exactly, you said, well they can, but in that case
we’re going to abort.
>>: Well, right, so I mean so you don’t even have to know the order of reads and writes from the
threads when you check the logs, right, you just have to say, did anyone read some –
>> Alastair Donaldson: Yeah, that’s right. We can record these as sets, not sequences.
>>: [inaudible]
>> Alastair Donaldson: Yeah, ok, so, a pretty straightforward trick. And this immediately reduces us to
the problem of sequential program verification. We can rewrite a GPU kernel as a sequential program
where we serialize the threads in this schedule or in fact any other schedule we choose. We could take
a round robin schedule.
>>: Or in particular what it means is that you have a very simple sequential [inaudible].
>> Alastair Donaldson: Yeah.
>>: As opposed to the rather more complex case if you allow the threads to interact.
>> Alastair Donaldson: Yes, that’s right, yeah. Okay, so this avoids reasoning by interleavings, which is
good. Or we can actually do better and we can observe that data races occur between just pairs of
threads, so what we can do is we can pick an arbitrary pair of threads I and J, inside this region. Sorry,
inside the range of threads. And now we can consider barriers A and B and we can consider running
thread I from A to B and logging all of its accesses and then running thread J from A to B and checking all
of its accesses. Now because I and J are arbitrary, I am not, by the way, talking about choosing a specific
pair like thread one and two. I’m talking about taking, considering every possible pair. If we can show
for every possible pair, that between these barriers they can’t race, then there can be no race between
any threads for the barrier region. Right? So, in some feedback on our work, we’ve had comments that
this is quite similar to ideas in protocol verification, where you pick a pair of processes and you make the
other process abstract. I think that there are similarities but the key difference here is that we actually
have to consider all of the pairs because these kernels are not symmetric, even though the threads run
the same program, they do not have to do the same thing. They can follow different control flow.
>>: [inaudible] symbolic constants.
>> Alastair Donaldson: Yeah the symbolic constants, that’s right.
>>: So, the solver will consider all possible values.
>> Alastair Donaldson: Yeah, exactly. So, if a data race exists, then some choice of I and J will expose it
and if we can show for all I and J, that this little program fragment is free from data races, then we know
that there can be no data races if we did have all the threads executed.
>>: But on the other hand if you don’t, and if you have the data race, it can be a false one simply
because you started it in [inaudible] state.
>> Alastair Donaldson: So that’s the , yeah, if this is the beginning of the kernel and we had a
precondition on the entry state, then up to the first barrier we could be precise, right? But this brings
me into my next slide which is, is this actually a sound thing to do at all. So say we had barrier A, barrier
B. We pull this trick of running thread I and then thread J and checking for races between them. And
then we have barrier B to barrier C and we pull the same trick again. Well, if you think about it, this is
not a sound thing to do because it would appear that the other threads just don’t exist. We would see
the world changing for these two threads. Well then the point of a barrier synchronization is now we
can see what the other threads did and we can safely read it. If we don’t model those other threads and
we just continue, then we are, you know you might have an array that is all zeros. And threads are going
to set it to one. So it should be all ones after, when we reach the barrier. We are going to see it be one
in two places and carry on. And then the analysis is going to be nonsensical. So this is not sound on its
own. To make it sound we have to make the shared state somehow abstract to model possible effects
of all the other threads.
So there are two things that we have explored here in our initial work, and something more
sophisticated recently. The simplest idea is to make the shared state completely arbitrary. There are
two ways of doing that. One is you can havoc the shared state every time you reach a barrier. Another
thing you can do is you can actually just remove the shared state completely and treat all reads as
nondeterministic reads. So if you read into a variable, you just havoc the variable, and when you write
to the shared state you simply, you log where the write would have gone for race checking purposes,
but you just remove the actual write statement. So we have both options. The second option has the
advantage of, well it avoids the need to generate a race for the theorem proof to reason by, which can
lead to more, to better efficiency. But is has some disadvantages as well, which I, which I won’t go into
right now. But I can tell you about them later if you’re interested.
Alright, so the GPU verification strategy is to exploit the any schedule will do trick, and the two threads
will do trick, with some obstruction to make the whole thing sound. And also with predicated execution,
which I will come to if I have time in a little while, to turn a massively parallel kernel K so a program that
we want to consider for thousands of threads potentially, into a sequential program P that is linear in
the size of K, the text size of K, such that if P can be program correct by which I mean that no assertions
can fail in P, I mean partially correct here, not totally correct, then K is free from data races and barrier
divergence. So this is the, the matter theory behind our approach.
>>: But if you are going to consider only two threads, then you don’t really need predicated execution,
right? Because you can just copy the kernel twice.
>> Alastair Donaldson: Not if they step into procedures and that kind of thing.
>>: Oh yeah, I was forgetting about that. And there could be loops there also.
>> Alastair Donaldson: Exactly, yeah. So, I will come into predicated execution shortly and we can
describe things. How am I doing for time? I don’t have a watch on me.
>>: Oh, ah
>> Alastair Donaldson: Oh there, okay, cool. Okay, so let me tell you briefly about the tool chain
architecture. What we do is we take an OpenCL or a CUDA kernel and in future we would like to
consider C++ AMP kernels, but they have been more challenging because C++ AMP has this nice thing of
being a single source solution, where you write a C++ program and there is special use of templates to
describe that you want some piece of code to be accelerated, which is a bit like saying it should be run
as a GPU kernel or their amp is in principle more general than that. This makes it quite difficult for an
academic tool to pause. So these are easier targets. We use the CLANG and LLVM compiler framework
and particularly the CLANG front-end to turn the kernel written in one of these programming models
into a Boogie program, a sequential Boogie program. So we, what we actually do is we pause the kernel
and turn it into one kind of Boogie program and then we have this kernel transformation engine that
applies all our tricks to produce a sequential program to be verified. And then we give this to the Boogie
verification engine developed here at Microsoft Research and Boogie uses an SMT solver, principally the
Z3 SMT solver, although we’re looking right now at support for the CVC 4 solver as an alternative, which
has a more industry friendly license.
And then for verification to work we have to generate candidate loop invariants and procedure pre and
post conditions. Although GPU kernels don’t allow recursion currently, and they’re not that big, the
programs tend to be hundreds, tens of hundreds or maybe a thousand lines of code rather than tens of
thousands of lines of code, so we have not yet found a case where it’s, we found it’s always better to do
full end lining, than to try to actually infer contracts. Sometimes we can infer contracts, but actually it’s
more expensive to do the inference, than to just end line everything and often the inference will fail. So
really our effort has been on candidate loop invariants and a good thing about this setup is the CLANG is
extremely widely used. It’s being used by almost everybody these days. Boogie is very widely used in
the verification community and Z3 I suppose is even more widely used because I would say that every
verification tool for the moment seems to use Z3 and then I think there are a whole bunch of other uses
as well.
So, these things are being improved all the time. Other people fix bugs in them and the only magic of
our approach where we have to actually be really careful we don’t introduce unsoundness is in this
component here. So we can put all of our brainpower into making this correct and rely on that the
thing’s being as correct as can be expected from complicated [inaudible] software.
So yeah, the soundness of our approach is much easier to argue than it would be had we built some
complete verifier for kernels, but we actually did the verification condition generation in some smart
way ourselves.
Okay, so now what I want to do is show you an example of how we take a kernel that doesn’t have any
loops or conditionals and do this two threaded transformation. I’ll go through this reasonably quickly
because it’s fairly straightforward. So this is an OpenCL kernel. And what we do is we generate a
sequential program. I’m going to show you in C form here. It would really be a Boogie program.
[inaudible] And we have a precondition. We introduce two symbolic constants tid$1 and tid$2. And
we have a precondition saying that they are in the range of thread ID’s that are between zero and N.
And that they’re different from one another. So this is our way of considering two arbitrary distinct
threads. We have the symbolic constants tid$1 and tid$2. And then what follows $1 and $2 are going to
be used to indicate that the version of the variable for the first thread or the second thread. And just to
be clear, I’m not talking, I’m very much not talking about threads 1 and 2. I’m talking about the first and
second of the two arbitrary threads under consideration.
Okay, what we do is we take the parameter idx and we say that every, that each thread has its own copy
of this parameter. And we add a precondition saying that these copies are initially equal because the
kernel gets invoked with parameters being passed by value and every thread receives the same value for
parameters. So the threads could in principle change these parameters later, which is why they need
their own copy but initially the values will be the same.
We just remove this array A, because I’m going to show you this abstraction which we call the
adversarial abstraction where the shared state just disappears completely. Then X becomes X for thread
one, X for thread two, Y becomes Y for thread one, Y for thread two. Now this read from A tid plus index
into X. This turns into the following. First of all we log that a read has occurred from A and we log that
thread one was the reader. Just thread one. We’re not going to consider thread two reading, only
thread one. So, tid$1 plus idx$1, that’s the offset for the read. And then we check that thread two
reading from A of this offset is okay with respect to any prior reads or writes that have happened, which
in this case would just be that read. Okay, but this is the general translation. So we do logging for
thread one and checking for thread two. And then to reflect the fact that X will be modified by reading
from the shared state, we havoc X for both of the threads. So both threads are doing the read, but
we’re logging it for the first thread and checking it for the second thread, but we have to reflect the fact
that the read happened for both of them so we havoc both copies of X. And this models the fact that A
could have been changing arbitrarily by other threads that may have even been having data races with
each other. Then we do the same for Y. So, a log read, check read, havoc Y. And then slightly more
interestingly, when we write into A at tid we write the value X plus Y. Then what we do is we log a write
by thread one at index tid$1 so at thread one’s ID. Then we check a write at index tid$2 for thread two
and then there is actually nothing more to do, so for the read case we have to havoc the receiving
variable, but for the write case we don’t have to do anything to reflect the effect of the write because
we have actually removed the shared state completely. So it’s like the write is disappearing into the
void and in return, reads anything that writes.
Alright, so now let me explain briefly the race checking instrumentation we use. This is a bit of an
implementation detail. We could have done something different here but this we found to work very
well. So for every array parameter, we introduce a bunch of global variables. We have a variable read
has occurred for the array, which is a Boolean and a variable write has occurred which is a Boolean. And
the idea here is that this Boolean will be false if we are currently not considering any read being in flight
for this array. And it will be true if we are considering some read.
And then we introduce a variable read offset A and a variable write offset A, which are integers. So the
idea is that if read has occurred as true, then read offset says the offset corresponding to the read. If
read has occurred as false, then the value of read offset is irrelevant. So this allows us to track either
zero reads or one read, but no more than that.
Does this make sense so far? And I’ll show you how we’re going to use it in a minute. Let me introduce
four procedures, a procedure called log read A, which takes an offset, log write A, which takes an offset,
check read A and check write A, each of which take an offset. And then there is the, the log procedures
will be invoked with respect to the first thread, and the check procedures with respect to the second
thread. So what we’re going to do is we’re going to consider just the first thread logging and just the
second thread checking, but because we’re going to consider all possible pairs of threads, this is okay.
So we’re exploiting symmetry here. Okay and we get rid of the array parameter. Alright.
>>: The idea is that the log read is now somehow being non-determinative.
>> Alastair Donaldson: Yeah, okay. You’re one step ahead. Alright, so I think I have explained this,
yeah. This is for an undergraduate summer school, so I was stepping through a bit more slowly, but I
think we can skip on.
Okay, so a log read takes in an offset and there may be an immediate thing you might think of would be
to say, we say that a read has occurred, and it occurred to offset. But this wouldn’t be good enough
because this would mean that we would only be logging the latest read, and we won’t be able to check
for races against any prior read that happened before the last barrier. So, what we do is we wrap it in
an [inaudible] star. So the theorem proofer, or the verifier, should consider that the program does log
this read, or that it just continues to log whatever it was logging, if anything. Alright?
So, yes star is an expression that evaluates non-deterministically. So we either log this read in which
case read has occurred A, and we don’t say A overwritten. Or we leave them alone, in which case they
do whatever they were already doing. And log write is exactly the same. Now check read is very
simple. We just assert that if a write has occurred to A by the other thread, then the offset written to by
the other thread must not be the same as this offset that I am checking. And that’s all we have to do.
And, yeah, this is what I just said. And check read, check write is slightly more sophisticated because a
write can race with either a write or a read. So we check that if a write has occurred, then the offset
written to must be different from the offset that I’m writing. And if a read has occurred, the offset read
from must be different from the offset that I am writing. Okay.
And then finally, we have a precondition on the whole kernel, saying that initially there’s no read and no
write on the array A. And we do this for every array. In principle we could just have a single Boolean
and a single offset and we could store for a read, which array it was from, which offset. Okay? And
that’s actually what I tried first, then Shaz suggested splitting up at the multiple arrays because I think
this was giving the theorem proofer a really hard time and also it means that when you have, well if you
know about theorem proof based verification, they have these modified sets for loops and it means that
every loop does a read that read is going to kill this logging stuff for every array in the kernel and you
have to have an invariant recovering where reads could have happened for , you know, all the arrays,
even if they weren’t touched by that loop. So splitting things up has some advantages.
And a barrier is quite nice to implement. Well the first way I thought about doing it was that barrier
should set read has occurred and write has occurred to false by assigning to them. And then, I realized
that this was really inconvenient, because you might have a loop that does not read or write a particular
array, but it’s got a barrier in it. And then because we were assigning to these variables, they were in
the loop modified set. And then we had to have invariants saying that you know they’ve got the same
value they had before and yeah we had [inaudible] invariants about these variables whereas if we just
assume that they are false, this does the trick. Okay, so the insertion behind this is that if you remember
the instrumentation, there was always a possible non-determinate choice not to track a read, not to
track a write. So there is always one part that is very lazy. It doesn’t do anything. It just goes no, no,
no, no, no, no, no, no, no, no, no, no, no, no, no. And then there is this tree of other parts that do track
some reads and some writes against each other, and they’re the important parts that actually find data
races. But when you hit a barrier, you basically say assume that we did the lazy thing. So all the parts
that did track reads and writes, they get snipped. And we only have this part that we can do race
logging and checking afresh from there.
Okay, any questions about this? I mean it’s, it’s kind of lower level but I find it quite interesting. I find
this assuming business quite interesting. Yeah.
>>: Well this [inaudible] so if you have a single thread which for read and the writes in the same thread
and each thread is read and write so that they [inaudible]. So this [inaudible] say there is a data race
inside of thread? [inaudible] in sequential ->> Alastair Donaldson: You mean like, so we have an array A and we do something like a read, so X
equals A of some offset?
>>: [inaudible]
>> Alastair Donaldson: [inaudible]. And then a write did you say, like A tid equals X plus one for
example?
>>: Yes. So will this be considered as a data race?
>> Alastair Donaldson: No, because there is no race checking performed within a thread. So, it’s, we’re
not going to check. Like say thread six, we’re not going to check does thread six’s read conflict with
thread six’s write. But for thread nine, we will consider does thread six’s read conflict with thread nine’s
write. But they won’t conflict because they use tid which is different for every thread.
>>: Okay. [inaudible]. As I see the transformation you showed before. So the first one will generate a
read.
>> Alastair Donaldson: So we do like log read, tid 1, yeah? And check read tid 2. That’s what we would
generate, basically.
>>: Oh, checks. Okay I see [inaudible]
>> Alastair Donaldson: And then we would do log, write tid 1 check write tid 2.
>>: Yes, when you log write tid. Oh you never check tid 1.
>> Alastair Donaldson: No, we only ever checked tid 2. Yeah so the first thread is the logger and the
second is the checker. So basically, we’re going to consider that log read and that check write, we will
consider, we will look for conflicts between those. And for conflicts between those, yeah, that case.
And actually that case there would check for a write, [inaudible] to the write. In both case you can see
it’s tid 1 and tid 2 which will be different. So, good question.
>>: [inaudible] only attempt the race between different ->> Alastair Donaldson: Putting different threads. That’s right.
Okay, now let me tell you how we handle loops and conditionals. So, we use predicated execution. The
idea here is that we flatten the kernel, so that all threads execute the same sequence of statements.
We more or less eliminate any conditional code. We can’t eliminate loops though, so we have to do
something a little bit sophisticated for loops. So let me explain first of all independently from GPU
kernels how predicated execution works. Just consider this snippet of code. If X is less than 100,
increment X, otherwise increment Y. We can make this predicated by introducing two fresh Booleans, P
and Q, and we can say that P is assigned to the Boolean X less than 100, and Q assigned to the negation
of that, not X less than 100. And then we can have, if this problem is being executed, then both of these
things will be executed but in predicated form, so we will say that X becomes equal to X plus one, if P
holds. Otherwise it just gets assigned back to X. So effectively this is a no op if P is false. And then we
can execute this statement. So Y gets assigned Y plus one if Q is true, otherwise it gets reassigned Y.
Okay?
So, this is something that has been used, used sometimes in compilers if you have got say a processor
where branches are quite expensive, you might have a little bit of conditional [inaudible] a simple FNL
but you don’t have many statements and yet either the N branch or the S branch. It might be more
efficient to flatten the whole thing using predicates, then to actually have a branch and avoid the
expense of that.
>>: For example, [inaudible] a question about this compiler you are talking about. How would a
compiler transmit B? X + 1:X [inaudible] use a branch?
>> Alastair Donaldson: Well it would use a special instruction, a special select instruction.
>>: Oh I see.
>> Alastair Donaldson: Yeah, so this is when you work at the architecture support of that kind of
instruction. It’s a very common thing to, if you want to vectorize code automatically, you first can
eliminate these conditionals and then vector architectures [inaudible] out of a select operation that can
crunch through things like this. So you make like a vector of Booleans, something you have got to select,
it takes a vector of Booleans and then a vector of N values and a vector of L’s values. But it doesn’t work
if you’ve got really deep [inaudible] stuff because if you flatten at all, then you’ve got select with select
with select with select with select inside them. That’s fine for a theorem proofer. Okay.
So, what we do is we apply predication so that every execution point is a predicate determining whether
each of the two threads is enabled, and then we parameterize the log and check procedures with a
predicate recording whether the threads are actually, actually logging or actually checking or really
they’re not supposed to be there because they didn’t get into that part of the code.
So, yeah, like X becomes equal to E, we turn that into for each of the threads, this is if we’ve got a
predicate P. We’re translating [inaudible] P, we do the select thing. So we say for the first thread, X gets
E$1 by which I mean the expression E dollarized, so I’m going to take all the local variables and turn
them into their $1 form. Okay, so if P holds then E1, otherwise X gets left alone. And then an array read
becomes, we log the read for thread one, but now we pass in P1 to record whether P was really alive or
not. And the same for checking for thread two. And we do the havocking, but now the havocking has to
become predicated so we don’t necessarily havoc X. We only havoc X if we are enabled. So we now
have to do something like X1 becomes equal to if P then arbitrary, otherwise X1. In fact, like we, we
have to do something a bit different from that in Boogie because you can’t have star in Boogie. And a
write is almost the same as before but we pass in these predicates.
Alright now I think I have some slides about loops coming up. Yeah the[inaudible] is where the
predicates come from so if we’ve got if E do S otherwise do T, or we introduce a new predicate Q, so P is
our current predicate. So we say Q for thread one is P for thread one and E1. So we take the existing
predicate and we strengthen it with the conditional. And then we have another predicate R which is the
incoming predicate strengthened with the negation of the conditional. And in case you were wondering
before, by the way, why it didn’t, why it had P and Q and didn’t just use P and not P, maybe this will, if
you were thinking that, maybe this will answer your thought, because Q and R are not going to be
necessarily a negation of one another because they’re both, they both invoke P. Right. [inaudible]
Okay, and then we translate the then side with respect to Q and then we translate the L side with
respect to R. Okay, and then the loop case I think is the most interesting. What we do here is we can’t
eliminate the loop, but what we do is we force both threads to keep executing the loop until they’re
both done with the loop. So, if neither of them wants to enter the loop, you skip the loop. If one of
them wants to enter the loop and one doesn’t, they both enter the loop and they both execute the loop,
but the one that didn’t really want to go in the loop just does nothing until that first [inaudible] is
finished and then they both leave. Okay?
So we turn this into the following: We evaluate the loop guide into a predicate Q, so it’s the loop guide
strengthened by the incoming predicate. And then we loop while either Q1 is true or Q2 is true. So we
loop while at least one of the threads is enabled. We translate the body of the loop with respect to Q so
this means that a thread, when we translate [inaudible] the body with Q, this will make sure that if a
thread was not enabled, and its Q was false, it won’t do anything. And then we update Q, so we say that
Q becomes its old value strengthened by the loop guard, which may have changed, probably will have
changed, according to the loop body. Okay.
>>: And so in general, you handle non-structured code, right? LLVM
>> Alastair Donaldson: We do, yeah. That’s right. That’s what our ESOP paper was about. That was
actually in my opinion one of the biggest challenges of this project. So, we had all this figured out more
or less when I was at Microsoft. And then we started to build a front-end based on CLANG’s AST at the
structured level, because that was the easiest thing to do in our minds. And that worked okay, but it
meant that we couldn’t handle kernels that did switch statements or kernels with breaks and continues
or they were hard to handle, so yeah, so we thought [inaudible] then we thought how do we do this
predicated execution of the unstructured program level because here it’s very nice because of the
hierarchal structure you have a predicate you can descend in and descend in. Anyway, Shaz had a very
smart idea and we worked out the details.
>>: I’m trying to understand the purpose of this predication because a lot of it seems like what the
Boogie would already do for verification. So is it because of the loops that you have to introduce this
explicitly, that you just can’t let Boogie ->> Alastair Donaldson: What we’re trying to do is just, so, we’re trying to give Boogie a program whose
correctness implies race freedom of the GPU kernel. And this, at the moment, we’re just trying to
construct that program. Once we have that program, Boogie then has to do his usual thing on the
program. He has to do verification, condition, generation.
>>: You argued previously that we could just run thread one followed by thread two.
>> Alastair Donaldson: Yeah, yeah yeah.
>>: Right, and that was going to be our schedule. In fact, you have sort of –
>> Alastair Donaldson: Oh well I told you that you could do that schedule or maybe you could do a
round robin schedule where you run like thread one makes a set, thread two makes a set, thread one
makes a set, thread two makes a set.
>>: [inaudible] so the question is why in fact do you not just do the schedule where thread one executes
all the way to the barrier and then thread two? And I think the answer is like you said, which is that it’s
the loops because you want to get a joint invariant for those two threads at the loop heads. Correct?
>> Alastair Donaldson: What I would say is like, more fundamental than that, well like again, barriers for
a minute, how do you even write down that program, that where, say you’ve got a barrier in a loop for
instance and ->>: If you, I would argue that if you did not have any loops in the program right?
>> Alastair Donaldson: Yeah.
>>: It was a straight line code. Then what you would do is you would just replicate that code from one
barrier to another twice. One for tid 1, one for tid 2.
>> Alastair Donaldson: What if the barriers are nested inside conditionals?
>>: Well, so ->> Alastair Donaldson: If you have no conditionals, then yes, you can do exactly that replication thing. If
you’ve got conditionals, like a nest of conditionals with barriers here and barriers there, how would you
get these barrier regions, barrier to barrier regions? It, it becomes tricky. And then also what would you
do if you’ve got procedures and you’ve got like some code you go in a procedure, then there’s a barrier.
Then you have to basically expand everything and I know I said we do full end [inaudible] but we don’t
want to be limited to that. We would like to be able to do a modular analysis.
>>: Yeah I mean I guess if you know where the barriers are and you can go from one barrier to the next >> Alastair Donaldson: Yeah, okay so, okay so yeah, I think it’s a really good question and that was
actually what I very first thought of but the problem I found was that this motion of going from one
barrier to the next becomes quite complicated with conditionals. The next barrier might be the same
barrier having gone around two iterations of a nested loop, or something. So I couldn’t work out how
you could write out this program without starting to expand all the possibilities. You know, you might
be able to kind of expand all the possible resolutions of the conditionals, but then I think the problem
would grow very large to [inaudible] the cases.
>>: Well, at least you have to consider all the barriers and barrier [inaudible].
>> Alastair Donaldson: Yeah
>>: Actually, which is [inaudible] and this is linear, right?
>> Alastair Donaldson: This is linear, that’s right.
>>: Okay.
>> Alastair Donaldson: And, and I think that the thing that this brings which we, I hope people will write
more complex GPU kernels if you [inaudible], thus verification will become important, more important,
and thus modular verification will become important. And then I think a strength of this predicated
approach is it means that actually both threads appear to be stepping into a procedure at the same
time, and then you really can verify procedures in isolation and use specifications.
>>: What’s the translation, okay, so you haven’t shown us the predicated barrier ->> Alastair Donaldson: Yeah, so that’s the [inaudible]
>>: So we haven’t seen enough slides ->> Alastair Donaldson: Yeah, no. So I’m running over time a bit now, so I’ll go on a few more minutes –
>> Shaz Qadeer: It’s alright, because there were lots of questions [inaudible]. You can continue up to
11:45.
>> Alastair Donaldson: Okay. If you need to go, I won’t be offended. Okay, so yeah, barrier, we turn
into barrier, giving it both the predicates, predicate one, predicate two. Alright. And now, so the log
read and log write, they get modified in the obvious way, so we now have an enabled parameter.
[inaudible] is the thread enabled. And now, if the thread is not enabled, we do not log its access, but if it
is enabled, we may log its access. So this is like before, but just with this predication. Okay, and yeah
the same for the checks are predicated, so check write says if I’m enabled, if I’m disabled, there is
nothing to check. I’m not really here. I’m not really going to race. But if I’m enabled, then do the usual
check. Okay, but know that ->>: The barrier operation was just this assertion that said, ah, we haven’t logged anything.
>> Alastair Donaldson: Yeah
>>: Right, so now ->>: Oh, sorry, never mind.
>> Alastair Donaldson: So let me show you here, so the barrier takes these enabled parameters. Now
this is how we check barrier divergence and actually predicated execution. Actually, that’s why we first
predicated execution. We wanted a way to check barrier divergence. Remember we talked to all these
people about what is this barrier divergence problem. We finally narrowed it down and that was the
initial source of predicated execution. But then it had various other benefits which I now, I think of ->>: Is barrier divergence when the two threads wind up at different barrier obstructions?
>> Alastair Donaldson: So, it’s slightly more substance than that. If they wind up at different barrier
obstructions, that is barrier divergence. However, if they get to the same barrier, but they executed
different numbers of loop iterations, that is barrier divergence. So for instance, if you’ve got an outer
loop and an inner loop, and a barrier inside the inner loop, it is not permissible for one thread to go, to
basically go outer loop, inner loop, inner loop, inner loop, inner loop, inner loop, and the other thread to
go outer loop, outer loop, outer loop, outer loop, inner loop, inter loop, you know, and hit the barrier
the same number of times. That’s no good. They actually have to take the same paths through the
loops.
>>: Then are the Open CL’s ->> Alastair Donaldson: [inaudible] the OpenCL’s back, yeah. And the reason for it is that it’s very
difficult to actually compile a barrier operation in the way you could, in something like Open MP or MPI,
you can have processors hit different barriers and you can implement barrier synchronization that way.
In the GPU kernel it would be very difficult to do that due to the way these threads actually work. So
the way we implement barrier is we assert that the threads are uniformly enabled. Either they’re both
disabled or they’re both enabled. But if one is enabled and the other is disabled, this means that they
have reached the barrier, they’ve either reached diff, either one of them is, it basically means one of
them is not there. Either because he would be going to a different barrier or not going to any barrier
ever, or would be, say in a, like out of sync, with respect to the loops they are executing. So this
precisely captures the requirements for barrier divergence checking. If I am in, I sometimes give talks
where I really explain that carefully with some examples, which I haven’t done here.
And then, if the first thread is not enabled, if either of them is not enabled, then they’re both not
enabled, so we return, because the barrier is not really being hit. Otherwise we do the assumed thing as
before.
>>: Do you want to say it’s not enabled one or oh, I see, because enabled one is equal to [inaudible]
>> Alastair Donaldson: Yeah, it might be clearer, it might be clearer not to do that optimization. Ah,
does that make sense?
>>: Yeah, I don’t understand the definition of barrier divergence for non-structured code. [inaudible]
>> Alastair Donaldson: Okay, yeah. Alright well there I mean, now I just had a slide here where if there
was time I would talk about some of these things. I don’t have slides on them. So, like the, we spent a
lot of time working on invariant inference, which I find very interesting. We used Houdini to do that and
now I’m working on and I’ve been talking to Akashla quite a bit but this is some ideas for trying to
optimize Houdini to be a bit smarter in how it considers candidates. And for GPUVerify, my ideas work
really well. That’s where they came from and Akash gave me some [inaudible] boogie programs, and my
ideas didn’t work well on them and I’m hoping to talk to Shaz about that over this next couple of days.
So, yeah, there were a lot of really interesting practical issues in building on the LLVM compiler frame,
but the principle one was this predicated execution for unstructured control flow graphs. That was very
challenging and interesting. And then doing source level error reporting actually was very interesting,
because these, we don’t report some assertion might fail. We actually want to report there might be a
data race between a pair of statements and the problem we have is that for one of the statements
things are fine. That’s the statement that was reached second. And then there is an actual statement.
We know that that statement is the potential culprit. But then that statement will be interfering with
something that may be another statement or it may be something that came from abstracting a loop or
abstracting a procedure. And what we want to state to the user is, we want to give them a best effort
guess at the program statement that caused that problem. So what we actually do is we carry source
location information and write in loop invariance. So we, whether a read or a write, we log the offset.
We also log the line number and then we have loop invariants that say things like if there’s a read, it’s
from an offset satisfying this pattern, and it’s from one of those line numbers. Seriously, I know it
sounds crazy, but that means that when we get a model from Z3 we can ask which line number the first
error came from and that allows us to give a potential error.
>>: [inaudible] the read or the write logging is using this [inaudible] choice.
>> Alastair Donaldson: Yeah
>>: So, just following that control information about which branch was taken, wouldn’t that give you ->> Alastair Donaldson: No, because say you’ve got a read that’s inside a loop, then havoc in the loop
modified variables can just set, can just say a read to the kernel occurred from location five billion, and
that may just be a false [inaudible]. Maybe there is a write in the loop that could write to some, to it at
that location.
So, what we have to do is additionally, well what we do, we don’t have to do this, in fact, Matthew
Parkinson, I told him and he came up with a smarter idea, but what we do currently is we actually carry
around source location information in those tracking variables and then we havoc the source line, and
then we have a global invariant saying that the source location variables can only be one of the possible
locations they could be. And then we have smarter invariants that that try to infer bands on those line
numbers. Matthew’s idea is smarter, right? Basically you know a certain check failed. So you know
immediately which array is the problem. So now what you can do is you can eliminate all of the other
checks and you can eliminate all the logs in other arrays. Now you’ve got a bunch of logs on the array in
question. Split them into two sets, disable them half of them here, half of them there, and run the
verifier on both. At least one will fail. As soon as one fails, kill the other one. Now divide those into two
and keep going. You basically binary search until one log remains and one check remains. Ah, and that
would avoid all this invariant stuff, because it’s very expensive to carry this stuff around in invariants.
[laughter]
>> Alastair Donaldson: Again, is it [inaudible]
>>: You’re just sort of saying take, take away, you know, keep taking away sources.
>> Alastair Donaldson: Okay, yeah.
>>: Anyway, so ->> Alastair Donaldson: But since you [inaudible]
>>: Yeah, there might be a better way to do that. I don’t know, but um ->>: Yeah, yeah
>>: Did you say, you know, out of the various sources of [inaudible], what’s most commonly, you know,
the reason why you fail to prove race freedom? Is it because you [inaudible] havoc to all the reads from
the arrays?
>>: Yes, not having a strong enough loop invariant, and saying that ->>: Okay, so it’s, is it because your abstraction allowed a loop invariant [inaudible] or is it because by
abstracting so much, you know by turning all the reads of the arrays into havocs ->> Alastair Donaldson: Yes
>>: You know, you just lost all possibility of ->> Alastair Donaldson: It’s almost always that, there are some examples where up front abstraction
makes it actually impossible to verify the kernel, no matter what invariants we then got. But that’s
[inaudible] and there is an important class of kernels which we have a paper under review about dealing
with. We’re using this barrier invariants technique for more precise abstractions. But in the, for the
most part, kernels don’t fall into that category. And those kernels, it really comes down to finding loop
invariants that characterize the access patterns of arrays. And we have a template based approach for
doing that. And template based approaches have advantages, but their main disadvantage is that if the
access point falls outside the template, or is obscured by syntax, that you know, ah then, then we die.
>>: It’s often because you have access to a certain stride, or something like that?
>> Alastair Donaldson: Yeah, a certain stride, or maybe yeah, I mean I can show you some examples we
have. Yeah we have a bunch of cases. And so the things I would like to explore are be more aggressive
with the invariant inference and using this technique that I alluded to about optimizing Houdini to make
Houdini be able to take larger candidate sets. Because the more candidates, the more chance you have.
The second thing is we’re looking at a dycon-like technique for actually doing some dynamic invariant
generation. To give Houdini candidates, that’s probably how we’re going to try it. And the third thing,
well we have a grant funded in which we said we would export interpolation.
>>: So what, [inaudible] you’ll do a simulation, right?
>> Alastair Donaldson: Yeah, actually simulate it.
>>: [inaudible]
>> Alastair Donaldson: Yeah, we need a, we either run the Boogie program, if that’s possible, using this
as a tool. I come up with the name of someone that has run a Boogie interpretive. So we might use
that. Or we might work on an interpretive LLVM bit code so that we can run our kernels. And there’s
this KLEE CL tool at Imperial that does dynamic symbolic execution of OpenCL kernels. And the main
problem is once we find these invariants, these likely invariants, had we mapped them back to the
actual Boogie variables, that’s the kind of practical problem. And that can be really tricky if you’re
dealing with, yeah.
>>: I have a better suggestion, is actually to do that on the, on the logic, on the verification conditions,
rather than the Boogie [inaudible].
>> Alastair Donaldson: Okay, but the problem with the dycon style approach is also a template based
approach. It has the potential to find things that are syntactically [inaudible] but occur dynamically, so
you can see that they happen. But still, you have to have your predefined things you’re looking for, so I
really, like, despite having seen a number of faults in my interpolation, I still don’t get the general thing
about understanding the interpolation. You may be able to discover things that are problem specific.
>>: Well, you might put, I was going to say, the other thing about a dycon is you can discover things that
you can’t actually proof.
>> Alastair Donaldson: That’s true, right.
>>: [inaudible] to say they rely on some [inaudible]
>> Alastair Donaldson: Okay, so we’re really trying to actually push this technique to people in industry
and I really think that probably they’re more interested in bug finding and such like, so we might use the
dycon thing to take invariants and trust them. You know, just as, I mean, [inaudible] reviewers will cry if
they hear you say that, but --
>>: If you want to find bugs why don’t you just unroll the damn thing. Wouldn’t you find enough bugs
that way?
>> Alastair Donaldson: So, there’s another problem. So often you have a loop that’s going to do an
extortionately large number of iterations. And it’s correct. And then you have another loop that’s bad,
and you’re never going to get past that with unwinding. So you may be able to get a hint it’s a problem
by a failed proof attempt. I have some ideas about trying to under approximate loops.
>>: [inaudible] under approximate acceleration.
>> Alastair Donaldson: Yeah, yeah, there’s a nice paper by [inaudible] and others at [inaudible] about
loop, well it’s not loop acceleration, it’s loop under approximation, which ->>: [inaudible] it’s something like under approximation acceleration.
>> Alastair Donaldson: Yeah, so I look at that and I think, like, that was formulizing some things I’ve
been thinking about, so ->>: Which is why it would be cool to have [inaudible] at a lower level [inaudible]
>> Alastair Donaldson: Oh, so other people could benefit from it.
>>: So that everybody could benefit from it.
>> Alastair Donaldson: Yeah, that’s right, yeah.
>>: Yeah I think that there are many advantages to what you’re proposing because if nobody is really
interested in all these templates and invariants at the lowest level that you can do it, the less
engineering you would have to do.
[inaudible]
>> Alastair Donaldson: So, I took my plan for practical deployment of GPUVerify is as follows: There are
three versions of the tool. There is one version which is eager to find bugs, one version that’s eager to
verify, and one version that’s neutral. So by eager to find bugs, I mean you do like unrolling for instance.
By eager to verify, I mean for instance you turn off race checking, but keep on race logging, so that you
can quickly find your best invariant with Houdini, without the expense of actually performing the race
check. And once you’ve found the best invariant you then perceive as good enough to do the race
check, and then in the middle you’ve got the one that just does everything at once and it may quickly
discover that a proof won’t work, but doesn’t find a bug. [inaudible] is if the bug finder finds a problem,
you kill the others and say there’s a problem. If the verifier proves things, you say good and um, if either
of the two verifier approaches fail, you just ignore that, and after 90 seconds you say no problems were
found. That’s my kind of, that’s my, the way I envision people getting use from the tool. I don’t really
think it’s going to be very useful if, I mean despite our efforts with invariant inference, there’s a very
high chance that the tool will report false positives.
>>: [inaudible] trying to take incomplete proofs and sort of use those as a way of if you will, triaging the
error reports and deciding which ones are most likely to be real on error reports. Also, the thing that
you’re doing with joint invariants for the two threads, for loops, is actually, it seems pretty closely
related to his differentials [inaudible].
>> Alastair Donaldson: Really? Okay.
>>: Yeah, because he’s essentially doing that as though he has a construction that is actually giving you
Boogie to Boogie translation that’s essentially giving you joint invariants for loops and ->> Alastair Donaldson: Okay, so you’ve got two programs and it’s like you’re considering them ->>: [inaudible] say two versions of the same program ->> Alastair Donaldson: I see, right.
>>: And what he wants to know is [inaudible] to say one is safer than the other. You know that one, if
one, if the second version, if the ah, second version crashes on a given input, then the first version
crashes on that, on that input, and so that’s what he means by differential assertion checking. And so,
but in practice what that means is you’re looking for joint invariants for the loops so it’s like running, you
know, the loops in the two versions. [inaudible] It may give something [inaudible]
>>: Yeah, okay. I’ll definitely talk to him about that. Okay, thanks for listening.
[applause]
Download