>> Shaz Qadeer: So it's my great pleasure to... University of Utah to Microsoft Research.

advertisement
>> Shaz Qadeer: So it's my great pleasure to welcome Ganesh Gopalakrishnan from the
University of Utah to Microsoft Research.
Ganesh has visited MSR many times before. He works in the area of high-performance
computing, formal verification, and other related topics.
Today he's going to tell us about some work that -- at the intersection of these two topics.
Thank you.
>> Ganesh Gopalakrishnan: Thank you very much, Shaz, for the kind of introduction.
And good to see all of you for my talk.
It's a great pleasure for me to be able to present some of the work that I've been doing in
the area of trying to apply high performance in the area of applying formal techniques,
systematic techniques for high-performance computing.
So I give the initial title to be Correctness Checking Concepts and Tools for
High-Performance Computing. And if I really want to address the upcoming roadmap of
exascale computing, I would probably offer a more pretty title such as this is really the
black ice that is [inaudible] for people who aspire to attain exascale rates of computing.
And let me say why and how it might be able to help some of those.
Okay. So today's high-performance computing research is driven by the mantra of
wanting to attain the maximum out of compute per watt, whatever compute means.
And we have enjoyed the growth thanks to two governing laws in this area. Moore's law
is familiar, which is roughly the doubling of the number of transistors. And Dennard's
law has been at work also because as in when we scale transistors, they remained rather
efficient. They operated in such a way that you could pack twice the number of
transistors and now exceed the thermal limits.
This was thanks to the wattage scaling that was possible as in when you scale down
[inaudible]. But right now you cannot scale down wattage anymore. It's almost at one
watt. And hence the ACBC square F hits you and we are power capped.
So Dennard's law is over. And this is from Bob Colwell. And there are a number of
worries that this induces because if have to enjoy the same apparently limited way of
evolving in terms of computing advance, we have to be hiding several latencies and
gaining performance where we were initially not able to.
So never have an excess design or slack. So this is how we had to be providing.
Yeah. But what is now being mentioned at HPC amongst all this is that HPC is also
going to address very serious computational concerns and the challenges going forward.
So one is familiar with the various kinds of industrial simulations or weather modeling.
But even for strategical planning or studying the effect of introducing a stent, people are
able to use fluid mechanics -- based simulations to predict the [inaudible] so you can do
at least various experimental procedures offline.
Pedestrian accident avoiding systems are going to be deploy with the assistance of GPUs
which are extremely energy efficient and mobile.
So correctness is a concern. And HPC is no longer the province of big scale HPC
centers, but it is coming everywhere.
Yeah. So there are even verified high-performance potato chips. This is just for laughs,
but Pringles apparently uses aerodynamic simulation in flying Pringles, so they shouldn't
take off. So if you want planes to take off thanks to HPC, you don't want Pringles to take
off, again, thanks to HPC.
All right. So let's see how the correctness feel is in this domain. I have previously
worked on cache currency protocol verification, things like DPOR and preemption -variations of bounded search. And around 2003 or 2004 I started collaborating with the
researchers in high-performance -- high-performance computing just because that area
was largely virgin and yet they were writing concurrency codes pretty much all the time.
So the real pressure, the running force for them to operate in a certain way is get the
maximum science per dollar.
Machines are a board and commission, do the commissioning and bring [inaudible] and
then the machine is useful for six years or so, and these are expensive machines. So they
really had to push on that front.
And many times HPC is used to explore unknown aspects of science. You are trying to
predict what happens when you travel close to realistic speeds or what happens inside of
black hole, things like that. So you don't know what the right answer is, so you're trying
to find out the right answer; i.e., you cannot write assertions easily in many cases.
Algorithmic approximations are made. Physics can never be modeled as such, so that is,
again, another variability.
With all the hardware performance concerns and Dennard's law being dead, we need to -we are facing the prospect of dark silicon which means you cannot really operate all the
transistors, so we really had to eke out every bit of performance, which means you need
to have different kinds of CPU and GPU elements and memory subsystems. It's
enormous heterogeneity, different core types are use.
When you look at number representations, you are thinking in real number space but you
are implementing floating point. And the floating point is like a noisy signal that writes
over your ideal real number curve. You don't know where that noise is or where it builds
up and where it affects computation, so -- and precision allocation is an important
consideration because if you can compute at lower precision, you are better off doing so.
There are even attempts to actually do half precision in terms of storage and bring it to
the chip and expand it to full precision. A lot of gains available in this area.
Bit flips are a worry because bits may actually may not be reliable. And scientists
themselves in this area are busy doing science and also catering to HPC. So Ed Lazowka
who visited our department called them pi men. They need to have two different strings.
And it's hard to find people who can do both well as opposed to T men.
All right. So this is an interesting ->>: What are T men?
>> Ganesh Gopalakrishnan: T men, it's just a deep single skill. Here you have to be
deep in two areas. All right. That is Ed Lazowka's picture. All right.
All right. So here's an example of what heterogeneity might cost. And this is an actual
bug story. An illustrative group at Uintah has been pushing on their HPC simulations,
and they recently made their code run in a symmetric mode on Xeon and Xeon Phis. So
the code was working perfectly correctly in the Xeon regime, but when some of the
processes were computing certain ratio in the Xeon Phi, in this case trying to send the
number of messages, being 162, the other side calculated the same ratio. Again, because
of floating point round-off issues, which are different.
And this actually caused a deadlock. So they were wondering why the code was
deadlocking, finally to a [inaudible] precision issue. So there are ramifications of this
kind where continuous quantities may affect discrete decisions. And if it happens in a
sequence, you may not be preserving invariance that you had in mind because the
decisions can be inconsistent.
And so we don't really know the answer. They just went ahead with double precision and
it seemed to work.
And when I read it further and my colleague [inaudible] questioned me how I think why
is this really happening, I initially thought it was 80 bit precision available here, 64
available here. But apparently 80 bit precision is available here, but there are subtle
differences in terms of round-off and there's a safe flag that you can apply to compilers.
So these are yet to be actually tried in this area.
But given the busy nature, we had to be -- the CS people had to be the ones coming and
answering these questions.
Let me talk a little bit about resilience. And finally my talk is going to drive towards
correctness issues of the familiar kind. But let me at least address some of the looming
issues.
If you look at some of the advanced GPUs, there are 7 billion transistors and coupler. If
you do some math, there must be easily [inaudible] transistors on a large system.
And now they're all throbbing at gigahertz over a week, and at the end of that week you
get an answer.
Now, what's the probability that every such transition went as planned? You can do the
math and find the probability or the failure has to be so, so, so low that you get a reliable
answer. So the moral is that failures are occurring all the time in a system of this scale.
But with HPC, the deal seems to be the number of the computations are additive in the
sense you are trying to compute a single number as opposed to a cloud transactional
system where things might be happening in a disjoint way. So the window of
vulnerability of each active computation is lower. That can have an effect.
And people are actually trying to discover whether this is actually happening on some of
the largest machines. So I had the luxury of seeing the Sequoia machine. I lovingly
touched it and so on at Livermore, one of the [inaudible] recently. They're trying to run
identical computations on two of the subsystems and trying to vote and see whether the
bits are actually occurring. So the verdict is out, yet to be fully announced, but they think
it is happening.
Bit flip is a loaded term. It is a multiplicity of issues. I mean, the fabricated chips don't
have a single operating rate. They can be off by 30 percent. So which we can see do
pick to operate the chips on, you had to be guessing that.
Coupled with that, hot spots. So LSB is getting bashed exponentially more than the
MSP. All kinds of thermal hot spots. You can imagine. And the transistors are physical
objects. They wear out and thermal stresses and cracks. They have actual pictures of
how to burn.
So I'm just saying with all this particle strikes can make the situation worse. So it's all an
additive effect. So with the scaling and the large number of parts, bit flips are a reality,
which yesterday we heard Michael Carbon's [phonetic] talk was trying to address that in
a sense.
And energy is the currency. I just was at Pacific National Labs before coming here, and
there is this very interesting project where they were trying to see which messages are
arriving at a barrier sufficiently early and if it happens consistently and they can learn
that quickly, they just do a dynamic voltage frequency scaling, which gives them 18
kilowatts of saving, which is 18 toasters. And these are highly, I mean, exascale attained
at 20, 25 megawatts is the goal. Right now it is a gigawatt or so. So every kilowatt saved
is important.
So when you do these voltage and frequency games, it adds to the noise. The systems are
going to be under level as a result.
Anyway, so a few more slides and then we'll get into the meat of the talk.
POWER6, they actually tried to measure bit rates. One of my former students was in this
study. This is a heavily engineered machine which has all kinds of error correction, and
they actually took it in IBM facility [inaudible] on the chip or whatever, high-energy
particles. And then they had a very interesting triaging of where the errors are filtered.
So a lot of don't cares at the circuit level, at the software level, at the application level.
And finally how many errors are [inaudible] without hanging the machine or crashing the
software. That's the silent data corruption right.
And that was apparently .03 of the latch flips are turning into assigned data corruptions.
Which is a very good number. Very reassuring. Many of the errors are not getting
through.
But this is a heavily overengineered machine. And this group subsequently wrote another
paper reflecting on POWER7. And they make the remark that this hardware protecting
you is not going to go on, you really need to have software-level differences, a search
basically. And you can evaluate that search in a triple model or redundancy or other
ways and somehow make this whole system bullet proof, self-checking kind of thing. So
this is a reality now.
All right. Despite all of this, what are the most worrisome HPC bugs? Since many of
these things are controllable and many things you don't have control over, I would really
see that concurrency bugs are still a place where we can exert a lot of control. And those
are going to be noticed. And some of the engineers at NVIDIA who talked to me
recently, over the last year, said that if you really have any race condition and so on, it
tends to show up at larger scale. It's just simply the probability builds up. So you do run
into that. You cannot silently mask them away.
All right. With that, I'll tell you a little bit about what we are trying to do in terms of our
broad scope of research and then try to drill into some of the specifics.
So one of the most exciting projects we are involved in is called the Pruner project which
is in collaboration with Livermore. And they are basically interested in trying to identify
when systems become reproducible, you cannot replay a behavior and why is that, what
subsystem is acting up that way.
So please help build a harness to say deterministically execute up point and then schedule
around. So similar ideas as being talked about here. And Alice talked to the -- addressed
some of the controlled methods. Except we had to now do it at a larger scale in the
presence of multiple APIs.
And so the data race detection challenges in MPI or OpenMP, MPI has message racing as
opposed to data racing.
Some of the failures they noticed in -- some of the failures at Livermore was failures at
delicate ranges of sizes, and it turned out to be a highly optimized MPI library causing it,
not the use of code. So, again, there is no single place.
So we will slowly work towards this and give them a test harness. We have a replay
facility for message parsing programs based on distributed execution of MPI with
piggybacked Lamport clocks. We have a special version of Lamport clocks that detects a
few more happens concurrently that we have engineered, and they are trying to use that
as a framework to then drive other scheduling.
So we will be looking at OpenMP scheduling very soon. And then we also have to
isolate whether it is floating point determinism or nondeterminism. So we may have to
determinize the reduction operation to avoid that issue flaring up and then see what else
is causing it.
>>: What do you mean by denial of service?
>> Ganesh Gopalakrishnan: All right. Here it was an overly optimized MPI collected
which was preventing MPI point to points from finishing. So within the MPI ecosystem,
some message types were not allowed to finish just because the other message type.
Because MPIs are on time, you can think of all the processes impeaching calls. So some
calls which were -- had to be scheduled in the a fair way or not scheduled. And so the
system deadlocked, and this happened only for specific sizes. And it turned out to be a
big which the vendor collected and all that.
We are looking into floating point error estimation, which is another very nice computer
science-style problem. Despite the fact that [inaudible] has standardized floating point, it
doesn't mean that there are -- there is a good understanding of when round-off happens.
So if you write a floating point and expiration involving floating point, David Bailey and
others have shown that if you selectively assign higher precision to some of the operands
and keep all the others at lower precision it still computes fine. So a lot of analysis
questions in this area. And, again, the floating point error that I mentioned earlier.
So people tend to slap double precision, but double precision costs you double the
bandwidth at least for GPUs, so we can't really afford to be indiscriminant in terms of
precision allocation.
So we have a search base lower bound determining [inaudible] run two modules, so one
being the native precision, one being an [inaudible] precision. Try to walk through the
input space and basically delta debugging or generative search way seeing where the
output deviate.
So that gives you a good estimate of lower bound. But we had a PPoPP paper this year
on that.
And for the upper bounding, which Alex is working on, we have several abstraction
techniques [inaudible] valuable methods but sometimes their own track dependencies are
variables, so we are trying to come up with a much tighter bound. So we can then see
whether the bounds agree, in which case we are getting a reasonable estimate.
System resilience-wise, we are beginning to work on that. We have developed an
LLVM-level fault injector. And there's some collaboration with [inaudible] going on
here with this automator.
Okay. So let me now turn the attention to HPC and concurrency itself. And the real
problem with concurrency now where I'm setting aside all the other issues and looking at
the concurrency bugs, is the exploration of the number of APIs. This is not all of them
needs to be presented as a given setting, but many of them are. And each API sort of
things that it owns the machine, which is a resource issue but I'm not getting into it, but
it's looming in the horizon. We don't know how the runtimes interact.
But the real worry, which I'll illustrate with some examples, is that everyday
programmers are exposed to low-level concurrency in some cases. And this is not the
desired path in the concurrency space because we thought that the job programmer didn't
need to know what about low-level concurrency, but apparently that's not too true.
So then I'll talk a little bit about that and then go on to other aspects of concurrency.
Okay. So I thought I will also give some sense of reality to the talk, realism to the talk,
so I will quickly demonstrate two tools, even though it's coming out of the context,
because I don't want to interrupt the talk later. So I'll give you a feel for all said and done
what do our current solutions look like. And then we'll come to the more slide oriented.
So we'll -- I'll be talking about correctness checking of GPU programs, and we have a
tool called GKLEE. I'll run that for a few minutes. And then I will also -- we have a
dynamic checker for MPI programs.
So I'll be talking about GKLEE, which is a -- which has been -- which evolved out of
KLEE, we just engineered KLEE to understand the different memory organization which
is underlying GPUs.
So if I were to present this in the benefit of the people who are watching, if I were to
present symbolic execution, I could even start with binary search and then it puts the
assumes and searches for a symbolic item. If the item is found, it prints a star, if it is to
the low side it prints an L, prints an item, prints and edge.
So I can now compile it with LLVM and then [inaudible]. So it's running symbolic
execution and it found 11 ways to execute the program. It has generated 11 inputs. So
you can sort of look at this and -- so HHH not found means it went through the edge three
times, the high path. L not found, it found that path. It found that. LHH. So you get the
idea. So this is what symbolic execution does. It does input generation and drives the
program. This is the [inaudible] and smart kind of technology engineered and well
implemented in GKLEE.
So the point I'm making is sequential -- or good sequential symbolic analysis is going to
be central to us, and in fact we will turn the GPU verification into a sequentialized
examination of a schedule. So that's how that technology is going to play out. So let's
run some GPU programs just since we are here.
So if I were to run this GPU kernel and the syntax is kind of not entirely CUDA in this
version, but we have caught up with CUDA, so here this main program will call a single
kernel called race and this kernel will be run by the given number of threads, which I'm
not mentioning here.
But it's easy to see that if two threads with 0 and 1 are run, thread 0 would be the writing
location 0 reading location one and thread -- yeah, one would be writing location, one
reading location, too, so there's that race.
So we are going to schedule in a manner that I'll explain soon. But the bottom line is that
it actually finds arrays and you can drill into the LLVM and that's where the race is and
all that.
And to give you a feel for how well this might work, on a CUDA kernel, which is a really
solid implementation, again, I won't explain what it does or how the problem might be,
but at least it will give you a sense of our time it takes for a tool like this.
This is the first implementation of GKLEE, and I'll be talking about a few others.
So it found a race and you can drill in and identify that it is a race between thread 0 and
thread 4, and then you let -- scratch your head and look at the input and find out where it
raced and how.
Okay. So that's -- that's enough for this demo. What else is possible in the space of
message parsing programs, and this is now symbolic analysis, so this is dynamic analysis.
So let me again run a quick demo of this illustration.
Now here the students put in a lot more effort and we have engineered an Eclipse
integration. We hope to have that available.
But, again, to give you a feel for what we might do, here is message parsing program
written in the MPI style and we want to find out if there are ratios here. So here is rank 0,
rank 1, and rank 2 code. So rank 0 means process 0, which is going to send a
nonblocking send message to process one, but rank 1 process 1 is going to have a white
card receive any source means any where, then a specific receive and then rank 2 is going
to do a normal send.
Okay. So let's see how we can test it, we can do MPI, CC. This is how you compile it,
the standard. Then you MPI run and no problem. This code seems to test fine. Okay.
Sometimes we get lucky. Okay. Didn't get lucky. Whereas a replay debugger applies a
different compilation, ISPCC, which is a little bit different way to compile it now under
ISP, it's -- it actually finds out if there are dependencies. And then -- so it actually replay
the code twice and then at the second run it found a problem.
And then I can spawn a user interface and then it tells you there are two ways of running
that program. We're in process 0. This was a nonblocking send which was picked up by
this nonblocking receive. There was this bad error, and then somebody put the send after
the bad errors thinking that this end is forcing, to force to be matched here. But if you
now go to the second interleaving, that's not the case. There was a racing communication
because once these nonblocking calls reach the bad error, they can cross it and that
enables this other send also to pop up its head, and it is illegible matcher.
So this is we actually schedule based on an underline happens before graph and we have
a happens before a theory of MPI, we know how to schedule it.
>>: So this is an [inaudible] because the send on the left ->> Ganesh Gopalakrishnan: Yeah.
>>: -- cannot have a receive in there because it is barring it?
>> Ganesh Gopalakrishnan: Well, in this case, this receive is just a specific receive
[inaudible] and it wants another message from source too, but there is no more message
to offer.
>>: I see.
>> Ganesh Gopalakrishnan: Yeah.
>>: So really the sent 1 should have been received by the receiver 2?
>> Ganesh Gopalakrishnan: That's right.
>> Okay.
>> Ganesh Gopalakrishnan: Yeah. So this was a [inaudible] match. Yeah. So at least
this gives us a feel for the kind of tools that we have built, but let's now go on with the
mainline of the talk.
And feel free to ask questions at any point.
So I have -- all right. So now let's us begin and try to revisit the GPU space, and let's
start from the beginning and look at how a user experiences concurrency and what errors
they might be subject to just because they don't understand or the documentation is not
clear.
All right. To remind you of GPU programming, GPUs are a model of computation where
you have a memory hierarchy of threads accessing shared memory and then global
memory. Bad errors can be used within this.
So if you want to do a summation of an array with AI plus B, you deploy multiple treads.
The details are not important. It's a SIMT model, single instruction multiple thread, and
typically threads are batched and scheduled.
Okay. So here's one piece of code that is contrived but it illustrates a certain point.
Because many people tend to assume that since there is this warp business, meaning a
bunch of the threads, in this case 32 threads, are scheduled as a batch, this single
instruction is going to finish before any thread engages in this execution, and, hence,
there is nothing -- no need to put anything here. Okay.
And so when you run this compiler, run this, the expected answer shows up because each
location is initialized to I, so it would have been 0, 2, 4, 6, 8 when you do this addition.
Except this thread -- this instruction came in and I thread wrote the I plus first location
and flag value, so it killed those values. So that's the right answer point.
Okay. So as soon as you apply an optimization, you're getting a different answer. Okay.
So these are issues that are widely discussed in CUDA forum. So I'm just saying things
that really tend to trouble users are at this level well before we come into races and all
that.
So what's going on here? We had to ask NVIDIA compiler expert, and he said that the
warp programming style is discouraged and compilers really understand the warp to be of
size 1. So really you must think as if this is asynchronously interleaving with the threads
being able to interleave in any way. There's no guarantee that this will be entirely done.
So this is -- and have dug into the code to see why this is happening and all that.
So the real -- so no guarantee is provided. Okay. So the real reason why this happened
was, this value was held in the register, and this value was computed but then the register
updated. So it's bad code.
So the recommended practice is to slap a volatile. So this is another thing that sort of
illustrates how this area works in terms of published documentation and what the user
says.
So what is C volatile? So this is going back to the C volatiles, which is a technology
device for device program in those days when C had to respond to commands in the
manner described. What does it have to do with concurrency? There's no clear answer.
The concurrency semantics of C and C learned memory model has no mention of C
volatiles. It is going to the C-level [inaudible] and all that.
So the side effect that you're getting is the values are no longer held in a register. But
that's giving you this benefit and, hence, everybody's happy and going away.
So there has to be some user education and retraining. And the worst thing is that if you
really track the documentation chain that is going on in this area, the volatile practice was
recommended as practice in an earlier version of CUDA. The volatile practice has not
been mentioned in this version of the CUDA manual and it's not clear what's going on.
So the point here is there are errors off a low I want to say grade, but off a user education
documentation nature that really are important in this area.
This is a fast-moving area with many things happening and the users cannot be expected
to keep up.
>>: So [inaudible] satisfactory because in older versions of the document, it explicitly
encouraged this warp time [inaudible] if you want to get high performance you do not
need to insert barriers and gave an example. And then [inaudible] space where they said
the compiler might do something strange, let's use [inaudible] volatile and then they just
remove any mention of it [inaudible] it's just gone. So it's not like now it says this
[inaudible] it's just disappeared [inaudible].
>> Ganesh Gopalakrishnan: Yeah. I'm not entirely linked -- I mean, this is a -- computer
science has seen this before. If you go back to the '70s or early '80s, the MC68000 came
in the picture and it will do page fault handling of the last location of a segment and
MMUs came they were new. New things are happening, hardware people really have
struggled with a lot of these breakages.
So I really would attribute that growth pain and just things being happening so fast.
Yeah.
And if they're open to engage with us, so Ali and I helped some engagements with the
OpenCL and NVIDIA teams. And we need to work on formalizing these.
But if you look at the threads, these are all issues that our thread's talking about. This
issue that I mentioned. And here is a users [inaudible]. I wrote this piece of code.
Supposed to work onto the warp semantics, but doesn't work unless I slap a syncthread in
this case, a big hammer. Okay. Which is a bad error. So that's safe, though.
So, anyway, so I won't go on this tangent very long except to say that they have been -also some of our students have been diligently looking at things out there. There's an
actual textbook with a published example of reduction. And it doesn't work. And we
have actually demonstrated it doesn't work. We have talked to NVIDIA experts. So I
can get into the details later.
So the basic idea is that in different thread blocks thread 0 was responsible for
accumulating doing the reduction unto itself, which is fine.
Having acquired values in thread 0, they're all engaging in atomic and interglobal
[inaudible]. There they're assumed mass-memory disability so that the atomics have
fence semantics and it doesn't, so you can actually be picking of a stale value.
So this is -- so these are the levels of trouble there [inaudible]. They are right. I mean,
this is in a book so why would people not use it. There's a virtual [inaudible] which have
broken and we are documenting the error. We haven't contacted the others. But again
this is a store store issue where the two stores are assumed to happen in that order, we
need a fence between them.
So there aren't that many of these, but these need to be captured and thoroughly
documented. And people are relying on this happening reliably, especially in very
ambitious GPU cores where go zones are used to multiple overlapping [inaudible] and
then they compute locally. But they really need fence semantics there.
Yeah, so CS has a role to play. This is just a little interlude. We need to build some cool
show pieces, so we built a cool show piece. They're planning to teach a concurrency
course based on it. So this is a little Raspberry Pi cluster. We actually -- since we are
getting up to teaching HPC and getting hands on, we have some students, some of my
two good students [inaudible] find a dollar, blank check basically, they bought the
Raspberry Pi. It runs MPI, so this is some -- we can -- it's fine. We can run the -- I can
show you offline. I can do the code compression and then [inaudible] like a '90s, late '90s
Linux machine, each core. And then [inaudible].
All right. So I have looked at the basics of GPUs, and now let me get into the more
serious aspects of GPU debugging, so having gotten past the first order errors, where do
we go next.
So this is a project where Ali and I and Shaz have inspired each other and published
many papers with a similar thrust. And the tool that we have is called GKLEE, like I said
before. So let me get a little bit into its operation.
So we can take advantage of the fact that for much of what the programmers write, GPU
program is largely syncthread barrier where our threads are all aligned, they all compute
in their own private space, some private portions of the global memory and then
synchronize again.
So the general idea is that a CUDA program can be thought of in this abstraction as sync,
run some read-writes, control flow, whatever, but to show as linear sequences. So this is
already embarrassingly parallel situation, so you don't need to run all possible
interleavings here. I.e., all these executions are [inaudible].
So since this is the only conflicting pair, let us say, this is a read, this a write for the same
address, and let us say every other read and write are falling into different locations, then
you don't need to [inaudible] discover this race. It's the deeper you are to the extreme.
You can actually run a single canonical schedule. This is also going back to [inaudible]
days of 1990 where you can sort of run this thread sequentially and then record every
access in that sequential run and then at the end of the run you call the SMT solver to see
if there is any overlapping access that you can solve. If so, there's a race.
So this is the style of race checking that we implemented initially. And of course let us -let's also see. I mean, this of course didn't scale that well, so we are coming to better
techniques. The reason this doesn't scale is obvious because if you are given a 200,000
CUDA program, the execution length is very long. And so is symmetry, why are we
doing all that. So we'll come to that.
But let's even look at an actual CUDA program to see what races might there be in this
CUDA program. So obviously this a weighting race when threads 0 and 1 or 1 and 2 are
colliding. But then the barrier, and then there's a -- for all the even threads do a read, and
for all the odd threads do a write involving that.
Okay. Well, this is a conditional. So either this happens or that happens. Is that true?
Well, that's where we need to be a bit careful about what conditional it is. It is a
conditional on the threads divergence. So either all the odd threads execute first, then the
even threads, or vice versa. And the order is unspecified. So there is a race there.
So you do have a classical race. And then there is an unordered situation of whether you
do a read or a write along diversion warp. So this is ->>: And you're assuming there's only one warp here.
>> Ganesh Gopalakrishnan: This is assuming one warp in this case, yeah.
>>: One warp.
>> Ganesh Gopalakrishnan: Yeah, yeah. And this is not mentioned in CUDA
conditions, so you get a name called voting race. If you voted data from platform, you
might find a different behavior kind of thing. So we are looking for both kinds of races
in a sense.
>>: But if you assume a warp stays at 1 ->> Ganesh Gopalakrishnan: Oh, yeah. Yeah, yeah. Then -- yeah.
>>: [inaudible].
>> Ganesh Gopalakrishnan: That's right, that's right. Yeah.
>>: Okay.
>> Ganesh Gopalakrishnan: So I guess ->>: So this is still kind of [inaudible].
>> Ganesh Gopalakrishnan: Yeah, but it's still -- the fact that it is getting sequentialized
in a certain way may give somebody the comfort that the results are looking deterministic
even though you have conflicting access, of course, [inaudible]. So the caution there is if
you were to run it to a different platform, the write happens before the read. It results in
change kind of thing.
Yeah. So this is like our different arguments of a C function wall being unordered.
There is a possible race. Same thing.
So we actually do a canonical schedule in this way of this kind. So just to give you a
point.
So we are going to run two threads and -- or as many threads as there are. And then we
notice the accessing pairs across them.
That's the -- that's what I said [inaudible]. Okay. Okay. So that was the original tool in
PPoP '12 and there we built an LLVM based-flow. But and then the -- as I said, the
concrete witnesses are a nice handy feature of these tools. They tell you where the error
is with threads and all that.
Okay. So the next improvement was to try and avoid having to examine all the threads
concretely. How about if we look at how threads are diverging based on their IDs. Okay.
So that's the next negative [inaudible] idea.
So imagine an entire program where you have decoding based on block idea of threads
and then thread ID and then block ID. So what we have is a situations where all the
threads come into the point where some of them flow this way, others flow this way and
that they are splitting lanes.
So we have symmetry in each flow group. That's the observation. So within each flow
group like here, the threads are executing essentially identical core.
So what we do in an actual implementation is to model two -- well, basically we run one
symbolic thread and then make a clone of it and give it another TID, which is different
from the original guys' TID. So we are running two symbolic threads per flow group.
And that is possible in a symbolic executer. You don't need to get concrete IDs. You can
pretty much run these flows. Go ahead.
>>: So I'd really like to understand this.
>> Ganesh Gopalakrishnan: Absolutely.
>>: I don't get this business of when you have one or two work flows. So which is it?
>> Ganesh Gopalakrishnan: It is an actual implementation how to -- my student
[inaudible] which I don't understand, but let's say that we are ->>: [inaudible].
>> Ganesh Gopalakrishnan: In principle let us say that we have an assume TID different
from TID prime.
>>: Okay.
>> Ganesh Gopalakrishnan: And then we are executing each flow.
>>: [inaudible] execution you have one flow.
>> Ganesh Gopalakrishnan: Yeah.
>>: Two arbitrary treads with different ->> Ganesh Gopalakrishnan: Arbitrary. Right? So ->>: [inaudible] condition.
>> Ganesh Gopalakrishnan: Yes.
>>: And you may also [inaudible] is there only one solution in case they both go the
same way? If it's possible for them to go different ways [inaudible] four threads, right?
>> Ganesh Gopalakrishnan: That's right. So each time there have been assume of that
condition, assume of that condition.
>>: Okay. So these four threads all have distinct IDs?
>> Ganesh Gopalakrishnan: Yes.
>>: And in this flow the IDs are even.
>> Ganesh Gopalakrishnan: Right.
>>: In this one these are odd.
>> Ganesh Gopalakrishnan: That's right.
>>: Okay. And then ->> Ganesh Gopalakrishnan: Yes. And that would be part of the assume -- assumed on
that execution path. So very clean models that it is modeling, LLVM constraint, so for
each LLVM load and store, it is an SSA kind of constraint, so it's equality between new
and old. And for each conditional, it is a path constraint.
>>: So once you -- once you parse one of these [inaudible] conditionals ->> Ganesh Gopalakrishnan: Yes.
>>: Slit, and in your initial work it's like no return from slitting.
>> Ganesh Gopalakrishnan: Initial?
>>: With this initial work.
>> Ganesh Gopalakrishnan: Yeah.
>>: Just like once you slit into two flows, that's it. There's no way back.
>> Ganesh Gopalakrishnan: There's no way back here. That's right. Yeah. So for this
example itself, we'll be generating several race checking conditions. We don't do safety
assertions here because we don't have full assertions. People don't [inaudible] we are
assuming a world where we are not given an input condition or [inaudible] assertions.
And [inaudible] so far we have focused on [inaudible] race checking for this tool.
So if I were to lift the skeleton of the flow, what we will be generating will be
verification conditions for race checking, which amount to recording all the accesses in
this flow and under that condition, which is the flow condition and the T IDs being
different, is it possible to solve there being an overlap between the conflicting pairs in
that flow.
And we do that for every intraflow. And then we do interflow also just because, like I
said in the divergent warp, you don't know the scheduling order between the odds and
evens. So we do checking across also.
>>: Does this assume some kind of independence of addresses on data or something like
that? I mean, you want to look for a race by looking at some finite number of threads.
>> Ganesh Gopalakrishnan: Yes.
>>: So I'm assuming that [inaudible] read at is not dependent on some value I read that
was written by some other thread that we're not [inaudible].
>> Ganesh Gopalakrishnan: We haven't written a theorem, to be honest, so I think it is
possible to write a proof that if you have an actual assertion for the starting state of the
system -- so let me make an example. A reduction kernel sort of works like a logarithmic
tree so each thread adds to its neighbor. And then it adds two away and then four away
and so on.
So after you do that, the behavior, the value of thread 0 is going to be dependent on all
the locations of the array and things like that.
>>: Where a given thread is reading and writing.
>> Ganesh Gopalakrishnan: That's right.
>>: So and it's the addresses that determine the races.
>> Ganesh Gopalakrishnan: That's true too. That's right.
>>: If you do the prefix sum ->> Ganesh Gopalakrishnan: Yeah.
>>: And then it's quite common to do a prefix sum and then write to an index which is
like A or B [inaudible] prefix summary.
>> Ganesh Gopalakrishnan: That's right. Yeah. So along the way it flows into a
sensitive spot, which is in this case an address access, that becomes germane, the value
that you compute which turns into address becomes germane. Yes.
>>: Right. So then -- so let's say -- so basically if I understand it is if you have your -some set of concrete threads ->> Ganesh Gopalakrishnan: Yes.
>>: And a bunch of the rest are abstract threads, so the concrete threads are
parameterized by their ID.
>> Ganesh Gopalakrishnan: The concrete threads basically gives you a range for the
whole thread population. And what we execute ->>: [inaudible] representative threads.
>> Ganesh Gopalakrishnan: Yes.
>>: Now, if I read some data out of the memory and I then use that as an index, right, or
as an address, right, then that value may have been written by some other thread.
>> Ganesh Gopalakrishnan: Absolutely.
>>: Some model [inaudible].
>> Ganesh Gopalakrishnan: Yeah.
>>: In that case doing [inaudible].
>> Ganesh Gopalakrishnan: In this case, so there's a very interesting thing going on.
You cannot be reading and then using that value and -- we are going to do this check per
barrier in trouble. So within each barrier in trouble, you basically cannot have any
communication between threads. So we are going to be just picking of the values and
then incorporating in whatever value you have in the final state.
>>: [inaudible].
>> Ganesh Gopalakrishnan: But when you want to build information across threads, you
had to sort of be closing ->>: All right. I think I remember that.
>> Ganesh Gopalakrishnan: Values.
>>: [inaudible] so then that variable ->> Ganesh Gopalakrishnan: Yes.
>>: -- you may need some kind of an annotation.
>> Ganesh Gopalakrishnan: That's right.
>>: [inaudible] so is that something that's user ->> Ganesh Gopalakrishnan: Well, in this case, since we are computing, this is a
computational framework, the invariant is basically the strongest invariant. We basically
are computing forward.
I don't know whether you need a more relaxed invariant done ->>: How do you compute -- how to -- that means you have to deal with compute
strongest closed for ->> Ganesh Gopalakrishnan: Okay.
>>: -- arbitrary number of threads.
>> Ganesh Gopalakrishnan: Right.
>>: For a parameter of the system.
>> Ganesh Gopalakrishnan: Right.
>>: What I'm wondering [inaudible] the next question is suppose you have a -- this is
simple example. A thread does, A, a tid, becomes equal to tid, okay, and then there's a
barrier. So this would not produce a new [inaudible].
>> Ganesh Gopalakrishnan: If equal to tid ->>: No, control [inaudible] just a single state [inaudible] tid.
>> Ganesh Gopalakrishnan: I -- yeah, that's an interesting point. Depending on -- yeah.
I think you have to basically fork. I had to -- yeah. I don't know whether that is
happening in the current system, but that assertion can be satisfied or not by different
instance of that. Right? Before even starting content of the array?
>>: I mean, you write your TID to TID of A and then you synchronize. Okay? And
then you say B of A of TID plus 1 becomes equal to 42.
>> Ganesh Gopalakrishnan: I see.
>>: Now, this would be data race free by construction because A is populated in a -- this
is all in a -- all [inaudible]. So GPU verify would not be able to prove [inaudible] barrier
in there int to make this work.
>> Ganesh Gopalakrishnan: Okay.
>>: And when I was reading your paper, I was trying to understand whether -- where the
GPP -- I understand GPP would have no problem because it has all the [inaudible].
>> Ganesh Gopalakrishnan: GPP computes, yeah.
>>: So why are we trying to [inaudible] do you have this approximation of the shared
state.
>> Ganesh Gopalakrishnan: I have -- as I told you where we get to write a full formal
account of this work, but I believe that if you have a full assertion covering the entire set
of state variables, what you are doing is computing for two proxy threads. But if you
were to expand these two proxy threads for all the TIDs present in every flow, you have
the power to determine the full extent of the state so that you get the barrier. And then
you will be solving a specific starting initial state. You are not covering for the full
generality of the code. But given an initial state, you are computing essentially all the
symbolic next states for every location.
And then the user just -- they cannot -- well, let's not get into too many details here, but I
do understand the concerns here. But the usage of the calculated value as you expressed
in the example happens in a subsequent phase, I guess, because if we are doing -- if you
have a data race in the center, we would have report then founded. But the usual race for
usage of calculated values happen after the barrier.
So let me think about that and we need to drill into it a little bit more and ->>: Seems like you could do a very compact -- very compact execution from one barrier
to the next.
>> Ganesh Gopalakrishnan: Yes.
>>: Then if you wanted precise information, you'd have to do quite an expensive
assumption across all [inaudible] of that.
>> Ganesh Gopalakrishnan: [inaudible] yeah. I'm here to write a proof, but I don't -- I
haven't really found -- yeah. I guess the work is yet to be taken to that level.
>>: Okay.
>> Ganesh Gopalakrishnan: Yeah. So what we are doing actually, the completeness
theorem would be interesting to write saying that if there's a race under what conditions
there is guarantees. That's something that a student who's finishing up soon has to write
up.
But I don't -- I don't see why this should be an issue because we are, again, starting from
an assertion. And I'll leave it at that.
>>: Okay.
>> Ganesh Gopalakrishnan: I'll come to a related story which is under construction
again. This is assuming that there isn't let us say usage that breaks the address use, break
symmetry, for many examples we were able to find all the races that we had found
before.
But there is another issuer lurking here is what we addressed initially. So you might have
this flow division happen in a loop. So we're back to exponential again. Because if one
flow has to build up enough knowledge for the next flow.
So we have experimented with doing LLVM static analysis to see how much information
flows from one [inaudible] to the next. And the flow is being [inaudible] for the sensitive
spots, which are the -- precisely the address spots.
So we are able to take the flows in one barrier in trouble and we do a forward analysis to
see whether some of the values computed under the different flow conditions. Flows into
the address are already the conditionals. And if that doesn't happen, we are doing a big or
of them. So for any flow that doesn't feed forward, we are basically building an or
constraint and going with a single flow.
So this technique is valuable. That's the most definite result barring -- since we don't
have a theorem at hand is that this is tremendously valuable. And we have covered
several of the benchmarks that you couldn't handle before. And we have gone after
several of the larger benchmarks, the Lonestar benchmark from UT, which is a pretty
serious set of GPU benchmarks, Parboil, UIUC.
And, yeah, this explosion goes away for several class of examples. So if you look at
bitonic sort, which was one example where it exploded. But if you look at the structure
of the bitonic sort, the values that you compute because each thread decides where the
location would be at the end of one barrier in trouble with the data. And then these
actions that certain threads will have to take in the next barrier in trouble depend on other
thread's actions in the previous barrier in trouble. But there's no flow into the address
position itself. So we were able to collapse the flow. So there seems to be the basis to
formalize that.
All right. So I would say this is the story for GPU verification as far as the tool goes.
The tools exist and the student is wrapping up and we'll discuss the later details when we
come to his dissertation.
The nice thing about the execution-based framework is also that we are able to detect
unexpected traces, which is a valuable thing. So if you have several programs where out
of bounds access has happened, we are able to, thanks to the symbolic execution, find
those cases which it occurs.
All right. What I'm planning to focus on now in the remaining part of the talk is how we
are able to take at least some of the formal ideas and maybe try and apply it to larger
scale. So this is just one project which has gone this way. But what it has allowed us is
to situate ourselves in the context of a large-scale project going on at Uintah.
So the so the project itself has been going on for a decade now. And they're building
multi-physics simulation framework. And it's a very structured computational framework
with a million lines of code in it and pretty serious effort of several Ph.D. students. And
they have achieved scalability of this codebase on national leadership machines.
We really -- so let me tell you a little bit about how the system is architected and tell you
the one technique we have tried which has gained some traction and see how we might
take that forward.
So the system itself is designed around a nice partitions of concerns where the designer
writes the computational intent in C++ and a task graph of sequential activities. So they
largely write sequential functions and they express dependencies.
And this might be various force calculations and other [inaudible] or pressure
calculations and how they are to be orchestrated.
And then the infrastructure takes on this task graph and then executes it in a maximally
parallel way, does a scheduler that understands how to schedule these and arranges for all
the message passing MPA messages and all that.
And the system architecture, just to give you a glimpse of what the scheduler contains,
it's a fairly interesting system. We rely on the designers of the system to give us a handle
on understanding it. So the main part of this was the experience of seeing something
designed to large extent, they have a data warehouse which is a distributed hash table that
parks the data. They have various task graph nodes being evaluated in a node. That will
be [inaudible] of this evaluation engine elsewhere [inaudible]. And then when there are
data dependencies there are messages that are being waited on.
The overview of what I would like to convey is that this is a system that they have
engineered to have GPUs and Xeon Phis. It's very smoothly extended the system.
What can we do and how complex are these? These are all the examples of fast-moving
HPC frameworks that groups develop to get [inaudible] done. And we had this two-year
involvement with a postdoc. So a lot of what we tried to do was to understand the
system. Then we said, okay, these -- they have documented failures so how about trying
to find a cheap way to diff two versions. And that might be a good bug root-causing
facility that we can give them.
So by two versions I mean if they have a scheduler. In some cases they had an older
scheduler. When they migrated to a different scheduler, something broke. So they
want -- we want to identify where the fault was and isolate -- help isolate it.
In some cases the two different inputs may have caused the system to behave differently.
Or in some cases there might have been a latent nondeterminism where they observe
something.
So when I said diff, it would be diff across any of these variations.
So I guess kind of very similar and, in fact, [inaudible] worked on similar graph
structures. Call it some call part graphs and things like that long ago. A lot of these call
graphs methods that are around, and we devised a particular version of it. And the name
that we give it is a call list stack trace graph or call list path graphs. What it is is a control
flow summary of the system. So how do the controls reach a particular point.
And the example -- the usage modalities that the user identifies where the system broke
and fixes a function call as having -- as being the vantage point of interest. In this case
it's a data warehouse support.
So then the system recording is turned on for, say, one scheduler, and then that records all
the calls that start from the system initialization and go through different function calls.
And each node here represents a call from a certain calling context of their parent. And
then they develop these big fat lines that tells you how many of this function's call
reached this function and things like that.
So this is a summary of how many calls went through various functions starting from the
top and reached a particular point.
And what we basically do is record it over two versions and then find a diff. And all we
can say at this point is it informs -- it has helped inform the designer and we have four or
five compiled examples of how this brought us very close to being able to debug further.
So this is a specialty where the user interacts and might use printouts, might use a
debugger, but this brings you closer with less effort.
So the formalization of this, I don't know what to say, it is just a quick way to x-ray it.
And we have seen that this is something that you can turn on without too much overhead.
So what we do is we don't deploy at a thousand core, a million cores or anything. In fact,
in a real instance we take the full core base and populate this, do this instrumentation for
three, four processors.
This core is available also. You can actually slap collect calls in any sequential parallel
program. It records the things.
And, yeah, we even found an uninitialized variable case this way by recording over two
different frames of time. There was one instance where we found an extra call going
through where we had to then use a debugger and the rest is surprising value that just sort
of indicates some garbage sitting there. So this is -- all right.
So it really is an interesting eye-opener for us because we want to help these kinds of
users. And we thought what is the next step. So the real eye-opener is that these
systems -- so this is a big picture graph, and I'm almost done. This is trying to [inaudible]
how you might make some of these large-scale systems observable.
So not only do we want these systems and these subsystems to be verified and a simple
[inaudible] and push a button. But we really need a hierarchy of observers in any such
system. We can't really run these systems open loop. So, i.e., automator.
There seems to be ways to infer the protocols existing at various interfaces. And it's also
clear that these automators cannot be at one grain of time because -- and we cannot use
some of these automator at the lowest level. If you put an automator to observe the
actions of an nonblocking data structure, well, you're going to lose all your performance.
So we had to sort of situate these automated medium even size.
Because actions here span several orders of magnitude in time, from nanoseconds to
milliseconds. So we might have to define a hierarchy [inaudible].
And then the way to -- we have tried to use automated learning for selected modules. So
we did password learning on the data warehouse. We have some examples of automators
being able to be inferred. So the positive signatures are the observed behaviors. The
negatives are about the user has to write assertions saying these causalities are there, and
we try to use SAT-based automated learning methods.
So I think this is a really good way to imagine building a large-scale system where you
have the ability to actively test and observe. And then if a fault occurs, there must be a
quick way to summarize the fault, maybe using a call list stack trace graph or something
like that.
So hopefully if this idea flies I'll get some money in a few months. Otherwise this NSL
proposal and we'll submit it again. This is proposal [inaudible] okay. Yeah.
With that I think I'm done. So correctness in HPC has a different ring to it and it has
been an interesting experience of really going forward and specifications. I think the
main pet projects are floating point correctness. That's a real attention gripper for us
because we think we can get somewhere with that. System resilience, there's some effort.
And GPU verification we really want to push forward and get some tighter guarantees
and also release the tools.
And finally the need for use of training and community involvement is also clear.
So with that I'll wind down and take any questions.
[applause]
>> Ganesh Gopalakrishnan: Go ahead.
>>: So there's been, you know, people thinking about [inaudible] problem in hardware,
for example, where something goes wrong in the lab and you're trying to reconstruct
some [inaudible] happened in the hardware even though you couldn't observe it and you
have among other things this problem of what do you want to record, you know, given
that you have a limited ability to record a trace at runtime.
So I would suppose that the same kind of problem occurs in HPC in a big computation
[inaudible] obviously can't trace everything.
So are there ways that you can get at this and try to infer, you know, what is the -- what
do you -- you know, what do you need to know to reconstruct, you know, one of these
graphs [inaudible] useful trace of execution?
>> Ganesh Gopalakrishnan: Right. Yeah, well, we had to limit our ambitions, like Shaz
always says, and we had to sort of attack smaller problems to get more control.
So just even imagining a textbook [inaudible] programs where the user does all the rank
division and lets the processes do something, yeah, we could record some of the salient
events where dependence has happened, so wildcard receives nondeterministic probes.
[inaudible] deterministically, so there's a minimal recording possible. Micro checkpoints
are possible, depending on the APA system. But as soon as you cross -- now we have
two things, [inaudible] okay. Yeah, things start getting ugly.
>>: [inaudible] because in hardware [inaudible] you have an observability problem.
>> Ganesh Gopalakrishnan: Yes.
>>: But usually if something's going [inaudible] after a fairly small amount of time. And
so some billions of cycles. Still, that's pretty fast. And so you can do a lot of replay
trying to hit [inaudible].
[multiple people speaking at once].
>>: [inaudible] that's some gigantic computation that, you know, ran overnight or
something. It seems like you have a -- in a way a much larger problem.
>> Ganesh Gopalakrishnan: I think the real hope that -- this system you entice a very,
very beautifully architected, and there are other examples of that, the Charm++ system is
one. They have an architecture which reduces these task graphs according to causality.
So there may be some events that you can record at this level. So it gives you at least an
initial control skeleton of where things happened. And they're not -- and these can be
recorded distributed -- in a distributed way.
>>: Are the architects interested in questions of determinism and replay and so forth? Is
that on their radar?
>> Ganesh Gopalakrishnan: Yeah, this is an interesting reality check question.
>>: [inaudible] yeah.
>> Ganesh Gopalakrishnan: They were also having the same problem as [inaudible]
researchers. They had to [inaudible] and then look for the [inaudible] and there had to be
certain other objectives, like they had to solve the next physics problem or the next new
machine.
So I think this area has this feeding frenzy. There's a new machine, makes all the sense
in the world to jump onto the new machine. Just because you run fight and faster with
less energy. And then that's where things break also.
But all said and done, yeah. We are working in this control way of writing NSF
proposals. And if this gets funded, we will have that -- a person, yes. Yeah.
And the other thing is this task graph framework is so nice that you can make some of
these tasks checker tasks. And the nice thing is then it can use the load balancer, it can
run on a [inaudible] core automatically.
So the framework is well engineered to stick in some of these checker tasks along with
the regular -- so I really -- I'm really enthusiastic. As soon as, yeah, I wrote this -- well,
yeah. Had this experience and wrote this proposal. We think that the submission is not
bad. This is the right -- because there are so many ways that you can get queues. So task
graph reduction gives you partial loader. This is a well architected scheduler flow, the
scheduler queues, you know.
So the design intent is clear for many, many facets. So if we can somehow intersect them
and put the right minimal observers, so that might be a good way to replay up to that
point.
And then these systems also checkpoint after every millisecond or milliseconds, takes a
system checkpoint, although that's not going to be affordable in the long run. Yeah.
Yeah, apparently when I went to Sequoia machine a little more, apparently they have this
maintenance day. So five to six days a week they compute, and one day, whatever, the
AC has to be fixed. It's a huge facility. So all jobs get checkpointed. But of course we
can see that this needs more than that. Yeah. Yeah.
Well, thanks for all the questions. And, again, thanks for the opportunity.
[applause]
Download