>> Madan Musvathi: Hi everyone. I'm Madan Musvathi,... the Research and Software Engineering Group. It's my pleasure...

advertisement
>> Madan Musvathi: Hi everyone. I'm Madan Musvathi, a researcher in
the Research and Software Engineering Group. It's my pleasure to
introduce Ganesh Gopalakrishnan today. Ganesh is a professor at the
University of Utah where he's been for the past 12, 13 years.
He has like a broad range of interests. And he's worked on
verification, cache [inaudible] protocols and memory models and
whatnot. So today he's going to be talking about making ISP which is a
verification tool for MPI programs practical.
Ganesh.
>> Ganesh Gopalakrishnan: Thank you, Madan. Good afternoon everyone.
Thanks for inviting me. So this is a great pleasure to give a talk of
my two PC students, one former, one current. And they did pretty much
all the initial work in this area and continuing work.
So what I would like to take you through today is the rationale, the
big picture of why we started verification research in this domain that
was not very familiar for us.
The approach that we have taken for modeling, the problems in this
domain. Then the actual tool that we have running. I'll do a tool
demo also for you. And then we'll actually take you through a
miniature development process of an MPI program and how our tool might
help in that area.
So four, five, six years ago we were looking for application domains
where software verification could be meaningful applied, which meant
that we needed to have an abundant supply of case studies that we could
run our tests on.
So by some sheer coincidence or luck, one of the researchers in our
area, George, suggested we look at high performance computing software.
So we took his suggestion along.
And the big picture of what we have done since then has been trying to
understand a model of concurrency that exists. And every feature in
this world seems to be well meant and well thought out based on real
needs. And, yet, when all these features add up, they are giving us an
opportunity to explore a nuance of concurrency and combination of
features that are chock full of problems.
So I'll take you through how we have managed to step through some of
those problems and solve them using formal techniques. So one could
look at classical models such as linearizability and things like that.
And if you really go back and study fundamental models like that, there
will be assumptions made of the interface that each thread sees,
meaning each thread has one outstanding method at a given time and so
on.
Yet, one can look at modern programming research where one can see
research on asynchronous method calls where methods are detached and
they intract through shared memory and the main computation moves on.
Now, a lot of APIs are being proposed for multi-cores going into the
future. And there might even be an alphabet soup of APIs that an
application might deal with. The general sense we are getting is one
needs to be looking at different models where we might have
asynchronous method calls like are found in other systems. But now we
have these asynchronous methods calls intracting through message
processing, not through shared memory. So this is a new twist.
And then there are multiple other features like nondeterminism
happening in terms of when receives can actually take their sent
values.
Then there are global operations like barriers that can regulate the
progress of all the processes. And the resource availability or lack
thereof affects forward progress. This is another feature.
There are oodles of libraries. They have various progress properties.
So it's an interesting mix of features that we are confronted with.
The hope was for us to deliver just like dynamic verifier in this
context. So clear temptation is to go back to a pristine area of
computer science and not do anything here. But somehow it will be good
to push on and see how far we can go confronted with these problems.
So that will be the general feel for the story. And the initial work
in this area was laid down by some of my students. So we started
understanding this larger, rather large API and trying to understand
each of its features. We even wrote a formal semantics for MPI and
made it even executable from a C front end and so on, to gain some
understanding.
But we actually felt that we needed actually to, cannot rely on formal
semantics that are built based just on the interface study, but we need
to get into the semantics and understand the real exploitable features
of the API.
So that's where the story's going. So since I didn't know how many
people here knew about MPI or not, I have a standard set of slides.
So basically MPI is widely used for cluster computing at all scales.
It's the lingua franca of parallel computing. It's culled from
multiple languages and multiple processes execute. And the processes
compute part of the solution. And they all achieve to solve some real
world problem, like explosion modeling or physics.
The execution of MPI programs happens on cluster machines of all sizes
and scales. And very mature MPI codes exist, so much so that people
are afraid to rewrite any of these codes using modern notation. So the
longevity of the standard is well assured. Although it has to evolve
and adapt to thread models and other models.
But clearly this seems to be a very worthwhile area for us to be
devoting significant years and student research time.
Very simple illustration of what an MPI program can do. One can look
at how one can integrate the area under a curve, whereby one can spawn
multiple processes. So in this illustration it allows us to integrate
under the different regions of subregions and then send the integrals
using a send command to a receive command which is taken by the master
process.
And it adds up the integrals.
MPI program is set up.
So that's the general feel for how an
But as soon as you write a program like this you can see, oops,
not in the pristine functional sequential world anymore. There
dynamic communications happening. And these are communications
name the processes to each of the messages to be sent. So here
trying to send to process 0.
this is
are
which
it is
But the programmers are starting iteration from 0. So it's trying to
receive from source. But this guy's not a 1. The rank of this process
is starting from 1.
So there is a mismatch in terms of communication expectation.
is not coming from 0. So there's a deadlock.
So this
So even as simple as a program of -- in a simple program of a few lines
one can introduce a deadlock. So this is the story. Okay. So how
about taming the complexity by trying to create a simple API of 10, 20
functions and use only that and implement all other functions using
these entities. So that would be the computer science approach. Turns
out the large size of the MPI is partly explained by the wide
deployment base it has and the number of applications it caters to.
So it offers a very broad interface of which many applications take a
tenth of the functions and it's a different tenth for different
applications. So it seems inevitable that we look at the large space.
Okay.
>>:
What do you mean by out-of-order completion?
>> Ganesh Gopalakrishnan: Out-of-order completion. It will be
introduced in a few seconds, but the idea is that when the MPI calls
are issued in program order, they need not finish in program order.
The idea is that the fundamental guarantee that MPI makes is that of
point-to-point message fee for ordering or non-order taking. So if
you're sending messages to different destinations, you can afford to
finish a later send which might be sending a smaller message ahead of
the earlier sent. This is a real aspect of these deployed asynchronous
communications that we need to model and handle.
So this is a fundamental aspect of our work and it will show up in many
places. So like Madan was asking, not just a nonblocking operation but
out-of-order finishing of the operations we need to be modeling.
And so in this space we have to make progress by pinning on some bug
pluses, and that's what we did. We looked at being able to find
deadlocks such as might happen when you send two two different
processes and then try to receive. Here the deadlock might happen due
to buffer not being available. Or misplaced global synchronization
operations like barriers.
Or here's a feature where nondeterministic receive is offered. And it
can be matched by either this send or this send, which are both sending
to address P0, the process P0. The data itself is not being shown
here. So this might be a available situation where the send from P0
matches this wild card receive, when this send from P0 matches receive
expectation from P1. That's all going fine.
But on a different cluster machine if you pulled this program, what you
might find is because of the speed this send offer is matched early, in
which case a deadlock can happen when the other send is attempted to be
matched because it's not coming from P1.
So this is called a communication race. Although there was a
nondeterministic. And you might ask why there's nondeterminism in
receive. I'll show you in actual case study later that nondeterminism
in receive is a meaningful feature which we can actually use.
We are interested in finding resource leaks that happen. Many objects
are allocated in MPI program and forgotten to be allocated. So that's
a thing. And we are looking for local assertions placed in the
program.
So we would like to run the actual code. Because the designers finally
get value when they can run real applications through a debugger which
offers good coverage guarantees. So testing methods are notorious.
There are pages, one or two page codes in MPI where you can test the
programs several times and never find a bad schedule that causes a
deadlock. It's very easy to show that.
Static analysis methods are not very widely usable here, just because
of the concurrency nuances. And the cost of sporting a false alarm can
be as reported in a paper by Nagapen [phonetic] Patrice Godefroid in a
conference last year several months of designers time, if you don't
want designers to be burdened by false alarms.
Static analysis can give you bugs in regions that cannot be accessed.
Model-based verification can be used if the final state models are
createable. But for real deployed MPI programs nobody will give you a
pristine final state machine model that can be separately verified.
So we're left with nothing but the possibility of running dynamic
verification which is actually the instrument in the program and replay
the different schedules and hopefully we can contain the schedule
growth.
Okay. So when you embark on verification there are a number of
exponential spacing to kill you. Exponentials in the data space
symmetry space, interleaving space.
So the message here is that it's reasonably clear in many cases how to
downscale a model so that your percent of smaller number of processes
and smaller number of data items to work on. But it is less clear how
interleavings can be similarly shrunk because each interleaving seems
to be its own beast. And so we need the tool to automatically figure
out which interleavings matter, schedules matter. That seems to be one
place where human intuitions are less reliable.
So in a large execution, so even with five threads and five steps in
each thread, one can do some simple math and find out that the number
of schedules, interleavings, can be more than 10 billion, if you have
all these actions permutable equally with other actions.
Okay. So, whereas, if you attract dependencies in the actions we can
find that by just replaying the dependent actions in multiple orders we
can cover the kind of bugs that we are going after.
So this is the intuition that is present in several techniques in the
pioneering work in this area and dynamic partial order reduction was
done by Flannigan and Godefroid in POPL [phonetic] several years ago.
That's the kind of approach we were motivated by and we want to go
after such a technique.
We also wanted to keep in mind the kind of dependencies that arise in
MPI programs as opposed to what arise in thread programs. In thread
programs, dependencies are fairly pervasive in the sense almost all
interactions happen through shared memory. So the number of schedules
you can play out can be very large. But the interesting thing that MPI
programs is that for very large MPI programs sometimes you can play out
just one interleaving and be done. There is no dependency.
Sometimes these wild card nondeterministic constructs might be offered
but may not materialize during execution. In which case we don't need
to play out more than one interleaving. And when interleavings arise,
it sort of bursts and all the processes may be involved in that
nondeterministic attack on a send, on a receive.
So with this kind of structure in place, we wanted -- we couldn't
really go after preemption bounding or heuristic searches like that.
We wanted to be more exact. And that's what we were after.
>>:
Did you also model out-of-order execution?
>> Ganesh Gopalakrishnan:
Absolutely.
>>: So in that case you can not run the program; you have to recombine
the program?
>> Ganesh Gopalakrishnan: Our scheduler has detailing techniques. Our
scheduler captures operations and sends into the MPI runtime and out of
order -- it exploits the latitude offered by the API semantics in being
able to execute certain commands out of order and sends the commands
only when they're sort of ready to go. That is coming in a second. So
the out-of-order execution semantics is a mystery presented in MPI and
causes surprises to people. But knowing the API's latitude and being
able to reorder we can exploit that and show you that such behaviors
will not cause problems. That's the idea.
So in fact this example, simple example is going to tell us a lot about
what we need to do even for simple three-line example, three-process
example.
For instance, here is a simple example with a nonblocking send called
MPI I send, 2 process P2. I'm now showing the data. And then a
barrier. A barrier is sort of like a point that all processes have to
arrive at before anyone can cross. Here's a barrier and here's a send
to another process.
And here's a nondeterministic receive from any process into the
barrier, and into the barrier. So what can happen in this kind of a
simple program is one might even ask, okay, there's a barrier here. So
can this send ever match this receive, because this send is written
after the barrier. This is before the barrier. In a normal shared
memory sense, such a match cannot occur. But in MPI there are delayed
completions. So the time line of execution is such that you can assure
there's nonblocking send. It's detached. The computation is going on.
This is the issue. It's detached. Going on.
arrived at the barrier. So they can cross.
Now all processes have
And while these two are alive, this also becomes alive and now a match
can occur. So this is an aspect of combining nonblocking semantics and
the barrier not really requiring the operations to finish. And again
all efficiency-minded, because this data that may be shipped with these
operations may be several megabytes and you don't want the data to be
shipped. So the weak guarantees offered by MPI are such that these
behaviors should be permitted for higher performance.
Now, if you look at this behavior and try to sort of do execution
replay saying that let us issue P0 send and P2 receive first, make that
behavior go through, verify that, unwind the schedule, let us now issue
this send and this receive alone. And go forward. That's not going to
work here.
>>:
So with the semantics, barrier has no fence semantics?
>> Ganesh Gopalakrishnan: Barrier has the semantics of sometimes being
able to prevent certain communications across. Sometimes -- so what
barrier's trying to do is it tries to erect constraint going later. So
for any operation later cannot be begun unless the barrier has been
crossed. So it's like a weak fence in that sense.
>>: There's more than one send and receive, too.
synchronous kind.
There's the
>> Ganesh Gopalakrishnan: Yes, yes, several models of send. Here
we're just looking at the nonblocking kind. But we're trying to
canaculize it to this send which can be detached. So the barrier is a
mild time alignment device and, of course, we have an algorithm in the
tool which I'll show later which found out functionally around
barriers. We can find out whether the barrier is creating any
orderings more than what's present. And in some cases it doesn't.
>>: So without the barrier, you had some other send after the barrier.
Send could have actually happened, can be issued before the first send
MP0?
>> Ganesh Gopalakrishnan: Issuing happens in program order. But,
yeah, if this end happens and to unrelated process, even this send can
finish before this process. But what you have to -- I'll come to
detail in a second. These are like split transactions or split
operations of a known paradigm. The usual completion operation of a
nonblocking send is a wait, which I haven't shown here.
So when a wait successfully unblocks, that's when the send can be
deemed to have finished. So if you put the weight associated with this
send right here, then that makes the barrier a meaningful entity. This
is for illustration showing the fluidity that the API offers prevents
certain tricks of common replay debugging from replay. We cannot just
issue this as piecemeal and this as a piecemeal and hope for anything
to happen.
The reason being that we cannot issue P2 unless we issue this send
also, because of the barrier. And when both sends are present in the
system, the MPI runtime can pick up its own send and finish. We have
no control over the non-send MPI runtime. We need some other technique
to match this alone or this alone.
So we'll -- what we'll really do is we know that there is no completion
obligation between the send barrier. So we actually, in our system,
will issue the barrier first and delay issuing the send. We collect
the sends and keep it in the system.
So we're trying to discover the most number of sends that can attack
the receive. And that's why we can delay things until they cannot be
delayed any further. Once we know that, we figure out these two sends
can match. Then we dynamically rewrite instructions. So this is going
to be from P to P0 from any. That issues, forces a matching. And then
you can unwind the execution. This is a stateless search. And then
coming forward we can rewrite this to receive from P1 and the issue.
So this actually is, in a nutshell, what we do. We start the
execution. If the operation is not at a blocking call, the scheduler
simply collects the nonblocking call and lets the execution continue
until the process comes to an ordering point in one process, then it
switches to another.
So once it delays until it can not further delay, it knows the full
extent of the nondeterminism and computes all the possible choices and
then replays. So this is how it goes. But let's try to get someone to
understand, formal understanding.
This is all very operational, what we'll go to a precise model. So the
main correctness guarantee that MPI gives is very weak. It's non-order
taking. Between two points, when you issue two messages from A to B,
they arrive in the same order. There's not a guarantee. So if you can
exploit that understanding and set up a suitable ordering that is to be
satisfied, that gives you then a good understanding of all the
schedules that are permitted between the MPI calls.
It's also assumed that the C code that you write between two MPI calls
do not exploit the particular ordering of completion of the MPI calls,
because the MPI calls, even though they appear in program order, may
not complete in that order. So you are making some assumptions about
not relying on the particular observation being made.
But that is unavoidable. So let us understand a little bit. So I'll
just look at four operations. To make life simple, let S stand for
nonblocking send. Let W stand for the rate operation for that send.
Let R for nonblocking receive. W also will be a wait for the receive.
And let B stand for barrier.
So then this is how the time line of send or receive tends to look
like. You initiate from the API the send and the call returns. But
then there are further events associated with the send and receive that
are still happening.
And these other events could be that the send or receive finally find a
match between other send, complementary send or receive, and a complete
event, a completion event which says I. I have avoided, if I have
send, I've drained my memory into the system memory, so I have
completed my memory obligations as well as the send goes.
So here's how -- and the weight and the barrier -- I'll go through it
in some more detail. The weight of the barrier are blocking operations
which means the API calls go in and when the call comes back the full
effect of weight or barrier have occurred.
So send, this is going to get a little bit interesting, but let's go
through it and we'll soon abstract all this detail. And send where a
system provides ample buffering meaning where the send, the system has
enough buffer pool allocated sort of goes like this. The issue call
returns. Immediately the system has drained your memory so that your
memory is usable again.
So that's when this send is deemed to have completed according to your
memory observation effects. But the send may not have found a match.
The match may occur later.
But now if you run the same send in a system which is constrained, this
runtime doesn't have much memory, then you issue the send. The send
comes back, and then the execution sort of blocks because the send has
to find a match and receive to drain the memory as a process-to-process
copy, not a process-to-system to process which is permitted earlier in
the case. So the process might look like this.
So the general effect is that when you issue a send, you supply a
handle and then the wait call uses that handle. When you nest the time
lines of the send and wait, the send call goes in and returns. The
wait call happens soon thereafter. And at some point the send matches
and then both are complete.
So all these are good indications of what goes on and these events
exist. We don't know how to exploit all these events. We can't even
observe these events. So what we have created is a simple model called
a completion before model. A completion before relation can be defined
as opposed to the program model relation and the completes before
relation is a weakening of the program order.
So that's the way in which we're going to set up the computation of an
MPI program. So here's how the intracompletes before relation, which
is the completes before obligation within a process can be modeled.
So suppose you have two sends that are sending to the same target.
Then their intracompletes before relation with respect to program
order. Because the non-order taking guarantee is what we need to
guarantee. So they're shooting to the same target so we cannot allow
the memory effects of this send to appear to have completed after the
memory effects of the second one. They have to appear to have
completed in the same order. So this is why intracompletes before
respects the non-order taking, as opposed to non-order taking.
If you send to different targets, then there's no such ordering. This
is where the intracompletes before ordering can be weakened. So a good
example is send of one gigabyte to P2, P1 and send one bite to P2 which
allows this to appear to take effect earlier in an efficient system.
Receives have similar non-order guaranteeing from the pool side
receives pool from the same source. And they have to obey the
nonordering taking. And they're pulling from different sources, you're
not going to violate non-order taking if you allow the completion
orders to differ. I won't go into other details but basically what we
have done is to define completes before ordering for the particular
events that have been initiated in an execution.
And modelers that, completes before we can do analysis on which points
we can delay things and which we cannot.
And if I'm appearing to hurry, please stop me. But I'm hoping to come
to a point where I have given a feel for how we have modeled the very
small programs and then I'll give a slightly larger program and its
animation in terms of the execution so that at that point hopefully you
can bridge the understanding.
Then I'll actually run a tool demo and show you other details. So
let's now take a simple -- go back to the original program that we are
baffled by which was a send to P2 and then a barrier, a barrier and a
send to P2 and then receive from any and barrier. Why did that program
not execute the way we thought it should? If you plot this completes
before graph for this program, build the complete before relation,
there's no completes before obligation between send and the barrier,
but there is an obligation for every successful operation after the
barrier.
So this is the situation. So clearly at this point we can see that
these two events are concurrent with this receive. So the completes
before is going to be our handle to define concurrency, when can things
be co-enabled. So this notion of co-enabled is to be granted in a
standard serial execution sequentially consistent kind of execution
system. Whereas we need a better way to model co-enabled in our
system. The long story short, there's going to be intracompletes
before and intercompletes before. And we build these completes before
graph while we execute forward. And whenever we discover that two
events are not connected with the completes before path, they are
concurrent.
And you need to be aware of that. And our tool actually builds this
understanding and it will show it in the GUI how the completes before
graph looks so the scheduler should know that.
So that's what it is. So two actions are co-enabled if and only if
there's no complete paths between them.
>>: So how is this different from having a fee for channel between any
two processes? So what you're giving is a nonoperational way to get to
this?
>> Ganesh Gopalakrishnan: Largely so, I would say. The implementation
is that of fee for channels in a sense, yes. There are tag lanes which
allow you different priorities for that. But barring that, it's an
operational way of looking at fee for basis.
There's a fee for -- the idea is the blocking nature of processes
forward progress is affected by the residual space. So there's a fee
for the system that you can see for each process that copies out into
some buffer pool of the process copies of the fee for the system. And
that fee for can be arbitrarily small or large in the system. The real
brittleness comes because we need to verify for a controlled
environment and then we prepare to deploy for any environment.
So another feature of MPI which I'll come to, if I get time, you can
have deadlocks when the system buffer is not adequate which is well
known. Trying to push into for fee that's not there. The MPI, because
of combination of features, can add to that even if you add buffer.
This is a slack inelastic system which is studied in some circles, but
there's another point of brittleness. Too much buffering is also bad.
So all this has to be modeled carefully because we have a -- so it's
well known that in any language which uses channels and
nondeterministic commands you can add slack and get into trouble.
That's sort of how easy to belabor it. We have a good technique to now
enumerate all the slacks that are needed and very efficient way to
enumerate all the slacks that we can model. So we do -- again long
story short, so that the message isn't lost, the final algorithm that
we are in the act of the writing out and getting ready is we do the
normal interleaving reduction and test with that this slack enumeration
reduction, both phases are going on.
Okay. So back to where we are.
you can issue them.
That's why these are concurrent and
So what we need is a good way to define all this. We need to define
some operation semantics that tells us what a full execution space of
MPI. And then ideally it would be nice to have a specialization of
this operation semantics that tells us what's reduced execution. And
that's easy to set up once we have these notions. So what we have is
four rules from the process side, which injects what we call
communication records into the runtime.
So whenever a process executes an MPI command, it makes an MPI
communication record which sets in the runtime. There are four rules.
Process in, process and receive, process wait, process barrier. And
then the runtime itself is a very active system that picks up the
communication records and does things with it, matches sends and
receives and finishes upper wait and so on. There are three -- five of
them, runtime, moment, actions.
So we have a page -- this is the best it gets in this area at least.
We have two pages that sort of tells you everything about MPI as far as
buffering, nonbuffering, progress through very precise operational
semantic rules.
And then you take these rules and execute them in a certain priority
order and that's our interleaving reduction. This is exactly what we
do for containing interleavings. And look what we do.
We run all the process injecting communication record actions with the
highest priority. So whenever we can do this process inject
communication record, we allow the process to inject the communication
record and keep moving forward, not blocking the process.
At some point all the process actions are finished. Now the runtime
has to do certain things. We constrain the runtime to do only
deterministic actions. Delay the nondeterministic actions until you
cannot.
The very last gasp we allow the nondeterministic receive match which
this rule shows. So this skewed priority execution which we started
with uses our reduction. We can show that all this -- this is obeying
the classical [inaudible] conditions or persistent conditions and we
get sound reduction as a result of this.
And so now two more -- I don't know how much I'm doing with the time.
>>:
10.
>> Ganesh Gopalakrishnan: 10, 15 minutes. I'll be okay. A slightly
elaborated version of the earlier example, we find these actions, the
barrier actions are concurrent and can be the highest in the priority
order. So we pick them up and file it once the barrier actions are
picked up and filed, then the highest in the priority order is this
receive which selects, hey, I like this send. So this is the matching
of the send that occurs. This send and selects the receive and picks
up all these higher priority rules that file.
And then they are completed. And then that allows this action to
complete, et cetera. And then so this is one more -- this is how the
tool looks like. The tool itself is an MPI program plus interposition
layer. And the executables are compiled and it's a push button tool
which I'll show in a second, the scheduler that injects messages.
So here's -- this is an example that's enough if you keep the mental
progression of what's happening here. This is what we do. So given
this program, how does our tool work?
Picks a process, say this
runtime. And the runtime
So give me your next. So
then the system says, ha,
starting point. Anything
process, and files its operation. The
notes it and says you're not blocking calls.
then comes here. And says file that. And
you have a blocking call, you are at a
later completes forward. So switch.
And this point it gets an operation. It hasn't issued the receives and
so on yet. Next. I haven't issued it. And then it comes here. Then
again notices this is a barrier, so ordering point, so switch over.
So then it sends the barrier. So at this point it notices that the
barrier is full. All the processes are around the barrier. There's no
completion order obligation between what occurred before the barrier.
So let the barrier go first. So the runtime actually sees things
carefully rearranged according to an allowed semantics. So reordering,
the fee completes before it's exploited by our scheduler. Then the
execution moves on. All of them let go. And now it gets, this
schedule comes back here.
This is the wait for that send. Now, we are building the completes
before graphs. So we need that information for scheduling. And then
that is an ordering point. Wait is a blocking call. So that's the
large idea.
So since I have some time constraints, I will tell you what goes on
next. So while you execute, you will discover the full set of senders
that can match this receive by delaying the nondeterministic receive.
That is what happened. At that point we say, aha, the two senders can
match it. Who shall go first? The dynamic rewriter comes in and says
let that send go first. So it replaces that star with P2 and then
files that. And then let that execution go and you will soon find that
it can now find an off and receive, which could not be received from
two. Nobody willing to send to this process. It send got matched
there.
So you have to sort of -- this is an execution of the operational
semantics as applied to this program. Okay. So why don't I show you
some actual, look and feel of what this whole thing is going to do.
So we have implemented this tool for several platforms. It runs under
Linux and Mac and Windows. And it can actually work for several MPI
libraries. If you take one of these programs you can see it under
Visual Studio, that's the source file. So all you do is run ISP.
And in this case it's a matrix multiplication program. Paramultiple
multiply. And it's concrete execution. So it even has built an
execution here to multipled the matrices. When it runs through it,
finds errors and reports any. And no deadlocks here.
Now you can say show me what you did. So the source analyzer window
can pop up and it will allow you to step through the interleavings.
And then you can say walk me through the interleavings of how the
communication is matched, and it pops up enough windows to show you all
the process ranks.
You can lock ranks, process ranks and step through that.
execute until you find issues.
So you can
There's debug button is partially implemented. It's only two days old.
I won't press it until the end of the talk. But if I press the debug
button it's able to cut into the Visual Studio debugger at that point
and engage the Visual Studio debugger at that point.
So this is how it will go. But now you can actually obtain more
information insights. You can say show me your message matching
occurred, another GUI. Tells you for that execution all the
send/receive matches that occurred in that execution.
Now you can say show me what completes before relation that you
executed this scheduler under. So you can say view interCB, view
interCB or view both. So at this point it will tell you the completes
before relation that it has computed for those nodes.
What about this will tell the designer is that you can expect anything
that is not connected with an arrow here to have been executed in
either order. And so if you don't understand your bugs, you can deeply
introspect the completes before relation and understand what happens.
This is a gazillion bells and whistles. I won't show you. Can show
the source code and so on. This is what happens. Now, if I take --
open project solutions. So this is an optimized version which now
allows wild cards to be meaningfully exploited. I'll tell you how this
program works.
But this is a program with the wild card receives. And wild card,
whenever wild card receives are introduced you have to attract
nondeterminism. And it actually finds that this execution produced
several interleavings for process execution produced maybe 18
interleavings, if I remember the number.
I don't want that. So 18 interleavings. And you can step through the
source analyzer, et cetera. So you might ask whether these many
interleavings are important for this program.
Well, particularly, yes, all the send/receive matches might supply
different data. And based on what your data you supply the computation
can branch differently. But in this particular case there is a higher
level symmetry going on, you have multiplied the matrix 18 times and we
can probably go after that.
So a lot of students have done a lot of fine engineering in this
project. It's a huge effort of the scale I've never done in a dynamic
institution. We can actually support multiple platforms and multiple
libraries. And the best results we have we can take 14,000 line
parameter hypergraph partitions and it checks it in a few seconds on an
ordinary laptop and nondeterminism manifested in execution.
What else can we do? We have taken codes and found deadlocks in them
and some of them are benign deadlocks. What we are after are several
case studies and several tutorials are also in the card now.
I'll give you an illustration how a programmer might program in MPI and
use this tool. So suppose you want to multiply two matrices. One
algorithm is to broadcast metrics B to all the nodes and send the rows
to different processes and the processes do their own row time semantic
metrics multiply. These are answers in parallel. And receive the
answers.
If there are more rows than the number of processes then you do it
again and again. So a program will start with initialize declare the
number of processes you have.
Then there will be a phase which broadcasts the metrics. Broadcast is
sending the same area to all the slaves. And there will be a matching
broadcast so the match is to pick up in all different process ranks.
Then there will be send/receive pipeline where these rows are being
sent and these being received. So this could be blocking send,
blocking receive, blocking send, blocking receive. Next iteration
we'll make these non-blocking calls so that you can pump these send and
come back and before the next send is pumped you can put a wait and
then pump it again, so it becomes a software pipeline loop.
Finally, when you receive the rows back, who do you send work to again?
Is it the first thing that got computed, or is it whoever completed
first?
And if you follow the whoever completed first model you get
nondeterministic receive as a natural programming method here.
say the first winner gets the next work.
You can
So that's how the code looks like. So you can receive from the first
processor, back from the first processor, send it back to the first
processor or you can say receive from anybody and send more work to
that processor.
And then you can use a nonblocking operation saying that -- you just
need to wait before the previous send finish and then issue the next
send. So you can set up delayed sending like that.
I did this Visual Studio demo again. In the few minutes I have, what
I'll give you is I'll tell you why system buffering is so important to
model. This could happen in any distributed system, message processing
system with nondeterminism and other things. The idea is that when you
have a program with send wait, send wait like that, it's normally
expected that this wait operation is a blocking call. So there is a
completes before ordering that it puts on this send.
So notice that even though this send is sending to P1, this is sending
to P2. By that alone you might have said these things could happen in
different orders, in complete different orders. Because of the wait
call the completion is forced to occur in program order.
Okay. Now let's focus on this edge. What happens if I add buffering?
Suppose the system has a lot of buffering so that it sort of takes this
send message immediately into its memory?
So when you have that happening, this wait command, whose obligation is
only to notice whether the send has copied out its memory is like a
[inaudible]. This wait says I'm done. I have noticed that this send
has sent its memory away, sent its data away. This completes before
edge vanishes, the system has plenty of buffering.
So this is the buffering used completion before weakening. So just to
summarize, in this example the completes before respects program order,
the system has no buffering. The completes before does not respect
ordering, neat program respect ordering. The system has plenty of
buffering. This has consequences in an interesting situation which is
portrayed here where initially the completes before graph orders this
send with respect to this receive. So while this receive put out,
taken data receive occurring in P2 previous taken data from anybody and
this send is sending to P2, these guys could not have a match because
there are completes before force.
But now if you add buffering, so only that match is allowed. This
nondeterminism doesn't matter if this completes before graph is built.
If you add some buffering here and this completes before edge breaks,
and now there are two attacks on that wild card.
So this is the nonlocal effect that buffering, adding buffering to one
send makes a butterfly effect matching of some receive.
The thing is it's not as scary as it looks. We have handled this. So
the idea, the naive approach would be to take every send, simulate zero
buffering, ample buffering, run all exponential cases. And that's not
what we do.
What we do is we run one interleaving, one interleaving without any
buffering. Build the completes before graph on the side. And analyze
that graph and find an optimal allocation of buffering for wait and
sends and then replay. We continue like this until enough places have
been buffered to make this kind of deadlock show up.
So the 0 buffer execution initially is to e-code the head-to-head
deadlocks and the full buffer execution is to code the too much
buffering deadlocks. And that is being written up.
Well, I think I dumped a lot of information on you. Hopefully you get
a sense of where this is going. So work is -- the issue is that we are
wedded to dynamic verification just because it cuts into where
designers like to be. And already we have a payoff. There's another
multi-core API called MPAPI for hand-held phones or PDAs, it's a light
weight multi-thread processing that's being standardized. We're able
to apply many of our techniques as well as a tool under construction.
And if you look at the plethora of tools available in this area that
are the chest tool, the modest effort and we have an effort for the
thread programming verification and we might branch into MPAPI
verification. So the idea is that we need to -- I don't know where to
publish all this work. That's a slight worry. Model checking is a
thriving area. But the engineering of these dynamic verifiers should
get a good outlet. It would be nice to have a meeting in [inaudible].
So engineering dynamic analyzers can be greatly facilitated if the API
designers are mindful of the constraints of building a dynamic
verification system. And there are several opportunities for user
interface design that are only being exploited.
So, okay, so I went maybe slightly over time.
a pleasure to give the talk.
But thank you.
This was
[applause]
Questions? [inaudible] is here. She would say, after Robert, she just
took over the API work and brought us into the MPI land.
>>:
What is the feedback you got from the MPI community?
>> Ganesh Gopalakrishnan: They're like anything. This is the best
part of our whole experience. So when we started verifying simple API
locking algorithms, we were modeling the problem, showing 20-year-old
kind of verification technology in action. Found some bugs. Won the
best paper award for the first paper. This is good.
We were working with engineers at Argon and they're co-authoring many
papers. And that community is welcoming of this research and PPP,
Principles of Parallel Programming is a community that's liking the
applied nature of this work. I'm sure we've got a [inaudible] paper
out of it, which is good. It will be good to inform more communities
on both sides, yes.
The usability of this tool -- well, we have textbook examples worked
out as tutorials. So hopefully we'll influence the pedagogy
[phonetic].
But barring that, the highest, largest applications of MPI are
happening in situations where they have 16,000 processes, huge
datasets. And there they're running into bugs that are in the MPI
library and not in the user program.
So we need to find ways to get there.
Yes?
>> Now that you've done this, are there ways that if you were designing
MPI you would design it differently?
>> Ganesh Gopalakrishnan:
Well, you want to say something?
>>: I think there is a reason that it's designed that way because it's
for more performance critical applications. I think they have
[inaudible], so I don't think it would -- if I wanted to make
verification, I would have made barriers the more stronger points.
That's the reason why they have done that.
>> Ganesh Gopalakrishnan: It's well thought out to an extent. They're
trying to think out, deprecating, canceling a message that has been
sent and so on. But it's a large effort.
Our guys told me that our MPI applications whose code is so mature,
getting six digit precision at [inaudible] model and so on such that no
one will really touch those codes. So longevity is important. The
thing is MPI is spreading --
>>:
[Inaudible] bugs in the code.
>> Ganesh Gopalakrishnan: No, but people still try to port it to new
platforms. And then all the schedules that never showed up show up,
which throws them into big confusion land.
They're worried about messages copying, using up memory by creating
copies. So they're trying to marry threading into MPI. That's where I
would make a change. Because now if you have threading, threads trying
to call MPI, there's no notion of thread rank. So two different
threads could be making MPI calls. And, oops, the wrong thread picks
up the receive. This can lead to such brittle code, which, if you see
at the code of the page you won't know what's going on. But people are
trying to get some traction there.
The tool's available. So I would like to encourage -- this is my
serious effort in understanding MPI and I did several versions of this
metrics multiply. It's a lot of fun.
One student who I have hired recently is trying to measure the
performance of this code on large clusters and so on. And he's even
pushing the raw performance of ISP to large matrices and so on. So I
don't know whether that needs to be shown. But I have his write-up.
Thank you.
[applause]
Download