>> Madan Musvathi: Hi everyone. I'm Madan Musvathi, a researcher in the Research and Software Engineering Group. It's my pleasure to introduce Ganesh Gopalakrishnan today. Ganesh is a professor at the University of Utah where he's been for the past 12, 13 years. He has like a broad range of interests. And he's worked on verification, cache [inaudible] protocols and memory models and whatnot. So today he's going to be talking about making ISP which is a verification tool for MPI programs practical. Ganesh. >> Ganesh Gopalakrishnan: Thank you, Madan. Good afternoon everyone. Thanks for inviting me. So this is a great pleasure to give a talk of my two PC students, one former, one current. And they did pretty much all the initial work in this area and continuing work. So what I would like to take you through today is the rationale, the big picture of why we started verification research in this domain that was not very familiar for us. The approach that we have taken for modeling, the problems in this domain. Then the actual tool that we have running. I'll do a tool demo also for you. And then we'll actually take you through a miniature development process of an MPI program and how our tool might help in that area. So four, five, six years ago we were looking for application domains where software verification could be meaningful applied, which meant that we needed to have an abundant supply of case studies that we could run our tests on. So by some sheer coincidence or luck, one of the researchers in our area, George, suggested we look at high performance computing software. So we took his suggestion along. And the big picture of what we have done since then has been trying to understand a model of concurrency that exists. And every feature in this world seems to be well meant and well thought out based on real needs. And, yet, when all these features add up, they are giving us an opportunity to explore a nuance of concurrency and combination of features that are chock full of problems. So I'll take you through how we have managed to step through some of those problems and solve them using formal techniques. So one could look at classical models such as linearizability and things like that. And if you really go back and study fundamental models like that, there will be assumptions made of the interface that each thread sees, meaning each thread has one outstanding method at a given time and so on. Yet, one can look at modern programming research where one can see research on asynchronous method calls where methods are detached and they intract through shared memory and the main computation moves on. Now, a lot of APIs are being proposed for multi-cores going into the future. And there might even be an alphabet soup of APIs that an application might deal with. The general sense we are getting is one needs to be looking at different models where we might have asynchronous method calls like are found in other systems. But now we have these asynchronous methods calls intracting through message processing, not through shared memory. So this is a new twist. And then there are multiple other features like nondeterminism happening in terms of when receives can actually take their sent values. Then there are global operations like barriers that can regulate the progress of all the processes. And the resource availability or lack thereof affects forward progress. This is another feature. There are oodles of libraries. They have various progress properties. So it's an interesting mix of features that we are confronted with. The hope was for us to deliver just like dynamic verifier in this context. So clear temptation is to go back to a pristine area of computer science and not do anything here. But somehow it will be good to push on and see how far we can go confronted with these problems. So that will be the general feel for the story. And the initial work in this area was laid down by some of my students. So we started understanding this larger, rather large API and trying to understand each of its features. We even wrote a formal semantics for MPI and made it even executable from a C front end and so on, to gain some understanding. But we actually felt that we needed actually to, cannot rely on formal semantics that are built based just on the interface study, but we need to get into the semantics and understand the real exploitable features of the API. So that's where the story's going. So since I didn't know how many people here knew about MPI or not, I have a standard set of slides. So basically MPI is widely used for cluster computing at all scales. It's the lingua franca of parallel computing. It's culled from multiple languages and multiple processes execute. And the processes compute part of the solution. And they all achieve to solve some real world problem, like explosion modeling or physics. The execution of MPI programs happens on cluster machines of all sizes and scales. And very mature MPI codes exist, so much so that people are afraid to rewrite any of these codes using modern notation. So the longevity of the standard is well assured. Although it has to evolve and adapt to thread models and other models. But clearly this seems to be a very worthwhile area for us to be devoting significant years and student research time. Very simple illustration of what an MPI program can do. One can look at how one can integrate the area under a curve, whereby one can spawn multiple processes. So in this illustration it allows us to integrate under the different regions of subregions and then send the integrals using a send command to a receive command which is taken by the master process. And it adds up the integrals. MPI program is set up. So that's the general feel for how an But as soon as you write a program like this you can see, oops, not in the pristine functional sequential world anymore. There dynamic communications happening. And these are communications name the processes to each of the messages to be sent. So here trying to send to process 0. this is are which it is But the programmers are starting iteration from 0. So it's trying to receive from source. But this guy's not a 1. The rank of this process is starting from 1. So there is a mismatch in terms of communication expectation. is not coming from 0. So there's a deadlock. So this So even as simple as a program of -- in a simple program of a few lines one can introduce a deadlock. So this is the story. Okay. So how about taming the complexity by trying to create a simple API of 10, 20 functions and use only that and implement all other functions using these entities. So that would be the computer science approach. Turns out the large size of the MPI is partly explained by the wide deployment base it has and the number of applications it caters to. So it offers a very broad interface of which many applications take a tenth of the functions and it's a different tenth for different applications. So it seems inevitable that we look at the large space. Okay. >>: What do you mean by out-of-order completion? >> Ganesh Gopalakrishnan: Out-of-order completion. It will be introduced in a few seconds, but the idea is that when the MPI calls are issued in program order, they need not finish in program order. The idea is that the fundamental guarantee that MPI makes is that of point-to-point message fee for ordering or non-order taking. So if you're sending messages to different destinations, you can afford to finish a later send which might be sending a smaller message ahead of the earlier sent. This is a real aspect of these deployed asynchronous communications that we need to model and handle. So this is a fundamental aspect of our work and it will show up in many places. So like Madan was asking, not just a nonblocking operation but out-of-order finishing of the operations we need to be modeling. And so in this space we have to make progress by pinning on some bug pluses, and that's what we did. We looked at being able to find deadlocks such as might happen when you send two two different processes and then try to receive. Here the deadlock might happen due to buffer not being available. Or misplaced global synchronization operations like barriers. Or here's a feature where nondeterministic receive is offered. And it can be matched by either this send or this send, which are both sending to address P0, the process P0. The data itself is not being shown here. So this might be a available situation where the send from P0 matches this wild card receive, when this send from P0 matches receive expectation from P1. That's all going fine. But on a different cluster machine if you pulled this program, what you might find is because of the speed this send offer is matched early, in which case a deadlock can happen when the other send is attempted to be matched because it's not coming from P1. So this is called a communication race. Although there was a nondeterministic. And you might ask why there's nondeterminism in receive. I'll show you in actual case study later that nondeterminism in receive is a meaningful feature which we can actually use. We are interested in finding resource leaks that happen. Many objects are allocated in MPI program and forgotten to be allocated. So that's a thing. And we are looking for local assertions placed in the program. So we would like to run the actual code. Because the designers finally get value when they can run real applications through a debugger which offers good coverage guarantees. So testing methods are notorious. There are pages, one or two page codes in MPI where you can test the programs several times and never find a bad schedule that causes a deadlock. It's very easy to show that. Static analysis methods are not very widely usable here, just because of the concurrency nuances. And the cost of sporting a false alarm can be as reported in a paper by Nagapen [phonetic] Patrice Godefroid in a conference last year several months of designers time, if you don't want designers to be burdened by false alarms. Static analysis can give you bugs in regions that cannot be accessed. Model-based verification can be used if the final state models are createable. But for real deployed MPI programs nobody will give you a pristine final state machine model that can be separately verified. So we're left with nothing but the possibility of running dynamic verification which is actually the instrument in the program and replay the different schedules and hopefully we can contain the schedule growth. Okay. So when you embark on verification there are a number of exponential spacing to kill you. Exponentials in the data space symmetry space, interleaving space. So the message here is that it's reasonably clear in many cases how to downscale a model so that your percent of smaller number of processes and smaller number of data items to work on. But it is less clear how interleavings can be similarly shrunk because each interleaving seems to be its own beast. And so we need the tool to automatically figure out which interleavings matter, schedules matter. That seems to be one place where human intuitions are less reliable. So in a large execution, so even with five threads and five steps in each thread, one can do some simple math and find out that the number of schedules, interleavings, can be more than 10 billion, if you have all these actions permutable equally with other actions. Okay. So, whereas, if you attract dependencies in the actions we can find that by just replaying the dependent actions in multiple orders we can cover the kind of bugs that we are going after. So this is the intuition that is present in several techniques in the pioneering work in this area and dynamic partial order reduction was done by Flannigan and Godefroid in POPL [phonetic] several years ago. That's the kind of approach we were motivated by and we want to go after such a technique. We also wanted to keep in mind the kind of dependencies that arise in MPI programs as opposed to what arise in thread programs. In thread programs, dependencies are fairly pervasive in the sense almost all interactions happen through shared memory. So the number of schedules you can play out can be very large. But the interesting thing that MPI programs is that for very large MPI programs sometimes you can play out just one interleaving and be done. There is no dependency. Sometimes these wild card nondeterministic constructs might be offered but may not materialize during execution. In which case we don't need to play out more than one interleaving. And when interleavings arise, it sort of bursts and all the processes may be involved in that nondeterministic attack on a send, on a receive. So with this kind of structure in place, we wanted -- we couldn't really go after preemption bounding or heuristic searches like that. We wanted to be more exact. And that's what we were after. >>: Did you also model out-of-order execution? >> Ganesh Gopalakrishnan: Absolutely. >>: So in that case you can not run the program; you have to recombine the program? >> Ganesh Gopalakrishnan: Our scheduler has detailing techniques. Our scheduler captures operations and sends into the MPI runtime and out of order -- it exploits the latitude offered by the API semantics in being able to execute certain commands out of order and sends the commands only when they're sort of ready to go. That is coming in a second. So the out-of-order execution semantics is a mystery presented in MPI and causes surprises to people. But knowing the API's latitude and being able to reorder we can exploit that and show you that such behaviors will not cause problems. That's the idea. So in fact this example, simple example is going to tell us a lot about what we need to do even for simple three-line example, three-process example. For instance, here is a simple example with a nonblocking send called MPI I send, 2 process P2. I'm now showing the data. And then a barrier. A barrier is sort of like a point that all processes have to arrive at before anyone can cross. Here's a barrier and here's a send to another process. And here's a nondeterministic receive from any process into the barrier, and into the barrier. So what can happen in this kind of a simple program is one might even ask, okay, there's a barrier here. So can this send ever match this receive, because this send is written after the barrier. This is before the barrier. In a normal shared memory sense, such a match cannot occur. But in MPI there are delayed completions. So the time line of execution is such that you can assure there's nonblocking send. It's detached. The computation is going on. This is the issue. It's detached. Going on. arrived at the barrier. So they can cross. Now all processes have And while these two are alive, this also becomes alive and now a match can occur. So this is an aspect of combining nonblocking semantics and the barrier not really requiring the operations to finish. And again all efficiency-minded, because this data that may be shipped with these operations may be several megabytes and you don't want the data to be shipped. So the weak guarantees offered by MPI are such that these behaviors should be permitted for higher performance. Now, if you look at this behavior and try to sort of do execution replay saying that let us issue P0 send and P2 receive first, make that behavior go through, verify that, unwind the schedule, let us now issue this send and this receive alone. And go forward. That's not going to work here. >>: So with the semantics, barrier has no fence semantics? >> Ganesh Gopalakrishnan: Barrier has the semantics of sometimes being able to prevent certain communications across. Sometimes -- so what barrier's trying to do is it tries to erect constraint going later. So for any operation later cannot be begun unless the barrier has been crossed. So it's like a weak fence in that sense. >>: There's more than one send and receive, too. synchronous kind. There's the >> Ganesh Gopalakrishnan: Yes, yes, several models of send. Here we're just looking at the nonblocking kind. But we're trying to canaculize it to this send which can be detached. So the barrier is a mild time alignment device and, of course, we have an algorithm in the tool which I'll show later which found out functionally around barriers. We can find out whether the barrier is creating any orderings more than what's present. And in some cases it doesn't. >>: So without the barrier, you had some other send after the barrier. Send could have actually happened, can be issued before the first send MP0? >> Ganesh Gopalakrishnan: Issuing happens in program order. But, yeah, if this end happens and to unrelated process, even this send can finish before this process. But what you have to -- I'll come to detail in a second. These are like split transactions or split operations of a known paradigm. The usual completion operation of a nonblocking send is a wait, which I haven't shown here. So when a wait successfully unblocks, that's when the send can be deemed to have finished. So if you put the weight associated with this send right here, then that makes the barrier a meaningful entity. This is for illustration showing the fluidity that the API offers prevents certain tricks of common replay debugging from replay. We cannot just issue this as piecemeal and this as a piecemeal and hope for anything to happen. The reason being that we cannot issue P2 unless we issue this send also, because of the barrier. And when both sends are present in the system, the MPI runtime can pick up its own send and finish. We have no control over the non-send MPI runtime. We need some other technique to match this alone or this alone. So we'll -- what we'll really do is we know that there is no completion obligation between the send barrier. So we actually, in our system, will issue the barrier first and delay issuing the send. We collect the sends and keep it in the system. So we're trying to discover the most number of sends that can attack the receive. And that's why we can delay things until they cannot be delayed any further. Once we know that, we figure out these two sends can match. Then we dynamically rewrite instructions. So this is going to be from P to P0 from any. That issues, forces a matching. And then you can unwind the execution. This is a stateless search. And then coming forward we can rewrite this to receive from P1 and the issue. So this actually is, in a nutshell, what we do. We start the execution. If the operation is not at a blocking call, the scheduler simply collects the nonblocking call and lets the execution continue until the process comes to an ordering point in one process, then it switches to another. So once it delays until it can not further delay, it knows the full extent of the nondeterminism and computes all the possible choices and then replays. So this is how it goes. But let's try to get someone to understand, formal understanding. This is all very operational, what we'll go to a precise model. So the main correctness guarantee that MPI gives is very weak. It's non-order taking. Between two points, when you issue two messages from A to B, they arrive in the same order. There's not a guarantee. So if you can exploit that understanding and set up a suitable ordering that is to be satisfied, that gives you then a good understanding of all the schedules that are permitted between the MPI calls. It's also assumed that the C code that you write between two MPI calls do not exploit the particular ordering of completion of the MPI calls, because the MPI calls, even though they appear in program order, may not complete in that order. So you are making some assumptions about not relying on the particular observation being made. But that is unavoidable. So let us understand a little bit. So I'll just look at four operations. To make life simple, let S stand for nonblocking send. Let W stand for the rate operation for that send. Let R for nonblocking receive. W also will be a wait for the receive. And let B stand for barrier. So then this is how the time line of send or receive tends to look like. You initiate from the API the send and the call returns. But then there are further events associated with the send and receive that are still happening. And these other events could be that the send or receive finally find a match between other send, complementary send or receive, and a complete event, a completion event which says I. I have avoided, if I have send, I've drained my memory into the system memory, so I have completed my memory obligations as well as the send goes. So here's how -- and the weight and the barrier -- I'll go through it in some more detail. The weight of the barrier are blocking operations which means the API calls go in and when the call comes back the full effect of weight or barrier have occurred. So send, this is going to get a little bit interesting, but let's go through it and we'll soon abstract all this detail. And send where a system provides ample buffering meaning where the send, the system has enough buffer pool allocated sort of goes like this. The issue call returns. Immediately the system has drained your memory so that your memory is usable again. So that's when this send is deemed to have completed according to your memory observation effects. But the send may not have found a match. The match may occur later. But now if you run the same send in a system which is constrained, this runtime doesn't have much memory, then you issue the send. The send comes back, and then the execution sort of blocks because the send has to find a match and receive to drain the memory as a process-to-process copy, not a process-to-system to process which is permitted earlier in the case. So the process might look like this. So the general effect is that when you issue a send, you supply a handle and then the wait call uses that handle. When you nest the time lines of the send and wait, the send call goes in and returns. The wait call happens soon thereafter. And at some point the send matches and then both are complete. So all these are good indications of what goes on and these events exist. We don't know how to exploit all these events. We can't even observe these events. So what we have created is a simple model called a completion before model. A completion before relation can be defined as opposed to the program model relation and the completes before relation is a weakening of the program order. So that's the way in which we're going to set up the computation of an MPI program. So here's how the intracompletes before relation, which is the completes before obligation within a process can be modeled. So suppose you have two sends that are sending to the same target. Then their intracompletes before relation with respect to program order. Because the non-order taking guarantee is what we need to guarantee. So they're shooting to the same target so we cannot allow the memory effects of this send to appear to have completed after the memory effects of the second one. They have to appear to have completed in the same order. So this is why intracompletes before respects the non-order taking, as opposed to non-order taking. If you send to different targets, then there's no such ordering. This is where the intracompletes before ordering can be weakened. So a good example is send of one gigabyte to P2, P1 and send one bite to P2 which allows this to appear to take effect earlier in an efficient system. Receives have similar non-order guaranteeing from the pool side receives pool from the same source. And they have to obey the nonordering taking. And they're pulling from different sources, you're not going to violate non-order taking if you allow the completion orders to differ. I won't go into other details but basically what we have done is to define completes before ordering for the particular events that have been initiated in an execution. And modelers that, completes before we can do analysis on which points we can delay things and which we cannot. And if I'm appearing to hurry, please stop me. But I'm hoping to come to a point where I have given a feel for how we have modeled the very small programs and then I'll give a slightly larger program and its animation in terms of the execution so that at that point hopefully you can bridge the understanding. Then I'll actually run a tool demo and show you other details. So let's now take a simple -- go back to the original program that we are baffled by which was a send to P2 and then a barrier, a barrier and a send to P2 and then receive from any and barrier. Why did that program not execute the way we thought it should? If you plot this completes before graph for this program, build the complete before relation, there's no completes before obligation between send and the barrier, but there is an obligation for every successful operation after the barrier. So this is the situation. So clearly at this point we can see that these two events are concurrent with this receive. So the completes before is going to be our handle to define concurrency, when can things be co-enabled. So this notion of co-enabled is to be granted in a standard serial execution sequentially consistent kind of execution system. Whereas we need a better way to model co-enabled in our system. The long story short, there's going to be intracompletes before and intercompletes before. And we build these completes before graph while we execute forward. And whenever we discover that two events are not connected with the completes before path, they are concurrent. And you need to be aware of that. And our tool actually builds this understanding and it will show it in the GUI how the completes before graph looks so the scheduler should know that. So that's what it is. So two actions are co-enabled if and only if there's no complete paths between them. >>: So how is this different from having a fee for channel between any two processes? So what you're giving is a nonoperational way to get to this? >> Ganesh Gopalakrishnan: Largely so, I would say. The implementation is that of fee for channels in a sense, yes. There are tag lanes which allow you different priorities for that. But barring that, it's an operational way of looking at fee for basis. There's a fee for -- the idea is the blocking nature of processes forward progress is affected by the residual space. So there's a fee for the system that you can see for each process that copies out into some buffer pool of the process copies of the fee for the system. And that fee for can be arbitrarily small or large in the system. The real brittleness comes because we need to verify for a controlled environment and then we prepare to deploy for any environment. So another feature of MPI which I'll come to, if I get time, you can have deadlocks when the system buffer is not adequate which is well known. Trying to push into for fee that's not there. The MPI, because of combination of features, can add to that even if you add buffer. This is a slack inelastic system which is studied in some circles, but there's another point of brittleness. Too much buffering is also bad. So all this has to be modeled carefully because we have a -- so it's well known that in any language which uses channels and nondeterministic commands you can add slack and get into trouble. That's sort of how easy to belabor it. We have a good technique to now enumerate all the slacks that are needed and very efficient way to enumerate all the slacks that we can model. So we do -- again long story short, so that the message isn't lost, the final algorithm that we are in the act of the writing out and getting ready is we do the normal interleaving reduction and test with that this slack enumeration reduction, both phases are going on. Okay. So back to where we are. you can issue them. That's why these are concurrent and So what we need is a good way to define all this. We need to define some operation semantics that tells us what a full execution space of MPI. And then ideally it would be nice to have a specialization of this operation semantics that tells us what's reduced execution. And that's easy to set up once we have these notions. So what we have is four rules from the process side, which injects what we call communication records into the runtime. So whenever a process executes an MPI command, it makes an MPI communication record which sets in the runtime. There are four rules. Process in, process and receive, process wait, process barrier. And then the runtime itself is a very active system that picks up the communication records and does things with it, matches sends and receives and finishes upper wait and so on. There are three -- five of them, runtime, moment, actions. So we have a page -- this is the best it gets in this area at least. We have two pages that sort of tells you everything about MPI as far as buffering, nonbuffering, progress through very precise operational semantic rules. And then you take these rules and execute them in a certain priority order and that's our interleaving reduction. This is exactly what we do for containing interleavings. And look what we do. We run all the process injecting communication record actions with the highest priority. So whenever we can do this process inject communication record, we allow the process to inject the communication record and keep moving forward, not blocking the process. At some point all the process actions are finished. Now the runtime has to do certain things. We constrain the runtime to do only deterministic actions. Delay the nondeterministic actions until you cannot. The very last gasp we allow the nondeterministic receive match which this rule shows. So this skewed priority execution which we started with uses our reduction. We can show that all this -- this is obeying the classical [inaudible] conditions or persistent conditions and we get sound reduction as a result of this. And so now two more -- I don't know how much I'm doing with the time. >>: 10. >> Ganesh Gopalakrishnan: 10, 15 minutes. I'll be okay. A slightly elaborated version of the earlier example, we find these actions, the barrier actions are concurrent and can be the highest in the priority order. So we pick them up and file it once the barrier actions are picked up and filed, then the highest in the priority order is this receive which selects, hey, I like this send. So this is the matching of the send that occurs. This send and selects the receive and picks up all these higher priority rules that file. And then they are completed. And then that allows this action to complete, et cetera. And then so this is one more -- this is how the tool looks like. The tool itself is an MPI program plus interposition layer. And the executables are compiled and it's a push button tool which I'll show in a second, the scheduler that injects messages. So here's -- this is an example that's enough if you keep the mental progression of what's happening here. This is what we do. So given this program, how does our tool work? Picks a process, say this runtime. And the runtime So give me your next. So then the system says, ha, starting point. Anything process, and files its operation. The notes it and says you're not blocking calls. then comes here. And says file that. And you have a blocking call, you are at a later completes forward. So switch. And this point it gets an operation. It hasn't issued the receives and so on yet. Next. I haven't issued it. And then it comes here. Then again notices this is a barrier, so ordering point, so switch over. So then it sends the barrier. So at this point it notices that the barrier is full. All the processes are around the barrier. There's no completion order obligation between what occurred before the barrier. So let the barrier go first. So the runtime actually sees things carefully rearranged according to an allowed semantics. So reordering, the fee completes before it's exploited by our scheduler. Then the execution moves on. All of them let go. And now it gets, this schedule comes back here. This is the wait for that send. Now, we are building the completes before graphs. So we need that information for scheduling. And then that is an ordering point. Wait is a blocking call. So that's the large idea. So since I have some time constraints, I will tell you what goes on next. So while you execute, you will discover the full set of senders that can match this receive by delaying the nondeterministic receive. That is what happened. At that point we say, aha, the two senders can match it. Who shall go first? The dynamic rewriter comes in and says let that send go first. So it replaces that star with P2 and then files that. And then let that execution go and you will soon find that it can now find an off and receive, which could not be received from two. Nobody willing to send to this process. It send got matched there. So you have to sort of -- this is an execution of the operational semantics as applied to this program. Okay. So why don't I show you some actual, look and feel of what this whole thing is going to do. So we have implemented this tool for several platforms. It runs under Linux and Mac and Windows. And it can actually work for several MPI libraries. If you take one of these programs you can see it under Visual Studio, that's the source file. So all you do is run ISP. And in this case it's a matrix multiplication program. Paramultiple multiply. And it's concrete execution. So it even has built an execution here to multipled the matrices. When it runs through it, finds errors and reports any. And no deadlocks here. Now you can say show me what you did. So the source analyzer window can pop up and it will allow you to step through the interleavings. And then you can say walk me through the interleavings of how the communication is matched, and it pops up enough windows to show you all the process ranks. You can lock ranks, process ranks and step through that. execute until you find issues. So you can There's debug button is partially implemented. It's only two days old. I won't press it until the end of the talk. But if I press the debug button it's able to cut into the Visual Studio debugger at that point and engage the Visual Studio debugger at that point. So this is how it will go. But now you can actually obtain more information insights. You can say show me your message matching occurred, another GUI. Tells you for that execution all the send/receive matches that occurred in that execution. Now you can say show me what completes before relation that you executed this scheduler under. So you can say view interCB, view interCB or view both. So at this point it will tell you the completes before relation that it has computed for those nodes. What about this will tell the designer is that you can expect anything that is not connected with an arrow here to have been executed in either order. And so if you don't understand your bugs, you can deeply introspect the completes before relation and understand what happens. This is a gazillion bells and whistles. I won't show you. Can show the source code and so on. This is what happens. Now, if I take -- open project solutions. So this is an optimized version which now allows wild cards to be meaningfully exploited. I'll tell you how this program works. But this is a program with the wild card receives. And wild card, whenever wild card receives are introduced you have to attract nondeterminism. And it actually finds that this execution produced several interleavings for process execution produced maybe 18 interleavings, if I remember the number. I don't want that. So 18 interleavings. And you can step through the source analyzer, et cetera. So you might ask whether these many interleavings are important for this program. Well, particularly, yes, all the send/receive matches might supply different data. And based on what your data you supply the computation can branch differently. But in this particular case there is a higher level symmetry going on, you have multiplied the matrix 18 times and we can probably go after that. So a lot of students have done a lot of fine engineering in this project. It's a huge effort of the scale I've never done in a dynamic institution. We can actually support multiple platforms and multiple libraries. And the best results we have we can take 14,000 line parameter hypergraph partitions and it checks it in a few seconds on an ordinary laptop and nondeterminism manifested in execution. What else can we do? We have taken codes and found deadlocks in them and some of them are benign deadlocks. What we are after are several case studies and several tutorials are also in the card now. I'll give you an illustration how a programmer might program in MPI and use this tool. So suppose you want to multiply two matrices. One algorithm is to broadcast metrics B to all the nodes and send the rows to different processes and the processes do their own row time semantic metrics multiply. These are answers in parallel. And receive the answers. If there are more rows than the number of processes then you do it again and again. So a program will start with initialize declare the number of processes you have. Then there will be a phase which broadcasts the metrics. Broadcast is sending the same area to all the slaves. And there will be a matching broadcast so the match is to pick up in all different process ranks. Then there will be send/receive pipeline where these rows are being sent and these being received. So this could be blocking send, blocking receive, blocking send, blocking receive. Next iteration we'll make these non-blocking calls so that you can pump these send and come back and before the next send is pumped you can put a wait and then pump it again, so it becomes a software pipeline loop. Finally, when you receive the rows back, who do you send work to again? Is it the first thing that got computed, or is it whoever completed first? And if you follow the whoever completed first model you get nondeterministic receive as a natural programming method here. say the first winner gets the next work. You can So that's how the code looks like. So you can receive from the first processor, back from the first processor, send it back to the first processor or you can say receive from anybody and send more work to that processor. And then you can use a nonblocking operation saying that -- you just need to wait before the previous send finish and then issue the next send. So you can set up delayed sending like that. I did this Visual Studio demo again. In the few minutes I have, what I'll give you is I'll tell you why system buffering is so important to model. This could happen in any distributed system, message processing system with nondeterminism and other things. The idea is that when you have a program with send wait, send wait like that, it's normally expected that this wait operation is a blocking call. So there is a completes before ordering that it puts on this send. So notice that even though this send is sending to P1, this is sending to P2. By that alone you might have said these things could happen in different orders, in complete different orders. Because of the wait call the completion is forced to occur in program order. Okay. Now let's focus on this edge. What happens if I add buffering? Suppose the system has a lot of buffering so that it sort of takes this send message immediately into its memory? So when you have that happening, this wait command, whose obligation is only to notice whether the send has copied out its memory is like a [inaudible]. This wait says I'm done. I have noticed that this send has sent its memory away, sent its data away. This completes before edge vanishes, the system has plenty of buffering. So this is the buffering used completion before weakening. So just to summarize, in this example the completes before respects program order, the system has no buffering. The completes before does not respect ordering, neat program respect ordering. The system has plenty of buffering. This has consequences in an interesting situation which is portrayed here where initially the completes before graph orders this send with respect to this receive. So while this receive put out, taken data receive occurring in P2 previous taken data from anybody and this send is sending to P2, these guys could not have a match because there are completes before force. But now if you add buffering, so only that match is allowed. This nondeterminism doesn't matter if this completes before graph is built. If you add some buffering here and this completes before edge breaks, and now there are two attacks on that wild card. So this is the nonlocal effect that buffering, adding buffering to one send makes a butterfly effect matching of some receive. The thing is it's not as scary as it looks. We have handled this. So the idea, the naive approach would be to take every send, simulate zero buffering, ample buffering, run all exponential cases. And that's not what we do. What we do is we run one interleaving, one interleaving without any buffering. Build the completes before graph on the side. And analyze that graph and find an optimal allocation of buffering for wait and sends and then replay. We continue like this until enough places have been buffered to make this kind of deadlock show up. So the 0 buffer execution initially is to e-code the head-to-head deadlocks and the full buffer execution is to code the too much buffering deadlocks. And that is being written up. Well, I think I dumped a lot of information on you. Hopefully you get a sense of where this is going. So work is -- the issue is that we are wedded to dynamic verification just because it cuts into where designers like to be. And already we have a payoff. There's another multi-core API called MPAPI for hand-held phones or PDAs, it's a light weight multi-thread processing that's being standardized. We're able to apply many of our techniques as well as a tool under construction. And if you look at the plethora of tools available in this area that are the chest tool, the modest effort and we have an effort for the thread programming verification and we might branch into MPAPI verification. So the idea is that we need to -- I don't know where to publish all this work. That's a slight worry. Model checking is a thriving area. But the engineering of these dynamic verifiers should get a good outlet. It would be nice to have a meeting in [inaudible]. So engineering dynamic analyzers can be greatly facilitated if the API designers are mindful of the constraints of building a dynamic verification system. And there are several opportunities for user interface design that are only being exploited. So, okay, so I went maybe slightly over time. a pleasure to give the talk. But thank you. This was [applause] Questions? [inaudible] is here. She would say, after Robert, she just took over the API work and brought us into the MPI land. >>: What is the feedback you got from the MPI community? >> Ganesh Gopalakrishnan: They're like anything. This is the best part of our whole experience. So when we started verifying simple API locking algorithms, we were modeling the problem, showing 20-year-old kind of verification technology in action. Found some bugs. Won the best paper award for the first paper. This is good. We were working with engineers at Argon and they're co-authoring many papers. And that community is welcoming of this research and PPP, Principles of Parallel Programming is a community that's liking the applied nature of this work. I'm sure we've got a [inaudible] paper out of it, which is good. It will be good to inform more communities on both sides, yes. The usability of this tool -- well, we have textbook examples worked out as tutorials. So hopefully we'll influence the pedagogy [phonetic]. But barring that, the highest, largest applications of MPI are happening in situations where they have 16,000 processes, huge datasets. And there they're running into bugs that are in the MPI library and not in the user program. So we need to find ways to get there. Yes? >> Now that you've done this, are there ways that if you were designing MPI you would design it differently? >> Ganesh Gopalakrishnan: Well, you want to say something? >>: I think there is a reason that it's designed that way because it's for more performance critical applications. I think they have [inaudible], so I don't think it would -- if I wanted to make verification, I would have made barriers the more stronger points. That's the reason why they have done that. >> Ganesh Gopalakrishnan: It's well thought out to an extent. They're trying to think out, deprecating, canceling a message that has been sent and so on. But it's a large effort. Our guys told me that our MPI applications whose code is so mature, getting six digit precision at [inaudible] model and so on such that no one will really touch those codes. So longevity is important. The thing is MPI is spreading -- >>: [Inaudible] bugs in the code. >> Ganesh Gopalakrishnan: No, but people still try to port it to new platforms. And then all the schedules that never showed up show up, which throws them into big confusion land. They're worried about messages copying, using up memory by creating copies. So they're trying to marry threading into MPI. That's where I would make a change. Because now if you have threading, threads trying to call MPI, there's no notion of thread rank. So two different threads could be making MPI calls. And, oops, the wrong thread picks up the receive. This can lead to such brittle code, which, if you see at the code of the page you won't know what's going on. But people are trying to get some traction there. The tool's available. So I would like to encourage -- this is my serious effort in understanding MPI and I did several versions of this metrics multiply. It's a lot of fun. One student who I have hired recently is trying to measure the performance of this code on large clusters and so on. And he's even pushing the raw performance of ISP to large matrices and so on. So I don't know whether that needs to be shown. But I have his write-up. Thank you. [applause]