>> Jim Larus: It's my pleasure today to introduce Edward Lee who is a professor at UC Berkeley. He's been there for a number of years, I think probably came about the same time I left. Ed has been here before. The last time he came to tell us why shared memory programming was a bad idea. I think this time he's going to tell us why message passing is a bad idea. And I'm not sure what's left after -- but if we invent a new model, we can invite him back and he can tell us why it's a bad idea. Thanks, Ed. >> Edward Lee: Okay, thanks Jim. I'm going to try not to be quite so negative. You could think of this as sort of equal opportunity griping. But, yeah, as Jim said, you know, the -- I've -- when I was last year, I basically stated that if you take shared memory as a programming model, which typically gets translated into threads using semaphores and mutexes that it -- in my opinion, this is really not an adequate programming model. That's not to say that it isn't a useful technique. It just shouldn't be exposed to the application programmers. It should be somewhere under the hood used by experts as it was originally, by the way. You know, I mean, this is 40-year-old technology, and it was -- has been used for many years by operating systems people. And it's really a relatively recent phenomenon for it to get exposed to the programmer at the programming language level. So I'm not going to talk about this today. I'm going to talk instead about how -- well, some people have reacted to this case by saying, well, you know, there's this classical debate between shared memory and message passing. So if you're opposed to shared memory you must be in favor of message passing. And the story is it's not that simple. Okay. So I think that the real issue is that whatever concurrency model you use you need more structure and more discipline than what is typically provided by the programming infrastructure today. And so what I'm really going to try to make a case for is something that I think is -- you know, ought to be innately appealing to most computer scientists, which is that constraints on programmers are often even more important than expressiveness. That, you know, if you think about it, you know, is assembly code more expressive than high-level programming languages? Well, yeah, sure. But it's much harder to use in part because it's more expressive. Right? Gives you more direct control of the hardware. So this isn't a really radical new idea, but I'm going to try to pick on some technologies that are pretty popular today in message passing, and in particular I'm going to focus on two problems that arise with message passing systems. One of them is a rather subtle one, which is buffer management in message passing systems. And I'm going to try to convince you that, you know, the existing message passing libraries really do an abysmal job on this and quite unnecessarily. That you could actually -- you can actually do much better. And then the second issue is a scalability issue for message passing programs. There's quite a few other things that I could talk about with message passing programs, but given a limited amount of time, it's better to focus on a couple of specific issues. So my whipping boy today is going to be MPI, not, you know -- not because -- well, mostly because I think it is one of the more popular message passing libraries out there. It's pretty widely used in the scientific community. It's been around for a while. And it's -- it runs on quite a variety of different platforms. So I'm going to use it as a -- I'm going to pick apart the flaws that I see in this, and it's -they're I think illustrative of similar flaws that you see in a lot of comparable libraries. So for those of you who aren't familiar with MPI, here is a snippet of a program in C. And this kind of shows how you use MPI in C. You call an initialization method, no big deal there. You then find out which process is actually running. So what happens is that this code will actually get -- the same code will get executed on a whole bunch of different machines. Okay? And you ask the infrastructure which -- who am I, right? What is my rank. So this procedure will right into this variable rank an identity for the particular process that is running. Okay? You can also ask the infrastructure how many processes are running. It will write into this variable size the number of processes that are running. And then you can run the same code in multiple processes, or you can run separate code in the distinct processes. So the way that you do that is if you want to run separate code in distinct processes, you just put the code you want to run inside of an if block that's guarded by the rank that you would like to have execute that process. Okay. So you could have code for one process in here, code for another process in here. And everything is collected in one source file. You compile it and it runs on multiple processors. So that's kind of the anatomy of an MPI program in C. You can do a similar thing in Fortran. Now, let's take a look at a fairly concrete process. Suppose that I have a process that I've given an iconic picture of here. It's got two input streams, so these are sources of messages. Messages are going to come in from other processes on these input ports. And messages are going to come in also on this input port. And these messages are going to determine the routing of messages from the sources to the output. So in particular, these messages here will carry Boolean data. And if you receive a message with value true, then you will look for a message on this message port and route it to the output. If you receive a message with value false, you would look for a message here and route it to the output port. Okay? So that's an example of a process that you might run. This would be typically a piece of a much bigger sorting algorithm. For example, you might have a program that is, say, you know, collecting transaction data from commodities markets around the world and sorting them by timestamp or something like that. Right? And so you would have some logic that determines where you should read the next input from. That would be a typical application that you could scale up to very large scales. Okay. So the way that this select process could be written -- first of all, this would sit inside of an if block that would be guarded by the rank of the process that you want to have run this select. And then inside that if block, this is just going to run in an infinite loop here. And it first reads from the control port with this first line, MPI receive. And then if the value of the control is a Boolean true, or this is C, so it's a nonzero int, then you will read from the port here. Now, unfortunately in MPI you don't actually do this in quite such a modular way. The reference here is not to a port but rather to the rank of the process that is the source of the data. So there's kind of a lack of modularity here that you can fix this with MPI. There's layers that you can put on top of this that separate that. But this particular code is not written in a very modular way because this process is directly referring to the processes that are sourcing the data. So, in any case, this will read from data source 1, this will read from data source 2. And then this will send to the output and then we go back to reading from the control port. So a couple of points about this is that, you know, this reference to the rank of the source or destination process is kind of a lack of modularity here because you can't take this code now and, you know, use it somewhere else in some other context according to its logic without changing it. And also there's no particular reason why this code should have any reference to the data types of the messages that are being routed through here, but it does. It, in fact -- there really isn't a convenient way in MPI to make this polymorphic, at least not in a type-safe way. So those are a couple of modularity problems that arise. But those are not going to be my focus. The modularity problems are mostly not going to be my focus. Instead I'm going to talk about the buffer management. Okay? So MPI describes its send procedure as a blocking send. So what they mean by that is that it does not return until the memory storing the value to be sent can be safely overwritten. Okay? So if I back up to this send process here, I make a reference to the data value that I want to send right here by providing a pointer to the value that I read at the input. Okay? When that can be safely overwritten, this method will return, which enables me then to go back and redo this loop and consequently overwrite that location. So that's what they mean by a blocking send in this protocol. Okay. Now, the MPI standard allows implementations to either copy the data into some system buffer for later delivery to the receiver or to rendezvous with the receiving process and return only after the receiver has become -- begun receiving the data. Okay? So when you use MPI send, a particular implementation of MPI could deliver either of these two behaviors. Okay? Now, that's a problem. Because this leads to programs that behave differently under different implementations of MPI. Okay? So in particular let me show you such a program. Okay. So this is again -- this is fleshing out this select process into something that actually looks more like a sorting application. So I've got two processes, source 1 and source 2, that are producing data. So those might be, for example, reading data from a database that has commodities exchange information on it. Then I've got a control process that is deciding how to merge these streams. So this -- the logic of the sort happens in this control process. Okay? And these things are getting merged. So the control process could be written like this. I receive data from the top input port, then I receive data from the bottom input port. And then I drop into an infinite loop. Okay? If some condition on the two data values is true, this is my sort criterion, okay, then I will send a true value on my output and receive data from the other guy. Okay? So I swap -- I alternate which one of these I look at -- look on depending on which of the data values I'm going to route through the select. Okay? So if my control logic decides to route a message from the top through, then it will effectively discard this control message and read -- and wait for another message on this input. Okay? If it decides instead to send through the message on the bottom, then it will discard this message and go back and wait for a new message here. Okay? So that's the -- that's the logic. So you could build up -- you could scale this thing up. In fact, you could create a tree of these things that could be arbitrarily large. And you would have a -- would be able to parallelize a very large sorting application in a pretty -- I think a pretty natural way in MPI. However, the MPI standard in fact doesn't tell us what the behavior of this program is going to be. Okay? So there's a couple of -- first one little trickiness here. The source data here is getting routed to two different destinations, the selector and the controller. Okay? There's two ways you could do that in MPI. One way would be to alter -- to change your source program, right, and have it write to one of these and then write to this guy. So the MPI programs tend to reference the processes that they're communicating with, which is a lack of modularity. So that would be inelegant. Okay? So instead let's just assume that we actually are willing to dedicate a process which, of course, has its own costs, overhead, to doing the replication. So it will receive -- so these little diamonds here are processes that will receive a message on the input, route it, send it to the top and then send it to the bottom. Now, the issue is that the process has to do that in some order. It's got to send to one and then send to the other. Okay? But the order matters. In particular, under a rendezvous semantics, which is allowed by the MPI standard, this program could deadlock. Okay? Whereas with a buffered send it never deadlocks. There's no way for it to deadlock. So there's kind of a fundamental problem here. And the irony is that according to the MPI standard document, the reason that they allow these two different implementations is to get portable programs -- yeah. >>: Is that the real reason or was there some politics where one company insisted on its [inaudible]? >> Edward Lee: I think that this is a veiled statement of there was politics. So I suspect that that's what was going on, that, you know, there were two different implementations that people were very wetted to, and so they decided not to make the commitment. But they missed the fact that a consequence of this lack of commitment is that the programs are distinctly nonportable, except, I mean, in the sense they run on both platforms. Right? It's just they don't do the same thing. And I think that's kind of a pretty serious flaw. All right? So that's kind of a problem. But -- so how could we fix this problem? Well, you can force a buffered send. So to avoid the deadlock, don't allow it to do the rendezvous. Okay? So the way to do that is to use a different method, MPI B send for buffered sent. But if you do this, then you have to worry about buffer space. Okay? So how do you worry about buffer space. Well, here's what it says in the MPI standards. So a buffered send operation that cannot complete because a lack of buffer space is erroneous. When such a situation is detected, an error is signaled that may cause the program to terminate abnormally. On the other hand, a standard send operation that cannot complete because of lack of buffer space will merely block, waiting for buffer space to become available or for a matching received to be posted. This behavior is preferable is most -- in many situations. Most of us would probably agree with that last statement, but unfortunately we can't use the ordinary send because if we do we have a nonportable program. So we're stuck using this and we have to worry about whether this overflow is going to occur. So how easy is it to manage buffer space in MPI? Well, unfortunately it's almost impossible. So there are procedure in the standards. So you can do MPI buffer attach, which associates a buffer with a process, but the MPI standard doesn't limit the buffering to the specified buffers, okay? And if the MPI send procedure returns an error, what should you do? What should the process do, in fact? It's really hard to know what your options are. So a key problem that I identify here is that MPI doesn't provide much in the way of mechanisms to exercise control over the process scheduling. Okay? And process scheduling is intrinsically tied with buffer management. You -- the way to avoid buffer overflow is to keep processes to throttle back the execution of processes at property times. And the mechanisms that are provided at MPI are kind of a hodgepodge. In fact, they're drawn from a completely different community. The barrier synchronization mechanism is a really kind of a brute-force mechanism that is rather difficult for programmers to use in this context. So let me zero in on -- yeah, was there a question? >>: On the previous side, so when do you -- when you do garbage collect, essentially you showed us some snippets of code where the processes is executing, is calling MPI [inaudible] received call, right? But now if different processes can essentially block waiting for a message to arrive in the buffer, how do you manage that there could be a strong dependency that processes [inaudible] they want to read that message. So when is a good time to garbage collect [inaudible]? >> Edward Lee: Well, first of all, keep in mind, this is C and Fortran codes that what you implement MPI programs on. So, yeah, I mean, garbage collection is done by the programmer. But in the case of buffer management, it's presumably done by the MPI infrastructure. Okay? And, you know, that's in fact -- I'm going to address the subtleties that are involved in trying to do that right. Okay? But ideally the programmer wouldn't have to worry about when the buffer memory has to be garbage collected. Right? So the only thing the programmer should have to worry about is this statement that I pointed out earlier, which is that a send is a blocking send. It doesn't return until the memory storing the value can be safely overwritten. Okay? The receive is similarly a blocking receive in that it doesn't return until the memory that you've -- the pointer that you've provided has been filled with received data. So with those two pieces of information, you can figure out when to do your garbage collection locally. But hopefully you don't have to do the garbage collection for the buffering. In fact, MPI provides no mechanism to do that. There's no access to that structure. Okay. So what are the process scheduling subtleties? Well, there's a bunch of them. Bounding the buffers, dealing with termination, deadlock or livelock, fairness issues, exploiting parallelism, sharing data across processes, determinism. I'm just going to focus on some of these, so in particular the buffer management one. So if you want to do -- if you want to -- if you recognize that you have to do something about process scheduling, what should you do. Okay? Well, one possibility is you could say, well, I should be fair. Right? As long as my scheduling is fair, everyone will be happy. Who could argue with that, right? I mean, it's motherhood and apple pie to be fair. So if I have a model like this, this is a set of processes communicating. And we can ignore this for the moment and assume this is done in MPI. But in this case it's actually not. Now, suppose that -- this is the same select process. Now, suppose that the process that's generating the control signal for the select happens to generate a constant stream of true messages for some reason. Okay? Then what will the consequence of fair scheduling be. Well, if you do fair scheduling, both of these processes will be given equal opportunity to run. You can't really control buffer management. Buffers are going to overflow and you're going to run out of memory. Okay? Because this guy's data is never going to get consumed. Well, so fair scheduling doesn't really work. Okay? So, okay, people know that. So let's do data-driven execution. Right? But if you do data-driven execution, right, here we have processes -- and the data driven-principle is going to be simply that a process is allowed to execute when it has input data. Okay? These processes here, however, don't have any sources of input data. So presumably they should always be allowed to execute. So that doesn't solve the problem. Okay? So we should turn it around. And we should do demand-driven execution because obviously data-driven execution didn't solve the problem. But unfortunately demand-driven execution doesn't solve the problem either because I can just use the converse of this selector, which is a Boolean switch process. And here I have a control signal that's telling me whether a message from this process should be routed to the top process or the bottom process. Okay? Now, how do you manage demands? Well, you know, these guys are synchs of data. So presumably they should be always demanding inputs. But if this guy happens to again produce a sequence of constant trues, then this guy's demands can never be satisfied. So how do you manage that process. So let me show you that this problem cannot be solved strictly speaking by just saying, well, you let that demand go unsatisfied. Right? So suppose that I take my original program, which has a Boolean select here, so I've got two sources of messages and a control process. And it just so happens that I connect up an observer, a synch to this guy. Okay? Now, what is the correct behavior of this program, right? This guy's demands can in fact be satisfied forever. Okay? However, if you satisfy them, you will overflow the buffers and your program will crash. So should you satisfy them, okay? It's not obvious that you should. I mean, I think most programmers would probably prefer that the program not crash over having the program crash, but is not satisfying those demands acceptable, okay? So this is the kind of subtlety that in the MPI world is just thrown over the fence to the application programmer and the application programmer has to worry about how to deal with this. Okay? And my basic message is it shouldn't be the application programmer that has to deal with this. This should be part of the underlying semantics of the infrastructure. Okay. So in particular in this implementation, which is a process network implementation, this demand will not be satisfied, okay, if it will result in unbounded growth of buffers. Okay. And I'll explain that in a little more detail later. So let's look at a couple of more subtle cases. Suppose that I have a source of processes and a control process and I route messages to a guy who wants data from both of these but my constant source happens to deliver a sequence of constant trues. What is a correct execution of this program? So arguably is it correct for it to overflow the buffers, because it's kind of intrinsic in the logic of the program that this overflows the buffers. Whereas in the previous programs it was not intrinsic in the logic of the program. Okay? Sorry? >>: Or it can just do nothing because it doesn't -- we'll never get anything on the false [inaudible]. >> Edward Lee: Or it could just do ->>: [inaudible]. >> Edward Lee: That's right. So it could just halt. When should it choose to halt? Right? So one criterion would be when it runs out of memory, okay, which is the choice that's made here is basically it runs until it runs out of memory. What about a program like this where, you know, I've got a feedback that in fact will never be satisfied. This process begins by waiting for a message here, okay, and it will never get that because it won't get that until it produces its output. Now, you could have this in a local part of the program, okay? Should that be able to prevent execution of other parts of the program. All right. That's the livelock situation. Arguably, I mean, again here, the solution in the infrastructure that I'm showing is to overflow the buffer. This guy is able to produce an infinite stream of data. It will do so until you run out of memory. Okay. In fact -- yeah. So the point here is that naive strategies for scheduling all fail here. Fair, demand driven, data driven, and most mixtures of demand and data driven that have been out there in the literature for a long time. And programmers that are building programs, nontrivial programs with MPI are having to rediscover all the problems with these strategies and to try to solve them without actually being able to exercise any control over process scheduling. So the games you have to play are kind of extraordinary to try to get around the problems. Okay. So my point is these problems that I outlined in fact have been solved. You could argue -- you could quibble about whether the solution is correct or not. In fact, there's been some debate in the literature about nuances of the choices and solutions. But nonetheless, there are solutions that are defensible that have well-thought-through semantics. Okay? So I'll give you a description of one which comes from my group which is implemented in this Ptolemy II director that we call a process network director. So in a process network director, every one of these boxes is backed by a thread, in fact. This is using threads under the hood. Okay. These messages pass along the arcs. It does support nondeterminism, so this particular process is happy to read from this upper path or from this lower path. Okay. And it will do so in a nondeterministic order. If a message becomes available on one and there's no message on the other, it will read the one that becomes available. Okay. And these are consumer processes. The key thing is that buffer management has been dealt with in this way in a very particular way that gives a nice, clean model to the programmer. You could argue about whether it's the right model, but at least it's a clean model. Okay. So here's the model. So we define a correct execution to be any execution which after a finite amount of time, every signal, which is a sequence of messages, is a prefix of the signal given by a denotational semantics of the program. So in the denotational semantics of the program, the signal could be an infinite sequence of messages. Okay? We define a correct execution to be any execution that gives us a prefix, okay, of all of those potentially infinite sequences. We define a useful execution to be a correct execution that satisfies the following criteria: That if you have a nonterminating program, then after any finite time a useful execution will extend at least one stream of messages by at least one message. Okay? So, in other words, it would be correct to stop immediately and do nothing, but it won't be useful. Right? So because if you stop immediately and do nothing, your sequences are all empty, which are in fact a prefix of the correct sequence, right? Secondly, a correct execution that satisfies criterion 1 exists and keeps buffers bounded, then a useful execution will execute with bounded buffers. Okay? So this says if you have a choice between overflowing buffers or not overflowing buffers, don't overflow the buffers. Okay. But keep executing. Seems reasonable. >>: [inaudible] least fixed points semantics? >> Edward Lee: So this is due to Gilles Kahn way back in 1974. But he modeled these kinds of process networks as functions that mapped streams of inputs to streams of outputs. Okay? That would be for deterministic functions. He ruled out things like this nondeterministic merge. So there are functions that map streams into streams, okay, and he showed that those functions, in fact, could be characterized using a very nice topological framework, okay, as monotonic functions, meaning that if you provide two possible inputs to a process and one is a prefix of the other, then the corresponding outputs will be a prefix of one another. The first one will be a prefix of the second one. Okay. So that -- the prefix provides an ordering between the processes. The function is monotonic in the same sense as ordinary monotonic functions. Okay? So then Kahn showed that any network of such processes could be described as a single monotonic function and that that monotonic function had a single least solution, so a least sequence of messages that satisfied all of the behaviors that were specified by the functions. Okay. So the least behavior could be infinite sequences. In fact, it typically is. For programs that I live, the least fixed point is -- contains infinite sequences. For programs that deadlock, it's finite sequences. But that's the background. The point is that it provides a clean denotational semantics. It's not an operational semantics, but it defines for any such network what is the sequence of messages that is defined by the program. And it's an infinite sequence typically. So an operational semantics can only approximate that with finite sequences. So -- right. So that's a bit of a technicality. But the point is, you know, there's a clean, well-defined semantics there. So this term correct execution is something that can be made fully rigorous. Okay. Now, this second criterion, right, which is if a correct criterion -- if a correct execution exists, this satisfies criterion 1 and executes with bounded buffers, then a useful execution will execute with bounded buffers. This is hard to satisfy, okay, for a very fundamental reason. Because it turns out that even for trivial programs of this type, if for trivial programming model of this type it's undecidable whether a given program can execute with bounded buffers. Okay. So, in fact, this was shown I believe first by my Ph.D. student Joe Buck in his Ph.D. thesis. And he showed that if you just had the following four processes, so this is simply a Boolean function, in fact, a NAND is sufficient. So all the data type -- the message data types are Boolean exclusively. So every message is true or false. There's no other messages. So you can have a NAND that will receive a message here and a message here and output the NAND of the two. You have a delay which will output an initial Boolean and then subsequently behaves like an identity function. So that's all it does. Output an initial Boolean, subsequently behave like an identity. And then you have a select and the switch and a fork. So there's actually five processes. You need the fork as well. The fork is a function that has a single input port, two output ports, and it simply replicates the messages on both output boards. Okay. So with those five processes, you can build a universal Turing machine with only Boolean data types. As a consequence, notice none of the processes actually use any memory to speak of. So there's no issue of bounding the memory in the processes themselves. The only issue of bounding memory has to be in the buffers. For Turing machines, it's undecidable whether a program uses bounded memory. Therefore, the bounded buffer problem is undecidable for these networks. So is the deadlock problem. Right? It's undecidable whether a particular network of these things will provide infinite sequences or not. So that creates a bit of a conundrum, right? Because now every MPI programmer has to solve an undecidable problem, okay, in order to guarantee that their program will not overflow the buffers at some point. So how do you satisfy an undecidable problem? Well, solution -- one solution was given by another Ph.D. student of mine, Tom Parks, who solves the undecidable problem in a very trivial way. Okay? He says -- he said in his thesis start with an arbitrary bound on the capacity of all the buffers and execute as much as you can. Okay? If your program deadlocks because of the bounded buffers, so a program will deadlock because a message tries to read to a buffer and it's full, okay, and if all the buffers are either full or empty, then -- and processes are blocked on either full or empty buffers, then you've deadlocked. By the way, in an MPI program, there's no way to tell when you've reached that state, okay, except that an exception occurs, right, and your program crashes. All right. So what you do is if a deadlock occurs and at least one actor -- one process is blocked on a write, increase the capacity of at least one buffer to unblock at least one write. Okay? That was his strategy. That strategy can be improved a lot. Right? There's lots of things you might do that would be smarter, but that strategy is efficient. And then you just continue executing, repeatedly checking for deadlock. So Tom Parks proved that this delivered a useful execution in that for all programs that could execute with bounded buffers this would execute with bounded buffers. So is this a contradiction? Is solving an undecidable problem impossible? Well, not really if you're willing to take forever to solve it, right? And that's what this does. Okay? This doesn't deliver an answer of whether the program executes with bounded buffers in bounded time. Okay? So it doesn't -- it's not a contradiction with this being undecidable. Yeah. >>: So it has one nondeterministic choice on which buffer can [inaudible] increase. Is that material in terms of figuring out whether it's going to actually be following the bounded execution? >> Edward Lee: There -- if there is one bounded execution, there are many bounded executions. Right? Because if you're not concerned about using the minimum number of buffers, okay, then it's not an issue. In fact, Parks also gives a variant of this where he says, you know, if you detect a deadlock, increase all the buffer sizes by one. That works too. Okay. There's many choices. But, you know, these don't minimize the size of the buffers. So he wasn't worried about the optimization problem. There are people who have followed on, so Geilen and Basten, for example, are concerned about the optimization problem. And they're also concerned about behavior for programs that in fact require unbounded buffers. And so they've delivered some improvements on these. Okay. Now, this strategy is pretty simple, but if you try to implement this in MPI, it's actually extraordinarily difficult. Okay. You have to get under the hood in MPI and start messing with the MPI engine in order to implement this. Okay. Now, for some cases it might not be good enough to solve this problem in this way when you don't get a conclusive answer, right, about whenever the program will, in fact, execute forever. So if you have embedded applications or safety critical applications, you're going to want to prove that bounded buffers are sufficient and that your particular buffer bounds are sufficient. So you can do that. There's formalisms that have been developed over the years that do this. There's actually quite a lot of them, many variants of this. One of the simplest ones was one that I did in my Ph.D. thesis way back in the previous millennium that we called synchronous dataflow. And the idea here was very simple, which was that a process now was described in terms of finite chunks of computation that we called firings, the term borrowed from the dataflow world. And for each firing a process would produce a fixed and known number of messages. So it would produce P messages or it would consume C messages. Okay? So if you have a network of such processes, you can in fact determine its decidable whether there is a bounded buffer execution. And in fact you can formulate and solve the buffer optimization problem to find the minimum buffer sizes. It turns out not to be trivial. In fact, minimizing the buffer size is NP-hard, even with this rather trivial model. Okay? But nonetheless, its decidable. So the tradeoff here is that we limit expressiveness by constraining the number of tokens or messages that are produced and consumed, and this of course rules out the Boolean switch and the Boolean select. Because this guy will -- may or may not consume messages on these inputs depending on the Boolean values. So those are not in the class of synchronous dataflow programs. >>: Discard the token but it didn't consume it? >> Edward Lee: Well, that would be -- you could create a variant of this that we call a multiplexer that reads from both inputs and discards one. And that one is synchronous dataflow. Yeah. So I think I'll skip over the formalism behind this because it's well documented. The point I'd like to make is that it gives us a decidable model, okay, and, moreover, not only can you bound the buffers, you can actually statically do load balancing as well if you know something about execution times of these firing functions. And there's been a bunch of work that has been done that most of it -- which is quite old. This stuff dates from the last time parallel computing was popular. Okay? So it's, you know -- in fact, here is a screen image from 1990 of a graphical dataflow programming environment that we built in my group where these were processes that were being synthesized into parallel assembly code for multiprocessor DSP systems for embedded applications. Okay. These are -- there was a great deal of emphasis on optimization in this work. So the horizontal axis is showing time in instruction cycles. And these are firings. So the typical firings of these actors were like six instruction cycles. Okay? The interprocessor communication was occurring in two instruction cycles. Okay. So a lot of really low-level optimization being done. These were aimed at high throughput applications at the time, like video processing applications which were using multiprocessor DSPs at that time. So that's 1990. That was quite a while ago. So the point is that this synchronous dataflow model makes things decidable but there's very interesting and very complex optimization problems there. In fact, it was a rich source, a mother lode of Ph.D. thesis topics for people for quite some time. And so to give you a sense of this, this is from Shuvra Bhattacharyya's Ph.D. thesis where he really focused on the buffer optimization problem. Here is a very simple model with a very real-world application. This has got only six processes, and all it's doing is converting compact disc data to digital audiotape data. So compact disc data is sampled at 44.1 kilosamples per second. And digital audiotape is sampled at 48 kilosamples per second, which, by the way, was done deliberately to try to prevent copying of compact discs, right? The industry thought they could prevent people from pirating compact discs by making it very difficult to do this conversion. Okay? So this does this conversion in a sequence of stages using finite -- using synchronous dataflow models. So this guy will consume two messages and produce one. This guy will consume two and produce three. This guy will consume eight and produce seven. This will produce -- consume five and produce seven. The schedule that minimizes the buffer sizes is shown here. Okay? And it -- as near as I can tell, it's completely chaotic. There's -- I can't -- you can't see any pattern in this. And Shuvra did some very nice work, or he did a tradeoff of, you know, what are the -given that you would like to have compact representations of the schedule, what's the best you can do with buffer minimization. Okay. So synchronous dataflow, however, by its is to restrictive. It's just a very limited model of computation. But fortunately there's been a whole bunch of things that have extended it with much more expressive structures, all of them striving to maintain decidability but enriching what you can describe in the programming model. The most recent of these is Bill Thies' and Saman Amarasinghe's work on StreamIt, which has done some very nice work enriching the expressiveness of synchronous dataflow models and mapping them onto parallel machines. Okay. So I think ->>: You don't have the permission [inaudible]. >> Edward Lee: Okay, well I don't want to try people's patience, and I in fact anticipated that I probably wouldn't have too much time to talk about this second part, but I think I can just sort of allude to some of the key issues in the second part and not run over. Yes. >>: [inaudible] the complexity of your previous example due to the fact that they chose such crazy ratios of sampling rates? >> Edward Lee: Well, this is -- I mean, the fact that this schedule has this weird structure is a consequence of these, you know, wacky sample rates. And, in fact, you know, as I said, they chose this ratio deliberately to give you wacky sample rates. And it turns out that you really -- if you want to do this conversion efficiently, you have to do it in these seven stages -- or it's four stages. These are polyphase, multirate, finite impulse response filters. And if you try to do it in one step, it requires a vast amount of memory. And it's extremely slow. Right? You have to break it down into these stages if you want to be able to do it in a reasonable amount of memory. So, yeah, so in a way, I mean, this is kind of an extreme example, right? But it illustrates the complexity intrinsic in the problem even though most applications don't suffer from this complexity. Okay. So I'd like to just talk briefly about scaling, because when I -- you know, I've done a bunch of criticizing of MPI, but there's one aspect of MPI that I think is really quite positive and I think is a good starting point and, in fact, you know, can be enriched considerably. And this is the collective operations in MPI. So the idea of the collective operations in MPI is to codify patterns of design that are commonly used. So there's a bunch of them. In fact, I think this is a pretty comprehensive list of what MPI provides in the way of collective operations. So they're, you know, scatter-gather kinds of operations, things of that nature. So, you know, just to illustrate some of these, these can be specified very compactly in MPI. So if you have a piece of data existing on one process and you would like to broadcast it to all six processes, there's a broadcast operation. You don't have to send it individually to each one. Okay. That's a simple collective operation. So in pictures, a broadcast in this visual syntax might look like this. This is this fork that will replicate messages on the output. Okay? This is kind of an awkward and not very scalable representation, however. So a nicer representation is something like this where the number of destinations is parameterized, okay, and is represented in a scalable way by a single icon. A gather/scatter is kind of a little bit more interesting than a broadcast. So here we have six data items on one accessible or created by one process and we want to scatter them to six processes. Okay. So that's the scatter. The gather takes one data item produced by each process and collects them into an array on a single -- accessible to a single process. So these are, again, things that can be expressed very compactly in MPI. And here's in pictures similar mechanisms. So the scatter/gather looks like this; that we have this process that we call a distributor, which simply reads a message and then in a round-robin fashion for each message sends it to one of N output channels. Okay. And, again, this can be compacted into a scalable representation. And then correspondingly there's a commutator there that will in a round-robin fashion read from multiple input channels and gather the data. Gather to all is kind of an interesting one. Here you have six data items produced separately by six different processes, and you'd like all six data items accessible to all six processes. So I wanted to show a concrete example of this which -- let's see. I didn't -- so this is -let me show you in pictures first. This is very simple problem, 3D gravitational simulation of n-bodies. Okay? So these blue blobs are bodies in space that are exerting gravitational force on each other. Okay? And so a straightforward implementation of this looks at all pair-wise combinations of bodies. Basically what you want to do for each body, you want to find the net gravitational force. So to do that you want to find its distance to all the other bodies, okay, and then scale that distance by Newton's -- use Newton's gravitational law to figure out what the force is, and then you run a simple numerical integration scheme to implement F equals MA to move the body according to the force. Right? Fairly straightforward thing to do conceptually. Okay. So the idea here is that if you have the positions of the n-bodies, okay, which are each position is a vector, XYZ. And say you have six bodies, okay? Then one way to implement this in parallel is you have a copy of these positions and what you want to do is calculate Euclidean distances, okay, for each of these. Then once you do that, you get a net force which you can then apply into F equals MA, so here's a very simple, naive numerical simulation, simple forward Euler numerical simulation that solves this problem. So let me show you -- so here's an implementation of this in Ptolemy II. Whoops. My mouse isn't working. So here I have a parameter that specifies initial positions of a set of bodies, each position is a vector with three elements. Whoops. So you can see there's three numbers here and then a curly brace, and there's another three numbers, and this is a rather long list. There's a bunch of initial bodies. Specify initial velocities for each of these. Specify a number of bodies, which is actually calculated from these guys, and then we have a simple feedback loop that implements the gather-to-all pattern on a parameterized number of models of the body. Okay. So if I look inside of here, this is an implementation at a fairly low-level dataflow way of this numerical simulation. So it calculates the Euclidean distances, finds the average of the Euclidean distances, and then this is the forward Euler numerical integration scheme. Okay. So that's the structure of this. And I can execute this and it gives me a 3D animation from some starting position that I can rotate around see these bodies exerting forces on each other. Occasionally you see one will get whipped around by coming close to another guy. Okay. So the idea here is that this is, you know, kind of a scalable representation that is codified in a particular pattern of interaction between processes. Codifying these patterns is useful. Okay. And I think that MPI has done that for a set of patterns, but the problem is what if the pattern you want isn't there. Okay. So in particular here is this pattern that I showed before of, you know, sorting data from multiple data sources. That's not an MPI collective operation. MapReduce, which is I think probably familiar to everybody, is not an MPI collective operation. But it's a pretty useful pattern. Recursive constructs, here's actually an implementation -- this is 15 years old at least in Ptolemy Classic of an FFT using recursion. So this process is a recursive reference to the network of processes that contains it. Okay. So internally the implementation of this is actually this same network. Okay. So this is a complete description of a raidx-2 FFT with this is the switch process, this is simply a repeater, this is a complex exponential constant generator, the twiddle factors. Okay. A complex multiplier and a repeater. And that describes an FFT using recursion. Also not provided by MPI, dynamically instantiated processes, which have been around for a while in this world. So how do we get this idea, these collective operations, but without being limited to a particular fixed library of them? Well, the idea is pretty simple. We can borrow the concept from functional programming and just say, well, we need high-order components in these models. Okay. Just like in functional languages, these patterns are represented as combinators, which are easy to write and extend in the functional language itself. Okay. If you provide similar kinds of mechanisms for networks of processes, this is one simple example of that. This is a particular combinator that is in fact -- is described internally as a higher-order component. It just takes whatever is inside and replicates it some number of times and implements this gather-to-all communication pattern. Okay. But it's described as a higher-order component. I think there's a lot of work to be done on this front. I think we need better language support for these kinds of structures, right, much like, you know, the transition from C to C++ gave us really nice support for object-oriented programming. But the component architectures here are a little different from object-oriented, right? The interaction between components is by sending messages, not by calling procedures. And I think that it ought to be possible to provide language support in the similar fashion to the way C++ provided support for object-oriented patterns. Et cetera. So just a couple of final comments. So I think that the message I want to convey really is that message passing requires more discipline than what you see in today's popular message-passing libraries. And we shouldn't be asking application programmers to rediscover and resolve problems that have been around for 20 years. We should instead be providing infrastructure that solves those problems for the application programmers and that there are certainly cases where we've missed opportunities to do that. And then the second part is that I do believe that this style of programming can scale up very nicely to very big problems, but I think we have some work to do and the single best idea that I have is to broaden this idea of collective operations borrowing the notion of higher-order functions from functional languages and creating these higher-order components to describe these collective operations. And then finally I just wanted to acknowledge my group who has made lots of contributions to the things that I'm talking about, and so that's a snapshot of the people who were working in my group as of two months ago. So thanks very much. [applause] >> Jim Larus: Questions? >>: So you talked a lot about MPI, which is a very old and relatively specialized message-passing system. Are there other more recent systems that you think were interesting you could talk about relative to [inaudible] properties? >> Edward Lee: Yes, there are. There's quite a lot out there, actually. Most of them -- I mean, part of the reason I picked MPI is that it seems to have a great deal more visibility than almost anything else out there in the message-passing world. Some of my favorites in the message-passing world actually come from computer music applications. So, for example, there's a group of people in Leon [phonetic] that have created a programming language called FAUST for doing computer music. And it's basically -- it's a very domain-specific language, but it's an extremely elegant and efficient message-passing framework. And it also -- it has this notion of higher-order components in a rather nice way. So there's a few things out there. They tend to be -- I mean, a lot of them are niche things. They're in domain-specific worlds. I'm -- you know, I'm open -- did you have any one particular in mind? >>: [inaudible] seems to be getting some visibility in general [inaudible]. >> Edward Lee: Yeah. Those are two excellent examples. Yeah, Erlang I think is also quite an elegant solution. E is pretty different. I mean, this use of promises I think is really quite intriguing, but it's a real twist on message passing. I think it's a very interesting twist, but I'm not convinced that all of the problems have been worked out. You know, I mean, I talked about the subtleties here of bounding buffers on simple message-passing schemes. I don't think they -- as far as I know, in the world of promises they haven't even gotten to asking the question much less answering it about this kind of -- these kind of boundedness problems. But I think there's probably good fodder for good Ph.D. thesises there, because the idea is really quite intriguing. Yeah. >>: How do you do the high-order components when you compare them to the, say, generative modeling? Do you see -- do you think there's some advantage of looking at it from a functional approach? >> Edward Lee: Actually, no. I'm not convinced there is. I think that generative modeling -- I sort of put in the same category as a promising technology. It simply tends to be a more imperative way of describing the structures you want as opposed to the declarative way that someone coming at it from a functional programming view would use. But I'm agnostic about which is preferable. In fact, I've had students in my group who have done prototypes using both kinds of techniques, both the generative approach with an imperative style and the higher-order functions approach with a declarative style. And I'm not convinced that one has a clear advantage over the other. >> Jim Larus: Any other questions? Let's thank Ed again. [applause]