>> Jim Larus: It's my pleasure today to introduce... Berkeley. He's been there for a number of years,...

advertisement
>> Jim Larus: It's my pleasure today to introduce Edward Lee who is a professor at UC
Berkeley. He's been there for a number of years, I think probably came about the same
time I left.
Ed has been here before. The last time he came to tell us why shared memory
programming was a bad idea. I think this time he's going to tell us why message passing
is a bad idea. And I'm not sure what's left after -- but if we invent a new model, we can
invite him back and he can tell us why it's a bad idea.
Thanks, Ed.
>> Edward Lee: Okay, thanks Jim. I'm going to try not to be quite so negative. You
could think of this as sort of equal opportunity griping.
But, yeah, as Jim said, you know, the -- I've -- when I was last year, I basically stated that
if you take shared memory as a programming model, which typically gets translated into
threads using semaphores and mutexes that it -- in my opinion, this is really not an
adequate programming model. That's not to say that it isn't a useful technique. It just
shouldn't be exposed to the application programmers. It should be somewhere under the
hood used by experts as it was originally, by the way.
You know, I mean, this is 40-year-old technology, and it was -- has been used for many
years by operating systems people. And it's really a relatively recent phenomenon for it
to get exposed to the programmer at the programming language level.
So I'm not going to talk about this today. I'm going to talk instead about how -- well,
some people have reacted to this case by saying, well, you know, there's this classical
debate between shared memory and message passing. So if you're opposed to shared
memory you must be in favor of message passing.
And the story is it's not that simple. Okay. So I think that the real issue is that whatever
concurrency model you use you need more structure and more discipline than what is
typically provided by the programming infrastructure today.
And so what I'm really going to try to make a case for is something that I think is -- you
know, ought to be innately appealing to most computer scientists, which is that
constraints on programmers are often even more important than expressiveness. That,
you know, if you think about it, you know, is assembly code more expressive than
high-level programming languages? Well, yeah, sure. But it's much harder to use in part
because it's more expressive. Right? Gives you more direct control of the hardware.
So this isn't a really radical new idea, but I'm going to try to pick on some technologies
that are pretty popular today in message passing, and in particular I'm going to focus on
two problems that arise with message passing systems.
One of them is a rather subtle one, which is buffer management in message passing
systems. And I'm going to try to convince you that, you know, the existing message
passing libraries really do an abysmal job on this and quite unnecessarily. That you could
actually -- you can actually do much better.
And then the second issue is a scalability issue for message passing programs. There's
quite a few other things that I could talk about with message passing programs, but given
a limited amount of time, it's better to focus on a couple of specific issues.
So my whipping boy today is going to be MPI, not, you know -- not because -- well,
mostly because I think it is one of the more popular message passing libraries out there.
It's pretty widely used in the scientific community. It's been around for a while. And
it's -- it runs on quite a variety of different platforms.
So I'm going to use it as a -- I'm going to pick apart the flaws that I see in this, and it's -they're I think illustrative of similar flaws that you see in a lot of comparable libraries.
So for those of you who aren't familiar with MPI, here is a snippet of a program in C.
And this kind of shows how you use MPI in C. You call an initialization method, no big
deal there. You then find out which process is actually running.
So what happens is that this code will actually get -- the same code will get executed on a
whole bunch of different machines. Okay? And you ask the infrastructure which -- who
am I, right? What is my rank.
So this procedure will right into this variable rank an identity for the particular process
that is running. Okay?
You can also ask the infrastructure how many processes are running. It will write into
this variable size the number of processes that are running.
And then you can run the same code in multiple processes, or you can run separate code
in the distinct processes. So the way that you do that is if you want to run separate code
in distinct processes, you just put the code you want to run inside of an if block that's
guarded by the rank that you would like to have execute that process.
Okay. So you could have code for one process in here, code for another process in here.
And everything is collected in one source file. You compile it and it runs on multiple
processors.
So that's kind of the anatomy of an MPI program in C. You can do a similar thing in
Fortran.
Now, let's take a look at a fairly concrete process. Suppose that I have a process that I've
given an iconic picture of here. It's got two input streams, so these are sources of
messages. Messages are going to come in from other processes on these input ports.
And messages are going to come in also on this input port.
And these messages are going to determine the routing of messages from the sources to
the output. So in particular, these messages here will carry Boolean data. And if you
receive a message with value true, then you will look for a message on this message port
and route it to the output.
If you receive a message with value false, you would look for a message here and route it
to the output port. Okay? So that's an example of a process that you might run. This
would be typically a piece of a much bigger sorting algorithm.
For example, you might have a program that is, say, you know, collecting transaction
data from commodities markets around the world and sorting them by timestamp or
something like that. Right? And so you would have some logic that determines where
you should read the next input from. That would be a typical application that you could
scale up to very large scales.
Okay. So the way that this select process could be written -- first of all, this would sit
inside of an if block that would be guarded by the rank of the process that you want to
have run this select.
And then inside that if block, this is just going to run in an infinite loop here. And it first
reads from the control port with this first line, MPI receive. And then if the value of the
control is a Boolean true, or this is C, so it's a nonzero int, then you will read from the
port here.
Now, unfortunately in MPI you don't actually do this in quite such a modular way. The
reference here is not to a port but rather to the rank of the process that is the source of the
data.
So there's kind of a lack of modularity here that you can fix this with MPI. There's layers
that you can put on top of this that separate that. But this particular code is not written in
a very modular way because this process is directly referring to the processes that are
sourcing the data.
So, in any case, this will read from data source 1, this will read from data source 2. And
then this will send to the output and then we go back to reading from the control port.
So a couple of points about this is that, you know, this reference to the rank of the source
or destination process is kind of a lack of modularity here because you can't take this
code now and, you know, use it somewhere else in some other context according to its
logic without changing it.
And also there's no particular reason why this code should have any reference to the data
types of the messages that are being routed through here, but it does. It, in fact -- there
really isn't a convenient way in MPI to make this polymorphic, at least not in a type-safe
way.
So those are a couple of modularity problems that arise. But those are not going to be my
focus. The modularity problems are mostly not going to be my focus.
Instead I'm going to talk about the buffer management. Okay? So MPI describes its send
procedure as a blocking send. So what they mean by that is that it does not return until
the memory storing the value to be sent can be safely overwritten. Okay? So if I back up
to this send process here, I make a reference to the data value that I want to send right
here by providing a pointer to the value that I read at the input. Okay?
When that can be safely overwritten, this method will return, which enables me then to
go back and redo this loop and consequently overwrite that location. So that's what they
mean by a blocking send in this protocol. Okay.
Now, the MPI standard allows implementations to either copy the data into some system
buffer for later delivery to the receiver or to rendezvous with the receiving process and
return only after the receiver has become -- begun receiving the data. Okay?
So when you use MPI send, a particular implementation of MPI could deliver either of
these two behaviors. Okay? Now, that's a problem. Because this leads to programs that
behave differently under different implementations of MPI. Okay? So in particular let
me show you such a program. Okay. So this is again -- this is fleshing out this select
process into something that actually looks more like a sorting application.
So I've got two processes, source 1 and source 2, that are producing data. So those might
be, for example, reading data from a database that has commodities exchange information
on it.
Then I've got a control process that is deciding how to merge these streams. So this -- the
logic of the sort happens in this control process. Okay? And these things are getting
merged.
So the control process could be written like this. I receive data from the top input port,
then I receive data from the bottom input port. And then I drop into an infinite loop.
Okay? If some condition on the two data values is true, this is my sort criterion, okay,
then I will send a true value on my output and receive data from the other guy. Okay?
So I swap -- I alternate which one of these I look at -- look on depending on which of the
data values I'm going to route through the select. Okay?
So if my control logic decides to route a message from the top through, then it will
effectively discard this control message and read -- and wait for another message on this
input. Okay? If it decides instead to send through the message on the bottom, then it will
discard this message and go back and wait for a new message here. Okay? So that's
the -- that's the logic.
So you could build up -- you could scale this thing up. In fact, you could create a tree of
these things that could be arbitrarily large. And you would have a -- would be able to
parallelize a very large sorting application in a pretty -- I think a pretty natural way in
MPI.
However, the MPI standard in fact doesn't tell us what the behavior of this program is
going to be. Okay? So there's a couple of -- first one little trickiness here. The source
data here is getting routed to two different destinations, the selector and the controller.
Okay?
There's two ways you could do that in MPI. One way would be to alter -- to change your
source program, right, and have it write to one of these and then write to this guy. So the
MPI programs tend to reference the processes that they're communicating with, which is
a lack of modularity. So that would be inelegant. Okay?
So instead let's just assume that we actually are willing to dedicate a process which, of
course, has its own costs, overhead, to doing the replication. So it will receive -- so these
little diamonds here are processes that will receive a message on the input, route it, send
it to the top and then send it to the bottom.
Now, the issue is that the process has to do that in some order. It's got to send to one and
then send to the other. Okay? But the order matters. In particular, under a rendezvous
semantics, which is allowed by the MPI standard, this program could deadlock. Okay?
Whereas with a buffered send it never deadlocks. There's no way for it to deadlock.
So there's kind of a fundamental problem here. And the irony is that according to the
MPI standard document, the reason that they allow these two different implementations is
to get portable programs -- yeah.
>>: Is that the real reason or was there some politics where one company insisted on its
[inaudible]?
>> Edward Lee: I think that this is a veiled statement of there was politics. So I suspect
that that's what was going on, that, you know, there were two different implementations
that people were very wetted to, and so they decided not to make the commitment. But
they missed the fact that a consequence of this lack of commitment is that the programs
are distinctly nonportable, except, I mean, in the sense they run on both platforms.
Right? It's just they don't do the same thing.
And I think that's kind of a pretty serious flaw. All right? So that's kind of a problem.
But -- so how could we fix this problem? Well, you can force a buffered send. So to
avoid the deadlock, don't allow it to do the rendezvous. Okay? So the way to do that is
to use a different method, MPI B send for buffered sent.
But if you do this, then you have to worry about buffer space. Okay? So how do you
worry about buffer space. Well, here's what it says in the MPI standards. So a buffered
send operation that cannot complete because a lack of buffer space is erroneous. When
such a situation is detected, an error is signaled that may cause the program to terminate
abnormally.
On the other hand, a standard send operation that cannot complete because of lack of
buffer space will merely block, waiting for buffer space to become available or for a
matching received to be posted.
This behavior is preferable is most -- in many situations. Most of us would probably
agree with that last statement, but unfortunately we can't use the ordinary send because if
we do we have a nonportable program.
So we're stuck using this and we have to worry about whether this overflow is going to
occur.
So how easy is it to manage buffer space in MPI? Well, unfortunately it's almost
impossible. So there are procedure in the standards. So you can do MPI buffer attach,
which associates a buffer with a process, but the MPI standard doesn't limit the buffering
to the specified buffers, okay?
And if the MPI send procedure returns an error, what should you do? What should the
process do, in fact? It's really hard to know what your options are.
So a key problem that I identify here is that MPI doesn't provide much in the way of
mechanisms to exercise control over the process scheduling. Okay? And process
scheduling is intrinsically tied with buffer management. You -- the way to avoid buffer
overflow is to keep processes to throttle back the execution of processes at property
times.
And the mechanisms that are provided at MPI are kind of a hodgepodge. In fact, they're
drawn from a completely different community. The barrier synchronization mechanism
is a really kind of a brute-force mechanism that is rather difficult for programmers to use
in this context.
So let me zero in on -- yeah, was there a question?
>>: On the previous side, so when do you -- when you do garbage collect, essentially
you showed us some snippets of code where the processes is executing, is calling MPI
[inaudible] received call, right? But now if different processes can essentially block
waiting for a message to arrive in the buffer, how do you manage that there could be a
strong dependency that processes [inaudible] they want to read that message. So when is
a good time to garbage collect [inaudible]?
>> Edward Lee: Well, first of all, keep in mind, this is C and Fortran codes that what
you implement MPI programs on. So, yeah, I mean, garbage collection is done by the
programmer.
But in the case of buffer management, it's presumably done by the MPI infrastructure.
Okay? And, you know, that's in fact -- I'm going to address the subtleties that are
involved in trying to do that right. Okay?
But ideally the programmer wouldn't have to worry about when the buffer memory has to
be garbage collected. Right? So the only thing the programmer should have to worry
about is this statement that I pointed out earlier, which is that a send is a blocking send.
It doesn't return until the memory storing the value can be safely overwritten. Okay?
The receive is similarly a blocking receive in that it doesn't return until the memory that
you've -- the pointer that you've provided has been filled with received data.
So with those two pieces of information, you can figure out when to do your garbage
collection locally. But hopefully you don't have to do the garbage collection for the
buffering. In fact, MPI provides no mechanism to do that. There's no access to that
structure.
Okay. So what are the process scheduling subtleties? Well, there's a bunch of them.
Bounding the buffers, dealing with termination, deadlock or livelock, fairness issues,
exploiting parallelism, sharing data across processes, determinism. I'm just going to
focus on some of these, so in particular the buffer management one.
So if you want to do -- if you want to -- if you recognize that you have to do something
about process scheduling, what should you do. Okay? Well, one possibility is you could
say, well, I should be fair. Right? As long as my scheduling is fair, everyone will be
happy. Who could argue with that, right? I mean, it's motherhood and apple pie to be
fair.
So if I have a model like this, this is a set of processes communicating. And we can
ignore this for the moment and assume this is done in MPI. But in this case it's actually
not.
Now, suppose that -- this is the same select process. Now, suppose that the process that's
generating the control signal for the select happens to generate a constant stream of true
messages for some reason. Okay? Then what will the consequence of fair scheduling be.
Well, if you do fair scheduling, both of these processes will be given equal opportunity to
run. You can't really control buffer management. Buffers are going to overflow and
you're going to run out of memory. Okay? Because this guy's data is never going to get
consumed.
Well, so fair scheduling doesn't really work. Okay? So, okay, people know that. So let's
do data-driven execution. Right? But if you do data-driven execution, right, here we
have processes -- and the data driven-principle is going to be simply that a process is
allowed to execute when it has input data. Okay?
These processes here, however, don't have any sources of input data. So presumably they
should always be allowed to execute. So that doesn't solve the problem.
Okay? So we should turn it around. And we should do demand-driven execution
because obviously data-driven execution didn't solve the problem. But unfortunately
demand-driven execution doesn't solve the problem either because I can just use the
converse of this selector, which is a Boolean switch process. And here I have a control
signal that's telling me whether a message from this process should be routed to the top
process or the bottom process. Okay?
Now, how do you manage demands? Well, you know, these guys are synchs of data. So
presumably they should be always demanding inputs. But if this guy happens to again
produce a sequence of constant trues, then this guy's demands can never be satisfied. So
how do you manage that process.
So let me show you that this problem cannot be solved strictly speaking by just saying,
well, you let that demand go unsatisfied. Right?
So suppose that I take my original program, which has a Boolean select here, so I've got
two sources of messages and a control process. And it just so happens that I connect up
an observer, a synch to this guy. Okay?
Now, what is the correct behavior of this program, right? This guy's demands can in fact
be satisfied forever. Okay? However, if you satisfy them, you will overflow the buffers
and your program will crash. So should you satisfy them, okay?
It's not obvious that you should. I mean, I think most programmers would probably
prefer that the program not crash over having the program crash, but is not satisfying
those demands acceptable, okay?
So this is the kind of subtlety that in the MPI world is just thrown over the fence to the
application programmer and the application programmer has to worry about how to deal
with this. Okay? And my basic message is it shouldn't be the application programmer
that has to deal with this. This should be part of the underlying semantics of the
infrastructure.
Okay. So in particular in this implementation, which is a process network
implementation, this demand will not be satisfied, okay, if it will result in unbounded
growth of buffers.
Okay. And I'll explain that in a little more detail later.
So let's look at a couple of more subtle cases. Suppose that I have a source of processes
and a control process and I route messages to a guy who wants data from both of these
but my constant source happens to deliver a sequence of constant trues. What is a correct
execution of this program?
So arguably is it correct for it to overflow the buffers, because it's kind of intrinsic in the
logic of the program that this overflows the buffers. Whereas in the previous programs it
was not intrinsic in the logic of the program. Okay? Sorry?
>>: Or it can just do nothing because it doesn't -- we'll never get anything on the false
[inaudible].
>> Edward Lee: Or it could just do ->>: [inaudible].
>> Edward Lee: That's right. So it could just halt. When should it choose to halt?
Right? So one criterion would be when it runs out of memory, okay, which is the choice
that's made here is basically it runs until it runs out of memory.
What about a program like this where, you know, I've got a feedback that in fact will
never be satisfied. This process begins by waiting for a message here, okay, and it will
never get that because it won't get that until it produces its output.
Now, you could have this in a local part of the program, okay? Should that be able to
prevent execution of other parts of the program. All right. That's the livelock situation.
Arguably, I mean, again here, the solution in the infrastructure that I'm showing is to
overflow the buffer. This guy is able to produce an infinite stream of data. It will do so
until you run out of memory.
Okay. In fact -- yeah. So the point here is that naive strategies for scheduling all fail
here. Fair, demand driven, data driven, and most mixtures of demand and data driven
that have been out there in the literature for a long time.
And programmers that are building programs, nontrivial programs with MPI are having
to rediscover all the problems with these strategies and to try to solve them without
actually being able to exercise any control over process scheduling. So the games you
have to play are kind of extraordinary to try to get around the problems.
Okay. So my point is these problems that I outlined in fact have been solved. You could
argue -- you could quibble about whether the solution is correct or not. In fact, there's
been some debate in the literature about nuances of the choices and solutions.
But nonetheless, there are solutions that are defensible that have well-thought-through
semantics. Okay? So I'll give you a description of one which comes from my group
which is implemented in this Ptolemy II director that we call a process network director.
So in a process network director, every one of these boxes is backed by a thread, in fact.
This is using threads under the hood. Okay. These messages pass along the arcs. It does
support nondeterminism, so this particular process is happy to read from this upper path
or from this lower path.
Okay. And it will do so in a nondeterministic order. If a message becomes available on
one and there's no message on the other, it will read the one that becomes available.
Okay. And these are consumer processes.
The key thing is that buffer management has been dealt with in this way in a very
particular way that gives a nice, clean model to the programmer. You could argue about
whether it's the right model, but at least it's a clean model.
Okay. So here's the model. So we define a correct execution to be any execution which
after a finite amount of time, every signal, which is a sequence of messages, is a prefix of
the signal given by a denotational semantics of the program.
So in the denotational semantics of the program, the signal could be an infinite sequence
of messages. Okay?
We define a correct execution to be any execution that gives us a prefix, okay, of all of
those potentially infinite sequences.
We define a useful execution to be a correct execution that satisfies the following criteria:
That if you have a nonterminating program, then after any finite time a useful execution
will extend at least one stream of messages by at least one message.
Okay? So, in other words, it would be correct to stop immediately and do nothing, but it
won't be useful. Right? So because if you stop immediately and do nothing, your
sequences are all empty, which are in fact a prefix of the correct sequence, right?
Secondly, a correct execution that satisfies criterion 1 exists and keeps buffers bounded,
then a useful execution will execute with bounded buffers. Okay?
So this says if you have a choice between overflowing buffers or not overflowing buffers,
don't overflow the buffers. Okay. But keep executing. Seems reasonable.
>>: [inaudible] least fixed points semantics?
>> Edward Lee: So this is due to Gilles Kahn way back in 1974. But he modeled these
kinds of process networks as functions that mapped streams of inputs to streams of
outputs. Okay? That would be for deterministic functions. He ruled out things like this
nondeterministic merge.
So there are functions that map streams into streams, okay, and he showed that those
functions, in fact, could be characterized using a very nice topological framework, okay,
as monotonic functions, meaning that if you provide two possible inputs to a process and
one is a prefix of the other, then the corresponding outputs will be a prefix of one
another. The first one will be a prefix of the second one. Okay.
So that -- the prefix provides an ordering between the processes. The function is
monotonic in the same sense as ordinary monotonic functions. Okay?
So then Kahn showed that any network of such processes could be described as a single
monotonic function and that that monotonic function had a single least solution, so a least
sequence of messages that satisfied all of the behaviors that were specified by the
functions.
Okay. So the least behavior could be infinite sequences. In fact, it typically is. For
programs that I live, the least fixed point is -- contains infinite sequences.
For programs that deadlock, it's finite sequences. But that's the background. The point is
that it provides a clean denotational semantics. It's not an operational semantics, but it
defines for any such network what is the sequence of messages that is defined by the
program.
And it's an infinite sequence typically. So an operational semantics can only approximate
that with finite sequences.
So -- right. So that's a bit of a technicality. But the point is, you know, there's a clean,
well-defined semantics there. So this term correct execution is something that can be
made fully rigorous.
Okay. Now, this second criterion, right, which is if a correct criterion -- if a correct
execution exists, this satisfies criterion 1 and executes with bounded buffers, then a
useful execution will execute with bounded buffers.
This is hard to satisfy, okay, for a very fundamental reason. Because it turns out that
even for trivial programs of this type, if for trivial programming model of this type it's
undecidable whether a given program can execute with bounded buffers.
Okay. So, in fact, this was shown I believe first by my Ph.D. student Joe Buck in his
Ph.D. thesis. And he showed that if you just had the following four processes, so this is
simply a Boolean function, in fact, a NAND is sufficient.
So all the data type -- the message data types are Boolean exclusively. So every message
is true or false. There's no other messages. So you can have a NAND that will receive a
message here and a message here and output the NAND of the two.
You have a delay which will output an initial Boolean and then subsequently behaves like
an identity function. So that's all it does. Output an initial Boolean, subsequently behave
like an identity.
And then you have a select and the switch and a fork. So there's actually five processes.
You need the fork as well. The fork is a function that has a single input port, two output
ports, and it simply replicates the messages on both output boards.
Okay. So with those five processes, you can build a universal Turing machine with only
Boolean data types. As a consequence, notice none of the processes actually use any
memory to speak of. So there's no issue of bounding the memory in the processes
themselves. The only issue of bounding memory has to be in the buffers.
For Turing machines, it's undecidable whether a program uses bounded memory.
Therefore, the bounded buffer problem is undecidable for these networks. So is the
deadlock problem. Right? It's undecidable whether a particular network of these things
will provide infinite sequences or not.
So that creates a bit of a conundrum, right? Because now every MPI programmer has to
solve an undecidable problem, okay, in order to guarantee that their program will not
overflow the buffers at some point.
So how do you satisfy an undecidable problem? Well, solution -- one solution was given
by another Ph.D. student of mine, Tom Parks, who solves the undecidable problem in a
very trivial way. Okay? He says -- he said in his thesis start with an arbitrary bound on
the capacity of all the buffers and execute as much as you can. Okay?
If your program deadlocks because of the bounded buffers, so a program will deadlock
because a message tries to read to a buffer and it's full, okay, and if all the buffers are
either full or empty, then -- and processes are blocked on either full or empty buffers,
then you've deadlocked.
By the way, in an MPI program, there's no way to tell when you've reached that state,
okay, except that an exception occurs, right, and your program crashes.
All right. So what you do is if a deadlock occurs and at least one actor -- one process is
blocked on a write, increase the capacity of at least one buffer to unblock at least one
write. Okay? That was his strategy. That strategy can be improved a lot. Right?
There's lots of things you might do that would be smarter, but that strategy is efficient.
And then you just continue executing, repeatedly checking for deadlock.
So Tom Parks proved that this delivered a useful execution in that for all programs that
could execute with bounded buffers this would execute with bounded buffers.
So is this a contradiction? Is solving an undecidable problem impossible? Well, not
really if you're willing to take forever to solve it, right? And that's what this does. Okay?
This doesn't deliver an answer of whether the program executes with bounded buffers in
bounded time. Okay? So it doesn't -- it's not a contradiction with this being undecidable.
Yeah.
>>: So it has one nondeterministic choice on which buffer can [inaudible] increase. Is
that material in terms of figuring out whether it's going to actually be following the
bounded execution?
>> Edward Lee: There -- if there is one bounded execution, there are many bounded
executions. Right? Because if you're not concerned about using the minimum number of
buffers, okay, then it's not an issue.
In fact, Parks also gives a variant of this where he says, you know, if you detect a
deadlock, increase all the buffer sizes by one. That works too. Okay. There's many
choices. But, you know, these don't minimize the size of the buffers. So he wasn't
worried about the optimization problem.
There are people who have followed on, so Geilen and Basten, for example, are
concerned about the optimization problem. And they're also concerned about behavior
for programs that in fact require unbounded buffers. And so they've delivered some
improvements on these.
Okay. Now, this strategy is pretty simple, but if you try to implement this in MPI, it's
actually extraordinarily difficult. Okay. You have to get under the hood in MPI and start
messing with the MPI engine in order to implement this.
Okay. Now, for some cases it might not be good enough to solve this problem in this
way when you don't get a conclusive answer, right, about whenever the program will, in
fact, execute forever. So if you have embedded applications or safety critical
applications, you're going to want to prove that bounded buffers are sufficient and that
your particular buffer bounds are sufficient.
So you can do that. There's formalisms that have been developed over the years that do
this. There's actually quite a lot of them, many variants of this. One of the simplest ones
was one that I did in my Ph.D. thesis way back in the previous millennium that we called
synchronous dataflow.
And the idea here was very simple, which was that a process now was described in terms
of finite chunks of computation that we called firings, the term borrowed from the
dataflow world. And for each firing a process would produce a fixed and known number
of messages. So it would produce P messages or it would consume C messages. Okay?
So if you have a network of such processes, you can in fact determine its decidable
whether there is a bounded buffer execution. And in fact you can formulate and solve the
buffer optimization problem to find the minimum buffer sizes.
It turns out not to be trivial. In fact, minimizing the buffer size is NP-hard, even with this
rather trivial model. Okay? But nonetheless, its decidable.
So the tradeoff here is that we limit expressiveness by constraining the number of tokens
or messages that are produced and consumed, and this of course rules out the Boolean
switch and the Boolean select. Because this guy will -- may or may not consume
messages on these inputs depending on the Boolean values. So those are not in the class
of synchronous dataflow programs.
>>: Discard the token but it didn't consume it?
>> Edward Lee: Well, that would be -- you could create a variant of this that we call a
multiplexer that reads from both inputs and discards one. And that one is synchronous
dataflow. Yeah.
So I think I'll skip over the formalism behind this because it's well documented. The
point I'd like to make is that it gives us a decidable model, okay, and, moreover, not only
can you bound the buffers, you can actually statically do load balancing as well if you
know something about execution times of these firing functions.
And there's been a bunch of work that has been done that most of it -- which is quite old.
This stuff dates from the last time parallel computing was popular. Okay? So it's, you
know -- in fact, here is a screen image from 1990 of a graphical dataflow programming
environment that we built in my group where these were processes that were being
synthesized into parallel assembly code for multiprocessor DSP systems for embedded
applications.
Okay. These are -- there was a great deal of emphasis on optimization in this work. So
the horizontal axis is showing time in instruction cycles. And these are firings. So the
typical firings of these actors were like six instruction cycles. Okay? The interprocessor
communication was occurring in two instruction cycles.
Okay. So a lot of really low-level optimization being done. These were aimed at high
throughput applications at the time, like video processing applications which were using
multiprocessor DSPs at that time. So that's 1990. That was quite a while ago.
So the point is that this synchronous dataflow model makes things decidable but there's
very interesting and very complex optimization problems there. In fact, it was a rich
source, a mother lode of Ph.D. thesis topics for people for quite some time.
And so to give you a sense of this, this is from Shuvra Bhattacharyya's Ph.D. thesis where
he really focused on the buffer optimization problem. Here is a very simple model with a
very real-world application. This has got only six processes, and all it's doing is
converting compact disc data to digital audiotape data.
So compact disc data is sampled at 44.1 kilosamples per second. And digital audiotape is
sampled at 48 kilosamples per second, which, by the way, was done deliberately to try to
prevent copying of compact discs, right? The industry thought they could prevent people
from pirating compact discs by making it very difficult to do this conversion. Okay?
So this does this conversion in a sequence of stages using finite -- using synchronous
dataflow models. So this guy will consume two messages and produce one. This guy
will consume two and produce three. This guy will consume eight and produce seven.
This will produce -- consume five and produce seven.
The schedule that minimizes the buffer sizes is shown here. Okay? And it -- as near as I
can tell, it's completely chaotic. There's -- I can't -- you can't see any pattern in this.
And Shuvra did some very nice work, or he did a tradeoff of, you know, what are the -given that you would like to have compact representations of the schedule, what's the best
you can do with buffer minimization.
Okay. So synchronous dataflow, however, by its is to restrictive. It's just a very limited
model of computation. But fortunately there's been a whole bunch of things that have
extended it with much more expressive structures, all of them striving to maintain
decidability but enriching what you can describe in the programming model.
The most recent of these is Bill Thies' and Saman Amarasinghe's work on StreamIt,
which has done some very nice work enriching the expressiveness of synchronous
dataflow models and mapping them onto parallel machines.
Okay. So I think ->>: You don't have the permission [inaudible].
>> Edward Lee: Okay, well I don't want to try people's patience, and I in fact anticipated
that I probably wouldn't have too much time to talk about this second part, but I think I
can just sort of allude to some of the key issues in the second part and not run over. Yes.
>>: [inaudible] the complexity of your previous example due to the fact that they chose
such crazy ratios of sampling rates?
>> Edward Lee: Well, this is -- I mean, the fact that this schedule has this weird structure
is a consequence of these, you know, wacky sample rates. And, in fact, you know, as I
said, they chose this ratio deliberately to give you wacky sample rates.
And it turns out that you really -- if you want to do this conversion efficiently, you have
to do it in these seven stages -- or it's four stages. These are polyphase, multirate, finite
impulse response filters. And if you try to do it in one step, it requires a vast amount of
memory. And it's extremely slow. Right? You have to break it down into these stages if
you want to be able to do it in a reasonable amount of memory.
So, yeah, so in a way, I mean, this is kind of an extreme example, right? But it illustrates
the complexity intrinsic in the problem even though most applications don't suffer from
this complexity.
Okay. So I'd like to just talk briefly about scaling, because when I -- you know, I've done
a bunch of criticizing of MPI, but there's one aspect of MPI that I think is really quite
positive and I think is a good starting point and, in fact, you know, can be enriched
considerably. And this is the collective operations in MPI.
So the idea of the collective operations in MPI is to codify patterns of design that are
commonly used. So there's a bunch of them. In fact, I think this is a pretty
comprehensive list of what MPI provides in the way of collective operations. So they're,
you know, scatter-gather kinds of operations, things of that nature.
So, you know, just to illustrate some of these, these can be specified very compactly in
MPI. So if you have a piece of data existing on one process and you would like to
broadcast it to all six processes, there's a broadcast operation. You don't have to send it
individually to each one. Okay. That's a simple collective operation.
So in pictures, a broadcast in this visual syntax might look like this. This is this fork that
will replicate messages on the output. Okay?
This is kind of an awkward and not very scalable representation, however. So a nicer
representation is something like this where the number of destinations is parameterized,
okay, and is represented in a scalable way by a single icon.
A gather/scatter is kind of a little bit more interesting than a broadcast. So here we have
six data items on one accessible or created by one process and we want to scatter them to
six processes. Okay. So that's the scatter.
The gather takes one data item produced by each process and collects them into an array
on a single -- accessible to a single process.
So these are, again, things that can be expressed very compactly in MPI. And here's in
pictures similar mechanisms. So the scatter/gather looks like this; that we have this
process that we call a distributor, which simply reads a message and then in a round-robin
fashion for each message sends it to one of N output channels. Okay.
And, again, this can be compacted into a scalable representation. And then
correspondingly there's a commutator there that will in a round-robin fashion read from
multiple input channels and gather the data.
Gather to all is kind of an interesting one. Here you have six data items produced
separately by six different processes, and you'd like all six data items accessible to all six
processes.
So I wanted to show a concrete example of this which -- let's see. I didn't -- so this is -let me show you in pictures first. This is very simple problem, 3D gravitational
simulation of n-bodies. Okay? So these blue blobs are bodies in space that are exerting
gravitational force on each other. Okay?
And so a straightforward implementation of this looks at all pair-wise combinations of
bodies. Basically what you want to do for each body, you want to find the net
gravitational force.
So to do that you want to find its distance to all the other bodies, okay, and then scale that
distance by Newton's -- use Newton's gravitational law to figure out what the force is,
and then you run a simple numerical integration scheme to implement F equals MA to
move the body according to the force. Right? Fairly straightforward thing to do
conceptually. Okay.
So the idea here is that if you have the positions of the n-bodies, okay, which are each
position is a vector, XYZ. And say you have six bodies, okay? Then one way to
implement this in parallel is you have a copy of these positions and what you want to do
is calculate Euclidean distances, okay, for each of these.
Then once you do that, you get a net force which you can then apply into F equals MA,
so here's a very simple, naive numerical simulation, simple forward Euler numerical
simulation that solves this problem.
So let me show you -- so here's an implementation of this in Ptolemy II. Whoops. My
mouse isn't working. So here I have a parameter that specifies initial positions of a set of
bodies, each position is a vector with three elements. Whoops. So you can see there's
three numbers here and then a curly brace, and there's another three numbers, and this is a
rather long list. There's a bunch of initial bodies.
Specify initial velocities for each of these. Specify a number of bodies, which is actually
calculated from these guys, and then we have a simple feedback loop that implements the
gather-to-all pattern on a parameterized number of models of the body. Okay.
So if I look inside of here, this is an implementation at a fairly low-level dataflow way of
this numerical simulation. So it calculates the Euclidean distances, finds the average of
the Euclidean distances, and then this is the forward Euler numerical integration scheme.
Okay. So that's the structure of this. And I can execute this and it gives me a 3D
animation from some starting position that I can rotate around see these bodies exerting
forces on each other. Occasionally you see one will get whipped around by coming close
to another guy. Okay.
So the idea here is that this is, you know, kind of a scalable representation that is codified
in a particular pattern of interaction between processes. Codifying these patterns is
useful. Okay. And I think that MPI has done that for a set of patterns, but the problem is
what if the pattern you want isn't there. Okay.
So in particular here is this pattern that I showed before of, you know, sorting data from
multiple data sources. That's not an MPI collective operation. MapReduce, which is I
think probably familiar to everybody, is not an MPI collective operation. But it's a pretty
useful pattern.
Recursive constructs, here's actually an implementation -- this is 15 years old at least in
Ptolemy Classic of an FFT using recursion. So this process is a recursive reference to the
network of processes that contains it. Okay. So internally the implementation of this is
actually this same network.
Okay. So this is a complete description of a raidx-2 FFT with this is the switch process,
this is simply a repeater, this is a complex exponential constant generator, the twiddle
factors. Okay. A complex multiplier and a repeater. And that describes an FFT using
recursion.
Also not provided by MPI, dynamically instantiated processes, which have been around
for a while in this world.
So how do we get this idea, these collective operations, but without being limited to a
particular fixed library of them? Well, the idea is pretty simple. We can borrow the
concept from functional programming and just say, well, we need high-order components
in these models. Okay. Just like in functional languages, these patterns are represented
as combinators, which are easy to write and extend in the functional language itself.
Okay. If you provide similar kinds of mechanisms for networks of processes, this is one
simple example of that. This is a particular combinator that is in fact -- is described
internally as a higher-order component. It just takes whatever is inside and replicates it
some number of times and implements this gather-to-all communication pattern. Okay.
But it's described as a higher-order component.
I think there's a lot of work to be done on this front. I think we need better language
support for these kinds of structures, right, much like, you know, the transition from C to
C++ gave us really nice support for object-oriented programming. But the component
architectures here are a little different from object-oriented, right?
The interaction between components is by sending messages, not by calling procedures.
And I think that it ought to be possible to provide language support in the similar fashion
to the way C++ provided support for object-oriented patterns. Et cetera.
So just a couple of final comments. So I think that the message I want to convey really is
that message passing requires more discipline than what you see in today's popular
message-passing libraries. And we shouldn't be asking application programmers to
rediscover and resolve problems that have been around for 20 years.
We should instead be providing infrastructure that solves those problems for the
application programmers and that there are certainly cases where we've missed
opportunities to do that.
And then the second part is that I do believe that this style of programming can scale up
very nicely to very big problems, but I think we have some work to do and the single best
idea that I have is to broaden this idea of collective operations borrowing the notion of
higher-order functions from functional languages and creating these higher-order
components to describe these collective operations.
And then finally I just wanted to acknowledge my group who has made lots of
contributions to the things that I'm talking about, and so that's a snapshot of the people
who were working in my group as of two months ago.
So thanks very much.
[applause]
>> Jim Larus: Questions?
>>: So you talked a lot about MPI, which is a very old and relatively specialized
message-passing system. Are there other more recent systems that you think were
interesting you could talk about relative to [inaudible] properties?
>> Edward Lee: Yes, there are. There's quite a lot out there, actually. Most of them -- I
mean, part of the reason I picked MPI is that it seems to have a great deal more visibility
than almost anything else out there in the message-passing world.
Some of my favorites in the message-passing world actually come from computer music
applications. So, for example, there's a group of people in Leon [phonetic] that have
created a programming language called FAUST for doing computer music.
And it's basically -- it's a very domain-specific language, but it's an extremely elegant and
efficient message-passing framework. And it also -- it has this notion of higher-order
components in a rather nice way.
So there's a few things out there. They tend to be -- I mean, a lot of them are niche
things. They're in domain-specific worlds.
I'm -- you know, I'm open -- did you have any one particular in mind?
>>: [inaudible] seems to be getting some visibility in general [inaudible].
>> Edward Lee: Yeah. Those are two excellent examples. Yeah, Erlang I think is also
quite an elegant solution. E is pretty different. I mean, this use of promises I think is
really quite intriguing, but it's a real twist on message passing. I think it's a very
interesting twist, but I'm not convinced that all of the problems have been worked out.
You know, I mean, I talked about the subtleties here of bounding buffers on simple
message-passing schemes. I don't think they -- as far as I know, in the world of promises
they haven't even gotten to asking the question much less answering it about this kind
of -- these kind of boundedness problems.
But I think there's probably good fodder for good Ph.D. thesises there, because the idea is
really quite intriguing. Yeah.
>>: How do you do the high-order components when you compare them to the, say,
generative modeling? Do you see -- do you think there's some advantage of looking at it
from a functional approach?
>> Edward Lee: Actually, no. I'm not convinced there is. I think that generative
modeling -- I sort of put in the same category as a promising technology. It simply tends
to be a more imperative way of describing the structures you want as opposed to the
declarative way that someone coming at it from a functional programming view would
use.
But I'm agnostic about which is preferable. In fact, I've had students in my group who
have done prototypes using both kinds of techniques, both the generative approach with
an imperative style and the higher-order functions approach with a declarative style. And
I'm not convinced that one has a clear advantage over the other.
>> Jim Larus: Any other questions? Let's thank Ed again.
[applause]
Download