17044 >> Jim Larus: It's my pleasure today to introduce...

advertisement
17044
>> Jim Larus: It's my pleasure today to introduce Guy Blelloch from Carnegie Mellon. Guy has
been doing parallel computing, parallel for a long time, at least as long as I've been in the field.
And he's told me that he never gave it up. Even when the rest of us went off and did something
different for a decade or so, he kept on working on it at least as a side line.
So it's back. We're fortunate that he's back and he's going to tell us some information on parallel
thinking. Guy.
>> Guy Blelloch: Okay. Is the microphone on? Can you hear me?
>> Jim Larus: Yes.
>> Guy Blelloch: Yes, so this is actually parallel thinking is actually part of a Microsoft-sponsored
center for computational thinking. You might not know what this is but actually Microsoft gave
Carnegie Mellon some money back when Jeanette was department head. She was very
interested in this notion of computational thinking, which I'll describe in a moment.
But now Jeanette's actually gone off to head up size at NSF and she's actually been putting a lot
of these ideas from computational thinking into size. And so a lot of the call for proposals and
stuff coming out of NSF have that term in it, computational thinking.
So the background, what is computational thinking, well Jeanette's idea was there's a lot of sort of
core foundational ideas in computer science in the way we think which are different from the way
that people think in other fields, in other sciences.
And it's sort of important to identify what those ideas are for a few reasons. So these ideas are
things like abstraction, that we somehow use abstraction more than in most other fields. We
develop our own abstractions all the time, whether it be the interface for data structure or
whatever, we're developing them all the time.
And they're really a core part of the way we think in computer science.
Another one is recursion. This is an idea that's completely embedded in us. We don't even think
twice about it. There's a lot of other fields when you first see recursion, it's sort of baffling.
There's various other ones like ideas like reductions that you can prove a problem hard by
reducing one problem to the other.
There's lots of sort of core ideas in computer science. The reason it's important to understand
them is, one, it's sort of an educational thing to develop curricula you want to understand what the
core ideas are so that you can emphasize those in your courses.
And a lot of motivation of this came out of a general feeling about five, 10 years ago, not just at
Carnegie Mellon, in a lot of places that what we're teaching especially in high schools but also in
intro courses in colleges was emphasizing sort of Java programming or some low level hacking of
some kind, instead of emphasizing what the core ideas of computer science are.
Another aspect of it is just a pr thing we should make it clear to people out of computer science
that computer science is again not Java programming, it is some deep ideas and it's a real
science and what are the ideas in that real science, what is the science behind computer science.
That's what computational thinking is.
The third idea is maybe you can take the way that we think about problems and apply it to other
areas and actually get innovations by doing that. I think Chris [inaudible] Berkeley calls this sort
of computational lens. So if you look at certain problems via computational lens, whether it be a
problem in economics, a problem in physics or whatever, you might have a different way of
looking at that problem and therefore actually get a better way of approaching it.
So anyway that's what computational thinking is, and Microsoft was generous and gave us
money for that and the project I was doing actually came out of that on parallel thinking.
So the idea obviously is there's some core thinking in thinking about parallel programming and
develop parallel codes and the question is what those core ideas are and why they might be
useful.
Okay. You've all seen charts like this, I'm sure. This particular one, many of you have probably
seen this exact one is from Andrew Chen at Intel. It's showing the increase in cores over time. I
guess we're at 2009 now.
Nowadays when you buy a multi-core it's actually four or six cores, some of them have eight. In a
few years we're going to have 12 to 32. And I guess the most interesting point here is that 2015
is not that far away. It's six years away, and at least Intel is predicting you'll have 100 cores on
every chip on a laptop you buy.
I'm not sure how precise they're trying to be. Sort of nebulous where it's placed. But that's the
idea. So the question is: How are we going to use all these cores and what are we going to do
with them?
And what are some of the problems in moving into a world where instead of single core we have
multiple cores. Well, one is building the hardware. I actually think this is the easiest to deal with.
Intel is going to do it for us or various chip manufacturers, exactly what it's going to look like still
has to be worked out. And I haven't signed the appropriate nondisclosure so I can't say what the
next generation is going to look like.
But that's moving ahead. Then there's developing good programming language and parallel
extensions. So I think there's a lot of efforts nowadays to do this. So there's some very nice
work. I don't know whether it would be like Cilk++ and various language extensions.
I think it will be a while before these play out. I think there will be several open MP out there and
some of them will work better than others. Developing good runtime compilers and debuggers. I
think developing good debuggers is a interesting field. Developing good OS support.
Then, of course, there's just the fact that a lot of companies have a huge code base. I don't know
how many lines of code Microsoft has, but I assume it's very large. And even if only a small
fraction of it needs to be changed to deal with multi-core, right, because we can't just -- as in the
past -- rely on the frequency speeding up, we actually now have to make use of those multiple
cores, then clearly there will be a huge amount of work here which then ties into this notion which
I actually think is the hardest part of this all. The most likely thing that's going to create a failure
here is not any of these technologies.
I think a lot of good ideas were developed over the last 20 years in all of these areas. I think the
biggest problem right now is that most people out there, people that we've graduated, know
almost nothing about parallelism. And part of this is that we I know certainly at Carnegie Mellon
we haven't been teaching parallelism for many years. And only recently we've started teaching
some courses in parallelism. So all the students we've educated in the last 10 years know very
little about parallelism.
And this is changing but this is going to take a while. If you don't have people understand
parallelism then it's going to be a problem because they're not going to be able to use these
things or these things won't be designed very well, the people who will be coding them or writing
the compilers maybe don't fully understand.
So I think there's a real issue here in getting people to understand all the issues of parallelism. A
lot of these things are things that as a community we understand, but we haven't educated
students and our workforce.
So what I want to do is sort of take an approach to parallel thinking that's similar to the idea of
computational thinking, is that it's not a bunch of library interfaces for parallelism. So it's not the
open MP interface for parallelism. The Cilk++ syntax for writing a parallel program. That's not
what parallelism. It's not a bunch of libraries or interfaces. The building blocks, the Intel building
blocks.
So it's really a set of core ideas and what are those core ideas, and identifying those and
educating our students and I guess even people beyond being students on what these core ideas
are and so part of it is just identifying what these are.
Then I guess the other side of this is I think that if you really go back to the basics that parallelism
isn't that hard. And I'm going to put a little caveat on this on the next slide.
But parallelism, where parallelism means making use of multiple processors. If done right a lot of
algorithms are naturally parallel but we just brain wash our students at least at Carnegie Mellon
we do, and I think most other universities are doing, into thinking sequentially. We've been doing
this for 40 years.
And if we just sort of brought in parallelism at a much more fundamental level, at an early part in
the curricula, we could actually avoid this problem.
Furthermore, that if done right, with appropriate and with people who have an appropriate
knowledge of parallel programming, it's really not that hard, parallelism. At least certain aspects.
I'm going to put one caveat on the next slide here, which is this -- I want to make a very clear
distinction between what I call parallelism and concurrency, these particular terms don't matter
that much. I think Jim was at a workshop last summer where we spent about five hours fighting
whether these are the right terms or not.
What I want to do is make a distinction between parallelism being a property of the machine,
having multiple processors running, to get speed of the only purpose is to get speed up, since
you're now going to be loading, balancing the load across multiple processors versus
concurrency, which is it's a programming issue where you've actually got multiple interleaved
threads and making use of these things which are running concurrently and nondeterministic
interleavings.
So, for example, most operating systems have always been concurrent. We've been teaching
the bakery algorithms unlocks for the last 40 years. And on the machines which are completely
sequential. And it's not because we want to speed things up. It's because we have a
multi-threaded process -- we've got multiple threads running, multiple processes running on our
machine, our sequential machine. They're interleaved. So these interleaved processes can
interact with each other.
We need locks. And we need concurrency. And it's running on a single processor. So that
concurrency has existed for many years, has nothing to do with parallelism, has to do with
interleaving with threads. Nothing to do with hardware parallelism. Now we've got sort of what I
call deterministic parallelism. I'm actually going to focus on this, the idea that you can have a
sequential semantics, no nondeterminism, and it be running in parallel.
I claim this is actually the part that's not understood, that if you look at curriculum today, this is the
part that's least understood. Like I said, we teach this. I went through our curricula and we
covered the bakery algorithm in four different courses. We cover interleaving of instructions in
three different course in the OS course and the distributed systems course and introduction to
system courses.
This sort of stuff we don't cover anywhere. And then, of course, you can have both. You can
have things which are concurrent so that your environment might be a nondeterminant. You've
got demands coming in on the fly you have to process things and you're also running it on parallel
machines.
But I want to make a clear distinction here between concurrency and parallelism. My claim
actually is that the parallelism, especially this deterministic parallelism is much easier than people
think if you would just train properly. I just want to go through this talk and give some examples
of why I think this.
Okay. So here I'm starting with Quicksort. This is the code from a book "Algorithms in Java" by
Robert Sedgewick, and it's a book that's used in a reasonable number of courses, introductory
courses and data structures
And we can look at this, and it's what you might expect. It's got a wire loop. It's got a couple
more wire loops in here. So this is clearly extremely sequential. It does some swap. So we pick
a pivot. We find all the elements less and greater than that pivot. Swap between them. Do a
final swap to put the pivot in the right place.
And then we make a recursive call to Quiksort. And then after that we make another call to
Quicksort. And so that's what you would expect. So we teach this Quiksort course as a
sequential algorithm. And in fact this is extremely sequential, this part here.
Well, you know if you were a computer science -- with experience, you'll probably realize that you
can call these two things in parallel. But that's not how we teach it. We just say you call one and
then you call the other.
So that's an extremely sequential program. Here's actually, if we go back a few years, here's the
exact same algorithm from Quiksort algorithm from Aho-Hopcroft-Ullman. This is a book, an
algorithm from the '70s.
I actually claim in some sense this is a much better piece of code because this is at the right level
of abstraction this is a parallel piece of code. My point being here is that many algorithms are just
inherently parallel. And if we just actually raised the level of abstraction, we would realize they're
parallel. So in this case there's a huge amount of parallelism in Quiksort the fact we can make
the two recursive calls.
Actually what this says here is how we put the results together. It doesn't say we have to call
these sequentially. It doesn't say we first call Quiksort and then the other one. It says when we
put these things together we follow them by S 2 and the Quiksort of S 3.
Furthermore, this here can be done in parallel. We can actually, when we're picking the elements
S 1, all the elements less than the pivot, that can be done in parallel. We can compare our keys,
our keys to the pivot in parallel.
So we've got N. The inputs set here. Might be a size N. We've got N of those and compare
them to the pivot in parallel. We can sort of collect them up. So there's actually a huge amount
of parallelism here. Expressed at this level it's actually -- it's already in the code. And so if you
go back in time, you actually find in the earlier books they described algorithms more in parallel
than they do now. That over the years we've gone into more and more sequential thinking.
So I guess there's a couple observations here. One is that there's a lot of natural parallelism
algorithms. And if we describe them at the right level of abstraction, we would notice that they're
there. And furthermore, when we do describe them in courses, we really should point out that
they're there. We should show the students that this is parallel. We should tell them that they
can make the two recursive calls to Quiksort in parallel. We can tell them they can compare the
pivot to keys in parallel.
So that's sort of the last opportunity. And so, yes, so the point being that there's a lot of inherent
parallelism. So actually this is Quiksort in a language that I had worked on back in the early '90s
called NESL. So this is actually, I was motivated, the particular syntax is actually motivated by
Aho-Hopcroft-Ullman's code, and it's a sort of set notation that says pick all the elements less
than the pivot. This is the S here. For every element in S if there's a set notation, pick the
elements less than A. Pick the ones equal to A, greater than A. These are comprehensions,
various other languages such as Haskell have similar sorts of constructs. The original motivation
for this this comes from Sedoff [phonetic], a language from the '70s.
Then it says call Quiksort on the elements that are less, each V and the elements less than or
greater S-1 and S-3 apply Quiksort and then you put together the results here.
You extract the first element result returned here in the second and put the S-2 in between. The
point is that in NESL, wherever you see curly brackets, it means you there's potential parallelism.
Means you can do this in parallel. You compare all the keys to the pivot in parallel. Also says
you can make the two recursive calls to Quiksort in parallel. This forms what's called nested
parallel. And the term NESL comes from nested parallel. It's turns out NESL is functional and it
allows you to get this deterministic parallel, guarantee there's not going to be any raised
conditions from the recursive calls Quiksort because this is purely functional.
This is something I presume you can do with F sharp. And so this is actually, you guarantee that
this is safe in the sense that you'll always get deterministic results, however you execute it.
Whether you execute the first Quiksort first or the second one first and they're both in parallel.
Okay. So I want to go on and say so this is supposed to express the parallelism at a very high
level. We say that we can do the Quiksort in parallels and we can pick all the elements greater or
equal than less the keys, each of those in parallel.
But the question is how do we actually implement that parallel selection here. So it's actually not
quite obvious that, yes, we can compare the keys E and S to the pivot in parallel. But then we
have to somehow pack them down. We have to start with this array of keys. Let's say S is
actually an array. And we have to pack them down into just the ones which are less than the
pivot. So here let's say the pivot is four. So we've packed out the elements two, one, zero three
and one. So the question is how do we do this? And it actually turns out this is a relatively
straightforward thing to do.
But it does require a little, well, parallel thinking, we actually have to realize that this can be done.
So here's what we do. Is we pick out all the elements which are less than the pivot, which are the
ones here. We've got zeros where it's not less than the pivot.
Then what we do is we do a so-called parallel prefix, where every one here gets the sum of every
one before. So this one is the sum of every one before. And the three is the sum of those three
elements. This four is the sum of those four, et cetera.
So we've got this array of it's called the prefix sum or the scan operation. And then what we do is
we just write each element to the particular position. If the flag is one. So we write this, the two
to location zero. The one to location one. The zero to location two. The three to location three
and four to location -- and the one to location four.
And that gives us the final result. Okay. So the interesting thing here is that now unlike Quiksort
which was inherently parallel, it's not obvious that this is the parallel step. It would seem
completely sequential doing the prefix sum. So some of you have seen this before. I know many
of you know this well.
But there's actually a very easy parallel algorithm to do this. And this would be something that
would be a core idea if you are going to be teaching parallel thinking.
I think a lot of people we've taught in the last 10 years that have come out of Carnegie Mellon
would be clueless on how to do this in parallel. But it's actually a very easy idea. In fact, it's sort
of a fundamental idea. So I just want to go through it to convince you it's doable, for those who
haven't seen it.
So here's an array. Let's say I want to find the prefix sum of this. We want basically the sum of
every one before it. So in this position I'm going to want a 3 and in that position I'm going to want
a 7, I guess.
So how are we going to do this? We want to do it in parallel, sequentially we do it in an array one
by one, write a loop and add it up.
Well, it turns out it's very easy to do this in a recursive way. What we do is we pair-wise add
these up. The 3, the 2 and the 1 becomes 3. The 4 and the 2 becomes 6, et cetera.
And now let's say I just recursively apply a prefix sum on that array. So it should be clear I can
do this pair-wise adding in parallel. So that's easy to do in parallel. I make a recursive call, which
is, of course, like as I mentioned one of our tools of computational thinking of recursion.
So now I've just got the result of the prefix sum on that array. 0 3, 9, 13.
Some of you might recognize that if you calculate this in your head, the actual prefix sum, the
fees are basically every other result that the prefix sum at this position is 0. The prefix sum at this
position is 3. The prefix sum at that position is 9. And the prefix sum at position is 13. We've got
half of our answers adding pairs up, doing recursive call to prefix sum.
And, furthermore, if we've got half of our answers, if I look at the answer to any position, I can
easily calculate the answer at the next position by adding the key. If I know, for example, at this
position it's 9 and I ask what's it at the next position, well, it's 9 plus 3, that's the definition of prefix
sum. It's the sum before it.
So we've got a simple way is what we do is we calculate -- we take these elements. We add the
element before it. So 0, 2, 3 to 4, 9 to 3, 13 and 5, gives us these things. We just interleave
those.
And that's our prefix sum. So what I'm saying, this is a very easy algorithm to describe. It's as
easy or much easier than most of the algorithms we describe in an introduction data course.
And it is a fundamental operation in parallel computing. This sort of stuff gets done all the time.
But yet I would say most of our students if we gave them this on an exit exam as they leave the
university wouldn't know how to do it.
Even though I was able to describe it in a few minutes. And this sort of idea actually gets -- you
can do a lot more than just prefix sums here. You can actually calculate the states, if you have a
state machine and it's transitioning states with a bunch of inputs you can calculate that using this.
There's all sorts of things which normally you would think as being sequential but are actually
parallel.
Okay. So here's actually the NESL code for it. That doesn't really matter. You might not like the
syntax, or you might. But it's basically saying this adds up those pair-wise adds up the elements.
This is the recursive call. And this basically calculates the result of this plus some subset of the
elements and you interleave it together. You get what you want here.
So it's only a few lines of code. It's not a complicated idea. It's something that I think every
undergraduate should know their freshman year. And it's a core idea. It's not how do -- what's
the syntax for open MP and nested loops. I'm a little worried it's more likely undergraduates that
know that than it is they know this basic algorithm.
Okay. So a couple observations here. One is that just because it seems sequential doesn't
mean it is. That this might be a simple parallel algorithm for it.
That this is actually a very common theme that comes up in the design of parallel algorithms is
that if you can somehow reduce the problem to a smaller problem, solve it on the smaller problem
recursively and use that to solve the big one, that's very helpful.
This is not exactly a divide and conquer. Divide and conquer we solve two smaller problems or
two or smaller problems and put them together. Here I'm just solving one. I only made one
recursive call to the plus scan. I didn't do multiple. So it's not divide and conquer, but it's this
idea of solving a smaller problem and using it to solve bigger. I'd say this comes up a lot more in
parallel algorithms, thinking about problems in parallel than it does sequentially. Because
sequentially you're often thinking of going through this loop one at a time, especially if you think
about the scan. You basically think about starting at the beginning and going to tend, that when
you want to think in parallel you should be thinking how do I take this really big problem, make it
into something smaller, solve that thing smaller and then help me solve the big one.
That's typically more useful in parallelism than the one-at-a-time approach obviously. This last
thing is that this is the generalization of the plus scan is basically you can generalize it to be a
bunch of state transitions and really what this is is is a special case of composing state
transitions.
In fact Guy Steel gives a nice talk on his version of parallel thinking, and he really emphasizes
this idea that basically he can compose states.
>>: So even the algorithm you described has some sequential [inaudible] I do this then I do this.
>> Guy Blelloch: That's right.
>>: So is there a notion of optimal parallelism with an algorithm you cannot parallelize this more
than ->> Guy Blelloch: Yes, there's a lot of theory in parallels about optimality and things you can't do
better than.
In fact, you can't do better than logarithmic depth for a scan because you basically have to do a
tree. You have to pair-wise add things up. So the best you can possibly do is logarithmic depth.
So, yes, there is a nice theory and in fact this algorithm I described is optimal in that sense. But
you can't do better than that.
I guess this is -- you can write Quiksort in lots of different languages. This is Quiksort in parallel
in Cilk++. The point is that the parallelism structure is the same in all these languages. We
should emphasize the particular constructs. So I think there's a lot of emphasis recently in
parallelism on just looking at what the open MP constructs are or what the Cilk++ constructs are,
what the FUBA's [phonetic] construct without thinking what's the high level idea here.
One thing I should point out is that I mentioned there's two types of parallelism, right? There's a
parallelism from making parallel calls that we can make the two recursive calls to Quiksort in
parallelism and there's parallelism in the actual partitioning which is what we did in the scan
operation, the plus scan. We did packing of the keys in parallel by doing that plus scan.
It turns out if you just take advantage of the recursive calls, and if you go -- you'll find parallel
Quicksorts all over the place. It's in the set of examples in Cilk++. It's in the set of examples of
threading building blocks. It's in a bunch of different ones.
They all do the partition sequentially. There's actually very limited parallelism. Because you
basically have to do linear work to do the partitioning. Then you can make two parallel recursive
calls. At the best you can do is it's still going to take you order end time.
However many processes you have, the very first partition is done sequentially. It's sort of an
MDAL law. We can't do better than that. In fact, a nice way to look at this, if you look at the span
and sort of the critical path of dependencies in this computation, it's going to be order N. The
total work, the total number of comparisons I do is N log N and so actually the very best I can do
is sort of dividing N log N by over N gives me the amount of processes I can make use of. I can
make use of log N processes.
On the second slide I showed, I showed that curve of the multi-core. Where we're at now, this is
perfectly adequate. We've got eight cores if we're sorting a million things, log N is 20. We would
be perfectly happy with this.
If you go to 2015, that was 100 cores. This is not going to do. So it's this will work up to 10, 20,
it's not going to go beyond there. Now, you could say well what if we did the other thing and we
only took advantage of parallelism from sub selecting the keys greater or equal in less than pivot.
And I actually took this piece of Quiksort code off of the web somewhere. It's an example I might
have typed it in wrong, but high performance Fortran parallel language that was developed in the
early '90s.
It allows you to do -- this is the pack operation. It's the thing that packs all the elements less than
pivot. So inside of the pack there's prefix sum or the scan operation. And so here we're picking
the elements less than the pivot, greater than the pivots. I guess here we don't do the equal
ones.
But we call two recursive Quiksorts and I guess this is an in place version so it leaves this
somewhere. And the point is that these two are done sequentially, recursive calls, but picking out
the elements is done in parallel.
And we can look at what the span sort of the critical path of this is. Because the recursive calls
are done sequentially, there's going to be a total of a linear number, the tree depth of these
recursive calls is, well, the size, number of leaves in the tree is basically one per key, so it's
basically going to be N total function calls made to Quiksort.
And if they don't sequentially it's going to take N time. In fact it takes exactly N time if you do the
analysis. You get the same span as before. If you look at parallelism we look at log N parallelism
it's not very much and we can make more use of this maybe eight cores and not many more.
On the other hand, if you take advantage of both forms of parallelism, okay, the parallel function
calls and also the parallel partitioning of basic sub selection of the keys less than or equal greater
than pivot, you actually get a lot of parallelism because what you can do now is the block is
supposed to represent the parallel work. Log N span for here to do the scan operation.
So if we looked at the scan it made a bunch of recursive calls, each half as big. So the total
depth there was log N. And so it takes log N sort of a span the critical path is log N here and then
what you can do is you can say there's about log N depth to recursion here. You can prove that
for Quiksort. It's about log N deep for recursion. It's approximately balanced in expectation. And
so the total work is still N log N, but now this span, the critical path through the computation is log
squared N. And now if you look at the parallelism, it's again, it's a work divided by span. We can
make use of basic N over log N processors. Huge amount of parallelism. If we have a million,
talking about 50,000 fold parallelism here. Here we've got a good parallel algorithm.
But I guess what's important here is that we have to have a way to think about this. We have to
have a way to analyze it. So as a person who's designing a parallel algorithm and I want to
compare two algorithms, like I want to decide whether it's useful to have the first form of
parallelism, the recursive calls, the second form, where I do the sub selection, or both, or how
useful it is to have them.
I need a way to think about this. I need a way to analyze it, I need a way to compare it.
So, again, this is part of the computational thinking is that what NESL does actually is it gives
actually you a cost model which allows you to calculate the work of a computation if you make a
bunch of parallel calls. Like let's say if these were two calls to Quiksort here, the work is just the
sum of the two.
That's what it would take, basically it's exactly the sequential time. If I make two calls
sequentially, it's total time is the sum of them. If I make two parallel calls, the total work is the
sum. I take the work from one call and the work from the second and they're running in parallel
but I still have to add the work.
The span of them is the maximum. So if I make two parallel calls, I have to wait for the longer of
the two. So I take the maximum of those two spans.
And so I can calculate the span. In fact, if you do this formally for Quiksort, you get exactly what I
had on the previous slide. You get a span of log squared N and a work of order N log N.
This gives you a formal way to determine what the work in span are, and then you might say:
Well, why is that useful?
And, well, it turns out if you know the work in span it's actually very useful about telling you how
fast it will run on any given number of processors. Now there are certain assumptions here about
locality and caches and stuff which I'll get back to.
But if you have a so-called greedy schedule -- so introduce the Quiksort with all these recursive
calls. You basically have a whole lot of parallelism. You have to schedule that parallelism on to
my processors. I didn't talk about that. The point is that that can be done by a scheduler under
the hood.
In fact, most systems parallel systems like open MP have a built in scheduler under the hood
nowadays. Some are better than others. In fact, the only decent one I've seen is the Cilk++ one.
So if you have a so-called greedy schedule then you can show that the time is always going to be
the work divided by the processes plus the depth. If I know this work and I know the depth, I
know quite a bit about the time.
You can also show that the work is going to be -- the time's going to be at least the maximum of
these two. If you think about it, the difference between the sum and the maximum is not that
much difference. The most it could be is two if these two terms are equal. If they're not equal it's
going to be less than two.
Well, this argument is very easy. Y is the time, Lee is the maximum. If I've got so much work to
do and so many processes, the best I can do is evenly load balance that work across the
processes. It's going to take me at least the work divided by the processes. That's assuming I
perfectly load balance it and assuming I can't get super linear speed up because of cache effects
or something.
Then there's the depth. The depth is the critical path, saying if I take the longest path through my
computation of dependencies that's what this is, or the span, I should say.
D is the span. So it's going to take at least the span. So it's going to take the maximum of these
two. You can actually show that if you use a so-called greedy schedule, and a greedy schedule
is just a schedule if there's work to be done we'll do it. Just normal notion of greedy.
So it tells me a lot about the time and fixed number of processes. So that's an argument why
abstracting in terms of work and depth and span is good.
Finally, if I divide and I've been saying this so long, if I divide the work by the span, it gives me a
very good sense of what the parallelism is. Because it turns out when this thing is, when the
parallelism basically this tells me which of these two terms will dominate, right? If I have a P
equals W over D, that many processes, then these terms are equal. I'd like the first term to
dominate because then I'm so-called making good use of my processes. I've got so much work
and I'm getting that work divided by P.
So ideally I always want this term to dominate and parallelism is exactly -- when I've got P less
than the parallelism exactly when that term dominates.
>>: We're seeing a lot of New York pictures that have sort of sim lanes but their own sort of
program counters in GP use.
So in this case within the simby lane [phonetic], it might be better, especially for scan, to use a
different algorithm to not use one that's this absolute down this[inaudible] phase but use the
[inaudible] algorithm, within the simby lane because the constant involved is less. So even
though the work is more, you use that much work anyways.
>> Guy Blelloch: I agree. All these things are important, right? So I think the idea is what I'm
saying now is this is the fundamental idea you teach a first year student. They have to
understand the basic idea of a scan, right?
And then in our curriculum, for example, we have an introduction to data structure course as a
freshman. As a sophomore we have the systems course. We say actually the real machines
aren't RAMs. They have caches they have these other effects. And that's the sort of thing you're
saying. The real machine is an ideal like I'm saying, yes, there would be trade-offs and these are
the next level of things that should be taught.
Okay. So more generally your comment about trade-offs between work come up in a lot of
different places. That's true. So a few more observations. One is that it's often important to take
advantage of both this data parallelism, which is sub selecting elements less than or greater than
the pivot. This is something that's easy to do on a GPU.
And also function parallelism, which are recursive calls to Quiksort which are more difficult to do
on the GPU. But in order to get good speed up you have to do both of those.
The other is that it's important to have some way for the user to get some sense of the costs, to
compare. So always when we do algorithms we want users at a rough sense to be able to
compare insertion sort with Quiksort, one's N squared, one's N log N.
It's not perfect because it doesn't take account of caches and the work in depth that I will describe
is not perfect in the same sense. It doesn't take account of secondary things.
But it gives you at least a high level sense of how parallel this algorithm is. And I think it's very
important to have people in general, programmers in general to be able to quickly look at two
algorithms and have a rough sense of how parallel one is versus the other, though they can sort
of proceed from there.
The topic I didn't really talk about, but there's a lot of different ways to schedule that parallelism.
And there's actually been a lot of very nice work in the last 15 years on sort of scheduling
parallelism.
That's more of an advanced topic. Okay. I just wanted to go just a few more quick examples
here to show that it's relatively easy to think about parallelism. This is a recursive block matrix
multiply. In order to multiply two matrixes, A and B, it blocks them from groups two and four,
recursively. Makes a bunch of recursive calls to matrix multiply on these smaller blocks. I
basically multiply I guess these two, this one with that one, this one with that one, add those two
together and it gives me the upper left corner of my result.
So there's actually eight recursive calls here. And so if you analyze the work, you make eight
recursive calls, each half as big each takes N squared work, that's for the matrix ad here. Add
this result to there. N squared elements. The total work is N cubed which is exactly what we
would expect sequential work to be.
If you analyze the depth, all these recursive calls can be done in parallel. Every single one of
them. Eight recursive calls and they can be done in parallel. If I take the maximum of eight
things, well, it's just one of them, if they're all equal.
And so then it's one depth is just because I can do all these matrix additions in concept depth.
The depth here is only log N. I've got a very shallow network here.
This is an extremely parallel algorithm. Look at the parallelisms N cubed over log N. Thousand
by thousand matrix. I can make use of millions of processors much more parallelism than I need.
You could argue there's too much parallelism and then it's really up to the scheduler to figure out
which part of that parallelism to take advantage of.
You can do matrix inversion. This is the so-called sure compliment of doing a matrix inversion.
You can do a block method for doing matrix inversion, if you do the recursion you get N cubed
here.
What's interesting here there's actually dependence there's sequential instructions. Here in
particular you have to -- there's an inversion here that you have to invert this D-1 first before you
can generate the S-1.
You have to do this inversion before you can do that inversion. And so there's a sequential
dependence. And so that's why there's a 2 here. If you solve that, the actual span of this
algorithm, the depth of the span is order N instead of log N.
But it's still extremely parallel. The parallelism here is a work depth over N squared I can still take
advantage of N squared processes in sort of the best case.
Okay. You can also do a 4 A transform is -- by the way, all the examples I'm showing now
standard sequential code that you might have seen in an algorithm of course they didn't tell you
they're parallel but are just naturally parallel. All you have to do is analyze the depth. That's the
examples I've shown, except for the scan example, these ones are all naturally parallel.
So here's, you just make two recursive calls to, in fact this is actually the complete NESL code for
FFTs, make two recursive calls to FFT, do some parallel multipliers and adds here. Standard
work FFT fast forward transform. And the depth is actually very shallow. It's log N here.
Parallelism you can take advantage of order N. This is actually similar to the Quiksort code in
style in the sense that you've got sort of the function call parallelism by making two calls to FFT in
parallel and also the data parallelism because this statement here is data parallel statement over
N elements.
Okay. I wanted to show one more example that requires something, there's not just reanalyzing
sequential algorithms. This is more like the scan where we actually rethink the algorithm a little
bit. We have to do a little bit of parallel thinking in order to do a merge here.
Okay. So this code here is actually sequential code for merge. It happens to be ML code. It
doesn't really matter what it is. But it's just the standard algorithm where you start at the
beginning of two lists. These are linked lists in this case and you look at the two elements,
whichever one is less you pull it off the front of the list and go to the next one and you go down
the list.
So that's a complete code for it until you get to the end. This is just the base case. When you get
to tend of the list you return the other to the other list.
What about doing this in parallel? This is a very sequential thing. It's like the scan operation,
which we would think about it very sequentially. The goal would be is if you're experienced in
parallel thinking, you should just know the answer to this immediately. But I don't think again if
we gave this quiz to most of our exiting students, they wouldn't know how to do merge and
parallel.
And I don't know what the experience of people in the room is. But I would say most likely
students, but turns out actually very easy algorithms for doing merging in parallel. And everyone
should understand these and everyone should have been trained in these when they were
undergraduates.
Let me just start. I just want to -- the algorithm I'm going to show -- and there's different
algorithms actually but the one I want to show is based assuming that our inputs are balanced
trees, in order. So my input is going to be a tree with some key at the root. All the keys less than
on the left and all the keys less, greater on the right. I'm going to take two of these trees and I'm
going to merge them.
Because the first thing you should notice if I start out with a linked list I'm never going to be able
to do better than linear span, because I have to go, get the next element in the linked list I'm
going to have to basically follow that linked list.
The first thing you should say if I start with the link list I'm screwed. So let me start with a tree. It
turns out you could always -- there's versions to do it on an array, the particular version I'm going
to do works on a tree.
I take two trees which both have the keys in order and I'm going to merge those two. And here's
just a simple, this is actually ML code again. A piece of code that given a key, a pivot here and a
tree, so the second argument is a tree. It's either an MT tree or it's a tree with a node in it, then
you basically are going to go down and split it. You're going to follow down a bunch of nodes
until you get to a leaf and break off the elements less than the pivot and the ones greater than the
pivot.
So this is basically splits into keys less than or greater than pivot. Different than what we did in
Quiksort because our elements are already sorted. This actually only takes logarithmic time to do
this. Because I just go down the tree. In logarithmic time assuming this tree's balanced I can
split this tree into two.
Because it's already sorted. I am just going through. If this was an array, this would correspond
to a binary search. So if you gave me a pivot and array, in order to figure out where that pivot in
the array is I would do some binary search on it. That's the array of this algorithm.
Here an array, basically a whole code for merging, that's two-thirds of it there. This is the rest of
it. I guess I'm missing the base case, actually.
But I'm done now. Here's a parallel code. Extremely parallel piece of code for merging. So let
me just go through this. What I do is like I said I start out with two trees A and B. This is my tree
A. That's my tree B. I assume that they're both balanced and I'm going to return a tree as output,
which is itself also balanced. And how am I going to do that? I'm going to basically take my first
tree, pull out the root. So that's some element that's greater than all these elements and less
than all those elements.
I'm going to now make a -- now I'm going to use that root to split the other tree showing the code,
based on the code in the previous page. That gives me two trees. The ones less than the pivot
and the ones greater out of B and I just recursively merge this AL with the BL, all the elements
less than the pivot. The AR with the BR, all the elements less or greater than the pivot, and I'm
done.
Then I just put my pivot in the middle, the thing I pulled out in the beginning. So that's a piece of
code. Now you might say well I haven't analyzed this. Is this an efficient piece of code. If I wrote
this sequentially, if I made these two recursive calls. So this is parallel now because I can make
the two calls to merge in parallel.
If I did it sequentially or if I analyzed the work, what is it? It turns out that it is a, if I do the
merging in parallel, the work is actually linear. You get a recurrence basically that says it's N over
-- 2 times N over 2 plus log N plus the time doing the split. That just solves to linear.
This is a linear time algorithm. If I ran it sequentially, it leaves big O is just as fast as the version
that comes in from the head.
In practice it's probably a factor of 2 slower because of various overheads. Dealing with lists, et
cetera. But it's at least within the constant factor.
So even if I did this sequentially it will be reasonably good. If I analyze the depth, well there's this
log N partition time here for doing the split. And then I make log -- the tree depth here of
recursion is log N. So each node in my tree has log N span. The whole tree has log N levels. So
the whole depth is the product of those two. It's log squared N.
And this is just using those rules where I take the maximum of the span for my [inaudible] to
recursive calls and then add in the span for that.
And that gives me these bounds. Anyway, it's a reasonable piece of code. It fits in hardly any -- I
guess I cheated a little bit. I left the base case off of here. But it's not that much more
complicated. This is something that everyone should understand.
Again, if you want to write an optimal version I'm sure you'd do something fancier, but at the
minimum you should understand this is the way to do merging.
Now, actually, just an interesting comment here, as it turns out you can actually improve this span
here from log squared N to log N using an interesting old concept from parallel computing called
a future.
And this is actually an advanced topic. I wouldn't actually teach this to a freshman. But it's sort of
interesting to realize this, is that as I'm splitting this thing here, so I take my M and I split this.
And I start splitting this, I can actually start feeding the results from my BL and BR to my recursive
calls before they're even done. That gives you sort of a pipelining effect.
You can do this with futures. And I think futures is something a few people in this room have
worked with before is basically all you do when you call this split operation here, is you wrap it in
a future, and you actually have to, this is my recursive split. Whenever I do my recursive split I
wrap it in a future. What a future is, it basically says go off and do this computation, return me a
handle immediately, that I can pass around to other people, and when I want the actual value, I
ask that handle for the value, and if it's ready, it gives it to you immediately. If it's not, then it
basically waits until it's ready and gives it to you. And suspends you until it's ready.
That's what a future is. And if you just -- you know, my split operation, you just wrap futures
around those two calls and then use this, it turns out you can show that the depth here reduces to
log N span. You actually get more parallelism. Now, to analyze this is more complicated and,
again, I wouldn't leave this as the first thing you teach in parallelism. But it goes to show that you
can then, you can have a formal way of sort of analyzing what a future will give you, that if -- that
this pipeline parallelism will give you.
Okay. So more observations. One is divide and conquer is even more useful in parallel than
sequential. Of course we've all been using divide and conquer algorithms forever. When you did
merging you didn't use divide and conquer you used something that started at the beginning of
the list. Even for merging and other cases like this, even for things you haven't used divide and
conquer for, now when you start thinking parallel and thinking about how to do these things in
parallel you should be thinking about how to do it in divide and conquer.
The other thing, trees are better than lists in parallelism. In fact, lists are a real drag, not a good
data structure for parallelism. If you start thinking in parallel, you basically are going to stop
building data structure out of linked list. You're going to build them out of trees or arrays are
okay. Lots of things which are okay. Basically where you can access the element, any element
in some time less than the size of the data structure. If I have a linked list I have to go all the way
down to the end of the list to access that last element.
And the other sort of, this sort of advanced topic, the tick means here is that pipelining can
actually asymptotically improve the depth of this algorithm, the span of this algorithm, I should say
span F.
Okay. So the observations. So here's sort of summary of what all the observations in the talk
were. There was some general observation is that natural parallelism often too low level. I'm
sorry. Natural parallelism is often lost in low level implementation. So that was the Quiksort
example, where we should be teaching people that this is a parallel algorithm when we teach
them Quiksort and we should do that from day one.
And people should just understand that. And not just the two recursive calls are parallel, the fact
that we can pick the elements greater or equal than less than pivot.
There's a lot of lost opportunities in describing parallelism to undergraduates. So in general. And
another one is lots of things which seem sequential actually are not that hard to parallelize. Two
examples was merging and the other was scan operation.
I think those were -- I hope I convinced you those were quite simple algorithms. I could write the
pretty much complete code for it in 10, 15 lines, right?
Some other model language issues. Taking advantage of both data and function parallelism.
The need for some abstract cost models so at least you can get some grasp on this sort of
parallelism that's available and so we talk about work and span and divide work by span and it
gives you the number of processes.
And also work tells you whether it's going to be as efficient as sequential algorithms in terms of
the total work it does.
I didn't really talk about scheduling. And another comment is I didn't really talk much about this
except in one of those early slides that it's all the examples I've shown are deterministic parallel.
However you schedule any one of those bits of code, throughout this talk wouldn't give you a
different answer. There's no -- a lot of complications people have for designing parallel code,
well, is just debugging, if you get different answers each time you run it, this is nondeterminism,
it's going to give you a huge headache. Everything I showed you will return the same result. It's
got sequential semantics. I think it's very important if all you're doing is designing an algorithm.
Now, like I said, there's a lot of applications where you need the concurrency. In fact, there's a lot
of applications which run on one processor where you need concurrency like an operating
system. So we do have to understand interleaving and coherency protocols and whatever. If
that's not what you're doing if you're designing an application to run in parallel, the application
does not interact with the environment around you or interacts at a coarse level instead of at a
very fine grain level then you should try to be deterministic.
I talked a bunch of different algorithmic techniques. I talked about recursing on smaller problems.
You take a big problem, this is a scan, solve a smaller operation, scan and help you. This comes
up over and over again. I didn't talk much about the transition can be aggregated. I talk about
divide and conquer, and trees and pipelining.
There's actually, I teach a course in parallel algorithms. There's actually a bunch of other sort of
core ideas in parallel thinking which I didn't even get to today. There's ideas in graph contraction
to design graph algorithms.
Identifying independent sets, which is useful in a bunch of different algorithms. Some in
computational geometry like baloney [phonetic] trangulation, etcetera, symmetry breaking. Some
of these other sort of core ideas in parallel algorithms, thinking about parallel, solving parallel
problems deterministic parallel problems.
Okay. So I'm short on time. I also had some slides on locality. I'll just quickly mention that one
issue with everything I've talked about so far is I haven't talked about locality at all. And I think
locality is actually very interesting problem in parallel, because we now sort of at least somewhat
used to analyzing algorithms for caches, sequentially.
When you get to parallel computation, dealing with locality is a whole other bag of worms. But I
actually think there's very nice ways of dealing with locality and an abstract way. And the same
sense what I've been showing an abstract way to think about parallelism. There's abstract ways
to think about locality where the locality is tied much more to the application. So properties of the
application, or the algorithm than it is to a particular hardware.
So that you can define locality in terms of properties. So we can come up with some idea of
locality like we did for work and span, and some measure of locality based on just the algorithm
itself. This will help you map it onto a machine that has multiple levels of caches or has a local,
nonlocal memory, et cetera.
Okay. So to finish up on the idea of parallel thinking is I really have concentrated on just one
piece of what I call parallel thinking which I call the deterministic parallel algorithms. So without
having any concurrency. There's no nondeterminism in what I just showed. I think it's still
important to do concurrency. So these are applications where there's just inherent
nondeterminism in the environment. You are getting requests on line and you have to deal with
them. There's inherent nondeterminism. So I think it's important to sort of identify sort of like I
tried to go through in the case of the deterministic parallel algorithm what the core ideas are, what
I think everyone should know about parallel thinking in that context.
There's a similar set of ideas that you'd want to know about the concurrency context. And I've
listed some of it here.
And also similar sorts of ideas that you'd want to know sort of in terms of architectures. And
cache coherency protocols, et cetera.
Okay. So I guess what I've talked about today is actually based on 30 years of research by lots
of different people. So a lot of the algorithms aren't my algorithms. And a lot of it's come out of
the theory, algorithms community and also the programming language community, although I've
often twisted it a little bit.
So my conclusions are educating I guess programmers for parallelism is key. And, again, I think
there's been much too much emphasis on teaching a particular parallel programming paradigm,
open MP or some language for a CUDA for graphics processor or whatever.
So you give someone a book on this topic and they go off and they learn how to do it. They don't
understand the high level concepts about parallelism. And I think there's too much emphasis and
I think it's very important.
I also think it's very important for people to clearly separate what I call parallelism and
concurrency, and like I said people have argued whether these are the right terms. But clearly
separating out when you're trying to take advantage of multiple processes in your hardware
versus when your actually application has concurrency in it.
There's actually -- it's a real time application. You're getting demands from the outside world and
you have to deal with those, and they're coming in a nondeterministic ordering.
I guess another comment is that I would actually say this is more for me at a university rather
than being at a company at Microsoft, is that it's important to teach these ideas from day one, that
when we first start teaching computing, our very first data structures course, very first
programming course we should be teaching parallelism. In fact, both parallelism and
concurrency should be taught from day one.
I think concurrency is a little bit more difficult. I think it's a more advanced topic. So probably
only the most simple things should be brought in on concurrency but some of it should be
introduced from day one. I know MIT has an introductory course where they start introducing
concurrency from day one.
Then I guess the last thing is, you know, I try to go through what I think are some of the core
ideas in parallel algorithms, deterministic parallel algorithms. I think it's actually reasonably
important for the community to understand what the core ideas in all these different areas are.
So that everyone is properly trained in how to think about parallel algorithms, how to think about
concurrency so that we can sort of move forward on making use of these parallel machines.
Okay. That's it. Questions.
>>: This depends on using a functional language?
>> Guy Blelloch: Well, basically what the functional language does is it gives you guarantees that
it's correct. If you don't use a functional language, then you either have to use a race detector,
that's the Cilk approach that you run it in a C-like language. You run your race detector, tells you
if there's any races. Typically the race detector slows down the program by a factor of 10. Can't
have it running all the time. But you run it on some test data and hope that if you run it, the tech
races on tech data. What the race detector does, it guarantees that when you make parallel calls
there's no races. If it passes a race detector, then it will be deterministic.
And the other approach is just to cross your fingers, I guess.
>>: So you talk about ideas or ideas for thinking about parallel algorithms, and for how to present
them to, say, undergrads and sort of today when we were talking about scan or matrix multiply
and stuff. We usually look at sort of a P RAM model where we talk about actual data structures
and how sort of the computation is structured. And then we talked sort of what I call a model of
circuit families where you show sort of the data flow, and how I guess that model of parallel
thinking, size and depth complexity. You talked about in terms of span and work.
But futures can be represented in certain families by the normal data flow, we have a wire coming
down to the other part.
So I was wondering, sort of the circuit family of thinking, data flow style thinking of parallel
algorithms, is very nice because it's visual. And you don't get lost into how do I map my trees into
cells of memory or anything that's very sequential, because it's all connected by rubber bands at
the Ys and stuff. So is the model of circuit families for parallel computation something we should
be teaching to undergrads and how do you take that circuit and schedule it onto real hardware or
data flow computing? It sort of seems like it's a more intuitive way of thinking about it.
>> Guy Blelloch: It's very much I feel -- so if you notice I never used the term P RAM here until
the very last slide because this isn't the P RAM model. P RAM model you have processors and
you do your own scheduling.
This is a very different model. In fact if you look at the merge code or Quiksort code, you don't
see processes. You might have noticed, you might not have noticed, when I do Quiksort and
analyzing flexibility, I didn't mention processes a single time. The term processor never showed
up here.
I believe that's very important that this piece of code is a parallel piece of code. Doesn't say
anything about processes.
>>: Also abstracts into structures as opposed to indexes and arrays, which I thought was nice.
>> Guy Blelloch: So I would say that's definitely the right approach because it's more abstract.
You can basically go with this code. Work out the work and the depth and on the last step you
can say if I had P processes how fast would that run.
>>: So my question is: Should we be presenting data flow computing or circuit families to
undergrads? Because you can show me at this hour, I still don't really get it until I see your
example for eight or 16 or whatever in terms of visually showing me the flow computation on the
screen exactly. So this is ->> Guy Blelloch: No, this is very much the -- right. So I would say this data flow style of
programming basically.
Like I said, it really only comes down to two trivial ideas which is that the work is -- sorry for going
through the slides. The work is just, if you make two parallel calls, this is really everything. It's
encapsulated, the whole thing is encapsulating in the equations the work is the sum of the calls
and the span is the max.
By the way, the term span I've used depth instead of span in my work. And Charles [inaudible]
and work in Cilk has always used critical path. About two years ago we got into a fistfight and we
agreed on using span as a compromise.
But it's not a standard term for that. Some people use critical path. Some people use work span.
Some people use depth.
>>: From a conceptual point, visually seeing that data flow graph really illustrates the work of the
number of boxes and the span the depth or whatever.
>> Guy Blelloch: Yeah, I think as well as showing the equations, it's very useful educationally to
show these sorts of boxes.
>>: I just wanted to say that a practical thing to get programmers to do this, just based on my
group's experience, the key thing was getting us a process, eight core machines. All of a sudden
it became worthwhile for us to put in the effort to learn to do the simple stuff and if you're in a
language that allows functional programming but doesn't require it, it motivated us to get rid of
those side effects.
>> Guy Blelloch: Yeah, I think this eight core -- I think there's going to be a huge jump that
somehow four cores was not enough parallelism to motivate anything. Most of what I showed
today, if you actually code it up, you lose a factor of two, three, from just going to parallel
algorithms. Like the scan is a good example. You loose -- there's twice as many operations in
the parallel scan as in the sequential scan. You just lose a factor of two off the bat.
If you've got four cores you're not going to be satisfied. So I think you really need to be getting at
minimum eight and I think it will be a huge difference when we get to 16, 32, 64. I think that's
when people will get it they'll say wow parallelism is useful. Somehow four is not enough
because of the overhead of parallelism you lose more than half of that just off of that.
>>: The people that are writing the tools and libraries now to have those eight cores so they're
motivated to do it for themselves and not just for some abstract that will be good in the future.
>> Guy Blelloch: Yeah. So it's true in the course I teach I had a four core machine last year and
I had an eight core machine this year. It made a difference that people felt more satisfied of
getting a factor of five speed up as opposed to a factor of two, two and a half speed up.
>>: Playing the devil's advocate here. You used justification of chance comments, 100
processors, 2015 I think it was. A lot of people don't believe that. In fact, it could stop as early as
eight. If it does stop as early as eight, can we forget all this?
>> Guy Blelloch: It would become more of a niche field that people will still already have much
bigger parallel machines. But if you don't have it on your laptop, then it wouldn't be something
you'd use on your laptop.
>>: This is really a hard sell, right? People have been saying for a long time we should be
teaching parallel thinking. And as you said Carnegie Mellon doesn't do it very well yet.
>> Guy Blelloch: Well, it is based a bit on this belief that we will have more than -- like I say eight
cores might be -- elbow ware, might become useful, right. 16, 32. But I think we only have to go
up to 32 when it becomes clearly useful. 16 -- I think 8 and 16 are sort of like, well, it's quite a bit
of work for this and there's probably other ways I can take advantage of 8 cores right, I think once
you get to 32 I think you have to start thinking this way.
And I personally -- I'm not sure if we're going to have 100 cores in 2015. I certainly wouldn't put a
big bet on that. But I'm quite sure that we'll have 32 cores at least by 2015, and beyond there.
I mean, maybe I've been misled by Intel. But I think so.
>>: They're already integrating GPs on the northridge, right, so you already have literally tens of
thousands of processing, logical processing elements, that are now sharing the same physical
address space. So scan and some of these other things for large datasets could now be used by
the operating system instead of having [inaudible] across PC express to do a separate memory.
So I think we already have that now. We have tens of thousands of basically hiding it through
latency, hiding -- we have this span for pushing instructions through that we didn't have before on
the same physical memories.
>> Guy Blelloch: Dave?
>>: So are we making a mistake by doing, emphasizing communication costs first order concern,
even to our introductory?
>> Guy Blelloch: Yeah, that's why I brought up that locality issue at the end, is that I don't think
you can forget about communication costs completely.
I do actually think you could do it at the freshman level. But not -- even going to the sophomore
level, I think you probably have to start talking about locality. So at the end I had a couple slides
on locality, but I think these all -- I think even sequential computing, it's questionable. Right?
There's 100-fold difference between a cache miss and a cache hit, right, a level one cache hit
and a complete cache miss. Even sequentially, but yet we still teach Quiksort without worrying
about the cache. Turns out Quiksort is okay for cache locality because it scans the memory in
order.
A lot of these algorithms are okay. I guess my point is in fact a lot of algorithms I showed are
reasonably cache friendly. And it's -- the point being is that if we are going to talk about locality
we want to do it in a way that's not at a low level. Like isn't talking about processes and
communicating, that somehow captures a property of the program. So recursive matrix -- there's
a huge amount of cache locality. Hugely local. We all know matrix multiply is about the best
possible thing you can do for a cache. Right. The reason is if you look at all the cross terms
here, they're all sharing the same matrices, just doing the same thing over and over again.
If we can somehow capture that at a high level as opposed to going to things I think it would be
helpful otherwise locality in parallel machines is a nightmare.
There's so many different ways you have locality in parallel machines and it's much more
complicated in the sequential.
>>: What do you think is the future of transparent caching, as we scale up to 100 about 12,000
nodes, per granularities like scan and matrix, are we going to continue with the same sort of
caching paradigm that we have for [inaudible] or are we just going to have to manually manage
that as part of our algorithm design?
>> Guy Blelloch: Well, I hope you wouldn't have to. Certainly, no one would want to manually
manage cache. It's hard enough sequentially to manually manage cache that's why we went to
automatic caching, and it would be much harder in parallel.
So, yes, I would certainly hope that it's done under the hood for you. Ideally in hardware. I think
cache coherency schemes, now, whether you need the stricter schemes that you can get with
relax, probably maybe you can get rid of relaxed if the system on top knows what to do with it.
But I think we do need it to be automatic. That's my opinion.
Just one more.
>>: One of the trickier things we get taught when there's sequential programming is how to do
everything in place. And from the pictures of the algorithms you showed it seems to be less in
place than the equivalency sequential algorithm. Is that a characteristic across the algorithm?
>> Guy Blelloch: It is a characteristic. You often do lose -- like I said, you sort of lose a factor of
two of these algorithms out of the box. And for the Quiksort that in fact the factor of 2 comes from
the fact that it's, the power version where you do a sub selection is not an in-place version. And
that means you're going to touch twice as much memory, probably. And that's going to give you
a factor of 2. It's not more than a factor of 2. But a lot of these cases, yes, it requires, because
you're doing them in parallel you have to do it in separate spaces. It's a comment about Quiksort
in specific but generally you often have to do it in separate spaces and that means touching more
memory.
>>: Do you think that will inform the computer architecture, like is there something like can they -is there something you thought of that can be done with the way things are cached to be more
friendly to that pattern? Because caches today are very friendly for this in-place sort of stuff.
>> Guy Blelloch: It could be. If you think about functional languages, they don't do things in
place, right? Because you're always -- unless you optimize, you're always copping to a new
location.
And there has been work that shows that this sort of caching schemes you would want in a
functional language are different. So maybe the same sort of thought process would also say the
same thing here.
>>: Great. Well, let's thank Guy.
Download