17044 >> Jim Larus: It's my pleasure today to introduce...

17044 >> Jim Larus: It's my pleasure today to introduce Guy Blelloch from Carnegie Mellon. Guy has been doing parallel computing, parallel for a long time, at least as long as I've been in the field. And he's told me that he never gave it up. Even when the rest of us went off and did something different for a decade or so, he kept on working on it at least as a side line. So it's back. We're fortunate that he's back and he's going to tell us some information on parallel thinking. Guy. >> Guy Blelloch: Okay. Is the microphone on? Can you hear me? >> Jim Larus: Yes. >> Guy Blelloch: Yes, so this is actually parallel thinking is actually part of a Microsoft-sponsored center for computational thinking. You might not know what this is but actually Microsoft gave Carnegie Mellon some money back when Jeanette was department head. She was very interested in this notion of computational thinking, which I'll describe in a moment. But now Jeanette's actually gone off to head up size at NSF and she's actually been putting a lot of these ideas from computational thinking into size. And so a lot of the call for proposals and stuff coming out of NSF have that term in it, computational thinking. So the background, what is computational thinking, well Jeanette's idea was there's a lot of sort of core foundational ideas in computer science in the way we think which are different from the way that people think in other fields, in other sciences. And it's sort of important to identify what those ideas are for a few reasons. So these ideas are things like abstraction, that we somehow use abstraction more than in most other fields. We develop our own abstractions all the time, whether it be the interface for data structure or whatever, we're developing them all the time. And they're really a core part of the way we think in computer science. Another one is recursion. This is an idea that's completely embedded in us. We don't even think twice about it. There's a lot of other fields when you first see recursion, it's sort of baffling. There's various other ones like ideas like reductions that you can prove a problem hard by reducing one problem to the other. There's lots of sort of core ideas in computer science. The reason it's important to understand them is, one, it's sort of an educational thing to develop curricula you want to understand what the core ideas are so that you can emphasize those in your courses. And a lot of motivation of this came out of a general feeling about five, 10 years ago, not just at Carnegie Mellon, in a lot of places that what we're teaching especially in high schools but also in intro courses in colleges was emphasizing sort of Java programming or some low level hacking of some kind, instead of emphasizing what the core ideas of computer science are. Another aspect of it is just a pr thing we should make it clear to people out of computer science that computer science is again not Java programming, it is some deep ideas and it's a real science and what are the ideas in that real science, what is the science behind computer science. That's what computational thinking is. The third idea is maybe you can take the way that we think about problems and apply it to other areas and actually get innovations by doing that. I think Chris [inaudible] Berkeley calls this sort of computational lens. So if you look at certain problems via computational lens, whether it be a problem in economics, a problem in physics or whatever, you might have a different way of looking at that problem and therefore actually get a better way of approaching it. So anyway that's what computational thinking is, and Microsoft was generous and gave us money for that and the project I was doing actually came out of that on parallel thinking. So the idea obviously is there's some core thinking in thinking about parallel programming and develop parallel codes and the question is what those core ideas are and why they might be useful. Okay. You've all seen charts like this, I'm sure. This particular one, many of you have probably seen this exact one is from Andrew Chen at Intel. It's showing the increase in cores over time. I guess we're at 2009 now. Nowadays when you buy a multi-core it's actually four or six cores, some of them have eight. In a few years we're going to have 12 to 32. And I guess the most interesting point here is that 2015 is not that far away. It's six years away, and at least Intel is predicting you'll have 100 cores on every chip on a laptop you buy. I'm not sure how precise they're trying to be. Sort of nebulous where it's placed. But that's the idea. So the question is: How are we going to use all these cores and what are we going to do with them? And what are some of the problems in moving into a world where instead of single core we have multiple cores. Well, one is building the hardware. I actually think this is the easiest to deal with. Intel is going to do it for us or various chip manufacturers, exactly what it's going to look like still has to be worked out. And I haven't signed the appropriate nondisclosure so I can't say what the next generation is going to look like. But that's moving ahead. Then there's developing good programming language and parallel extensions. So I think there's a lot of efforts nowadays to do this. So there's some very nice work. I don't know whether it would be like Cilk++ and various language extensions. I think it will be a while before these play out. I think there will be several open MP out there and some of them will work better than others. Developing good runtime compilers and debuggers. I think developing good debuggers is a interesting field. Developing good OS support. Then, of course, there's just the fact that a lot of companies have a huge code base. I don't know how many lines of code Microsoft has, but I assume it's very large. And even if only a small fraction of it needs to be changed to deal with multi-core, right, because we can't just -- as in the past -- rely on the frequency speeding up, we actually now have to make use of those multiple cores, then clearly there will be a huge amount of work here which then ties into this notion which I actually think is the hardest part of this all. The most likely thing that's going to create a failure here is not any of these technologies. I think a lot of good ideas were developed over the last 20 years in all of these areas. I think the biggest problem right now is that most people out there, people that we've graduated, know almost nothing about parallelism. And part of this is that we I know certainly at Carnegie Mellon we haven't been teaching parallelism for many years. And only recently we've started teaching some courses in parallelism. So all the students we've educated in the last 10 years know very little about parallelism. And this is changing but this is going to take a while. If you don't have people understand parallelism then it's going to be a problem because they're not going to be able to use these things or these things won't be designed very well, the people who will be coding them or writing the compilers maybe don't fully understand. So I think there's a real issue here in getting people to understand all the issues of parallelism. A lot of these things are things that as a community we understand, but we haven't educated students and our workforce. So what I want to do is sort of take an approach to parallel thinking that's similar to the idea of computational thinking, is that it's not a bunch of library interfaces for parallelism. So it's not the open MP interface for parallelism. The Cilk++ syntax for writing a parallel program. That's not what parallelism. It's not a bunch of libraries or interfaces. The building blocks, the Intel building blocks. So it's really a set of core ideas and what are those core ideas, and identifying those and educating our students and I guess even people beyond being students on what these core ideas are and so part of it is just identifying what these are. Then I guess the other side of this is I think that if you really go back to the basics that parallelism isn't that hard. And I'm going to put a little caveat on this on the next slide. But parallelism, where parallelism means making use of multiple processors. If done right a lot of algorithms are naturally parallel but we just brain wash our students at least at Carnegie Mellon we do, and I think most other universities are doing, into thinking sequentially. We've been doing this for 40 years. And if we just sort of brought in parallelism at a much more fundamental level, at an early part in the curricula, we could actually avoid this problem. Furthermore, that if done right, with appropriate and with people who have an appropriate knowledge of parallel programming, it's really not that hard, parallelism. At least certain aspects. I'm going to put one caveat on the next slide here, which is this -- I want to make a very clear distinction between what I call parallelism and concurrency, these particular terms don't matter that much. I think Jim was at a workshop last summer where we spent about five hours fighting whether these are the right terms or not. What I want to do is make a distinction between parallelism being a property of the machine, having multiple processors running, to get speed of the only purpose is to get speed up, since you're now going to be loading, balancing the load across multiple processors versus concurrency, which is it's a programming issue where you've actually got multiple interleaved threads and making use of these things which are running concurrently and nondeterministic interleavings. So, for example, most operating systems have always been concurrent. We've been teaching the bakery algorithms unlocks for the last 40 years. And on the machines which are completely sequential. And it's not because we want to speed things up. It's because we have a multi-threaded process -- we've got multiple threads running, multiple processes running on our machine, our sequential machine. They're interleaved. So these interleaved processes can interact with each other. We need locks. And we need concurrency. And it's running on a single processor. So that concurrency has existed for many years, has nothing to do with parallelism, has to do with interleaving with threads. Nothing to do with hardware parallelism. Now we've got sort of what I call deterministic parallelism. I'm actually going to focus on this, the idea that you can have a sequential semantics, no nondeterminism, and it be running in parallel. I claim this is actually the part that's not understood, that if you look at curriculum today, this is the part that's least understood. Like I said, we teach this. I went through our curricula and we covered the bakery algorithm in four different courses. We cover interleaving of instructions in three different course in the OS course and the distributed systems course and introduction to system courses. This sort of stuff we don't cover anywhere. And then, of course, you can have both. You can have things which are concurrent so that your environment might be a nondeterminant. You've got demands coming in on the fly you have to process things and you're also running it on parallel machines. But I want to make a clear distinction here between concurrency and parallelism. My claim actually is that the parallelism, especially this deterministic parallelism is much easier than people think if you would just train properly. I just want to go through this talk and give some examples of why I think this. Okay. So here I'm starting with Quicksort. This is the code from a book "Algorithms in Java" by Robert Sedgewick, and it's a book that's used in a reasonable number of courses, introductory courses and data structures And we can look at this, and it's what you might expect. It's got a wire loop. It's got a couple more wire loops in here. So this is clearly extremely sequential. It does some swap. So we pick a pivot. We find all the elements less and greater than that pivot. Swap between them. Do a final swap to put the pivot in the right place. And then we make a recursive call to Quiksort. And then after that we make another call to Quicksort. And so that's what you would expect. So we teach this Quiksort course as a sequential algorithm. And in fact this is extremely sequential, this part here. Well, you know if you were a computer science -- with experience, you'll probably realize that you can call these two things in parallel. But that's not how we teach it. We just say you call one and then you call the other. So that's an extremely sequential program. Here's actually, if we go back a few years, here's the exact same algorithm from Quiksort algorithm from Aho-Hopcroft-Ullman. This is a book, an algorithm from the '70s. I actually claim in some sense this is a much better piece of code because this is at the right level of abstraction this is a parallel piece of code. My point being here is that many algorithms are just inherently parallel. And if we just actually raised the level of abstraction, we would realize they're parallel. So in this case there's a huge amount of parallelism in Quiksort the fact we can make the two recursive calls. Actually what this says here is how we put the results together. It doesn't say we have to call these sequentially. It doesn't say we first call Quiksort and then the other one. It says when we put these things together we follow them by S 2 and the Quiksort of S 3. Furthermore, this here can be done in parallel. We can actually, when we're picking the elements S 1, all the elements less than the pivot, that can be done in parallel. We can compare our keys, our keys to the pivot in parallel. So we've got N. The inputs set here. Might be a size N. We've got N of those and compare them to the pivot in parallel. We can sort of collect them up. So there's actually a huge amount of parallelism here. Expressed at this level it's actually -- it's already in the code. And so if you go back in time, you actually find in the earlier books they described algorithms more in parallel than they do now. That over the years we've gone into more and more sequential thinking. So I guess there's a couple observations here. One is that there's a lot of natural parallelism algorithms. And if we describe them at the right level of abstraction, we would notice that they're there. And furthermore, when we do describe them in courses, we really should point out that they're there. We should show the students that this is parallel. We should tell them that they can make the two recursive calls to Quiksort in parallel. We can tell them they can compare the pivot to keys in parallel. So that's sort of the last opportunity. And so, yes, so the point being that there's a lot of inherent parallelism. So actually this is Quiksort in a language that I had worked on back in the early '90s called NESL. So this is actually, I was motivated, the particular syntax is actually motivated by Aho-Hopcroft-Ullman's code, and it's a sort of set notation that says pick all the elements less than the pivot. This is the S here. For every element in S if there's a set notation, pick the elements less than A. Pick the ones equal to A, greater than A. These are comprehensions, various other languages such as Haskell have similar sorts of constructs. The original motivation for this this comes from Sedoff [phonetic], a language from the '70s. Then it says call Quiksort on the elements that are less, each V and the elements less than or greater S-1 and S-3 apply Quiksort and then you put together the results here. You extract the first element result returned here in the second and put the S-2 in between. The point is that in NESL, wherever you see curly brackets, it means you there's potential parallelism. Means you can do this in parallel. You compare all the keys to the pivot in parallel. Also says you can make the two recursive calls to Quiksort in parallel. This forms what's called nested parallel. And the term NESL comes from nested parallel. It's turns out NESL is functional and it allows you to get this deterministic parallel, guarantee there's not going to be any raised conditions from the recursive calls Quiksort because this is purely functional. This is something I presume you can do with F sharp. And so this is actually, you guarantee that this is safe in the sense that you'll always get deterministic results, however you execute it. Whether you execute the first Quiksort first or the second one first and they're both in parallel. Okay. So I want to go on and say so this is supposed to express the parallelism at a very high level. We say that we can do the Quiksort in parallels and we can pick all the elements greater or equal than less the keys, each of those in parallel. But the question is how do we actually implement that parallel selection here. So it's actually not quite obvious that, yes, we can compare the keys E and S to the pivot in parallel. But then we have to somehow pack them down. We have to start with this array of keys. Let's say S is actually an array. And we have to pack them down into just the ones which are less than the pivot. So here let's say the pivot is four. So we've packed out the elements two, one, zero three and one. So the question is how do we do this? And it actually turns out this is a relatively straightforward thing to do. But it does require a little, well, parallel thinking, we actually have to realize that this can be done. So here's what we do. Is we pick out all the elements which are less than the pivot, which are the ones here. We've got zeros where it's not less than the pivot. Then what we do is we do a so-called parallel prefix, where every one here gets the sum of every one before. So this one is the sum of every one before. And the three is the sum of those three elements. This four is the sum of those four, et cetera. So we've got this array of it's called the prefix sum or the scan operation. And then what we do is we just write each element to the particular position. If the flag is one. So we write this, the two to location zero. The one to location one. The zero to location two. The three to location three and four to location -- and the one to location four. And that gives us the final result. Okay. So the interesting thing here is that now unlike Quiksort which was inherently parallel, it's not obvious that this is the parallel step. It would seem completely sequential doing the prefix sum. So some of you have seen this before. I know many of you know this well. But there's actually a very easy parallel algorithm to do this. And this would be something that would be a core idea if you are going to be teaching parallel thinking. I think a lot of people we've taught in the last 10 years that have come out of Carnegie Mellon would be clueless on how to do this in parallel. But it's actually a very easy idea. In fact, it's sort of a fundamental idea. So I just want to go through it to convince you it's doable, for those who haven't seen it. So here's an array. Let's say I want to find the prefix sum of this. We want basically the sum of every one before it. So in this position I'm going to want a 3 and in that position I'm going to want a 7, I guess. So how are we going to do this? We want to do it in parallel, sequentially we do it in an array one by one, write a loop and add it up. Well, it turns out it's very easy to do this in a recursive way. What we do is we pair-wise add these up. The 3, the 2 and the 1 becomes 3. The 4 and the 2 becomes 6, et cetera. And now let's say I just recursively apply a prefix sum on that array. So it should be clear I can do this pair-wise adding in parallel. So that's easy to do in parallel. I make a recursive call, which is, of course, like as I mentioned one of our tools of computational thinking of recursion. So now I've just got the result of the prefix sum on that array. 0 3, 9, 13. Some of you might recognize that if you calculate this in your head, the actual prefix sum, the fees are basically every other result that the prefix sum at this position is 0. The prefix sum at this position is 3. The prefix sum at that position is 9. And the prefix sum at position is 13. We've got half of our answers adding pairs up, doing recursive call to prefix sum. And, furthermore, if we've got half of our answers, if I look at the answer to any position, I can easily calculate the answer at the next position by adding the key. If I know, for example, at this position it's 9 and I ask what's it at the next position, well, it's 9 plus 3, that's the definition of prefix sum. It's the sum before it. So we've got a simple way is what we do is we calculate -- we take these elements. We add the element before it. So 0, 2, 3 to 4, 9 to 3, 13 and 5, gives us these things. We just interleave those. And that's our prefix sum. So what I'm saying, this is a very easy algorithm to describe. It's as easy or much easier than most of the algorithms we describe in an introduction data course. And it is a fundamental operation in parallel computing. This sort of stuff gets done all the time. But yet I would say most of our students if we gave them this on an exit exam as they leave the university wouldn't know how to do it. Even though I was able to describe it in a few minutes. And this sort of idea actually gets -- you can do a lot more than just prefix sums here. You can actually calculate the states, if you have a state machine and it's transitioning states with a bunch of inputs you can calculate that using this. There's all sorts of things which normally you would think as being sequential but are actually parallel. Okay. So here's actually the NESL code for it. That doesn't really matter. You might not like the syntax, or you might. But it's basically saying this adds up those pair-wise adds up the elements. This is the recursive call. And this basically calculates the result of this plus some subset of the elements and you interleave it together. You get what you want here. So it's only a few lines of code. It's not a complicated idea. It's something that I think every undergraduate should know their freshman year. And it's a core idea. It's not how do -- what's the syntax for open MP and nested loops. I'm a little worried it's more likely undergraduates that know that than it is they know this basic algorithm. Okay. So a couple observations here. One is that just because it seems sequential doesn't mean it is. That this might be a simple parallel algorithm for it. That this is actually a very common theme that comes up in the design of parallel algorithms is that if you can somehow reduce the problem to a smaller problem, solve it on the smaller problem recursively and use that to solve the big one, that's very helpful. This is not exactly a divide and conquer. Divide and conquer we solve two smaller problems or two or smaller problems and put them together. Here I'm just solving one. I only made one recursive call to the plus scan. I didn't do multiple. So it's not divide and conquer, but it's this idea of solving a smaller problem and using it to solve bigger. I'd say this comes up a lot more in parallel algorithms, thinking about problems in parallel than it does sequentially. Because sequentially you're often thinking of going through this loop one at a time, especially if you think about the scan. You basically think about starting at the beginning and going to tend, that when you want to think in parallel you should be thinking how do I take this really big problem, make it into something smaller, solve that thing smaller and then help me solve the big one. That's typically more useful in parallelism than the one-at-a-time approach obviously. This last thing is that this is the generalization of the plus scan is basically you can generalize it to be a bunch of state transitions and really what this is is is a special case of composing state transitions. In fact Guy Steel gives a nice talk on his version of parallel thinking, and he really emphasizes this idea that basically he can compose states. >>: So even the algorithm you described has some sequential [inaudible] I do this then I do this. >> Guy Blelloch: That's right. >>: So is there a notion of optimal parallelism with an algorithm you cannot parallelize this more than ->> Guy Blelloch: Yes, there's a lot of theory in parallels about optimality and things you can't do better than. In fact, you can't do better than logarithmic depth for a scan because you basically have to do a tree. You have to pair-wise add things up. So the best you can possibly do is logarithmic depth. So, yes, there is a nice theory and in fact this algorithm I described is optimal in that sense. But you can't do better than that. I guess this is -- you can write Quiksort in lots of different languages. This is Quiksort in parallel in Cilk++. The point is that the parallelism structure is the same in all these languages. We should emphasize the particular constructs. So I think there's a lot of emphasis recently in parallelism on just looking at what the open MP constructs are or what the Cilk++ constructs are, what the FUBA's [phonetic] construct without thinking what's the high level idea here. One thing I should point out is that I mentioned there's two types of parallelism, right? There's a parallelism from making parallel calls that we can make the two recursive calls to Quiksort in parallelism and there's parallelism in the actual partitioning which is what we did in the scan operation, the plus scan. We did packing of the keys in parallel by doing that plus scan. It turns out if you just take advantage of the recursive calls, and if you go -- you'll find parallel Quicksorts all over the place. It's in the set of examples in Cilk++. It's in the set of examples of threading building blocks. It's in a bunch of different ones. They all do the partition sequentially. There's actually very limited parallelism. Because you basically have to do linear work to do the partitioning. Then you can make two parallel recursive calls. At the best you can do is it's still going to take you order end time. However many processes you have, the very first partition is done sequentially. It's sort of an MDAL law. We can't do better than that. In fact, a nice way to look at this, if you look at the span and sort of the critical path of dependencies in this computation, it's going to be order N. The total work, the total number of comparisons I do is N log N and so actually the very best I can do is sort of dividing N log N by over N gives me the amount of processes I can make use of. I can make use of log N processes. On the second slide I showed, I showed that curve of the multi-core. Where we're at now, this is perfectly adequate. We've got eight cores if we're sorting a million things, log N is 20. We would be perfectly happy with this. If you go to 2015, that was 100 cores. This is not going to do. So it's this will work up to 10, 20, it's not going to go beyond there. Now, you could say well what if we did the other thing and we only took advantage of parallelism from sub selecting the keys greater or equal in less than pivot. And I actually took this piece of Quiksort code off of the web somewhere. It's an example I might have typed it in wrong, but high performance Fortran parallel language that was developed in the early '90s. It allows you to do -- this is the pack operation. It's the thing that packs all the elements less than pivot. So inside of the pack there's prefix sum or the scan operation. And so here we're picking the elements less than the pivot, greater than the pivots. I guess here we don't do the equal ones. But we call two recursive Quiksorts and I guess this is an in place version so it leaves this somewhere. And the point is that these two are done sequentially, recursive calls, but picking out the elements is done in parallel. And we can look at what the span sort of the critical path of this is. Because the recursive calls are done sequentially, there's going to be a total of a linear number, the tree depth of these recursive calls is, well, the size, number of leaves in the tree is basically one per key, so it's basically going to be N total function calls made to Quiksort. And if they don't sequentially it's going to take N time. In fact it takes exactly N time if you do the analysis. You get the same span as before. If you look at parallelism we look at log N parallelism it's not very much and we can make more use of this maybe eight cores and not many more. On the other hand, if you take advantage of both forms of parallelism, okay, the parallel function calls and also the parallel partitioning of basic sub selection of the keys less than or equal greater than pivot, you actually get a lot of parallelism because what you can do now is the block is supposed to represent the parallel work. Log N span for here to do the scan operation. So if we looked at the scan it made a bunch of recursive calls, each half as big. So the total depth there was log N. And so it takes log N sort of a span the critical path is log N here and then what you can do is you can say there's about log N depth to recursion here. You can prove that for Quiksort. It's about log N deep for recursion. It's approximately balanced in expectation. And so the total work is still N log N, but now this span, the critical path through the computation is log squared N. And now if you look at the parallelism, it's again, it's a work divided by span. We can make use of basic N over log N processors. Huge amount of parallelism. If we have a million, talking about 50,000 fold parallelism here. Here we've got a good parallel algorithm. But I guess what's important here is that we have to have a way to think about this. We have to have a way to analyze it. So as a person who's designing a parallel algorithm and I want to compare two algorithms, like I want to decide whether it's useful to have the first form of parallelism, the recursive calls, the second form, where I do the sub selection, or both, or how useful it is to have them. I need a way to think about this. I need a way to analyze it, I need a way to compare it. So, again, this is part of the computational thinking is that what NESL does actually is it gives actually you a cost model which allows you to calculate the work of a computation if you make a bunch of parallel calls. Like let's say if these were two calls to Quiksort here, the work is just the sum of the two. That's what it would take, basically it's exactly the sequential time. If I make two calls sequentially, it's total time is the sum of them. If I make two parallel calls, the total work is the sum. I take the work from one call and the work from the second and they're running in parallel but I still have to add the work. The span of them is the maximum. So if I make two parallel calls, I have to wait for the longer of the two. So I take the maximum of those two spans. And so I can calculate the span. In fact, if you do this formally for Quiksort, you get exactly what I had on the previous slide. You get a span of log squared N and a work of order N log N. This gives you a formal way to determine what the work in span are, and then you might say: Well, why is that useful? And, well, it turns out if you know the work in span it's actually very useful about telling you how fast it will run on any given number of processors. Now there are certain assumptions here about locality and caches and stuff which I'll get back to. But if you have a so-called greedy schedule -- so introduce the Quiksort with all these recursive calls. You basically have a whole lot of parallelism. You have to schedule that parallelism on to my processors. I didn't talk about that. The point is that that can be done by a scheduler under the hood. In fact, most systems parallel systems like open MP have a built in scheduler under the hood nowadays. Some are better than others. In fact, the only decent one I've seen is the Cilk++ one. So if you have a so-called greedy schedule then you can show that the time is always going to be the work divided by the processes plus the depth. If I know this work and I know the depth, I know quite a bit about the time. You can also show that the work is going to be -- the time's going to be at least the maximum of these two. If you think about it, the difference between the sum and the maximum is not that much difference. The most it could be is two if these two terms are equal. If they're not equal it's going to be less than two. Well, this argument is very easy. Y is the time, Lee is the maximum. If I've got so much work to do and so many processes, the best I can do is evenly load balance that work across the processes. It's going to take me at least the work divided by the processes. That's assuming I perfectly load balance it and assuming I can't get super linear speed up because of cache effects or something. Then there's the depth. The depth is the critical path, saying if I take the longest path through my computation of dependencies that's what this is, or the span, I should say. D is the span. So it's going to take at least the span. So it's going to take the maximum of these two. You can actually show that if you use a so-called greedy schedule, and a greedy schedule is just a schedule if there's work to be done we'll do it. Just normal notion of greedy. So it tells me a lot about the time and fixed number of processes. So that's an argument why abstracting in terms of work and depth and span is good. Finally, if I divide and I've been saying this so long, if I divide the work by the span, it gives me a very good sense of what the parallelism is. Because it turns out when this thing is, when the parallelism basically this tells me which of these two terms will dominate, right? If I have a P equals W over D, that many processes, then these terms are equal. I'd like the first term to dominate because then I'm so-called making good use of my processes. I've got so much work and I'm getting that work divided by P. So ideally I always want this term to dominate and parallelism is exactly -- when I've got P less than the parallelism exactly when that term dominates. >>: We're seeing a lot of New York pictures that have sort of sim lanes but their own sort of program counters in GP use. So in this case within the simby lane [phonetic], it might be better, especially for scan, to use a different algorithm to not use one that's this absolute down this[inaudible] phase but use the [inaudible] algorithm, within the simby lane because the constant involved is less. So even though the work is more, you use that much work anyways. >> Guy Blelloch: I agree. All these things are important, right? So I think the idea is what I'm saying now is this is the fundamental idea you teach a first year student. They have to understand the basic idea of a scan, right? And then in our curriculum, for example, we have an introduction to data structure course as a freshman. As a sophomore we have the systems course. We say actually the real machines aren't RAMs. They have caches they have these other effects. And that's the sort of thing you're saying. The real machine is an ideal like I'm saying, yes, there would be trade-offs and these are the next level of things that should be taught. Okay. So more generally your comment about trade-offs between work come up in a lot of different places. That's true. So a few more observations. One is that it's often important to take advantage of both this data parallelism, which is sub selecting elements less than or greater than the pivot. This is something that's easy to do on a GPU. And also function parallelism, which are recursive calls to Quiksort which are more difficult to do on the GPU. But in order to get good speed up you have to do both of those. The other is that it's important to have some way for the user to get some sense of the costs, to compare. So always when we do algorithms we want users at a rough sense to be able to compare insertion sort with Quiksort, one's N squared, one's N log N. It's not perfect because it doesn't take account of caches and the work in depth that I will describe is not perfect in the same sense. It doesn't take account of secondary things. But it gives you at least a high level sense of how parallel this algorithm is. And I think it's very important to have people in general, programmers in general to be able to quickly look at two algorithms and have a rough sense of how parallel one is versus the other, though they can sort of proceed from there. The topic I didn't really talk about, but there's a lot of different ways to schedule that parallelism. And there's actually been a lot of very nice work in the last 15 years on sort of scheduling parallelism. That's more of an advanced topic. Okay. I just wanted to go just a few more quick examples here to show that it's relatively easy to think about parallelism. This is a recursive block matrix multiply. In order to multiply two matrixes, A and B, it blocks them from groups two and four, recursively. Makes a bunch of recursive calls to matrix multiply on these smaller blocks. I basically multiply I guess these two, this one with that one, this one with that one, add those two together and it gives me the upper left corner of my result. So there's actually eight recursive calls here. And so if you analyze the work, you make eight recursive calls, each half as big each takes N squared work, that's for the matrix ad here. Add this result to there. N squared elements. The total work is N cubed which is exactly what we would expect sequential work to be. If you analyze the depth, all these recursive calls can be done in parallel. Every single one of them. Eight recursive calls and they can be done in parallel. If I take the maximum of eight things, well, it's just one of them, if they're all equal. And so then it's one depth is just because I can do all these matrix additions in concept depth. The depth here is only log N. I've got a very shallow network here. This is an extremely parallel algorithm. Look at the parallelisms N cubed over log N. Thousand by thousand matrix. I can make use of millions of processors much more parallelism than I need. You could argue there's too much parallelism and then it's really up to the scheduler to figure out which part of that parallelism to take advantage of. You can do matrix inversion. This is the so-called sure compliment of doing a matrix inversion. You can do a block method for doing matrix inversion, if you do the recursion you get N cubed here. What's interesting here there's actually dependence there's sequential instructions. Here in particular you have to -- there's an inversion here that you have to invert this D-1 first before you can generate the S-1. You have to do this inversion before you can do that inversion. And so there's a sequential dependence. And so that's why there's a 2 here. If you solve that, the actual span of this algorithm, the depth of the span is order N instead of log N. But it's still extremely parallel. The parallelism here is a work depth over N squared I can still take advantage of N squared processes in sort of the best case. Okay. You can also do a 4 A transform is -- by the way, all the examples I'm showing now standard sequential code that you might have seen in an algorithm of course they didn't tell you they're parallel but are just naturally parallel. All you have to do is analyze the depth. That's the examples I've shown, except for the scan example, these ones are all naturally parallel. So here's, you just make two recursive calls to, in fact this is actually the complete NESL code for FFTs, make two recursive calls to FFT, do some parallel multipliers and adds here. Standard work FFT fast forward transform. And the depth is actually very shallow. It's log N here. Parallelism you can take advantage of order N. This is actually similar to the Quiksort code in style in the sense that you've got sort of the function call parallelism by making two calls to FFT in parallel and also the data parallelism because this statement here is data parallel statement over N elements. Okay. I wanted to show one more example that requires something, there's not just reanalyzing sequential algorithms. This is more like the scan where we actually rethink the algorithm a little bit. We have to do a little bit of parallel thinking in order to do a merge here. Okay. So this code here is actually sequential code for merge. It happens to be ML code. It doesn't really matter what it is. But it's just the standard algorithm where you start at the beginning of two lists. These are linked lists in this case and you look at the two elements, whichever one is less you pull it off the front of the list and go to the next one and you go down the list. So that's a complete code for it until you get to the end. This is just the base case. When you get to tend of the list you return the other to the other list. What about doing this in parallel? This is a very sequential thing. It's like the scan operation, which we would think about it very sequentially. The goal would be is if you're experienced in parallel thinking, you should just know the answer to this immediately. But I don't think again if we gave this quiz to most of our exiting students, they wouldn't know how to do merge and parallel. And I don't know what the experience of people in the room is. But I would say most likely students, but turns out actually very easy algorithms for doing merging in parallel. And everyone should understand these and everyone should have been trained in these when they were undergraduates. Let me just start. I just want to -- the algorithm I'm going to show -- and there's different algorithms actually but the one I want to show is based assuming that our inputs are balanced trees, in order. So my input is going to be a tree with some key at the root. All the keys less than on the left and all the keys less, greater on the right. I'm going to take two of these trees and I'm going to merge them. Because the first thing you should notice if I start out with a linked list I'm never going to be able to do better than linear span, because I have to go, get the next element in the linked list I'm going to have to basically follow that linked list. The first thing you should say if I start with the link list I'm screwed. So let me start with a tree. It turns out you could always -- there's versions to do it on an array, the particular version I'm going to do works on a tree. I take two trees which both have the keys in order and I'm going to merge those two. And here's just a simple, this is actually ML code again. A piece of code that given a key, a pivot here and a tree, so the second argument is a tree. It's either an MT tree or it's a tree with a node in it, then you basically are going to go down and split it. You're going to follow down a bunch of nodes until you get to a leaf and break off the elements less than the pivot and the ones greater than the pivot. So this is basically splits into keys less than or greater than pivot. Different than what we did in Quiksort because our elements are already sorted. This actually only takes logarithmic time to do this. Because I just go down the tree. In logarithmic time assuming this tree's balanced I can split this tree into two. Because it's already sorted. I am just going through. If this was an array, this would correspond to a binary search. So if you gave me a pivot and array, in order to figure out where that pivot in the array is I would do some binary search on it. That's the array of this algorithm. Here an array, basically a whole code for merging, that's two-thirds of it there. This is the rest of it. I guess I'm missing the base case, actually. But I'm done now. Here's a parallel code. Extremely parallel piece of code for merging. So let me just go through this. What I do is like I said I start out with two trees A and B. This is my tree A. That's my tree B. I assume that they're both balanced and I'm going to return a tree as output, which is itself also balanced. And how am I going to do that? I'm going to basically take my first tree, pull out the root. So that's some element that's greater than all these elements and less than all those elements. I'm going to now make a -- now I'm going to use that root to split the other tree showing the code, based on the code in the previous page. That gives me two trees. The ones less than the pivot and the ones greater out of B and I just recursively merge this AL with the BL, all the elements less than the pivot. The AR with the BR, all the elements less or greater than the pivot, and I'm done. Then I just put my pivot in the middle, the thing I pulled out in the beginning. So that's a piece of code. Now you might say well I haven't analyzed this. Is this an efficient piece of code. If I wrote this sequentially, if I made these two recursive calls. So this is parallel now because I can make the two calls to merge in parallel. If I did it sequentially or if I analyzed the work, what is it? It turns out that it is a, if I do the merging in parallel, the work is actually linear. You get a recurrence basically that says it's N over -- 2 times N over 2 plus log N plus the time doing the split. That just solves to linear. This is a linear time algorithm. If I ran it sequentially, it leaves big O is just as fast as the version that comes in from the head. In practice it's probably a factor of 2 slower because of various overheads. Dealing with lists, et cetera. But it's at least within the constant factor. So even if I did this sequentially it will be reasonably good. If I analyze the depth, well there's this log N partition time here for doing the split. And then I make log -- the tree depth here of recursion is log N. So each node in my tree has log N span. The whole tree has log N levels. So the whole depth is the product of those two. It's log squared N. And this is just using those rules where I take the maximum of the span for my [inaudible] to recursive calls and then add in the span for that. And that gives me these bounds. Anyway, it's a reasonable piece of code. It fits in hardly any -- I guess I cheated a little bit. I left the base case off of here. But it's not that much more complicated. This is something that everyone should understand. Again, if you want to write an optimal version I'm sure you'd do something fancier, but at the minimum you should understand this is the way to do merging. Now, actually, just an interesting comment here, as it turns out you can actually improve this span here from log squared N to log N using an interesting old concept from parallel computing called a future. And this is actually an advanced topic. I wouldn't actually teach this to a freshman. But it's sort of interesting to realize this, is that as I'm splitting this thing here, so I take my M and I split this. And I start splitting this, I can actually start feeding the results from my BL and BR to my recursive calls before they're even done. That gives you sort of a pipelining effect. You can do this with futures. And I think futures is something a few people in this room have worked with before is basically all you do when you call this split operation here, is you wrap it in a future, and you actually have to, this is my recursive split. Whenever I do my recursive split I wrap it in a future. What a future is, it basically says go off and do this computation, return me a handle immediately, that I can pass around to other people, and when I want the actual value, I ask that handle for the value, and if it's ready, it gives it to you immediately. If it's not, then it basically waits until it's ready and gives it to you. And suspends you until it's ready. That's what a future is. And if you just -- you know, my split operation, you just wrap futures around those two calls and then use this, it turns out you can show that the depth here reduces to log N span. You actually get more parallelism. Now, to analyze this is more complicated and, again, I wouldn't leave this as the first thing you teach in parallelism. But it goes to show that you can then, you can have a formal way of sort of analyzing what a future will give you, that if -- that this pipeline parallelism will give you. Okay. So more observations. One is divide and conquer is even more useful in parallel than sequential. Of course we've all been using divide and conquer algorithms forever. When you did merging you didn't use divide and conquer you used something that started at the beginning of the list. Even for merging and other cases like this, even for things you haven't used divide and conquer for, now when you start thinking parallel and thinking about how to do these things in parallel you should be thinking about how to do it in divide and conquer. The other thing, trees are better than lists in parallelism. In fact, lists are a real drag, not a good data structure for parallelism. If you start thinking in parallel, you basically are going to stop building data structure out of linked list. You're going to build them out of trees or arrays are okay. Lots of things which are okay. Basically where you can access the element, any element in some time less than the size of the data structure. If I have a linked list I have to go all the way down to the end of the list to access that last element. And the other sort of, this sort of advanced topic, the tick means here is that pipelining can actually asymptotically improve the depth of this algorithm, the span of this algorithm, I should say span F. Okay. So the observations. So here's sort of summary of what all the observations in the talk were. There was some general observation is that natural parallelism often too low level. I'm sorry. Natural parallelism is often lost in low level implementation. So that was the Quiksort example, where we should be teaching people that this is a parallel algorithm when we teach them Quiksort and we should do that from day one. And people should just understand that. And not just the two recursive calls are parallel, the fact that we can pick the elements greater or equal than less than pivot. There's a lot of lost opportunities in describing parallelism to undergraduates. So in general. And another one is lots of things which seem sequential actually are not that hard to parallelize. Two examples was merging and the other was scan operation. I think those were -- I hope I convinced you those were quite simple algorithms. I could write the pretty much complete code for it in 10, 15 lines, right? Some other model language issues. Taking advantage of both data and function parallelism. The need for some abstract cost models so at least you can get some grasp on this sort of parallelism that's available and so we talk about work and span and divide work by span and it gives you the number of processes. And also work tells you whether it's going to be as efficient as sequential algorithms in terms of the total work it does. I didn't really talk about scheduling. And another comment is I didn't really talk much about this except in one of those early slides that it's all the examples I've shown are deterministic parallel. However you schedule any one of those bits of code, throughout this talk wouldn't give you a different answer. There's no -- a lot of complications people have for designing parallel code, well, is just debugging, if you get different answers each time you run it, this is nondeterminism, it's going to give you a huge headache. Everything I showed you will return the same result. It's got sequential semantics. I think it's very important if all you're doing is designing an algorithm. Now, like I said, there's a lot of applications where you need the concurrency. In fact, there's a lot of applications which run on one processor where you need concurrency like an operating system. So we do have to understand interleaving and coherency protocols and whatever. If that's not what you're doing if you're designing an application to run in parallel, the application does not interact with the environment around you or interacts at a coarse level instead of at a very fine grain level then you should try to be deterministic. I talked a bunch of different algorithmic techniques. I talked about recursing on smaller problems. You take a big problem, this is a scan, solve a smaller operation, scan and help you. This comes up over and over again. I didn't talk much about the transition can be aggregated. I talk about divide and conquer, and trees and pipelining. There's actually, I teach a course in parallel algorithms. There's actually a bunch of other sort of core ideas in parallel thinking which I didn't even get to today. There's ideas in graph contraction to design graph algorithms. Identifying independent sets, which is useful in a bunch of different algorithms. Some in computational geometry like baloney [phonetic] trangulation, etcetera, symmetry breaking. Some of these other sort of core ideas in parallel algorithms, thinking about parallel, solving parallel problems deterministic parallel problems. Okay. So I'm short on time. I also had some slides on locality. I'll just quickly mention that one issue with everything I've talked about so far is I haven't talked about locality at all. And I think locality is actually very interesting problem in parallel, because we now sort of at least somewhat used to analyzing algorithms for caches, sequentially. When you get to parallel computation, dealing with locality is a whole other bag of worms. But I actually think there's very nice ways of dealing with locality and an abstract way. And the same sense what I've been showing an abstract way to think about parallelism. There's abstract ways to think about locality where the locality is tied much more to the application. So properties of the application, or the algorithm than it is to a particular hardware. So that you can define locality in terms of properties. So we can come up with some idea of locality like we did for work and span, and some measure of locality based on just the algorithm itself. This will help you map it onto a machine that has multiple levels of caches or has a local, nonlocal memory, et cetera. Okay. So to finish up on the idea of parallel thinking is I really have concentrated on just one piece of what I call parallel thinking which I call the deterministic parallel algorithms. So without having any concurrency. There's no nondeterminism in what I just showed. I think it's still important to do concurrency. So these are applications where there's just inherent nondeterminism in the environment. You are getting requests on line and you have to deal with them. There's inherent nondeterminism. So I think it's important to sort of identify sort of like I tried to go through in the case of the deterministic parallel algorithm what the core ideas are, what I think everyone should know about parallel thinking in that context. There's a similar set of ideas that you'd want to know about the concurrency context. And I've listed some of it here. And also similar sorts of ideas that you'd want to know sort of in terms of architectures. And cache coherency protocols, et cetera. Okay. So I guess what I've talked about today is actually based on 30 years of research by lots of different people. So a lot of the algorithms aren't my algorithms. And a lot of it's come out of the theory, algorithms community and also the programming language community, although I've often twisted it a little bit. So my conclusions are educating I guess programmers for parallelism is key. And, again, I think there's been much too much emphasis on teaching a particular parallel programming paradigm, open MP or some language for a CUDA for graphics processor or whatever. So you give someone a book on this topic and they go off and they learn how to do it. They don't understand the high level concepts about parallelism. And I think there's too much emphasis and I think it's very important. I also think it's very important for people to clearly separate what I call parallelism and concurrency, and like I said people have argued whether these are the right terms. But clearly separating out when you're trying to take advantage of multiple processes in your hardware versus when your actually application has concurrency in it. There's actually -- it's a real time application. You're getting demands from the outside world and you have to deal with those, and they're coming in a nondeterministic ordering. I guess another comment is that I would actually say this is more for me at a university rather than being at a company at Microsoft, is that it's important to teach these ideas from day one, that when we first start teaching computing, our very first data structures course, very first programming course we should be teaching parallelism. In fact, both parallelism and concurrency should be taught from day one. I think concurrency is a little bit more difficult. I think it's a more advanced topic. So probably only the most simple things should be brought in on concurrency but some of it should be introduced from day one. I know MIT has an introductory course where they start introducing concurrency from day one. Then I guess the last thing is, you know, I try to go through what I think are some of the core ideas in parallel algorithms, deterministic parallel algorithms. I think it's actually reasonably important for the community to understand what the core ideas in all these different areas are. So that everyone is properly trained in how to think about parallel algorithms, how to think about concurrency so that we can sort of move forward on making use of these parallel machines. Okay. That's it. Questions. >>: This depends on using a functional language? >> Guy Blelloch: Well, basically what the functional language does is it gives you guarantees that it's correct. If you don't use a functional language, then you either have to use a race detector, that's the Cilk approach that you run it in a C-like language. You run your race detector, tells you if there's any races. Typically the race detector slows down the program by a factor of 10. Can't have it running all the time. But you run it on some test data and hope that if you run it, the tech races on tech data. What the race detector does, it guarantees that when you make parallel calls there's no races. If it passes a race detector, then it will be deterministic. And the other approach is just to cross your fingers, I guess. >>: So you talk about ideas or ideas for thinking about parallel algorithms, and for how to present them to, say, undergrads and sort of today when we were talking about scan or matrix multiply and stuff. We usually look at sort of a P RAM model where we talk about actual data structures and how sort of the computation is structured. And then we talked sort of what I call a model of circuit families where you show sort of the data flow, and how I guess that model of parallel thinking, size and depth complexity. You talked about in terms of span and work. But futures can be represented in certain families by the normal data flow, we have a wire coming down to the other part. So I was wondering, sort of the circuit family of thinking, data flow style thinking of parallel algorithms, is very nice because it's visual. And you don't get lost into how do I map my trees into cells of memory or anything that's very sequential, because it's all connected by rubber bands at the Ys and stuff. So is the model of circuit families for parallel computation something we should be teaching to undergrads and how do you take that circuit and schedule it onto real hardware or data flow computing? It sort of seems like it's a more intuitive way of thinking about it. >> Guy Blelloch: It's very much I feel -- so if you notice I never used the term P RAM here until the very last slide because this isn't the P RAM model. P RAM model you have processors and you do your own scheduling. This is a very different model. In fact if you look at the merge code or Quiksort code, you don't see processes. You might have noticed, you might not have noticed, when I do Quiksort and analyzing flexibility, I didn't mention processes a single time. The term processor never showed up here. I believe that's very important that this piece of code is a parallel piece of code. Doesn't say anything about processes. >>: Also abstracts into structures as opposed to indexes and arrays, which I thought was nice. >> Guy Blelloch: So I would say that's definitely the right approach because it's more abstract. You can basically go with this code. Work out the work and the depth and on the last step you can say if I had P processes how fast would that run. >>: So my question is: Should we be presenting data flow computing or circuit families to undergrads? Because you can show me at this hour, I still don't really get it until I see your example for eight or 16 or whatever in terms of visually showing me the flow computation on the screen exactly. So this is ->> Guy Blelloch: No, this is very much the -- right. So I would say this data flow style of programming basically. Like I said, it really only comes down to two trivial ideas which is that the work is -- sorry for going through the slides. The work is just, if you make two parallel calls, this is really everything. It's encapsulated, the whole thing is encapsulating in the equations the work is the sum of the calls and the span is the max. By the way, the term span I've used depth instead of span in my work. And Charles [inaudible] and work in Cilk has always used critical path. About two years ago we got into a fistfight and we agreed on using span as a compromise. But it's not a standard term for that. Some people use critical path. Some people use work span. Some people use depth. >>: From a conceptual point, visually seeing that data flow graph really illustrates the work of the number of boxes and the span the depth or whatever. >> Guy Blelloch: Yeah, I think as well as showing the equations, it's very useful educationally to show these sorts of boxes. >>: I just wanted to say that a practical thing to get programmers to do this, just based on my group's experience, the key thing was getting us a process, eight core machines. All of a sudden it became worthwhile for us to put in the effort to learn to do the simple stuff and if you're in a language that allows functional programming but doesn't require it, it motivated us to get rid of those side effects. >> Guy Blelloch: Yeah, I think this eight core -- I think there's going to be a huge jump that somehow four cores was not enough parallelism to motivate anything. Most of what I showed today, if you actually code it up, you lose a factor of two, three, from just going to parallel algorithms. Like the scan is a good example. You loose -- there's twice as many operations in the parallel scan as in the sequential scan. You just lose a factor of two off the bat. If you've got four cores you're not going to be satisfied. So I think you really need to be getting at minimum eight and I think it will be a huge difference when we get to 16, 32, 64. I think that's when people will get it they'll say wow parallelism is useful. Somehow four is not enough because of the overhead of parallelism you lose more than half of that just off of that. >>: The people that are writing the tools and libraries now to have those eight cores so they're motivated to do it for themselves and not just for some abstract that will be good in the future. >> Guy Blelloch: Yeah. So it's true in the course I teach I had a four core machine last year and I had an eight core machine this year. It made a difference that people felt more satisfied of getting a factor of five speed up as opposed to a factor of two, two and a half speed up. >>: Playing the devil's advocate here. You used justification of chance comments, 100 processors, 2015 I think it was. A lot of people don't believe that. In fact, it could stop as early as eight. If it does stop as early as eight, can we forget all this? >> Guy Blelloch: It would become more of a niche field that people will still already have much bigger parallel machines. But if you don't have it on your laptop, then it wouldn't be something you'd use on your laptop. >>: This is really a hard sell, right? People have been saying for a long time we should be teaching parallel thinking. And as you said Carnegie Mellon doesn't do it very well yet. >> Guy Blelloch: Well, it is based a bit on this belief that we will have more than -- like I say eight cores might be -- elbow ware, might become useful, right. 16, 32. But I think we only have to go up to 32 when it becomes clearly useful. 16 -- I think 8 and 16 are sort of like, well, it's quite a bit of work for this and there's probably other ways I can take advantage of 8 cores right, I think once you get to 32 I think you have to start thinking this way. And I personally -- I'm not sure if we're going to have 100 cores in 2015. I certainly wouldn't put a big bet on that. But I'm quite sure that we'll have 32 cores at least by 2015, and beyond there. I mean, maybe I've been misled by Intel. But I think so. >>: They're already integrating GPs on the northridge, right, so you already have literally tens of thousands of processing, logical processing elements, that are now sharing the same physical address space. So scan and some of these other things for large datasets could now be used by the operating system instead of having [inaudible] across PC express to do a separate memory. So I think we already have that now. We have tens of thousands of basically hiding it through latency, hiding -- we have this span for pushing instructions through that we didn't have before on the same physical memories. >> Guy Blelloch: Dave? >>: So are we making a mistake by doing, emphasizing communication costs first order concern, even to our introductory? >> Guy Blelloch: Yeah, that's why I brought up that locality issue at the end, is that I don't think you can forget about communication costs completely. I do actually think you could do it at the freshman level. But not -- even going to the sophomore level, I think you probably have to start talking about locality. So at the end I had a couple slides on locality, but I think these all -- I think even sequential computing, it's questionable. Right? There's 100-fold difference between a cache miss and a cache hit, right, a level one cache hit and a complete cache miss. Even sequentially, but yet we still teach Quiksort without worrying about the cache. Turns out Quiksort is okay for cache locality because it scans the memory in order. A lot of these algorithms are okay. I guess my point is in fact a lot of algorithms I showed are reasonably cache friendly. And it's -- the point being is that if we are going to talk about locality we want to do it in a way that's not at a low level. Like isn't talking about processes and communicating, that somehow captures a property of the program. So recursive matrix -- there's a huge amount of cache locality. Hugely local. We all know matrix multiply is about the best possible thing you can do for a cache. Right. The reason is if you look at all the cross terms here, they're all sharing the same matrices, just doing the same thing over and over again. If we can somehow capture that at a high level as opposed to going to things I think it would be helpful otherwise locality in parallel machines is a nightmare. There's so many different ways you have locality in parallel machines and it's much more complicated in the sequential. >>: What do you think is the future of transparent caching, as we scale up to 100 about 12,000 nodes, per granularities like scan and matrix, are we going to continue with the same sort of caching paradigm that we have for [inaudible] or are we just going to have to manually manage that as part of our algorithm design? >> Guy Blelloch: Well, I hope you wouldn't have to. Certainly, no one would want to manually manage cache. It's hard enough sequentially to manually manage cache that's why we went to automatic caching, and it would be much harder in parallel. So, yes, I would certainly hope that it's done under the hood for you. Ideally in hardware. I think cache coherency schemes, now, whether you need the stricter schemes that you can get with relax, probably maybe you can get rid of relaxed if the system on top knows what to do with it. But I think we do need it to be automatic. That's my opinion. Just one more. >>: One of the trickier things we get taught when there's sequential programming is how to do everything in place. And from the pictures of the algorithms you showed it seems to be less in place than the equivalency sequential algorithm. Is that a characteristic across the algorithm? >> Guy Blelloch: It is a characteristic. You often do lose -- like I said, you sort of lose a factor of two of these algorithms out of the box. And for the Quiksort that in fact the factor of 2 comes from the fact that it's, the power version where you do a sub selection is not an in-place version. And that means you're going to touch twice as much memory, probably. And that's going to give you a factor of 2. It's not more than a factor of 2. But a lot of these cases, yes, it requires, because you're doing them in parallel you have to do it in separate spaces. It's a comment about Quiksort in specific but generally you often have to do it in separate spaces and that means touching more memory. >>: Do you think that will inform the computer architecture, like is there something like can they -is there something you thought of that can be done with the way things are cached to be more friendly to that pattern? Because caches today are very friendly for this in-place sort of stuff. >> Guy Blelloch: It could be. If you think about functional languages, they don't do things in place, right? Because you're always -- unless you optimize, you're always copping to a new location. And there has been work that shows that this sort of caching schemes you would want in a functional language are different. So maybe the same sort of thought process would also say the same thing here. >>: Great. Well, let's thank Guy.

17044 >> Jim Larus: It's my pleasure today to introduce...

Related documents

Products

Support

17044 &gt;&gt; Jim Larus: It's my pleasure today to introduce...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

17044 >> Jim Larus: It's my pleasure today to introduce...