16750 >> Galen Hunt: Good morning. It's my pleasure to introduce Rebecca Isaacs, who will be speaking on efficient data parallel computing on small heterogeneous clusters. Actually, I know what she's talking about. Rebecca is, of course, from the Cambridge lab. Been there 10 years. >> Rebecca Isaacs: Eight. >> Galen Hunt: Eight years. My illustrious colleague. Served on a number of committees with her. And it's my pleasure to have her here. >> Rebecca Isaacs: Thank you, Galen. Thank you for having me. And thank you all for coming to my talk. I was actually just remembering that it was about seven and a half years since I last gave a talk here. So maybe it's time to give another one. Okay. So the work I'm going to talk about was done jointly with Paul, who's sitting here, Richard Black, and Simon Peter from ETH Zurich was our intern last summer, and his advisor is Timothy Rosko. So motivation for this work, sometimes when you're trying to run an application, it actually requires a lot more resource than you have on your desktop computer or laptop or whatever, and basically it pegs the CPU or pegs the disk or whatever. A feeling that I have, and I'm sure many of you do, too, is that there's all these other computers in my office or in my house, and why is it so hard to use those computers as well? Basically we lack tools to be able to have this sort of spontaneous using of these other computers. We actually have a sort of longer term vision of the disaggregated PC. I'm sure that everyone's heard other talks about this idea. And in the disaggregated PC all those computers will be managed by one operating system, and that operating system will understand how they're all configured, the topology of the network between them. It will monitor performance. It will learn models. It will schedule these programs on all these computers. It will just be really amazing. But that, of course, is a long way in the future. So in the short term, we thought, well, can we apply some of these techniques that are in use now for doing large data parallel programming to these smaller scale clusters. So just a sort of more visual representation of that idea. We have a whole spectrum of parallelism. We've got the single machine on the left going off to homogenous clusters and large data centers. So what's in the middle? Well, this is the domain of the small -- usually heterogeneous clusters, which if they're in this environment that we imagine in the home or the workplace, they're also going to be pulled together on demand. So they're not just sitting there waiting for your job. You actually want to make use of them at the time that you want to run the program. So stepping back from this large vision of a general purpose operating system that will manage these computers, we're looking here at just running very specific types of applications on the ad hoc cluster. But there still seem to be a whole class of programs that they will be pretty useful for. Data mining, video editing, scientific programming, in particular, and these are easily parallelizable programs, and they're the kind of thing that at the moment that you would take your, you'd use Dryad or Map Produce or some kind of programming environment like that and run these programs on a datacenter. But often it would be perfectly tractable to run those programs on your smaller cluster of say up to 10 machines. So that's the goal, to be able to run -- in this work in particular we're looking at DryadLINQ programs, we'd like to run DryadLINQ parallel programs on these small scale ad hoc clusters. So just to sort of give some context to that about data parallel programming, the idea here is that you have a very large dataset and the pieces of data are wrapped partition on to different machines and they are processed in parallel. As I said before, these execution environments like Dryad and in the open source world Hadoop will make this actually quite easy in terms of placing the parts of the program on to the computers doing the scheduling, moving the data around to the machine that it needs to be on and they also deal with fault tolerance. So because they're designed for the large data centers there's an inbuilt assumption about failure and so they have all the mechanisms to restart jobs, to monitor and when pieces of the job fail they restart them and so on. Associated with the execution environment is high level language like DryadLINQ or Pig Latin and these declarative languages make it quite easy for the programmer to actually express the parallelism that they need. So I would claim that these frameworks have made data parallel programming much, much simpler than they have been in the past, than it has been in the past. So seems like a great idea to just take DryadLINQ and Dryad and run them on your ad hoc cluster. In particular, because the framework is very lightweight. So it requires a demon service to be running on each of the computers. But other than that in the program there's no assumptions about hardware is available, how many machines are available, and as I said before, you can write this parallelism declaratively and the framework will deal with scaling it from maybe you've only got one machine available, it will have it run on one machine and it will scale up to all 10 or all 100 or 1,000 or whatever it is. What's the problem then? Well, the big problem is the diversity of the hardware of these kind of clusters. So the data programming frameworks, because they're designed for the data center, they have assumptions about how homogenous the available hardware is. And this means that the schedules -- well, basically greedy. They'll just take the next idle machine and run the task and it doesn't really matter. When you've got really quite diverse machines, you know laptops versus desktops, in fact running the wrong thing in the wrong place can make the performance actually pretty poor. And if you're running something that would maybe take an hour, if it was optimally scheduled and it takes five hours and it's on your home network, this might really matter to you. Also these built in assumptions about failure don't apply. This is less of an issue because we can just make them go away. They do techniques like speculatively executing a new task when the currently running one is observed to be running quite slowly and we can just turn that off in this case. So our goal is to sort out this scheduling problem for DryadLINQ programs on a small scale ad hoc cluster. And to do that we've basically done two things. We've developed performance models for the computational vertices of the parallel program, and we've also used a constraint solver to find the -- I wouldn't say the optimal schedule, but to find a reasonable schedule for actually executing that program for assigning those computational vertices to the physical computers. Okay. So, first of all, I'm just going to do a very brief overview of Dryad and DryadLINQ. So Dryad came out of Microsoft Research in Silicon Valley. And the way that Michael Lazard [phonetic] who is largely responsible for it likes to describe it is generalized map produce. So it's the map produce model, but it's much more general. Programs are proper data flow graphs, so you can have multiple inputs and multiple outputs on every node. The vertices, the nodes are connected by channels and channels are in fact are files or FIFOs or TCP streams. And a program is run by dispatching these vertices on to a machine by a process called Job Manager. So each vertex is a process. The DryadLINQ compiler will produce some C# code from the original program that the programmer writes. It compiles it into some C# code which is then compiled again into an executable handed to Dryad. The job manager will take these executables and push them on demand out to the physical machine where they are to be executed. And Dryad, although it has been developed with cosmols [phonetic], it doesn't actually require a specialized file system. So there's no sort of fundamental technical reason why you can't be running Dryad on your cluster of Windows boxes at home. So LINQ is .NET constructs for manipulating data. So it's designed to work with relational databases or XML data. And it's supported by features such as anonymous types and lambda expressions in C# and also I believe Visual Basic and F#. DryadLINQ also has some extra operators. So LINQ has the usual query operators on data like Select and Join and so on and DryadLINQ extends that with some operators that are particular to Dryad. But actually it's a very small number. And DryadLINQ is only implemented for C#. So what DryadLINQ does is it takes these expressions, these LINQ expressions, and produces effectively the data flow graph that Dryad can then go and do its thing with. And I'll show you an example, just to clarify that. So this is a very simple join. And if you've looked at any of the DryadLINQ papers, you've probably seen it already. And what it's doing is it's just taking a file called keywords, which is a list of words and it's looking for the first word in a file of -- in a second file of text, that it's the first word of a line matches one of those keywords then it keeps that line and returns it. So it's a join operation. As a Dryad data flow graph, this is what that operation turns into. In this particular example of the graph, I'm showing it with two partitions for the input data file. And then there's a case file which has only one partition. What happens is that a hash partition operation is run on each of the partitions, and that just hashes each word, each -- yes, it will hash the first word of each line in the data file and send it off to the appropriate physical machine or I take that back. It will send it -- it will hash each word and then it will do a merge of words with the same hash function will all end up being dealt with by the same merge operator. And then after having done that, there's the actual join happens in stage three. So this is a notional picture of how that ->>: What did you mean by the -- the original data file has two partitions? That means it's divided in half and there are half on each machine? >> Rebecca Isaacs: Yep. >>: The fact you're going to produce the output into two partitions was weighted by the fact that there's two partitions, or are those independent decisions? >> Rebecca Isaacs: Uhm, I think it's related. >>: So it's going to unless you use partition exclusivity. >> Rebecca Isaacs: Yeah, you can change that. But, yeah. >>: This is sort of like if I had more than two machines, the number of partitions is unrelated to the number of machines. So it might be that those merge and join sets and partition steps should be one and four? >> Rebecca Isaacs: Yeah. So in notional runtime schedule, this is again a chart like visualization where each horizontal line represents which of these vertices that are actually executing on a machine. So on this bottom machine gets one of the hash operations, one of the merge and one of the join. So that's -- don't read too much into that. I'm just showing you the picture, because I use pictures like this later in the talk. So Dryad has data flow graph of the program. How does it actually take each of the nodes and decide where to run them. It uses a greedy scheduling approach. So it just looks at the next available machine and the next runable node and schedules it there. However, the programmer can provide hints as to if a particular node should be run on a particular machine, this can be indicated by annotating the XML description, which is generated automatically by the DryadLINQ compiler, and we use this XML file that the compiler spits out. This is how we impose our scheduling regime. Okay. So as I said earlier, that the heterogeneity can cause problems for scheduling on ad hoc clusters, and here's a derived picture showing you exactly why. So the runtime, the normalized runtime on a 1 gigahertz machine, let's say that the green node will take two minutes and this light blue node will take six. A reasonable greedy solution would schedule that longer running node on the 2-gigahertz machine to take three minutes and the shorter one on the slower machine to take two minutes. And then you end up, the scheduler ends up putting that 10 minute one there and actually taking much longer than it would have done had it scheduled the green node on the faster machine and the light blue node on the slower machine. It's very contrived, but it can easily happen. Okay. So I've given you the background of this work. I'm now going to talk about how we take those, the vertices, the nodes of the graph, and try and predict the performance of them, which we need to do in order to improve on the greedy scheduling algorithm. We need to know how long something is going to run if we place it on Machine A as opposed to Machine B. So to start with, we had to look at how Dryad vertices actually run. So we used HW Windows tracing, and I'm going to show you a screen shot from the Xperf tool showing what's going on with the single select operations. So select -- it's a bit of a misnomer. It's actually like MAP. It takes every element and does something to it. In this case it just reads and then writes a million records on the local disk. And this was run on an eight processor machine with two disks. So this is the Xperf visualization of the running of that vertex. So the top graph is showing utilization. I've overlaid CPU on both disks. It's utilization by the one process that's responsible for running the vertex. And the bottom graph is showing IO counts by that process. The reads are shown in red and the writes are shown in blue. And this is the entire running time of the thing. So what you can see straightaway is that the IO is batched quite a lot. There are basically four batches of reads and four batches of writes. The next thing to note is that during the read phase, both disks are almost 100 percent utilized. Unfortunately, Xperf chose blue and blue to show those lines. But the blue ones are the disks. We've also got CPU, which you can maybe just see it's sort of this grainy line down at the bottom. So during the read batch, CPU utilization is about 10 percent. During the write, it actually drops off to significantly lower than that. Okay. That's very interesting, but what happens when we run the same thing on different hardware? It's a similar picture. Similar but different. In this case we've got a slower processor, and actually a much much faster disk. And the consequences here are that again we have the batching of IO that we saw before. And now the CPU in that sort of yellow green line again it's about 25 percent during the reading and the disk, instead of being pegged on the read, it's now pegged on the right. This is to sort of emphasize or describe what's really happening. This is the performance that we want to model. >>: Is it worth mentioning that 25 percent in Xperf is [inaudible] persons competing and the other 25 are [inaudible]. >> Rebecca Isaacs: Yes. 25 percent of the total for the processes. Another aspect of how things execute is the threads. This is actually just a visualization of the same trace, the ETW trace. This is some stuff we've written ourselves. And here these horizontal lines on the bottom are the threads executing within the process and the top two with, again, the green and the red are the two disks. And you can see the batching very clearly. Those are disk request events, in fact. And what we're doing down here is we've pulled out all the contact switches and we're basically filling in the color when the thread is executing and when it's not running the line is blank. So there's nothing sort of surprising here. But it's worth noting that there's a whole bunch of threads. There's one of them that seems to be doing all of the reading and then these other threads are picking up IO completions. And sometimes they're also issuing write requests to the disks. So there's a nontrivial amount of concurrency in this process. So observations from that are that the bottleneck resource which what we need to understand in order to predict how this vertex is going to run on different hardware actually changes. Not only does it change but it won't even be 100 percent utilized. Vertices consume multiple resources simultaneously. We already saw that with the disk reads sitting on one machine consuming a reasonable amount of CPU at the same time. However, fortunately for us because Dryad is engineered for through-put we get this really nice batching of IO. So it's actually batched in 256 megabyte chunks for almost all of the vertex types, almost all of the operators in DryadLINQ. And these requests are actually pretty aggressively pipelined as well. DryadLINQ has a standard set of operators, and most of them behave pretty predictably. There's some like Apply, which can execute arbitrary code. But most of them, one select -- well, maybe not. Maybe select is a bad example. But one join vertex is going to look somewhat like another. So we want to predict vertex execution times. What do we need to know in order to do this? Obviously we need to know the hardware that the vertex is going to run on. And it also varies also according to workload. So the size of the IO. In fact, it will vary depending on the record size. But for the purposes of this work we assumed that the access patterns on the IO will actually stay the same from one vertex execution to another, even though the absolute amount changes. If you recall, we're actually talking about trying to schedule these vertices in the context of a data flow graph. So a vertex is going to be reading it's input from its parent node. So the placement of the vertex relative to its parent node is going to affect how quickly it can do that. If we have the vertex running on the same node as the parent, it's going to be, probably the disk which is going to be the bottleneck during the reading, perhaps, and if we place it across a network then perhaps that network is going to be the bottleneck. We've also got complications like, for example, sometimes it goes through the SMB subsystem to read a file on the local disk. And that means it also will read or write that file at a different rate than if it's reading or writing to disk directly. However, the vertex -- the prediction of the running time of that vertex doesn't need to be really, really accurate. That's not the goal here. The goal however is just to find a reasonable schedule for our programs so we can run it on our small cluster, and it won't be as bad as it might be if we just used greedy scheduling. So how do we do this? >>: So the object [inaudible] only be doing this when the other machines are likely loaded so we don't have to take it into consideration as part of the job. >> Rebecca Isaacs: Yes. Paul's nodding his head rigorously. Yes, it would be nice to be monitoring what's happening on other machines, feed that back in. But we're not doing that. >>: [inaudible]. >> Rebecca Isaacs: Yeah. So the way that we do this is we take advantage of the batched IO to divide the execution of the vertex into what we call phases. And within each of these phases we have consistent resource demands that are amenable to sort of simple qeueing analysis, to figure out how long is that phase going to take to run on the target hardware. So we identify these phases by whether IO is taking place. Also, as you saw in the visualization of the execution of the select vertex, all the times that the vertex was reading, for example, they all actually looked pretty similar. They were all bottlenecking the disk in the case of the first picture. So we can also take advantage of that and sort of group these things together and say, okay, the total time in a reading phase will be the sum of all of these. And I'll show you that visually. So here I'm plotting, the top graph is showing the cumulative IO performed by that select vertex in terms of gigabytes. And on the bottom it was seconds that it was running for. So you can see that that batching, it reads for a while and then the red line goes up for a while. Then it writes for a while and the green goes up. The middle graph we're showing CPU seconds consumed by that vertex. Again there's very clear jumps during the read phases. I've also plotted on the bottom graph the concurrency in terms of, in this case it's in terms of how many threads are runable at any time during the execution of the vertex. So we can -- by looking at the gradient of those IO lines, we can identify phases. And in this case I've just put lines over all the read phases. We can do the same thing with the write phases. In fact, you can actually see this concurrency graph. The number of runable threads actually looks pretty similar in all of those, in each of the read phases and each of the write faces. And then we also have what we call overhead phases and initialization and this is when other stuff is happening that will take a certain amount of time. If you think about it, the binary has to be loaded, things like that. So we identify these phases, and then magic animation, we group all the similar ones. And now we have a demand for CPU and for disk in each of these phases and this now gives us a very easy way of determining the running time of each phase on different hardware. So each of these phase -- for each phase we know its type, if we're reading or writing. We have something called a concurrency histogram that I'll explain in a minute. We have what file is being read or written, and how much of that file, and as I said before, we can now apply operational laws to determine how long it's going to run. The concurrency is interesting, because without looking at the source code, there's not really any way to tell. If we measure how long something runs for when we've got two processes, how long is it going to run for when we have it on eight. We estimate this, and this isn't a sort of perfect way of doing it, by any means. But we can estimate this because we are using ETW events, quite a level of events, we can see whenever threads become runable by looking at the ready thread events, and so this gives us an idea of if we had more processes, would this vertex be able to have these threads actually running rather than waiting in the queue. So we saw this count of runable threads in a histogram, and where each bucket indicates how many threads, and then we've got the proportion of CPU time that those threads were runable for. So we can use this to figure out whether we could take advantage of more or whether we're going to have to adjust because there's fewer processes. So just to recap, what we're doing here is we're developing a performance model for each vertex in our program, and we're doing this by taking a reference trace on one machine. So we run it once with the input data or some fraction of the input data that we're going to use when we run it in the future. And then from this reference trace we build a model by extracting these phases. And then we're just going to predict how long that vertex will run if we change the size of the input. If we run it on a different computer and if we change the channels, whether it's getting its input from a file or from a file locally or a file remotely. Okay. That seems great. But actually there's a lot of issues in reality and we don't have high expectations of accuracy. To start with, even where the file is on disk can really change how long it takes to read 256 megabytes of that file. Fragmentation can really mess things up. This vision of using the machines you've got lying about the house means that they're quite likely going to be normal Windows machines. And although we say well they're lightly loaded and we're not running other stuff on them, there's still going to be the search index, the virus scanner, and so on. And similarly on the network. And then there are just deficiencies on our model which we try to keep as simple as we possibly can. So we're not even looking at caching or contentionnal memory. So within 30 percent of the actual is the target. So what I've got here is results of a larger evaluation that we did, and this is just showing one vertex, merge vertex. And in this case the merge vertex has only got one input and one output. One thing that we did, the top line labeled reference is basically showing the results when you take E trace, produce the model, do the phase extraction and produce a model and then from that model predict the running time of the identical vertex, the one that the model was generated from. And the average error there over 10 runs was about 10 percent. Other things better, changing the size of the input, running it on a completely different machine. This one labeled "remote" was pretty terrible. Average 40 percent error. And in this case it's doing a read from a remote machine and it's actually bottlenecked by the 100 megabit network link there. That for Centera [phonetic] can be explained by a lot of different things. By and large, this is about as bad as it gets. The other vertices that we've done in evaluation of are actually, by and large we're reasonably happy with this modeling technique. We think it's good enough for what we want to do. Okay. So I've talked about how we can determine how long it's going to take to run one of these vertices when we place it on an arbitrary machine in our costa. And we need to decide looking at the data flow graph as a whole how are we going to take this entire graph and map it on these physical computers? This is just sort of a picture of the end-to-end situation, what's actually going on. So we've got the code. The DryadLINQ compiler turns it into a data flow graph, which through the Dryad job manager will get executed on the cluster. When we're taking our first reference trace, the one from which we build the models, we actually do that by running a logging service. It's an ETW consumer on each of the nodes in our costa. From that, we extract the phases and then we have a model which we can give to the performance planner. We also have, from the DryadLINQ compiler, the XML graph that explains this, that represents this data flow graph. And so that XML file can be updated using the model, once a schedule has been found for this program, then we can annotate the XML graph using the hints that Dryad understands to tell it where to run the vertices. And then subsequent executions of this program just take our updated XML graph as input. So the way that we try and find a schedule is we've actually been looking at using a constraint logic program. I think in hindsight that maybe wasn't such a good idea. It's a very subtle and complicated business. But we were interested in sort of exploring how well these things would work. So the idea is that you've got a search tree constraint such as one vertex must finish before another one starts can help to prune the search space. You can also use heuristics to speed it up such as looking at the vertices that are going to take the longest time to run and trying to place them first. And another little trick we did was to, first of all, produce a greedy schedule and use the total running time of that to give us an upper bound. So there's no point continuing to explore a branch in the search space if it's already going to take longer than a greedy. So I'm not actually going to say anymore about that. This is -- I mean we'd be very happy to talk about it off line but too gory to stand up and talk about. One important aspect of scheduling the whole data flow graph that I haven't mentioned so far is that there's another problem which is contention between vertices. So this is the chart showing the join example being scheduled. Without the contention model, we have a certain runtime prediction for that merge vertex. And for that one. But what you notice is that they're both reading their input from the same upstream disk. And so those two vertices will interfere with each other and in practice they'll actually take a lot longer to run. So this is just another thing that really has to be taken into consideration. And we do that and that's fine. We've done an experimental evaluation for the paper that we wrote on this where we just took three of the workloads that the DryadLINQ people use in their examples. TeraSort, join and algebra. In the results I'm going to show in a minute we were actually using a very small cluster of just three machines with a laptop, a desktop and a server. And they're reasonably diverse. Not massively but it's hard to get a really massively heterogeneous cluster. But I think this is probably representative of the kind of thing you might want to use here. So how did we do? What was the overall speed-up eventually versus just using a greedy schedule? It really depended on the program and the workload. The algebra program, which as you saw it, it had lots and lots of nodes in it and it's just a very simple -- it actually has a small amount of input and it does some very simple, I know it does things like norm and standard deviation and so on. It just produces a few numbers. It's very much a toy example. But we actually got a really good speed up there of almost 40 percent. I'm showing the -- we did quite a lot of runs of each one, and we've got a min, medium max for the greedy schedule and the achieved is our schedule, and we also did an exhaustive search, not quite exhaustive but it sort of gives a rough lower bound. So join was also pretty good. TeraSort, we haven't finished the experiment. We've only got one number. So in this case we got nine percent speedup. But who knows if we run more tests we'll probably improve on that. So in conclusion, given the inaccuracy of prediction, this seems reasonable again. >>: Given the machine setup you used, seems almost like you would be better off running everything off the server. Did you do a comparison if you just gave up on distributing computation and just stuck with the server, whether it was useful to have two machines? >> Rebecca Isaacs: So you're right. Oftentimes it would actually make more sense to just run on the server. But it depends where your input is, to a large extent. So, for example, in that join, at each stage of the data flow graph, you're actually reducing the amount of data that's being shipped quite drastically. So if the data is on the laptop, or a large portion of it is on the laptop, you've got a really slow link to the server, then it may or may not make sense to ship that data and do the filtering on the server, and it may preferable not to ship it and do the filtering on the laptop and ship the smaller amount of the data. And we didn't do the comparison because it's easy to contrive examples that will work one way or the other, and it doesn't seem very relevant. >>: It seemed to consider a case running vertex, [inaudible]. >> Rebecca Isaacs: Well, it did in this case. But we could have had cases where it didn't. Does that answer your question? Okay. So in conclusion, large compute jobs shouldn't have to be run in a data center. There's definitely a scope to be running these things locally. And it's a first step towards this sort of match longer term vision of the disaggregated PC. How are we going to in general schedule jobs on this collection of computers that we have to hand. The academic release version of Dryad and DryadLINQ is somewhat different to the version that we made these modifications to. So we are forward porting our stuff. We're also looking at using Microsoft Solver Foundation libraries for the constraint solving for the search. And it would be really nice to have some kind of feedback when we produce the schedule and the program's being executed to monitor how long vertices are actually taking to run and feed that back and adjust the model accordingly. And that is the end of the talk. [applause] >>: What about use space available? I can't imagine that laptop [inaudible] process disk space [inaudible] server. >> Rebecca Isaacs: That's a good point. I hadn't thought about that. Although, we did fill up a disk at one point in our experiments. Yeah, it's certainly something -- it would make sense to take it into account, because Dryad generates an awful lot of intermediate data and it can be aggressively garbage collected, but while the thing is running it's conceivable it could fill up a disk, a small disk, yeah. >>: How much information do you know about each of the nodes? I mean, when you put it in scheduling do you know its [inaudible] memory? >> Rebecca Isaacs: Yeah, we assume we know everything. >>: Do you actually monitor any of the nodes to see what their usage is, like the disk space usage or CPU? >> Rebecca Isaacs: No. No. But that would be a sensible thing to have, a nice thing to have. Yeah. [applause]