24294 >> Surajit Chaudhuri: It's a great honor to introduce Rose Balazinska, from University of Washington, Seattle. Most of you know her. She did her Ph.D. at MIT. She works in the area of distributed systems and databases, has a number of awards. Hard to name all. But she won the Microsoft New Faculty Award [inaudible] award and the ten-year Most Influential Paper. In addition, she has a number of best paper awards. She works in the area also of scientific databases, which I forgot to mention. Today she's going to talk about big data analytics. So without further ado. >> Magdalena Balazinska: Thank you, Surajit, and thank you everyone for coming. Before I start, I'd like to acknowledge that the core technical pieces that I'll be talking about are actually the hard work of two of my students, Prasang and YongChul. And also that this work is sponsored actually interesting Microsoft. So thank you. So what's the motivation? The of course, is everyone nowadays has a lot of data. Being university campus we talk a lot more to domain or natural more than industry. And all these natural scientists are data at just an amazing scale and rate. by NSF and motivation, on a scientists generating For example, astronomers are building increasingly large telescope that will simply survey the sky and accumulating very large collections of images and we want to analyze the images and here for the upcoming LSST the new survey we're talking petabytes of data. Similarly, in other domains, such as oceanography, they're collecting data from sensors and also assimilating, the environment simulating the oceans producing large amounts of data. Other areas such as biology we have new sequencing, new lab automation techniques. So the bottom line is that scientists are just producing a lot more data. And therefore they need tools. They need help to analyze all this data. And, of course, as you perfectly know better than I do, that this need does extend beyond science. So if someone has a large amount of data, what can they do with this data? So there are several solutions that already exist. I could use a parallel relational database management system, perhaps Greenplum, MapReduce type system, such as Hadoop or Scope, or one of those systems. Maybe I can use one of those new prototype parallel data processing systems such as SciDB. And there are many, many challenges. Many interesting challenges, and the one challenge I would like to talk about today is really fault tolerance. So what it is about fault tolerance. If I'm going to analyze a large amount of data, clearly failures are going to occur while I'm processing that data. In a recent publication, Google reports that they see approximately five worker deaths which means kind of task deaths for each MapReduce job. So definitely failures are occurring. The question is what should we do about these failures. In the community we really have kind of two standard techniques that are pretty extreme in terms of comparison. So one approach, which is characteristic of traditional parallel database management system is to use kind of a pipelined or if you want streaming execution. So here let's say we're going to start with a query that is going to start with a dataset. Maybe select some values drawn with another dataset drawn with yet a third dataset, perform some aggregation at the end. If I want this in a parallel database, I will typically start reading the data from disk and then I'm going to stream the data directly from one operator to the next as I do my processing. I'm not going to write any data to disk. I'll just kind of stream the data right through all these operators, which can be spread across kind of several machines in the cluster. So this is great. But if something crashes during the execution of my query, well, I don't really have a choice. I have to restart the whole query. So there's several advantages. One advantage is that I don't actually pay any overhead of naturalizing, checkpointing, et cetera, I just stream the data right through. The second advantage is if my operators are not blocking, I do what is called, for example, online aggregation, online query processing, while I have the possibility to produce results incrementally, which can be very nice if I'm going to analyze large dataset to see the results incrementally and to be able to maybe stop the computation if I have seen enough and I'm not happy or happy. On the other hand, on the negative side, clearly failures are going to be costly, which at small scale is not a problem but if we're going to scale this up this might become a problem. So this is one strategy commonly used today. On the opposite side we have MapReduce types of systems where the idea is to use blocking, to be extremely cautious, if you want, to use blocking query execution where we'll start from the data is cluster on disk, and we're going to read the data, process it and write the intermediate results back to disk. Then we're done with the first operator. We're going to schedule the next operator that will read data from disk, process it write it to disk and so on. The beauty of this technique is that if something crashes, we don't have to redo all the work. We can only read the data from disk again and just we process the operator that failed. So the advantage of this technique is recovering from failures is much less expensive than in the previous case because we only reprocess the one failed operator. On the other hand, there are kind of two negatives, one is that we do have to pay the overhead of writing all this data to disk. But the second negative aspect is that I can't see any result incrementally, because I'm fully processing each operator and blocking, I really have to wait for the whole query to be done before I can see any results. So the question that we answered, that we ask ourselves, is what would be kind of a nice compromise between these two techniques? What we would like is preserve the ability to produce results incrementally. We don't want to add any extra blocking just for the purpose of fault tolerance. And second, what we would like is to have a fast execution time assuming that the failures are going to occur. So how can we do this? So actually can we achieve this? And actually the answer is that, yes, we can achieve this because there exists several fault tolerance techniques and we can use them in a non-blocking fashion. In fact, there are many ways in which we can achieve this. So there are many techniques. I'm kind of listing three here, and I'm going to describe them in details later. But the key question that we asked is, okay, great, so I can take a parallel data processing system. I can make it fault tolerant using one of several techniques. So which one should I use? And does it actually matter? So as a first set of experiments to kind of drive this research, we actually started with a small scale cluster, kind of 17 machines, and we're going to run some simple queries and actually try to observe what happens when failures occur and we use these different techniques for fault tolerance. So in these experiments we're going to run simple queries. We're going to inject exactly one failure about halfway through the execution and we're going to fail each of the operators separately in different executions and we will average out the total runtime. And we're going to try to use different fault tolerant strategies. So the first fault tolerant strategy would be simply to do almost nothing. And if I have a stateful operator and the operator crashes we're going to restart the operator from the beginning. But if I have some stateless operators what I can do, let's say I have a selection operator in my query plan, that operator crashes, well, if I restart the operator, I don't have to reprocess everything, I can simply skip over all the data that I already processed, and I'm going to continue processing. So that's very similar to doing that, restarting the whole query plan, except we're going to restart only one operator at a time, the operator that crashed. So now small scale cluster, we're going to compare. So here we're going to have the different fault tolerant strategies. And on the Y axis we have the runtime. The blue is the runtime without any failures. And then the red part is kind of the extra time we spent due to failure and recovery. So if I just run the recovery without any failure, this is the runtime we get. If I run the query and I inject a failure roughly in the middle of the execution, you can kind of see the expected wasted time because the failure is going to be the red part. In the case of something like a parallel database if I just restart then I'm going to add 50 percent because in expectation I have to redo half of the work. So what if instead I use this kind of technique of skipping over the input data whenever this is possible. So here indeed what will happen is the runtime without failures kind of the basic runtime remains the same but in terms of recovery I can speed up the recovery of some of my operators. I'm already going to do a little bit better. >>: Are we talking are you using the ->> Magdalena Balazinska: It's all the operators, but the stateful operators can skip over anything. So they have to go back and actually reprocess everything because they have to rebuild this state. You can see this, I have a little bit of improvement only for that operator. But that's still one strategy that we can use, yes. >>: Does this also have a nexus, does it have any nexus? >> Magdalena Balazinska: In this case, this is -- we were not producing indexes in this case. The main thing here is I'll get to the details later of how we implement it in our context. What you want if an operator crashes it needs the ability to tell whatever is upstream of that operator to just resend a certain amount of the data, not resend the whole data, in order to be able to skip over some of the input. >>: Where does the state start, this operator ->> Magdalena Balazinska: There's no state. Also in this case how much to skip over? So I'll get into the details later. I'll basically show you how we implement it in our framework. So at this point let's just keep it a little bit vague, get a sense does it even matter that we check the right strategies. So this is just kind of one simple strategy. What if we also used the MapReduce type of strategy? What I'm going to do I'm going to write data to disk, add the output of each operator, the way MapReduce does it. I don't have to actually do it in a blocking way. What I can do instead of writing all the data to disk and sending it downstream, I can write a little bit of data to disk and send the data downstream, write a little more data to this and send the data downstream. If an operator crashes, the advantages, all the data that the operator consumed is available somewhere on disk. So it can simply replay that data if the operator wants to recover, instead of kind of propagating back the failure and having to regenerate any of the missing data. So does this -- how well does this perform? Here we're going to go back. We have again our query, and actually I forgot to say this is a simple query that selects data joins with another dataset, aggregate, and this is a synthetic. This is not a real query, just a synthetic query. And it's about, I think, this query was 160 million TUPLs that we were processing. We have the restart strategy. In restart, what we're doing is simply if any operator crashes we're restarting the whole query. In the skip strategy, let's say if my select crashes, I can start the select and skip over to input data. Not using indexes, just going to skip over wherever the operator started. If the drawing crashes, then the general -- we process, the select will have to crash itself to reprocess, but it can skip over again all the data we want. Although it will have to rebuild all of the state from the beginning, same thing for aggregate. So skip, if select crashes we can skip over some input data. If any of the other operators crashes, we pretty much start from the beginning. In the case of materialize we'll select the output to disk, write the output of drawing. If not, if this join crashes, we don't have to reprocess this join, but the join will just resend the data from disk and that join will be able to rebuild its state from the beginning and continuous. Kind of MapReduce style. How helpful it is. We actually executed it. In this case for this query, it's pretty bad, because the fact that we have to write the intermediate data to disk really increases our runtime in the absence of failures. Once a failure occurs, we can recover much faster because we only have to reproduce the operators that failed. But overall for this specific query, this is not the best strategy. There's one last very well known strategies, and these are all common strategies. We did not invent any of these fault tolerant techniques. Last known strategy from the community's checkpointing. So this is great for stateful operators where we run the operator and every so often we're going to write a copy of the operator stage to disk. And if the operator crashes, we can restart from that last checkpoint, which will include the state of the operator and exactly how far along the operator was. >>: I'm surprised these are so bad in the previous slide. On this screen you have to seek to that block and wait for occasional latency. Disk writing in principle, find any bland spot on the disk, write there and make a note where you wrote it. It may not allow you to add that option. >> Magdalena Balazinska: Sure, could also be that we didn't optimize all these techniques as you could optimally maybe produce them. This is also something ->>: Basically on empty disk just start writing where you are. >> Magdalena Balazinska: Sure. >>: Next time around it's harder. >> Magdalena Balazinska: That's right. In this case we don't do these type of optimizations. We have an open file. The one thing we did in the experiment, each operator partition has its own disk. We do sequential reads and sequential writes but that's as much as we do. So if an operator fails we started the operator from scratch read the state from disk and then process. So how well does this do? We're going back to our query with the three techniques so far. And in this case, actually if we checkpoint, we actually don't add that much overhead because the state is actually not that large and now recovering is extremely fast because we only have process from the latest checkpoint. What is interesting and this is the bottom line in our motivation that given a simple parallel query and even a single failure, we can get something as much as 70 percent difference in execution time if we choose kind of the round if you want fault tolerant strategy and the right fault tolerant strategy. So the second question might be, is it just a checkpointing, is always the strategy that we should use. Is it always the case that I should simply checkpoint all the operators. Yes? >>: When you do skip, skip efficient, you need to kind of be efficient about it. So use an index for it? How are you getting this implemented? Are you rescanning? >> Magdalena Balazinska: You want to be able to seek to the right location. You don't want to scan the whole input data. Offset. Exactly. We're scanning from the disk beginning and we remember the offset and we just restart from that, yes. >>: Is this for MapReduce setting to ->> Magdalena Balazinska: You could. So this is actually the experiments. So it's the slide I forgot to mention it we're using kind of our own -- we're using a skeleton processing engine that all it has is operators and fault tolerant strategies. >>: Because MapReduce assume something different functions system like scope, using functions are very important. >> Magdalena Balazinska: Yes. >>: That strategy checkpoint mainly additional contract for the programmer what is even a right emotion of ->> Magdalena Balazinska: Yes. Exactly that's I'll get to this. What we did is we initially let's assume relational operators. We know their insides. At the end I'll show you if we have user defined operations we can make assumptions they can't do much in terms of fault tolerance; you can use our technique and still get improvement. I'll get back to user-defined operations at the end. good question, yes. So this is a very >>: Checkpoint, are you saying operators are checkpointing. >> Magdalena Balazinska: Yes. >>: Because operators prefer checkpoint, versus another operator which has much more state, one would be much more expensive? >> Magdalena Balazinska: That's exactly where we are headed. So the question that we asked -- and actually this is a purely academic question, what I'm going to present is not something you could directly apply, but what we were just curious about, just the science perspective was all right, do I always checkpoint? Is there one strategy that's always the best? And the answer is no. And actually this is kind of the next experiment, what we did is simply replace the last, exactly the same query, processes also you know the same number of input TUPLs but we have a drawing at the end, and here the bars are completely different. In this case, the strategy of everyone just using skipping over the input data when they cannot checkpointing no naturalization gives me the best runtime. So the observation here is if I have a parallel query, the choice of fault tolerance technique can have a significant, visible impact on runtime, even in the presence of a single failure. So imagine with more failures, the differences will amplify. At the same time, there isn't like a single well-known technique that I can just apply uniformly. So what we want is exactly what you suggested. We said can I, should I pick the right fault tolerant technique at the granularity of each operator? We actually went further. And the question we asked was can we build an optimizer that will take my query plan and my resources as input and will automatically tell me what fault tolerant strategy to use at each point in my query plan and does it actually matter? Would such a fault tolerant optimizer make a difference? That's the question we're set up to answer. And the rest of the talk we'll kind of see how we went about answering this question. Does that make sense so far? Yes? >>: What is the function for the organization? is there latency, any other factors? Is it just runtime or >> Magdalena Balazinska: I'll get to this in a second exactly because that's something we can use the standard cost functions and one of the challenges here is we really tried to get as close as possible to estimating something equivalent to runtime. And because fault tolerant, the difference between different fault tolerance plans is not orders of magnitude, it's in the order with one failure, maybe 50 percent, 40 percent, 70 percent and have to be a little more accurate in order to select good plan. What challenge these operators interact with each other when they're running in a pipeline. If an operator is like a symmetric hash join, initially it's going to be the bottleneck will be how fast the input data is arriving. But soon it's going to produce so much TUPLs the bottleneck becomes the CPU operator, we have challenges in dealing with these techniques. I'll come back to this in a few slides. So the bottom line so far is that we want to see if we can choose fault tolerance at the granularity at the operators if we can make this choice automatically. This is what we did in our optimizer that we called FT opt. The reason it's so short was the paper was over the limit so short name to squeeze the paper. And this is actually in last year's SIGMUTH conference. The goal was two things. First of all, I need the ability to mix and match fault tolerant strategies. I'm going to have a single query plan with one operator using one strategy, another operator using a different strategy. So the first thing we had to develop is some sort of protocol that will hide all the details of the fault tolerant technique and have some sort of high level agreement between operators such that if one operator crashes, we only restart that one operator and we don't care what anyone else is doing for fault tolerance. So that's the first contribution. And the second one was really to have this cost-based optimizer that we call fault tolerant optimizer. The input to the optimizer it's the query plan, so we first run the regular optimizer get a query plan and then we pick the right fault tolerant strategy for the query. And of course the question then is also kind of how much do we gain? So kind of once we actually start to execute it in parallel, the way this is going to work is we're going to kind of encapsulate these operators. We're going to have our optimizer on the side, and we'll have this encapsulation with the protocol between operators and the optimizer will select the strategies and then different operators will end up using different strategies. So the granularity will be really kind of one operator in the query plan. So let's look at the actual protocol that we use. So the goal is that we're going to have pipelining. We don't want to add any kind of blocking. I want the data to flow continuously, because if at all possible we would like to be able to show results incrementally. We don't want to for fault tolerance to add any kind of blocking. That's the first goal. That will be the goal of the protocol, and at the same time we need the kind of ability people raise questions about figuring out where do I restart from, what do I reprocess. So the protocol has to capture this. So here's kind of kind of this is basically illustrated with an example. Let's have three operators, a one or two and/or three. And our protocol really requires that each operator obeys four different rules. The first rule that we require is that all TUPLs that we send or all of records if you want between operators must have some sort of unique identifier like a primary key or it could be a hidden record ID. The reason is when we restart an operator, we will want to tell the upstream operator where to resend data from. So we need the ability to identify these TUPLs uniquely. So we need those unique identifiers. Second, we will request that if an operator at any point in time or two is allowed to go back and ask or want to resend any kind of suffix of its output data, so I can say go back and say please restart and send me everything from the beginning, please restart and send me everything since TUPL number 225. And this operator has to support this feature. The easiest way to implement this is if this operator says resend everything from TUPL 225 this operator can simply kill itself, restart, reprocess, and kind of send all the data just delete all the output data until it sees the right TUPL and continue. Of course, this does assume that we have determinism in the operator. We do not separate operators that has some sort of optimization in them because we can assume that we can restart an operator and it will produce the same output. >>: Two randomizations are okay? >> Magdalena Balazinska: That's right. If you have some sort of random seed I can just give you the seed again, as long as there's some way to restart and regenerate the same sequence, we're fine. We discussed this in the paper I'm not going to go into the details of that aspect. All right. In many cases, if I have, if I require that an operator remembers all its output that has the ability to reproduce its output, that's fine, but it can be actually quite costly. In many cases, operators, once they know they have made a certain amount of progress, let's say maybe this operator chose to checkpoint, then this operator knows that it will actually never ask for some old TUPLs again. As an optimization, this operator can actually acknowledge to the upstream operator and say, by the way, I will never ask you to replay anything before TUPL number 153, because it's my problem how but I know that I will never ask you for this. And if this operator uses something like, you know, maybe materializing the output to disk then you can truncate the logs, do whatever optimizations it wants. And finally the last rule, we're going to ask all the operators to remember the last TUPL that they received. So if this operator crashes, it can actually take, ask and say okay I just crashed tell me what is the last item that I sent you and that you received and then this operator will then eliminate duplicates and only sends the new data. Yes? >>: You always require the sender to have the output, why don't you have the optimum of these operators itself to put the input into this somewhere? >> Magdalena Balazinska: It can do whatever it wants. All I'm doing is I have four rules. I don't tell the operators how to implement them. I just say that the operator needs the ability. And the easiest way would be this operator could do nothing except someone requests to replay it could crash itself restart and regenerate it. We don't put constraints on what the operator, how the operator implements those rules. Does that make sense? Or was the question different? >>: Like if it's a [inaudible] if you start inputting locally, you can get the data locally, right? >> Magdalena Balazinska: Yes, yes, yes you'll see it on the next slide exactly. Kind of the main observation is we have these four rules but they don't say anything about checkpointing, materializing the output. They don't say anything about how to implement it, which means as long as the operators can play by these rules, they can do whatever they want. And here's one example actually with let's say checkpointing. Let's go through this example an operator or the middle one crashes and restarts. What the operator will do is first it can actually ask the downstream operator or three to say what is the last TUPL that I sent you? And that three should be able to reply about the rules we have about identifying TUPLs uniquely and remembering the last TUPL they received. Now 02 knows when it crashed with respect to the downstream operator. At that .02 can do the following. If it happened to save data from disk it can recover that checkpoint from disk. If it didn't have a checkpoint it will restart from the beginning. But it's going to basically either recover state or start from scratch. And depending on which of these alternatives the operator used it's going to ask 02 -- sorry, 01 to replay a certain amount of data. If 02 check pointed maybe it will ask for only the most recent few TUPLs, and maybe it will be efficient because maybe these few TUPLs are still in memory at the preceding operator; if it didn't and it has to restart from the beginning maybe it will ask to replay all the data from the beginning. And finally, once 02 gets the replay data from 01 it processes the data and waits until it catches up to the last TUPL and sends only the new data to the operator. So what is nice is that we have several nice properties. First, the data can flow continuously in the absence of failures. We don't put any constraints on when data has to be written to disk. There's no blocking. The data can just flow continuously. At the same time, if an operator crashes, we can restart only that operator in individually. And finally we don't put any constraints, as long as I have come up with my own fault tolerant strategy, if it can be made to work within this framework, then I can use that strategy for any operator. So kind of many benefits. And what actually we did is that we are show in the paper how to implement the standard strategies like checkpointing, materialization and skipping over data within this framework. Yes? >>: The comment you made about blocking the operator, so [inaudible] one or recovers one has to [inaudible]. >> Magdalena Balazinska: Yes. So actually what happens in our framework when a failure occurs we do a block, we block the same query plan. >>: Without a failure, I think -- without a failure then there's no blocking. >> Magdalena Balazinska: Exactly. When a failure occurs, we indeed block the whole query plan, which is why we often have the process, we have fault failures without plan and add it up because no one is doing anything during that time. Yes? >>: [inaudible] the last TUPL that's conceived in the event of an upcoming crash? >> Magdalena Balazinska: So one example, the operator could simply call it in memory. So if let's say 02 crashed but 03 didn't crash, it could still have that in memory. If 03 crashes ->>: What about 02 is crashed, when you try to resurrect 02, awful lot of work it doesn't want to repeat? >> Magdalena Balazinska: If it's checkpointed the work, so depends on what 02. If it starts snapping crashes it pretty much has to restart from the beginning unless it's a stateless operator. If 02 is a selection then it can just ask 03. >>: Go back to the previous slide. >> Magdalena Balazinska: This one? >>: So I see. So if I act a TUPL, at that point I need to do something that's processed? >> Magdalena Balazinska: Yes, exactly. And failure model. If you assume like only in this paper just for the purposes of exploring the idea of fault tolerance optimization, we restricted ourselves to process failures. We just did not assume that disks will fail as a way to explore this idea without too much complexity. If you assume the disks can fail, then perhaps 02, before it can acknowledge, not only has to write it locally maybe has to replicate it somewhere else, then you can acknowledge it. But that will add overhead to fault strategy and maybe twist, kind of push the scale towards not using any fault tolerance because you may have to restart otherwise too much overhead at runtime. >>: [inaudible], it's these kinds of considerations that killed them to -- it's a very vague fault tolerant kind of model. The overheads of introducing even moderate granularity [inaudible]. >> Magdalena Balazinska: It can become complicated that's true. That's the first step towards saying is it worth it because if it doesn't work at that level there's no point going forward. I'm saying exactly practical directly. It's kind of interesting scientific question on that path. All right. Other questions? So this is for the -- yes? >>: So the database for suspending descending queries for the idea of operators [inaudible] with upstream operators so they can either choose the checkpoint data at that point and involve the contract, essentially allow me single to replace the parts [inaudible] the operator. So how does this fit in comparison to this kind of ->> Magdalena Balazinska: So the main difference compared to a lot of the work including this type of work is that this is really meant towards enabling the heterogeneity. So there's a lot of protocols that were similar in spirit but they assume heterogeneity. Everyone is checkpointing. Everyone is skipping over. Versus here we really thought what are the minimum, what is the minimal set of primitives that we need in order to decouple the choices of the strategies. But a lot of the principles such as acknowledging a TUPL, replaying TUPLs comes up in different protocols but typically everyone is doing some uniform strategy. Or you might have -- or the contract will be specific to a specific pair of strategies versus here the contract is independent of the pair of strategies and that is the main goal. Other questions? All right. So we've seen kind of most of -- seen the protocols, so the second part which goes back to the other question was with respect to actually the cost model. So here we want to make the decision on the fault tolerant strategy to use in an automated fashion. Therefore, we need some way to compare the cost of these different strategies. So the cost function that we chose to optimize for, and it could pick other cost functions was the execution time with failures. So the cost formula is going to be, we're going to try to minimize the time that we spend in regular processing, which is blue, which means the time for the first TUPL to propagate through our whole pipeline. Remember we assume kind of we think about nonblocking, although if an operator blocks, it seems that it delays the first TUPL until it's done processing so we have the time for the first TUPL to propagate and actually time frame for the remaining to go through, plus as has been mentioned we block during failure recovery, we're going to add the expected kind of number of failures times the time we're going to spend recovering from these failures. And this is kind of reasonably standard cost function to optimize for. The challenge, though, is when we started with standard cost formulas, we were getting highly inaccurate results. And like I mentioned, in a query optimizer, the differences between different query plans can often be in orders of magnitude huge. So even if the cost functions were not perfectly accurate, that's fine. Sufficiently good to just pick reasonably good plans. Here we are talking about 50 percent, 40 percent differences in runtimes between the different fault tolerance plans. So we needed to be more accurate in terms of the cost models for the operators. And the key thing we observed when executing our pipelines is that there's kind of a lot of dynamic between the operators. In particular, an operator will not necessarily produce like a selection operator will produce, it's a filter. So I get an input TUPL, apply some filter output to TUPL. It's linear operator producing data at a steady state. But many operators start empty and they start accumulate state, and they have this kind of nonlinear behavior that when we were not capturing we had too much of a difference in terms of the predicted and the actual runtime. So kind of the way we chose to solve this is by using kind of convex optimization framework and to just model the operators with constraints of the form, if I put two operators together, while the input rate for this operator cannot be any higher than the output rate of the preceding operator. And the rate at which this operator is producing TUPLs cannot be any higher than say how much the CPU allows me to produce. So we're going to kind of model operators with all these kinds of bounds. And here's one example for the symmetric hash join operator. So the idea behind this operator, this is one example, is that we're going to read data as the data appears where, when we have an input TUPL, we put it in memory hash table. When I have a TUPL from the other relation, I probe my hash table and I store a TUPL in my in memory hash table. Next TUPL comes and I probe the other relation and I store the TUPL. So at the beginning when the operator starts running, actually, no, so if I assume that all the data is available to me, there's a certain maximum rate at which we can produce TUPLs, which is how much CPU we have available to us. Yes? >>: I have a question so you have an optimal ->> Magdalena Balazinska: Yes. >>: For each operator in that what are you trying to do? >> Magdalena Balazinska: clear. So this is a good question. I should be more >>: Would that materialize? >> Magdalena Balazinska: Exactly. What we want to do is for each operator we will decide which of our N available fault tolerant strategies to use. So should the operator skip, in our experiments we looked at skipping, checkpointing or materializing. If I pick checkpointing, how frequently should I be checkpointing. So when you materialize, you only materialize the output. When you checkpoint, you checkpoint your internal state. So the overall optimization is going to be that I have my cost function, and each operator has a variety of fault tolerant strategies available to it. For each fault tolerant strategies we need a model that will help us assess what will be the cost of that fault tolerant strategy. >>: Checkpoint all of its output or doesn't any of it. >> Magdalena Balazinska: That's correct. >>: There's no option ->> Magdalena Balazinska: No we don't do it. Either checkpoint or not. If you checkpoint you will need the frequency. >>: You gave an example saying the difference between two strategies could be as much as 70 or 80 percent. >> Magdalena Balazinska: Yes. >>: I say alone that doesn't complete optimization. Perhaps exist one strategy that comes within 20 percent of all things. >> Magdalena Balazinska: Yes. >>: So in which case you don't probably need the -- such a strategy ->> Magdalena Balazinska: I agree. I totally agree. In this case we really wanted to explore the idea of the whole optimization. But I agree in practice you might want to take one of those simpler strategies because you might get within 20, 30 percent of optimal. I completely agree, yes. >>: This may be off subject, but anytime I quoted John Mashe, saying speaking for 10 percent. Saying we do unspeakable things for 3 percent. >> Magdalena Balazinska: I see. Depends on the perspective. >>: The product is a big number. >> Magdalena Balazinska: Big number. But I agree this is kind of -that's why I say it's kind of more of a scientific interest we wanted to see how well does it work. >>: In theory, the spread is decent -- maybe one that's always in the middle. But there's one that's consistently in the middle [inaudible] it may be that any given strategy is either lousy for some inputs in which case it's completely out of code. >> Magdalena Balazinska: That's right. At least we want to eliminate the bad plans. We did a lot of sensitivity analysis to see how sensitive we were to different bad inputs. But the bottom line, just kind of going back to the cost function, simply if I take an operator such as symmetric hash join, if all the input data is available, then the output, right, it's kind of limited how fast I can process this input. But typically the input is coming at a certain pace. As it goes accumulating more state. So I'm finding more joins. So actually initially the output would be limited by how fast the input is arriving. So when we combine this, what eventually happens with the operators that initially it processes as fast as the input is arriving but eventually it becomes CPU-bound and it kind of cannot process any faster once it accumulated a certain amount of state. And what we found is we actually, because this is something like 25 percent and we're talking with differences in the order of 40 percent, we had to account for this type of effect, in order for the optimizer to actually give us select a good fault tolerance plan. >>: Are you saying a certain optimum plan, look at all the plans of each? >> Magdalena Balazinska: I agree, yes. And that's possible. We just didn't do it. But it would be possible to expand the search and basically look at it with the optimizer, that might be interesting. Yes? >>: So the natural points, makes perfect sense, checkpoint. Minimal heap state points, for example, [inaudible] join, you know that computer run out of page and basically your [inaudible] is essentially negligible. Optimal points for doing a checkpoint strategy. >> Magdalena Balazinska: Yes. >>: So I'm trying to understand how those specific choices were related to the standard ->> Magdalena Balazinska: That may be one way, too, when we actually look at the optimizer. Here we said what is the frequency and saying that if I just pick the frequency without the specificity of the operators, then I'm going to kind of get an average size with a certain frequency. But it is in fact the case that for some operators there's certain points that make a lot of sense. We could restrict the optimizer to look at only these points. Say for these operators the frequency is limited to being only one of the following and only get that, those best points, indeed. Exactly. So how well does this model work? So this is actually -- this is actually kind of a good case where when we have different operators what we did is we actually looked at different pairs of operators with different combinations of fault tolerant strategies. So skipping materializing, skipping, skipping, and this is the comparison of the real runtime with the one that we predicted. As you can see we're not completely always on spot. There are differences in the area of 10 percent, but at least we can distinguish between all the main plans. So this is as far as we wanted to get. So how well does this -- so how well -- the final question is now that we have this optimizer, we have the ability to use different fault tolerant strategies and we have the ability to pick these different strategies. So the question is are we able to find these good plans and how much faster these plans actually execute. This is one example from our paper where we have a query kind of similar, select join, join aggregate kind of standard query, and this is a query that was actually processing 160 million input TUPLs, producing eight million, I think the first drawing, one of the select outputs eight million, add another eight and another 40 and the aggregate has a state of 8,000 TUPLs. So that's kind of the query plan. As you can see with different strategies restarting, skipping, materialization checkpointing, and for this operator, this is one of the examples where really hybrid strategy was helpful. And the hybrid strategy ended up being to materialize the output of the select, do nothing for the drawings and then checkpoint the state of the aggregate, which is a little bit kind of what we could anticipate. And the difference was significant. It doesn't look as impressive from the graph because we're not talking about orders of magnitude, but still the hybrid plan was 20 percent better than any of the uniform strategies, and it is definitely better than 33 percent better than just restarting. If you look at the choice between kind of the best and the worst possible plans, in this case it is also on the order of 33 percent which is significant. It's like a third of your query runtime, which is nice. So let me kind of quickly show you a couple more results from the paper. This is kind of another example query that shows how many million TUPLs we are processing. None actually is the same as skipping. The same as skipping. And this is another example where we show what we predict, what we observe and at least it shows that versus restarting that in this case the best strategy was to uniformly do nothing, and our optimizer was able to select that strategy. Another example where we have -- this is kind of also important. This is select, join, join, join. This is the same query, select join, join, join, with slightly lower selectivities and suddenly the best strategies to materialize, given same operators, change the input data, and now best strategy changes, again the optimizer is able to find that strategy. Although, in this case because there is a little bit more of a network, higher network consumption than we actually had -- actually not this one. We actually our estimates were not as accurate but still accurate enough to find the right plan. We changed the query plan a little bit again. We changed and have an aggregate operator, and this is where again kind of the best strategy here was to checkpoint. And we were able to find it. But this is the one where we actually -oh, that's right. This is the one where we're producing quite a lot of TUPLs, some operators were network bound and our cost model was not as accurate for network bound operators. More accurate for CPU bound operators. So our estimates were not as accurate but we still found the right plan. Finally again to understand the query this is where we have the hybrid plan. Actually, this is kind of the same as I shown earlier, this is the hybrid plan where we materialize, skip, skip and aggregate at the end. But this one shows also what we -- our estimates. >>: How do you monitor the number of data sets [inaudible]. >> Magdalena Balazinska: This is very important. How we model it typically, from what's read in papers often people know that in their specific cluster, there's a certain number of kind of average number of failures per job or average kind those statistics. So what we do can kind of estimate roughly how nodes I want to use for it. And expect. of meantime to failure or some of is we say given that I have a query I long it will run for and how many I will estimate how many failures to I expect this query to run with five failures, six failures, and we optimize for the runtime with that number of failures. >>: Do you find the strategy is sensitive to that [inaudible]. >> Magdalena Balazinska: Actually, no, so we did a sensitivity analysis -- do I have that later? So we found that we were actually -- we have the exact result in the paper. We were not sensitive to small run estimates and the number of failures and we were not sensitive to higher performance estimate like reading writing data from disk. The one point we were sensitive, of course, was cardinality estimation errors. Within 10 percent we can do quite well but when we were off by more than 10 percent or more we started to choose not the right plans because we were thinking hey it's great to materialize and discover there's actually so much data. So that's the sensitivity. And finally to go back to the idea of user defined operators, so this is one example plan. What we did is we replaced the final aggregate with the user-defined operator. So we assumed that the best plan in this case, the hybrid one, had the last operator checkpoint, and that was the runtime we would get. But here we replaced the operator with user defined operator. What we do with those is we assume that they're unable to do anything in terms of fault tolerance. So now we have a query plan. We can pick fault tolerant strategies for all operators but some of them don't know anything about fault tolerance and actually we showed it's still useful even if you have some of those operators, we picked different plans and you can still do better than some of the uniform strategies such as, for example, materializing on behalf of the operator. So this is just to go back to the user defined operator question. So how much time do we have? 1:30, 2:30? We have time? So this is kind of ongoing work. Like I said this was interesting. What we notice is that it is possible to use different fault tolerant strategies in the same query plan. We can pick them automatically, and it does make a difference that's visible on the actual runtime. The key challenge, though, is we need the specific models for the operators. And that's really painful, because it's hard to get accurate models, like in one example where it was really network bound, our models were not that accurate. And it's hard if I want to write new operators to develop these models. So where models. parallel strategy we are headed right now is But I'm going to do -- I'm and I'm going to use maybe like materializing all the to say let me forget about these going to start and run a query in some sort of fault tolerant output. And as I I really tolerant better. run it, I'll simply observe what is going on. If I find that have some bottlenecks, I will try to remove maybe some fault symmetric points maybe at runtime that will allow me to do So that's kind of what we're exploring right now. So in terms of summary of the whole fault tolerant work, is that we looked at -- we looked at the problem of running queries in parallel, what it's worth to materialize, checkpoint and do these different strategies. Some of our contributions for us to say, look, if I have a parallel query I can actually run the query and make it fault tolerant without blocking. I can pick and choose the technique and I can do this automatically. So that was kind of the main point of the whole fault tolerant work. And where we're headed like I said is more towards the dynamic scenarios. So any questions at this point? Yes? >>: You said they are working on learning about the model cardinality cache system, the plant and restart would be the better model cardinality. Do you have a chance to change the [inaudible] then or ->> Magdalena Balazinska: That's a good point. We actually do not change the plan of the shape at that point. But we could. This would be something also interesting especially what happens is if we crash and one node is restarting from a failure no one else is doing any work. So we could take that opportunity to make some changes at that time since the other nodes could be doing some other work while we are replaying. So yes if you're working in that space that might be interesting yes? >>: So on the analysis, did you find any patterns that say some strategies that work for some operators irrespective of like, for example, materialization is good for [inaudible] operators or ->> Magdalena Balazinska: So what we found is we actually found some patterns. In particular, if I have stateful operators, then it's often the case that the operators right before, if they frequently the best strategies to materialize before that operator. Or operators that have small states they can definitely checkpoint. But we often found it's not for one specific operator but it's for combination of pairs of operators or sequences of operators, and also the input data sizes. But in that case, yes, we did kind of find some patterns if I have a sequence of operators with stateless operators and big stateful operators it's often work to materialize the output of those stateless operators but it's hard to make that general. These are still kind of things we kind of see but the cost-based model kind of allows you to formalize that intuition, which is the case. Other questions? Yes? >>: User defined operator was recovery strategy of none. >> Magdalena Balazinska: That's right. >>: Put it around the Google job that have an average of five crashes, would the job ever get done. >> Magdalena Balazinska: It does because we still restart that user defined operator from the beginning but we only restart that operator because the operators around it can provide fault tolerance and things like replay, which we definitely don't want to reprocess the whole query. >>: Any other questions? >> Magdalena Balazinska: Actually, if you have a few minutes I want to give -- I don't know how many minutes we have, that's what I'm asking, do we have time? Can we go over 2:30? What I want to do is tolerant, fault tolerance, that's why I'm pausing here. I can't remember how much time we allocated that's why I'm hesitating wanted to give you a broad overview of the type of research we're doing in our group besides fault tolerance, it's something I'd love to talk to people off line. We're doing work along three broad aspects. So we are on campus. There are a lot of scientists. We talk to them. All of them have a lot of data. We find there are really challenges among three lines of work. First, how can I get someone like a scientist who is technically skilled and capable, wants to run complex machine learning and complex user defined operations and other analysis, how do we get them to do this efficiently? And of course we can leverage cloud computing environments and this is our new ash project. And what I talked about is an example project from within this umbrella. The second is to look at not only making things go fast, but also making them easier to use. And we're not HCI people but we're still looking at things such as how can I help someone articulate a SQL query can I use past executions past usage history from the cluster to learn what kind of queries people ask and help newcomers articulate their own queries. Finally the last one we're looking at which is related to the Windows Azure data market is to figure out how can I price? Now a lot of people have lots of data. They all want to buy and sell this data online. What kind of database support do we provide to make it easier for them to articulate prices, reason about prices, price only a couple of views and automatically have them sell arbitrary queries and so on. And just to kind of give you quickly some examples of products in these different categories, in terms of let me give kind of a couple of products in the efficiency categories. So we worked with a lot of scientists. And one thing we discovered time after time is that often when jobs execute, there's a huge SKU. So this is one example where it's a debug run. Smaller scale. X axis I have time, on the Y axis I have all the tasks I'm running in my cluster. I have different color tasks they don't matter here, but I can see the duration of all the tasks is very, very small except for this one crazy task at the top which takes forever. So what we build we actually have a couple of systems that allows us to smooth this kind of SKU automatically. So the top line is 1.5 hours versus more of the other lines in this job or the order of a couple of minutes. So build a system called secure reduce and the details are in a SOC 2010 paper that is also an optimizer takes a sample of the input data, cost functions about the actual processing that we want to do, information about the cluster, and figures out how to partition allocate the data in a way that automatically kind of gets rid of this type of SKU. And what is interesting is on the real scientific workloads, we found runtime improvements of a factor of 6 to 8 compared to standard MapReduce implementation. So this is really significant. But that was specific to one type of application. Kind of that we find in a science domain. We also have a more recent skewed work where we said I don't want these like here I have these cost functions. So in the new work I said I don't want any new information from the user. And what we do is we simply observe what is going on in a Hadoop cluster. Notice when SKU is occurring and find a good way to reallocate the data in a manner that introduces low overhead and kind of looks ahead to optimize for total runtime. Without knowing anything about the application we can reduce runtimes by factor of four. So this is significant. These are, you know, large factors. Real systems that are actually built on top of Hadoop, publicly available now. Actually my student who worked on this is going to join Microsoft in September, on the Bing search team. So you guys will get those skills now. We also looked at iterative processing. A lot of people want Hadoop clusters, but they need to run machine learning, page run kind of algorithms that iterate. By simply being smart about caching and scheduling, we can cut runtimes in half. This is also a system that's now publicly available that we built at UW. We're also collaborating with [inaudible] team which the goal here is to build a parallel system that is different because it's going to operate on multi-dimensional arrays. So scientists often have the simulate the universe, atmosphere. They operate on these multi-dimensional arrays. It's a new system. Has some 500,000 downloads. It's not our work, joint work with MIT, Brown University, Portland State University and so on. So this is kind of examples from this category. In terms of making big data analytics easier, I mentioned that we do different things. So we looked at SQL auto complete and the idea is as a user types the SQL query the user can ask for recommendations. What we're going to do is say, okay, so the user has put certain tables, certain predicates in the query. We look at that query. We look at past usage of the cluster and say who asked similar types of queries and what other predicates tables did they use in their queries and you recommend the most popular from those. Actually had good -- we tested this in the slow [inaudible] survey query log and had very good results in terms of the type of recommendations. We also build kind of another thing that allows people to browse through queries. But I'm not going to mention this. We looked at other things that tried to help people understand their cluster better. So once they actually articulate their query they run it in a cluster. We looked at things such as better progress estimates that allow people to have a better sense of how long their queries are going to run for. And one of our key contributions was to say, well, failures will occur. Skew will occur. Instead of giving one best guess we'll give people ranges where we try to make the range as small as possible such that it's useful and say this is kind of the best case runtime estimate, but assuming kind of a single worst case failure or a SKU at the following operator is quite likely then we'll also give users a range. So they can have a better sense of how long their queries will take. Most recently we actually build a system that of. What we can do we can run Hadoop job and questions, such as why did my Hadoop job just other job, even though it processed less data the same number of instances. actually I'm pretty proud then we can ask run slower than this and it was running using And our system uses machine learning to then say, well, because these two runs someone, for example, changed a cluster configuration and increased the block size. And therefore you're actually not using all those machines, perhaps. So this is something that will appear at the PDLDB. So this is kind of three examples from this area, again I'm really happy to talk to you about this off line. Finally the pricing, this is the most recent project. And this is actually quite fun. We're looking at this from several perspectives. The first idea is people buy and sell their data on line. Currently if I use something like Windows Azure data market I can price my data but I can only price it using very simple selection queries so I'll define kind of a view and people can enter parameters in this view and they're going to pay based on the number of output records that they get. And you kind of pay monthly based on the total amount of data you will be seeing. What we actually asked was to say, well, if I produce a certain number of views, then my customers, when they come, they will have to pick among one of those views to satisfy their needs. But if none of the views is good, well, they will have to buy some super set. If they're not willing to pay for the super set they might not be really happy. What we did is we actually built -- this is more on the theoretical side -- we developed theory that allows the seller to only specify some views and we use these price points to automatically drive the price of any arbitrary query that the seller wants to purchase. We basically automatically figure out given the query that the user is interested in, what is the minimum number of views that they have to purchase and in what combination to get the best price. And we can do just automatically for pretty large selection of queries. And this kind of pricing goes even much further than pricing data. We're looking at things such as if I'm going to buy and sell data I also want to protect the data. I will sell the data and say you're allowed to use this data but you're not allowed to join it with other datasets. We're now looking at building databases that help us enforce digital rights on the data without too much overhead. Or if I'm going to use cloud computing resources, many people share the same cloud. I need a way to figure out what optimizations to implement, how to price these optimizations, how to allocate the cost, how to give it to the users, and we also have automated techniques to do this. So this is kind of in conclusion kind of a broad scope of the kind of work that we do in the database group at UW on the systems side related to kind of big data analysis cloud computing and data pricing. So, okay, at this point that's it. Thank you very much. >> Surajit Chaudhuri: Any final questions? number during the talk. [applause] I think we had a fair >> Magdalena Balazinska: Good. All right. Thank you.