24294 >> Surajit Chaudhuri: It's a great honor to introduce... from University of Washington, Seattle. Most of you know...

advertisement
24294
>> Surajit Chaudhuri: It's a great honor to introduce Rose Balazinska,
from University of Washington, Seattle. Most of you know her. She did
her Ph.D. at MIT. She works in the area of distributed systems and
databases, has a number of awards. Hard to name all. But she won the
Microsoft New Faculty Award [inaudible] award and the ten-year Most
Influential Paper. In addition, she has a number of best paper awards.
She works in the area also of scientific databases, which I forgot to
mention. Today she's going to talk about big data analytics. So
without further ado.
>> Magdalena Balazinska: Thank you, Surajit, and thank you everyone
for coming. Before I start, I'd like to acknowledge that the core
technical pieces that I'll be talking about are actually the hard work
of two of my students, Prasang and YongChul.
And also that this work is sponsored actually interesting
Microsoft. So thank you. So what's the motivation? The
of course, is everyone nowadays has a lot of data. Being
university campus we talk a lot more to domain or natural
more than industry. And all these natural scientists are
data at just an amazing scale and rate.
by NSF and
motivation,
on a
scientists
generating
For example, astronomers are building increasingly large telescope that
will simply survey the sky and accumulating very large collections of
images and we want to analyze the images and here for the upcoming LSST
the new survey we're talking petabytes of data.
Similarly, in other domains, such as oceanography, they're collecting
data from sensors and also assimilating, the environment simulating the
oceans producing large amounts of data.
Other areas such as biology we have new sequencing, new lab automation
techniques. So the bottom line is that scientists are just producing a
lot more data. And therefore they need tools. They need help to
analyze all this data.
And, of course, as you perfectly know better than I do, that this need
does extend beyond science. So if someone has a large amount of data,
what can they do with this data?
So there are several solutions that already exist. I could use a
parallel relational database management system, perhaps Greenplum,
MapReduce type system, such as Hadoop or Scope, or one of those
systems. Maybe I can use one of those new prototype parallel data
processing systems such as SciDB. And there are many, many challenges.
Many interesting challenges, and the one challenge I would like to talk
about today is really fault tolerance. So what it is about fault
tolerance. If I'm going to analyze a large amount of data, clearly
failures are going to occur while I'm processing that data.
In a recent publication, Google reports that they see approximately
five worker deaths which means kind of task deaths for each MapReduce
job. So definitely failures are occurring.
The question is what should we do about these failures. In the
community we really have kind of two standard techniques that are
pretty extreme in terms of comparison.
So one approach, which is characteristic of traditional parallel
database management system is to use kind of a pipelined or if you want
streaming execution.
So here let's say we're going to start with a query that is going to
start with a dataset. Maybe select some values drawn with another
dataset drawn with yet a third dataset, perform some aggregation at the
end.
If I want this in a parallel database, I will typically start reading
the data from disk and then I'm going to stream the data directly from
one operator to the next as I do my processing.
I'm not going to write any data to disk. I'll just kind of stream the
data right through all these operators, which can be spread across kind
of several machines in the cluster.
So this is great. But if something crashes during the execution of my
query, well, I don't really have a choice. I have to restart the whole
query. So there's several advantages. One advantage is that I don't
actually pay any overhead of naturalizing, checkpointing, et cetera, I
just stream the data right through.
The second advantage is if my operators are not blocking, I do what is
called, for example, online aggregation, online query processing, while
I have the possibility to produce results incrementally, which can be
very nice if I'm going to analyze large dataset to see the results
incrementally and to be able to maybe stop the computation if I have
seen enough and I'm not happy or happy.
On the other hand, on the negative side, clearly failures are going to
be costly, which at small scale is not a problem but if we're going to
scale this up this might become a problem.
So this is one strategy commonly used today. On the opposite side we
have MapReduce types of systems where the idea is to use blocking, to
be extremely cautious, if you want, to use blocking query execution
where we'll start from the data is cluster on disk, and we're going to
read the data, process it and write the intermediate results back to
disk.
Then we're done with the first operator. We're going to schedule the
next operator that will read data from disk, process it write it to
disk and so on.
The beauty of this technique is that if something crashes, we don't
have to redo all the work. We can only read the data from disk again
and just we process the operator that failed. So the advantage of this
technique is recovering from failures is much less expensive than in
the previous case because we only reprocess the one failed operator.
On the other hand, there are kind of two negatives, one is that we do
have to pay the overhead of writing all this data to disk. But the
second negative aspect is that I can't see any result incrementally,
because I'm fully processing each operator and blocking, I really have
to wait for the whole query to be done before I can see any results.
So the question that we answered, that we ask ourselves, is what would
be kind of a nice compromise between these two techniques? What we
would like is preserve the ability to produce results incrementally.
We don't want to add any extra blocking just for the purpose of fault
tolerance. And second, what we would like is to have a fast execution
time assuming that the failures are going to occur.
So how can we do this? So actually can we achieve this? And actually
the answer is that, yes, we can achieve this because there exists
several fault tolerance techniques and we can use them in a
non-blocking fashion. In fact, there are many ways in which we can
achieve this. So there are many techniques. I'm kind of listing three
here, and I'm going to describe them in details later. But the key
question that we asked is, okay, great, so I can take a parallel data
processing system. I can make it fault tolerant using one of several
techniques. So which one should I use?
And does it actually matter? So as a first set of experiments to kind
of drive this research, we actually started with a small scale cluster,
kind of 17 machines, and we're going to run some simple queries and
actually try to observe what happens when failures occur and we use
these different techniques for fault tolerance.
So in these experiments we're going to run simple queries. We're going
to inject exactly one failure about halfway through the execution and
we're going to fail each of the operators separately in different
executions and we will average out the total runtime.
And we're going to try to use different fault tolerant strategies. So
the first fault tolerant strategy would be simply to do almost nothing.
And if I have a stateful operator and the operator crashes we're going
to restart the operator from the beginning. But if I have some
stateless operators what I can do, let's say I have a selection
operator in my query plan, that operator crashes, well, if I restart
the operator, I don't have to reprocess everything, I can simply skip
over all the data that I already processed, and I'm going to continue
processing.
So that's very similar to doing that, restarting the whole query plan,
except we're going to restart only one operator at a time, the operator
that crashed.
So now small scale cluster, we're going to compare. So here we're
going to have the different fault tolerant strategies. And on the Y
axis we have the runtime. The blue is the runtime without any
failures. And then the red part is kind of the extra time we spent due
to failure and recovery.
So if I just run the recovery without any failure, this is the runtime
we get. If I run the query and I inject a failure roughly in the
middle of the execution, you can kind of see the expected wasted time
because the failure is going to be the red part.
In the case of something like a parallel database if I just restart
then I'm going to add 50 percent because in expectation I have to redo
half of the work.
So what if instead I use this kind of technique of skipping over the
input data whenever this is possible.
So here indeed what will happen is the runtime without failures kind of
the basic runtime remains the same but in terms of recovery I can speed
up the recovery of some of my operators. I'm already going to do a
little bit better.
>>: Are we talking are you using the ->> Magdalena Balazinska: It's all the operators, but the stateful
operators can skip over anything. So they have to go back and actually
reprocess everything because they have to rebuild this state.
You can see this, I have a little bit of improvement only for that
operator. But that's still one strategy that we can use, yes.
>>: Does this also have a nexus, does it have any nexus?
>> Magdalena Balazinska: In this case, this is -- we were not
producing indexes in this case. The main thing here is I'll get to the
details later of how we implement it in our context.
What you want if an operator crashes it needs the ability to tell
whatever is upstream of that operator to just resend a certain amount
of the data, not resend the whole data, in order to be able to skip
over some of the input.
>>: Where does the state start, this operator ->> Magdalena Balazinska: There's no state. Also in this case how much
to skip over? So I'll get into the details later. I'll basically show
you how we implement it in our framework.
So at this point let's just keep it a little bit vague, get a sense
does it even matter that we check the right strategies.
So this is just kind of one simple strategy. What if we also used the
MapReduce type of strategy? What I'm going to do I'm going to write
data to disk, add the output of each operator, the way MapReduce does
it. I don't have to actually do it in a blocking way.
What I can do instead of writing all the data to disk and sending it
downstream, I can write a little bit of data to disk and send the data
downstream, write a little more data to this and send the data
downstream.
If an operator crashes, the advantages, all the data that the operator
consumed is available somewhere on disk. So it can simply replay that
data if the operator wants to recover, instead of kind of propagating
back the failure and having to regenerate any of the missing data.
So does this -- how well does this perform? Here we're going to go
back. We have again our query, and actually I forgot to say this is a
simple query that selects data joins with another dataset, aggregate,
and this is a synthetic. This is not a real query, just a synthetic
query.
And it's about, I think, this query was 160 million TUPLs that we were
processing. We have the restart strategy. In restart, what we're
doing is simply if any operator crashes we're restarting the whole
query.
In the skip strategy, let's say if my select crashes, I can start the
select and skip over to input data. Not using indexes, just going to
skip over wherever the operator started.
If the drawing crashes, then the general -- we process, the select will
have to crash itself to reprocess, but it can skip over again all the
data we want.
Although it will have to rebuild all of the state from the beginning,
same thing for aggregate. So skip, if select crashes we can skip over
some input data. If any of the other operators crashes, we pretty much
start from the beginning. In the case of materialize we'll select the
output to disk, write the output of drawing.
If not, if this join crashes, we don't have to reprocess this join, but
the join will just resend the data from disk and that join will be able
to rebuild its state from the beginning and continuous. Kind of
MapReduce style.
How helpful it is. We actually executed it. In this case for this
query, it's pretty bad, because the fact that we have to write the
intermediate data to disk really increases our runtime in the absence
of failures.
Once a failure occurs, we can recover much faster because we only have
to reproduce the operators that failed. But overall for this specific
query, this is not the best strategy.
There's one last very well known strategies, and these are all common
strategies. We did not invent any of these fault tolerant techniques.
Last known strategy from the community's checkpointing.
So this is great for stateful operators where we run the operator and
every so often we're going to write a copy of the operator stage to
disk. And if the operator crashes, we can restart from that last
checkpoint, which will include the state of the operator and exactly
how far along the operator was.
>>: I'm surprised these are so bad in the previous slide. On this
screen you have to seek to that block and wait for occasional latency.
Disk writing in principle, find any bland spot on the disk, write there
and make a note where you wrote it. It may not allow you to add that
option.
>> Magdalena Balazinska: Sure, could also be that we didn't optimize
all these techniques as you could optimally maybe produce them. This
is also something ->>: Basically on empty disk just start writing where you are.
>> Magdalena Balazinska:
Sure.
>>: Next time around it's harder.
>> Magdalena Balazinska: That's right. In this case we don't do these
type of optimizations. We have an open file. The one thing we did in
the experiment, each operator partition has its own disk. We do
sequential reads and sequential writes but that's as much as we do.
So if an operator fails we started the operator from scratch read the
state from disk and then process. So how well does this do? We're
going back to our query with the three techniques so far. And in this
case, actually if we checkpoint, we actually don't add that much
overhead because the state is actually not that large and now
recovering is extremely fast because we only have process from the
latest checkpoint.
What is interesting and this is the bottom line in our motivation that
given a simple parallel query and even a single failure, we can get
something as much as 70 percent difference in execution time if we
choose kind of the round if you want fault tolerant strategy and the
right fault tolerant strategy.
So the second question might be, is it just a checkpointing, is always
the strategy that we should use. Is it always the case that I should
simply checkpoint all the operators. Yes?
>>: When you do skip, skip efficient, you need to kind of be efficient
about it. So use an index for it? How are you getting this
implemented? Are you rescanning?
>> Magdalena Balazinska: You want to be able to seek to the right
location. You don't want to scan the whole input data. Offset.
Exactly. We're scanning from the disk beginning and we remember the
offset and we just restart from that, yes.
>>: Is this for MapReduce setting to ->> Magdalena Balazinska: You could. So this is actually the
experiments. So it's the slide I forgot to mention it we're using kind
of our own -- we're using a skeleton processing engine that all it has
is operators and fault tolerant strategies.
>>: Because MapReduce assume something different functions system like
scope, using functions are very important.
>> Magdalena Balazinska:
Yes.
>>: That strategy checkpoint mainly additional contract for the
programmer what is even a right emotion of ->> Magdalena Balazinska: Yes. Exactly that's I'll get to this. What
we did is we initially let's assume relational operators. We know
their insides. At the end I'll show you if we have user defined
operations we can make assumptions they can't do much in terms of fault
tolerance; you can use our technique and still get improvement.
I'll get back to user-defined operations at the end.
good question, yes.
So this is a very
>>: Checkpoint, are you saying operators are checkpointing.
>> Magdalena Balazinska:
Yes.
>>: Because operators prefer checkpoint, versus another operator which
has much more state, one would be much more expensive?
>> Magdalena Balazinska: That's exactly where we are headed. So the
question that we asked -- and actually this is a purely academic
question, what I'm going to present is not something you could directly
apply, but what we were just curious about, just the science
perspective was all right, do I always checkpoint? Is there one
strategy that's always the best? And the answer is no.
And actually this is kind of the next experiment, what we did is simply
replace the last, exactly the same query, processes also you know the
same number of input TUPLs but we have a drawing at the end, and here
the bars are completely different.
In this case, the strategy of everyone just using skipping over the
input data when they cannot checkpointing no naturalization gives me
the best runtime.
So the observation here is if I have a parallel query, the choice of
fault tolerance technique can have a significant, visible impact on
runtime, even in the presence of a single failure.
So imagine with more failures, the differences will amplify. At the
same time, there isn't like a single well-known technique that I can
just apply uniformly.
So what we want is exactly what you suggested. We said can I, should I
pick the right fault tolerant technique at the granularity of each
operator? We actually went further. And the question we asked was can
we build an optimizer that will take my query plan and my resources as
input and will automatically tell me what fault tolerant strategy to
use at each point in my query plan and does it actually matter? Would
such a fault tolerant optimizer make a difference?
That's the question we're set up to answer. And the rest of the talk
we'll kind of see how we went about answering this question.
Does that make sense so far?
Yes?
>>: What is the function for the organization?
is there latency, any other factors?
Is it just runtime or
>> Magdalena Balazinska: I'll get to this in a second exactly because
that's something we can use the standard cost functions and one of the
challenges here is we really tried to get as close as possible to
estimating something equivalent to runtime.
And because fault tolerant, the difference between different fault
tolerance plans is not orders of magnitude, it's in the order with one
failure, maybe 50 percent, 40 percent, 70 percent and have to be a
little more accurate in order to select good plan.
What challenge these operators interact with each other when they're
running in a pipeline. If an operator is like a symmetric hash join,
initially it's going to be the bottleneck will be how fast the input
data is arriving.
But soon it's going to produce so much TUPLs the bottleneck becomes the
CPU operator, we have challenges in dealing with these techniques.
I'll come back to this in a few slides.
So the bottom line so far is that we want to see if we can choose fault
tolerance at the granularity at the operators if we can make this
choice automatically.
This is what we did in our optimizer that we called FT opt. The reason
it's so short was the paper was over the limit so short name to squeeze
the paper.
And this is actually in last year's SIGMUTH conference. The goal was
two things. First of all, I need the ability to mix and match fault
tolerant strategies.
I'm going to have a single query plan with one operator using one
strategy, another operator using a different strategy. So the first
thing we had to develop is some sort of protocol that will hide all the
details of the fault tolerant technique and have some sort of high
level agreement between operators such that if one operator crashes, we
only restart that one operator and we don't care what anyone else is
doing for fault tolerance. So that's the first contribution.
And the second one was really to have this cost-based optimizer that we
call fault tolerant optimizer. The input to the optimizer it's the
query plan, so we first run the regular optimizer get a query plan and
then we pick the right fault tolerant strategy for the query.
And of course the question then is also kind of how much do we gain?
So kind of once we actually start to execute it in parallel, the way
this is going to work is we're going to kind of encapsulate these
operators. We're going to have our optimizer on the side, and we'll
have this encapsulation with the protocol between operators and the
optimizer will select the strategies and then different operators will
end up using different strategies.
So the granularity will be really kind of one operator in the query
plan. So let's look at the actual protocol that we use. So the goal
is that we're going to have pipelining. We don't want to add any kind
of blocking. I want the data to flow continuously, because if at all
possible we would like to be able to show results incrementally. We
don't want to for fault tolerance to add any kind of blocking. That's
the first goal.
That will be the goal of the protocol, and at the same time we need the
kind of ability people raise questions about figuring out where do I
restart from, what do I reprocess. So the protocol has to capture
this.
So here's kind of kind of this is basically illustrated with an
example. Let's have three operators, a one or two and/or three. And
our protocol really requires that each operator obeys four different
rules.
The first rule that we require is that all TUPLs that we send or all of
records if you want between operators must have some sort of unique
identifier like a primary key or it could be a hidden record ID. The
reason is when we restart an operator, we will want to tell the
upstream operator where to resend data from.
So we need the ability to identify these TUPLs uniquely. So we need
those unique identifiers. Second, we will request that if an operator
at any point in time or two is allowed to go back and ask or want to
resend any kind of suffix of its output data, so I can say go back and
say please restart and send me everything from the beginning, please
restart and send me everything since TUPL number 225. And this
operator has to support this feature. The easiest way to implement
this is if this operator says resend everything from TUPL 225 this
operator can simply kill itself, restart, reprocess, and kind of send
all the data just delete all the output data until it sees the right
TUPL and continue. Of course, this does assume that we have
determinism in the operator. We do not separate operators that has
some sort of optimization in them because we can assume that we can
restart an operator and it will produce the same output.
>>: Two randomizations are okay?
>> Magdalena Balazinska: That's right. If you have some sort of
random seed I can just give you the seed again, as long as there's some
way to restart and regenerate the same sequence, we're fine. We
discussed this in the paper I'm not going to go into the details of
that aspect.
All right. In many cases, if I have, if I require that an operator
remembers all its output that has the ability to reproduce its output,
that's fine, but it can be actually quite costly. In many cases,
operators, once they know they have made a certain amount of progress,
let's say maybe this operator chose to checkpoint, then this operator
knows that it will actually never ask for some old TUPLs again.
As an optimization, this operator can actually acknowledge to the
upstream operator and say, by the way, I will never ask you to replay
anything before TUPL number 153, because it's my problem how but I know
that I will never ask you for this.
And if this operator uses something like, you know, maybe materializing
the output to disk then you can truncate the logs, do whatever
optimizations it wants.
And finally the last rule, we're going to ask all the operators to
remember the last TUPL that they received. So if this operator
crashes, it can actually take, ask and say okay I just crashed tell me
what is the last item that I sent you and that you received and then
this operator will then eliminate duplicates and only sends the new
data.
Yes?
>>: You always require the sender to have the output, why don't you
have the optimum of these operators itself to put the input into this
somewhere?
>> Magdalena Balazinska: It can do whatever it wants. All I'm doing
is I have four rules. I don't tell the operators how to implement
them. I just say that the operator needs the ability. And the easiest
way would be this operator could do nothing except someone requests to
replay it could crash itself restart and regenerate it. We don't put
constraints on what the operator, how the operator implements those
rules. Does that make sense? Or was the question different?
>>: Like if it's a [inaudible] if you start inputting locally, you can
get the data locally, right?
>> Magdalena Balazinska: Yes, yes, yes you'll see it on the next slide
exactly. Kind of the main observation is we have these four rules but
they don't say anything about checkpointing, materializing the output.
They don't say anything about how to implement it, which means as long
as the operators can play by these rules, they can do whatever they
want. And here's one example actually with let's say checkpointing.
Let's go through this example an operator or the middle one crashes and
restarts. What the operator will do is first it can actually ask the
downstream operator or three to say what is the last TUPL that I sent
you? And that three should be able to reply about the rules we have
about identifying TUPLs uniquely and remembering the last TUPL they
received.
Now 02 knows when it crashed with respect to the downstream operator.
At that .02 can do the following. If it happened to save data from
disk it can recover that checkpoint from disk. If it didn't have a
checkpoint it will restart from the beginning.
But it's going to basically either recover state or start from scratch.
And depending on which of these alternatives the operator used it's
going to ask 02 -- sorry, 01 to replay a certain amount of data. If 02
check pointed maybe it will ask for only the most recent few TUPLs, and
maybe it will be efficient because maybe these few TUPLs are still in
memory at the preceding operator; if it didn't and it has to restart
from the beginning maybe it will ask to replay all the data from the
beginning.
And finally, once 02 gets the replay data from 01 it processes the data
and waits until it catches up to the last TUPL and sends only the new
data to the operator.
So what is nice is that we have several nice properties. First, the
data can flow continuously in the absence of failures. We don't put
any constraints on when data has to be written to disk. There's no
blocking. The data can just flow continuously.
At the same time, if an operator crashes, we can restart only that
operator in individually. And finally we don't put any constraints, as
long as I have come up with my own fault tolerant strategy, if it can
be made to work within this framework, then I can use that strategy for
any operator.
So kind of many benefits.
And what actually we did is that we are show in the paper how to
implement the standard strategies like checkpointing, materialization
and skipping over data within this framework.
Yes?
>>: The comment you made about blocking the operator, so [inaudible]
one or recovers one has to [inaudible].
>> Magdalena Balazinska: Yes. So actually what happens in our
framework when a failure occurs we do a block, we block the same query
plan.
>>: Without a failure, I think -- without a failure then there's no
blocking.
>> Magdalena Balazinska: Exactly. When a failure occurs, we indeed
block the whole query plan, which is why we often have the process, we
have fault failures without plan and add it up because no one is doing
anything during that time.
Yes?
>>: [inaudible] the last TUPL that's conceived in the event of an
upcoming crash?
>> Magdalena Balazinska: So one example, the operator could simply
call it in memory. So if let's say 02 crashed but 03 didn't crash, it
could still have that in memory. If 03 crashes ->>: What about 02 is crashed, when you try to resurrect 02, awful lot
of work it doesn't want to repeat?
>> Magdalena Balazinska: If it's checkpointed the work, so depends on
what 02. If it starts snapping crashes it pretty much has to restart
from the beginning unless it's a stateless operator. If 02 is a
selection then it can just ask 03.
>>: Go back to the previous slide.
>> Magdalena Balazinska:
This one?
>>: So I see. So if I act a TUPL, at that point I need to do something
that's processed?
>> Magdalena Balazinska: Yes, exactly. And failure model. If you
assume like only in this paper just for the purposes of exploring the
idea of fault tolerance optimization, we restricted ourselves to
process failures. We just did not assume that disks will fail as a way
to explore this idea without too much complexity. If you assume the
disks can fail, then perhaps 02, before it can acknowledge, not only
has to write it locally maybe has to replicate it somewhere else, then
you can acknowledge it. But that will add overhead to fault strategy
and maybe twist, kind of push the scale towards not using any fault
tolerance because you may have to restart otherwise too much overhead
at runtime.
>>: [inaudible], it's these kinds of considerations that killed them
to -- it's a very vague fault tolerant kind of model.
The overheads of introducing even moderate granularity [inaudible].
>> Magdalena Balazinska: It can become complicated that's true.
That's the first step towards saying is it worth it because if it
doesn't work at that level there's no point going forward. I'm saying
exactly practical directly. It's kind of interesting scientific
question on that path.
All right.
Other questions?
So this is for the -- yes?
>>: So the database for suspending descending queries for the idea of
operators [inaudible] with upstream operators so they can either choose
the checkpoint data at that point and involve the contract, essentially
allow me single to replace the parts [inaudible] the operator. So how
does this fit in comparison to this kind of ->> Magdalena Balazinska: So the main difference compared to a lot of
the work including this type of work is that this is really meant
towards enabling the heterogeneity. So there's a lot of protocols that
were similar in spirit but they assume heterogeneity. Everyone is
checkpointing. Everyone is skipping over. Versus here we really
thought what are the minimum, what is the minimal set of primitives
that we need in order to decouple the choices of the strategies.
But a lot of the principles such as acknowledging a TUPL, replaying
TUPLs comes up in different protocols but typically everyone is doing
some uniform strategy.
Or you might have -- or the contract will be specific to a specific
pair of strategies versus here the contract is independent of the pair
of strategies and that is the main goal.
Other questions? All right. So we've seen kind of most of -- seen the
protocols, so the second part which goes back to the other question was
with respect to actually the cost model. So here we want to make the
decision on the fault tolerant strategy to use in an automated fashion.
Therefore, we need some way to compare the cost of these different
strategies. So the cost function that we chose to optimize for, and it
could pick other cost functions was the execution time with failures.
So the cost formula is going to be, we're going to try to minimize the
time that we spend in regular processing, which is blue, which means
the time for the first TUPL to propagate through our whole pipeline.
Remember we assume kind of we think about nonblocking, although if an
operator blocks, it seems that it delays the first TUPL until it's done
processing so we have the time for the first TUPL to propagate and
actually time frame for the remaining to go through, plus as has been
mentioned we block during failure recovery, we're going to add the
expected kind of number of failures times the time we're going to spend
recovering from these failures.
And this is kind of reasonably standard cost function to optimize for.
The challenge, though, is when we started with standard cost formulas,
we were getting highly inaccurate results.
And like I mentioned, in a query optimizer, the differences between
different query plans can often be in orders of magnitude huge. So
even if the cost functions were not perfectly accurate, that's fine.
Sufficiently good to just pick reasonably good plans.
Here we are talking about 50 percent, 40 percent differences in
runtimes between the different fault tolerance plans. So we needed to
be more accurate in terms of the cost models for the operators.
And the key thing we observed when executing our pipelines is that
there's kind of a lot of dynamic between the operators. In particular,
an operator will not necessarily produce like a selection operator will
produce, it's a filter. So I get an input TUPL, apply some filter
output to TUPL. It's linear operator producing data at a steady state.
But many operators start empty and they start accumulate state, and
they have this kind of nonlinear behavior that when we were not
capturing we had too much of a difference in terms of the predicted and
the actual runtime.
So kind of the way we chose to solve this is by using kind of convex
optimization framework and to just model the operators with constraints
of the form, if I put two operators together, while the input rate for
this operator cannot be any higher than the output rate of the
preceding operator.
And the rate at which this operator is producing TUPLs cannot be any
higher than say how much the CPU allows me to produce. So we're going
to kind of model operators with all these kinds of bounds.
And here's one example for the symmetric hash join operator. So the
idea behind this operator, this is one example, is that we're going to
read data as the data appears where, when we have an input TUPL, we put
it in memory hash table. When I have a TUPL from the other relation, I
probe my hash table and I store a TUPL in my in memory hash table.
Next TUPL comes and I probe the other relation and I store the TUPL.
So at the beginning when the operator starts running, actually, no, so
if I assume that all the data is available to me, there's a certain
maximum rate at which we can produce TUPLs, which is how much CPU we
have available to us.
Yes?
>>: I have a question so you have an optimal ->> Magdalena Balazinska:
Yes.
>>: For each operator in that what are you trying to do?
>> Magdalena Balazinska:
clear.
So this is a good question.
I should be more
>>: Would that materialize?
>> Magdalena Balazinska: Exactly. What we want to do is for each
operator we will decide which of our N available fault tolerant
strategies to use. So should the operator skip, in our experiments we
looked at skipping, checkpointing or materializing. If I pick
checkpointing, how frequently should I be checkpointing.
So when you materialize, you only materialize the output. When you
checkpoint, you checkpoint your internal state. So the overall
optimization is going to be that I have my cost function, and each
operator has a variety of fault tolerant strategies available to it.
For each fault tolerant strategies we need a model that will help us
assess what will be the cost of that fault tolerant strategy.
>>: Checkpoint all of its output or doesn't any of it.
>> Magdalena Balazinska:
That's correct.
>>: There's no option ->> Magdalena Balazinska: No we don't do it. Either checkpoint or not.
If you checkpoint you will need the frequency.
>>: You gave an example saying the difference between two strategies
could be as much as 70 or 80 percent.
>> Magdalena Balazinska:
Yes.
>>: I say alone that doesn't complete optimization. Perhaps exist one
strategy that comes within 20 percent of all things.
>> Magdalena Balazinska:
Yes.
>>: So in which case you don't probably need the -- such a strategy ->> Magdalena Balazinska: I agree. I totally agree. In this case we
really wanted to explore the idea of the whole optimization. But I
agree in practice you might want to take one of those simpler
strategies because you might get within 20, 30 percent of optimal. I
completely agree, yes.
>>: This may be off subject, but anytime I quoted John Mashe, saying
speaking for 10 percent. Saying we do unspeakable things for
3 percent.
>> Magdalena Balazinska:
I see.
Depends on the perspective.
>>: The product is a big number.
>> Magdalena Balazinska: Big number. But I agree this is kind of -that's why I say it's kind of more of a scientific interest we wanted
to see how well does it work.
>>: In theory, the spread is decent -- maybe one that's always in the
middle. But there's one that's consistently in the middle [inaudible]
it may be that any given strategy is either lousy for some inputs in
which case it's completely out of code.
>> Magdalena Balazinska: That's right. At least we want to eliminate
the bad plans. We did a lot of sensitivity analysis to see how
sensitive we were to different bad inputs.
But the bottom line, just kind of going back to the cost function,
simply if I take an operator such as symmetric hash join, if all the
input data is available, then the output, right, it's kind of limited
how fast I can process this input.
But typically the input is coming at a certain pace. As it goes
accumulating more state. So I'm finding more joins. So actually
initially the output would be limited by how fast the input is
arriving.
So when we combine this, what eventually happens with the operators
that initially it processes as fast as the input is arriving but
eventually it becomes CPU-bound and it kind of cannot process any
faster once it accumulated a certain amount of state.
And what we found is we actually, because this is something like
25 percent and we're talking with differences in the order of
40 percent, we had to account for this type of effect, in order for the
optimizer to actually give us select a good fault tolerance plan.
>>: Are you saying a certain optimum plan, look at all the plans of
each?
>> Magdalena Balazinska: I agree, yes. And that's possible. We just
didn't do it. But it would be possible to expand the search and
basically look at it with the optimizer, that might be interesting.
Yes?
>>: So the natural points, makes perfect sense, checkpoint. Minimal
heap state points, for example, [inaudible] join, you know that
computer run out of page and basically your [inaudible] is essentially
negligible. Optimal points for doing a checkpoint strategy.
>> Magdalena Balazinska:
Yes.
>>: So I'm trying to understand how those specific choices were related
to the standard ->> Magdalena Balazinska: That may be one way, too, when we actually
look at the optimizer. Here we said what is the frequency and saying
that if I just pick the frequency without the specificity of the
operators, then I'm going to kind of get an average size with a certain
frequency. But it is in fact the case that for some operators there's
certain points that make a lot of sense.
We could restrict the optimizer to look at only these points. Say for
these operators the frequency is limited to being only one of the
following and only get that, those best points, indeed. Exactly.
So how well does this model work? So this is actually -- this is
actually kind of a good case where when we have different operators
what we did is we actually looked at different pairs of operators with
different combinations of fault tolerant strategies. So skipping
materializing, skipping, skipping, and this is the comparison of the
real runtime with the one that we predicted. As you can see we're not
completely always on spot. There are differences in the area of 10
percent, but at least we can distinguish between all the main plans.
So this is as far as we wanted to get. So how well does this -- so how
well -- the final question is now that we have this optimizer, we have
the ability to use different fault tolerant strategies and we have the
ability to pick these different strategies. So the question is are we
able to find these good plans and how much faster these plans actually
execute.
This is one example from our paper where we have a query kind of
similar, select join, join aggregate kind of standard query, and this
is a query that was actually processing 160 million input TUPLs,
producing eight million, I think the first drawing, one of the select
outputs eight million, add another eight and another 40 and the
aggregate has a state of 8,000 TUPLs. So that's kind of the query
plan. As you can see with different strategies restarting, skipping,
materialization checkpointing, and for this operator, this is one of
the examples where really hybrid strategy was helpful. And the hybrid
strategy ended up being to materialize the output of the select, do
nothing for the drawings and then checkpoint the state of the
aggregate, which is a little bit kind of what we could anticipate.
And the difference was significant. It doesn't look as impressive from
the graph because we're not talking about orders of magnitude, but
still the hybrid plan was 20 percent better than any of the uniform
strategies, and it is definitely better than 33 percent better than
just restarting.
If you look at the choice between kind of the best and the worst
possible plans, in this case it is also on the order of 33 percent
which is significant. It's like a third of your query runtime, which
is nice.
So let me kind of quickly show you a couple more results from the
paper. This is kind of another example query that shows how many
million TUPLs we are processing. None actually is the same as
skipping. The same as skipping. And this is another example where we
show what we predict, what we observe and at least it shows that versus
restarting that in this case the best strategy was to uniformly do
nothing, and our optimizer was able to select that strategy.
Another example where we have -- this is kind of also important. This
is select, join, join, join. This is the same query, select join,
join, join, with slightly lower selectivities and suddenly the best
strategies to materialize, given same operators, change the input data,
and now best strategy changes, again the optimizer is able to find that
strategy. Although, in this case because there is a little bit more of
a network, higher network consumption than we actually had -- actually
not this one. We actually our estimates were not as accurate but still
accurate enough to find the right plan.
We changed the query plan a little bit again. We changed and have an
aggregate operator, and this is where again kind of the best strategy
here was to checkpoint.
And we were able to find it. But this is the one where we actually -oh, that's right. This is the one where we're producing quite a lot of
TUPLs, some operators were network bound and our cost model was not as
accurate for network bound operators. More accurate for CPU bound
operators.
So our estimates were not as accurate but we still found the right
plan. Finally again to understand the query this is where we have the
hybrid plan. Actually, this is kind of the same as I shown earlier,
this is the hybrid plan where we materialize, skip, skip and aggregate
at the end. But this one shows also what we -- our estimates.
>>: How do you monitor the number of data sets [inaudible].
>> Magdalena Balazinska: This is very important. How we model it
typically, from what's read in papers often people know that in their
specific cluster, there's a certain number of kind of average number of
failures per job or average kind
those statistics. So what we do
can kind of estimate roughly how
nodes I want to use for it. And
expect.
of meantime to failure or some of
is we say given that I have a query I
long it will run for and how many
I will estimate how many failures to
I expect this query to run with five failures, six failures, and we
optimize for the runtime with that number of failures.
>>: Do you find the strategy is sensitive to that [inaudible].
>> Magdalena Balazinska: Actually, no, so we did a sensitivity
analysis -- do I have that later?
So we found that we were actually -- we have the exact result in the
paper. We were not sensitive to small run estimates and the number of
failures and we were not sensitive to higher performance estimate like
reading writing data from disk.
The one point we were sensitive, of course, was cardinality estimation
errors. Within 10 percent we can do quite well but when we were off by
more than 10 percent or more we started to choose not the right plans
because we were thinking hey it's great to materialize and discover
there's actually so much data. So that's the sensitivity.
And finally to go back to the idea of user defined operators, so this
is one example plan. What we did is we replaced the final aggregate
with the user-defined operator. So we assumed that the best plan in
this case, the hybrid one, had the last operator checkpoint, and that
was the runtime we would get. But here we replaced the operator with
user defined operator.
What we do with those is we assume that they're unable to do anything
in terms of fault tolerance. So now we have a query plan. We can pick
fault tolerant strategies for all operators but some of them don't know
anything about fault tolerance and actually we showed it's still useful
even if you have some of those operators, we picked different plans and
you can still do better than some of the uniform strategies such as,
for example, materializing on behalf of the operator.
So this is just to go back to the user defined operator question. So
how much time do we have? 1:30, 2:30? We have time? So this is kind
of ongoing work. Like I said this was interesting. What we notice is
that it is possible to use different fault tolerant strategies in the
same query plan. We can pick them automatically, and it does make a
difference that's visible on the actual runtime.
The key challenge, though, is we need the specific models for the
operators. And that's really painful, because it's hard to get
accurate models, like in one example where it was really network bound,
our models were not that accurate. And it's hard if I want to write
new operators to develop these models.
So where
models.
parallel
strategy
we are headed right now is
But I'm going to do -- I'm
and I'm going to use maybe
like materializing all the
to say let me forget about these
going to start and run a query in
some sort of fault tolerant
output.
And as I
I really
tolerant
better.
run it, I'll simply observe what is going on. If I find that
have some bottlenecks, I will try to remove maybe some fault
symmetric points maybe at runtime that will allow me to do
So that's kind of what we're exploring right now.
So in terms of summary of the whole fault tolerant work, is that we
looked at -- we looked at the problem of running queries in parallel,
what it's worth to materialize, checkpoint and do these different
strategies.
Some of our contributions for us to say, look, if I have a parallel
query I can actually run the query and make it fault tolerant without
blocking. I can pick and choose the technique and I can do this
automatically.
So that was kind of the main point of the whole fault tolerant work.
And where we're headed like I said is more towards the dynamic
scenarios.
So any questions at this point?
Yes?
>>: You said they are working on learning about the model cardinality
cache system, the plant and restart would be the better model
cardinality. Do you have a chance to change the [inaudible] then or ->> Magdalena Balazinska: That's a good point. We actually do not
change the plan of the shape at that point. But we could. This would
be something also interesting especially what happens is if we crash
and one node is restarting from a failure no one else is doing any
work. So we could take that opportunity to make some changes at that
time since the other nodes could be doing some other work while we are
replaying.
So yes if you're working in that space that might be interesting yes?
>>: So on the analysis, did you find any patterns that say some
strategies that work for some operators irrespective of like, for
example, materialization is good for [inaudible] operators or ->> Magdalena Balazinska: So what we found is we actually found some
patterns. In particular, if I have stateful operators, then it's often
the case that the operators right before, if they frequently the best
strategies to materialize before that operator.
Or operators that have small states they can definitely checkpoint.
But we often found it's not for one specific operator but it's for
combination of pairs of operators or sequences of operators, and also
the input data sizes.
But in that case, yes, we did kind of find some patterns if I have a
sequence of operators with stateless operators and big stateful
operators it's often work to materialize the output of those stateless
operators but it's hard to make that general. These are still kind of
things we kind of see but the cost-based model kind of allows you to
formalize that intuition, which is the case.
Other questions?
Yes?
>>: User defined operator was recovery strategy of none.
>> Magdalena Balazinska:
That's right.
>>: Put it around the Google job that have an average of five crashes,
would the job ever get done.
>> Magdalena Balazinska: It does because we still restart that user
defined operator from the beginning but we only restart that operator
because the operators around it can provide fault tolerance and things
like replay, which we definitely don't want to reprocess the whole
query.
>>: Any other questions?
>> Magdalena Balazinska: Actually, if you have a few minutes I want to
give -- I don't know how many minutes we have, that's what I'm asking,
do we have time? Can we go over 2:30? What I want to do is tolerant,
fault tolerance, that's why I'm pausing here.
I can't remember how much time we allocated that's why I'm hesitating
wanted to give you a broad overview of the type of research we're doing
in our group besides fault tolerance, it's something I'd love to talk
to people off line. We're doing work along three broad aspects. So we
are on campus. There are a lot of scientists. We talk to them. All
of them have a lot of data. We find there are really challenges among
three lines of work. First, how can I get someone like a scientist who
is technically skilled and capable, wants to run complex machine
learning and complex user defined operations and other analysis, how do
we get them to do this efficiently?
And of course we can leverage cloud computing environments and this is
our new ash project. And what I talked about is an example project
from within this umbrella.
The second is to look at not only making things go fast, but also
making them easier to use. And we're not HCI people but we're still
looking at things such as how can I help someone articulate a SQL query
can I use past executions past usage history from the cluster to learn
what kind of queries people ask and help newcomers articulate their own
queries.
Finally the last one we're looking at which is related to the Windows
Azure data market is to figure out how can I price? Now a lot of
people have lots of data. They all want to buy and sell this data
online. What kind of database support do we provide to make it easier
for them to articulate prices, reason about prices, price only a couple
of views and automatically have them sell arbitrary queries and so on.
And just to kind of give you quickly some examples of products in these
different categories, in terms of let me give kind of a couple of
products in the efficiency categories. So we worked with a lot of
scientists.
And one thing we discovered time after time is that often when jobs
execute, there's a huge SKU. So this is one example where it's a debug
run. Smaller scale. X axis I have time, on the Y axis I have all the
tasks I'm running in my cluster.
I have different color tasks they don't matter here, but I can see the
duration of all the tasks is very, very small except for this one crazy
task at the top which takes forever.
So what we build we actually have a couple of systems that allows us to
smooth this kind of SKU automatically.
So the top line is 1.5 hours versus more of the other lines in this job
or the order of a couple of minutes. So build a system called secure
reduce and the details are in a SOC 2010 paper that is also an
optimizer takes a sample of the input data, cost functions about the
actual processing that we want to do, information about the cluster,
and figures out how to partition allocate the data in a way that
automatically kind of gets rid of this type of SKU. And what is
interesting is on the real scientific workloads, we found runtime
improvements of a factor of 6 to 8 compared to standard MapReduce
implementation. So this is really significant.
But that was specific to one type of application. Kind of that we find
in a science domain. We also have a more recent skewed work where we
said I don't want these like here I have these cost functions.
So in the new work I said I don't want any new information from the
user. And what we do is we simply observe what is going on in a Hadoop
cluster. Notice when SKU is occurring and find a good way to
reallocate the data in a manner that introduces low overhead and kind
of looks ahead to optimize for total runtime. Without knowing anything
about the application we can reduce runtimes by factor of four.
So this is significant. These are, you know, large factors. Real
systems that are actually built on top of Hadoop, publicly available
now. Actually my student who worked on this is going to join Microsoft
in September, on the Bing search team. So you guys will get those
skills now.
We also looked at iterative processing. A lot of people want Hadoop
clusters, but they need to run machine learning, page run kind of
algorithms that iterate. By simply being smart about caching and
scheduling, we can cut runtimes in half. This is also a system that's
now publicly available that we built at UW.
We're also collaborating with [inaudible] team which the goal here is
to build a parallel system that is different because it's going to
operate on multi-dimensional arrays. So scientists often have the
simulate the universe, atmosphere. They operate on these
multi-dimensional arrays. It's a new system. Has some 500,000
downloads. It's not our work, joint work with MIT, Brown University,
Portland State University and so on.
So this is kind of examples from this category. In terms of making big
data analytics easier, I mentioned that we do different things. So we
looked at SQL auto complete and the idea is as a user types the SQL
query the user can ask for recommendations.
What we're going to do is say, okay, so the user has put certain
tables, certain predicates in the query. We look at that query. We
look at past usage of the cluster and say who asked similar types of
queries and what other predicates tables did they use in their queries
and you recommend the most popular from those.
Actually had good -- we tested this in the slow [inaudible] survey
query log and had very good results in terms of the type of
recommendations.
We also build kind of another thing that allows people to browse
through queries. But I'm not going to mention this. We looked at
other things that tried to help people understand their cluster better.
So once they actually articulate their query they run it in a cluster.
We looked at things such as better progress estimates that allow people
to have a better sense of how long their queries are going to run for.
And one of our key contributions was to say, well, failures will occur.
Skew will occur. Instead of giving one best guess we'll give people
ranges where we try to make the range as small as possible such that
it's useful and say this is kind of the best case runtime estimate, but
assuming kind of a single worst case failure or a SKU at the following
operator is quite likely then we'll also give users a range. So they
can have a better sense of how long their queries will take.
Most recently we actually build a system that
of. What we can do we can run Hadoop job and
questions, such as why did my Hadoop job just
other job, even though it processed less data
the same number of instances.
actually I'm pretty proud
then we can ask
run slower than this
and it was running using
And our system uses machine learning to then say, well, because these
two runs someone, for example, changed a cluster configuration and
increased the block size. And therefore you're actually not using all
those machines, perhaps.
So this is something that will appear at the PDLDB. So this is kind of
three examples from this area, again I'm really happy to talk to you
about this off line. Finally the pricing, this is the most recent
project. And this is actually quite fun.
We're looking at this from several perspectives. The first idea is
people buy and sell their data on line. Currently if I use something
like Windows Azure data market I can price my data but I can only price
it using very simple selection queries so I'll define kind of a view
and people can enter parameters in this view and they're going to pay
based on the number of output records that they get. And you kind of
pay monthly based on the total amount of data you will be seeing.
What we actually asked was to say, well, if I produce a certain number
of views, then my customers, when they come, they will have to pick
among one of those views to satisfy their needs. But if none of the
views is good, well, they will have to buy some super set. If they're
not willing to pay for the super set they might not be really happy.
What we did is we actually built -- this is more on the theoretical
side -- we developed theory that allows the seller to only specify some
views and we use these price points to automatically drive the price of
any arbitrary query that the seller wants to purchase.
We basically automatically figure out given the query that the user is
interested in, what is the minimum number of views that they have to
purchase and in what combination to get the best price.
And we can do just automatically for pretty large selection of queries.
And this kind of pricing goes even much further than pricing data.
We're looking at things such as if I'm going to buy and sell data I
also want to protect the data. I will sell the data and say you're
allowed to use this data but you're not allowed to join it with other
datasets. We're now looking at building databases that help us enforce
digital rights on the data without too much overhead.
Or if I'm going to use cloud computing resources, many people share the
same cloud. I need a way to figure out what optimizations to
implement, how to price these optimizations, how to allocate the cost,
how to give it to the users, and we also have automated techniques to
do this.
So this is kind of in conclusion kind of a broad scope of the kind of
work that we do in the database group at UW on the systems side related
to kind of big data analysis cloud computing and data pricing.
So, okay, at this point that's it.
Thank you very much.
>> Surajit Chaudhuri: Any final questions?
number during the talk.
[applause]
I think we had a fair
>> Magdalena Balazinska: Good. All right. Thank you.
Download