1

advertisement
1
>> Rich Draves: All right. So pleased to see a big crowd. Here to welcome
Matei from Berkeley. So Matei, of course, is a candidate, interviewing with a
number of different groups here today and tomorrow at Microsoft. Matei is
well-known for his work in sort of big data or cloud computing or cluster
computing, depending on how you look at it. We're building systems, ranging
from ESOS to spark and systems which have actually achieved, already achieved a
significant adoption industry, which is unusual for many graduate students.
Matei has also had a lot of success on the academic side. Last year, he was
fortunate to win two different best paper awards, Sigcomm and NSDI. His
interests are wide-ranging. In addition to the sort of big data, cloud
computing area, for example, he's also worked with folks here at Microsoft in
things like gene sequencing, made some significant advances there.
So let's welcome Matei.
Look forward to a good talk.
>> Matei Zaharia: Thanks for the introduction, Rich. So I'm going to talk
today about big data analytics, and basically bringing that to new types of
applications that need to process data faster than before.
And so feel free to ask questions and stuff throughout the talk. I think I
don't need to talk really about the big data problem at Microsoft. Everyone
here is probably very familiar with it. But basically, the problem is in a lot
of domains, not just web applications, but also things now like scientific
instruments or gene sequencing or things like that, data is going faster than
computation speeds.
And basically what we have are these data searches that keep producing it.
Either you have many more users using your application on mobile devices or on
the web, or you have scientific instruments like gene sequencing instruments or
telescopes that are speeding up faster than [indiscernible] allow. And you
have cheap storage so you don't ever throw away the data. You just buy more
disks and store it. But you have stalling clock rates and it's getting harder
and harder to work with it.
So as a result, people are now running on a very
infrastructure. They're running applications on
And that's kind of the only way to actually deal
of that, people are also adopting a new class of
different type of
these very large clusters.
with this data. And because
systems to work with these
2
clusters. So just as an example, Hadoop MapReduce, the open source
implementation of MapReduce started out at a bunch of web companies, but it's
now used at places like Visa and Bank of America and a whole bunch of places in
more traditional enterprises. And it's a growing market. It's projected to
reach $1 billion by 2016.
So for people doing research in systems, this space is both exciting and
challenging. So first of all, it's exciting because these large clusters are a
new kind of hardware platform with new requirements and basically whole new
software stack is emerging for them. And there's a chance to actually
influence this software stack.
Often in most systems, maybe less so at a company like Microsoft, but in
general in many areas of systems, it's hard to go from, you know, doing some
research to seeing it actually being out there and seeing what happens if
people try to use it how it actually works in practice.
But in this space, this is a space where people are adopting new algorithms,
new systems, even new programming languages to deal with this kind of data. So
as a researcher, there's a chance I can each get things tried out and see how
they work.
At the same time, though, it's also challenging. This is because apart from
just the large scale of these systems, which has its own problems, the demands
on them are growing. So in particular, users have growing demands about
performance, flexibility, and availability.
By performance, I just mean they want to get answers faster, even though the
amount of data is growing, or they want to move safe on a batch computer model
to streaming and closer to hill time. By flexibility, I just mean more types
of applications that they want to run in parallel with different requirements.
And by availability, I mean basically high availability. So when you're
running MapReduce every might to build a web index, it's okay if you miss it
one night. But when you're running it to do fraud detection in close to real
time and that breaks, you're actually losing money.
So people want all these things out of big data systems. So my work has been a
software stack that addresses two of the core problems in this field. And
these are programming models and multi-tenancy. So for programming models,
there's been a lot of work out there on batch systems, which are really great
3
for making this data accessible. But users also wanted to support interactive
[indiscernible], complex applications, things like machine learning algorithms
and streaming computation. And these are the programming models I worked on.
For multi-tenancy, one of the major problems that happens when you have these
clusters is they're larger and they're shared across many users. So many of
the problems that happen in a traditional operating system, with multi-tenancy,
you know, kind of the mainframe type operating system are happening again here
and you need algorithms to share these clusters efficiently between users.
So this is the stack of systems I worked on. Basically, at the top, the things
in blue are parallel execution engines. I'm going to go into more detail, but
the Spark is the underlying engine. And on top of that, we built a streaming
engine, Spark streaming, and also something called Shark, which does SQL.
In the middle, Mesos and orchestra are two systems for resource sharing. Mesos
is for resources on the machines, like CPU and memory. Orchestra is for
sharing the network among parallel applications. And the ones at the bottom,
these are a bunch of scheduling algorithms I worked on that tackle different
problems that happen in these datacenters, like fairness for multiple resource
types, or data locality or stragglers. And these are both about determining
the right policy and also coming up with efficient algorithms to deal with
these things.
So the work at the top addresses the first problem of programming models.
work at the bottom is about multi-tenancy.
The
In this talk, I'm going to focus mostly on the top part, but then I'm going to
also come back at the end and talk a little about one problem here, because I
think there are some cool problems with algorithms and policies as well.
So let me just start with some really basic background on this. I think people
here probably mostly know this. But basically, when you're running in these
large datacenters, there are really two things that make it hard and that are
different from the previous parallel environments people have considered. And
these are failures and stragglers.
So the problem with failures is that anything, you know, that can go wrong, you
know, fairly early on a single machine will start happening a lot more often on
a thousand machines or ten thousand. So if you have a server, for example, as
4
a mean time between failures on a typical server might be three years, and you
put a thousand of those, now your mean time between failures is a day. And if
you put, you know, 10,000, something's going to fail every couple of hours. So
that's one problem.
Stragglers are actually an even more common thing, which is a machine hasn't
just outright failed but for some reason it's slow. Maybe there's a component
that's dying, but it's not actually failed yet, like a disk, and it's reading
really slowly. Maybe there's contention with other processes. Maybe there's a
bug in the operating system or the application.
And the problem is you're doing this parallel computation on a thousand nodes,
but if one of them is slow and everyone waits for that, you know, you're losing
all the benefits of that.
So Google's MapReduce is one of the first systems to handle this automatically.
And the thing that was interesting about MapReduce was not really the
programming model, but just the fact that it did these things automatically.
And basically, you know, if you haven't seen it, the point of MapReduce is just
that there are these two phases of map and reduce tasks. You read from a
replicated file system, and you build up this graph of processes that talk to
each other. And if any of them goes away, the system knows how to launch a new
copy of that and splice it into the graph. Or if one of them is slow, it can
launch a second copy and splice it in.
So MapReduce was really great for batch computations, but it's only this one
pass batch computing model. And I found in talking to users is that users who
put their data into these very quickly need it to do more. And in particular,
users need it to do three things. They need it to run more complex algorithms
that are multi-pass. Things like machine learning or graph computation that go
over the data multiple times.
They need it to do more interactive queries. So it's great that you can, you
know, sort of cull the web and build an index in a few hours every night, but
now if I have a new question, can you answer that question in two seconds or do
I have to wait two hours again. Or do I have to build something. They also
want it to do real time stream processes. So you build say a spam detection
classifier and you train that every night, now can you train it, you know in
real time as new spam messages come up. Move that application into something
you can run in real time.
5
So one reaction to these needs is that people build specialized models for some
of these applications. So, for example, Google's Pregel is a model for graph
processing. There have been a lot of iterative MapReduce. There's a system
from Twitter called Storm that's very popular for streaming.
But there are two problems with this. First of all, these specialized systems
only come up one use case at a time. So if you have a use case that is still a
complex application, but maybe it's not graph processing, the systems out there
might not be good for it. And the second one is that even if you have models
for all the things you want to do, it's hard to compose them into a single
application. And in sort of the real world, many users want to start by doing
SQL-like query, you do something that's a graph. Now you want a graph
algorithm on the result. Now you want to MapReduce or, sorry, machine learning
algorithm on the result of that. And with separate systems, it becomes hard.
So our observation behind all this work is that these complex, streaming and
interactive apps actually all have a common need, and there's one thing they
need that MapReduce lacks, and that thing is efficient primitives for data
sharing. So these are all applications that actually perform data sharing
between different parallel steps, and that's the thing that would make them
work better.
I'll show a couple of examples.
This one here is iterative algorithm, and this is a common pattern in many
algorithms. You can take that. That's actually my coat. Yeah. So okay. So
yeah, so this is a common thing. For example, if you imagine something like
page rank, page rank is basically a sequence of MapReduce jobs. If you run
this on something like just Google's MapReduce or Hadoop, the problem is that
between each job, you're storing your state in the distributed file system.
And just reading and writing to the file system is slow, because of data
application across the network and also because of disk IO. So it just makes
it slow.
Another case is interactive queries. So interactive queries, you often select
some subset of the input that you're going to ask a bunch of questions about,
and so all these queries share a common data source. And I'm going to come
back to streaming later, but streaming also involves a lot of state sharing as
well, because you maintain state across time in the computation.
6
So these things, if you just run them with MapReduce and sort of the Google
like stack, they're slow because of the replication and disk IO that happens in
the storage system. But those two aspects are also necessary for fault
tolerance. So that's why the file system replicates data.
>>: [indiscernible] isn't a constant.
replicated files.
You see system that use single
>> Matei Zaharia: Yeah, that is true. You can have intermediate files that
are single replicated, but if you have -- so, for example, in this case, it's
hard to do that because you don't even know which queries you're going to ask
in the future. So if you have a thing like [indiscernible] where you submit a
whole graph at once, you can do it. But here, the abstraction that you see as
a user is just, you know, I can make files and then I can run MapReduce on
them.
>>: So you're saying the system doesn't know if it's a [indiscernible] file or
not?
>> Matei Zaharia: Yeah, exactly. It's just that there's no -- yeah, there's
no explicit abstraction across parallel jobs, yeah, for data sharing.
>>:
Okay.
>> Matei Zaharia: But what I'll talk about is definitely, you know, based on
what people do when they do know the feature graph. Okay. So our goal with
this was to -- can we do this sharing at the speed of memory, and the reason to
do it is really simple, is because memory is easily 10 to 100 times faster than
the network or the disk. And if you think about it, even if you have a very
fast, full bisection kind of network, say you have 10 gigabit or 20 gigabit
ethernet such as in [indiscernible] datacenter storage, that's actually still
about a factor of 20 or 30 slower than the memory bandwidth in a machine. In a
machine, you can easily get about 400 gig bits per second of memory bandwidth.
So that's why we wanted to do this at the speed of memory. But the challenge
there is how do we actually make it fault tolerant if we just said that the
disk and the network are things we can't push data over.
So there have been a bunch of existing storage systems that put data in memory,
7
but the problem is neither of -- none of them actually have this property of
doing everything at memory speed. And the reason why is because they're based
on this very general shared memory abstraction. So basically, these systems
give you abstraction of a mutable state that's sitting out there and that you
can do fine-grained operations on, like reads and writes to a cell in a table.
And these include things like databases, key value stores, ram cloud, file
systems, all these kinds of things. That's the abstraction they provide.
And these all require replicating the data or, you know, things like update
logs about what you did over the network for fault tolerance. And we just said
that replicating it is much slower than the speed of writing to memory.
So the problem we looked at, then, to deal with these is can we provide fault
tolerance without doing replication? And we came up with a solution to this
called resilient distributed data sets or RDDs. And basically, RDDs are a
restricted form of shared memory that makes this possible. So RDDs are
restricted in two ways. First of all, they're immutable once you create them.
So they're just partition collections of records you can write them on
[indiscernible] immutable.
And second, you can only build them through coarse-grained, deterministic
operations. So instead of building these by reading and writing cells in a
table, you do something like apply a map function to a dataset or apply a
filter or do a join. And there's all kinds of operations you can do in that.
Now, what this enables is to do fault recovery using lineage instead of
replication. So instead of logging the data to another machine, we're going to
just log the operation we did on it, and then if something fails, we're going
to recompute just the lost partitions of the dataset.
So just to give you an example of what this looks like, here's some operations
you might do with RDDs. So maybe you start with an input file that's spread
across three blocks, three different machines. And maybe you start by doing a
map function. So you give a function F that you're going to apply to every
element.
So now you're going to build a dataset, this is an RDD. And basically, the
circles there are partitions and the whole thing, you know, is an RDD, is a
dataset. And this is not going to be replicated. There's just one partition
sitting on each machine.
8
You might then
you do another
you pass it to
these parallel
>>:
do, for example, a group-by. So [indiscernible] function G and
operation function on this data. And we might do a filter where
function H. This is how you've built your dataset. You've done
operations.
[indiscernible] deterministic?
>> Matei Zaharia: Yeah, we do require them to be deterministic. That's an
assumption we're making, yeah. Okay. So that's what you get. And now, if
something goes missing, you can look at this dependency graph to rebuild
things. So, for example, if this guy goes missing here, we can rebuild it by
just applying H to this partition of the parent dataset. And we can get it
back.
Even if multiple chunks go missing, you can go ahead and build them again in a
topological order. The other thing this doesn't show but it actually matters a
lot, in practice, is in practice, on each machine, you're going to have many
different data partitions. And when a machine fails, you can rebuild the
different partitions in parallel. So the recovery process can often be a lot
faster than the initial process of computing this thing. And that's what makes
this recovery quickly as well.
So next question with this is how general is it? So we just said we're going
to limit shared memory to these coarse-grained operations. And we found out
that actually, despite the restrictions, RDDs can express a lot of different
parallel algorithms that people want to do in practice. And this is because
just by nature, data parallel algorithms, the algorithms apply the same
operation to many data items at the same time.
So this strategy of logging the one operation that you're going to do to, you
know, a billion items, instead of logging the billion results, makes a lot of
sense in that setting.
And, in fact, we showed that using RDDs, we can express and unify many of the
existing programming models out there. So we can express the data flow models
like MapReduce and Dryad that kind of build a single graph like this, but we
also found some of these specialized models people propose, such as Pregel or
PowerGraph from CM you or iterative MapReduce can be expressed using RDD
operations. By this I don't mean just kind of the incompleteness argument of
9
like we'll get the same result, but we're also expressing the same data
partitioning across nodes and controlling what's in memory and what isn't.
it's really going to execute in the same way with the same optimizations.
So
And we also found that we could do new applications that some of these models
couldn't. So if you look at this in kind of a trade-off space of parallel
storage abstractions, this is what it would look like. So you can have the
trade-off between the granularity of updates the system allows and the write
throughput.
And basically, things like K value stores in memory databases, ram cloud, allow
very fine /TKPRAEUPBed updates, but their throughput is limited by network
bandwidth, because they replicate the data. Things like Google file system
actually, they're not really designed for fine-grained updates. And despite
that, people have run lot of algorithms on them because of the data parallel
nature of the algorithms. But they're still limited by network throughput.
And RDDs are instead limited by memory bandwidth.
>>: So one thing that this is not showing is the cost -- or the speed of
recovery, right?
>> Matei Zaharia:
>>:
The speed of recovery, yes.
So if you had [indiscernible] data, you would recover instantaneous?
>> Matei Zaharia: That is true. So here, there will be a cost to recovery.
I'll talk a bunch about speed of recovery later on, though, yeah.
>>:
[indiscernible].
>> Matei Zaharia: So you don't have to have RDDs in memory if you don't want.
You can have them on disk. It's still -- or on SSDs, you know, and it will
still save you from sending stuff over the network. So in our system,
actually, the system's designed to spill gracefully to disk and to keep doing
sequential operations if you do that.
>>: I'm a little confused by [indiscernible] by keeping things, it seems like
you're [indiscernible] disk versus keeping things in memory. That's one thing,
versus doing things locally on a single machine versus reducing from many
10
machines.
>> Matei Zaharia:
>>:
Yeah.
So you're limited by the network, not by the speed.
>> Matei Zaharia: That is true, yes. You're right.
mainly for the network that we're doing this. Yeah.
It's actually, it's
I mean --
>>: Seems like that significantly changes the semantics of the computation.
In other words, you're saying it would be faster by not running a distributed
algorithm, just by running an algorithm that's ->> Matei Zaharia: But it's not going to be local so we still have operations
across nodes. If you go back to this guy, so [indiscernible] buys an operation
across nodes. What I'm saying is just when you write this, like if you created
this dataset with a MapReduce, for example, you would be writing this to a
distributed file system, and then the next job, like say you didn't know you
want to do a group-by next. You save this result, you wrote it out to the file
system and then you come in and do a group-by.
>>: In the graphics in the next slide, you showed that the network band was
limiting where your [indiscernible] if you're doing a group-by or something ->> Matei Zaharia: Then you are, that's absolutely right. This is after you've
got the data grouped the way you want when you actually are doing the write to
this storage system, yeah. So the applications, of course, so yeah, this is
just, you know, after they've computed the reduce function or whatever, they
are writing it out. Applications can definitely still be network bound if they
can communicate. Yeah.
>>: [indiscernible] it would be reconstructing a failed node. Seems like I'm
trying to understand what MapReduce does. Does MapReduce, as you play it out,
also ->> Matei Zaharia: Yeah, yeah, it's similar. So basically, what we took is we
took that kind of reconstruction and put it in a storage abstraction. So it
persists across things -- across parallel jobs you do on that, rather than
being just on one job. But it's definitely inspired by the same thing.
11
And I think the cool thing is actually like how we do this with streaming,
which I'll show later, which is -- yeah.
>>: So for jobs that are actually network-bound, does this approach only
produce a latency? Seems like if a job is network-bound, you're reading to a
disk, then you get a point you're saturating the network, say.
>> Matei Zaharia: Yeah, it depends on how much intermediate data you have. So
if your network -- if you don't have a lot of intermediate data, you're just
doing a big shuffle, then this isn't going to matter. But we found in a lot of
jobs, even in things like page rank that do a significant amount of shuffling,
and have a lot of state, this can help. Yeah.
>>: So if I understand this correctly, where you're really saving on network
[indiscernible] for intermediate state is by not replicating it [indiscernible]
and over three times?
>> Matei Zaharia: Exactly, yeah. And yeah, and also that can be -- yeah, it
can be a significant fraction of the job running time, yeah.
>>: Can't you, with systems like Hadoop, control the [indiscernible] at the
end? You can say that you only want ->> Matei Zaharia: You can definitely control it, but then the problem is if
something fails, you're lost. Like Hadoop doesn't keep track of oh, I did this
map function before you rebuild it. So what we're doing is pushing that
information in the storage abstraction, yeah.
>>: You're basically using some form of -- a form of logging to rebuild as
opposed to ->> Matei Zaharia:
>>:
Yes, exactly.
And paying more for it [indiscernible].
>> Matei Zaharia: Yeah, yeah, you may have -- although it actually, well, it
depends on what you're computing. But if you compare it to the cost of like
having to always replicate the thing, that's a fixed cost. So yeah, depends a
lot on your failure assumptions, yeah.
12
>>: A really quick note. There's another metric there, which is failure,
things like lead time to failure. [indiscernible].
>> Matei Zaharia: So there are cases where you want to do application instead.
Yeah, it's very true. And actually, I'll talk a bit about this in streaming
also. You can still combine this with applications sometimes if you want. One
of the cool things is you can also [indiscernible] application asynchronously,
because now if you didn't do it right away, you have some way out to recover.
So there's different ways, yeah.
>>: What are the scenarios that you cannot recover from?
where you cannot recover.
Give us a scenario
>> Matei Zaharia: Actually, we can lose all the nodes in our system. As long
as the input data on the original file system, like the ones I showed, you
know, this file here is still available, we can recompute everything. So it's
designed so any subset of the nodes can fail.
>>:
So data [indiscernible] I see this as incremental computation.
>> Matei Zaharia: Uh-huh, yeah. There's a lot of -- it's definitely, it's
definitely inspired by lots of systems have done this kind of logging. I think
the thing -- I mean, honestly, the thing that's interesting about this is the
applications we applied it to. So when Google wrote the Pregel paper. There's
a whole paper, and it says they actually don't even have this kind of
fine-grained recovery. They just take checkpoints and they say we're working
on a thing where we think we can do fine grained recovery.
In Pregel, we implemented Pregel with fine /TKPWRAEUPBD recovery in 200 lines
of code. So it's just, yeah.
>>: But the obvious follow-on is once you have incremental computation, you
can use it for lots of other ->> Matei Zaharia:
>>:
That's true, yeah.
So you can change part of the data and --
>> Matei Zaharia: Yeah, we actually haven't done that in our system, but it's
an interesting thing to try to do next, definitely, yeah.
13
>>:
You can't do that [indiscernible].
>> Matei Zaharia: In databases, there are lots of things they can do, okay.
You should never tell database people they can't do something. That's a lesson
I've learned, with all due respect.
Okay. So that's kind of the abstraction. Let me also tell you a little bit
about the system, and then I'll go to some of the things we did next with it.
So we built this system called Spark that /EURPments this, and I just wanted to
show a little bit of how it works. Basically, so Spark exposes RDDs through
this nice and simple interface in the Scala language, which is kind of Java
with functional programming. As Bill describes it, it's kind of like C-Sharp.
So there it is. It's like C-Sharp with stranger syntax. And we didn't -- I'm
not saying we invented this model. So the model is very much inspired by the
API of Dryad link, but it lets you write applications in a very concise way.
And one of the cool things we did that I think is unique to our system is we
also allow you to use et interactively from the Scala shell, and it makes for
like, you know, it makes it very easy to explore data.
So this is, you know, kind of some of the syntax where basically, you create
your dataset, you apply transformations like filter, and this funny looking
stuff in red here is Scala's syntax for a function literal or closure, so it's
like lambda X [indiscernible] and then you can keep doing operations on it and
keep building a lineage graph and computing things.
So I wanted to show you this on an actual learning system just so you can see
the kind of things it does. So basically, in this -- I've set up Spark cluster
on Amazon EC2. Let's check that everything is still there. It is. Okay. And
I have 20 nodes and I have a Wikipedia dataset I loaded on this. It's just a
plain text dump of all of Wikipedia, that's 60 gigabytes. So it's not huge,
but it's a thing that would take a while to actually look at on a single
machine.
And I'm just going to show you how you can use this interactively to do things.
So this is the Spark shell. You can do your standard Scala stuff in there.
And you have this special variable, SC, or Spark [indiscernible] that lets you
access the cluster functionality.
14
So first thing I'm going to do is represent the text file I have sitting in the
Hadoop file system. And this is going to give us back an IDD of strings so
it's a distributed collection of strings. And so we can actually like start
looking at it even without doing stuff in parallel. So there's a few
operations you can do that will just speak at the beginning of the file. So if
I do file dot first, that gives me the first string, and you can see what the
format is like.
So this is a tab separated file. You have article ID. You have the title, and
yet is maybe the first thing alphabetically in this Wikipedia. You have date
modified. You have an XML version, and you don't see the last field, but
there's a plain text field at the end as well. So what you can do is you can
take this and convert it into a form that's easier to work with.
So, for example, I'm going to define a class to represent articles and I'm
going to pull out the title and the text from it. And now I'm going to do some
map functions to turn these lines of text into article objects. So first I'm
going to take [indiscernible] and split it by tabs, and that syntax again is
the same as doing this. So it's like a lambda syntax, basically, it's the
shorthand form.
And then I'm going to filter. So some of these things actually don't have the
last field, the plain text, because they're things like images, so I'm going to
just filter out the ones with exactly five fields, and I'm going to map, I have
this array of fields. New article. And I'll take F0. Say F1 is the title,
and F4. You can't see that, but it's exists somewhere there.
So now I have this article object. So all these things happen lazily. It
doesn't actually compute it until it needs to. But I can do stuff like this to
see articles, you know, first article is still and yet and yet. So the last
things I'm going to do is tell it I want the articles to persist in memory
across the cluster. You can choose which data sits in memory, which one is
just computed ephemerally as you go along. So I'll mark it that way.
So now I'm going to do a
for example, how many of
to be that text contains
it's actually submitting
HTFS. So scheduling the
that stuff.
question on the whole dataset, and I'm going to count,
these contain Berkeley. And so actually, this needs
Berkeley, the plain text of the article. And so now
these tasks to the cluster, and it's going to go on
tasks according to where the data is placed and doing
15
And it goes along, so basically the class article I typed in, the functions I
typed in get shipped to the worker nodes and they got [indiscernible] and this
is kind of the straggler problem, but hopefully it will finish. Yeah, there
you go. So there it is. It's live, it's happening on Amazon. I'm sure the
Microsoft cloud never has stragglers.
Okay. So we scanned this thing and there were 15,000 articles. But it took 27
seconds. Not exactly interactive. So let's try to do it again now. And now
the data will be in memory, because we call it persist. So if we do it again,
we get back the same thing in the 0.6 seconds. And we can ask other questions
now.
So, for example, the one I like to ask is Stanford.
Let's try this one. And this is only 13,000. There
I hope no one's from Stanford. Last thing I want to
risky part of the demo. So we have 20 nodes in this
get rid of one of them.
So Berkeley was 15,000.
you go. Okay. And so now
show, this is sort of the
cluster. So let's try to
So these are the ones. I'm just going to pick a random one, and see a weird
Firefox menu and just kill it. So there you go. So it takes a little bit of
time to shut down, but once we look at it here, eventually it will drop out.
So you can see now there are only 19. And you can see this guy was also
notified that it's last, and we last this out. So let's try to do this again
and see whether we get the same answer. And now, at the end, you know, there
are a few -- yeah so you can see at the end it went kind of quickly, but there
are a few -- like the last 30 tasks or whatever on that node were lost, and
those were rebuilt across the cluster..
So this is where I'm saying you can recover pretty quickly, even if a couple of
failures happen, because you do this in parallel. So that's kind of it. And,
of course, now that it's actually in memory again so if we do this again, it's
back to its usual self.
So that's kind of what the system looks like. Okay. So let me see. So I part
from doing kind of searching of Wikipedia, this is also good for things like
machine learning algorithms. So we took a couple of, you know, very simple
algorithms, but that we run on this, and basically these iterative algorithms
are running a bunch of MapReduce on the same data and if you share that data
using IDDs, you can go a lot faster.
16
So depending on how much computing it does -- a bit more computing, that was 30
times faster and this is about a hundred times faster. And other people have
built in memory engines for these algorithms, Piccolo is one, but most of the
engines out there either don't provide fault tolerance or do it using
checkpointing, which you have to periodically save your state out, and that
costs something. So we got similar speedups to what they got, but we have this
fine-grained fault tolerance model as well.
>>:
Do you have some results where you have failures during the run?
>> Matei Zaharia: No. That would actually be cool, yeah. I mean, we did -in our paper, we have some results with failures. And the same thing happens.
Like the iteration or something fails, takes longer to recover. But I don't
have them on the slide here.
>>: So there's a dissonance between this slide and part of the motivation for
your talk. When you were reviewing your motivation, you were saying that
computation speed, CPU speeds are slowing down and can't keep up with big data.
What this slide would seem to say is the CPU speeds are just fine. It's the
communication and the algorithm that ->> Matei Zaharia: That's true, yeah. I guess what I meant to say, there is
the capabilities of a single machine, yeah. But it's true, many of these
things are not CPU bound. Actually, the thing that is really causing clusters
to become bigger is actually disk bandwidth. Disk bandwidth hasn't gotten very
fast, and so you can buy these disks, they're huge, but it takes like, you
know, to read a terabyte off a disk, it will take you many hours. So you need
to put thousands of disks in parallel. That's actually the thing I think that
really causes this.
So that's kind of the system, and one other thing I want to say is as I said at
the beginning, we wanted to show this as pretty general so we implemented a
bunch of these other models on it as well that people have proposed. We have
these iterative ones that we implemented. Graphlab, if you're familiar with
it, we can only do the synchronous version because that's version is
deterministic. And another cool thing we implemented, actually appears at this
year's Sigmod is a SQL engine called Shark, and the story is there at least in
the database community, there was this kind of debate between databases and
MapReduce. People thought that, okay, well, MapReduce adds fault tolerance
during query execution. Most parallel databases don't have that. But the cost
17
of the fault tolerance is so high that it's not worth it.
So Shark actually gets similar speedups over Hadoop that the parallel databases
do. So it can run these queries, ten, a hundred times faster, and it
simultaneously have the fault tolerance that you saw before. And the thing
about this also is it's not just a matter of saying, you know, our thing is
more general. It also means applications can now inter mix these models.
So, for example, one of the things we're doing in Shark is letting you call
into machine learning algorithms that are hidden in Spark and data never has to
be written to some intermediate file system in between. It just runs in the
same engine.
And final thing, we've been lucky in doing this to also have growing community
of actual users. So we open sourced Spark in 2010 and in the past few years,
we've really seen a lot of growth in who's doing things with it. So just some
quick stats on that. We held a training camp on Spark in August, and 3,000
people watched online to learn how to use it online. We have a meet up,
in-person meetup in the bay area and we have over 500 members that come there.
And we have, in the past year, 14 companies have contributed codes to Spark.
There's some of the companies and universities that have done things with it at
the bottom. So you can find more about that on the website.
>>:
Stanford doesn't seem --
>> Matei Zaharia: Yeah, I've shown the demo to some Stanford people in the
past. So that was kind of the Spark part. I also want to talk a little bit -so that covered the interactive queries and iterative algorithms.
I also want to talk about streaming and this is a systems bit that we did next.
I think we're actually still working on a bit now. So the question here was
just how do we perform fault tolerant streaming computation at scale. And the
motivation for this is that a lot of big data applications we have today
receive data in real time and see sort of real value from acting on it quickly.
So things like fraud detection, spam filtering, even understanding statistics
about what's happening on a website after you make a change to it or you launch
an ad campaign. And for example, Twitter and Google have hundreds of nodes
that are doing streaming computations in various ways to try to deal with this
18
data.
So our goal was to look for applications with latency needs between half a
second to two seconds. So we're not looking at, like, millisecond quantitative
training stuff, but we think this is still pretty good. But be able to run
them on hundreds of nodes.
The problems, though, is that stream processing at scale is pretty hard. It's
harder than batch processing, because these issues of failures and stragglers
can really break the application. So the fault recovery, fast fault recovery
was kind of a nice thing to have in the interactive case. But here, if you
don't do it quickly, you might just fall behind and you've suddenly lost the
whole point of doing a real time computation.
Same thing with stragglers. If a node goes slowly, rather than making a joke
about it in the talk, you're now seven seconds behind where you were supposed
to be in this stream. So there's been a lot of work on streaming systems, but
traditional streaming system designs don't deal well with these problems.
So traditional streaming systems, based on this -- we're calling it continuous
processing model, and it's a very natural one. But it becomes tricky to scale.
So in this model, you have a bunch of nodes in a graph, and each node has a
long-lived mutable state. And for each record, you update your state and you
push out new records to the other nodes.
So state, by the way, is the main thing that makes the streaming tricky. So an
example of state is you want to count clicks by URL data percent clicked. So
you have this big table, maybe you partition it across nodes and everyone keeps
track of counts for a slice of the /*URLs. And this is how these systems are
set up.
So when you have this model and you want to add fault tolerance, there's two
ways that people have explored, replication and upstream backup. So the most
common one that's done in basically most of the parallel database work and
systems like Borealis and Flux, is in replication.
In replication, you send a copy of the input to -- you send the input to two
copies of the processing graph, and each copy does the message passing and
state updating in parallel. There's also a subtle thing in replication,
though, which is that you need to synchronize the copies. And that's due to
19
non-determinism in a message [indiscernible] across the network.
So, for example, in this one, imagine node one and node two are both sending a
message to node three. Now, at roughly the same time. Now, which of those
gets there first will depend on what happens on the network. But node 3's
state might be different if it got this one before that one. So the copy here
of node 3 needs to know that. And show these protocols, Borealis and Flux do a
lot of fairly complicated stuff to actually keep these in synch and keep them
in synch even if a node fails and another one comes back and stuff like that.
But even discounting the cost of that, you know, basically, this model gives
you fast recovery from faults instantaneous, but you play at least 2X the
hardware cost.
Okay. The upstream backup model is another one that's been proposed. In that
one, you don't have extra copies. Instead, nodes, checkpoint periodically and
they buffer mentals they send since the checkpoint. If a node fails, you have
to bring up another copy of it and splice it into the graph. And this model
has less hardware cost. But it's also slower to recover. In particular, at
high load, the new node needs to not just recover from a checkpoint but also
keep up with the arriving stream. So it can take a pretty long time to
actually catch up with the rest of the system.
And a bigger problem is that neither of these approaches handles stragglers
very well. So in the replication approach, because of the need to keep the
replicas in synch, if one of the nodes is slow, you went up slow both rep cas.
And in the upstream backup approach, you don't really have anything you could
do, except maybe treat the slow node as a failure and then it's expensive to
recover.
So we wanted to design a streaming system that met a sort of fairly ambitious
goals to actually be able to do this at scale. So we wanted to have a system
that can scale to hundreds of nodes and has minimal cost beyond just the basic
processing. So no 2X application or anything like that.
We wanted to tolerate both crashes and stragglers. And we wanted to be able to
attain sub-second latency and sub-second fault recovery.
So the way we did this is by starting with an observation about the batch
processing models, like MapReduce and Spark. So these models actually manage
20
to provide fault tolerance in a very efficient way without replicating a lot of
stuff because of this deterministic recomputation that we saw. So they divide
the work into small, deterministic tasks and then determine if a node fails,
they rerun those in parallel on others.
So our idea was can we just run streaming computations as a series of very
short but deterministic batch-like jobs and then we can apply the same recovery
models but at a smaller time scale so you know our task, instead of being
several seconds in length might be like several hundred milliseconds.
And so it kind of becomes this system optimization problem of just making a
system that does that quickly. And to store this state between time steps,
we're going to use RDDs, which we came up with this way to store state in
memory.
So that's what we ended up to go and we call this model discretized stream
processing. And basically, the idea of the model is we'll divide time into
small steps, and the data that arrives in each step is put into a dataset and
basically it's an immutable dataset. And the input data, we have to store
reliably, because if we lose that, we can't go back and recompute stuff. So
this will be replicated. But you probably want to store that data anyway.
After that, you do a batch operation that's deterministic, and you produce new
datasets, and these can either be output or they can be staged that you're
going to use on your next time step. And these are stored in memory,
unreplicated as an RDD and we can reconstruct them with lineage.
On the next time step, you take your new data, you take your old state, and you
do a MapReduce like computation.
>>: Seems to me like this state either becomes -- if you lose this state, you
have to reconstruct it, you really haven't gained anything, because you're
either going to have to go back -- suppose it's a one-week window. You're
going to have to construct it from a week's worth of activity, or you'll have
had to squirrelled it away somewhere so you can reload it.
>> Matei Zaharia: Yeah. So the way you do it is you do periodic
checkpointing. But the key is you don't need to checkpoint everything. Like,
for example, say you're doing this every second and then every ten seconds, you
store that dataset reliably. You just like asynchronously write it out to
21
another copy.
>>:
So that's what we're going to do.
Using the check point to truncate --
>> Matei Zaharia:
>>:
Yeah, to truncate the lineage.
Is that one of the strategies that you talked about earlier?
>> Matei Zaharia: Yeah, it is, but the difference -- yeah, I mean, the
difference is you don't hold back the whole application when it fails. So
checkpointing in the parallel computing systems usually means if a node fails,
I just kill everything they can to recover and I go back to the previous
checkpoint. Here, I just recompute the last stuff and that can be
significantly faster.
>>: So it's almost like you're applying the technique from a streaming setting
[indiscernible].
>> Matei Zaharia:
Yeah.
>>: Okay. I'm sorry. I misunderstood. I thought you were doing something
different than streaming recovery. But you're applying streaming recovery to a
batch setting in order to improve the batch recovery?
>> Matei Zaharia:
setting?
>>:
You mean I'm applying the batch recovery to a streaming
No, the other way, right?
>> Matei Zaharia:
well --
I'm not doing -- you mean the application way.
Yeah,
>>: You're basically doing a checkpoint to truncate how far back in the
logging ->> Matei Zaharia: Yeah, absolutely. But checkpoint is different from what I
showed as replication, because checkpointing just means -- so the replication
approach has to keep the state in synch, had to do this synchronization
protocol. Checkpointing just means I have a copy in memory, I going to also
send it to another guy and maybe I'm also going to write it to disk. So it's
22
an asynchronous thing and doesn't require them to be in any way, you know -yeah.
>>: I had a quick question, it's sort of [indiscernible] the size of the
dataset is going to be for iteration. So can you do something ->> Matei Zaharia:
The state of size --
>>: Distributed parallelism if the [indiscernible] is small in most cases and
deal with local parallelism in [indiscernible].
>> Matei Zaharia: Oh, I see. So if the stream fits on a single machine, you
can do it. We're specifically targeting things that need to have high degrees
of parallelism, and there's two reasons why. Like either the stream might be
big, for example, you're collecting logs from all the machines in your
datacenter, and each one's logging many, whatever kilobytes per second or
something.
Or the computation might be big. So one of the applications we did was online
machine learning algorithm. It's very CPU-heavy, and it needs to run on many
nodes to actually do real data. So that's what we're targeting.
>>: What is the computations for depends on series of time steps
[indiscernible] T1 and T2.
>> Matei Zaharia: Yeah, so we don't have -- so I don't have slides on this,
but we do have operators that do that, and basically an operator can go back
and take data from farther back time steps. It's not just the previous one,
yeah. And we do incremental sliding windows where you add current data and
subtract stuff from ten seconds ago. So we do stuff like that. We implemented
-- basically, a lot of the optimizations people did for stream processing and
databases, you can express them this way, because all you're doing is doing the
same thing, but in these little batches, right. Like it's an algorithmic
optimization you can still use, yeah.
Okay. So that's -- thanks, these are good questions about the model. So
basically, I talked about this already. So the way we do fault recovery is we
have to do this checkpoint periodically, but it's asynchronous. It just means
that everyone has to write their stuff out to another node and you don't need
to do it O to have been because recovery is parallel. We have the same story
23
as before. Something goes away, now other nodes can work in parallel to
rebuild that.
And so if you look at it compared to previous recovery approaches, we have
faster recovery than upstream backup, but without a 2X cost of application.
that's kind of the model.
So
So the question is we broke this thing into these little batch jobs. How fast
can we actually make it go? And we found that compared to other systems out
there, it can actually go pretty fast. So we were able to process up to 60
million records per second, or six gigabytes per second of data on 00 nodes at
sub-second latency. So this graph here is showing two applications. This is
just like searching for regular expression, and top K is a sliding window word
count followed by top K.
And they both scale pretty much linearly to 100 nodes and the lines here are
showing if we allow a latency target of one second versus two seconds, how -you know, how much throughput can we get. And even with sort of sub-second
latency, I think these were around 5 or 6 hundred milliseconds, you can still
get pretty good throughput.
We compared it with a few existing systems.
>>:
So in those --
[indiscernible] checkpoint?
>> Matei Zaharia: It's impacted by the size of the windows, the little
Windows, and that -- because there's some communication and scheduling costs to
launch these things, yeah.
>>:
Checkpoint frequency?
>> Matei Zaharia: The checkpoint frequency is not a huge deal. That affects
recovery time, but the checkpoint frequency can be pretty low, like even if you
checkpoint every ten seconds, like only one-tenth of the windows, you can
recover quickly. I'll show about that later.
>>: Do you have an idea like what the difference would be compared to
[indiscernible] ten seconds?
>> Matei Zaharia:
Actually, I don't know.
I think we tried a few bigger
24
windows, but it was a while back so I'm not sure with the cur rent system. I
don't think it's huge. I think maybe there's a difference like maybe up to 50
percent, something like that, but it's not huge. Because when you're getting
to a ten-second window, you're getting into the realm of normal Spark jobs.
Like the machine learning ones I showed, you know, they were doing one
iteration in like one second or four seconds. So it's not a big deal.
Where this is more interesting is where we push it to, like, a multi -- like
this is a three-stage multi-use job that's happening in 600 milliseconds.
That's where it becomes really interesting.
>>: Once you bring [indiscernible], you tend to need more resources to finish
the same job as because your breakdown becomes less efficient?
>> Matei Zaharia: Yeah, exactly. That's what it is. There is this
[indiscernible]. But it becomes an engineering kind of problem, like let's
make a fast scheduler for these things, which is a thing we're happy to deal
with, yeah.
So just to get through what we did here, we compared this to a few existing
systems. So in the open source world, probably the most commonly used system
is Storm from Twitter, this is which type of message-passing system. In
general, we didn't expect to be faster for any reason, but we just wanted to
show that we're in the same ballpark. And in storm, we were actually faster,
depends, between four and two times. So depends what we're doing. But it's
mostly, I think it's slightly better engineering than what we did. The point
is it's a comparable performance to sort of this real system.
And commercial streaming database systems, they don't have very specific
numbers, but they say, you know, we can do maybe 500,000 or a million records
per second in total for the whole system. And this is -- usually, they don't
really scale out across nodes. But what we did is we did about this many
records per second per node and the results I showed before. But we also
scaled linearly to 100 nodes.
And the other thing is the way these guys do fault tolerance, so storm doesn't
actually have fault tolerance for state the way we do. It only just ensures
each message will be seen at least once and these systems use application. So
we are able to do this while also providing nicer recovery mechanisms.
25
And apart from speed of computation, there's also speed of recovery, and we
found that even with a pretty big checkpoint intervals, we were able to recover
quite quickly. So often, we could do it in less than a second. So this one
here is showing the sliding, like, word count with ten-second checkpoint
interval, and this is just the processing time of each batch of data. And when
we get rid of a node here, that little chunk of data takes about another extra
second to process.
And then there's this window here that because we're doing a sliding window, we
keep going along and taking new data and subtracting data from ten seconds ago,
so there's this window of vulnerability, where we may have to recompute other
stuff. And after that, we're back to normal operation, basically.
One other thing I want to show here is how this varies with the checkpoint
interval and cluster size. So one of the things -- so, for example, we tried
doing 30-second checkpoints on 20 nodes instead of the ten-second that I showed
before, and even with 30-second checkpoints, you can recover in about three or
four extra seconds from what you were normally doing.
And the other cool thing is as you add more nodes to the system, recovery gets
faster. So when we add 40 nodes instead of 20, we're actually recovering about
twice as fast.
So what's cool about this is this is a recovery mechanism where scale is an
advantage, whereas in the previous ones, the synchronization, all that stuff
gets harder with scale.
>>:
[indiscernible].
>> Matei Zaharia: It does go up, yes. So that's true. Actually, it's a good
question, on average will it help or not. Maybe on average, you're still
breaking even, actually. Because twice as many faults, but you recover twice
as quickly. That's a good point. Okay. Cool.
Okay. So I have -- so I've had, you know, decent amount of questions, but if
you guys want to stick around for five minutes, I can talk about this stuff
too. What do you think, Rich? Do you think it's a good idea?
>> Rich Draves:
Yes.
26
>> Matei Zaharia: Okay. So I wanted to also talk very briefly about this, and
we can chat about it after in person. One of the things we did here. So apart
from looking at systems, I do like to look at scheduling algorithms and
policies and try to analyze things there.
So one of the cool problems we looked at here is multi-resource fairness. So
let me just set that up real quick. So basically, in lots of computer systems
need to do -- need to divide resources across users. And the most common way
they've done it is weighted sharing, proportional sharing in the operating
system world.
Examples of that are fair queueing on network links or lottery scheduling for
the CPU.
So fair sharing basically divides one resource, like the CPU cycles you have or
the link bandwidth, according to the weights for each users. So, for example,
if the users all have equal rates, it's going to split this -- you know, they
each get a third of it. But the problem we saw, as we were building these
cluster applications, is that cluster applications have very different demands
in terms of multiple types of resources.
So some applications might compete on CPU. Others applications I've just been
talking about how it's important to use memory. Some applications might be
bottle necked on IO bandwidth and so on.
So you can't really do a scheduler for these systems that only looks at one of
these resources or tries to split them up in a fixed ratio. So the question we
had is how can we generalize fair sharing to multiple resources?
Just as an example of this, what you're going to see is say you have a cluster
and you have equal numbers of CPUs in memory, 100 CPUs, 100 gigabytes, and you
have -- one user has bag of tasks they want to run that each use six CPUs and
one gigabyte of ram. The other user needs three CPUs and four gigabytes. How
much should you give to each one?
So we tried a few policies here. A few of the natural ones and found a bunch
of interesting problems happened that don't happen with a single resource. So
first thing we tried is something we called asset fairness, and the idea there
was just let's treat the resources as kind of the same, let's say, coherency.
So having one percent of the CPU is the same value as one percent of memory.
27
And let's try to equalize the users overall shares.
So the numbers in the examples we had before, you end up with this. So first
user gets six-ninths of the CPU, one-ninth of memory, the second user gets
three-ninths of the CPU and four of RAM and in total, they each have
seven-ninths.
But with this policy, even though it's really natural to do it, there's
actually a problem. So the problem is that one of the users gets less than
half of both resources. And we call this, the properties, or we say that
violates the sharing incentive property. And by sharing incentive, we mean
that, you know, if users contribute, say, equal amounts to the cluster so they
have equal weights, one user should be able to get at least half of -- you
know, of one resource. So they should be at least as well off as if they had
just gone off and built two separate smaller clusters.
And now this guy here, because, you know, the top guy is not using memory very
much, the one at the bottom is getting less than half of both.
One thing we tried to fix this that has other problems is called bottleneck
fairness. This is another really natural thing. So you might say let's take
the resource that's most contended and split that equally. So here you're not
contending for memory, so we'll give the second person, we'll let them get more
memory. This looks, again looks like a normal, pretty natural thing to do, but
there's actually another problem here, which is that users can start to gain
the system. So it's not strategy-proof.
So for example, here the bottle neck was CPU and user one only got half the
CPU, but he really wants to use CPUs. What that user can do is change the
demand he gave to the system, instead of saying I need six CPUs and one
gigabyte of RAM, I actually need five gigabytes of RAM and he'll get them and
not use them. That shifts the bottleneck to memory, and now this user gets
more of the CPU they actually wanted as well.
So these are problems that just don't happen with one resource. So our
approach was basically to characterize the properties of single user's fairness
that make it nice and try to come up with a multi-users policy that has the
same.
In a nutshell, the one we came up with is called dominant resource fairness and
28
it's to equalize the user's share of the resource they use most. So this
user's share of CPU and this year's share of memory are going to be equal. And
we showed that this always has the sharing incentive property above. It's
strategy-proof. There's no benefit to lying about your consumption. And it
has a bunch of other properties as well.
And we compared it with a few policies. One of the things we compared with,
this is a preferred sharing policy in economics, comparative equilibrium, which
is basically a perfect market. And we found that that actually lacks some of
the properties that DRF has. Yeah?
>>:
Do these properties [indiscernible] for more than two users?
>> Matei Zaharia:
Yes, this is for any users and, like, end resources.
Yeah.
>>: So is the assumption that the user has to specify the resources to one
another? I mean, so you say ->> Matei Zaharia:
>>:
Yeah.
I need six CPUs and a gigabyte of memory.
>> Matei Zaharia: It's assuming the user is in a fixed ratio. So it's not
like -- it's definitely not a general thing. If a user has two types of
things, it's not going to cover it. But we started with this one where they
have a fixed ratio and they wanted to come up with something for that. Yeah.
>>:
How does this work compared to [indiscernible].
>> Matei Zaharia: Good question. Actually, I'm not sure exactly what she's
done lately. I saw what she was doing in the past, but I think one of the
differences in her work was that she did mostly like cache resources on, like,
buffer cache within applications and looking at like hardware mechanisms to
even enable that to happen. I think the policy is something you might use
there. I don't think they looked at this problem of just what [indiscernible]
policy is. They looked at what's the hardware mechanism. I might be wrong,
though, because I haven't looked at it in a while.
What about the resilience of [indiscernible].
29
>> Matei Zaharia: Yeah, so actually, yeah. Right. So this is actually
resistant to collusion as long as each user uses some of each resource. I
think if some users have zero demand for a resource, then you can cheat.
So one of the interesting things, we didn't prove that, but the interesting
thing that happened is a bunch of economists actually looked at this after and
tried to look at other properties and that's one of the ones that they said.
>>:
So this seems to assume that there is a dominant resource.
>> Matei Zaharia:
Oh, yeah.
That --
Per user.
>>: Which if you've got a large enough memory, then if you can give me the
full memory on a machine, then I may not care about network bandwidth, and so
memory is my dominant -- but assuming that's never going to happen, then never
bandwidth may be my dominant resource.
>> Matei Zaharia: Oh, so you're saying your application might -- yes, your
application has different ways of running based on the resources.
>>:
It has different efficiencies based on different --
>> Matei Zaharia: Yeah, we didn't deal with that yet. So there's lots of ways
to try to generalize this, and actually I'm interested in doing some of it.
We've done a bit already. But yeah, we didn't deal with that.
>>: Where do you say this thing applies?
single machine?
I mean, is this the level of a
>> Matei Zaharia: Yeah, so we applied it in a cluster scheduler, Mesos, which
I didn't talk about. One of the cool things is actually the Hadoop team is now
applying this in Hadoop, independently implementing this. And we had this
paper at Sigcomm last year where we applied it in software defined
[indiscernible] and middleboxes also. So where you have flows that go through
different modules, like intrusion detection, they might stress different
resources.
>>: So this is after you've already [indiscernible] resources and now you want
to share --
30
>> Matei Zaharia: Exactly, the model is you have users within an organization.
It's not an economic thing where you're paying for resources.
>>: Then perhaps some of these issues about cheating and lying, some of those
things probably ->> Matei Zaharia: Yeah, so actually the cheating thing came because we saw
people doing this in real clusters. So for example, one interesting story
there from Google, they used to have this policy that if you have utilization
above a certain level, they'll give you dedicated machines. They found users
would actually add like spin loops and things.
So there was a similar thing with Hadoop users. So users get very creative.
One of the Hadoop things, like Yahoo built these like 3,000-node Hadoop
clusters, and users want the to run MPI. So people like go to map function
that runs MPI in it, and you can imagine that really messed up the networking
and the data locality for the other jobs.
>>: How does this compare to a natural market, since you assign a price per
unit to each resource.
>> Matei Zaharia: That's a good question. So this is kind of what the
competitive equilibrium does. So competitive equilibrium is if we had a
perfectly competitive market, what would it allocate? Now, it's not the same
as users actually bidding, but the problem is with users bidding, it becomes
very complex for the users to do stuff.
So but it's competitive equilibrium is kind of the outcome if they did bid in a
perfectly competitive market. The problem is the assumption of perfectly
competitive, which can beg some things.
Okay. Cool. So I'd like to talk more with people about this after. So let me
wrap up. So just one other thing I wanted to say. So I do like to build sort
of real applications in real systems. And working in this space, especially
because it's such a new space, I've tried to open source things and I've been
lucky to have people actually try to use some of these.
So I talked about Spark and Shark, but some of the other systems that I've
worked on have also been used outside. So Mesos cluster manager is actually
used at Twitter to manage their nodes. They have over 3,000 nodes now that
31
they're managing. DRF is being independently implemented in the Hadoop 2.0
design. LATE, an algorithm for straggler handling, is also in Hadoop. Delay
scheduling, which is a thing I did for data locality, I actually wrote one of
the most popular schedulers for Hadoop, called the Hadoop fair scheduler as
part of that work. And it's still being used today at places like Facebook and
eBay and so on.
And finally, the thing I've been working with folks here, SNAP sequence aligner
which is really fun but totally different thing, looking at gene sequencing is
actually started to be used -- there's a group at UCSF that's been using it to
try to build a pipeline to find viruses faster.
So basically, this is one of the things I really like about this space is that
there is room to actually build things that people will use. Just to
summarize, basically, you guys already know Big Data systems problems, but I
hope I've shown you some of the problems that can happen and some of the
research challenges.
I've talked about these two things. I've talked about this way of dealing with
faults when you have coarse-grained operations, which is very common in data
parallel algorithms, and we've applied this to both batch and streaming. And
I've also talked about the multi-user straining problem, which one of the few
that you can look at in these clusters. So that's it, I'd be glad to take any
more questions.
>>:
[indiscernible].
How did you make sure that there was [indiscernible].
>> Matei Zaharia: So the user -- basically, so we don't enforce the
determinism. If you happen to call a lie /PRAER that's not deterministic, like
you call [indiscernible], but we do provide ways -- so the operations we
provide, as long as you pass a deterministic function, it will produce the same
result.
And, for example, for things like random number generation or sampling, we have
a sampling operation, we just see that with this sort of task ID so that it's
always the same ID ->>:
So you [indiscernible].
>> Matei Zaharia:
We just, it's a bit simpler.
We just see the random number
32
generators, things like that. But if users are doing stuff that's
nondeterministic, we don't catch it, it would be interesting to try to force
that or even to detect it. We do have some work now that will detect it by
just check summing the output and seeing whether it's the same. But it won't
tell you until you've already messed up.
>>: I didn't catch, when does an RDD die?
you just get rid of it?
>> Matei
happens.
need it,
actually
while it
When do you not use it anymore?
Do
Zaharia: Good question. We have -- you can actually set what
There's different storage levels. So one is just drop it and if I
I'll recompute it again. The other one is [indiscernible]. And we
have like an LRU cache in there. So each node keeps accumulating data
can, and then things drop out at the end.
So if you are keeping it stored, then you have to assume that writing to memory
is effectively at the cost of writing to disk, because in the long-term, you're
going to have to build something up.
>> Matei Zaharia: Well, that's only true if you use the storage level where
they spill out to disk. So by default, actually, we just drop it. And if we
ever come back to it, recompute it.
>>:
When you're done with it, you have the persistent number that you know.
>> Matei Zaharia: Yes, but all that means is that normally, like in my demo, I
made all these intermediate data sets. Each time I do a map or filter, it's
another RDD. But by default, they're not even saved to memory. They're just,
you compute it in kind of a streaming fashion and then you drop it out once
you've got the result.
>>:
[indiscernible].
>> Matei Zaharia: So on the ones you call persistent. So basically, RDD is
just a recipe for computing a dataset. And if you mark it as a thing you want
to keep around, it stays. It's kind of a weird thing. But it's -- that's just
the programming model that made sense, because people build up a dataset out of
many transformations, and they don't want to save the intermediate result.
>>:
[indiscernible].
33
>> Matei Zaharia: To some extent, but it's still expensive to just allocate
space and write it. So, for example, if you do a map and then another map and
none of those is being persistent, we do it one record at time, so we do it in
a pipeline fashion and it's just better for cache usage and memory bandwidth
and stuff like that.
>>:
Is this meant to replace intermediate data between the mapper and reducer?
>> Matei Zaharia: It doesn't actually change the
more between MapReduce jobs. So between a mapper
our system, we still have the maps actually write
in memory first and then it goes to disk. And we
>>:
way that works. It's really
and reducer, we still -- in
out a block. And again, it's
don't actually push -- yeah.
[indiscernible].
>> Matei Zaharia: It actually wasn't. Mostly, it was -- so the things that
can be problems are network bandwidth. So, for example, when we were
receiving, I said we received like I don't know how many gigabytes per second.
Part of the problem is you have to replicate the input data also. And that can
actually really be a bottleneck. So that was one thing.
The other thing that will happen in large versions of this now is eventually,
the scheduling will become a bottleneck. So I think a cool, like, future topic
would be how can we make this scheduling faster, and can we even do things like
decentralize scheduling or work stealing, where it doesn't need to be done by
one node.
>> HOST:
Any other questions?
Okay.
Thanks.
Download