>> Arvind Arasu: I am very happy to introduce... faculty at Duke University and Shivnath is an expert on...

advertisement
>> Arvind Arasu: I am very happy to introduce Shivnath Babu. Shivnath is at the
faculty at Duke University and Shivnath is an expert on streams and analytic systems. He
has authored many papers which are heavily cited in these areas. So he will be speaking
now about self tuning systems for big data. So Shivnath.
>> Shivnath Babu: Thanks Arvind. I am going to be talking about a system that we have
been building at Duke for the last couple of years called Starfish. And you will see in the
title that there are two terms. One of which might make sense, self tuning, one of which
will not, MADDER. I am going to introduce these things as we move along. So any
conference, anyplace you actually go to, you keep hearing the term big data, big data
analytics right? And often when we hear about that we also hear about the giants in the
space, like the Googles and the Facebooks and the Microsofts, you will have petabyte
scales of warehouses. There also many like little players in the big data space. One
scientist you must've heard of, one example I would like to give is journalism and this
whole field of computational journalism, which is something that we are very interested
in at Duke to the point that there is a center for computational journalism. I did the
following, like many of these newspapers, the ones in print that are actually losing money
big time, so where the funds are actually getting cut or in things like investigative
reporting. So generally downsizing is very human intensive in the field so what the
whole spirit of computational journalism can replace some of those manual sort of tasks
with computing.
And the best example I can give and it takes us to something like WikiLeaks. So
WikiLeaks is a lot of data. The administration releases data on the funding and actually
where things are going to be released and things like that. So now there are a whole
bunch of journalists that are actually scrambling to extract leads out of that data that they
can build stories. So these are places and these guys don't have petabytes of data. They
have probably a couple of gigabytes of data so what makes data big? It is not only the
size of the data, and petabyte is definitely big. But what also makes it big is a type of
analysis that you do on the data. And many people don't just want us to do accounting
and aggregation on these things like that; they actually might want to treat the data, build
matrix out of it or do a whole bunch of linear algebra operations or do some machine
learning operations and things like that.
So we have been interested in big data and its space, and around two years back we were
thinking about what should be actually--we would like to build a system for big data
analytics, so what features should that system have? And one of the things we spend a lot
of time on is actually understanding what features people need from such a system. So
what are these analysis factors needed for big data and that is how we came up with these
MADDER principles that I will introduce in a second, and to establish context I want to
tell you how the Starfish system actually looks. So it is built on the MapReduce
execution framework and very specifically our implementation is on the Hadoop
MapReduce execution engine, and the MapReduce execution framework consists of a
distributed file system. This would be the Hadoop distributed file system in the Hadoop
world.
The Google distributed file system and the Google file system is called Google word is a
MapReduce execution engine which has a scheduling that takes care of [inaudible] run.
If there's a failure, there is a task that is running slow, things like that. So Starfish is
going to fit in the Hadoop ecosystem and the main thing I want to emphasize is when you
say Hadoop it's not just the MapReduce execution engine and distributed file system. It's
a whole bunch of tools that are layered on top. A lot of people like a lot of machine
learning researchers at Yahoo, for instance, like to write their MapReduce programs in
Python with a nice API called Hadoop Streaming API. A lot of people actually like to
use more creative languages like Hive. Hive QL which is an SQL-like language or a big
language somewhere between SQL and a mole scripting software interface. There are
workflow engines, because there's not much you can do with a single MapReduce job.
You must have to tie together a whole bunch of MapReduce jobs into workflow.
There is a system called Oozie, which is a workflow manager. You know that there is a
cloud and people have actually moved into that space making it very easy to position a
hundred node Hadoop cluster in minutes. And that is an elastic MapReduce service that
Amazon provides. So this is the context in which Starfish actually fits. And I hope you
actually keep this in mind as we keep moving along. In reality we could replace Hadoop
with another, you know, big data system, maybe a [inaudible] database, but a lot of other
work actually focuses on interesting aspects of MapReduce.
So what are these MADDER principles that I keep mentioning? There are two ways to
understand it. One way to understand it is what are these analysis practices when people
are dealing with big data, what does this analysis look like; what are the features of that?
Or building on that if you had a big data practitioner and they give you a big data analysis
system, what features do you want to not system? So if you are spending a lot of time in
2009 thinking about it, and at that time there was a paper that was published in VLDB by
Joe Hellerstein and his colleagues who were collaborators from Green Plum and from
Fox Interactive Media. They were attacking a similar problem trying to characterize
these things. They came up with this cute acronym called MAD, and then we realized
actually some of our thoughts were similar, although I will actually say that there are
differences which I will emphasize, and we copied that and we came up with MADDER.
So the MAD, the first MAD are similar. The first D is slightly different. So what are
these MADDER principles?
The first M stands for magnetic, so the thing is often like, again, going back to the
computer journalism example; WikiLeaks has been posted, right? Now as a journalist I
probably have no idea what the data actually looks like, so without actually working with
the data, I probably cannot even define the schema or any properties of the data. So
MapReduce was the system, the big data system should be magnetic. It should be very
easy to get your data in the system and start working with it. Okay, that is what the M
actually stands for.
The A actually comes from agility, agile. In many of these big data processes it is very
hard to predict, to have a good idea how the data will actually get processed on the data
formats and the requirements and things like that. And so systems should be sort of agile
and should be able to adapt to whatever changes you have and sort of counterintuitive
thing here is the less the system assumes about the data or the workload or things like
that, the more agile it can be.
The first D stands for deep, and here is where we actually differ significantly from the
version that that Hellerstein and his collaborators had. The deep means you should be
able to do deep analytics, you know, machine learning and algebra and things like that,
and they were doing all of this in SQL, which actually was very heavy. Now for us deep
has a sense of broad, as in the ocean being deep and broad sort of thing. And for us deep
really means, we don't want to go and tell these statisticians or the journalists look write
all your stuff in SQL and we know how to handle SQL, rather we want to tell them you
can write it in SQL if you like. If you like Python and that's what you want to use, write
your stuff in Python. Remember that I am assuming an underlying MapReduce
distribution framework so it is not just [inaudible]. It's Python with an API that is
actually going to generate MapReduce programs that will be sent to the processing
system like that. Or if they like other programs like SQL, do that. So that is what we
actually want with deep. We want a system that is not making any assumptions about
how people are actually working [inaudible] that they are using on top. The Hadoop
system actually showed just a few slides back.
The next D and this will be the last DER which is very specific to our thoughts and some
things which are was not there in the journal VLDB in the big data world. The data
lifecycle aware is a little bit hard to get, but hopefully things will come out through an
example. So any, you know, this is actually a Yahoo web page; any of these social
network sort of pages that you actually go to, you can see that there are a lot of things in
there such as ads, news, articles being recommended in fact, the whole content that has
actually sort of been generated has a whole bunch of analytics that have gone into it. And
this is an architectural picture taken from LinkedIn and Facebook and Yahoo, I bet even
Microsoft probably has something like this, you know, that is driving the content. So this
whole system architectural content can be broken up into three subsystems, roughly.
There is a data serving set system which has the three-tier architectures and everything to
be delivered. And you play in that space with no SQL engines, LinkedIn’s no SQL
engine is called Voldemort. So these are the data serving and the ones that can actually
capture a lot of writes. Then there is the log aggregation subsystem, and by logs I don't
mean the data base transmission logs, I mean the click stream and the user active resource
information. So that information that is getting generated in the data serving system sort
of gets sort of gets moved in some sort of real-time into analytics platform. It could be
Hadoop; it could be Edgebase, a lot of invasion happening in that space. And what is
really interesting is that, the results of that get pushed back, so there is a cycle that is
happening, and that is why LinkedIn causes a data cycle they estimate terabytes of data
are actually moving through the cycle.
So notice that data can actually take these interesting cycles, lifecycles and the big data
partitioners have to support them. So whatever system we come up should not make this
hard. Okay, especially when there are multiple systems here trying to combine together.
And one example of that would be if I have an analytic system that takes 30 minutes to
load data, but it can't process these [inaudible] and things like that in 1 minute, that is
probably a nonstarter if you actually wanted a 15 minute cycle latency. So depending on
the application you might want second latency to lengthen, they were actually using this
for their people you may know feature, that sort of feature that most social networks
have. They were having a six-hour cycle latency, so depending on that you actually have
to do things differently. So that is what data lifecycle awareness is sort of requirement
that comes up in big data analytics.
The E stands for elasticity. So there is the world of the cloud and you may or may not
believe in the cloud, but one feature people like a lot about the cloud is a feature that they
want to have especially true in journalism, because you get these big data sets every now
and then. You expect the results you search for, the results have been used in the system
as well as the costs incurred to be sort of proportional to your actual workloads. You
don't want the system to just be lying idle all of the time. So this is a feature that you
actually bond with elasticity, the pay-as-you-go nature.
The last R which you have heard a lot but I want to emphasize it again is robustness.
Anything that could go wrong is going to go wrong when you actually begin working in
the big systems and big data. And one of the interesting sorts of problems that we have
actually been running into and have some work that I am not going to be talking about is
the last thing is that data corruption. Now things will crash and data can actually get,
become corrupted. These are all things you should be able to deal with gracefully, not
just fall off.
So these are the MADDER principles and we actually want to build a system that is
going to stick to these MADDER principles. And as we started thinking more about the
system design, and the system architecture, frankly it was these MADDER principles that
brought us more to the Hadoop world or the MapReduce way of thinking, get back to a
parallel database system. You realize that there isn't really a focus on ease-of-use.
People definitely want good performance, and what is unique about this MADDER
principles is that each and every one of them makes it much more hard for the system to
give good performance out-of-the-box. And if you take the M and A the magnetic and
the agility often what that incurs is you know the system knows nothing about the data
when it is been loaded until the point where somebody is going to analyze or run
something on that data. So data can actually be opaque. The deep end, you know, people
who are writing these programs, they are actually sticking to their workflows that connect
modular subsystems together. What that means is that we are dealing with not
declarative programs like SQL declarative languages like SQL. We are dealing with
Python programs. We are dealing with these multisystem workflows and things like that.
The elasticity and the robustness, you know, systems we have actually built very nice
mechanisms to deal with and to provide elasticity and robustness, but what is often hard
is not the mechanism, you can actually add more nodes so it will grow and shrink, but
when you eventually do that, how many nodes should you add? How many nodes should
you take off? So those are things that actually become very hard, the policies. So the
goal of Starfish was that we want MADDER features, but we also want the system to be
good performing without the user, you know, often in this context there is no
administrator, okay? They should be able to--the system should be able to give good
performance automatically. That is where the self tuning aspect comes in. Starfish is
about bringing these two things together.
So what I am going to do in this talk is first I am going to talk about what are the tuning
problems that actually arise in a system, in a big data analytic system especially one built
on a MapReduce framework. I'm going to spend some time on that and give you some
examples. And I am going to sort of drill down into two specific problems at the level of
MapReduce jobs and all the way up to the level of sizing that entire cluster. And then I
am going to show you an experimental evaluation of the system. It's available; we are
actually working on it. Version .1 is out there, version .2 will actually be--we are testing
it and everything; it should be out by the end of the month. There is a version .3 coming
into collaboration with what we are doing with Yahoo and so on. It's basically a system
and I will talk about evaluation. It is a rapidly moving sort of a target and I will also talk
about some of the ongoing work that we have in Starfish, some of the things on the wish
list.
What are tuning problems? One thing I hope that you will actually take away from this
talk is that there are tuning problems that arise at multiple different levels, and you don't
want different tools that can be used to work on different parts. You actually want
something that can actually have an integrated view of all these different parts. And what
are those parts? There are challenges and tuning challenges configuration space and
individual MapReduce job level. There are challenges at the level of when you are
stitching together multiple jobs to give these workflows, and some of these workflows
can have 80 to 200 jobs that are big workflows. Nobody runs single jobs at a time. Even
if you are writing on SQL query or a big manuscript, you still run into multiple jobs,
workload multiple jobs.
And then there are challenges at the level of the entire workload, for instance, the
challenge I was talking about in the data cycle sort of a context. Or if you have multiple
users and you want to preserve some sort of guarantees across for the workloads of what
different users are running. There are challenges at the level of the data layout. Other
people keep saying in the MapReduce world you have no schema. You don't know
anything about data. That is far from the truth. Any time you are actually running a
workload on it, you can learn a lot about the data. There are interesting observations and
automatically partitioning data to actually help with future workloads. And finally, sizing
the cluster, and I will actually give you some examples, at least three of these in the next
couple of slides. Before that a quick sort of primer on MapReduce job executions so that
all of us are on the same page. A MapReduce job might begin its life as a MapReduce
program that a user has written. I know you can't read this. This is actually a
WordCount MapReduce program written in Java, now sticking to the Java API that is
provided by Hadoop. The user rights map functions, reduce functions and stitches them
together using some boilerplate code into a job. So a MapReduce job connects the
program the user wants to run, and this program could also have been generated
automatically by a compiler for a system like [inaudible] or PigLatin. It's a program. It
connects the job, connects a program with the data and some cluster resources on which
the job is actually going to run and then there is a configuration. I will elaborate on the
configuration in just a moment.
The first thing that happens program, the data is actually taken and logical partitions of
that data are undefined. And the Hadoop will of course splits and what happens is now a
map task has actually started for each split on those cluster resources. The map actually
processes, runs the map function, does some sorting compression and all of that
generating some intermediate data. That intermediate data will get shuffled, and this is
actually partitioned data that will get shuffled to the reducers that are actually running,
and often if they are processing anywhere from many gigabytes to terabytes of data, the
number of executions lodged in the cluster to run these tasks will not be enough. You
might have to run multiple waves of these tasks. For example, the way the Hadoop is set
up, you actually for every slave known in the cluster you can say it can run three map
tasks and maybe two reduce tasks. So we have 10 such nodes, then you have 13 map task
logs, and we will run 1000 maps and you are think about you are actually running many
waves. Similar reducers can also have many waves.
So let's drill down a little bit into individual tasks. For instance, if you take the map task
execution, what I am suggesting is that there are many phases, what we call the task
phases of map execution. Let's start with the split being taken from the distributed file
system. And then there is a key value, a fraction that is provided by the maps. This is
where that WordCount like a Java program might actually be running. You can write
those map functions of things and byte on the word not, and then the output key value
pairs algorithm maps will actually get put into a memory buffer serialized in Java and
associating partitions that could go to the relief site, and when the buffer fills up to some
threshold, it will get sorted and there is a combine operation that could actually be done
with just pre-aggregation of the map site, so that the data that is getting shuffled to the
reducers is reduced. You could then compress it and if, and this is actually spilled to disk
local disk [inaudible] file system. If there is more data coming, then again basically
things could get spilled again. There is an external sort kind of thing that runs here with
the merging and everything and then finally the intermediate output of the maps is
created. This is the thing that actually gets shuffled when the reducers are asking.
This is a rough idea of how MapReduce job actually executes and now I am going to go
into phases in the reduce. So now when you actually see the [inaudible] slices, user
wrote a MapReduce program. And that translated into some execution strategy with
some number of map tasks, some number of reduce tasks, some way of partitioning data,
some choices that should be combined, the pre-aggregation should be done and some
choices had better be the intermediate outputs of the map should be compressed and
uncompressed on that side and the final output should be compressed. So a lot of choices
are made.
So who makes these choices? These choices in, for instance, in the Hadoop world are
specified by the configuration c which is given today in the Hadoop world, is just a fact
of the Hadoop engineering when the job is submitted, and that configuration that
determines the actual execution. I could choose to have 20 reduce tasks, 100 reduce
tasks. I could choose to size the map outward buffer differently. Now do these
parameters actually make a difference? You bet. There can be order of magnitude
differences and we have to spend many thousands, 10,000s of dollars on Amazon EC2
trying to understand the effect of these parameters. There are actually around 190
parameters in Hadoop, although many of them are where to write a task file log and
everything. There are around 20 to 25 parameters that can actually have a significant
impact on job performance.
And again we are talking about same program, same processes and same data. So what
this graph is actually showing you and this will give you a quick idea. It is actually
taking a program called Word Co-occurrence, which is actually used in natural language
processing to find which pairs of words are common and actually count the occurrences
of them. What we have done is we have taken that Word Co-occurrence, same data,
same customer sources, all we have done as you can see on this X and Y axis is changed
two parameters, i.o.sort.mp basically specifies that map site buffer I showed you on the
map task site. This one, most people have used Hadoop have not even heard of this
parameter. This one controls the amount of space in that map buffer which is used to
store metadata because we are dealing with Keynesian values that could be arbitrary
[inaudible] want to store some metadata for every Keynesian value that is put in.
Now as you can see, so what the z-axis shows us running, when we run the program
much of times and actually showing a [inaudible] surface of how performance will be.
And here the variation is not that high but I can show you other surfaces that come
variation is likely order of magnitude and even higher. What ideally we'd like to have is
a choice of these parameters such that you hit the blue region, but in this example the
typical strategies that people use to select these parameters were still in the red region and
the really bad region. So essentially, that is the kind of problem that especially if you
want to do a job [inaudible] somehow automatically you must pick parameters for this
blue region.
So what do people do today? I have actually cut and pasted two sorts of blogs, yes?
>>: On the previous slide, can you explain why in this type of them environment
[inaudible] more memory would always improved performance?
>> Shivnath Babu: Very good question. That is something basically that I choose this
example for a very specific reason and it will become clear in the talk. It will become
apparent in the very next slide. So the point is, let me sort of take that question gradually.
Let me answer it in the talk? That is part of the reason why I chose this one. Any other
questions?
We will start getting some part of the answer right in the slide. So what I have taken,
what I am showing you here is I went to the pig site again and the Apache.org/pig/pig
Journal system and sort of cut and pasted two sentences from the pig sort of to do list,
their roadmap. I know you can't read this as well as I can. So they are saying that people
are aware of this problem, and there are all of these configuration parameters and they are
also aware of the fact that one single configuration, one size does not fit all. Like for
different jobs you want to have different configurations. If one configuration or one set
of parameters was actually the best, then we don't have a problem to solve here. Focus
on the last three lines of that first blog. That shows how people go about attacking this
problem today, rules of thumb. And what that rule of thumb is actually saying is if you
had a MapReduce program that doesn't have any memory intensive operations, not
building any big cache tables or anything the map phase, it has a combined phase,
combined with a pre-aggregation that is done on the map site so that the data is shuffled
to the site is actually gets compressed, or its smaller. If those two conditions hold, you
want to set that map site buffer as high as possible. And this is very sort of I/O sort of
thinking, very I/O sort of thinking and the idea is if you do that the number of the
combined will be more effective because it can actually group many more items together
and the number of spills from this will be reduced. You know, do that.
And this one exactly gets you into that red region, thought of setting them [inaudible].
And a quick insight will actually show you perhaps why this happens. The quick
insiders--the setting that we're running it; this is on Amazon EC2 and Amazon EC2
actually gives you a whole set of different types of nodes that I will get to in the next
slide. There are nodes which have not that much CPU, like compute cycles so running all
of this combiner and everything cost the CPU contention. So you are much better off not
having large buffers, smaller buffers but better utilization of the CPU and I/O, even at the
cost of shuffling more data, and that gets into the blue region.
So those are the trade-offs and any time you have the sort of trade-offs, that is when we
actually, the tuning problem becomes interesting. Otherwise it's probably in a monotonic
surface. Notice that there is a dip, right? So finding this point that is kind of becomes a
tuning problem. If this was all sort of monotonic, then things are very easy. Second
point I want to get to is the second one, second blog here. So the fact is the way many
people actually sort of, even at Yahoo and Facebook sort of thinking I'm sure that at
Microsoft you have [inaudible] initiated and everything, you would rather people don't
think at the level of MapReduce jobs and this becomes--not everybody can write
MapReduce jobs, right? You would rather have them write in PigLatin or scope or
something like that. When I am writing a PigLatin script, I have no idea how many
MapReduce jobs or what MapReduce jobs get generated. If I don't even have an idea,
then how do I go about turning those parameters for a lot of jobs? So that basically is
what the blog is actually saying.
Great. So that was the job and slowly I got into the workflow part. Now let me move up
to the workloads and the cluster sizing. One very nice thing Amazon cloud and actually
cloud computing in general has made possible, now if I want to locate 100 node cluster, I
can simply go [inaudible] hundred node Hadoop cluster on Amazon. If probably three or
four years back I would have to go to talk to my system administrator, go to some sort of
[inaudible] in size and all of these things. It could be days to months of delay before
something like this could happen. But as the cliché goes with great power, comes great
responsibility, right?
So where is that responsibility that comes here? Amazon actually has different types of
nodes. I am just actually showing 6 of them here. There are 11 types of nodes, many
different pricing schemes and things like that. So if you see this picture, what each row is
actually showing is the node type, and these three columns, the second third and fourth
columns show the how these are categorized in terms of resources, and often these are
abstract units. This basically in the c1.medium node has 5 easy to compute units for
some definition of a compute unit. I/O performance is moderate and the cc1.4xlarge has
33.5 computing units and it has high I/O performance. There is some per hour cost that
you actually enter for each of these node types.
So if I am a user like you know when I have actually gone and given this Starfish talk to
more Hadoop users in the triangle area where I am from, one of the things asked and one
of the types of problems that I hear is, I have this workload that runs in 5 hours. I would
like to cut it down to 1 hour, so what should I do? And I am trying to illustrate some of
the choices that you might have in such a setting. So what you see here are two graphs.
The x-axis is basically the same on each of these graphs. It is a node type of Amazon
ECII, 5 of those node types that I just showed on the previous slide. And we are actually
taking a Hadoop workload that consists of I think 6 applications or something like 10
jobs and so on. The Y axis on the first graph shows the completion time of that
workload. Second one shows the cost incurred around the workload. If you just focus on
what I have shown here, if you use the ml.xlarge nodes, you can complete the workload
in about two hours. You pay some more of cost.
The use of cl.medium nodes will actually complete in four hours. It takes more time but
in corresponding at 40% less cost. And any of these applications in the big data world
are batch sort of jobs. Or [inaudible] was calling you know the [inaudible].
>>: [inaudible].
>> Shivnath Babu: Play think these are all run with the same configuration. The
configuration is based on the rule based settings for Hadoop as I will explain when I
show my excellent results. So this is a simple example of what today people can actually
do with different types of nodes and different clusters and the cost of what these will
incur.
>>: [inaudible] what is the source size of the example [inaudible] large [inaudible]
available?
>> Shivnath Babu: I was going to get to that in the excellent results section, but being
that you just asked it--I purposely hid those details just because not everyone would be
familiar enough for me to introduce it at this point. You can see basically this were for
numbers. Amazon EC2 they have elastic MapReduce service. For each of these things
they actually come up with some recommended numbers. If you are starting a Hadoop
cluster with in MapReduce, we have some numbers. We don't use elastic MapReduce for
these ones. They're basically our own harness that we have set up for partitioning a
cluster on Amazon EC2. These are the numbers that we have firmly determined are good
for the sort of workloads that we are running. For instance for c1.medium, and again
taking that question, when you are running as I mentioned earlier, when you are setting
up a Hadoop cluster every slave node you actually set up with some number of map slots
and some number of reduce slots. So that is the number of concurrent, for c1.medium I
set it up with two maps slots and to reduce lots. What that means is I can run that node,
can run two map tasks and two reduce tasks concurrently. So if you had 10 nodes, that
means my, like the wave size is really 10 times to the map width size.
If the user’s requirement was really the workloads can actually finish by 6:00 a.m. eastern
time in the morning, then we can actually recommend different choices to them. We can
take the cost and the running time properties from them, and come up with different
options. Without any, today people have solved this problem manually. So this would be
a tuning problem at the level of the entire cluster. And the person that was asked, you
know, how many map slots and reduce slots can actually run in one node, is also a tuning
parameter. If we have light jobs we would probably want to run many of them together.
If we have very heavy jobs, we probably want to run few of them, because they are all
going to be on the same set of results on the node.
Hopefully I have convinced you that there are challenges on the level of difference, on
the level of MapReduce jobs and workflows, workloads. I didn't expand on data layouts
but you probably would believe me there, and on the cluster. So on this one slide I am
trying to illustrate in general how we think about approaching these tuning problems in
one single event fashion, and more importantly what actually meets the architecture of
the Starfish system. So there are three confidents. The first one is the profiler, and the
job description of the profiler is to collect concise summaries of execution. On the level
of jobs, on the level of workflows, on the level of workloads and noticed that this is about
somebody's execution, actual execution happening and then observing. And this collects
for the profiles.
These profiles then get fed into what we call a What-if Engine, and the What-if Engine
basically its role is to estimate the impact of hypothetical changes, the impact that can
happen on execution if certain changes are made. So this guy will give the profiler and
ask it look that profile was collected for 20 reduce tasks. What would happen if I had 40
reduce tasks and I turned on map compression, or I had 40 reduce tasks and the data slice
were to increase by 20%? So this is some sort of basically related sort of questioning and
answering, so given an execution, you ask it if a change were to happen, how will
performance actually change, and there can be different ways that we can actually
categorize performance, maybe it's completion time. Maybe it's resource utilization
because if we are running multiple jobs together, you have more throughput focus, so that
is the job of the What-if Engine.
And then we have a whole suite of optimizers on top. And the job of the optimizers is for
workload tuning you actually have, the space you are considering, circle that space, make
appropriate calls to the What-if Engine or to the profiler to generate profiles, and then
come up with a good answer to recommend. And this leads naturally into the Starfish
architecture. So as you see there is a data layer. There is a job layer. There is a
workload layer, and there is an entire workload layer which has the workload itself in the
sizing, the cluster sizing, actually confidents and the profiler and the What-if Engine cut
across all of these things.
So what I am going to do next is basically drill down a little bit into the job tuning
problem. Not because I think that is the most important problem, but because I want to
give you an internal view of how these three confidents implant so that they can naturally
solve a real problem, and then see abstract description of the problem. You have given a
job that is to be run by a program resources and data, and the choice and basically what it
chooses is configuration that can optimize some [inaudible] metric of performance and
for now just assume that it is condition.
The job of the profiler in this context is to collect a profile from that same program
possibly run on a different data set, different resources or different configuration. And
we actually support multiple modes here. One very commonly used case from companies
like Yahoo and Facebook is that they are running, they are big users of Hadoop is to
actually run these so-called periodic MapReduce jobs on their daily data and their orderly
data and things like that. Facebook what runs 1000, I think that is the number that I
heard. Yahoo runs much more. So you could technically collect a profile so d1 could
actually be yesterday's data, and we are trying to ask a question on what is the best
configuration for today's data.
Often things don't change much from one to the other. Another mode we actually
support here is we can run a sample job to collect a profile. So d1 will actually be a
sample of d. And then you collect the profiler and use that for optimizing the [inaudible]
in the job. And the reason we can actually do this is because the What-if Engine can
answer given a profile natural execution profile, you can answer questions for any
different d2, r2 and c2. And that is used by the optimizer so it searches through the space
and find a good configuration for this space, S, I actually mentioned to find a good
configuration that is recommended.
There are multiple different types of optimizers. We have been focused on writing
optimizers to minimize completion time, but we can really entrust the optimizer to be
some sort of a robust optimizer, which is to ensure that no matter what, how the profile
was actually collected, maybe when the cluster was actually heavy used or your
collection on a [inaudible] cluster it still be a robust strategy. There are some different
options here but I am going to talk about completion time in the optimizer.
Hopefully, this gives you kind of a feel for the different types of challenges that arise.
Let's move along a little bit further. What is the profiler am I talking about? Profiler as I
mentioned is a concise description of a job of a program execution. This is the sort of
picture I showed you earlier on an actual job execution. Let's just drill down again and
show you this slide also again. The profiler captures information at the level of
individual phases of task execution. The map task has these five phases, the read, the
map, the collect, the spill and the merge. The map is where that actual map function is
actually being invoked. The lot of the framework code that actually runs. So the profile
basically collects and you can think of the profiler as actually collecting two types of
information. One is data flow information and the other is cost information. So data
flow, and this is the same kind of things that you might be used to if you use a database
system, right? For data flow how many records, bytes are actually flowing through. And
we can actually collect some more things specific to the MapReduce setting, how many
rounds of mergers are happening, how much spill is happening, but everything specific to
data flow. The change in resources, these things will not actually change. Cost is more
like how much time is being spent in each of these phases and tasks and here I am
showing examples of time spent to read things as it was a map phase.
There is some more stuff coming. Let me try some intuition for that. So this is a
screenshot I am showing from the Starfish optimizer. So what you're seeing here is
visualizing a profile collected for a MapReduce job that, I don't think all of you can read
it, it's about 326 map jobs and 27 reduce tasks. So what we are seeing in the sort of blue
rectangles is the running time. It shows the running of a map task. So there are many
map tasks and some reduced stuff that actually run. So based on the profile this is a type
of data, this is a simple type of cost data that you can actually be collecting. It can be
collecting data flow data, so this again is a screenshot from the optimizer where we can
visualize how much data was actually sent from one map task to other sort of reduced
tasks or from one node in the cluster to other nodes, and there are some interesting things
and you can play on a video on how the job actually kept going, what really happened.
Maybe there was a time when a lot of data was actually getting sent. There is a demo
going to be coming up on DVD at the end of the month. Please check out.
So cost and data flow, but quickly you realize that 300 map tasks, 26 reduce tasks, we're
dealing with terabyte data; we are dealing with 10,000 map tasks and 500 reduce tasks.
Quickly that information can become overwhelming. You have to start neutralization, so
that made this sort of thing to worsen and we probably want to show some summary or
some statistics about the data. So what I am showing here is a skew of the map outputs.
So some maps and I know that you guys can't read this because it's a screenshot from the
neutralizer but that is fine, because you can just follow what I am saying to get the
information. So what this is actually showing is that some map jobs actually send a
bunch of map tasks around 22 MB of data. Some of them actually send it on 46 MB of
data, so this is actually a histogram. Maybe we can sort of take all of that raw data and
start extracting some sensible information from that because we get these samples from
every map task and every reduce task, and similarly for cost.
So what I am actually showing here is for all the phases of map task execution and reduce
task execution. We have taken [inaudible] job, aggregated that into one representative
map profile and one representative reduce profile. So for this job the read of the map on
average the map spends around 10 seconds reading, 35 seconds, and then actually the
spills and similarly for reduce tasks. So what we have done is take all of that raw data
and starting from the top to extract sensible information and properties from the data.
And that is what completes the profile, in fact, these are the most important features of
the profile that we are going to use as we are the inputs and the [inaudible] as it is making
decisions based on what happens with data size changes or what happens if we actually
go from medium nodes to large nodes or things like that.
As an example the selectivity of the map function, so every map task reads some
important data [inaudible] of the data. So we can come up with a selectivity for every
map task and we have a distribution across all map tasks. And you can run a short set of
things showing what the previous side was, taking all of that and just representing it by a
single, the average value. But if you want some sort of robust optimizer and things like
that then you probably want to take that entire distribution into account. All of that is
preserved in the job profiler. It is an exact description of the entire job execution that can
be used for reasoning and [inaudible]. Where did this profile come from? In Starfish
they come in two ways. One is the measurement which is the one I was showing earlier.
Then the job [inaudible] if we were actually to go and collect the profile information.
The other list I will expand on a little bit later by estimation and the biggest entity that is
making estimation is the What-if Engine.
So let's get down as to how profiles can actually be collected by measurement because
this is the challenge that we are dealing with. We had to spend a significant amount of
time to deal with because proposed two practical restrictions on what we can do to collect
the profile. And these two restrictions are we actually want to make no modifications to
the actual Hadoop code. And so this benefits and this is really a practical reason in the
sense that the more changes you propose to the code, if you say to run Starfish you have
to install these five patches, they are probably not going to go in there. Rather the
approach we want to take was you can use Starfish without making any changes to the
Hadoop code and Hadoop can all stay the same there, you know? Without making any
changes you can use Starfish. And from the user’s perspective we don't want to tell users
that there are these five implementation things in your code and we can profile it. Zero
changes to user code.
And how we manage to pull this off is by using this trick called dynamic instrumentation.
So the idea is there are some event condition action rules that are specified. And the
events I am talking about here are events that recur during program execution like
entering the function, exit from a function, exception or timer event or things like that.
She can actually have conditions that are associated with these events and if the
conditions hold some actions will get taken. Maybe by recalling them all the bytes
flowing through or actually according to function parameters or the time it takes to
complete a function, things like that. And what happens is these event conditions action
tools they also come up later in Java because we are using a tool called BTrace. These
will get combined with the byte codes and then get dynamically get added to the Hadoop
code that actually runs like I showed in this particular animation.
They will get in there and then we actually want to take it off. This is gone. So the
profiling overhead when we are not turning this on is zero, then when we are actually
turning it on there is some overhead to collect the information and when we take it off
again it goes back to zero. And we use this tool called BTrace because it is open source,
not because it's the best and I will actually show you some examples of why we think we
might want to go up with a slightly different tool to replace BTrace. But first I want to
show you this pictorial so that, this is actually made, this makes sense. So we start off
with all of the tasks that you are running and Hadoop being on Java and the way your
tasks actually get drawn is JVM [inaudible] associated and the task runs in that JVM or
the map does and the sorting of the framework, everything that actually happens in that
JVM we write over the event condition action rules. Suppose we want to measure the
time being spent in the map function as opposed to the time been spent in the spills, then
these sort of byte codes that get you from our rules they get put in the JVMs using some
standard JVM instrumentation APIs. We collect the profiles of the raw data and it goes
through the sequence of aggregations and [inaudible] property estimation to come up
with the job profile.
And when you see something like this what you might be concerned about is overhead.
How much when I run the job generating profiling and when I run this with the profiling
turned on, what is the overhead. And it turns out and I will show you some real numbers
profiling overhead can be significant. There are two types of overheads that that come in
basically that are compute cycles being eaten up by the code we added, and there is also
I/O. What I am going to show you here is how we propose to deal with that problem. So
we have these support two modes of sampling for profiling in Starfish. First mode of
sampling is you can turn on profiling only for a random subset of the task, maybe 10% of
them. So know that this task is not getting profiled. And even better we can actually run
only the task we need to profile. So this job is that sample job notion that I described, so
we can run only a sample of the map task and some number of reduce tasks.
Of course when you're using sampling, you expect that the fidelity of information will be
lower, and I am going to show you some results that for all of the work that we ran
around 10% of task sampling was good enough to actually get to the best sort of
recommendation that we could get.
>>: You mentioned the purpose of profiling was to [inaudible]. But if that [inaudible]
Hadoop [inaudible] dependent on a particular set of function [inaudible].
>> Shivnath Babu: Great point. Yes, so basically these event condition action rules are
dependent on the actual Hadoop function cause itself, because we are saying look when
the spill function enters and this is a function of the specific mean, then start the timer,
right? So it's true then that our rules will basically start off and we will have to update
these rules then the Hadoop code itself changes. But what we are really talking about is
very code Hadoop code. And all the last couple of reasons that has to be changing. But
that is the problem. So when I went and did the same talk at Yahoo and showed them
what we can do with the extent of information that we bring into the profiles, then they
are thinking maybe it's a good idea to put some of these things into the Hadoop itself. So
I think that is a path for actually making impact to the other route so that is why we went
this route. Yes?
>>: [inaudible] information why didn't you just parse the logs?
>> Shivnath Babu: Great point. I showed this information on the profiles, the data flow
and the times and the data flow says sets and the times. Probably around half of that
especially in the data flow comes from the logs. What is in the logs, we don't actually
add extra step to collect it. So all the stuff, all the timings and all of those things at that
task-based level we had access.
Okay. So that was the profiler, the sort of profiles that we can actually, we have and how
we collect it. Now coming to the What-if Engine, now given a profile for this particular
program that we actually want a good recommendation for, and given some hypothetical
data property use, cluster resources or configurations, what we want, or the goal of the
What-if Engine is to estimate the properties of that hypothetical job execution without
really running it. And that consists of two parts. One is basically, we call it the job
Oracle, which can take this information along with I am going to show you how we
actually are going to do it instead of models and simulations and those sorts of things. To
come up with a virtual job profile for that job as it will run. So ideally what we want is
for this virtual job profile should be exactly the one we would have collected if the job
had run, that p, d2, r2, c2 would actually have been run and measure it using our
measurement base idea. Yes?
>>: [inaudible] just one [inaudible] job [inaudible] the next one?
>> Shivnath Babu: Great. So that is basically again, that is something we are actually
doing so again, one of the things we are actually and are pushing for is continuous
profiling. Every time a job gets run on the cluster, we actually will be measuring 10% or
some small portion of the task and we're going to put all of that in to a Hadoop, some sort
of profile warehouse. And then the input, this input is actually being chosen by Starfish
itself. For every job we actually have a tag profile and if you want to make a what if call
or a optimization call, you would give the profiler the tag for it. But it's a very
impressive problem on how can I actually keep a whole bunch of profiles and maybe if
we had a profile where the data properties were very similar to what the data properties
are now. Then it makes sense to give that, or give a whole bunch of profiles that would
lead us to maybe making some changes here. So that is something to think about. We
aren't there yet. That is part of the future work that I will be telling you about. Yes?
>>: In the What-if Engine you never change p but in the [inaudible] consists of one
MapReduce job. If you choose the map key or partition key, your performance change
order of magnitude. So how are you going to contend with that?
>> Shivnath Babu: Great. So I must clarify that I am currently talking about, because I
want it to be end to end, I am talking about an individual job. We currently support--this
is like version .1 of Starfish. We now support profiling and what if and everything and a
lot of workflows, so what will happen is basically we are actually giving each of these
will be different. So there is some optimization layer that will actually be with those lost
optimizations and eventually I will show you a slide on that where we can choose
different partitioning keys, [inaudible] jobs and then make the appropriate calls to
[inaudible]. We do that but this is very specific for a specific job. The same ideas are
[inaudible]. Only we are now talking about the workflow level, more optimization sizes
and more different ways to actually generate those programs and make the call to the
What-if Engine. Did that answer your question?
>>: Kind of. Because even if you find an optimal for these individual jobs, you can
point to [inaudible].
>> Shivnath Babu: We are dealing with that problem. I am not actually showing you
that in the slide, but the short answer is, remember when I showed the slide on the job
and the data and those sort of things? Actually I showed the Starfish architecture, the
picture we had the profiler, the What-if Engine and the layer of different optimizers?
This is all for the optimizer was for a single job, but we have an optimizer; we have a
workflow optimizer, and that workflow optimizer deals with those things, and it actually
[inaudible]. It picks different configurations for different jobs and the optimizer actually
has--it looks at two jobs together along the gate in between, so that is the unit at which it's
looking and makes choices at that level.
>>: [inaudible] you show the data to be skewed based on [inaudible]. Even within a
single job you can choose different partitions, which means you are changing….
>> Shivnath Babu: Right. And in this example basically I am assuming that the
MapReduce program is fixed, so the partitioning keys are also fixed. So if we have an
optimizer that can actually deal with--one concrete example that we deal with is maybe
the partition key is actually ab together. I can get that in multiple ways; I can actually
partition on a, then sort these partitions on b. That will bring together ab combinations
and there are many logical choices there. Here I am really talking about the physical
choices like, this one, program is the same. But I will be happy to talk about this off-line,
something again we have really and actually working on. So things when we start
moving up through the stack to the big level and the workflow level, these are things you
have to deal with.
So the main task is basically to come up with this virtual profile and then once we have
the virtual profile, we have chosen to decouple estimating the actual properties like
completion time and resource utilization and things like that, because maybe we are
actually dealing with many jobs, your throughput focus, so we are trying to decouple
these two steps. I am going to elaborate a little bit on how the virtual profiles actually get
chosen. So there are basically again the job is we have given a d2, r2, c2 and we have
estimated the virtual profile for that. So this one slide I am actually going to show some
execution in two slides, hopefully the questions are resolved. There are a whole bunch of
details. Usually they cannot solve a major part of the year to actually come up with this
thing. The problem is we are given a profile for the job j, and as I said it's really the data
flow statistics and the cost statistics that we are using we said primarily for visualization
and things like that. And yet given the properties of data resources and the configuration,
and we have estimated the corresponding fields in the virtual profile.
The first profile for data, you know, dataflow and data flow statistics, that is not too much
innovation that we have actually been able to come up here. In databases we have
actually focused a lot on this spot. The challenges that arise in the matters of data
sometimes you just don't have, you don't know what p is; you don't know if it's actually a
joint that is returned by some [inaudible]; we don't know that. So we use, in those
contexts we use actually simple assumptions like dataflow proportion input assumption.
Like the dataflow is sort of proportional to the input data. If you had more information,
and this is where you can have more information as you go up the stack.
In the work that we're doing with the big folks at Yahoo, we are actually adding statistics
and coming up with a way to estimate the data flow statistics. So I don't have too much
to say there about the standard stuff, but when we have some interesting stuff to say, the
coming up with the cost statistics and the cost statistics like the times when doing I/O, the
times when actually shuffling data to the network but [inaudible] record. That is what
goes into the cost statistics. And it's primarily determined by the cluster resources like
the c1.medium nodes or m1.large nodes. And what we have come up with is we tried
more white box and I think more models, but we have come up with a different technique
that seems to work best. We call it relative black box models and the execution is pretty
simple. So again the high level problem is if you give me an observation of how a job
actually runs on one cluster, you are then going to ask me how will it run on a different
cluster. And the way that we are going to do it is we are actually going to build this later
model. We are going to take a whole suite of like a benchmark of jobs and there is a lot
of intuition of what has to go into how you pick the benchmark. We pick the benchmark
of jobs, run it on both clusters, maybe a test cluster and a production cluster, or it might
be the m1.medium node standard and the m1.large nodes. We run that. We collect
training data and we build a more lean [inaudible]. This is going to be the main system,
but frankly if you build any type of reasonable regression sort of model, that is good
enough. What is more interesting is how you pick these jobs.
Virtually I don't have the time to go into it, but this work is going to be present in the
ACM [inaudible] computing conference in the end of October. So the main idea is you
can think of taking a sample of jobs from the workload, or you can think of picking the
jobs, like a sorting just highly intensive job like text and I take all is just CPU intensive.
So that is one approach to actually go with. We have come up with a better way which
involves running a synthetic job which internally will create different sort of patterns, so
you can come up with this data very efficiently, without any understanding of what the
actual workload on the cluster is. And we found that is actually an ambitious statement,
but we actually are able to pull off something like that.
Once we have that, now give me the profile on a source cluster, I can predict for the
target cluster. That is the idea of this relative black box model.
So that was the story for the cost statistics and then we have actually done a whole bunch
of studying and modeling off MapReduce execution to take the data flow statistics and
generate the actual dataflow. The actual dataflow, like the number of spills that happen,
the number of bytes that actually go through, they are different on the configuration, like
the I/O and the buffer size and things like that. So those are all very, right now they are
all models very specific to Hadoop because Hadoop is our underlying execution engine.
If the execution starts to miss something, we have to update that. So what is taken is this
might seem like the ECM intuitive and retrospect, but what has taken us a lot of times to
find the right modeling technique, the right techniques to actually, for each of these
profile feeds.
So that was the profiling, a very quick overview of the What-if analysis. Now the job,
the optimizer has to actually target the data list given the new data and the resources and
the program has to find the best configuration. So here this is basically a search in a high
dimensional space. There are 20 or 25 parameters in Hadoop that can have a major
impact. We currently work with 14 of them which we think are the most important. You
could personally grow to a larger number but we are focused on ones that are very
specific to jobs, as opposed to tuning the parameters in the clusters, like the number of
threads for reading from the file system and things like that. So here basically is such
problem. And the two times of making it, applying a simple technique like grading and
then making calls to the What-if Engine based on that. It's going to take a long time to
actually run. The two insights we have gotten to come up with optimization times on the
order of seconds, but even smaller. And first of all exploit the intuition that it can take all
of these parameters and sort of partition them into categories in the stack, we can come
up with the optimal configuration for a one set of parameters and then not change for
whatever other setting you have for the other parameters. And the intuition basically
comes from the fact that MapReduce execution sort of goes in these phases where the
map tasks run and then there is a shuffling and the reduce tasks run.
We can come up with an almost optimal configuration for the map tasks which remain
fine for whatever configuration you come up with for the new starts as well. That is what
goes in the first part, and then there is a similar interlink technique that goes into integer
space. We basically use this one technique called recursive random search. I would be
happy to talk about off-line. We have tried a whole bunch of them. Not much new in
engineering and not much new insights there. So that brings me to the evaluation. So we
have the system that we have been actually been releasing, so Starfish version 0.1 was
released early this year. We actually have .2 that is going to be released, it's released
already but we are debugging it and working with the Yahoo folks to get some insights
from their big clusters before we actually release it. And there is a version .3 to support
PigLatin that is coming sometime later this year. So if evaluating a system like Starfish
to come up with some of the insights that we have, I can easily create new clusters and I
will show you how.
Things like Starfish actually perform much better than any sort of other technique or rulebased that you can come up with. Experimental evaluation option involves making a
choice for work cluster you're going to run it on. And on Amazon EC2 there are so many
choices that we have. What sort of workload are you going to use and what sort of data
are you going to use? I am going to actually focus on two kinds of experiments, one with
cl.medium nodes and I want to emphasize these are nodes which have every node has
around five easy to compute units. Like the two JVMs have two codes, each code having
2.5 units, compute units. Now also I am going to show you this is like, you know, beefier
nodes. These are more beefy nodes actually in terms of Amazon for high-performance
computing. Each node has about 33.5 compute units. And those last few numbers I
explained earlier on is how you configure Hadoop, and every task log is configured with
a maximum amount of memory that the sort can use. And that becomes the JVM heap
size setting.
So the workloads and data we have been trying to choose workloads or jobs that are on
different domains, natural language parsing, text analytics, graph processing and things
like that. And we work with different amounts of data as well. Thirty gigs is basically
one set of, one set that we work with. I will also show you results with one terabyte of
data, some results. So these ones the x-axis is showing different jobs Word Coocurrence, WordCount, TeraSort, LinkGraph and Join. What is shown on the tree sort of
things you see here are the blue bar shows the time of the default setting for these jobs
that Hadoop comes up with. And the clusters we are using are around 30 gigs of data.
The clusters we are using our 15c.1 medium nodes, which is like 75 easy to compute
units.
The default is basically the blue one and what that actually shows us is the speedup over
the default. Rule-based optimizer, there is no real rule-based optimizer but Hadoop, so
what we have done then as a set, the way people do these things today is by actually
using some rules of thumb so the cloud one competes in the Hadoop spaces actually
come from the rules of thumb. Yahoo has come up with rules of thumb. We have
actually taken that and as you saw in the example I gave, often they are guidelines. They
are saying if you are actually doing a job that has these [inaudible] set it to this height.
And we have come up with some numbers based on, most of them involve memory so we
set those numbers based on the Java heap size settings. The maximum Java heap size
setting and things like that. The main reason for the rule-based optimizer is I want to
give you a feel for what cost of optimization can actually do.
And the green bar here is basically, we run using the rule-based optimizer setting, take
the profile from that, feed that into the Starfish and it comes with a suggestion. Then we
run the job with that particular configuration setting. As you can see here the default is
actually really bad and for terabytes scale they would not have been good to show you
those defaults. The rule-based setting, the real surprise here is the cost base is actually
really able to come up with twice as good as the rule-based setting, which is kind of
surprising to us; we didn't really expect that. So why was that happening?
And I am going to show you some insights based on actual performance so this will also
answer the question that came up earlier. So here, so this is for the WordCount job from
the previous slide. So setting a is the confirmation setting given by the rule-based
optimizer. B is a setting given by the cost base optimizer, what you saw was like twice
faster. What we are visualizing here is the profile after those runs and this is the map
profiles and these are the reduce profiles. You can actually see that for the map tasks, it
took it on the rule-based setting took it at 330 seconds and with the cost based setting the
time was reduced in half. So this probably got the benefits and interestingly for the
reducers our running time is higher. But notice the scale of these assets. Rule-based is
running in maybe 10 or 12 seconds. Maps take like 5 minutes to run. And when we
explore further what was happening was that the rule-based, again this is the same stuff I
showed you earlier, the rule-based basically said look, have a large [inaudible] and
minimizing spills and the combiner and all of these things. What ends up happening on
that c1.medium nodes is that it ended up creating big bottleneck in the CPU. We actually
saw that based on the profile that had been fixed. Essentially we came up with a smaller
setting and when you reduce the size of the [inaudible] okay, the buffer, you're going to
write more data out, shuffle more data. But there is still another problem here. So these
are the sort of things that you can actually come up with in a cost based setting that
sometimes even humans, and especially humans who have no idea of MapReduce
execution, can actually do.
So the next one we are showing results for much larger data sets and correspondingly the
compute units and the nodes here moved to the beefy nodes. Yes?
>>: On that previous slide how many variables of those 20 some odd ones do you
actually end up changing to produce this configuration?
>> Shivnath Babu: I think probably all of them. Beyond actually searching in that space
and the output the best setting that we actually found, but the critical things that I am
focusing on the most crucial ones, you just compare 1 to 1. Probably many of them look
different but that is not the way in which the cost based optimizer is thinking. It is going
into this space and finding the best one. You can try this out. I think in the paper that we
have that is going to be presented at VLDB at the end of the month, we actually show the
actual settings for each of these ones. I think we actually show it for Word Cooccurrence, like even more interesting. But it was hard to explain. I think it's closer here.
>>: [inaudible] it looks like you created batch size sorting reduce size sorting. How is
the network traffic affected by this? Seems like it would make phase is longer and more
intense.
>> Shivnath Babu: Shuffles took much more time, however, we still have the combiner.
So really I think that is a very good insight, which is even if you forget MapReduce and
we think of external two-phase sorting and database systems, there are two types of
phases that happen. One is reading a whole chunk of data and then solving it using an n
log service strategy and then writing things out and merging really like an ordered, like
just merging. That is one strategy. Now let me change the buffer size when we are really
playing with this, we release the sort of good [inaudible] on that n log in and then we do
more on the merging. And it turns out that in certain cases that merging, that doing that
trade-off because you better utilize the resources can actually do things better. And
clearly hear this [inaudible] has not been taken out. The [inaudible] is still there. So it's
just that you are shuffling more data but in the grand scheme of things that is just a small
fraction of the cost.
>>: Back up to my point is the network utilization of the links because it would have
gone up because of the change in strategy.
>> Shivnath Babu: It would have gone up but here like in the before things like
WordCount, the amount of data that is actually going, just based on this number, you can
see; this can run really fast because they are processing much less data. So the network
like I wrote in [inaudible] traffic to increase it significantly. However, that is another
metric that you might actually want to use for optimization when it comes to large jobs.
Here I am focusing on [inaudible].
>>: Was version visit that you are using?
>> Shivnath Babu: It's 20.2. We now support 20.2 or 3 which is the version that Yahoo
is actually using. So we support these two versions. Okay. So terabytes scale data and
much more beefier clusters. Here the results are like much more expected. There are
some surprises. Here I have taken out the default. It's just rule-based and us. Some
cases we are actually improving, but actually two interesting cases, Word Co-occurrence,
where the speed up--lower is actually worse. So Word Co-occurrence and for LinkGraph
was actually worse, so don't work for Word Co-occurrence. Word Co-occurrence is a
program where you have ordered n data and the shuffle of everything is rn squared. It
turned out that the best optimum setting, actually managed to find for this doesn't use the
combiner, but doesn't turn on map side compression or actually any kind of compression
for that matter.
What the rules of the match we said was turn on the combiner and turn on compression
because a lot of data is actually being sent. That is like probably well like at one point I
am the worst and the best. When we because of some issues in the profiling, which we
are trying to actually fix, and how we profile when there is combining and compression.
Our setting actually thought mistakenly that not having the combiner, not having the
compression was actually the best. So it slightly was, in fact, it was twice worse than the
optimal then it would actually find. Things are not perfect, but the reason for that
actually we think, because we haven't fixed it, is profiling that we have using [inaudible].
[inaudible] actually look very good on paper but it turns out it has significant overhead
because it's the sort of thing that people are actually working with this one main
[inaudible] and whatnot. They are not commercial orderings to be traced which we
would think is the problem, but here is what we have now.
What I am showing on the X axis is I am waiting for Word Co-occurrence and
[inaudible] the [inaudible] tasks that are being profile. Remember Starfish can actually
choose to profile only a fraction of the task. And the Y axis shows the slowdown, no
profiling and with profiling. Then we can see hundred percent profiling slows down the
task around 30%. There are some interesting nonlinear behaviors which I can expand on.
Like some really interesting stuff in profiling, but you can see that 5%, if we're going to
live with 5 to 10% profiling with BTrace, that is going to slow down only 5 to 10% and
especially 1% is hardly any slowdown. And you can see what we did is use those
profiles in these cases, feed it into Starfish optimizer and What-if Engine, see what
setting it comes up with, run it to see what speed up we can give it against rule-based.
So even with just 1% profiling we are able to give a better suggestion than the rule-based
by around 1.5. This is a thing that shows the speed up. Around 10% you're actually
seeing as best we can actually get them. This could be that they are not finding the real
best one. There is something else. Or at least in our context 10% profiling is able to get
to the best. Then 10% has about 10% overhead. I truly believe that our alternators to
BTrace. There are companies in the Java profiling space and I have seen benchmarking
results that actually show a tool that reduces the overhead of profiling of BTrace by
around 600, so maybe that will take out the problem, but it's commercial, but we will
have to do some things to try it out. But that is the story on profiling.
Now quickly moving into the What-if. This is the slide I showed you earlier; this was
actually for the Word Co-occurrence with all of these hills and valleys and whatnot. By
doing one run, getting the profiler, putting it in the What-if Engine, that is the best that I
have been able to come up with. The main thing to notice is that it's able to cancel the
trends and hills and valleys reasonably well, this thing is actually shifted up slightly
higher because every, we are actually measuring function calls and times and those sort
of things and BTrace has some overhead. So this is like all estimate, but the trend
remains the same. The important thing for finding a good setting is the fact that you
preserve those strengths. This shape has to remain the same. And so there is work to
actually be done to improve things over here.
So coming to the other problem that I actually talked about, the cluster sizing, so
remember I showed you this picture earlier on for the different types of nodes and the
blue bars were what I showed you earlier. There are interesting trade-offs to be made
from between cost and time and things like that, you know, we have the system
[inaudible] so is really a two position engine that is making a appropriate calls to the
profiling and the What-if Engine. Here we collect remember the little black box models
and those sorts of things. We collect profiling on the n1.large cluster, one run, and then
you can basically estimate how things will actually perform for different types of nodes,
different number of nodes, different Hadoop concentrations of nodes and things like that.
Again I am sure that there is room, if you were here, if you are able to capture the transfer
board we have now and under that for instance we should be able to say that you are
willing to trade off on the completion time, but you want to minimize cost, maybe you
actually want to go over the medium or the extra-large, or the medium, c1.medium in this
case.
This is probably one thing that is a nice use case that we are able to support. People don't
just go ahead and if you have a program like that is going to run on big data. People don't
just go out and take a program and unleash it on a big cluster. They often try to run small
data because getting things wrong on a big cluster you might actually--many other users
might actually become mad at you because you have done things, you have actually
probably even brought down clusters like what happened at Yahoo. So what we support
is essentially those two clusters could be a test cluster and a production cluster. So here
this is a just a small scale of what we want to do, but imagine you had a full node test
cluster which had some sample of the data, maybe even the full data. And you had a 16
node, sorry a four node test cluster and a 16 node production cluster. So what we are
doing here is we are going to the test cluster, collecting profiles for the job from that and
then making a prediction for the production cluster and running the job with the profile
on the production cluster. So you can see the rule-based and then here I will show you
real times. So lower is better. This is the rule-based setting. The other one is if you
actually managed to collect the profile on the production cluster, what you can actually
do, the green one is the user profile so the test cluster and the red operation for the
production cluster.
They were able to do a reasonable job. And this is I think a very important use case for
companies like Yahoo and Facebook because, and I am sure for everybody. Nobody
likes to turn on like profiling on those things on the production cluster, and many are very
sort of finicky about those production clusters, because their job is probably on the line if
something goes wrong with that.
So that was a very quick overview of Starfish. Now what I would like to do is take a
couple more minutes to give you a very quick feel of some of the ongoing stuff and let
me, shall I do that in a minute? Okay. A quick reminder of the Starfish architecture and
in each and every one of these computes there is some interesting subjects written, at
least three years more work to be done. One of the things that we are focusing on a lot
now is the workflow. We started off with the focus on jobs because we realized that
unless you have a good idea of the low style of execution you probably can't do anything
smart at the highest level where things could be wrong. So the workflow optimization
and let me illustrate using one slide, some of the things you could professionally do. So
this is the like extraction of the description of MapReduce workflow for the term
frequency and most often frequency. This is used in information retrieval so it's natural
for anybody, any user to write it in terms of maybe three or four MapReduce jobs.
Some map function in one, D and the W are like doc ID, word ID. It’s easy for them to
specify these long thin workflows but performance is often much better if you can
actually kind of get more shorter and much fatter workflows. That's why we call these
parts stubby; we actually want stubby workflows. So the things we can actually do and
one thing is basically again take two MapReduce jobs put them together. And that has to
do with how you choose the partitioning function, and if you choose the partitioning
function properly then you can sort of literally pipeline the records being output by a
reduced to exactly the next map and [inaudible] the MapReduce because the properties
that the partitioning has to enforce are already enforced from the single job. So this guy
is actually not removing any function because these are backward function. He probably
doesn't know what they actually do. So we take those two MapReduce jobs, and sort of
run them in one single MapReduce job. And similarly they actually collapse things
horizontally, like this can sharing and shuffle sharing and things like that.
And then soon all of the in a job level parameter selection, all of those choices remain.
So this is still at the level of black box, MapReduce workflows with some properties
assumed about the MapReduce API itself. Then you can do some run things and actually
even better things like large optimization, you can move up to the scope or the big or the
high players. So those are things that we are actually focusing on especially in some
collaboration with the people at Yahoo.
So this is basically the wish list indicated in red and what we are actually doing is
probably going to come out by the end of the year or maybe sooner. They are focusing
on the, we already have the profiling everything implemented. Optimization is
something that that we are focusing on. A robust adaptive tuning looks like, it's like
collect the profile and the cluster was heavy loaded and use that configuration is still the
best of the clusters likely loaded, or vice versa. Those are things we are actually trying to
study. It's hard for us to study that in isolation running things on Amazon EC2. We
really need help from someone like Yahoo or hopefully even Microsoft if you guys are
interested in something like this.
Lowering the profiling overhead remains high on our list, although I think it's more
engineering rather than research. And like people in the dark commercial two-step
probably solve that problem. There are new features announced about the sizing that we
are thinking about. [inaudible], [inaudible] based splicing is definitely something that
like it's definitely a thing we want to do. But still a little bit away. The other thing we
are really interested in is especially for linear algebra type workloads is using like on the
cloud again--and nobody is telling you, you have to run Hadoop cluster or even one
single Hadoop cluster on the same type of node, homogeneous nodes would actually
choose to use some [inaudible] nodes, the small [inaudible] nodes to actually written a lot
of data to the past text to generate this [inaudible] in those sort of things. Then probably
use sort of beefy nodes with more memory to crush through those matrixes. And even
with GPUs. So national gives you know nodes in GPUs. So there is an interesting
challenge to be solved there; we are actually looking at that.
The workload management layout, we are looking at those data cycles workloads. That
took place computation. We actually want to do some computation on the data serving
system, log aggregation in the backend antics. And this is a point that this was brought
up. The folks at Waterloo we are thinking about turning on profiling all the time,
collecting a warehouse of these profiles and then interestingly posing both, applying
mining and things like that to these things and are proposing, like picking the right
profiles that are [inaudible] profiles to input to the What-if Engine. That is it. Thank you
so this is a quick summary MADDER principles. Starfish tuning, we want to bring both.
There are some techniques we have actually come up with that are quoted from other
fields. Once you find 1 you can download and use. .2 is really right there, probably the
end of the month, I really want to have an announcement for the VLDB. Let's see if that
happens version 3 is probably with the PigLatin and the [inaudible]. We actually support
pig .9 which is not officially released yet. That might happen in October, so we might
delay until then. That is the story on Starfish. Thank you. [applause].
>>: [inaudible] the one that was on the previous one. How do you explain that the
[inaudible] is small and the…
>> Shivnath Babu: These are like really beefy nodes, so the point on these nodes the
[inaudible] are thinking about the [inaudible], right? And most of the rules of thumb are
actually based on what I observed on Yahoo's clusters and what I observed on cloud
Earth clusters. When we go to these more beefier nodes some of those [inaudible]
systems actually [inaudible] lengths didn't actually start to match. So it just means in this
sort of context RU-based thinking especially when we are dealing with large data and
enough CPU cycles so that those [inaudible] systems actually become valid. So rules of
thumb are--and my point is not to say we can actually have something that is much better
than rules of thumb. Somebody who understands everything really well will be able to
come up with good settings and our hope is there are many contexts where the
[inaudible], right, they have no idea doing these things. The hope of the [inaudible] is to
bridge that gap at least to the level of the job parameters. Yes?
>>: [inaudible] your optimizer optimize [inaudible].
>> Shivnath Babu: We are only at the job level. We are only dealing with the jobs, but
at the level of this level we are actually dealing with properties of data. So we are
looking at, in fact, we are even looking at how you could even co-locate different data
sets so that the join can actually become faster. So when we bring in partitioning we are
looking at partitions the logical aspect along with how to [inaudible]. All that we are
looking at here, but the jobs did not include that. The data is given to us.
>>: [inaudible] comparison that shows [inaudible] a lot of those rules at the bottom
seemed like they are agnostic to which jobs they are running. Did you use the same rules
of thumb for every job, and then did you use a different cost-based optimizer rules for
each job so that the CEO got the [inaudible] all got the same configurations?
>> Shivnath Babu: So there are two questions. The rules of thumb setting seem to
acknowledge good jobs. That is not true. Because the rules of thumb I showed you if
that map [inaudible] has no memory in terms of operations and there is a combiner, they
are not agnostic and the rules of thumb settings are different for different jobs. The
second question was the cost base [inaudible] remain the same. Collect the profile based
on rules of thumb settings, print them, see what comes.
[applause].
Download