>> Vivek R. Narasayya: Hi, everyone. It's my pleasure to introduce Herodotos
Herodotou, who's a Ph.D. student who's wrapping up his Ph.D. at Duke
University. Hero has done some really good work in the area of query optimization in sort of tuning SQL queries and now more recently on tuning
[inaudible] produced programs. So today he's going to talk about his work on the self-tuning system for big data analytics.
>> Herodotos Herodotou: Thank you. And thank you, everyone, for coming.
So today I'll be talking about Starfish. That's a project that we started at Duke, I guess I started, about two years ago. Ever since, now it has grown. I'm now leading a team of for or five students that are actually working on enhancing
Starfish in different directions and even started exploring some opportunities and some collaborations with some other universities. So clearly it's a very big project, but for today's talk I'm only going to focus on the work that I have done related to Starfish.
It's a very good area -- it's a very good time to be working in the area of data analytics. I've been to all the talks over the last six months or so with people from architecture, systems, machine learning. They're motivating their work on how the data sizes just keep growing and growing exponentially. And based on some IDC reports, a very recent report, the amount of data that's going to get managed in the various data warehouses over the next decade is going to grow
50 times, and the amount of hardware itself, the cluster sizes, will also grow 10 times. So data sizes are really, really growing.
However, the number of people, the IT staff, the administrators who are actually managing and tuning these particular clusters, it's only going to slightly grow by
1.5x. So clearly, you know, the same amount of people that we have today are going to be asked to manage and tune much, much larger clusters and a lot, lot more data than they do today.
So it's very timely that we are considering ways of automating the process of managing and tuning the cluster, making it very easy for the users to actually tune the clusters. And this is exactly where Starfish comes into play.
The idea behind Starfish is to provide this ease of management and automatic good performance out of a system for big data analytics.
Now, when I started working on Starfish a couple years back I wanted to figure out and really think about what are the kind of features that data practitioners would like or require even from a system on big data analytics. Users would like a system to be magnetic, that is, they would like to be able to just simply load their data in whatever format structure, semi-structured, unstructured data, and the system should be able to handle that and you should be able to analyze any kinds of data.
Of course, the system also should be agile, right? Data changes. The type of analysis the users want to run on top of those data also changes, so the system should be able to handle changes very easily.
The type of analysis that people would like to do can be very complex, can be very simple to very, very complex. You might want to do some machine learning, some statistical analysis, some text analytics, et cetera, and, of course, different people for different analysis would like to use different kind of interfaces. And, of course, those kind of interfaces usually make more sense. You're going to do some statistical analysis, you might want to use something like R, which is a statistical software. Or if you're going to do some business analytics, then your users are more comfortable with something like SQL. So it would be nice if system can actually support all these different types of interfaces.
Data lifecycle awareness is also another very important aspect today. People are starting putting together these big systems that consider various components. We have the user facing systems that are serving the users, then you have some log aggregation systems and then finally some analysis systems so the data will flow through the whole pipeline. The system should be aware of that pipeline to make sure that everything goes smoothly and the system behaves with good performance.
Elasticity, of course, nowadays is also very important, especially with how cloud is becoming very popular. People can now run a system on the included and they would like the system to be elastic in the sense that we should be able to add more resources or take out resources from the system itself.
And then robustness, of course. It's almost a given, the system should be robust, it should degrade gracefully in case any types of different failures would happen.
So these six main features in some ways form what we call the MADDER principles. So these are the principles that govern data analytics systems today.
If we take a look into these MADDER principles, in some ways they're really targeted in making the systems easier to use. Users should just be able to just load the data, run my analytics, I don't need to do any complicated ETLs or any crazy transformations to get things to work. So it's very easy to use.
However, if you have a system that's really, really easy to use, it's very hard to get good performance out of it automatically. And that in some ways is a consequence of the principles themselves. If you have a system that can be used to analyze structured and unstructured data, that means that now you create challenges like, for example, a lot of -- the users now have to rely on using high-level languages in order to do all the complicated analyses that they would like to do over these semi-structured and unstructured data. They want to write their UDFs to customly parse the data and then customly analyze the data, et cetera.
Now, like I said, I keep saying this semi-structured and unstructured data. That means that the system doesn't really know what this data is. We're moving away
from the good old relational world where the system knows, okay, this is the schema, these are the statistics. Not anymore. The system doesn't know this information. So it's very hard for the system to figure out the best thing to do in order to have good performance.
And then, finally, if we look at some of the properties like elasticity, if you take a system like Hadoop -- and I'm going to talk about Hadoop in a little bit -- it is elastic. You could add five nodes to the system and the system will pick it up and things will work. That's wonderful.
However, how do you decide whether you should add five nodes or not? Maybe you should add 10. Maybe you should not add any. How do you come up with the right kinds of policies in order to automate this process and make sure you're doing the right thing.
So there's a lot of challenges that are involved here. So our goal for Starfish from the get-go was to try and get a good balance between ease of use and getting good performance automatically. And this is something that's going to come out in the talk as I go on. We're not interested in getting the actually best performance out of the system. It's all about getting good performance in an automated way.
So currently we have -- well, actually two years ago we selected to build the
Starfish system on top of Hadoop mainly because it already satisfies a lot of those MADDER principles. It can handle unstructured data, it can run all sorts of different type of analysis, it's elastic, et cetera, so we decided to build Starfish on top of Hadoop.
Now, analyzing data in Hadoop usually involved loading your data, a lot of times it's simple files, into the distributed file system and then running parallel
MapReduce computations on top of that to interpret and process the data itself.
Now, over the years there's a lot of systems that have been built on top and around Hadoop creating an entire ecosystem, mainly to satisfy different users' needs and preferences. So there's a lot of systems on the side. There's Flume.
This is a log aggregation system. There's HBase. It's like a key value store on top of Hadoop. On the top there's a lot of systems like Pig and Hive that offer more declarative languages, more higher-level languages that users can use for expressing the computations that they would like to do for the type of analysis they would like to do. Amazon is offering Hadoop now as a service on their cloud. So customers could say, hey, I would like to get a 20-node Hadoop cluster to do something.
Now, if we look at this entire ecosystem there's actually a lot of what I'm going to talking a tuning problem, tuning challenges, that we can address ranging all the way down from the cluster itself, how did we decide what kind of machines do we want, how many nodes to we want to put together to put this cluster together.
Then, going up, we have a distributed file system. We want to lay down the data.
How do we decide how to lay out the data, both the original data and the derived data that's going to be generated from the different computations we're doing.
Then, moving up, we're going to be running MapReduce jobs. MapReduce jobs, their execution is influenced by a lot of different configuration parameters, so how do we tune all those parameters to get good performance out of it?
Now, moving on into some of the higher-level systems like Pig or Hive, that in some ways -- that generate work flows of MapReduce jobs. Now you don't have individual MapReduce jobs anymore, you have graphs of MapReduce jobs that are running together to achieve a particular graph. And then, finally, you have overall workloads you're running on the cloud, so you have a new set of challenges that you need to address.
All right. So in this talk, first I'm going to talk about what Starfish can actually do today, what are the kind of features and what are the capabilities of the Starfish system. So these are going to be the high-level, the what Starfish can do.
And then I'm going to go into more details on the how Starfish can actually do all the things that it can do today, and for the interest of time I'm only going to focus on tuning individual MapReduce jobs, but, of course, more challenges about workloads and workloads and cluster sizing problems will come through, and I can talk more about it, if you guys like, later on. And then I'm going to finish off with some experimental evaluation and some brief mention of some related work.
So let me start about what Starfish can do today. So what I'm going to present next is a set of experiments in some ways that I've run that in some ways will showcase how Starfish can be used to manage MapReduce workloads on the
Amazon EC2 cloud.
Now, there's nothing specific about Amazon EC2 here. Everything could work on in-house clusters as well. Amazon EC2 was simply a convenience for us to create different kinds of scenarios in varying the cluster that we're using.
Now, for people who are not very familiar with Amazon EC2, what Amazon has is a set of different machines, different instance types. The users can rent I think about 11. And I have a table outlining some of the more popular ones. So the users could rent machines of any type to put together a cluster.
Now, if you look at this table, different machine types, of course, have different machine specifications. If I were to focus on the c1.medium, a machine of the
C1.medium type has five EC2 units, whatever that means, has 1.7 gigs of memory, it can offer moderate IO performance, and it will cost 17 cents per hour to rent one of these machines.
Yes?
>>: [inaudible]
>> Herodotos Herodotou: No, they do not want you identify what it means, and also they don't really quantify the EC2 units, which in some ways complicates things for users because now I'm a user, I know my workload, but if I don't know exactly what the machines are, how do I decide which machine should I be
using? Again, stressing out the need for some tool or Starfish in this sense to actually tell us, hey, this is what you should do.
All right. And then apart from clusters, we already varied the workload.
Now, I tried to select MapReduce programs that come from all sorts of different domains ranging from business analytics to graph processing to information retrieval, and then the data sets that I used were both real and synthetic with sizes ranging from tens of gigabytes up to terabytes.
And, of course, adding to this table are more traditional benchmarks out there like the TPCH benchmark, the Yahoo's Pig Mix benchmark. That's are high-performance benchmarks, and I've played with those as well.
So starting from the basics, you're running -- you want to run and single
MapReduce job on the cluster, on your Hadoop cluster. If you take that job -- if you take Hadoop as-is and that job as-is and you're going to run it, you're going to get some performance.
Now, like I said earlier, there's a lot of different configuration parameters that you can use that will actually affect the performance of that have particular job. So what this graph shows is actually the performance when running these different
MapReduce jobs that I've shown in the table before outlying on the X axis of the graph as we run them using three different configuration settings.
Now, the blue bar shows the speed-up -- the blue bar represents the jobs running using the default MapReduce settings. And the graph is normalized on actually the default settings, so, of course, everything here is 1.
Now, Hadoop experts and very expert users mainly from Yahoo or Cloudera who have a lot of expertise into using Hadoop have published a lot of rules of thumb, what we call rules of thumb. Think of this as -- these are best practices on how to actually set different configuration parameters in order to get good performance out of the system.
Now, I went through all those blogs and the different tutorials, et cetera, figured out which different rules I should apply to different jobs, essentially manually tune every single job. And the red bars here shows the speed-up that I was able to achieve after following all these different rules of thumb compared to the default settings.
Now, as you can see, we can clearly get really, really good performance out of it.
We can get speed-ups up to, you know, 30 -- 20 to 30x speed-ups by following these rules, which is wonderful. It's great.
However, the one thing that is not really shown in this graph is all the effort and the time that I put into figuring out which rules should be used for different programs. A lot of those rules are also kind of [inaudible]. Things like, okay, if your MapReduce program is not memory intensive and you can do partial
aggregation, then you should set some memory buffers to be high. That's a rule of thumb.
As you see, it really depends on the program. That means I need to understand what the program is doing. I also need to understand what the different settings mean in order to be able to set them, et cetera, et cetera. So it was a very labor-intensive, time-intensive exercise to figure out how to set these rules of thumb in order to get these good speed-ups.
The final bar, the green bar here, shows the speed-up that we were able to achieve after using Starfish to automatically tell us what configuration settings to use. And as you can see, in all the cases Starfish was able to automatically give us settings that would match all the rules of thumb and in some cases even surpass the performance that we were able to get with all these rules of thumb.
And, again, I didn't have to spend any time in figuring out what rules to use or how to use them.
Now, of course, a lot of times people don't just submit individual MapReduce jobs. They'll submit entire work flows of MapReduce jobs either themselves or they're going to be generated by higher-level systems like Pig or Hive. At this point Starfish is also able to support tuning Pig Latin scripts, which is the scripts used by the Pig system.
So what you see here is two graphs. On the X axis there are different queries from two different benchmarks, the PPCH and the high-performance benchmark.
And these graphs show now the speed-up with regards to rules of thumb. I stopped using the default settings. Clearly, you know, they're performing pretty poorly. In some cases we can't even use them anymore because some of the programs will not even run using the default settings.
And, again, in all cases, as you can see from the green bar, the Starfish optimizer was able to give settings that will match or surpass the manually tuned settings that we're able to achieve for the different MapReduce jobs here.
Now, another very interesting scenario that arises in a lot of companies is the presence of two types of clusters. Usually companies will have development clusters and they'll have production clusters. All the mission-critical workloads will run on the production cluster. Usually we'll isolate it. And then there are a bunch of development clusters where the developers can actually test the different MapReduce jobs they're developing, figure out how they work, figure out that they work correctly, and then they want to move them on to the production clusters when the jobs are ready.
Now, there are two questions that arise naturally in this setting. First is, okay, I have seen how my jobs behave on the development cluster, so now how will this job behave once I stage them over on the production cluster?
And the second even more critical question is what settings should I be using for running these jobs on the production cluster?
Now, one of the capabilities that Starfish has today is after observing how the jobs behave on the development cluster, Starfish can actually answer both of these questions without ever running the jobs on the production clusters themselves.
So this graph shows the running time of, again, our different MapReduce programs. The blue bar shows the actual running time. This is how much time it took to run these MapReduce programs on the production cluster itself, and then the red bar shows the predicted running time. This is how much time Starfish predicted that these jobs will take when we run them on the production cluster.
Now, a few things to note here is that Starfish was able to make these predictions without ever running any of these jobs on the production cluster. It observed how the jobs behaved on the development cluster, which in this particular case consisted of 10 large nodes, and the jobs were processing about
60 gigabytes of input each. Whereas, the production cluster was actually much bigger, a much beefier cluster consisting of 30 extra large nodes, and she the jobs were processing now much larger data.
So Starfish, after observing how jobs behave on a small cluster, on small amounts of data, was actually able to predict how the same jobs would behave on a much bigger cluster processing a lot more data.
Now, the other thing to notice here, one of the important things that I want to bring out here is the accuracy of these predictions, you know, some of them are pretty close, some of them not so much, but the important thing is that we're able to get the different trends together. And when we were able to do that then we can actually optimize the individual MapReduce jobs later on.
Yes?
>>: How strongly did the run times on the production cluster relate to the run times that you saw on the development cluster?
>> Herodotos Herodotou: What do you mean?
>>: Basically I'm asking about correlation. Is it just a question of finding -- so the simplest model could be, okay, I take a simple linear factor, I scale everything --
>> Herodotos Herodotou: Ah, yes. Okay. So the answer is -- I see what the question is.
Actually, let me simplify that question. So let's say we run two jobs on the development cluster and we saw job 1 was running much faster than job 2.
Would that still be the case in the production cluster or I would just say, okay, this is a three times bigger cluster, multiply the running time by 3, would that still be the case? The answer is no.
There's a lot more things in play here. Yeah, you have more nodes to run.
You're processing a little more data. Things are just too complicated top able to do that linear scale-up. There are even cases that I've seen where one job might be running a little faster on the development cluster, but then for one reason or the other it would run worse on the production cluster, et cetera.
And then the final scenario that really arises with the cloud is the following:
Suppose we have a user that wants to execute a MapReduce workload, and, of course, there are some performance requirements that come along with it. For example, they might want to do some report generation and they want to finish within two hours before the morning or something. And a lot of times they might have also some actual dollar amount because, you know, we're using the cloud, we're actually paying money to use the cloud, so there might be some monetary goal there as well.
Now, in this setting there is a range -- a different range of decisions that the user has to make ranging all the way down from the cluster itself, right? The use now needs to figure out, okay, how many nodes should I use, what kind of type of nodes should I use to put my cluster together, et cetera, and then once I put the cluster together and I run Hadoop, there's a bunch of different settings that I need to figure out, you know, how many slots should I use, what different memory settings should I use for Hadoop, et cetera. And then, finally, these workloads are going to run. We need to figure out, okay, how many MapReduce tests should we run. Again, there are different memory-related parameters. Should I use compression, all sorts of different parameters that we need to set.
So as we see now, the range of decisions grows even larger. There's a lot more decisions to make. Again, Starfish can actually help in answering -- making these kind of decisions for the users themselves.
So to summarize, there are tuning levels at all sorts of different levels -- job levels, workload levels, cluster levels, et cetera -- so these are the different kind of tuning challenges and tuning problems that Starfish is able to resolve today.
Now, a lot of these tuning challenges or tuning problems, if you like, they have different flavors in some ways, but when we started putting Starfish together we really wanted to figure out some sort of a universal approach that would allow us to actually solve all these problems in a nice unified way.
Now, the way I was actually able to achieve that was to take -- it's a unified approach to tuning. And this unified approach will be summarized in this particular slide.
So first what we wanted to do was to observe how the job behaves and learn from it. And that's the responsibility of a component that we called the profiler.
So the profiler is responsible for collecting concise summaries of how different
MapReduce jobs executed on different clusters and learned from it. Then this information will be used from a different component which we call the what-if engine that can actually estimate the impact that different hypothetical changes can have to the job execution itself.
So we can now phrase questions to the system of the form, okay, I have observed how this job behaves running 10 reduced tasks on a 20-node cluster.
How will it change -- how will the performance change if I were to run 20 reduced tasks instead of 10? How will it change if I were to add three more nodes into my cluster? What if I now have a bigger data set? How will that affect the performance? So the what-if engine is responsible for answering these kind of questions.
And then, finally, for the different problems, you can think about -- you know, there's a lot of different decisions that we can make, so there's a search space to search through. And for that we have a set of optimizers. The optimizers are responsible for enumerating and searching through the decision space to figure out what's the best decision to make to get good performance and to satisfy the user's performance requirements.
So now I'm going to go a lot into -- into a lot more details regarding the profiler, and what-if engine, and the optimizer from the aspect of tuning individual
MapReduce jobs.
Now, the way that we view a MapReduce job is the following. We view it as a quadruplet consisting of the actual MapReduce program that's going to run on the cluster, the data properties of the data that's going to get processed by this program, the particular cluster resources that we have, and then, finally, the different configuration settings.
Now, running a MapReduce job on a Hadoop cluster essentially involves running individual map and reduce tasks. The map tasks are going to process partitions of the data which are called splits to produce some intermediate output that's going to get transferred or shuffled into the reduce tasks. The reduce tasks are then going to process the intermediate data to produce the final output, and depending on the resources and how big the cluster is, this map or reduce task could actually run in multiple ways.
Now, what affects the execution of each map and reduce task is a set of configuration parameters. These configurations determine how many map and reduce tasks I'm running, determines how do I want to partition the data, how do I set different memory-related parameters to buffer map output or the reduce input, should I use compression of this intermediate data or the output data, should I use the combiner that's a local pre-aggregation function that can be applied at the map output, et cetera, et cetera. So we have a fairly large space of different configuration settings that can actually affect the performance of MapReduce jobs.
Now, at this point a typical question that might arise is how much do these different configuration settings actually affect the performance of individual
MapReduce jobs?
Now, the answer is -- the true answer is it really depends on the program itself and the cluster and the data, but they can actually have a significant impact on performance in different ways for different programs.
So now what you see in this graph is on the X and the Y axis we see two particular configuration parameters, that's the io.sort.mb and the io.sort.record.percent. These are two parameters that affect the memory buffer that you use for the map output.
So on the Z axis we receive the actual running time of a particular program, in particular it's a co-occurrence MapReduce program that was run on a 60-node cluster on EC2, so we see how this performance varies as we vary these two configuration parameters.
So we see there are some areas, blue areas, where the job performance is very well. We get really good performance out of it. Then we see some other areas, the red areas where we get some really bad performance out of the job itself.
So it is our job to figure out how do we get settings that are in the blue region and definitely far, far away from the red region itself.
So this is just a two-dimensional projection of the surface. Back when I actually generated this graph, I generated a lot of these graphs with different configuration parameter settings, got all sorts of crazy different shapes and forms. But all these surfaces in some ways can be -- mathematically can be represented as a function, and it is a function of the program, the data, the resources and the configurations. So all these four things together are affecting the performance of the MapReduce job.
Then you can use essentially pretty much any performance metrics that you would like from resource consumption to total execution time, et cetera. For simplicity, let's now think about performance as the total running of the job. So when we are asked to optimize, that is, to find the optimal settings that will minimize the MapReduce jobs, it essentially involves figuring out which settings should I use to minimize this function F.
In theory, all is well. However, the main challenge here is that these MapReduce jobs in many cases are actually expressed as MapReduce programs retaining high-level languages like Java or Python or generated by some systems like Pig or Hive. So how do we capture the execution of the job? How do we represent these different programs retaining in these arbitrary languages?
And the answer to that is profiles, job profiles. So a job profile is the abstraction that I'm using to represent the execution of MapReduce job. So think about this.
At the end of the day this is a vector of features that characterize how individual map and reduce tasks executed on the cluster itself.
And it records very detailed information at the level of tasks phases. And when I say tasks phases, I mean really sub-task phases, right?
Like I said earlier, MapReduce job will run as map tasks and reduce tasks. Now, within every map or reduce tasks we can actually break down the execution into various smaller steps.
For example, for a map task, first we need to read the input split from the distributed file system, then we need to execute the user-defined map function to produce some output, that output is going to get serialized and partitioned into memory buffer, the memory buffer gets filled up, then we actually need to spill that to disk, before spilling we're going to sort it, if there's a combiner we're going to use it, if there's compression -- if compression is enabled, again, we're going to use it. So that might actually lead to the creation of multiple spills. And then finally we will need to merge it to produce the final output that's going to get shuffled to the reduce tasks, and there's a similar things that goes on for the reduce tasks.
Now, by using this information in the job profiles, we can actually analyze the execution of MapReduce jobs and again understand how it actually behaved. So when you see in this screen shot here is a screen shot from the surface visualizer, a graphical user interface that we have that the user can use to be understand how the jobs behave.
So when you see here is for every different node, you can see that the black box is representing individual map tasks that run and the purple box represents reduce tasks. So we can quickly visualize all the map and reduce waves and how they executed on the cluster. Now, all that information is available to us from the job profile itself.
Now, there are two main dimensions in the job profile. There is one set of data flow fields and there's one set of cost fields. The data flow fields represents the flow of data through the different map and reduced task phases. So these are things like how much records -- how many records and how much bytes, did they go into the map task and came out of the map task, how many spills do we do, how many merged runs do we do, et cetera. For the people who are familiar to
Hadoop, this is a super-set of the Hadoop counters. So it's a lot of information that represents how the data went from the different sub-task phases.
And that information -- actually, before I go there -- and then the second dimension is the costs. This represents, for the most part, timings of how much time was spent executing the different task phases within individual map and reduce tasks, like how much time did we spend performing IO versus how much time did we spend doing CPU processing of the user-defined map function.
Now, all this information -- if we collect this information from all the map and reduce tasks, we don't just have, you know, single numbers here, we actually have entire distributions of these values and then we can use those distributions to actually build more meaningful and more insightful representations like histograms, for example.
By looking at the output produced by the different MapReduce tasks, we can build a histogram that will tell us how much data weren't output by the different
MapReduce tasks. So we can see if there was any of skew involved or any bottlenecks that would potentially create any bottlenecks in the execution, et cetera.
And then with the costing information we can actually break down the execution of individual map and reduce tasks so we can see how much time was spent in the different phases, and, again, perhaps identify that -- okay, so in this particular case, the spill phase, that's the phase where we actually, you know, sort and spill to disk, took a significantly long chunk of this map execution, so perhaps there's something we could do there, maybe there's some configuration settings we can change there in order to figure out how to decrease the execution time of the map task.
Now, generating these profiles comes in actually two ways. One is we can actually measure how -- we can actually measure these fields, and this is done by the profiler itself, and this is how we can generate the job profile, or we can estimate a job profile. And this is actually done by the what-if engine. And I'm going to go into both approaches in a little bit.
Now, when it comes to the measurement-based profile, this is where the
MapReduce job is actually running on the cluster, and now we want to observe what happened.
What I started putting together the profiler, I had some goals in mind. First of all,
I should able to turn profiling on and on dynamically. So when I turn it off, there should be zero overhead on the cluster itself, which is something that is essentially required, especially for running on the production cluster, and then on demand can turn it on and observe and learn from how the job behaved with some low overhead.
The second requirement -- well, second and third in some ways -- had to do with how usable Starfish is going to be or was going to be. The first one is we didn't want to modify Hadoop in any way, and we also did not want to require the users to modify their MapReduce programs in any ways. So in some ways we wanted to increase the chances for Starfish to be adopted, and that's definitely something that was proven to be very wise, I guess, at this point because people have actually started looking into Starfish and started using Starfish.
Now, the way that I was able to achieve all these goals in terms of profiling is by using a technique called dynamic instrumentation. So dynamic instrumentation is a technique that has become very popular in programming languages and in the compiler world to monitor and debug in some ways complex systems, so we took this approach here as well.
Now, individual map and reduce tasks are running into Java virtual machines,
JVMs, right? Now, what dynamic instrumentation allows us to do once we enable it in some ways is to actually tap in into these Java virtual machines and dynamically inject byte code based on some event condition action rules; that is,
if an event happens, say a method gets called, and a condition is met, then a particular action needs to be taken.
Now, that action in most cases has to do with recording the content of some variable or doing some fine-grained timings to collect all that information.
So when we enabled profiling, we dynamically instrument Hadoop Java classes to collect this raw monitoring data. That raw monitoring data will then get collected, post-processed in some ways to build these individual map and reduce profiles that are then going to get combined together to build job profiles, which is what we're going to use later on for analyzing and optimizing the execution of
MapReduce jobs.
Now, of course, when you're running MapReduce in large scale, you're running a
MapReduce job that could actually have hundreds or even thousands of map and reduce tasks, there's really no need to profile every single one of them. So, therefore, we have enabled some sampling features into Starfish.
The first one is to use something to profile fewer tasks. If we have 1,000 map tasks running, we could profile 10 percent of those map tasks and we're going to collect enough information to build a very accurate job profile. And the reason, of course, behind sampling is to decrease the amount of overhead on the job execution even further.
Now, if I only want to run this job to collect some profiling information very quickly because let's say I want to run this large ad hoc job and I just want to get some information very quickly, then we don't even really need to run all those 1,000 tasks. We can actually execute fewer tasks so we can get -- build approximate job profiles very, very quickly that we can then use to optimize the execution of the job.
So what this abstraction, what this job profile enables us to achieve is in some ways we're able to represent any arbitrary MapReduce jobs by using this set of fields, these data flow and cost fields, and in all reality we use these data flow and cost fields to actually compute a set of data flow statistics and a set of cost statistics which are in some ways independent of the actual execution itself. So data flow statistics are things -- represent statistics over the data, like what was the selectivity of the user-defined map function in terms of records or what was the compression ratio that we're able to achieve. And then cost statistics are statistics, things like what was the average IO cost from reading from local disk or what was the average CPU cost for processing the user-defined map function.
Now, by using these data flow and cost statistics, we can actually predict how
MapReduce jobs are going to behave in different settings. And, like I said, that is done by the what-if engine. So the what-if engine can actually predict the hypothetical changes on data, cluster resources, and configuration settings on the program execution.
So the job profile can be given to the what-if engine to estimate how the job will behave, and that's all it needs in terms of the program itself. So we're able to
completely abstract away the user-defined map and reduce program into this job profile.
Now, when we want to ask a hypothetical question to the what-if engine, there are three more inputs that are required. So we need to specify the full, essentially, scenario. I want to run this particular job represented by this job profile over a particular -- over a new input data set on perhaps different cluster resources and use perhaps different configuration settings. Now, any of these three could actually be hypothetical.
So these four inputs will go through the what-if engine and out will come the properties of the hypothetical job.
Now, the what-if engine works in two steps. The first step is taken by the profile estimator. Now, this component is responsible for estimating what we call the virtue job profile. This is the profile that we would have observed if we were to run this job on these particular settings.
Then this virtual job profile is going to be used by our task scheduler simulator that will simulate the scheduling and execution of individual map and reduce tasks on the cluster in order to get an understanding of the full execution of the
MapReduce job.
And when I say the full execution, I really mean a description of how each and every map and reduce task is going to behave, and then we can actually use that information to visualize any of the screen shots that I showed earlier about the histograms, the timeline of execution, et cetera, we can actually visualize all of that by the output of the what-if engine.
Now, of course, the most critical component of all here is this virtual profile estimator. So the goal of the virtual profile estimator is the following: Given a profile that we have observed for a particular program on some input data, some cluster resources and some configuration settings, we want to estimate what the virtual profile will for the hypothetical job in different data properties, different cluster resources and different configuration settings.
Now, what you're going to see in this one slide here, it's actually about one year worth of work, give or take, that I tried to summarize in a single slide.
So estimating -- you know, making this estimation essentially involves estimating the individual categories that comprise these MapReduce job profiles. We want to estimate the data flow statistics, the cost statistics, data flow, and cost information for these virtual profiles.
Now, these categories represent, of course, very different information, and therefore I decided to use different modeling techniques to model these different categories.
For estimating data flow statistics for the virtual profile, all we need to do is we look at the data flow statistics that we have observed and combine it with the
current input data that we're asking a hypothetical option of, and then we use some traditional cardinality models, good old database-style cardinality models, to estimate the data flow statistics of the virtual profile.
Now, when it comes to cost statistics, these are -- like I said earlier, this is information that's very well connected to the machine specifications. We're talking about IO costs or CPU costs. So if the cluster resources -- the hypothetical or the new cluster resources are different, now we need to translate these cost statistics that we have observed on one cluster into the cost statistics that we expect to observe on the new cluster.
So for this purpose I use a set of relative black box model, machine learning models, essentially, to make this translation. And then when it comes to data flow and costs, these are specific, in some ways, on how Hadoop behaves, so for this I came up with a set of white box models; that is, these are analytical models, literally a set of mathematical equations that explain how individual map and reduce tasks will behave in terms of data flow and costs.
This information here, the analytical box models, it's about 16 pages of math that are available in a technical report, and a lot of the other things are described in this paper that I presented in SOCC earlier. So there are a lot of details hidden behind this slide. Of course, I definitely don't have any time to go into any of this.
So I decided to only go a little bit into these relative black box models, these machine learning models, that we used to make estimations across different clusters.
So imagine that we have a blue cluster and a red cluster, that they have perhaps different types of nodes, different number of nodes, you know, think about -- this could be the production cluster and -- the development cluster and the red would be the production cluster, for example.
So we run and profile the job on the blue cluster, we collect the job profiling information, and then we get the cost statistics of this particular job running on the blue cluster, right? So we have this set of numbers, essentially, to represent the cost statistics.
Now, our goal is to figure out what the cost statistics will be for this job once we run it on the red cluster.
Now, in order to be able to make this translation, if you like, or this prediction, we essentially need to collect information about how would a job behave in these two different clusters. So in many typical machine learning algorithms, we need some training data. And this training data will come after running a set of training jobs, a very carefully chosen set, a customly implemented set of training jobs that will run in both of these clusters to collect essentially cost statistics, right?
Different jobs will behave differently. We have CPU-intensive jobs, IO-intensive jobs, et cetera. We run them on both clusters.
So we collect these cost statistics for different scenarios and then we're going to use this information to build a machine learning model.
Here pretty much any supervised learning algorithm would work. After playing with a few, I decided to use something called an M5 model tree. The thing about this is it's a decision tree where at the leaves you have small regression models to capture the trends.
So these models can then be used to translate cost statistics from the blue cluster into cost statistics in the red cluster.
All right. And then the final component, the job optimizer, can be used to figure out what's the best configuration settings to use for running a particular
MapReduce job over perhaps different input data and different cluster resources.
Again, the job optimizer follows two steps. The first is to enumerate the entire search space and try to figure out if there are any independent subspaces that we can cut our space into, and then for each one of those independent subspaces, now we need to search through this space. Now, this space is essentially represented by that F function that I showed very, very early on and implemented by the what-if engine.
So for this, of course, it's a -- the function itself, it's black box, it's non-model, it doesn't behave in any well-known way, so I ended up implementing this method called recursive random search that's responsible for searching through this space, and as it searches through the space from the different points in the space, it will make appropriate what-if calls to figure out what the cost will be for executing the job in the cluster at that particular point.
Yes?
>>: [inaudible]
>> Herodotos Herodotou: Uh-huh.
>>: [inaudible]
>> Herodotos Herodotou: Right, right. Right. That's a very good question. And there's actually a lot of details that go into how do we come up with these training jobs.
One thing you could do is say, okay, so you know what your workload looks like, so we could take a sample of that workload or even the entire workload that you know and then run it on both clusters, observe, and then build that information.
That would definitely work. The problem with that is now if you have a new job that comes in that behaves differently, if you don't have the training data, you haven't observed the training data, and then the machine learning model is not going to work very well. So for that reason we did not follow that approach.
So if we take a step back and think about what do we need here, we need to predict how cost statistics will behave, so how these numbers relate going from the blue cluster to the red cluster.
So the goal of these train jobs is in some ways to generate data that will cover this prediction space that consists of all the different range of values that the different cost statistics can take. So for this purpose I actually implemented a set of custom MapReduce jobs that they don't do anything useful to the user, but they exercise -- they actually run map and reduce tasks that behave very differently in order to capture as much of that prediction space as possible so that any job that will come in, existing job or new job, will be very likely to have cost statistics that will fall into the training data that we have collected.
All right. Now, you said until 11:40?
>>: [inaudible]
>> Herodotos Herodotou: Ten minutes? Okay. There's plenty of time for that.
Okay.
All right. Since I'm only left now with experimental evaluation, the main section -- so for the experimental evaluation, like I said, it's essentially evaluating the different functionalities of Starfish that I have presented so far.
So let me start from the profiling itself. Like I said earlier, if we turn it off, we have zero overhead, but if we turn it on, we do get some overhead. And the tool that I ended up using turned out to have a little more overhead than I would like it, I guess, so I -- I didn't mention this earlier -- so I used this tool called B Trace
[phonetic] to actually implement the profiler into.
So what you see in these two graphs is I wanted to show in cost benefit -- these tradeoffs between the overhead and the benefit we can get from the profiling.
So the first graph shows the percent overhead that is added to the job as we profile more and more tasks. As you see, we profile every single MapReduce task. In this case by running this word co-occurrence over 30 gigabytes of data, we have a few hundred map tasks that are run, and we can get up to almost 30 percent overhead, which is pretty significant.
But when we start profiling only 5, 10, 20 percent, then the overhead that we observe is much, much lower, definitely more appropriate, and then if we look at the graph on the right that shows what is the speed that we're able to achieve by using job profiles that were -- approximate job profiles that were built from the different percentage of the tasks profiled, we see that profiling up to 10 percent of tasks will give us enough information to build a profile that's representative that's pretty much the same profile as if we were to profile 100 percent of the tasks.
Now, the next brief evaluation of the what-if engine -- of course, I did a lot more to evaluate the what-if engine, but visually the easiest way to observe is actually to compare essentially the different surfaces that we can get.
So on the left here you see the actual surface that you saw earlier that shows the running time of this word co-occurrence program it was a run in different settings for these two configuration parameters. So I literally ran this job, it was -- 5 times
4 -- 20 different times top get this particular graph.
Now, I used -- one of those settings I profiled in one of those settings and then used that job profile to ask, to make 20 what-if calls to the what-if engine, and that generated the estimated surface on the right.
So as you can see, the what-if -- I'm not going to claim that the two surfaces are identical, of course, but what I will point out is how what the if engine is able to clearly capture all the execution trends that results. From borrowing these two configuration parameters, we can clearly identify the blue regions and the red regions here and the different trends, and all this based on a single execution job on a single setting.
Then we have -- let's see. This graph shows the running time of running these individual MapReduce jobs on the production cluster using settings -- three different settings. One was using the rule of thumb settings, the second one, the red bar, was using the settings suggested by the Starfish optimizer based on the job profiles that we have observed on the development cluster before running on the production cluster at all.
So as we can see, in all cases we're able to match or improve upon the execution running time of the rules of thumb settings.
Now, to see how much better we could have got if instead of profiling on the development cluster we were to profile on the production cluster, I put there this green bar. So for the green bar we actually profiled on the production cluster and then used that job profile to get the configuration settings to use on the production cluster.
Now, as you can see, in almost every single case it didn't really matter whether the profile came from the development cluster or the production cluster. We were still able to get very, very good performance in both cases.
And then, finally, this [inaudible] to the last scenario where we can actually make predictions and recommendations with regards to the number of nodes to use or the type of nodes to use. So what you see in these two graphs is the running time on the top and the monetary cost on the bottom of running our MapReduce workload consists of all those MapReduce jobs. So the blue bar shows the actual running time of the workload run on -- let's see, these were 20-node clusters of different node types.
Now, the red bar shows the prediction running time of the entire workload, again, on the different node types. And to get these predictions we used the job profiles that were collected on a 10-node cluster of m1.large instances.
And, again, as you can see, we're still fairly accurate in predicting the running time as well as the monetary cost. And, again, the important thing here was to actually capture the trends and not the absolute predictions.
So, finally, clearly they're the a lot of related work around with Starfish. I just wanted to, I guess, briefly mention some of the related work. There's definitely a lot of work being done on self-tuning of database systems with projects like LEO from IBM or AutoAdmin here at Microsoft Research, similar work on adaptive query processing where he change your execution plan at running time.
Now, in terms of optimizations in the MapReduce systems, there's a lot of rule-based techniques that have been implemented, especially in the higher-level systems like Hive or Pig or how to select good data layouts or columnar storage papers out, et cetera. And then, finally, with regards to performance modelling, there's different a huge variety of papers for black box models, analytical models, simulations under the MapReduce systems and in different systems, et cetera.
So to summarize, Starfish is a system that is able to achieve good balance between ease of use and getting good performance automatically out of a system, actually a whole ecosystem, like Hadoop and all the different systems around it, and it focuses on tuning at all sorts of different levels from job level to work flow level and to cluster level.
We've also released part of the Starfish code, so you can actually go on the website and you can download Starfish. You can play with it, you can use it to visualize the execution of MapReduce jobs, you can ask what-if questions, and you can optimize MapReduce jobs. And then in the next version we're now adding -- actually we've already added support for Pig and we're in the process of adding support for other higher-level systems like Hive and Cascading.
All right. And I guess at this point I'll take any questions for the time we have left.
[applause].
>>: This might be a question that takes a little longer, but the one interesting thing is your choice between using -- so sort of using machine learning for the cost models and then using white box models for other parts, what motivated that?
>> Herodotos Herodotou: I could summarize that in one line that -- I guess no size fits all would be the answer here. The different things that I wanted to estimate were so different that using -- let's see. Where was that? Using the same model for everything just didn't seem to be right.
When we're trying to make predictions with regards to memory specification -- to hardware specification, like we have different memory, different CPU architectures, you know, representing that using white box analytical models would just almost be impossible, or you would not be able to get good predictions out of it.
Now, the other way around, if I were to use -- I could have used machine learning models to capture some of the -- to predict data flow and cost information, for example.
However, these are far more complicated. There's a lot of interactions between them, so I would have to figure out what are all the different interactions, and the prediction space is actually much larger, and then at the same time there was no reason to do that because we actually understand how Hadoop behaves. We can see how it behaves. We know exactly what's going on. We know the different rules that they follow for spilling and for sorting, so we can actually model all those things analytically. And if we can model things analytically, then we can actually get much better predictions out of it. So the clear motivation was figure out what's the best model that will give us the best predictions.
>>: [inaudible] talk that people are actually using Starfish or thinking about using it. Can you say more about who's using it and what has been the feedback?
>> Herodotos Herodotou: Yes. So within academia itself, Starfish has actually inspired a couple of new projects. There's a project that I know of at Waterloo and another one in Hong Kong University that they're starting to look at different ways to extend and use Starfish.
Now, in terms of the industry itself, I have been contacted -- there are a couple of local companies at the research triangle area, and then I guess our biggest connection right now is [inaudible]. It's a company that commoditizes Hadoop and offers support for Hadoop -- it's a competitor of Cloudera -- that they've started to look into how do you use Starfish, and maybe we're actually looking into some collaborations, possible collaborations, on how they can actually using
Starfish in their product and how we can actually run Starfish on some real production clusters and production -- real world workloads and data.
Of course, Yahoo as well. Now that we have support for Pig, the next thing to do is actually give them Starfish so we can see how it behaves when it analyzes, again, real workloads expressed in Pig Latin query. There's one more that I'm missing -- oh, and [inaudible], of course. Came out of Yahoo. Right.
Yes?
>>: [inaudible] let's say adaptive query processing or other techniques, how much does this address?
>> Herodotos Herodotou: Right. So as far as configuration settings is concerned, I really don't think there is much speed-up left, but this is only one slice in some ways of the optimization space.
Now, like I said at the very first slide was I wanted to present the work that I did by myself. So these are all the work that I have done.
Now, at Duke we're actually looking into a lot -- you know, much larger picture of
Starfish. In particular, right now we're looking into how to make different
optimizations at the different levels of workloads. So we have a workload that's expressed, let's say, in Hive or Pig Latin, right?
Now, there's a lot of different optimizations we could do there. We actually broke it down into four different categories. So we could do logical optimizations, you know, traditional join rearrangements, push-down selection, those kind of optimizations. Of course, now we could do them in a cost space fashion by using some of the Starfish capabilities here.
The second layer is physical operator selection. Now, if you can implement the same logical operation in different ways, how do you select which one is better, right? Should I use my typical reduce side join versus a fragment replicate join, et cetera?
The third layer is how do we pack now these logical and physical operations into individual MapReduce jobs?
And there there's a lot of possibilities for merging jobs together, sharing scans, sharing data transfers, figuring out how to set up data layouts or partition the data in particular ways such that jobs that come later on to read that data actually benefit from that. So there's a lot of work there. Actually, this is work I'm working on right now with another Ph.D. student back at Duke.
And then the fourth level is the parameter space level, which is what this presentation was focused on.
So if it you look at the entire stack, there's a lot of optimization opportunities and a lot of speed-up that you can get.
>> Vivek R. Narasayya: Any other questions? Let's thank the speaker.
[applause]