>> Arvind Arasu: I am very happy to introduce Shivnath Babu. Shivnath is at the faculty at Duke University and Shivnath is an expert on streams and analytic systems. He has authored many papers which are heavily cited in these areas. So he will be speaking now about self tuning systems for big data. So Shivnath. >> Shivnath Babu: Thanks Arvind. I am going to be talking about a system that we have been building at Duke for the last couple of years called Starfish. And you will see in the title that there are two terms. One of which might make sense, self tuning, one of which will not, MADDER. I am going to introduce these things as we move along. So any conference, anyplace you actually go to, you keep hearing the term big data, big data analytics right? And often when we hear about that we also hear about the giants in the space, like the Googles and the Facebooks and the Microsofts, you will have petabyte scales of warehouses. There also many like little players in the big data space. One scientist you must've heard of, one example I would like to give is journalism and this whole field of computational journalism, which is something that we are very interested in at Duke to the point that there is a center for computational journalism. I did the following, like many of these newspapers, the ones in print that are actually losing money big time, so where the funds are actually getting cut or in things like investigative reporting. So generally downsizing is very human intensive in the field so what the whole spirit of computational journalism can replace some of those manual sort of tasks with computing. And the best example I can give and it takes us to something like WikiLeaks. So WikiLeaks is a lot of data. The administration releases data on the funding and actually where things are going to be released and things like that. So now there are a whole bunch of journalists that are actually scrambling to extract leads out of that data that they can build stories. So these are places and these guys don't have petabytes of data. They have probably a couple of gigabytes of data so what makes data big? It is not only the size of the data, and petabyte is definitely big. But what also makes it big is a type of analysis that you do on the data. And many people don't just want us to do accounting and aggregation on these things like that; they actually might want to treat the data, build matrix out of it or do a whole bunch of linear algebra operations or do some machine learning operations and things like that. So we have been interested in big data and its space, and around two years back we were thinking about what should be actually--we would like to build a system for big data analytics, so what features should that system have? And one of the things we spend a lot of time on is actually understanding what features people need from such a system. So what are these analysis factors needed for big data and that is how we came up with these MADDER principles that I will introduce in a second, and to establish context I want to tell you how the Starfish system actually looks. So it is built on the MapReduce execution framework and very specifically our implementation is on the Hadoop MapReduce execution engine, and the MapReduce execution framework consists of a distributed file system. This would be the Hadoop distributed file system in the Hadoop world. The Google distributed file system and the Google file system is called Google word is a MapReduce execution engine which has a scheduling that takes care of [inaudible] run. If there's a failure, there is a task that is running slow, things like that. So Starfish is going to fit in the Hadoop ecosystem and the main thing I want to emphasize is when you say Hadoop it's not just the MapReduce execution engine and distributed file system. It's a whole bunch of tools that are layered on top. A lot of people like a lot of machine learning researchers at Yahoo, for instance, like to write their MapReduce programs in Python with a nice API called Hadoop Streaming API. A lot of people actually like to use more creative languages like Hive. Hive QL which is an SQL-like language or a big language somewhere between SQL and a mole scripting software interface. There are workflow engines, because there's not much you can do with a single MapReduce job. You must have to tie together a whole bunch of MapReduce jobs into workflow. There is a system called Oozie, which is a workflow manager. You know that there is a cloud and people have actually moved into that space making it very easy to position a hundred node Hadoop cluster in minutes. And that is an elastic MapReduce service that Amazon provides. So this is the context in which Starfish actually fits. And I hope you actually keep this in mind as we keep moving along. In reality we could replace Hadoop with another, you know, big data system, maybe a [inaudible] database, but a lot of other work actually focuses on interesting aspects of MapReduce. So what are these MADDER principles that I keep mentioning? There are two ways to understand it. One way to understand it is what are these analysis practices when people are dealing with big data, what does this analysis look like; what are the features of that? Or building on that if you had a big data practitioner and they give you a big data analysis system, what features do you want to not system? So if you are spending a lot of time in 2009 thinking about it, and at that time there was a paper that was published in VLDB by Joe Hellerstein and his colleagues who were collaborators from Green Plum and from Fox Interactive Media. They were attacking a similar problem trying to characterize these things. They came up with this cute acronym called MAD, and then we realized actually some of our thoughts were similar, although I will actually say that there are differences which I will emphasize, and we copied that and we came up with MADDER. So the MAD, the first MAD are similar. The first D is slightly different. So what are these MADDER principles? The first M stands for magnetic, so the thing is often like, again, going back to the computer journalism example; WikiLeaks has been posted, right? Now as a journalist I probably have no idea what the data actually looks like, so without actually working with the data, I probably cannot even define the schema or any properties of the data. So MapReduce was the system, the big data system should be magnetic. It should be very easy to get your data in the system and start working with it. Okay, that is what the M actually stands for. The A actually comes from agility, agile. In many of these big data processes it is very hard to predict, to have a good idea how the data will actually get processed on the data formats and the requirements and things like that. And so systems should be sort of agile and should be able to adapt to whatever changes you have and sort of counterintuitive thing here is the less the system assumes about the data or the workload or things like that, the more agile it can be. The first D stands for deep, and here is where we actually differ significantly from the version that that Hellerstein and his collaborators had. The deep means you should be able to do deep analytics, you know, machine learning and algebra and things like that, and they were doing all of this in SQL, which actually was very heavy. Now for us deep has a sense of broad, as in the ocean being deep and broad sort of thing. And for us deep really means, we don't want to go and tell these statisticians or the journalists look write all your stuff in SQL and we know how to handle SQL, rather we want to tell them you can write it in SQL if you like. If you like Python and that's what you want to use, write your stuff in Python. Remember that I am assuming an underlying MapReduce distribution framework so it is not just [inaudible]. It's Python with an API that is actually going to generate MapReduce programs that will be sent to the processing system like that. Or if they like other programs like SQL, do that. So that is what we actually want with deep. We want a system that is not making any assumptions about how people are actually working [inaudible] that they are using on top. The Hadoop system actually showed just a few slides back. The next D and this will be the last DER which is very specific to our thoughts and some things which are was not there in the journal VLDB in the big data world. The data lifecycle aware is a little bit hard to get, but hopefully things will come out through an example. So any, you know, this is actually a Yahoo web page; any of these social network sort of pages that you actually go to, you can see that there are a lot of things in there such as ads, news, articles being recommended in fact, the whole content that has actually sort of been generated has a whole bunch of analytics that have gone into it. And this is an architectural picture taken from LinkedIn and Facebook and Yahoo, I bet even Microsoft probably has something like this, you know, that is driving the content. So this whole system architectural content can be broken up into three subsystems, roughly. There is a data serving set system which has the three-tier architectures and everything to be delivered. And you play in that space with no SQL engines, LinkedIn’s no SQL engine is called Voldemort. So these are the data serving and the ones that can actually capture a lot of writes. Then there is the log aggregation subsystem, and by logs I don't mean the data base transmission logs, I mean the click stream and the user active resource information. So that information that is getting generated in the data serving system sort of gets sort of gets moved in some sort of real-time into analytics platform. It could be Hadoop; it could be Edgebase, a lot of invasion happening in that space. And what is really interesting is that, the results of that get pushed back, so there is a cycle that is happening, and that is why LinkedIn causes a data cycle they estimate terabytes of data are actually moving through the cycle. So notice that data can actually take these interesting cycles, lifecycles and the big data partitioners have to support them. So whatever system we come up should not make this hard. Okay, especially when there are multiple systems here trying to combine together. And one example of that would be if I have an analytic system that takes 30 minutes to load data, but it can't process these [inaudible] and things like that in 1 minute, that is probably a nonstarter if you actually wanted a 15 minute cycle latency. So depending on the application you might want second latency to lengthen, they were actually using this for their people you may know feature, that sort of feature that most social networks have. They were having a six-hour cycle latency, so depending on that you actually have to do things differently. So that is what data lifecycle awareness is sort of requirement that comes up in big data analytics. The E stands for elasticity. So there is the world of the cloud and you may or may not believe in the cloud, but one feature people like a lot about the cloud is a feature that they want to have especially true in journalism, because you get these big data sets every now and then. You expect the results you search for, the results have been used in the system as well as the costs incurred to be sort of proportional to your actual workloads. You don't want the system to just be lying idle all of the time. So this is a feature that you actually bond with elasticity, the pay-as-you-go nature. The last R which you have heard a lot but I want to emphasize it again is robustness. Anything that could go wrong is going to go wrong when you actually begin working in the big systems and big data. And one of the interesting sorts of problems that we have actually been running into and have some work that I am not going to be talking about is the last thing is that data corruption. Now things will crash and data can actually get, become corrupted. These are all things you should be able to deal with gracefully, not just fall off. So these are the MADDER principles and we actually want to build a system that is going to stick to these MADDER principles. And as we started thinking more about the system design, and the system architecture, frankly it was these MADDER principles that brought us more to the Hadoop world or the MapReduce way of thinking, get back to a parallel database system. You realize that there isn't really a focus on ease-of-use. People definitely want good performance, and what is unique about this MADDER principles is that each and every one of them makes it much more hard for the system to give good performance out-of-the-box. And if you take the M and A the magnetic and the agility often what that incurs is you know the system knows nothing about the data when it is been loaded until the point where somebody is going to analyze or run something on that data. So data can actually be opaque. The deep end, you know, people who are writing these programs, they are actually sticking to their workflows that connect modular subsystems together. What that means is that we are dealing with not declarative programs like SQL declarative languages like SQL. We are dealing with Python programs. We are dealing with these multisystem workflows and things like that. The elasticity and the robustness, you know, systems we have actually built very nice mechanisms to deal with and to provide elasticity and robustness, but what is often hard is not the mechanism, you can actually add more nodes so it will grow and shrink, but when you eventually do that, how many nodes should you add? How many nodes should you take off? So those are things that actually become very hard, the policies. So the goal of Starfish was that we want MADDER features, but we also want the system to be good performing without the user, you know, often in this context there is no administrator, okay? They should be able to--the system should be able to give good performance automatically. That is where the self tuning aspect comes in. Starfish is about bringing these two things together. So what I am going to do in this talk is first I am going to talk about what are the tuning problems that actually arise in a system, in a big data analytic system especially one built on a MapReduce framework. I'm going to spend some time on that and give you some examples. And I am going to sort of drill down into two specific problems at the level of MapReduce jobs and all the way up to the level of sizing that entire cluster. And then I am going to show you an experimental evaluation of the system. It's available; we are actually working on it. Version .1 is out there, version .2 will actually be--we are testing it and everything; it should be out by the end of the month. There is a version .3 coming into collaboration with what we are doing with Yahoo and so on. It's basically a system and I will talk about evaluation. It is a rapidly moving sort of a target and I will also talk about some of the ongoing work that we have in Starfish, some of the things on the wish list. What are tuning problems? One thing I hope that you will actually take away from this talk is that there are tuning problems that arise at multiple different levels, and you don't want different tools that can be used to work on different parts. You actually want something that can actually have an integrated view of all these different parts. And what are those parts? There are challenges and tuning challenges configuration space and individual MapReduce job level. There are challenges at the level of when you are stitching together multiple jobs to give these workflows, and some of these workflows can have 80 to 200 jobs that are big workflows. Nobody runs single jobs at a time. Even if you are writing on SQL query or a big manuscript, you still run into multiple jobs, workload multiple jobs. And then there are challenges at the level of the entire workload, for instance, the challenge I was talking about in the data cycle sort of a context. Or if you have multiple users and you want to preserve some sort of guarantees across for the workloads of what different users are running. There are challenges at the level of the data layout. Other people keep saying in the MapReduce world you have no schema. You don't know anything about data. That is far from the truth. Any time you are actually running a workload on it, you can learn a lot about the data. There are interesting observations and automatically partitioning data to actually help with future workloads. And finally, sizing the cluster, and I will actually give you some examples, at least three of these in the next couple of slides. Before that a quick sort of primer on MapReduce job executions so that all of us are on the same page. A MapReduce job might begin its life as a MapReduce program that a user has written. I know you can't read this. This is actually a WordCount MapReduce program written in Java, now sticking to the Java API that is provided by Hadoop. The user rights map functions, reduce functions and stitches them together using some boilerplate code into a job. So a MapReduce job connects the program the user wants to run, and this program could also have been generated automatically by a compiler for a system like [inaudible] or PigLatin. It's a program. It connects the job, connects a program with the data and some cluster resources on which the job is actually going to run and then there is a configuration. I will elaborate on the configuration in just a moment. The first thing that happens program, the data is actually taken and logical partitions of that data are undefined. And the Hadoop will of course splits and what happens is now a map task has actually started for each split on those cluster resources. The map actually processes, runs the map function, does some sorting compression and all of that generating some intermediate data. That intermediate data will get shuffled, and this is actually partitioned data that will get shuffled to the reducers that are actually running, and often if they are processing anywhere from many gigabytes to terabytes of data, the number of executions lodged in the cluster to run these tasks will not be enough. You might have to run multiple waves of these tasks. For example, the way the Hadoop is set up, you actually for every slave known in the cluster you can say it can run three map tasks and maybe two reduce tasks. So we have 10 such nodes, then you have 13 map task logs, and we will run 1000 maps and you are think about you are actually running many waves. Similar reducers can also have many waves. So let's drill down a little bit into individual tasks. For instance, if you take the map task execution, what I am suggesting is that there are many phases, what we call the task phases of map execution. Let's start with the split being taken from the distributed file system. And then there is a key value, a fraction that is provided by the maps. This is where that WordCount like a Java program might actually be running. You can write those map functions of things and byte on the word not, and then the output key value pairs algorithm maps will actually get put into a memory buffer serialized in Java and associating partitions that could go to the relief site, and when the buffer fills up to some threshold, it will get sorted and there is a combine operation that could actually be done with just pre-aggregation of the map site, so that the data that is getting shuffled to the reducers is reduced. You could then compress it and if, and this is actually spilled to disk local disk [inaudible] file system. If there is more data coming, then again basically things could get spilled again. There is an external sort kind of thing that runs here with the merging and everything and then finally the intermediate output of the maps is created. This is the thing that actually gets shuffled when the reducers are asking. This is a rough idea of how MapReduce job actually executes and now I am going to go into phases in the reduce. So now when you actually see the [inaudible] slices, user wrote a MapReduce program. And that translated into some execution strategy with some number of map tasks, some number of reduce tasks, some way of partitioning data, some choices that should be combined, the pre-aggregation should be done and some choices had better be the intermediate outputs of the map should be compressed and uncompressed on that side and the final output should be compressed. So a lot of choices are made. So who makes these choices? These choices in, for instance, in the Hadoop world are specified by the configuration c which is given today in the Hadoop world, is just a fact of the Hadoop engineering when the job is submitted, and that configuration that determines the actual execution. I could choose to have 20 reduce tasks, 100 reduce tasks. I could choose to size the map outward buffer differently. Now do these parameters actually make a difference? You bet. There can be order of magnitude differences and we have to spend many thousands, 10,000s of dollars on Amazon EC2 trying to understand the effect of these parameters. There are actually around 190 parameters in Hadoop, although many of them are where to write a task file log and everything. There are around 20 to 25 parameters that can actually have a significant impact on job performance. And again we are talking about same program, same processes and same data. So what this graph is actually showing you and this will give you a quick idea. It is actually taking a program called Word Co-occurrence, which is actually used in natural language processing to find which pairs of words are common and actually count the occurrences of them. What we have done is we have taken that Word Co-occurrence, same data, same customer sources, all we have done as you can see on this X and Y axis is changed two parameters, i.o.sort.mp basically specifies that map site buffer I showed you on the map task site. This one, most people have used Hadoop have not even heard of this parameter. This one controls the amount of space in that map buffer which is used to store metadata because we are dealing with Keynesian values that could be arbitrary [inaudible] want to store some metadata for every Keynesian value that is put in. Now as you can see, so what the z-axis shows us running, when we run the program much of times and actually showing a [inaudible] surface of how performance will be. And here the variation is not that high but I can show you other surfaces that come variation is likely order of magnitude and even higher. What ideally we'd like to have is a choice of these parameters such that you hit the blue region, but in this example the typical strategies that people use to select these parameters were still in the red region and the really bad region. So essentially, that is the kind of problem that especially if you want to do a job [inaudible] somehow automatically you must pick parameters for this blue region. So what do people do today? I have actually cut and pasted two sorts of blogs, yes? >>: On the previous slide, can you explain why in this type of them environment [inaudible] more memory would always improved performance? >> Shivnath Babu: Very good question. That is something basically that I choose this example for a very specific reason and it will become clear in the talk. It will become apparent in the very next slide. So the point is, let me sort of take that question gradually. Let me answer it in the talk? That is part of the reason why I chose this one. Any other questions? We will start getting some part of the answer right in the slide. So what I have taken, what I am showing you here is I went to the pig site again and the Apache.org/pig/pig Journal system and sort of cut and pasted two sentences from the pig sort of to do list, their roadmap. I know you can't read this as well as I can. So they are saying that people are aware of this problem, and there are all of these configuration parameters and they are also aware of the fact that one single configuration, one size does not fit all. Like for different jobs you want to have different configurations. If one configuration or one set of parameters was actually the best, then we don't have a problem to solve here. Focus on the last three lines of that first blog. That shows how people go about attacking this problem today, rules of thumb. And what that rule of thumb is actually saying is if you had a MapReduce program that doesn't have any memory intensive operations, not building any big cache tables or anything the map phase, it has a combined phase, combined with a pre-aggregation that is done on the map site so that the data is shuffled to the site is actually gets compressed, or its smaller. If those two conditions hold, you want to set that map site buffer as high as possible. And this is very sort of I/O sort of thinking, very I/O sort of thinking and the idea is if you do that the number of the combined will be more effective because it can actually group many more items together and the number of spills from this will be reduced. You know, do that. And this one exactly gets you into that red region, thought of setting them [inaudible]. And a quick insight will actually show you perhaps why this happens. The quick insiders--the setting that we're running it; this is on Amazon EC2 and Amazon EC2 actually gives you a whole set of different types of nodes that I will get to in the next slide. There are nodes which have not that much CPU, like compute cycles so running all of this combiner and everything cost the CPU contention. So you are much better off not having large buffers, smaller buffers but better utilization of the CPU and I/O, even at the cost of shuffling more data, and that gets into the blue region. So those are the trade-offs and any time you have the sort of trade-offs, that is when we actually, the tuning problem becomes interesting. Otherwise it's probably in a monotonic surface. Notice that there is a dip, right? So finding this point that is kind of becomes a tuning problem. If this was all sort of monotonic, then things are very easy. Second point I want to get to is the second one, second blog here. So the fact is the way many people actually sort of, even at Yahoo and Facebook sort of thinking I'm sure that at Microsoft you have [inaudible] initiated and everything, you would rather people don't think at the level of MapReduce jobs and this becomes--not everybody can write MapReduce jobs, right? You would rather have them write in PigLatin or scope or something like that. When I am writing a PigLatin script, I have no idea how many MapReduce jobs or what MapReduce jobs get generated. If I don't even have an idea, then how do I go about turning those parameters for a lot of jobs? So that basically is what the blog is actually saying. Great. So that was the job and slowly I got into the workflow part. Now let me move up to the workloads and the cluster sizing. One very nice thing Amazon cloud and actually cloud computing in general has made possible, now if I want to locate 100 node cluster, I can simply go [inaudible] hundred node Hadoop cluster on Amazon. If probably three or four years back I would have to go to talk to my system administrator, go to some sort of [inaudible] in size and all of these things. It could be days to months of delay before something like this could happen. But as the cliché goes with great power, comes great responsibility, right? So where is that responsibility that comes here? Amazon actually has different types of nodes. I am just actually showing 6 of them here. There are 11 types of nodes, many different pricing schemes and things like that. So if you see this picture, what each row is actually showing is the node type, and these three columns, the second third and fourth columns show the how these are categorized in terms of resources, and often these are abstract units. This basically in the c1.medium node has 5 easy to compute units for some definition of a compute unit. I/O performance is moderate and the cc1.4xlarge has 33.5 computing units and it has high I/O performance. There is some per hour cost that you actually enter for each of these node types. So if I am a user like you know when I have actually gone and given this Starfish talk to more Hadoop users in the triangle area where I am from, one of the things asked and one of the types of problems that I hear is, I have this workload that runs in 5 hours. I would like to cut it down to 1 hour, so what should I do? And I am trying to illustrate some of the choices that you might have in such a setting. So what you see here are two graphs. The x-axis is basically the same on each of these graphs. It is a node type of Amazon ECII, 5 of those node types that I just showed on the previous slide. And we are actually taking a Hadoop workload that consists of I think 6 applications or something like 10 jobs and so on. The Y axis on the first graph shows the completion time of that workload. Second one shows the cost incurred around the workload. If you just focus on what I have shown here, if you use the ml.xlarge nodes, you can complete the workload in about two hours. You pay some more of cost. The use of cl.medium nodes will actually complete in four hours. It takes more time but in corresponding at 40% less cost. And any of these applications in the big data world are batch sort of jobs. Or [inaudible] was calling you know the [inaudible]. >>: [inaudible]. >> Shivnath Babu: Play think these are all run with the same configuration. The configuration is based on the rule based settings for Hadoop as I will explain when I show my excellent results. So this is a simple example of what today people can actually do with different types of nodes and different clusters and the cost of what these will incur. >>: [inaudible] what is the source size of the example [inaudible] large [inaudible] available? >> Shivnath Babu: I was going to get to that in the excellent results section, but being that you just asked it--I purposely hid those details just because not everyone would be familiar enough for me to introduce it at this point. You can see basically this were for numbers. Amazon EC2 they have elastic MapReduce service. For each of these things they actually come up with some recommended numbers. If you are starting a Hadoop cluster with in MapReduce, we have some numbers. We don't use elastic MapReduce for these ones. They're basically our own harness that we have set up for partitioning a cluster on Amazon EC2. These are the numbers that we have firmly determined are good for the sort of workloads that we are running. For instance for c1.medium, and again taking that question, when you are running as I mentioned earlier, when you are setting up a Hadoop cluster every slave node you actually set up with some number of map slots and some number of reduce slots. So that is the number of concurrent, for c1.medium I set it up with two maps slots and to reduce lots. What that means is I can run that node, can run two map tasks and two reduce tasks concurrently. So if you had 10 nodes, that means my, like the wave size is really 10 times to the map width size. If the user’s requirement was really the workloads can actually finish by 6:00 a.m. eastern time in the morning, then we can actually recommend different choices to them. We can take the cost and the running time properties from them, and come up with different options. Without any, today people have solved this problem manually. So this would be a tuning problem at the level of the entire cluster. And the person that was asked, you know, how many map slots and reduce slots can actually run in one node, is also a tuning parameter. If we have light jobs we would probably want to run many of them together. If we have very heavy jobs, we probably want to run few of them, because they are all going to be on the same set of results on the node. Hopefully I have convinced you that there are challenges on the level of difference, on the level of MapReduce jobs and workflows, workloads. I didn't expand on data layouts but you probably would believe me there, and on the cluster. So on this one slide I am trying to illustrate in general how we think about approaching these tuning problems in one single event fashion, and more importantly what actually meets the architecture of the Starfish system. So there are three confidents. The first one is the profiler, and the job description of the profiler is to collect concise summaries of execution. On the level of jobs, on the level of workflows, on the level of workloads and noticed that this is about somebody's execution, actual execution happening and then observing. And this collects for the profiles. These profiles then get fed into what we call a What-if Engine, and the What-if Engine basically its role is to estimate the impact of hypothetical changes, the impact that can happen on execution if certain changes are made. So this guy will give the profiler and ask it look that profile was collected for 20 reduce tasks. What would happen if I had 40 reduce tasks and I turned on map compression, or I had 40 reduce tasks and the data slice were to increase by 20%? So this is some sort of basically related sort of questioning and answering, so given an execution, you ask it if a change were to happen, how will performance actually change, and there can be different ways that we can actually categorize performance, maybe it's completion time. Maybe it's resource utilization because if we are running multiple jobs together, you have more throughput focus, so that is the job of the What-if Engine. And then we have a whole suite of optimizers on top. And the job of the optimizers is for workload tuning you actually have, the space you are considering, circle that space, make appropriate calls to the What-if Engine or to the profiler to generate profiles, and then come up with a good answer to recommend. And this leads naturally into the Starfish architecture. So as you see there is a data layer. There is a job layer. There is a workload layer, and there is an entire workload layer which has the workload itself in the sizing, the cluster sizing, actually confidents and the profiler and the What-if Engine cut across all of these things. So what I am going to do next is basically drill down a little bit into the job tuning problem. Not because I think that is the most important problem, but because I want to give you an internal view of how these three confidents implant so that they can naturally solve a real problem, and then see abstract description of the problem. You have given a job that is to be run by a program resources and data, and the choice and basically what it chooses is configuration that can optimize some [inaudible] metric of performance and for now just assume that it is condition. The job of the profiler in this context is to collect a profile from that same program possibly run on a different data set, different resources or different configuration. And we actually support multiple modes here. One very commonly used case from companies like Yahoo and Facebook is that they are running, they are big users of Hadoop is to actually run these so-called periodic MapReduce jobs on their daily data and their orderly data and things like that. Facebook what runs 1000, I think that is the number that I heard. Yahoo runs much more. So you could technically collect a profile so d1 could actually be yesterday's data, and we are trying to ask a question on what is the best configuration for today's data. Often things don't change much from one to the other. Another mode we actually support here is we can run a sample job to collect a profile. So d1 will actually be a sample of d. And then you collect the profiler and use that for optimizing the [inaudible] in the job. And the reason we can actually do this is because the What-if Engine can answer given a profile natural execution profile, you can answer questions for any different d2, r2 and c2. And that is used by the optimizer so it searches through the space and find a good configuration for this space, S, I actually mentioned to find a good configuration that is recommended. There are multiple different types of optimizers. We have been focused on writing optimizers to minimize completion time, but we can really entrust the optimizer to be some sort of a robust optimizer, which is to ensure that no matter what, how the profile was actually collected, maybe when the cluster was actually heavy used or your collection on a [inaudible] cluster it still be a robust strategy. There are some different options here but I am going to talk about completion time in the optimizer. Hopefully, this gives you kind of a feel for the different types of challenges that arise. Let's move along a little bit further. What is the profiler am I talking about? Profiler as I mentioned is a concise description of a job of a program execution. This is the sort of picture I showed you earlier on an actual job execution. Let's just drill down again and show you this slide also again. The profiler captures information at the level of individual phases of task execution. The map task has these five phases, the read, the map, the collect, the spill and the merge. The map is where that actual map function is actually being invoked. The lot of the framework code that actually runs. So the profile basically collects and you can think of the profiler as actually collecting two types of information. One is data flow information and the other is cost information. So data flow, and this is the same kind of things that you might be used to if you use a database system, right? For data flow how many records, bytes are actually flowing through. And we can actually collect some more things specific to the MapReduce setting, how many rounds of mergers are happening, how much spill is happening, but everything specific to data flow. The change in resources, these things will not actually change. Cost is more like how much time is being spent in each of these phases and tasks and here I am showing examples of time spent to read things as it was a map phase. There is some more stuff coming. Let me try some intuition for that. So this is a screenshot I am showing from the Starfish optimizer. So what you're seeing here is visualizing a profile collected for a MapReduce job that, I don't think all of you can read it, it's about 326 map jobs and 27 reduce tasks. So what we are seeing in the sort of blue rectangles is the running time. It shows the running of a map task. So there are many map tasks and some reduced stuff that actually run. So based on the profile this is a type of data, this is a simple type of cost data that you can actually be collecting. It can be collecting data flow data, so this again is a screenshot from the optimizer where we can visualize how much data was actually sent from one map task to other sort of reduced tasks or from one node in the cluster to other nodes, and there are some interesting things and you can play on a video on how the job actually kept going, what really happened. Maybe there was a time when a lot of data was actually getting sent. There is a demo going to be coming up on DVD at the end of the month. Please check out. So cost and data flow, but quickly you realize that 300 map tasks, 26 reduce tasks, we're dealing with terabyte data; we are dealing with 10,000 map tasks and 500 reduce tasks. Quickly that information can become overwhelming. You have to start neutralization, so that made this sort of thing to worsen and we probably want to show some summary or some statistics about the data. So what I am showing here is a skew of the map outputs. So some maps and I know that you guys can't read this because it's a screenshot from the neutralizer but that is fine, because you can just follow what I am saying to get the information. So what this is actually showing is that some map jobs actually send a bunch of map tasks around 22 MB of data. Some of them actually send it on 46 MB of data, so this is actually a histogram. Maybe we can sort of take all of that raw data and start extracting some sensible information from that because we get these samples from every map task and every reduce task, and similarly for cost. So what I am actually showing here is for all the phases of map task execution and reduce task execution. We have taken [inaudible] job, aggregated that into one representative map profile and one representative reduce profile. So for this job the read of the map on average the map spends around 10 seconds reading, 35 seconds, and then actually the spills and similarly for reduce tasks. So what we have done is take all of that raw data and starting from the top to extract sensible information and properties from the data. And that is what completes the profile, in fact, these are the most important features of the profile that we are going to use as we are the inputs and the [inaudible] as it is making decisions based on what happens with data size changes or what happens if we actually go from medium nodes to large nodes or things like that. As an example the selectivity of the map function, so every map task reads some important data [inaudible] of the data. So we can come up with a selectivity for every map task and we have a distribution across all map tasks. And you can run a short set of things showing what the previous side was, taking all of that and just representing it by a single, the average value. But if you want some sort of robust optimizer and things like that then you probably want to take that entire distribution into account. All of that is preserved in the job profiler. It is an exact description of the entire job execution that can be used for reasoning and [inaudible]. Where did this profile come from? In Starfish they come in two ways. One is the measurement which is the one I was showing earlier. Then the job [inaudible] if we were actually to go and collect the profile information. The other list I will expand on a little bit later by estimation and the biggest entity that is making estimation is the What-if Engine. So let's get down as to how profiles can actually be collected by measurement because this is the challenge that we are dealing with. We had to spend a significant amount of time to deal with because proposed two practical restrictions on what we can do to collect the profile. And these two restrictions are we actually want to make no modifications to the actual Hadoop code. And so this benefits and this is really a practical reason in the sense that the more changes you propose to the code, if you say to run Starfish you have to install these five patches, they are probably not going to go in there. Rather the approach we want to take was you can use Starfish without making any changes to the Hadoop code and Hadoop can all stay the same there, you know? Without making any changes you can use Starfish. And from the user’s perspective we don't want to tell users that there are these five implementation things in your code and we can profile it. Zero changes to user code. And how we manage to pull this off is by using this trick called dynamic instrumentation. So the idea is there are some event condition action rules that are specified. And the events I am talking about here are events that recur during program execution like entering the function, exit from a function, exception or timer event or things like that. She can actually have conditions that are associated with these events and if the conditions hold some actions will get taken. Maybe by recalling them all the bytes flowing through or actually according to function parameters or the time it takes to complete a function, things like that. And what happens is these event conditions action tools they also come up later in Java because we are using a tool called BTrace. These will get combined with the byte codes and then get dynamically get added to the Hadoop code that actually runs like I showed in this particular animation. They will get in there and then we actually want to take it off. This is gone. So the profiling overhead when we are not turning this on is zero, then when we are actually turning it on there is some overhead to collect the information and when we take it off again it goes back to zero. And we use this tool called BTrace because it is open source, not because it's the best and I will actually show you some examples of why we think we might want to go up with a slightly different tool to replace BTrace. But first I want to show you this pictorial so that, this is actually made, this makes sense. So we start off with all of the tasks that you are running and Hadoop being on Java and the way your tasks actually get drawn is JVM [inaudible] associated and the task runs in that JVM or the map does and the sorting of the framework, everything that actually happens in that JVM we write over the event condition action rules. Suppose we want to measure the time being spent in the map function as opposed to the time been spent in the spills, then these sort of byte codes that get you from our rules they get put in the JVMs using some standard JVM instrumentation APIs. We collect the profiles of the raw data and it goes through the sequence of aggregations and [inaudible] property estimation to come up with the job profile. And when you see something like this what you might be concerned about is overhead. How much when I run the job generating profiling and when I run this with the profiling turned on, what is the overhead. And it turns out and I will show you some real numbers profiling overhead can be significant. There are two types of overheads that that come in basically that are compute cycles being eaten up by the code we added, and there is also I/O. What I am going to show you here is how we propose to deal with that problem. So we have these support two modes of sampling for profiling in Starfish. First mode of sampling is you can turn on profiling only for a random subset of the task, maybe 10% of them. So know that this task is not getting profiled. And even better we can actually run only the task we need to profile. So this job is that sample job notion that I described, so we can run only a sample of the map task and some number of reduce tasks. Of course when you're using sampling, you expect that the fidelity of information will be lower, and I am going to show you some results that for all of the work that we ran around 10% of task sampling was good enough to actually get to the best sort of recommendation that we could get. >>: You mentioned the purpose of profiling was to [inaudible]. But if that [inaudible] Hadoop [inaudible] dependent on a particular set of function [inaudible]. >> Shivnath Babu: Great point. Yes, so basically these event condition action rules are dependent on the actual Hadoop function cause itself, because we are saying look when the spill function enters and this is a function of the specific mean, then start the timer, right? So it's true then that our rules will basically start off and we will have to update these rules then the Hadoop code itself changes. But what we are really talking about is very code Hadoop code. And all the last couple of reasons that has to be changing. But that is the problem. So when I went and did the same talk at Yahoo and showed them what we can do with the extent of information that we bring into the profiles, then they are thinking maybe it's a good idea to put some of these things into the Hadoop itself. So I think that is a path for actually making impact to the other route so that is why we went this route. Yes? >>: [inaudible] information why didn't you just parse the logs? >> Shivnath Babu: Great point. I showed this information on the profiles, the data flow and the times and the data flow says sets and the times. Probably around half of that especially in the data flow comes from the logs. What is in the logs, we don't actually add extra step to collect it. So all the stuff, all the timings and all of those things at that task-based level we had access. Okay. So that was the profiler, the sort of profiles that we can actually, we have and how we collect it. Now coming to the What-if Engine, now given a profile for this particular program that we actually want a good recommendation for, and given some hypothetical data property use, cluster resources or configurations, what we want, or the goal of the What-if Engine is to estimate the properties of that hypothetical job execution without really running it. And that consists of two parts. One is basically, we call it the job Oracle, which can take this information along with I am going to show you how we actually are going to do it instead of models and simulations and those sorts of things. To come up with a virtual job profile for that job as it will run. So ideally what we want is for this virtual job profile should be exactly the one we would have collected if the job had run, that p, d2, r2, c2 would actually have been run and measure it using our measurement base idea. Yes? >>: [inaudible] just one [inaudible] job [inaudible] the next one? >> Shivnath Babu: Great. So that is basically again, that is something we are actually doing so again, one of the things we are actually and are pushing for is continuous profiling. Every time a job gets run on the cluster, we actually will be measuring 10% or some small portion of the task and we're going to put all of that in to a Hadoop, some sort of profile warehouse. And then the input, this input is actually being chosen by Starfish itself. For every job we actually have a tag profile and if you want to make a what if call or a optimization call, you would give the profiler the tag for it. But it's a very impressive problem on how can I actually keep a whole bunch of profiles and maybe if we had a profile where the data properties were very similar to what the data properties are now. Then it makes sense to give that, or give a whole bunch of profiles that would lead us to maybe making some changes here. So that is something to think about. We aren't there yet. That is part of the future work that I will be telling you about. Yes? >>: In the What-if Engine you never change p but in the [inaudible] consists of one MapReduce job. If you choose the map key or partition key, your performance change order of magnitude. So how are you going to contend with that? >> Shivnath Babu: Great. So I must clarify that I am currently talking about, because I want it to be end to end, I am talking about an individual job. We currently support--this is like version .1 of Starfish. We now support profiling and what if and everything and a lot of workflows, so what will happen is basically we are actually giving each of these will be different. So there is some optimization layer that will actually be with those lost optimizations and eventually I will show you a slide on that where we can choose different partitioning keys, [inaudible] jobs and then make the appropriate calls to [inaudible]. We do that but this is very specific for a specific job. The same ideas are [inaudible]. Only we are now talking about the workflow level, more optimization sizes and more different ways to actually generate those programs and make the call to the What-if Engine. Did that answer your question? >>: Kind of. Because even if you find an optimal for these individual jobs, you can point to [inaudible]. >> Shivnath Babu: We are dealing with that problem. I am not actually showing you that in the slide, but the short answer is, remember when I showed the slide on the job and the data and those sort of things? Actually I showed the Starfish architecture, the picture we had the profiler, the What-if Engine and the layer of different optimizers? This is all for the optimizer was for a single job, but we have an optimizer; we have a workflow optimizer, and that workflow optimizer deals with those things, and it actually [inaudible]. It picks different configurations for different jobs and the optimizer actually has--it looks at two jobs together along the gate in between, so that is the unit at which it's looking and makes choices at that level. >>: [inaudible] you show the data to be skewed based on [inaudible]. Even within a single job you can choose different partitions, which means you are changing…. >> Shivnath Babu: Right. And in this example basically I am assuming that the MapReduce program is fixed, so the partitioning keys are also fixed. So if we have an optimizer that can actually deal with--one concrete example that we deal with is maybe the partition key is actually ab together. I can get that in multiple ways; I can actually partition on a, then sort these partitions on b. That will bring together ab combinations and there are many logical choices there. Here I am really talking about the physical choices like, this one, program is the same. But I will be happy to talk about this off-line, something again we have really and actually working on. So things when we start moving up through the stack to the big level and the workflow level, these are things you have to deal with. So the main task is basically to come up with this virtual profile and then once we have the virtual profile, we have chosen to decouple estimating the actual properties like completion time and resource utilization and things like that, because maybe we are actually dealing with many jobs, your throughput focus, so we are trying to decouple these two steps. I am going to elaborate a little bit on how the virtual profiles actually get chosen. So there are basically again the job is we have given a d2, r2, c2 and we have estimated the virtual profile for that. So this one slide I am actually going to show some execution in two slides, hopefully the questions are resolved. There are a whole bunch of details. Usually they cannot solve a major part of the year to actually come up with this thing. The problem is we are given a profile for the job j, and as I said it's really the data flow statistics and the cost statistics that we are using we said primarily for visualization and things like that. And yet given the properties of data resources and the configuration, and we have estimated the corresponding fields in the virtual profile. The first profile for data, you know, dataflow and data flow statistics, that is not too much innovation that we have actually been able to come up here. In databases we have actually focused a lot on this spot. The challenges that arise in the matters of data sometimes you just don't have, you don't know what p is; you don't know if it's actually a joint that is returned by some [inaudible]; we don't know that. So we use, in those contexts we use actually simple assumptions like dataflow proportion input assumption. Like the dataflow is sort of proportional to the input data. If you had more information, and this is where you can have more information as you go up the stack. In the work that we're doing with the big folks at Yahoo, we are actually adding statistics and coming up with a way to estimate the data flow statistics. So I don't have too much to say there about the standard stuff, but when we have some interesting stuff to say, the coming up with the cost statistics and the cost statistics like the times when doing I/O, the times when actually shuffling data to the network but [inaudible] record. That is what goes into the cost statistics. And it's primarily determined by the cluster resources like the c1.medium nodes or m1.large nodes. And what we have come up with is we tried more white box and I think more models, but we have come up with a different technique that seems to work best. We call it relative black box models and the execution is pretty simple. So again the high level problem is if you give me an observation of how a job actually runs on one cluster, you are then going to ask me how will it run on a different cluster. And the way that we are going to do it is we are actually going to build this later model. We are going to take a whole suite of like a benchmark of jobs and there is a lot of intuition of what has to go into how you pick the benchmark. We pick the benchmark of jobs, run it on both clusters, maybe a test cluster and a production cluster, or it might be the m1.medium node standard and the m1.large nodes. We run that. We collect training data and we build a more lean [inaudible]. This is going to be the main system, but frankly if you build any type of reasonable regression sort of model, that is good enough. What is more interesting is how you pick these jobs. Virtually I don't have the time to go into it, but this work is going to be present in the ACM [inaudible] computing conference in the end of October. So the main idea is you can think of taking a sample of jobs from the workload, or you can think of picking the jobs, like a sorting just highly intensive job like text and I take all is just CPU intensive. So that is one approach to actually go with. We have come up with a better way which involves running a synthetic job which internally will create different sort of patterns, so you can come up with this data very efficiently, without any understanding of what the actual workload on the cluster is. And we found that is actually an ambitious statement, but we actually are able to pull off something like that. Once we have that, now give me the profile on a source cluster, I can predict for the target cluster. That is the idea of this relative black box model. So that was the story for the cost statistics and then we have actually done a whole bunch of studying and modeling off MapReduce execution to take the data flow statistics and generate the actual dataflow. The actual dataflow, like the number of spills that happen, the number of bytes that actually go through, they are different on the configuration, like the I/O and the buffer size and things like that. So those are all very, right now they are all models very specific to Hadoop because Hadoop is our underlying execution engine. If the execution starts to miss something, we have to update that. So what is taken is this might seem like the ECM intuitive and retrospect, but what has taken us a lot of times to find the right modeling technique, the right techniques to actually, for each of these profile feeds. So that was the profiling, a very quick overview of the What-if analysis. Now the job, the optimizer has to actually target the data list given the new data and the resources and the program has to find the best configuration. So here this is basically a search in a high dimensional space. There are 20 or 25 parameters in Hadoop that can have a major impact. We currently work with 14 of them which we think are the most important. You could personally grow to a larger number but we are focused on ones that are very specific to jobs, as opposed to tuning the parameters in the clusters, like the number of threads for reading from the file system and things like that. So here basically is such problem. And the two times of making it, applying a simple technique like grading and then making calls to the What-if Engine based on that. It's going to take a long time to actually run. The two insights we have gotten to come up with optimization times on the order of seconds, but even smaller. And first of all exploit the intuition that it can take all of these parameters and sort of partition them into categories in the stack, we can come up with the optimal configuration for a one set of parameters and then not change for whatever other setting you have for the other parameters. And the intuition basically comes from the fact that MapReduce execution sort of goes in these phases where the map tasks run and then there is a shuffling and the reduce tasks run. We can come up with an almost optimal configuration for the map tasks which remain fine for whatever configuration you come up with for the new starts as well. That is what goes in the first part, and then there is a similar interlink technique that goes into integer space. We basically use this one technique called recursive random search. I would be happy to talk about off-line. We have tried a whole bunch of them. Not much new in engineering and not much new insights there. So that brings me to the evaluation. So we have the system that we have been actually been releasing, so Starfish version 0.1 was released early this year. We actually have .2 that is going to be released, it's released already but we are debugging it and working with the Yahoo folks to get some insights from their big clusters before we actually release it. And there is a version .3 to support PigLatin that is coming sometime later this year. So if evaluating a system like Starfish to come up with some of the insights that we have, I can easily create new clusters and I will show you how. Things like Starfish actually perform much better than any sort of other technique or rulebased that you can come up with. Experimental evaluation option involves making a choice for work cluster you're going to run it on. And on Amazon EC2 there are so many choices that we have. What sort of workload are you going to use and what sort of data are you going to use? I am going to actually focus on two kinds of experiments, one with cl.medium nodes and I want to emphasize these are nodes which have every node has around five easy to compute units. Like the two JVMs have two codes, each code having 2.5 units, compute units. Now also I am going to show you this is like, you know, beefier nodes. These are more beefy nodes actually in terms of Amazon for high-performance computing. Each node has about 33.5 compute units. And those last few numbers I explained earlier on is how you configure Hadoop, and every task log is configured with a maximum amount of memory that the sort can use. And that becomes the JVM heap size setting. So the workloads and data we have been trying to choose workloads or jobs that are on different domains, natural language parsing, text analytics, graph processing and things like that. And we work with different amounts of data as well. Thirty gigs is basically one set of, one set that we work with. I will also show you results with one terabyte of data, some results. So these ones the x-axis is showing different jobs Word Coocurrence, WordCount, TeraSort, LinkGraph and Join. What is shown on the tree sort of things you see here are the blue bar shows the time of the default setting for these jobs that Hadoop comes up with. And the clusters we are using are around 30 gigs of data. The clusters we are using our 15c.1 medium nodes, which is like 75 easy to compute units. The default is basically the blue one and what that actually shows us is the speedup over the default. Rule-based optimizer, there is no real rule-based optimizer but Hadoop, so what we have done then as a set, the way people do these things today is by actually using some rules of thumb so the cloud one competes in the Hadoop spaces actually come from the rules of thumb. Yahoo has come up with rules of thumb. We have actually taken that and as you saw in the example I gave, often they are guidelines. They are saying if you are actually doing a job that has these [inaudible] set it to this height. And we have come up with some numbers based on, most of them involve memory so we set those numbers based on the Java heap size settings. The maximum Java heap size setting and things like that. The main reason for the rule-based optimizer is I want to give you a feel for what cost of optimization can actually do. And the green bar here is basically, we run using the rule-based optimizer setting, take the profile from that, feed that into the Starfish and it comes with a suggestion. Then we run the job with that particular configuration setting. As you can see here the default is actually really bad and for terabytes scale they would not have been good to show you those defaults. The rule-based setting, the real surprise here is the cost base is actually really able to come up with twice as good as the rule-based setting, which is kind of surprising to us; we didn't really expect that. So why was that happening? And I am going to show you some insights based on actual performance so this will also answer the question that came up earlier. So here, so this is for the WordCount job from the previous slide. So setting a is the confirmation setting given by the rule-based optimizer. B is a setting given by the cost base optimizer, what you saw was like twice faster. What we are visualizing here is the profile after those runs and this is the map profiles and these are the reduce profiles. You can actually see that for the map tasks, it took it on the rule-based setting took it at 330 seconds and with the cost based setting the time was reduced in half. So this probably got the benefits and interestingly for the reducers our running time is higher. But notice the scale of these assets. Rule-based is running in maybe 10 or 12 seconds. Maps take like 5 minutes to run. And when we explore further what was happening was that the rule-based, again this is the same stuff I showed you earlier, the rule-based basically said look, have a large [inaudible] and minimizing spills and the combiner and all of these things. What ends up happening on that c1.medium nodes is that it ended up creating big bottleneck in the CPU. We actually saw that based on the profile that had been fixed. Essentially we came up with a smaller setting and when you reduce the size of the [inaudible] okay, the buffer, you're going to write more data out, shuffle more data. But there is still another problem here. So these are the sort of things that you can actually come up with in a cost based setting that sometimes even humans, and especially humans who have no idea of MapReduce execution, can actually do. So the next one we are showing results for much larger data sets and correspondingly the compute units and the nodes here moved to the beefy nodes. Yes? >>: On that previous slide how many variables of those 20 some odd ones do you actually end up changing to produce this configuration? >> Shivnath Babu: I think probably all of them. Beyond actually searching in that space and the output the best setting that we actually found, but the critical things that I am focusing on the most crucial ones, you just compare 1 to 1. Probably many of them look different but that is not the way in which the cost based optimizer is thinking. It is going into this space and finding the best one. You can try this out. I think in the paper that we have that is going to be presented at VLDB at the end of the month, we actually show the actual settings for each of these ones. I think we actually show it for Word Cooccurrence, like even more interesting. But it was hard to explain. I think it's closer here. >>: [inaudible] it looks like you created batch size sorting reduce size sorting. How is the network traffic affected by this? Seems like it would make phase is longer and more intense. >> Shivnath Babu: Shuffles took much more time, however, we still have the combiner. So really I think that is a very good insight, which is even if you forget MapReduce and we think of external two-phase sorting and database systems, there are two types of phases that happen. One is reading a whole chunk of data and then solving it using an n log service strategy and then writing things out and merging really like an ordered, like just merging. That is one strategy. Now let me change the buffer size when we are really playing with this, we release the sort of good [inaudible] on that n log in and then we do more on the merging. And it turns out that in certain cases that merging, that doing that trade-off because you better utilize the resources can actually do things better. And clearly hear this [inaudible] has not been taken out. The [inaudible] is still there. So it's just that you are shuffling more data but in the grand scheme of things that is just a small fraction of the cost. >>: Back up to my point is the network utilization of the links because it would have gone up because of the change in strategy. >> Shivnath Babu: It would have gone up but here like in the before things like WordCount, the amount of data that is actually going, just based on this number, you can see; this can run really fast because they are processing much less data. So the network like I wrote in [inaudible] traffic to increase it significantly. However, that is another metric that you might actually want to use for optimization when it comes to large jobs. Here I am focusing on [inaudible]. >>: Was version visit that you are using? >> Shivnath Babu: It's 20.2. We now support 20.2 or 3 which is the version that Yahoo is actually using. So we support these two versions. Okay. So terabytes scale data and much more beefier clusters. Here the results are like much more expected. There are some surprises. Here I have taken out the default. It's just rule-based and us. Some cases we are actually improving, but actually two interesting cases, Word Co-occurrence, where the speed up--lower is actually worse. So Word Co-occurrence and for LinkGraph was actually worse, so don't work for Word Co-occurrence. Word Co-occurrence is a program where you have ordered n data and the shuffle of everything is rn squared. It turned out that the best optimum setting, actually managed to find for this doesn't use the combiner, but doesn't turn on map side compression or actually any kind of compression for that matter. What the rules of the match we said was turn on the combiner and turn on compression because a lot of data is actually being sent. That is like probably well like at one point I am the worst and the best. When we because of some issues in the profiling, which we are trying to actually fix, and how we profile when there is combining and compression. Our setting actually thought mistakenly that not having the combiner, not having the compression was actually the best. So it slightly was, in fact, it was twice worse than the optimal then it would actually find. Things are not perfect, but the reason for that actually we think, because we haven't fixed it, is profiling that we have using [inaudible]. [inaudible] actually look very good on paper but it turns out it has significant overhead because it's the sort of thing that people are actually working with this one main [inaudible] and whatnot. They are not commercial orderings to be traced which we would think is the problem, but here is what we have now. What I am showing on the X axis is I am waiting for Word Co-occurrence and [inaudible] the [inaudible] tasks that are being profile. Remember Starfish can actually choose to profile only a fraction of the task. And the Y axis shows the slowdown, no profiling and with profiling. Then we can see hundred percent profiling slows down the task around 30%. There are some interesting nonlinear behaviors which I can expand on. Like some really interesting stuff in profiling, but you can see that 5%, if we're going to live with 5 to 10% profiling with BTrace, that is going to slow down only 5 to 10% and especially 1% is hardly any slowdown. And you can see what we did is use those profiles in these cases, feed it into Starfish optimizer and What-if Engine, see what setting it comes up with, run it to see what speed up we can give it against rule-based. So even with just 1% profiling we are able to give a better suggestion than the rule-based by around 1.5. This is a thing that shows the speed up. Around 10% you're actually seeing as best we can actually get them. This could be that they are not finding the real best one. There is something else. Or at least in our context 10% profiling is able to get to the best. Then 10% has about 10% overhead. I truly believe that our alternators to BTrace. There are companies in the Java profiling space and I have seen benchmarking results that actually show a tool that reduces the overhead of profiling of BTrace by around 600, so maybe that will take out the problem, but it's commercial, but we will have to do some things to try it out. But that is the story on profiling. Now quickly moving into the What-if. This is the slide I showed you earlier; this was actually for the Word Co-occurrence with all of these hills and valleys and whatnot. By doing one run, getting the profiler, putting it in the What-if Engine, that is the best that I have been able to come up with. The main thing to notice is that it's able to cancel the trends and hills and valleys reasonably well, this thing is actually shifted up slightly higher because every, we are actually measuring function calls and times and those sort of things and BTrace has some overhead. So this is like all estimate, but the trend remains the same. The important thing for finding a good setting is the fact that you preserve those strengths. This shape has to remain the same. And so there is work to actually be done to improve things over here. So coming to the other problem that I actually talked about, the cluster sizing, so remember I showed you this picture earlier on for the different types of nodes and the blue bars were what I showed you earlier. There are interesting trade-offs to be made from between cost and time and things like that, you know, we have the system [inaudible] so is really a two position engine that is making a appropriate calls to the profiling and the What-if Engine. Here we collect remember the little black box models and those sorts of things. We collect profiling on the n1.large cluster, one run, and then you can basically estimate how things will actually perform for different types of nodes, different number of nodes, different Hadoop concentrations of nodes and things like that. Again I am sure that there is room, if you were here, if you are able to capture the transfer board we have now and under that for instance we should be able to say that you are willing to trade off on the completion time, but you want to minimize cost, maybe you actually want to go over the medium or the extra-large, or the medium, c1.medium in this case. This is probably one thing that is a nice use case that we are able to support. People don't just go ahead and if you have a program like that is going to run on big data. People don't just go out and take a program and unleash it on a big cluster. They often try to run small data because getting things wrong on a big cluster you might actually--many other users might actually become mad at you because you have done things, you have actually probably even brought down clusters like what happened at Yahoo. So what we support is essentially those two clusters could be a test cluster and a production cluster. So here this is a just a small scale of what we want to do, but imagine you had a full node test cluster which had some sample of the data, maybe even the full data. And you had a 16 node, sorry a four node test cluster and a 16 node production cluster. So what we are doing here is we are going to the test cluster, collecting profiles for the job from that and then making a prediction for the production cluster and running the job with the profile on the production cluster. So you can see the rule-based and then here I will show you real times. So lower is better. This is the rule-based setting. The other one is if you actually managed to collect the profile on the production cluster, what you can actually do, the green one is the user profile so the test cluster and the red operation for the production cluster. They were able to do a reasonable job. And this is I think a very important use case for companies like Yahoo and Facebook because, and I am sure for everybody. Nobody likes to turn on like profiling on those things on the production cluster, and many are very sort of finicky about those production clusters, because their job is probably on the line if something goes wrong with that. So that was a very quick overview of Starfish. Now what I would like to do is take a couple more minutes to give you a very quick feel of some of the ongoing stuff and let me, shall I do that in a minute? Okay. A quick reminder of the Starfish architecture and in each and every one of these computes there is some interesting subjects written, at least three years more work to be done. One of the things that we are focusing on a lot now is the workflow. We started off with the focus on jobs because we realized that unless you have a good idea of the low style of execution you probably can't do anything smart at the highest level where things could be wrong. So the workflow optimization and let me illustrate using one slide, some of the things you could professionally do. So this is the like extraction of the description of MapReduce workflow for the term frequency and most often frequency. This is used in information retrieval so it's natural for anybody, any user to write it in terms of maybe three or four MapReduce jobs. Some map function in one, D and the W are like doc ID, word ID. It’s easy for them to specify these long thin workflows but performance is often much better if you can actually kind of get more shorter and much fatter workflows. That's why we call these parts stubby; we actually want stubby workflows. So the things we can actually do and one thing is basically again take two MapReduce jobs put them together. And that has to do with how you choose the partitioning function, and if you choose the partitioning function properly then you can sort of literally pipeline the records being output by a reduced to exactly the next map and [inaudible] the MapReduce because the properties that the partitioning has to enforce are already enforced from the single job. So this guy is actually not removing any function because these are backward function. He probably doesn't know what they actually do. So we take those two MapReduce jobs, and sort of run them in one single MapReduce job. And similarly they actually collapse things horizontally, like this can sharing and shuffle sharing and things like that. And then soon all of the in a job level parameter selection, all of those choices remain. So this is still at the level of black box, MapReduce workflows with some properties assumed about the MapReduce API itself. Then you can do some run things and actually even better things like large optimization, you can move up to the scope or the big or the high players. So those are things that we are actually focusing on especially in some collaboration with the people at Yahoo. So this is basically the wish list indicated in red and what we are actually doing is probably going to come out by the end of the year or maybe sooner. They are focusing on the, we already have the profiling everything implemented. Optimization is something that that we are focusing on. A robust adaptive tuning looks like, it's like collect the profile and the cluster was heavy loaded and use that configuration is still the best of the clusters likely loaded, or vice versa. Those are things we are actually trying to study. It's hard for us to study that in isolation running things on Amazon EC2. We really need help from someone like Yahoo or hopefully even Microsoft if you guys are interested in something like this. Lowering the profiling overhead remains high on our list, although I think it's more engineering rather than research. And like people in the dark commercial two-step probably solve that problem. There are new features announced about the sizing that we are thinking about. [inaudible], [inaudible] based splicing is definitely something that like it's definitely a thing we want to do. But still a little bit away. The other thing we are really interested in is especially for linear algebra type workloads is using like on the cloud again--and nobody is telling you, you have to run Hadoop cluster or even one single Hadoop cluster on the same type of node, homogeneous nodes would actually choose to use some [inaudible] nodes, the small [inaudible] nodes to actually written a lot of data to the past text to generate this [inaudible] in those sort of things. Then probably use sort of beefy nodes with more memory to crush through those matrixes. And even with GPUs. So national gives you know nodes in GPUs. So there is an interesting challenge to be solved there; we are actually looking at that. The workload management layout, we are looking at those data cycles workloads. That took place computation. We actually want to do some computation on the data serving system, log aggregation in the backend antics. And this is a point that this was brought up. The folks at Waterloo we are thinking about turning on profiling all the time, collecting a warehouse of these profiles and then interestingly posing both, applying mining and things like that to these things and are proposing, like picking the right profiles that are [inaudible] profiles to input to the What-if Engine. That is it. Thank you so this is a quick summary MADDER principles. Starfish tuning, we want to bring both. There are some techniques we have actually come up with that are quoted from other fields. Once you find 1 you can download and use. .2 is really right there, probably the end of the month, I really want to have an announcement for the VLDB. Let's see if that happens version 3 is probably with the PigLatin and the [inaudible]. We actually support pig .9 which is not officially released yet. That might happen in October, so we might delay until then. That is the story on Starfish. Thank you. [applause]. >>: [inaudible] the one that was on the previous one. How do you explain that the [inaudible] is small and the… >> Shivnath Babu: These are like really beefy nodes, so the point on these nodes the [inaudible] are thinking about the [inaudible], right? And most of the rules of thumb are actually based on what I observed on Yahoo's clusters and what I observed on cloud Earth clusters. When we go to these more beefier nodes some of those [inaudible] systems actually [inaudible] lengths didn't actually start to match. So it just means in this sort of context RU-based thinking especially when we are dealing with large data and enough CPU cycles so that those [inaudible] systems actually become valid. So rules of thumb are--and my point is not to say we can actually have something that is much better than rules of thumb. Somebody who understands everything really well will be able to come up with good settings and our hope is there are many contexts where the [inaudible], right, they have no idea doing these things. The hope of the [inaudible] is to bridge that gap at least to the level of the job parameters. Yes? >>: [inaudible] your optimizer optimize [inaudible]. >> Shivnath Babu: We are only at the job level. We are only dealing with the jobs, but at the level of this level we are actually dealing with properties of data. So we are looking at, in fact, we are even looking at how you could even co-locate different data sets so that the join can actually become faster. So when we bring in partitioning we are looking at partitions the logical aspect along with how to [inaudible]. All that we are looking at here, but the jobs did not include that. The data is given to us. >>: [inaudible] comparison that shows [inaudible] a lot of those rules at the bottom seemed like they are agnostic to which jobs they are running. Did you use the same rules of thumb for every job, and then did you use a different cost-based optimizer rules for each job so that the CEO got the [inaudible] all got the same configurations? >> Shivnath Babu: So there are two questions. The rules of thumb setting seem to acknowledge good jobs. That is not true. Because the rules of thumb I showed you if that map [inaudible] has no memory in terms of operations and there is a combiner, they are not agnostic and the rules of thumb settings are different for different jobs. The second question was the cost base [inaudible] remain the same. Collect the profile based on rules of thumb settings, print them, see what comes. [applause].