>> Emre Kiciman: Hi, everybody. It's my pleasure to welcome Peter Bodik to give us a talk today. Peter is a Ph.D. student at UC Berkeley, where he's been working on the RAD Lab project. He's going to tell us today about his work there on automating data center operations using machine learning. Thanks very much. >> Peter Bodik: Thank you. Hi. So I'm Peter from RAD Lab at UC Berkeley. And I'll talk about using machine learning to operate data centers. So when the RAD Lab project was started about five years ago t goal was to build tools that would let a single person deploy and operate a large scale Web applications. And the main bet was that we can use machine learning to help us get there. And I spent five years applying machine learning to these problems, and today I will talk about some of the best results we got. So before I get into the details, let me start by talking about why it's difficult to operate these large-scale Web applications. The main reason is that they are really complex and very few people actually understand how they work and so on. In particular, some of the popular Web apps like Amazon, Hotmail and Facebook, you know, they could serve tens of millions or hundreds of millions of users. They run in hundreds of data centers or tens of data centers, and each data center could have tens of thousands of servers or could run hundreds of services. This picture shows what Amazon looked like about four years ago. This is from our paper from 2006. And each dot on this graph shows a single service that they're running like recommendation service, shopping cart and so on. There was hundred of them back in the day, and now I think it's up to 200. The arrows show dependencies between services. And what happened when you click on Amazon's website, that request gets translated into often 50 or more requests to all these individual services and the responses get merged together to form a single HTML page that gets sent back to the user. So there's lot of dependencies and lot of complex interactions in these systems. Here you have again a schematic visualization of a data center with many applications. And the really complex part is that you might have different applications at different requests for these applications that would share the same service, they would share the same physical machine, they would share the network or the, databases, and understanding these interactions is really difficult. Next the software that's running in these data centers is constantly changing. EBay published these statistics that they say they add 300 features per quarter and they add 100,000 of new lines of code every two weeks to their system. And finally is hardware running in these data centers is constantly failing. Google says that on their 2000-node cluster, which is their unit of deployment during the first year they see 20 rack failures, 1,000 machine failures and thousands of disk failures. And there's many other types of failures that are err, but more significant. So in this environment, it's really complex and difficult to operate these applications. So what are the three main challenges of data center operations? And these are the three mine challenges I'll be talking about in this talk. The first one is quickly detecting and identifying performance crises caused by the various failures that happen in a data center. These crises could affect the op time of the application, can reduce the performance of these applications, and that's why the operators have to quickly responsibility and fix them. And for example, at Amazon operators to wake up in the middle of the night, on the weekends, and they would have 15 minutes to start work on the particular problem and do it quickly. So this is really annoying for them. The second main challenge is understanding the workloads of these applications. They're both daily patterns in workload that are crucial for provisioning correctly these applications. But there are also these unexpected spikes and surges in workloads, like the one that happened after Michael Jackson died where 13 percent of all the Wikipedia traffic was directed to the single article about Michael Jackson, right? So handling these workload spikes is crucial and we would like to stress this, our applications before we deploy them to production to make sure that we can handle them. And finally, we would like to also understand performance of these systems in differing configurations. The problem right now is that most of the data centers are very static in terms of deployment of resources. And the reason is that the operators don't understand what would happen if we add more servers, remove a few servers or even different kinds of servers. And it's very difficult to efficiently utilize these data center resources. So these are the three main things that I'll talk about today. So why machine learning, right, you might ask. And so, as I mentioned, the data center operations are manual, slow, and static, and mainly because the systems are very complex. Having models of workload in the performance of these systems would be really useful for addressing these three challenges. But using, for example, analytical models like queue networks is not very practical because designing these models the manual and because these systems change too frequently we would have to update these models manually. Fortunately there's a lot of monitoring data being routinely collected in the data centers that could provide insights into workload performance and different failures. But there's so much data that analyzing it manually isn't practical. These operators are not trained to do this, and doing it manually would be too slow. But that's where machine learning would help a lot, where we could use various methods and techniques that allow us to analyze large quantities of data accurately and also combine these machine learning methods with non machine learning techniques like control theory or optimization to create even more powerful tools. And I'll show you examples of that later in the talk. All right. So the main contribution of my work is using machine learning techniques to take the data center monitoring data and build accurate models of workload, performance, and failures in these systems and use them to address the three challenges I described at the beginning. I worked on four main projects. And let me just quickly go over them and highlight what were the different challenges on what modeling we used this each of them. So the -- in the first project the goal was to defect performance problems or different failures in Web applications. They might be hard to detect automatically or by using -- by operators. The input was the workload trace or the behavior of the users on the website. We used anomaly detection techniques to spot unexpected behavior of the users. And these would often point to performance problems or on failures in these Web applications. The second project was on identifying these problems. So the input was where all the performance metrics being collected on all the servers running and application in we use feature selection techniques to pick a smaller set of these performance metrics and then create the crisis fingerprint that will uniquely identify a particular performance problem so that we can then later automatically recognize whether a particular problem is happening again and again, and the operator doesn't have to do this manually. The third project was on modeling workload spikes. The goal was to create a workload generator that we could use for stress testing Web applications. So we created a generative model with a few parameters that lets us synthesize new workload spikes and we use that internally to run various experiments. And finally the last project was on performance modeling. We designed both black box models of these Web applications but also we combined the structure of these systems or structure of these queries that you run on top -- inside these applications to provide accurate predictions of latency that extrapolate to new workloads and new configurations so that you can use them to answer various what-if questions that if you're the developer or the operator. We also combine these performance models with control theory, do dynamic resource allocations in both stateless systems like Web servers or storage system when you actually have to copy data over to the new machines. So -yeah? >>: [inaudible] all based on actual traces from company like Microsoft, Amazon? >> Peter Bodik: You mean all these projects? >>: [inaudible]. >> Peter Bodik: So the -- let's see. So the first one, this one was based on data from Ebates.com. So this was an actual company that gave us traces from one of the -- for some of their failures that happened. This one was in collaboration with Microsoft during my internship. The work -- the spike modeling, we had some traces, they are publically available, and I'll go into that in more detail later. And out performance modeling was done internally on the applications that we had available that -- if using benchmarks and workload generators that we had available. So these are not real, real systems our production systems, I should say. Yeah. So in the rest of the talk I'll talk about the crisis identification project and the characterization and synthesis of workload spikes, and then I'll summarize. So the crisis identification project is the project I worked on during my internship at Microsoft Research-Silicon Valley with Moises Goldszmidt. The main use of machine learning here was taking all the performance metric being collected in a data center for one particular application and using feature selection methods to select the small relevant set of these metrics to create an accurate crisis fingerprint that you can use to uniquely identify these various performance problems. So as I mentioned, there is a lot of failures happening in data centers. It could be hardware failures but could also be software bugs, misconfigurations and so on that cause either downtime or reduce performance of these applications. And this costs a lot of money. That's why there is -- there are operators who have to really quickly respond when they happen. So here let me first show you what happens during a typical crisis in a data center. On the left side you see a timeline where at first the application was okay. And then at 3 in the morning, there's a crisis that's first detected, right? The detection part is the easy part. It's usually automatic. And you detect it by violations of performance SOLs, right? For example, you notice that the latency's too high or the throughput is too low. So that's relatively easy. When the operator wakes up, opens his laptop, he has to perform crisis identification. So he starts looking at performance metrics, logs, logging into machines and so on to identify what the problem is. He doesn't have to do root cause diagnosis necessarily here, just needs to understand enough about the problem to fix it. Maybe rebooting a machine would do it, and so on. And this part, the identification part, could take minutes, maybe hours sometimes. And it's really manual and difficult. After they figure out what's going on, they perform resolution steps which depend on the type of the crisis. And once the crisis is over, that's when they perform root cause diagnosis and add the description of the crisis into a trouble-ticket database. >>: [inaudible] crisis like [inaudible]. >> Peter Bodik: No. We were not considering that. We were mostly interested in performance -- performance problems. But -- and we were interested in, you know, problems where you can -- you can fix them by maybe rebooting machines, restarting processes and so on and so on, so in that case -- but some of the techniques might apply. And you can ask later to see if some of that would apply. So our approach of fingerprinting was evaluated on data from exchange hosted services which on data from 400 machines where every machine was collecting about 100 different performance metrics. To give you an idea of how we actually want to help the appraisers, imagine that you have a system where you already observed four different performance problems, right. The application configuration matter may be reliability twice, and so what operators all right have is this trouble-ticket database, right? For every type of a problem they have a encryption of the problem, how they detected it and resolution steps that they take to resolve this problem. What we're going to add is a fingerprint. For every type of a problem we'll add a vector describing that problem that we can automatically compare. And when the new problem happens, what week like to answer is is this crisis that's happening right now a crisis we've seen before, and if it is, we'd like to know which one it is. And if it's not, we'd like to tell the operator that this is a new type of a problem. And even in that case this helps the operator because he doesn't have to manually go over all the past problems and try to find the one that's similar. And so what we do is we create a fingerprint of the current problem and we compare it to all the past fingerprints in our database and if we find one that matches this is -- this is our answer and this is what we tell the operator. So this is the overview of our technique. There are two main insights here that lead to our solution. The first one is that these performance crises actually repeat often. There are various reasons for this. For example, the root cause diagnosis could be incorrect, so the developers actually fix the wrong problem and the crisis keeps happening again. Or just deploying the fix to production might take time because of testing and might be weeks before it is actually deployed, again causing the crisis to reoccur. And we've seen this both at the EHS data and also at Amazon. So during my internship at Amazon it was some of the less severe crises happened really frequently. And they were not significant enough that the developers would fix them quickly, but the operators too would have to wake up in the middle of the night to deal with them. >>: [inaudible] makes you wake up in the middle of the night? I mean how many are recurring and how many are new? >> Peter Bodik: You'll see some statistics later for the EHS. For Amazon, the way they classified the crisis where they called them sev1 and sev2 in terms of severity. The sev1s were the ones where you absolutely of to deal with it right now because you're losing money right now. The sev2s were about 100 times more frequent. It's -- it depends on what service you're talking about, right? But this particular application that we looked at they were about 100 times more frequent. They were not ones that directly are affecting performance right now. But if you don't deal with -- fix the problem within an hour maybe they could lead to a lot more serious problems down the road. >>: [inaudible] in other words, how many [inaudible]. >> Peter Bodik: I don't have statistics, exact numbers on that. But for the EHS data, we had about 19 labeled problems. And out of that, there is one crisis type that repeated 10 times, and there was one that repeated twice. And the other ones just happened once in the data set that we had. Okay. The second -second insight is that the state of the system in terms of performance metrics that you're collecting is similar during identical performance problems, right? If the same thing happens twice the performance metrics will not be exactly the same but if CPU might go up and network traffic might drop and so on. And this is exactly what the operators use when they are personally trying to identify these problems. But capturing this system state with a few metrics is difficult because for different problems you need different metrics. And often ahead of time you don't know which are the right metrics. For example, the operators from EHS gave us three latency metrics which they said were the most important metrics for that application. This is what defines actually the performance SLO for these application. And when we tried to use just these three metrics for creating the fingerprints, this was fought enough information to actually identify the problem. So you need to look into a lot more outer metrics of the system. >>: When you say system state, do you necessarily mean on observable system state or it's just the thing that you are not measuring against? >> Peter Bodik: So I don't mean application state like user cookies and so on. I mean just the performance metrics that the operators would use to identify the problems. And so these would be CPU utilization, workloads, latencies, throughputs, queue lengths. Some things that -- metrics that you can measure in your system. >>: The system state and metrics are actually ->> Peter Bodik: That's what -- yes, that's what I mean by that, yes. So in this particular case we used, you know, the system state was exactly -were exactly performance metrics but you could add other metrics into it that are based on logs and other things that you might -- that might be useful for crisis identification. All right. There are three contributions in this project. The first one is the creation of the fingerprints. Fingerprint is a compact representation of the state. The main goal of this representation is that it should uniquely identify a performance problem. It should be robust to noise because even though the same crisis happens twice, the performance metrics will not be exactly the same. So we would like to make sure that the fingerprint is still similar. And finally we also wanted to have an intuitive visualization of the fingerprint. So the -- when the operators use these fingerprints, we'd like them to understand what's captured in there, even if they don't understand the exact process of how we created this fingerprint. The second contribution is the realtime process of how you use these fingerprints for crisis identification. As I explained before, the goal is to maybe send an e-mail to the operator that's when he wakes up and opens his laptop that says the crisis that you're looking at right now is very similar to the one we saw two weeks ago, and here is the set of steps you need to take to fix it, or we could also say this is the problem that we haven't seen before and you should start debugging yourself and start looking into the details of the problem. And finally we also did a rigorous evaluation on data from EHS. >>: How long was that data for? You said [inaudible]. >> Peter Bodik: We have four months of data. And so that was the main data set we used. >>: [inaudible]. >> Peter Bodik: These are the majority of the main problems they had. There were some that I think we didn't use because they didn't -- couldn't give us a label for it. But I think these are definitely the vast majority of things that -- the big problems that happened during the four-month period. >>: Do you have a sense of were there anybody that went undetected? >> Peter Bodik: There were things that we saw in the data in terms of high latencies where we talked to the operators and they didn't have any record for that. So they didn't see any e-mails going on describing this problem, even though we clearly saw that latency was up. And there were a few instances of that. So those are the ones where we thought that something happened but they didn't -- either didn't know or there were no e-mails at least, so they couldn't give us a label for the problem. Yes? >>: You only use snapshot of the system state as features or does that also include some accumulated stats? >> Peter Bodik: So the data we use were aggregated into 15-minute periods. So so they were -- they might have been average CPU utilization during a 15-minute window. And the reason we use -- it's 15 minutes and not say one minute is that this was data from some of their historic databases, metric databases. So I think in realtime they are collecting the data in much finer granularity, it's just the data we got from them were over a year old in some cases, so they didn't have all the details statistics in this case. Okay? So in the rest of this section I'll talk about how we define these performance crisis, define the fingerprint, and go over the evaluation. The definition of a crisis in general is that it's a violation of service-level objective. It's usually a business level metric such as latency of the recourse happening on the whole cluster. An example of SLO that we use in or that the EHS use is that at least 90 percent of the servers need to have latency below hundred milliseconds during this 15-minute epoch. And so if you can check this condition very easily and if it's not true that that's where the crisis is happening. The crisis we looked at, there are application configuration errors, database configuration errors, request routing and overloaded front ends and back ends. And all of these were easily detected but the identification was the difficult task. Yes? >>: Why -- is it common that you would specify these things over whole clusters? It seems like oftentimes you would want to say that -- about some end user to some level that the end user only sees a certain amount of latency as opposed to the second bullet which says that it's a property of the entire ->> Peter Bodik: I think you care about the property -- properties of all the requests coming into the system and you want to make sure that say the slowest requests in the system are not too slow, right? Or that the 99 percentile of latency of all the requests is not too high. So, you know, you look at all the users in your system and make sure that the slow ones are not too affected. >>: [inaudible] crisis protection. Imagine that [inaudible]. >> Peter Bodik: So we considered using the fingerprints for predicting that a failure will happen. But it turned out that if you just make the SLO stricter, for example you decrease the threshold from hundred milliseconds to 50 milliseconds and a also would give you an accurate -- a relatively accurate predictor. At least more accurate or roughly as accurate as using the fingerprints themselves. So we tried that. And in this particular case it wasn't really useful. Okay. So now how do we actually create these fingerprints? The goal of the fingerprint is take the system state, the performance metrics being collected and create this compact representation. On the performance metrics we treat them as arbitrary time series so we don't make any assumptions about what they represent, what they are. And very often there are these application specific metrics that we have no idea what they really mean, right? There are some of the metrics we understood like workloads and resource utilizations, but the majority of these metrics were specific to EHS and we didn't know what they are. You can think of the system state as if you think of one server that is server is collecting hundred different metrics over time so that's the state on that machine. But you have hundreds or thousands of these servers and that's the system state. Now I have the definition of the crisis. I know that there are periods that are okay and there are periods where the crisis was happening. So our goal is to take the data from the crisis and create this crisis fingerprint from that. And we do this in four steps. In the first step we select the relevant metric, right? This is where we use the feature selection step because it turned out that using all hundred metrics is just too much data and we select just 20 or 30 metrics. In the second step we summarize each of the selected relevant metrics across the whole cluster using quantiles because we don't want to remember, say, a thousand different values of a CPU utilization. The third step we map each of the metric quantiles that we computed here into three states. It's either hot, normal, or cold, depending whether it's the current value is too high or too low compared to some historic data. And finally because each crisis could last for different amount of time, we average over time too compute this single fingerprint. So in the next four slides I'll talk about these four steps in detail. So first step was selecting the relevant metrics. The reason we did this was we tried creating fingerprints using just all the hundred metrics. And we also tried creating fingerprints using the three operator selected latency metrics. And none of these really provided us with high identification accuracy. With hundred metrics it was just too much data and the fingerprints were too noisy. With just the three latency metrics there was not -- we why not capturing enough information. So what we did is use a feature selection technique from machine learning called the logistics regression with L1 constraints to select the smaller subset of relevant metrics that are correlated with the occurrence of the crisis. The way you can think about it is that we're building a linear model where inputs are all the metrics being collected on the system and the output is a binary and we try to predict what whether there's a crisis or not, and we try to use only a small subset of the metric but still get an accurate model. >>: [inaudible] 400 times 100 -- >> Peter Bodik: This was hundred features. One metric was one feature. >>: [inaudible] whole citation or ->> Peter Bodik: We were -- so the way we had actually did this was that we -the crises -- see if I go back -- so we had like SLO per machine, which were the latencies, the latency metrics and the thresholds. And this is what we used to create the two classes. >>: [inaudible] 400 separate labels [inaudible] every 15 minutes [inaudible]. >> Peter Bodik: Right. Right. Yes. >>: [inaudible] have a lot of structure both in individual features as time goes by [inaudible] and so on as well as it could have correlation between features where say some servers become overloaded and others have lower loads? Did you try to take into account the temporal structure as well as the cross-feature structure somehow? >> Peter Bodik: So I'll get to the temporal structure later. But the -- in terms of dependencies, no, we haven't tried to explicitly capture that because all we did was select these metrics and then -- so what you actually have is we have many instances of crises. And we run this per crisis. And so after each -- for each crises we get maybe 10 or 20 metrics, but we have many instances of crises so we actually take the most often selected metrics as the set of metrics that will be the input to the fingerprint. >>: [inaudible] to do in one single method feature selection? >> Peter Bodik: The way we did it this way is that we wanted to actually also identify which metrics are useful for each individual crisis, just so that we could tell the operators that here, for this crisis, these five metrics were the crucial ones and so on. And the one thing that I'd like to point out is that some of the metrics we selected were these application metrics that the operators new that they are collecting them but they didn't really realize that they're actually important for crisis identification, right? And this highlights that there's so much data that even the people who work with the system every day they didn't realize that some of these metrics would help them. And so once we told them look at these two metrics really help us so you should start looking at them as well, they started using them. And this is again -- we didn't really assume anything about many of these metrics, so that's what makes it powerful. >>: [inaudible] reevaluate the -- this -- easily imagine the situation where you see a new type of crisis show up where the relevant metric was something that didn't matter before because it had been. >> Peter Bodik: Right. Right. So what we do is every time a new crises happens, after it's over we run this feature selection on that new crisis and we at, potentially add new metrics in and we might remove some other metrics from the set. Yeah. All right. And so I'd like to point out that we're using this classification setting here only for the feature selection. We're not actually using these models later to create the fingerprints. Now, in the second step I would like to take all the selected metrics, all the relevant metrics and summarize them across the whole cluster. Is reason is that if you have a CPU utilization metric, for example, you have 400 measurements and you don't want to remember all 400 values of this. So at the top you see histogram or distribution of observed values of this metric. And we characterize the distribution using three quantiles, the 25th percentile, the median and the 95th percentile. We selected these ones by visually inspecting the data. But you could also automate this process once you have some labeled data and approximate pick the quantiles that actually give you the best accuracy. And the advantage of this is that this representation is robust to outliers, even if one or two servers report extremely high values of metrics it doesn't affect these quantiles that we're computing. And it also scales to large clusters. Both in terms of -- if you're adding more machines we're still representing the fingerprint using just these three quantiles. And also there are techniques to efficiently compute quantiles given large input sizes. >>: [inaudible] in the previous slide when you select the features, when you train the logistic classifier, so for each crisis you're assuming you get 400 instance examples of failing of crisis behavior for all the servers. You're assuming that for each crisis involves all of the servers? >> Peter Bodik: So we take when a crisis happens, we take some time around the crisis, say, you know, 12 hours before, 12 hours after. So it covers both the normal periods when the system was okay. And it covers also the problematic periods. And so the number of data inputs is, you know, 100 metrics. But then the number of data points is 400 times the duration of the crisis plus some buffer on both sides. >>: But you are assuming that all 400 servers are exhibiting some kind of crisis? >> Peter Bodik: If not, then they just fall into the normal class. >>: So how do you label the crisis? >> Peter Bodik: So for each crisis in this particular case each server is reporting the three latencies. And we know the thresholds on each latency that actually mattered that are specified by the operators. For each server and each point in time we know whether that server is okay or not. >>: [inaudible] the servers doing the same thing for the Amazon model [inaudible]. >> Peter Bodik: Right. So we did this worse kind of best for say a single application for like application servers belonging to a single application. Right. Yes. So what we tried and that wouldn't really work is using say mean invariance of these metrics because these are really sensitive to outliers and a single server could significant effect the values of these. And we also tried using just the median of these values. But that didn't give -- in some cases that didn't give us enough information about the distribution of these values. And we had to add a few more quantiles. >>: [inaudible]. >> Peter Bodik: Because we were interested in the crises that affect a large fraction of the cluster. This case the crisis only was happening when at least 10 percent of the machines, which case at least 40 machines were affected. >>: [inaudible] asymmetric? >> Peter Bodik: Right. >>: Quantiles. 25 and 95. So there an implicit assumption that you're more likely going to -- because you're quantifying this way, is there [inaudible] that high is more like the [inaudible]. >> Peter Bodik: In most of the -- yeah, I mean we did this by visually inspecting the data and, you know, we noticed that a lot of the metrics -- you know, when something breaks, it's usually the metrics go up. Or they don't drop below zero say, right. So we wanted to be more sensitive on the high end. But again, this is something that you could automate. And once you have some labeled data you could try different quantiles and, you know, pick the ones that actually work better. >>: [inaudible] opposed to latency metrics but [inaudible] much more interesting, right? >> Peter Bodik: Right. >>: All right. >>: So these are numbers that you use across all the metrics, not just an example of CPU? >> Peter Bodik: Yes. Yeah. >>: Okay. So you use exactly the same ->> Peter Bodik: Right. Right. Okay? So now in third step, the goal the is to take -- observe the raw values of the America quant files like the median of CPU utilization and map it into hot, normal, or cold values. So to say kind of capture automatically whether the kind values is too high or too low. So what you see on the far you have four fingerprints on the right, we call them epoch fingerprint because now for every single time epoch we have one row in this rectangle that represents that particular time step, and the gray represents a normal value, the red represents too high and blue represents too low. So what you see here is that there's about -- here we use about 11 metrics. Each one captured using three quantiles. So there's about 33 columns. And so we achieved three things here. First we can differentiate among different crises. The crises -- there's two instances of the same crises on the top. And visually the fingerprints look very similar. Well, there are two different crises at the bottom and the fingerprints are different. The second thing is that the representation is very compact, right? Remember that we started with 400 machines, each one reporting 100 metrics over a period of time and we can really compactly represent it here. And on top of that it's intuitive because you can quickly see that there is this one America that first drop at the beginning of the crisis and then got back normal and then maybe the high quantile increased. And when we actually showed these fingerprints to the operators, they were able to identify some of these problems just by looking at these -- this heat map. Right? So we thought that's really cool that they can do that. And they could verify whatever is captured here actually represents the problem they were looking at. What we tried is using the raw metric values. The problem there is that say some of these metrics could achieve really high values during the crisis, so comparing them -- comparing the vectors would really skew -- or these extreme values would really skew the comparison. And we also try figure the time series model here. Because we noticed that, you know, there are these daily patterns of workloads and CPU utilization, and we try fitting a time series model but for many other metrics there were no significant patterns. So doing this discretization based on this model didn't really work very well. >>: [inaudible]. >> Peter Bodik: We had about 10 times, 10 types total and had 19 instances total. Yeah. And finally, the last thing we have to deal with is that different crisis have different durations. So you can't directly compare the rectangles you saw in the previous slide. And that's one thing that wouldn't have worked. The other thing that wouldn't have worked is just using the first epoch of the crisis because often the crisis evolves over time and you want to capture that information. So the simple thing we did is simply average over time. Take all the fingerprints during the crisis and create a single vector that represents that crisis and we would compare different crisis fingerprints by computing the Euclidian distance. So these are the four examples of how we create the fingerprints. But now we also need to define the method how we use the fingerprints to identify the problems in realtime. Again, you see the timeline on the left. First the crisis is detected by violation of an SLO. And what's important here is that the operators of EHS total us that it's useful for them to get a correct label for a crisis during the first hour after the crisis was detected. The reason is that it often takes them more than an hour to figure out what's happening. And so during the first hour after the crisis was detected, we performed this identification step. What we do is we update the fingerprints first, given all the data that we have, and we compare the fingerprint to all the past crises we have in our trouble-ticket database. And we either and emit a label of the -- of a single crisis that we find or we emit a label -- a question mark that says that means that we haven't really seen this crisis before. So over time what you get is these five labels that you first try to do identification when the crisis is detected and then every 15 minutes later during the first hour. >>: [inaudible] average of the whole crisis, right? >> Peter Bodik: Or during the first hour of the crisis. >>: Oh, okay. That makes sense. And are you comparing always versus the first hour average or after 15 minutes to like 15 minute average? >> Peter Bodik: We would compare the first say 15 minutes or half an hour to the fingerprint of the whole hour for the past crisis. And actually it will turn out to be crucial also was to include the first 30 minutes before the crises in the fingerprint. So in this case actually here the first two rows are the two 15-minute intervals before the crisis was detected. And in this case everything is normal. But in some of the other crises you see that these metrics were high even before the crisis started. And so using those first 30 minutes and adding it to the fingerprint was crucial. Or increase the accuracy. And once the crisis is over, we update the relevant metric, we run the feature selection step, and we updated all the fingerprints to use the same set of metrics. And finally the operators would add this crisis to the trouble-ticket database with the right label. >>: So say the operator didn't notice the crisis, would your system be useful in detecting it? Or would you find similarities to normal -- >> Peter Bodik: So we didn't deal with the detection part. We all -- we took that definition of the crises that they gave us which was defined in terms of the latency metrics. And so we figured that that's input. And we didn't really look at where we could do a better job. We tried using these fingerprints, but we didn't really -- couldn't really do a better job in terms of prediction. >>: So [inaudible] talking about prediction in advance of the crisis happening, but ->> Peter Bodik: Yeah, we assume they know. Somebody detected, right, or you detected using the SLO violation and then this is when we triggered the fingerprinting okay? Yes? >>: [inaudible] some figures can be due to the [inaudible] in the program like -each other and like if they are not totally consistent. >> Peter Bodik: So you mean that the problem would not really manifest as an increased latency? >>: Right. I mean now the root cost -- the root cost of some figures be not like in the system [inaudible] overloaded but it is because there is some problems in the code? >> Peter Bodik: Right. If this translates into increased latency, right, then we would detect it by checking the SLO and then you could start the fingerprinting process on top of it. So, yes. >>: So if the problem doesn't lead to performance crisis then you are not dealing with it? >> Peter Bodik: We were not dealing with that, yes. >>: There's no way they [inaudible]. >> Peter Bodik: All right. So next let's look at evaluation. As I mentioned, we used data from EHS from about 400 machines, 100 metrics per machine, 15-minute epochs. And the operator told us that one hour into the companies is still useful in terms of correct identification. We were -- we had these three latency metrics, and the thresholds on each of them that were given to us by the operators. And if more than 10 percent of the servers with latency higher than the threshold during one of the epochs that constituted a violation and the definition of a crisis. We had 10 crisis types total 19 instances of a crisis. There was one instance -one crisis type where we had nine instances, another one where we had two, and the other eight where we had only one each. >>: [inaudible]. >> Peter Bodik: So it's like the same instance -- same crisis happens 10 times. Those were the 10 instances, occurrences of the crisis. And so and these were -- these happened during a four month period. The main evaluation metric that we care about is the identification accuracy, right? How accurately can you label the problem? But before I get to that, we actually defined what we call identification stability. Because every time a crisis happens we actually emit five labels. You might start with saying you don't know what this crisis is and then you start labeling it correctly. And we require that we stip to the same label. Once you sign the particular label of a crisis you will not change your mind later, so that the operator is not confused. So in here in the top, you see two instances of unstable identification where we first change our mind from crisis A to crisis B and in the second instance we change our mind from crisis A to I don't know, right? So both of these would be unstable. And even if A is the right crisis, we wouldn't count them as correct. So in terms for -- in terms of crises that were previously seen if we were identifying a crisis that was previously seen, the cross was about 77 percent. This means that by now three-quarters of the crises we can assign the right label, and we can do this on average after a 10 minutes once the crisis is detected. So potentially it could save up to 50 minutes of the identification performed by the operators, which seems pretty good. And a hypothesis is that we can't really verify that if we had shorter time epochs, we could probably do it even faster. But we had the 15-minute time intervals. For the crises that were previously unseen, the best we can do is say five times I don't know what this crisis is, right? And there's a definition of accuracy here. And for 82 percent of these crises we're able to do this correctly. >>: Is that statistically relevant or just a number of [inaudible]. >> Peter Bodik: You mean why isn't it higher? >>: [inaudible]. >> Peter Bodik: Oh, I mean that's -- these are almost completely independent problems, right? So there's no reason why that should be higher than this one. >>: Are the previously seen crises [inaudible]. The way you described it was that you don't [inaudible]. >> Peter Bodik: Right. >>: [inaudible] feature vectors. >> Peter Bodik: Right. I mean as you get more data the accuracies actually do increase over time. Because the more data you have the more samples of different crises you have and you can tune your threshold, identification thresholds better. So there's definitely that effect. >>: These accuracies were compute causally, right? It's not like you did a whole [inaudible] you were -- you sort of described your temporal data -- did your temporal thing and then [inaudible] accuracy and measure yourself, do it again, do it again. >> Peter Bodik: So the way we did it is that -- we only had this one sequence of 19, we actually -- we reshuffled them many times, I think like 40 times. So we created 40 new sequences. So we can test the identification in different order of these crises. But for each sequence, what we did was we started with one or two crises and then we went chronologically in that order. >>: [inaudible]. >> Peter Bodik: Yes, across 40. >>: Is this a [inaudible] about 30 percent of the time ->> Peter Bodik: I think so. I mean, yes. Yes. You mean 30 percent of the time you still -- give them an incorrect label but they would still be checking what this crisis is anyway. So it should still give them a -- it should still help them. And also in practice, the way we did this evaluation was we only considered the closest similar failure or crisis in the past. But in practice you might look at the top three, for example, crises, right? And the operator might decide himself which one actually looks best. >>: So I'm curious about this reshuffling now because it seems like in these systems oftentimes the systems change and get updated over time. And by reshuffling it seems like you're using future knowledge of what this -- of what happened after the system's been updated to make predictions about crises which happened in the past. Do you find that if you didn't do this you just ran it ->> Peter Bodik: Well, we definitely wanted to do some reshuffling because otherwise you would only have a sequence of 19 crises, and we could compute the accuracy. But we really base on very few samples. So that's the reason why we did that. And you know the time period was relatively short. It was just four months. So I don't think the system would have changed that significantly. >>: [inaudible] right? So you're not doing -- the reshuffling, it started completely clean for each of the 40 runs. So if there's any changes over time [inaudible]. >> Peter Bodik: Right. But you know the crisis five might tell me something about the crisis that happened earlier really but reshuffling happened later. But yeah I don't think -- obviously, you know, if you have a longer time period like a year or longer you might want to even start forgetting some of the fingerprints. Bus as crises are fixed you want to maybe delete that from your database. >>: [inaudible]. >> Peter Bodik: So we assume that there is a database that -- of all the past crises that are labeled and all of them have fingerprints assigned to them. We'll just copy Euclidian distance and that's the singularity measure. >>: Out of the 23 percent where you weren't accurate for previously seen, do you know what percentage were like you gave back question mark versus you gave back wrong solution to [inaudible]. >> Peter Bodik: That's a good question. But we haven't quantified that. >>: Do you think the distribution of sort of the couple of them being common and the remainder as being one offs will generalize to longer time periods [inaudible] and if that's the case wouldn't it be more effective to give them some sort of lower level information other than just the label, since most of -- a lot of the time they would just say this is new? That's not as useful as telling them here is some more salient features that are ->> Peter Bodik: Right. Well, so in the data from Amazon I don't have raw counts, but there's a lot of these failures that -- I think in that system we saw lot more things happening at least as I remember people running around and responding to e-mails that, you know, they had a kind of this Wiki page almost of things that break and they kept track of them, right, and many of these things were recurring again and again. And I think it's more likely happened for crises that are not that significant. Because if this is a problem that really takes down the website, they really make sure that it's not going to happen again. Right? For the less severe problems it's very likely that it will happen again. >>: [inaudible] the accuracy for previously unseen crisis. So for what cases do you deem as predicting [inaudible] and what are not. >> Peter Bodik: So in this case I would only say that it's accurate if we have five -- five question marks. That's the only case when we actually say that it's accurate. >>: So by using the question mark you mean ->> Peter Bodik: I don't though what it is. There's no similar crisis in the past. >>: But the definition of the crisis should be 100 percent right because every time they [inaudible] it will be deemed as a crisis. >> Peter Bodik: So the crisis is happening. It's just that the same type of crisis hasn't really happened before. >>: So this 82 percent means that for 82 percent of the cases you didn't assign a definite label for this crisis? >> Peter Bodik: Right. >>: Okay. >> Peter Bodik: And this is the first time we saw the crisis, so this is the best we can do, right? >>: The accuracy for previously seen crisis, can I say it's like 50 percent as the baseline because you have 10 of the same kind, so you can always predict that's type A. >> Peter Bodik: The baseline is more like 10 percent. If I have 10 crisis types, I can pick one of the 10. >>: No, no. I mean -- if it was -- there are some prior knowledge saying type A is more likely to be [inaudible] based on previous data say. >> Peter Bodik: Right. >>: Than based on 15 percent? >> Peter Bodik: Right. >>: Okay. >> Peter Bodik: So yeah, it depends on what is the different distribution so these crises and how often they occur. >>: [inaudible]. >> Peter Bodik: You could. But we didn't get to experiment with that. Moises actually -- after I left, he continued working on this and he developed the more advanced models, and so he has some results on that. But the accuracy wasn't higher, it was just -- it was a more flexible model and didn't need maybe as much training data. >>: If you didn't use this really harsh metric but you said the majority of the labels matched the true one, how much higher would the [inaudible]. >> Peter Bodik: Again, we wanted to make sure that we were like using the strictest criterion, so that -- yeah. >>: So when you say -- so you reshuffle in [inaudible] 19 and you start [inaudible] from when does this start to count? After the second example or ->> Peter Bodik: So that's a good point. I mean that's another thing that we experimented with in -- the worst case is you know you start with two crises, right, so that kind of gives you at least some idea. And you count the remaining 17 for accuracy. That give us a little bit less lower accuracy, maybe 70 percent. But when we started counting after 10, that's when we achieved this. So if you -- once you have enough training data that you can adjust your thresholds correctly based on the data you've seen in the past. All right. Next slide. So the closest related work is this paper from SOSP '05. It has some overlapping authors. Moises and Armando Fox were in that paper. What they designed there is they called failure signatures. The main differences were that each signature was for an individual server not for a whole large scale application, so it doesn't directly apply to this scenario. And it was also -- the process was also a lot more complex because they were creating these classification models per crisis and they were maintaining a set of them, they were removing some of them, adding new, and they were actually using -- they were classifying the current crisis according to all these models and using that to decide what is the -- what is the label of the current problem? And we show that we're doing much better by using our fingerprints. So to summarize fingerprinting, we -- I presented crisis fingerprints that compactly represent the state of the system. They scale to large clusters because we use only a few quantiles to capture the metric distribution. And they also of intuitive visualization so that both we, as the designers, and the data center operators could understand was really capturing these fingerprints. The feature selection technique from machine learning was crucial for selecting the right set of metrics here. And we achieved approximately identification accuracy of 80 percent. And on average after 10 minutes after the crisis was detected. In terms of impact in Microsoft, the metric selection part is now being used at one of the Azure teams to diagnose their storage systems. And we also submitted two patents, one on the metric selection, and the other one on the overall fingerprinting process. There are no questions, I'll move to the second part. I'll briefly go over workload modeling of -- modeling of work spikes. So in this project, the goal was to create a generative model of our workload spike that we can use to synthesize new workloads and stress test Web applications against these workloads. So spikes happen a lot, right? Like the one that happened after the death of Michael Jackson where they saw on Wikipedia about five percent increase in aggregate workload volume, but 13 percent of all the requests to Wikipedia were directed to Jackson's article. >>: [inaudible]. >> Peter Bodik: He was one of the hottest pages but still, you know, out of the million Wikipedia pages it was a small fraction, less than one percent say. So there was a significant increase in traffic here. So what we mean by spike, it's an event where you either see increase in aggregate workload to the website or you have significant data hotspots. But there is a few set of objects that see this spike in workload. And we started working on this project because there was little prior work on characterization of spikes. There's a lot of work on characterization of Web server traffic or network traffic but little on spikes and hotspots. And also there's a lot of workload generators that people use for stress testing but very few of them actually let you specify spikes in the traffic that they generate, and they're not very flexible. So there are two distributions here. One is characterization of a few workload spikes where we realize that these spikes vary significantly in many important characteristics. And then we propose a simple and realistic spike model that can capture this wide variance of spike characteristics and we can use to synthesize new workloads. So first let's look at different workload characteristics that are important or spike characteristics. For spikes in the workload volume we looked at time to peak from start of the spike to the peak of the spike, the duration, the magnitude and the maximum slope in the aggregate workload. And for data hotspots we looked a at the number of hotspots in this particular spike, the spatial locality or whether the hotspots are on the same servers or not, and the entropy by of the hot spot distribution, whether the hotspots are equally hot or there are some hotspots that are hotter than others. So in the next two slides we'll look at these four characteristics and show you that the spikes you looked at varied significantly in each of these characteristics. Now, on this slide you see the first two spikes. The one off the left is the start of a soccer match during the World Cup in 1998. In the second one, this is a trace from our departmental Web server when 50 photos of one of the students got really popular. And in the graphs on the top you see relative workload volume coming to the website. And as expected, the workload volume increased significantly during the spike. We also wanted to look at our data hotspots. So first the object is in these traces is either a path to file or Wikipedia article. And so on. And the hotspot is an object with significant increase in traffic during the spike. So in these particular spikes we identified about 150 hotspots for the first spike and about 50 hotspots for the second spike. And what I'm showing in the -- on the graphs on the bottom is the fraction of traffic going to a particular hot spot. And so that one page over there received about three percent of all the traffic during this particular spike. Now, there are these three other spikes we looked at, again from our departmental Web server when Above the Clouds paper from RAD Lab got slashdotted, start of a campaign on Ebates.com, and Michael Jackson spike. And here is the first big difference. On the spike on the left there's a significant increase in workload during the spike, but on on the spikes on the right the aggregate workload volume is almost flat. But the -- these are still spikes because they are significant data hotspots here. During the Ebates campaign there was one page that was received about 17 percent of traffic and the Michael Jackson again received about 13 percent of all the traffic. So there is the first characteristic. What is the magnitude of these spikes. The second characteristic that we looked at is the number of hotspots. On the left again you see 50 or 150 different hotspots, on the right one or two only, right? So there's a big difference there. And finally we wanted to answer the question where all the hotspots are equally hot or some hot the spots are a lot hotter. So on the spike on the left, these 50 photos were almost equally hot during this spike while Michael Jackson's page was significantly hotter than all the other hotspots in the system. So those are the three characteristics we looked at here, and the spikes vary significantly. The other question we wanted to answer is whether the hotspots are co-located on the same servers or not. Here we don't really know the locations of the objects from the workload traces because we only know which requests -- which objects were being requested. So what we did was we sorted all the objects alphabetically and assigned them to 20 servers. And this is something that might not make sense for some storage systems but we were interested in this in the context of the scad [phonetic] storage system that we were working on in Berkeley. And in that system we're actually storing objects in this alphabetical order to fable efficient range queries on these objects. So what I'm showing here at the bottom are these 20 servers that have a bunch of objects located to them. And I'm showing how many hotspots were actually on each particular server. So what you see on the spikes on the left there is the high spatial locality. On the very left there is one server that has only all the -- almost all the 50 popular photos. In the World Cup spike there are three clusters of hotspots whereas in the Michael Jackson spike the hotspots were distributed across most of the servers. >>: [inaudible] traces from? >> Peter Bodik: So they are actually publishing there's there are two sources of data. There's Wikipedia publishes hourly snapshots of number of hits to every page. And there's also recently a new couple-month trace of individual requests of like a 10 sample that is available out there. But what we used were the hourly snapshots that they publish. >>: I don't quite understand about the rest of them. Isn't there just a single document? >> Peter Bodik: Right. But the -- we found that they were about 60 other hotspots which were -- when we looked at them, these were the articles being linked from Jackson's page which were not nearly as popular but still pretty popular. So the point of this part was to characterize what the really spikes looked like. We only looked at five, but even in these five we found significant differences. So the conclusion is that there is no typical spike. If you want to stress test your system against these spikes you can't just use one, you should really explore a larger space of these spikes. And that was the goal of the second part of this project where we wanted to design a workload model to synthesize new spikes. You can think of this as there's this space of spikes defined by the seven characteristics, so it's a seven dimensional space. And we saw a five examples of spikes in here. What we want to do is create the generative model that has a few parameters that control the characteristics of the spikes that you synthesize. I'd like to point out that we're not interested in inference so I don't care about inferring which actual parameters would yield one particular spike, all I care about is that for any combination of the seven characteristics their exists some set of parameters that could give me those characteristics. So in the next slide I'll tell about you how we design this generative model. So there's two components this model, and there's two things that you need to synthesize a spike. The first one is the workload volume during the spike. That would determine how many requests a second you're generating. In the second part is the object popularity during the hotspot which determines the probability of picking a particular project and a generating a request against it. So in the -- to generate the workload volume with the spike, we simply take some workload trace that you might have in your system like the curve you see in the top left and we multiply it by this [inaudible] linear curve. And so these are -here are a few parameters that actually affect the characteristics like slope, duration and magnitude of the spike. So this is the first part. The more interesting part is how do you generate object popularity and add hotspots to that popularity? So again we start with object popularity without hotspots. Thighs could be, for example, samples from a Dirichlet distribution or you could use the actual object upon popularity from your system. In the next slide we see like which objects are the hotspots, and ones we have that, we adjust the popularity of these hotspots. So in the next two slides, I'll talk about these two parts of this model. So first we wanted to pick which of the objects, all objects in the system are hotspots. We wanted to have two parameters to control both the number of hotspots and the clustering of these hotspots. So we have the this clustering process with two parameters, numb of hotspots and L, which is the clustering parameter. And we perform the following. We start with the first hotspot and add it to the first cluster. And for all the remaining hotspots we either put them in a new cluster with probability proportional to L or to a an existing cluster with probability proportional to number of hotspots already in that cluster. After you're done with that, you have a set of hotspot clusters and you assign these clusters randomly across your space of all the objects in your system. Automatic this process is also known as the Chinese restaurant process. And so here on the right you see a few samples of this process. For large values of L, we get almost uniform distributions of hotspots across your clusters whereas you decrease the value of L you get more clustering and more hotspots are clustered on few service. So that was the goal of how a few simple parameters lets us control what the hotspots look like. And in the second part, we wanted to adjust the probability of these hotspots. We sample from the Dirichlet distribution, which is parameterized by a single parameter K. And again, the K determines the variance and cause the entropy of the plop later of these hotspots. So if you pick small K you get probability distribution like the red curve over there where you have a few hotspots that are very popular which might match the Michael Jackson case, or you have a large scale -- K, and you have hotspots that have very similar popularity which might match the 50 popular photos at our department. To summarize this part, we first perform spike characterization where we looked at five different spikes and concluded that there's no typical spikes and they vary significantly. In the second part we design workload generation tool that lets us flexibly adjust the spike parameters and this is much better than all the workload generation tools out there that don't really let you add spikes to workloads. In terms of model validation right now here I'm showing you a comparison of a real workload and a synthesized workload. This says he's are the -- this is data from the 50 popular photo spike. On the top graphs you see aggregate workload to the system. And in the bottom graphs you see a workload to the most popular photo in the system. And they do pretty similar. There's of course some dips over here that we could model but you -- that would make the model more complex that we want. So in the rest of the talk I just wanted to go over some of the other projects I worked on and summarize. Yeah? >>: I have a question about the piecewise linear function. How do you determine that this was linear function? >> Peter Bodik: These are the parameters that you can pick yourself. >>: [inaudible] you find those piece wise linear function? >> Peter Bodik: So the goal of this was not to start with this spike and create -get parameters that would actually match that spike. The way you would use it is that, you know, I'm giving you a tool that has a few parameters, and you might stress test your system against. Maybe you start with spikes that don't have the slope -- the slope isn't too steep, right? And then you keep increasing that parameter and you keep testing your system to check whether you can system handle it or not. That's the goal, right? So we don't want to infer the exact values of parameters for a particular spike, but instead give you the flexibility to tweak them. >>: [inaudible] there will be some parameter like how many segments and what's the biggest slope for each piecewise linear, and then the model automatically generates those, synthesize the data. >> Peter Bodik: Yes, right. >>: Because you might allow us to assign the likelihood of the existing data. [inaudible] then you just look at the perplexity of the data you already have and then just see if it's -- if it's actually predicting the ones you have. Or is that just -I mean, it seems like you have more modular approach where you just had piecewise. But in the end if you look at -- if you could just actually look at the ->> Peter Bodik: You mean the likelihood of ->>: Of the existing spikes that you have. >> Peter Bodik: You mean that in terms of trying to predict that they ->>: [inaudible] likely that the model would have -- could have produced these spikes, how much do they agree with the model? >> Peter Bodik: Right. I mean, so this is some kind of the ongoing work. Here I'm just showing you for this one particular spike how closely we matched that, where we're kind of still trying to figure out how do you really compare, how realistic are our spikes are compared to the ones we see. Right? Because it's -it's not trivial, right? You know these curves look pretty similar, but how do you really compare them, right, because of all the noise in there. You can't just directly compare the values, right? So we're still kind of working on that and how exactly we do compare it. Okay? So let me now quickly run over some of the other projects I worked on. The first one was on performance modeling, where we designed both the black box models of various systems but also we are combining query execution plans with simple models of database operators to get flexible models that can extrapolate well to unseen workloads or unseen configurations of the system that could help the developers understand impact of their designs for -- in future workloads. The second project was on dynamic resource allocation where we're using these performance models together with a control theory techniques to do dynamic resource allocation for both stateless systems like app servers and also for stateful systems like the scad storage system a I mentioned a while ago, where you [inaudible] whether you're adding or removing machines but you also have to figure out what are -- how you're going to partition or replicate the data on to the new machines. The other project was on detection of performance crisis. I mentioned we're using a anomaly detection techniques to notice changes in user behavior. And finally for both this crisis detection project and the fingerprinting project we design intuitive visualization of the state of the system. Here we were showing the patterns of the user workload on the website that were -- where the operators would verify what are anomaly detection technique is actually detecting anomalies or it's a false alarm. So the big picture was that all the projects I worked on were applying machine learning to the monitoring data being collected in the data center and creating accurate models of workload performance and failures. And we're able to use various different techniques like classification feature selection, regression, sampling and anomaly detection. But on top of that, we were able to combine these techniques with non machine learning techniques like visualization, control theory, or optimization to get even more powerful results. In terms of future work there is just two projects I'd like to highlight. The holy grail of data center management is really doing automatic resource management simply based on performance SLOs for all these applications. Ideally I just take all my applications running in a data center, describe how fast each one should be running, and the data center should automatically take care of assigning resources and efficiently provisioning all these applications. And here we really need to combine all the models of workload, resources, power together and use optimization techniques to get there. And the other project is on failure diagnosis. While I mentioned the crisis fingerprinting, that really only helps you when the crisis are repeating. For some of the more significant crisis, that could take hours to diagnose. You really need to combine all these various models that we're building with traces of requests, dependencies and configuration and ideally put all the information together to create single tool, single intuitive tool that the operators could use. So to conclude, what I presented here is a case for using machine learning as a fundamental technology to automate data center operations. The main reason we need to use it is that the systems that we're building are too complex to understand. Fortunately there's a lot of good sources of data in the data center that could shed light on to the workload failures and performance of these systems. And it's just too many data to analyze manually and that's why we need to use some machine learning and statistical methods to understand this data. Thank you, and this work is in collaboration with other grad students, post-docs, my advisors, and Moises Goldszmidt from Silicon Valley. [applause]. >> Emre Kiciman: Questions?