>> Emre Kiciman: Hi, everybody. It's my pleasure... give us a talk today. Peter is a Ph.D....

advertisement
>> Emre Kiciman: Hi, everybody. It's my pleasure to welcome Peter Bodik to
give us a talk today. Peter is a Ph.D. student at UC Berkeley, where he's been
working on the RAD Lab project. He's going to tell us today about his work there
on automating data center operations using machine learning. Thanks very
much.
>> Peter Bodik: Thank you. Hi. So I'm Peter from RAD Lab at UC Berkeley.
And I'll talk about using machine learning to operate data centers. So when the
RAD Lab project was started about five years ago t goal was to build tools that
would let a single person deploy and operate a large scale Web applications.
And the main bet was that we can use machine learning to help us get there.
And I spent five years applying machine learning to these problems, and today I
will talk about some of the best results we got. So before I get into the details, let
me start by talking about why it's difficult to operate these large-scale Web
applications. The main reason is that they are really complex and very few
people actually understand how they work and so on.
In particular, some of the popular Web apps like Amazon, Hotmail and Facebook,
you know, they could serve tens of millions or hundreds of millions of users.
They run in hundreds of data centers or tens of data centers, and each data
center could have tens of thousands of servers or could run hundreds of
services.
This picture shows what Amazon looked like about four years ago. This is from
our paper from 2006. And each dot on this graph shows a single service that
they're running like recommendation service, shopping cart and so on. There
was hundred of them back in the day, and now I think it's up to 200.
The arrows show dependencies between services. And what happened when
you click on Amazon's website, that request gets translated into often 50 or more
requests to all these individual services and the responses get merged together
to form a single HTML page that gets sent back to the user.
So there's lot of dependencies and lot of complex interactions in these systems.
Here you have again a schematic visualization of a data center with many
applications. And the really complex part is that you might have different
applications at different requests for these applications that would share the
same service, they would share the same physical machine, they would share
the network or the, databases, and understanding these interactions is really
difficult.
Next the software that's running in these data centers is constantly changing.
EBay published these statistics that they say they add 300 features per quarter
and they add 100,000 of new lines of code every two weeks to their system.
And finally is hardware running in these data centers is constantly failing. Google
says that on their 2000-node cluster, which is their unit of deployment during the
first year they see 20 rack failures, 1,000 machine failures and thousands of disk
failures. And there's many other types of failures that are err, but more
significant.
So in this environment, it's really complex and difficult to operate these
applications. So what are the three main challenges of data center operations?
And these are the three mine challenges I'll be talking about in this talk.
The first one is quickly detecting and identifying performance crises caused by
the various failures that happen in a data center.
These crises could affect the op time of the application, can reduce the
performance of these applications, and that's why the operators have to quickly
responsibility and fix them.
And for example, at Amazon operators to wake up in the middle of the night, on
the weekends, and they would have 15 minutes to start work on the particular
problem and do it quickly. So this is really annoying for them.
The second main challenge is understanding the workloads of these applications.
They're both daily patterns in workload that are crucial for provisioning correctly
these applications. But there are also these unexpected spikes and surges in
workloads, like the one that happened after Michael Jackson died where 13
percent of all the Wikipedia traffic was directed to the single article about Michael
Jackson, right? So handling these workload spikes is crucial and we would like
to stress this, our applications before we deploy them to production to make sure
that we can handle them.
And finally, we would like to also understand performance of these systems in
differing configurations. The problem right now is that most of the data centers
are very static in terms of deployment of resources. And the reason is that the
operators don't understand what would happen if we add more servers, remove a
few servers or even different kinds of servers. And it's very difficult to efficiently
utilize these data center resources.
So these are the three main things that I'll talk about today.
So why machine learning, right, you might ask. And so, as I mentioned, the data
center operations are manual, slow, and static, and mainly because the systems
are very complex.
Having models of workload in the performance of these systems would be really
useful for addressing these three challenges. But using, for example, analytical
models like queue networks is not very practical because designing these
models the manual and because these systems change too frequently we would
have to update these models manually.
Fortunately there's a lot of monitoring data being routinely collected in the data
centers that could provide insights into workload performance and different
failures. But there's so much data that analyzing it manually isn't practical.
These operators are not trained to do this, and doing it manually would be too
slow.
But that's where machine learning would help a lot, where we could use various
methods and techniques that allow us to analyze large quantities of data
accurately and also combine these machine learning methods with non machine
learning techniques like control theory or optimization to create even more
powerful tools. And I'll show you examples of that later in the talk.
All right. So the main contribution of my work is using machine learning
techniques to take the data center monitoring data and build accurate models of
workload, performance, and failures in these systems and use them to address
the three challenges I described at the beginning.
I worked on four main projects. And let me just quickly go over them and
highlight what were the different challenges on what modeling we used this each
of them. So the -- in the first project the goal was to defect performance
problems or different failures in Web applications. They might be hard to detect
automatically or by using -- by operators.
The input was the workload trace or the behavior of the users on the website.
We used anomaly detection techniques to spot unexpected behavior of the
users. And these would often point to performance problems or on failures in
these Web applications.
The second project was on identifying these problems. So the input was where
all the performance metrics being collected on all the servers running and
application in we use feature selection techniques to pick a smaller set of these
performance metrics and then create the crisis fingerprint that will uniquely
identify a particular performance problem so that we can then later automatically
recognize whether a particular problem is happening again and again, and the
operator doesn't have to do this manually.
The third project was on modeling workload spikes. The goal was to create a
workload generator that we could use for stress testing Web applications. So we
created a generative model with a few parameters that lets us synthesize new
workload spikes and we use that internally to run various experiments.
And finally the last project was on performance modeling. We designed both
black box models of these Web applications but also we combined the structure
of these systems or structure of these queries that you run on top -- inside these
applications to provide accurate predictions of latency that extrapolate to new
workloads and new configurations so that you can use them to answer various
what-if questions that if you're the developer or the operator.
We also combine these performance models with control theory, do dynamic
resource allocations in both stateless systems like Web servers or storage
system when you actually have to copy data over to the new machines. So -yeah?
>>: [inaudible] all based on actual traces from company like Microsoft, Amazon?
>> Peter Bodik: You mean all these projects?
>>: [inaudible].
>> Peter Bodik: So the -- let's see. So the first one, this one was based on data
from Ebates.com. So this was an actual company that gave us traces from one
of the -- for some of their failures that happened. This one was in collaboration
with Microsoft during my internship.
The work -- the spike modeling, we had some traces, they are publically
available, and I'll go into that in more detail later. And out performance modeling
was done internally on the applications that we had available that -- if using
benchmarks and workload generators that we had available. So these are not
real, real systems our production systems, I should say. Yeah.
So in the rest of the talk I'll talk about the crisis identification project and the
characterization and synthesis of workload spikes, and then I'll summarize.
So the crisis identification project is the project I worked on during my internship
at Microsoft Research-Silicon Valley with Moises Goldszmidt. The main use of
machine learning here was taking all the performance metric being collected in a
data center for one particular application and using feature selection methods to
select the small relevant set of these metrics to create an accurate crisis
fingerprint that you can use to uniquely identify these various performance
problems.
So as I mentioned, there is a lot of failures happening in data centers. It could be
hardware failures but could also be software bugs, misconfigurations and so on
that cause either downtime or reduce performance of these applications.
And this costs a lot of money. That's why there is -- there are operators who
have to really quickly respond when they happen.
So here let me first show you what happens during a typical crisis in a data
center. On the left side you see a timeline where at first the application was
okay. And then at 3 in the morning, there's a crisis that's first detected, right?
The detection part is the easy part. It's usually automatic. And you detect it by
violations of performance SOLs, right? For example, you notice that the latency's
too high or the throughput is too low. So that's relatively easy.
When the operator wakes up, opens his laptop, he has to perform crisis
identification. So he starts looking at performance metrics, logs, logging into
machines and so on to identify what the problem is. He doesn't have to do root
cause diagnosis necessarily here, just needs to understand enough about the
problem to fix it. Maybe rebooting a machine would do it, and so on.
And this part, the identification part, could take minutes, maybe hours
sometimes. And it's really manual and difficult.
After they figure out what's going on, they perform resolution steps which depend
on the type of the crisis. And once the crisis is over, that's when they perform
root cause diagnosis and add the description of the crisis into a trouble-ticket
database.
>>: [inaudible] crisis like [inaudible].
>> Peter Bodik: No. We were not considering that. We were mostly interested
in performance -- performance problems. But -- and we were interested in, you
know, problems where you can -- you can fix them by maybe rebooting
machines, restarting processes and so on and so on, so in that case -- but some
of the techniques might apply. And you can ask later to see if some of that would
apply.
So our approach of fingerprinting was evaluated on data from exchange hosted
services which on data from 400 machines where every machine was collecting
about 100 different performance metrics. To give you an idea of how we actually
want to help the appraisers, imagine that you have a system where you already
observed four different performance problems, right. The application
configuration matter may be reliability twice, and so what operators all right have
is this trouble-ticket database, right? For every type of a problem they have a
encryption of the problem, how they detected it and resolution steps that they
take to resolve this problem.
What we're going to add is a fingerprint. For every type of a problem we'll add a
vector describing that problem that we can automatically compare. And when
the new problem happens, what week like to answer is is this crisis that's
happening right now a crisis we've seen before, and if it is, we'd like to know
which one it is. And if it's not, we'd like to tell the operator that this is a new type
of a problem.
And even in that case this helps the operator because he doesn't have to
manually go over all the past problems and try to find the one that's similar.
And so what we do is we create a fingerprint of the current problem and we
compare it to all the past fingerprints in our database and if we find one that
matches this is -- this is our answer and this is what we tell the operator. So this
is the overview of our technique.
There are two main insights here that lead to our solution. The first one is that
these performance crises actually repeat often. There are various reasons for
this. For example, the root cause diagnosis could be incorrect, so the developers
actually fix the wrong problem and the crisis keeps happening again. Or just
deploying the fix to production might take time because of testing and might be
weeks before it is actually deployed, again causing the crisis to reoccur.
And we've seen this both at the EHS data and also at Amazon. So during my
internship at Amazon it was some of the less severe crises happened really
frequently. And they were not significant enough that the developers would fix
them quickly, but the operators too would have to wake up in the middle of the
night to deal with them.
>>: [inaudible] makes you wake up in the middle of the night? I mean how many
are recurring and how many are new?
>> Peter Bodik: You'll see some statistics later for the EHS. For Amazon, the
way they classified the crisis where they called them sev1 and sev2 in terms of
severity. The sev1s were the ones where you absolutely of to deal with it right
now because you're losing money right now. The sev2s were about 100 times
more frequent. It's -- it depends on what service you're talking about, right? But
this particular application that we looked at they were about 100 times more
frequent. They were not ones that directly are affecting performance right now.
But if you don't deal with -- fix the problem within an hour maybe they could lead
to a lot more serious problems down the road.
>>: [inaudible] in other words, how many [inaudible].
>> Peter Bodik: I don't have statistics, exact numbers on that. But for the EHS
data, we had about 19 labeled problems. And out of that, there is one crisis type
that repeated 10 times, and there was one that repeated twice. And the other
ones just happened once in the data set that we had. Okay. The second -second insight is that the state of the system in terms of performance metrics that
you're collecting is similar during identical performance problems, right? If the
same thing happens twice the performance metrics will not be exactly the same
but if CPU might go up and network traffic might drop and so on. And this is
exactly what the operators use when they are personally trying to identify these
problems.
But capturing this system state with a few metrics is difficult because for different
problems you need different metrics. And often ahead of time you don't know
which are the right metrics. For example, the operators from EHS gave us three
latency metrics which they said were the most important metrics for that
application. This is what defines actually the performance SLO for these
application. And when we tried to use just these three metrics for creating the
fingerprints, this was fought enough information to actually identify the problem.
So you need to look into a lot more outer metrics of the system.
>>: When you say system state, do you necessarily mean on observable system
state or it's just the thing that you are not measuring against?
>> Peter Bodik: So I don't mean application state like user cookies and so on. I
mean just the performance metrics that the operators would use to identify the
problems. And so these would be CPU utilization, workloads, latencies,
throughputs, queue lengths. Some things that -- metrics that you can measure in
your system.
>>: The system state and metrics are actually ->> Peter Bodik: That's what -- yes, that's what I mean by that, yes.
So in this particular case we used, you know, the system state was exactly -were exactly performance metrics but you could add other metrics into it that are
based on logs and other things that you might -- that might be useful for crisis
identification.
All right. There are three contributions in this project. The first one is the
creation of the fingerprints. Fingerprint is a compact representation of the state.
The main goal of this representation is that it should uniquely identify a
performance problem. It should be robust to noise because even though the
same crisis happens twice, the performance metrics will not be exactly the same.
So we would like to make sure that the fingerprint is still similar.
And finally we also wanted to have an intuitive visualization of the fingerprint. So
the -- when the operators use these fingerprints, we'd like them to understand
what's captured in there, even if they don't understand the exact process of how
we created this fingerprint.
The second contribution is the realtime process of how you use these fingerprints
for crisis identification. As I explained before, the goal is to maybe send an
e-mail to the operator that's when he wakes up and opens his laptop that says
the crisis that you're looking at right now is very similar to the one we saw two
weeks ago, and here is the set of steps you need to take to fix it, or we could also
say this is the problem that we haven't seen before and you should start
debugging yourself and start looking into the details of the problem.
And finally we also did a rigorous evaluation on data from EHS.
>>: How long was that data for? You said [inaudible].
>> Peter Bodik: We have four months of data. And so that was the main data
set we used.
>>: [inaudible].
>> Peter Bodik: These are the majority of the main problems they had. There
were some that I think we didn't use because they didn't -- couldn't give us a
label for it. But I think these are definitely the vast majority of things that -- the
big problems that happened during the four-month period.
>>: Do you have a sense of were there anybody that went undetected?
>> Peter Bodik: There were things that we saw in the data in terms of high
latencies where we talked to the operators and they didn't have any record for
that. So they didn't see any e-mails going on describing this problem, even
though we clearly saw that latency was up. And there were a few instances of
that. So those are the ones where we thought that something happened but they
didn't -- either didn't know or there were no e-mails at least, so they couldn't give
us a label for the problem. Yes?
>>: You only use snapshot of the system state as features or does that also
include some accumulated stats?
>> Peter Bodik: So the data we use were aggregated into 15-minute periods.
So so they were -- they might have been average CPU utilization during a
15-minute window. And the reason we use -- it's 15 minutes and not say one
minute is that this was data from some of their historic databases, metric
databases. So I think in realtime they are collecting the data in much finer
granularity, it's just the data we got from them were over a year old in some
cases, so they didn't have all the details statistics in this case. Okay?
So in the rest of this section I'll talk about how we define these performance
crisis, define the fingerprint, and go over the evaluation. The definition of a crisis
in general is that it's a violation of service-level objective. It's usually a business
level metric such as latency of the recourse happening on the whole cluster. An
example of SLO that we use in or that the EHS use is that at least 90 percent of
the servers need to have latency below hundred milliseconds during this
15-minute epoch.
And so if you can check this condition very easily and if it's not true that that's
where the crisis is happening. The crisis we looked at, there are application
configuration errors, database configuration errors, request routing and
overloaded front ends and back ends. And all of these were easily detected but
the identification was the difficult task. Yes?
>>: Why -- is it common that you would specify these things over whole
clusters? It seems like oftentimes you would want to say that -- about some end
user to some level that the end user only sees a certain amount of latency as
opposed to the second bullet which says that it's a property of the entire ->> Peter Bodik: I think you care about the property -- properties of all the
requests coming into the system and you want to make sure that say the slowest
requests in the system are not too slow, right? Or that the 99 percentile of
latency of all the requests is not too high. So, you know, you look at all the users
in your system and make sure that the slow ones are not too affected.
>>: [inaudible] crisis protection. Imagine that [inaudible].
>> Peter Bodik: So we considered using the fingerprints for predicting that a
failure will happen. But it turned out that if you just make the SLO stricter, for
example you decrease the threshold from hundred milliseconds to 50
milliseconds and a also would give you an accurate -- a relatively accurate
predictor. At least more accurate or roughly as accurate as using the fingerprints
themselves. So we tried that. And in this particular case it wasn't really useful.
Okay. So now how do we actually create these fingerprints? The goal of the
fingerprint is take the system state, the performance metrics being collected and
create this compact representation.
On the performance metrics we treat them as arbitrary time series so we don't
make any assumptions about what they represent, what they are. And very often
there are these application specific metrics that we have no idea what they really
mean, right? There are some of the metrics we understood like workloads and
resource utilizations, but the majority of these metrics were specific to EHS and
we didn't know what they are.
You can think of the system state as if you think of one server that is server is
collecting hundred different metrics over time so that's the state on that machine.
But you have hundreds or thousands of these servers and that's the system
state.
Now I have the definition of the crisis. I know that there are periods that are okay
and there are periods where the crisis was happening. So our goal is to take the
data from the crisis and create this crisis fingerprint from that. And we do this in
four steps. In the first step we select the relevant metric, right? This is where we
use the feature selection step because it turned out that using all hundred
metrics is just too much data and we select just 20 or 30 metrics.
In the second step we summarize each of the selected relevant metrics across
the whole cluster using quantiles because we don't want to remember, say, a
thousand different values of a CPU utilization.
The third step we map each of the metric quantiles that we computed here into
three states. It's either hot, normal, or cold, depending whether it's the current
value is too high or too low compared to some historic data. And finally because
each crisis could last for different amount of time, we average over time too
compute this single fingerprint.
So in the next four slides I'll talk about these four steps in detail. So first step
was selecting the relevant metrics. The reason we did this was we tried creating
fingerprints using just all the hundred metrics. And we also tried creating
fingerprints using the three operator selected latency metrics. And none of these
really provided us with high identification accuracy. With hundred metrics it was
just too much data and the fingerprints were too noisy. With just the three
latency metrics there was not -- we why not capturing enough information.
So what we did is use a feature selection technique from machine learning called
the logistics regression with L1 constraints to select the smaller subset of
relevant metrics that are correlated with the occurrence of the crisis. The way
you can think about it is that we're building a linear model where inputs are all the
metrics being collected on the system and the output is a binary and we try to
predict what whether there's a crisis or not, and we try to use only a small subset
of the metric but still get an accurate model.
>>: [inaudible] 400 times 100 --
>> Peter Bodik: This was hundred features. One metric was one feature.
>>: [inaudible] whole citation or ->> Peter Bodik: We were -- so the way we had actually did this was that we -the crises -- see if I go back -- so we had like SLO per machine, which were the
latencies, the latency metrics and the thresholds. And this is what we used to
create the two classes.
>>: [inaudible] 400 separate labels [inaudible] every 15 minutes [inaudible].
>> Peter Bodik: Right. Right. Yes.
>>: [inaudible] have a lot of structure both in individual features as time goes by
[inaudible] and so on as well as it could have correlation between features where
say some servers become overloaded and others have lower loads? Did you try
to take into account the temporal structure as well as the cross-feature structure
somehow?
>> Peter Bodik: So I'll get to the temporal structure later. But the -- in terms of
dependencies, no, we haven't tried to explicitly capture that because all we did
was select these metrics and then -- so what you actually have is we have many
instances of crises. And we run this per crisis. And so after each -- for each
crises we get maybe 10 or 20 metrics, but we have many instances of crises so
we actually take the most often selected metrics as the set of metrics that will be
the input to the fingerprint.
>>: [inaudible] to do in one single method feature selection?
>> Peter Bodik: The way we did it this way is that we wanted to actually also
identify which metrics are useful for each individual crisis, just so that we could
tell the operators that here, for this crisis, these five metrics were the crucial ones
and so on. And the one thing that I'd like to point out is that some of the metrics
we selected were these application metrics that the operators new that they are
collecting them but they didn't really realize that they're actually important for
crisis identification, right? And this highlights that there's so much data that even
the people who work with the system every day they didn't realize that some of
these metrics would help them. And so once we told them look at these two
metrics really help us so you should start looking at them as well, they started
using them. And this is again -- we didn't really assume anything about many of
these metrics, so that's what makes it powerful.
>>: [inaudible] reevaluate the -- this -- easily imagine the situation where you see
a new type of crisis show up where the relevant metric was something that didn't
matter before because it had been.
>> Peter Bodik: Right. Right. So what we do is every time a new crises
happens, after it's over we run this feature selection on that new crisis and we at,
potentially add new metrics in and we might remove some other metrics from the
set. Yeah. All right. And so I'd like to point out that we're using this classification
setting here only for the feature selection. We're not actually using these models
later to create the fingerprints.
Now, in the second step I would like to take all the selected metrics, all the
relevant metrics and summarize them across the whole cluster. Is reason is that
if you have a CPU utilization metric, for example, you have 400 measurements
and you don't want to remember all 400 values of this.
So at the top you see histogram or distribution of observed values of this metric.
And we characterize the distribution using three quantiles, the 25th percentile,
the median and the 95th percentile. We selected these ones by visually
inspecting the data. But you could also automate this process once you have
some labeled data and approximate pick the quantiles that actually give you the
best accuracy.
And the advantage of this is that this representation is robust to outliers, even if
one or two servers report extremely high values of metrics it doesn't affect these
quantiles that we're computing. And it also scales to large clusters. Both in
terms of -- if you're adding more machines we're still representing the fingerprint
using just these three quantiles. And also there are techniques to efficiently
compute quantiles given large input sizes.
>>: [inaudible] in the previous slide when you select the features, when you train
the logistic classifier, so for each crisis you're assuming you get 400 instance
examples of failing of crisis behavior for all the servers. You're assuming that for
each crisis involves all of the servers?
>> Peter Bodik: So we take when a crisis happens, we take some time around
the crisis, say, you know, 12 hours before, 12 hours after. So it covers both the
normal periods when the system was okay. And it covers also the problematic
periods. And so the number of data inputs is, you know, 100 metrics. But then
the number of data points is 400 times the duration of the crisis plus some buffer
on both sides.
>>: But you are assuming that all 400 servers are exhibiting some kind of crisis?
>> Peter Bodik: If not, then they just fall into the normal class.
>>: So how do you label the crisis?
>> Peter Bodik: So for each crisis in this particular case each server is reporting
the three latencies. And we know the thresholds on each latency that actually
mattered that are specified by the operators. For each server and each point in
time we know whether that server is okay or not.
>>: [inaudible] the servers doing the same thing for the Amazon model
[inaudible].
>> Peter Bodik: Right. So we did this worse kind of best for say a single
application for like application servers belonging to a single application. Right.
Yes.
So what we tried and that wouldn't really work is using say mean invariance of
these metrics because these are really sensitive to outliers and a single server
could significant effect the values of these. And we also tried using just the
median of these values. But that didn't give -- in some cases that didn't give us
enough information about the distribution of these values. And we had to add a
few more quantiles.
>>: [inaudible].
>> Peter Bodik: Because we were interested in the crises that affect a large
fraction of the cluster. This case the crisis only was happening when at least 10
percent of the machines, which case at least 40 machines were affected.
>>: [inaudible] asymmetric?
>> Peter Bodik: Right.
>>: Quantiles. 25 and 95. So there an implicit assumption that you're more
likely going to -- because you're quantifying this way, is there [inaudible] that high
is more like the [inaudible].
>> Peter Bodik: In most of the -- yeah, I mean we did this by visually inspecting
the data and, you know, we noticed that a lot of the metrics -- you know, when
something breaks, it's usually the metrics go up. Or they don't drop below zero
say, right. So we wanted to be more sensitive on the high end. But again, this is
something that you could automate. And once you have some labeled data you
could try different quantiles and, you know, pick the ones that actually work
better.
>>: [inaudible] opposed to latency metrics but [inaudible] much more interesting,
right?
>> Peter Bodik: Right.
>>: All right.
>>: So these are numbers that you use across all the metrics, not just an
example of CPU?
>> Peter Bodik: Yes. Yeah.
>>: Okay. So you use exactly the same ->> Peter Bodik: Right. Right. Okay? So now in third step, the goal the is to
take -- observe the raw values of the America quant files like the median of CPU
utilization and map it into hot, normal, or cold values. So to say kind of capture
automatically whether the kind values is too high or too low.
So what you see on the far you have four fingerprints on the right, we call them
epoch fingerprint because now for every single time epoch we have one row in
this rectangle that represents that particular time step, and the gray represents a
normal value, the red represents too high and blue represents too low.
So what you see here is that there's about -- here we use about 11 metrics.
Each one captured using three quantiles. So there's about 33 columns. And so
we achieved three things here. First we can differentiate among different crises.
The crises -- there's two instances of the same crises on the top. And visually
the fingerprints look very similar. Well, there are two different crises at the
bottom and the fingerprints are different.
The second thing is that the representation is very compact, right? Remember
that we started with 400 machines, each one reporting 100 metrics over a period
of time and we can really compactly represent it here. And on top of that it's
intuitive because you can quickly see that there is this one America that first drop
at the beginning of the crisis and then got back normal and then maybe the high
quantile increased.
And when we actually showed these fingerprints to the operators, they were able
to identify some of these problems just by looking at these -- this heat map.
Right? So we thought that's really cool that they can do that. And they could
verify whatever is captured here actually represents the problem they were
looking at.
What we tried is using the raw metric values. The problem there is that say
some of these metrics could achieve really high values during the crisis, so
comparing them -- comparing the vectors would really skew -- or these extreme
values would really skew the comparison. And we also try figure the time series
model here. Because we noticed that, you know, there are these daily patterns
of workloads and CPU utilization, and we try fitting a time series model but for
many other metrics there were no significant patterns. So doing this
discretization based on this model didn't really work very well.
>>: [inaudible].
>> Peter Bodik: We had about 10 times, 10 types total and had 19 instances
total. Yeah.
And finally, the last thing we have to deal with is that different crisis have different
durations. So you can't directly compare the rectangles you saw in the previous
slide. And that's one thing that wouldn't have worked. The other thing that
wouldn't have worked is just using the first epoch of the crisis because often the
crisis evolves over time and you want to capture that information.
So the simple thing we did is simply average over time. Take all the fingerprints
during the crisis and create a single vector that represents that crisis and we
would compare different crisis fingerprints by computing the Euclidian distance.
So these are the four examples of how we create the fingerprints. But now we
also need to define the method how we use the fingerprints to identify the
problems in realtime. Again, you see the timeline on the left.
First the crisis is detected by violation of an SLO. And what's important here is
that the operators of EHS total us that it's useful for them to get a correct label for
a crisis during the first hour after the crisis was detected. The reason is that it
often takes them more than an hour to figure out what's happening. And so
during the first hour after the crisis was detected, we performed this identification
step.
What we do is we update the fingerprints first, given all the data that we have,
and we compare the fingerprint to all the past crises we have in our trouble-ticket
database. And we either and emit a label of the -- of a single crisis that we find
or we emit a label -- a question mark that says that means that we haven't really
seen this crisis before. So over time what you get is these five labels that you
first try to do identification when the crisis is detected and then every 15 minutes
later during the first hour.
>>: [inaudible] average of the whole crisis, right?
>> Peter Bodik: Or during the first hour of the crisis.
>>: Oh, okay. That makes sense. And are you comparing always versus the
first hour average or after 15 minutes to like 15 minute average?
>> Peter Bodik: We would compare the first say 15 minutes or half an hour to
the fingerprint of the whole hour for the past crisis. And actually it will turn out to
be crucial also was to include the first 30 minutes before the crises in the
fingerprint. So in this case actually here the first two rows are the two 15-minute
intervals before the crisis was detected. And in this case everything is normal.
But in some of the other crises you see that these metrics were high even before
the crisis started.
And so using those first 30 minutes and adding it to the fingerprint was crucial.
Or increase the accuracy.
And once the crisis is over, we update the relevant metric, we run the feature
selection step, and we updated all the fingerprints to use the same set of metrics.
And finally the operators would add this crisis to the trouble-ticket database with
the right label.
>>: So say the operator didn't notice the crisis, would your system be useful in
detecting it? Or would you find similarities to normal --
>> Peter Bodik: So we didn't deal with the detection part. We all -- we took that
definition of the crises that they gave us which was defined in terms of the
latency metrics. And so we figured that that's input. And we didn't really look at
where we could do a better job. We tried using these fingerprints, but we didn't
really -- couldn't really do a better job in terms of prediction.
>>: So [inaudible] talking about prediction in advance of the crisis happening, but
->> Peter Bodik: Yeah, we assume they know. Somebody detected, right, or you
detected using the SLO violation and then this is when we triggered the
fingerprinting okay? Yes?
>>: [inaudible] some figures can be due to the [inaudible] in the program like -each other and like if they are not totally consistent.
>> Peter Bodik: So you mean that the problem would not really manifest as an
increased latency?
>>: Right. I mean now the root cost -- the root cost of some figures be not like in
the system [inaudible] overloaded but it is because there is some problems in the
code?
>> Peter Bodik: Right. If this translates into increased latency, right, then we
would detect it by checking the SLO and then you could start the fingerprinting
process on top of it. So, yes.
>>: So if the problem doesn't lead to performance crisis then you are not dealing
with it?
>> Peter Bodik: We were not dealing with that, yes.
>>: There's no way they [inaudible].
>> Peter Bodik: All right. So next let's look at evaluation. As I mentioned, we
used data from EHS from about 400 machines, 100 metrics per machine,
15-minute epochs. And the operator told us that one hour into the companies is
still useful in terms of correct identification.
We were -- we had these three latency metrics, and the thresholds on each of
them that were given to us by the operators. And if more than 10 percent of the
servers with latency higher than the threshold during one of the epochs that
constituted a violation and the definition of a crisis.
We had 10 crisis types total 19 instances of a crisis. There was one instance -one crisis type where we had nine instances, another one where we had two,
and the other eight where we had only one each.
>>: [inaudible].
>> Peter Bodik: So it's like the same instance -- same crisis happens 10 times.
Those were the 10 instances, occurrences of the crisis. And so and these were
-- these happened during a four month period.
The main evaluation metric that we care about is the identification accuracy,
right? How accurately can you label the problem? But before I get to that, we
actually defined what we call identification stability. Because every time a crisis
happens we actually emit five labels. You might start with saying you don't know
what this crisis is and then you start labeling it correctly. And we require that we
stip to the same label. Once you sign the particular label of a crisis you will not
change your mind later, so that the operator is not confused.
So in here in the top, you see two instances of unstable identification where we
first change our mind from crisis A to crisis B and in the second instance we
change our mind from crisis A to I don't know, right? So both of these would be
unstable. And even if A is the right crisis, we wouldn't count them as correct.
So in terms for -- in terms of crises that were previously seen if we were
identifying a crisis that was previously seen, the cross was about 77 percent.
This means that by now three-quarters of the crises we can assign the right label,
and we can do this on average after a 10 minutes once the crisis is detected. So
potentially it could save up to 50 minutes of the identification performed by the
operators, which seems pretty good.
And a hypothesis is that we can't really verify that if we had shorter time epochs,
we could probably do it even faster. But we had the 15-minute time intervals.
For the crises that were previously unseen, the best we can do is say five times I
don't know what this crisis is, right? And there's a definition of accuracy here.
And for 82 percent of these crises we're able to do this correctly.
>>: Is that statistically relevant or just a number of [inaudible].
>> Peter Bodik: You mean why isn't it higher?
>>: [inaudible].
>> Peter Bodik: Oh, I mean that's -- these are almost completely independent
problems, right? So there's no reason why that should be higher than this one.
>>: Are the previously seen crises [inaudible]. The way you described it was
that you don't [inaudible].
>> Peter Bodik: Right.
>>: [inaudible] feature vectors.
>> Peter Bodik: Right. I mean as you get more data the accuracies actually do
increase over time. Because the more data you have the more samples of
different crises you have and you can tune your threshold, identification
thresholds better. So there's definitely that effect.
>>: These accuracies were compute causally, right? It's not like you did a whole
[inaudible] you were -- you sort of described your temporal data -- did your
temporal thing and then [inaudible] accuracy and measure yourself, do it again,
do it again.
>> Peter Bodik: So the way we did it is that -- we only had this one sequence of
19, we actually -- we reshuffled them many times, I think like 40 times. So we
created 40 new sequences. So we can test the identification in different order of
these crises. But for each sequence, what we did was we started with one or two
crises and then we went chronologically in that order.
>>: [inaudible].
>> Peter Bodik: Yes, across 40.
>>: Is this a [inaudible] about 30 percent of the time ->> Peter Bodik: I think so. I mean, yes. Yes. You mean 30 percent of the time
you still -- give them an incorrect label but they would still be checking what this
crisis is anyway. So it should still give them a -- it should still help them. And
also in practice, the way we did this evaluation was we only considered the
closest similar failure or crisis in the past. But in practice you might look at the
top three, for example, crises, right? And the operator might decide himself
which one actually looks best.
>>: So I'm curious about this reshuffling now because it seems like in these
systems oftentimes the systems change and get updated over time. And by
reshuffling it seems like you're using future knowledge of what this -- of what
happened after the system's been updated to make predictions about crises
which happened in the past. Do you find that if you didn't do this you just ran it ->> Peter Bodik: Well, we definitely wanted to do some reshuffling because
otherwise you would only have a sequence of 19 crises, and we could compute
the accuracy. But we really base on very few samples. So that's the reason why
we did that. And you know the time period was relatively short. It was just four
months. So I don't think the system would have changed that significantly.
>>: [inaudible] right? So you're not doing -- the reshuffling, it started completely
clean for each of the 40 runs. So if there's any changes over time [inaudible].
>> Peter Bodik: Right. But you know the crisis five might tell me something
about the crisis that happened earlier really but reshuffling happened later. But
yeah I don't think -- obviously, you know, if you have a longer time period like a
year or longer you might want to even start forgetting some of the fingerprints.
Bus as crises are fixed you want to maybe delete that from your database.
>>: [inaudible].
>> Peter Bodik: So we assume that there is a database that -- of all the past
crises that are labeled and all of them have fingerprints assigned to them. We'll
just copy Euclidian distance and that's the singularity measure.
>>: Out of the 23 percent where you weren't accurate for previously seen, do
you know what percentage were like you gave back question mark versus you
gave back wrong solution to [inaudible].
>> Peter Bodik: That's a good question. But we haven't quantified that.
>>: Do you think the distribution of sort of the couple of them being common and
the remainder as being one offs will generalize to longer time periods [inaudible]
and if that's the case wouldn't it be more effective to give them some sort of lower
level information other than just the label, since most of -- a lot of the time they
would just say this is new? That's not as useful as telling them here is some
more salient features that are ->> Peter Bodik: Right. Well, so in the data from Amazon I don't have raw
counts, but there's a lot of these failures that -- I think in that system we saw lot
more things happening at least as I remember people running around and
responding to e-mails that, you know, they had a kind of this Wiki page almost of
things that break and they kept track of them, right, and many of these things
were recurring again and again.
And I think it's more likely happened for crises that are not that significant.
Because if this is a problem that really takes down the website, they really make
sure that it's not going to happen again. Right? For the less severe problems it's
very likely that it will happen again.
>>: [inaudible] the accuracy for previously unseen crisis. So for what cases do
you deem as predicting [inaudible] and what are not.
>> Peter Bodik: So in this case I would only say that it's accurate if we have five
-- five question marks. That's the only case when we actually say that it's
accurate.
>>: So by using the question mark you mean ->> Peter Bodik: I don't though what it is. There's no similar crisis in the past.
>>: But the definition of the crisis should be 100 percent right because every
time they [inaudible] it will be deemed as a crisis.
>> Peter Bodik: So the crisis is happening. It's just that the same type of crisis
hasn't really happened before.
>>: So this 82 percent means that for 82 percent of the cases you didn't assign a
definite label for this crisis?
>> Peter Bodik: Right.
>>: Okay.
>> Peter Bodik: And this is the first time we saw the crisis, so this is the best we
can do, right?
>>: The accuracy for previously seen crisis, can I say it's like 50 percent as the
baseline because you have 10 of the same kind, so you can always predict that's
type A.
>> Peter Bodik: The baseline is more like 10 percent. If I have 10 crisis types, I
can pick one of the 10.
>>: No, no. I mean -- if it was -- there are some prior knowledge saying type A is
more likely to be [inaudible] based on previous data say.
>> Peter Bodik: Right.
>>: Than based on 15 percent?
>> Peter Bodik: Right.
>>: Okay.
>> Peter Bodik: So yeah, it depends on what is the different distribution so these
crises and how often they occur.
>>: [inaudible].
>> Peter Bodik: You could. But we didn't get to experiment with that. Moises
actually -- after I left, he continued working on this and he developed the more
advanced models, and so he has some results on that. But the accuracy wasn't
higher, it was just -- it was a more flexible model and didn't need maybe as much
training data.
>>: If you didn't use this really harsh metric but you said the majority of the
labels matched the true one, how much higher would the [inaudible].
>> Peter Bodik: Again, we wanted to make sure that we were like using the
strictest criterion, so that -- yeah.
>>: So when you say -- so you reshuffle in [inaudible] 19 and you start
[inaudible] from when does this start to count? After the second example or ->> Peter Bodik: So that's a good point. I mean that's another thing that we
experimented with in -- the worst case is you know you start with two crises,
right, so that kind of gives you at least some idea. And you count the remaining
17 for accuracy. That give us a little bit less lower accuracy, maybe 70 percent.
But when we started counting after 10, that's when we achieved this. So if you --
once you have enough training data that you can adjust your thresholds correctly
based on the data you've seen in the past.
All right. Next slide. So the closest related work is this paper from SOSP '05. It
has some overlapping authors. Moises and Armando Fox were in that paper.
What they designed there is they called failure signatures. The main differences
were that each signature was for an individual server not for a whole large scale
application, so it doesn't directly apply to this scenario.
And it was also -- the process was also a lot more complex because they were
creating these classification models per crisis and they were maintaining a set of
them, they were removing some of them, adding new, and they were actually
using -- they were classifying the current crisis according to all these models and
using that to decide what is the -- what is the label of the current problem? And
we show that we're doing much better by using our fingerprints.
So to summarize fingerprinting, we -- I presented crisis fingerprints that
compactly represent the state of the system. They scale to large clusters
because we use only a few quantiles to capture the metric distribution. And they
also of intuitive visualization so that both we, as the designers, and the data
center operators could understand was really capturing these fingerprints.
The feature selection technique from machine learning was crucial for selecting
the right set of metrics here. And we achieved approximately identification
accuracy of 80 percent. And on average after 10 minutes after the crisis was
detected.
In terms of impact in Microsoft, the metric selection part is now being used at one
of the Azure teams to diagnose their storage systems. And we also submitted
two patents, one on the metric selection, and the other one on the overall
fingerprinting process.
There are no questions, I'll move to the second part. I'll briefly go over workload
modeling of -- modeling of work spikes. So in this project, the goal was to create
a generative model of our workload spike that we can use to synthesize new
workloads and stress test Web applications against these workloads.
So spikes happen a lot, right? Like the one that happened after the death of
Michael Jackson where they saw on Wikipedia about five percent increase in
aggregate workload volume, but 13 percent of all the requests to Wikipedia were
directed to Jackson's article.
>>: [inaudible].
>> Peter Bodik: He was one of the hottest pages but still, you know, out of the
million Wikipedia pages it was a small fraction, less than one percent say. So
there was a significant increase in traffic here.
So what we mean by spike, it's an event where you either see increase in
aggregate workload to the website or you have significant data hotspots. But
there is a few set of objects that see this spike in workload.
And we started working on this project because there was little prior work on
characterization of spikes. There's a lot of work on characterization of Web
server traffic or network traffic but little on spikes and hotspots. And also there's
a lot of workload generators that people use for stress testing but very few of
them actually let you specify spikes in the traffic that they generate, and they're
not very flexible.
So there are two distributions here. One is characterization of a few workload
spikes where we realize that these spikes vary significantly in many important
characteristics. And then we propose a simple and realistic spike model that can
capture this wide variance of spike characteristics and we can use to synthesize
new workloads.
So first let's look at different workload characteristics that are important or spike
characteristics.
For spikes in the workload volume we looked at time to peak from start of the
spike to the peak of the spike, the duration, the magnitude and the maximum
slope in the aggregate workload. And for data hotspots we looked a at the
number of hotspots in this particular spike, the spatial locality or whether the
hotspots are on the same servers or not, and the entropy by of the hot spot
distribution, whether the hotspots are equally hot or there are some hotspots that
are hotter than others.
So in the next two slides we'll look at these four characteristics and show you that
the spikes you looked at varied significantly in each of these characteristics.
Now, on this slide you see the first two spikes. The one off the left is the start of
a soccer match during the World Cup in 1998. In the second one, this is a trace
from our departmental Web server when 50 photos of one of the students got
really popular.
And in the graphs on the top you see relative workload volume coming to the
website. And as expected, the workload volume increased significantly during
the spike.
We also wanted to look at our data hotspots. So first the object is in these traces
is either a path to file or Wikipedia article. And so on. And the hotspot is an
object with significant increase in traffic during the spike.
So in these particular spikes we identified about 150 hotspots for the first spike
and about 50 hotspots for the second spike. And what I'm showing in the -- on
the graphs on the bottom is the fraction of traffic going to a particular hot spot.
And so that one page over there received about three percent of all the traffic
during this particular spike.
Now, there are these three other spikes we looked at, again from our
departmental Web server when Above the Clouds paper from RAD Lab got
slashdotted, start of a campaign on Ebates.com, and Michael Jackson spike.
And here is the first big difference. On the spike on the left there's a significant
increase in workload during the spike, but on on the spikes on the right the
aggregate workload volume is almost flat. But the -- these are still spikes
because they are significant data hotspots here. During the Ebates campaign
there was one page that was received about 17 percent of traffic and the Michael
Jackson again received about 13 percent of all the traffic. So there is the first
characteristic. What is the magnitude of these spikes.
The second characteristic that we looked at is the number of hotspots. On the
left again you see 50 or 150 different hotspots, on the right one or two only, right?
So there's a big difference there.
And finally we wanted to answer the question where all the hotspots are equally
hot or some hot the spots are a lot hotter. So on the spike on the left, these 50
photos were almost equally hot during this spike while Michael Jackson's page
was significantly hotter than all the other hotspots in the system. So those are
the three characteristics we looked at here, and the spikes vary significantly.
The other question we wanted to answer is whether the hotspots are co-located
on the same servers or not. Here we don't really know the locations of the
objects from the workload traces because we only know which requests -- which
objects were being requested. So what we did was we sorted all the objects
alphabetically and assigned them to 20 servers. And this is something that might
not make sense for some storage systems but we were interested in this in the
context of the scad [phonetic] storage system that we were working on in
Berkeley.
And in that system we're actually storing objects in this alphabetical order to fable
efficient range queries on these objects. So what I'm showing here at the bottom
are these 20 servers that have a bunch of objects located to them. And I'm
showing how many hotspots were actually on each particular server.
So what you see on the spikes on the left there is the high spatial locality. On the
very left there is one server that has only all the -- almost all the 50 popular
photos. In the World Cup spike there are three clusters of hotspots whereas in
the Michael Jackson spike the hotspots were distributed across most of the
servers.
>>: [inaudible] traces from?
>> Peter Bodik: So they are actually publishing there's there are two sources of
data. There's Wikipedia publishes hourly snapshots of number of hits to every
page. And there's also recently a new couple-month trace of individual requests
of like a 10 sample that is available out there. But what we used were the hourly
snapshots that they publish.
>>: I don't quite understand about the rest of them. Isn't there just a single
document?
>> Peter Bodik: Right. But the -- we found that they were about 60 other
hotspots which were -- when we looked at them, these were the articles being
linked from Jackson's page which were not nearly as popular but still pretty
popular.
So the point of this part was to characterize what the really spikes looked like.
We only looked at five, but even in these five we found significant differences.
So the conclusion is that there is no typical spike. If you want to stress test your
system against these spikes you can't just use one, you should really explore a
larger space of these spikes.
And that was the goal of the second part of this project where we wanted to
design a workload model to synthesize new spikes. You can think of this as
there's this space of spikes defined by the seven characteristics, so it's a seven
dimensional space. And we saw a five examples of spikes in here.
What we want to do is create the generative model that has a few parameters
that control the characteristics of the spikes that you synthesize. I'd like to point
out that we're not interested in inference so I don't care about inferring which
actual parameters would yield one particular spike, all I care about is that for any
combination of the seven characteristics their exists some set of parameters that
could give me those characteristics.
So in the next slide I'll tell about you how we design this generative model. So
there's two components this model, and there's two things that you need to
synthesize a spike. The first one is the workload volume during the spike. That
would determine how many requests a second you're generating.
In the second part is the object popularity during the hotspot which determines
the probability of picking a particular project and a generating a request against
it.
So in the -- to generate the workload volume with the spike, we simply take some
workload trace that you might have in your system like the curve you see in the
top left and we multiply it by this [inaudible] linear curve. And so these are -here are a few parameters that actually affect the characteristics like slope,
duration and magnitude of the spike. So this is the first part.
The more interesting part is how do you generate object popularity and add
hotspots to that popularity? So again we start with object popularity without
hotspots. Thighs could be, for example, samples from a Dirichlet distribution or
you could use the actual object upon popularity from your system. In the next
slide we see like which objects are the hotspots, and ones we have that, we
adjust the popularity of these hotspots.
So in the next two slides, I'll talk about these two parts of this model. So first we
wanted to pick which of the objects, all objects in the system are hotspots. We
wanted to have two parameters to control both the number of hotspots and the
clustering of these hotspots. So we have the this clustering process with two
parameters, numb of hotspots and L, which is the clustering parameter. And we
perform the following.
We start with the first hotspot and add it to the first cluster. And for all the
remaining hotspots we either put them in a new cluster with probability
proportional to L or to a an existing cluster with probability proportional to number
of hotspots already in that cluster.
After you're done with that, you have a set of hotspot clusters and you assign
these clusters randomly across your space of all the objects in your system.
Automatic this process is also known as the Chinese restaurant process.
And so here on the right you see a few samples of this process. For large values
of L, we get almost uniform distributions of hotspots across your clusters
whereas you decrease the value of L you get more clustering and more hotspots
are clustered on few service.
So that was the goal of how a few simple parameters lets us control what the
hotspots look like.
And in the second part, we wanted to adjust the probability of these hotspots.
We sample from the Dirichlet distribution, which is parameterized by a single
parameter K. And again, the K determines the variance and cause the entropy of
the plop later of these hotspots. So if you pick small K you get probability
distribution like the red curve over there where you have a few hotspots that are
very popular which might match the Michael Jackson case, or you have a large
scale -- K, and you have hotspots that have very similar popularity which might
match the 50 popular photos at our department.
To summarize this part, we first perform spike characterization where we looked
at five different spikes and concluded that there's no typical spikes and they vary
significantly. In the second part we design workload generation tool that lets us
flexibly adjust the spike parameters and this is much better than all the workload
generation tools out there that don't really let you add spikes to workloads.
In terms of model validation right now here I'm showing you a comparison of a
real workload and a synthesized workload. This says he's are the -- this is data
from the 50 popular photo spike. On the top graphs you see aggregate workload
to the system. And in the bottom graphs you see a workload to the most popular
photo in the system. And they do pretty similar. There's of course some dips
over here that we could model but you -- that would make the model more
complex that we want.
So in the rest of the talk I just wanted to go over some of the other projects I
worked on and summarize. Yeah?
>>: I have a question about the piecewise linear function. How do you
determine that this was linear function?
>> Peter Bodik: These are the parameters that you can pick yourself.
>>: [inaudible] you find those piece wise linear function?
>> Peter Bodik: So the goal of this was not to start with this spike and create -get parameters that would actually match that spike. The way you would use it is
that, you know, I'm giving you a tool that has a few parameters, and you might
stress test your system against. Maybe you start with spikes that don't have the
slope -- the slope isn't too steep, right? And then you keep increasing that
parameter and you keep testing your system to check whether you can system
handle it or not. That's the goal, right?
So we don't want to infer the exact values of parameters for a particular spike,
but instead give you the flexibility to tweak them.
>>: [inaudible] there will be some parameter like how many segments and what's
the biggest slope for each piecewise linear, and then the model automatically
generates those, synthesize the data.
>> Peter Bodik: Yes, right.
>>: Because you might allow us to assign the likelihood of the existing data.
[inaudible] then you just look at the perplexity of the data you already have and
then just see if it's -- if it's actually predicting the ones you have. Or is that just -I mean, it seems like you have more modular approach where you just had
piecewise. But in the end if you look at -- if you could just actually look at the ->> Peter Bodik: You mean the likelihood of ->>: Of the existing spikes that you have.
>> Peter Bodik: You mean that in terms of trying to predict that they ->>: [inaudible] likely that the model would have -- could have produced these
spikes, how much do they agree with the model?
>> Peter Bodik: Right. I mean, so this is some kind of the ongoing work. Here
I'm just showing you for this one particular spike how closely we matched that,
where we're kind of still trying to figure out how do you really compare, how
realistic are our spikes are compared to the ones we see. Right? Because it's -it's not trivial, right? You know these curves look pretty similar, but how do you
really compare them, right, because of all the noise in there. You can't just
directly compare the values, right? So we're still kind of working on that and how
exactly we do compare it. Okay?
So let me now quickly run over some of the other projects I worked on. The first
one was on performance modeling, where we designed both the black box
models of various systems but also we are combining query execution plans with
simple models of database operators to get flexible models that can extrapolate
well to unseen workloads or unseen configurations of the system that could help
the developers understand impact of their designs for -- in future workloads.
The second project was on dynamic resource allocation where we're using these
performance models together with a control theory techniques to do dynamic
resource allocation for both stateless systems like app servers and also for
stateful systems like the scad storage system a I mentioned a while ago, where
you [inaudible] whether you're adding or removing machines but you also have to
figure out what are -- how you're going to partition or replicate the data on to the
new machines.
The other project was on detection of performance crisis. I mentioned we're
using a anomaly detection techniques to notice changes in user behavior. And
finally for both this crisis detection project and the fingerprinting project we design
intuitive visualization of the state of the system. Here we were showing the
patterns of the user workload on the website that were -- where the operators
would verify what are anomaly detection technique is actually detecting
anomalies or it's a false alarm.
So the big picture was that all the projects I worked on were applying machine
learning to the monitoring data being collected in the data center and creating
accurate models of workload performance and failures. And we're able to use
various different techniques like classification feature selection, regression,
sampling and anomaly detection. But on top of that, we were able to combine
these techniques with non machine learning techniques like visualization, control
theory, or optimization to get even more powerful results.
In terms of future work there is just two projects I'd like to highlight. The holy grail
of data center management is really doing automatic resource management
simply based on performance SLOs for all these applications. Ideally I just take
all my applications running in a data center, describe how fast each one should
be running, and the data center should automatically take care of assigning
resources and efficiently provisioning all these applications. And here we really
need to combine all the models of workload, resources, power together and use
optimization techniques to get there.
And the other project is on failure diagnosis. While I mentioned the crisis
fingerprinting, that really only helps you when the crisis are repeating. For some
of the more significant crisis, that could take hours to diagnose. You really need
to combine all these various models that we're building with traces of requests,
dependencies and configuration and ideally put all the information together to
create single tool, single intuitive tool that the operators could use.
So to conclude, what I presented here is a case for using machine learning as a
fundamental technology to automate data center operations. The main reason
we need to use it is that the systems that we're building are too complex to
understand. Fortunately there's a lot of good sources of data in the data center
that could shed light on to the workload failures and performance of these
systems. And it's just too many data to analyze manually and that's why we
need to use some machine learning and statistical methods to understand this
data.
Thank you, and this work is in collaboration with other grad students, post-docs,
my advisors, and Moises Goldszmidt from Silicon Valley.
[applause].
>> Emre Kiciman: Questions?
Download