>> Nikolaj Bjorner: Thanks. It's my pleasure to... was an intern here a few years ago and worked...

advertisement
>> Nikolaj Bjorner: Thanks. It's my pleasure to introduce Moshe Gabel from Technion. Moshe
was an intern here a few years ago and worked on latent fault detection. Since then he has
taken the research on to an interesting new direction and also done research on distributed
monitoring. So without more ado.
>> Moshe Gabel: Hi. Thank you, Nikolaj. I'm Moshe Gabel from Technion. In this talk I'm just
going to review some of the work we did in the past I think five years or so which started here
at Microsoft with [indiscernible] who's sitting over there. And basically we wanted to use
performance counters that Microsoft collects in their data centers to detect anomalies and
even perhaps predict certain problems.
So to start with I think, I hope all of you know that data centers are getting larger or more
complex and if you look at the world of high-performance computing they are increasingly
using more complex components, more GPUs per machine, more cores per CPU and so on. And
if you look at cloud services, if you get any failures you sometimes get data loss or maybe just
lose some dollars or if your system is really doing well you're just losing electricity because you
have a machine that's failing, but that's the best case that you could have. And if you look at
the case of high-performance computing what they usually have in case of failure their entire
applications run for hours. It's one computation and if they have a distributed application, so
this means they have to do a lot of check-pointing, and the more failures they have the more
frequent the check-pointing which means that the utilization of the system is very low, and if
they have any failures they have to restart from the last checkpoint which might have been 30
minutes ago, so they're losing a lot of time over this.
So you get all these machines and all these components so obviously you get more and more
failures because the mean time between failures is decreasing. And in many data centers you
go to people and say okay, how do you monitor your machines? And it's getting a little bit
better but still many people simply use some sort of manually set threshold. They say we know
our system and the CPU usage should be between 60 to 80 and the latency of this
[indiscernible] should be above something or there's a problem and so on and so forth.
These lead to this sort of cycle of frustration where you have your support engineer which we
will call Bob; and Bob is very happy, his life is running very well, and then it starts getting false
alarms at 3 AM like paging and e-mails and phone calls at 3 AM, the system is coming down, we
have all these alerts that we've set with the manual threshold and the latency is too low or the
CPU usage is too high. And Bob goes to the office and logs in remotely and checks and it’s just
some backup process that no one told Bob about and it's fine, the system is running really well,
and it's just some noise or something that they haven't configured, maybe a new version was
pushed and then the threshold shouldn't be set to 80 percent, it should be set to 82 percent.
Okay. So this repeats itself a few times and then Bob is really tired and he's not happy. So what
Bob does is he relaxes the thresholds, right? He says okay, so CPU should be between 60
percent and 85 percent. It's the new system; it's fine. He's happy again, he can sleep, and
everything is good except that a few days later there would have been an alert but actually
what you have is a service outage. Why? Because the CPU usage was 83 percent and the
threshold should have been set to 82 percent because there was no backup and one machine
has a low latency. Basically you relax the threshold too much and now you are missing two
alarms.
So you lose a lot of money and Bob's boss is like why? We had these alarms. I don't
understand. And Bob said well, we had all these false alarms. Well, now I want you to tighten
this threshold. So this is how it goes and Bob is very unhappy and we want to solve it. So
usually we asked people and they tell us well does all this machine learning stuff that people
are doing these days, supervised learning and so on, why not learn from past failures? You
have a failure, you have some log, and maybe you can predict this failure. Maybe you can learn
what the threshold is.
And sometimes it works but there are some inherent problems with this idea because if you're
in the world of cloud services, which this is the first work that we did, you have a changing
workload and you have changing software. Sometimes we even have changing hardware but
not often, but the software sometimes some companies push a new version of the software a
few times every day, and even if you do it once every week you push a new version of your
software or your workload has changed because something, some event happened so everyone
is going to your service or you’re scaling up because you had an article on you in some
newspaper and basically now you look at the logs and the logs used to say the threshold should
be this and that and that's no longer true because the software has changed or your service has
changed or your workload has changed.
And another problem is you say let's go to the experts and they will tell you in the logs what the
threshold should be. So basically that's really hard. If you go to support engineers and tell
please look at all these giant logs and tell me if this system behavior is correct or incorrect and
you have to do logs for a few hundred machines for a few days and it will take them many days
and it's precious time. It’s not very interesting and it’s precious time of their busy life, and then
you tell them you know what the system has changed and we want you to do this again
because these old logs they’re not for the new system but don't worry we have these new logs
from the new system and no, they are not going to do it. So you want to have something very
flexible and adaptive.
So there are several challenges. As I said, the system evolves and it gets you into trouble if you
try to model from historical data. I'm not going to talk a lot about this but some systems are
very big and very distributed and communications has major costs there can even though in
theory it’s a data center, some systems don't run in highly connected data centers, and in other
systems you don't want to overload the networks with lots of data so some centralized
methods are limited. I'm not going to discuss this a lot, we don't have time. And also, like I
said, your data sometimes is unstructured, you have a lot of missing labels, you need to be very
careful with supervised methods.
And what people are now doing, and this is both we have some people who did this in Yahoo
and there's a group in Microsoft who started doing this and this work that we did a while ago
although no one’s using it as far as I know, basically they’re starting to use unsupervised
approaches. They’re starting to say we don't want labels, we don't know labels, we're basically
going to collect a lot of performance metrics like CPU usage and memory usage and disk latency
and maybe temperature or queue length of requests for whatever service that we are
monitoring, we’re going to do some clever outlier detection and we are going to really, really
hope that if we see an outlier machine then that means it has some true anomaly, it has a
problem. Spoiler alert, usually no. You have a lot of spurious anomalies, as I'm sure anyone
who's tried it on a real system, you have anomalies in systems and you want to control this
problem.
So we might call these anomalies latent faults. I mean these are subtle performance anomalies,
they indicate either some subtle problem right now, maybe a misconfiguration or a slow drive
or some sort of database corruption, or someone has stepped over the network cable so you
get more errors on the network cable, or it could eventually result in a fault. Maybe there's a
memory leak. So the system is still running but the leak is growing and in the end the system
will crash. And this can be caused by hardware bug, failure or software bug, misconfigurations.
And they usually fly under the radar. No one monitors for this. Usually they are too subtle to
monitor and either you catch them when it's too late or when it's really, really high and then
you might catch them, like if you have a monitor on memory and you have a memory leak you
might say okay, memory has to be between this and that but the leak is small so it takes a long
while. It would be nice if we could detect the leak before.
And of course, like I said, not every outlier is actually like a bad anomaly. Sometimes you just
get noise. You say this but the question is anomaly detection. How, when, what kind of
outliers are we looking for? How are we going to set the threshold for the anomaly detector?
How am I going to verify the results that the anomaly detectors find outliers that are true
anomalies and so on? There's a lot of questions in there. And this leads me to our first work
with Nikolaj and Ran and my advisor Assaf Schuster. This is work we published in DSN 2012,
Latent Fault Detection in Large Scale Services.
So to break it down this is like the end result. We wanted to find a latent fault in cloud services
which are load balanced and have uniform hardware and uniform software. We managed to, I
say predict failures, basically we managed to find outliers and it turned out that these tend to
become failures with fairly high precision and up to 14 days in advance. We just find an outlier
and look ahead and see what's going on using historical logs. We saw that if you look at true
failures from the existing health monitoring system and you go a little back maybe a day, maybe
a couple of days, and you can see that at least 20 percent of these failures you could have
predicted before with very high precision. And our latent fault detector outlier, basically it’s a
kind of outlier detector and it's very easy to use. It’s designed to have no tuning, it doesn't
need domain knowledge. We don't care if you're monitoring, what you're monitoring, what the
system does as long as it has certain assumptions built in and it’s fairly practical. It's easy to
use, it limits the rate of false positives, so it's very practical because you can control it. You can
say I want two percent false positives, I want 10 percent rate of false positives, and it's
adaptive. It adapts to service changes automatically.
So what do we want to do? We have M multivariate time series; I say M machines. We have
big M machines. Each machine is giving us C measurements. These measurements, I
mentioned some examples before hardware, software, temperature, number of threads and so
on; and I'm going to assume that this is a scale out load balance system, some sort of web
service perhaps which handles lots of requests per second or per minute, so these are the
measurements from machine one. We have T times because let's say I collect measurements
and let's say I'm looking at the last T measurements or T measurements per day types. And we
have C types of measurements, like I said CPU or [indiscernible] and so on.
This is machine 2 and we have M such machines and we go maybe there's an outlier machine
and some of these measurements are maybe too low or too high. There's something there that
is not correct. We want to find this outlier. This is what we want to do. We want to find this
bad machine between all these new machines and it's an outlier detection. We are not going to
learn, do any labeled learning, we are not going to learn some model of the correct behavior,
we are going to do outlier detection for this basically multivariate time series.
So let's talk about explicit assumptions. In the systems that we were testing we had
homogeneous hardware. Basically you buy a container of machines, today you buy a container
of machines; they all have the same hardware configuration mostly and any old data center in
general, it's still true today for people who still use physical machines, you don't just get any
machine that you find on the street and put in your data center. You have logistics and you
cannot just buy whatever you want to put this in so you tend to have very large groups of
machines that are very similar. I'm going to assume that the majorities of machines are okay
because if you're in the data center, if 50 percent of your machines are not okay you have
bigger fish to fry, right? You know. So let's assume most machines are fine. The software is
fine, the hardware is mostly fine. And let's assume that the workload per each machine is scale
out and load balanced in average. That is, there is a load balance there and maybe if you look
at every specific second of course the load is not balanced. But if you look across an entire day
the load balancer does a decent job. If that machine was too busy in the last hour maybe next
time it will allocate the job to the least busy machine because on average it's doable to achieve
this is sort of load balancing.
So this leads to our central assumptions that we have in this entire method. You can compare
the raw metrics, if you did it smartly, and if you have healthy machines which are the majority
then you will look at the metrics and they will have similar behavior. And if you look at the
faulty machine then they are consistently different. What do I mean? First of all they’re
different, they’re faulty so there measurements are somehow different, maybe they have a
high CPU usage or low disk rate because the disk is failing and it's also consistently different
because there is a fault there and the fault is always the same. It's not that you have one
machine that has a high temperature in one minute and the next minute actually the disk is
slow. You have one fault, more than one, but you have consistent faults. So the idea is simply
to compare raw metrics in a smart way and you aggregate it across time to reduce noise.
So this is one example of one way to do it. Let's say I have a healthy machine M and I look at
the all its neighbors or basically all of other machines and I say okay let's take these
measurements and we do some scaling and selection before but I don't have time to get into it,
and I'm going to say let's look now at this time and say okay this is the measurement for the
machine of interest and these are all the other machines and I look at the average direction.
It’s a C dimensional direction but here we have two dimensions and this is the average
direction. And then you say okay, let's wait five minutes and do the same thing again. And now
the direction is up. Why? Because there's a load balancer and there's noise so maybe this
machine before was busy and now it's a bit less busy. Maybe now it has a hard request but
now it has an easy request and you wait five more minutes and you do the same thing again
another direction is like this. Basically, all these variations are random variations due to noise,
due to whatever the load balancer is doing. It's doing its job. So it will have these random
noise variations which will cancel each other out. If you add this average and this average and
this average together you get a short vector because sometimes it's up, sometimes it's down,
sometimes left, sometimes right. You average it, you get a short vector.
Now let’s look at a faulty machine. It's consistently different because, like I said, maybe the
hard drive is always slow because it's failing. So across some of the measurements that we
have it will always be let's say lower. So it’s consistently lower, the direction is always like this,
maybe it's a little to the side. There is still some noise but there will be some difference that is
reinforced as you average across the time because it's always low or always high.
So we call it a sign test because it's basically an extension of the classical sign test, but we
average the direction from, I'm looking at machine I and I want to tell you whether it is an
outlier or not, we measure the data and we say okay, let's compute the average directions now
and then let's do the same thing across T times and add all these directions together or average
them out and we can limit, based on the length of the vector that we get, we can have some
probability of whether series or machine I is an outlier or not. This helps us limit false positive
rate, and if this probability is too low and I say oh, the probability that this series I is not an
outlier is low so I'm going to say below one percent it’s an outlier. This is how I can test for
outliers. I'm going to skip over that. Basically we use concentration bounds to derive these
probabilities because the tests are limited and we also have different tests. That was a sign
test. We have tests based on the [indiscernible] and LOF or a local outlier factor but I'm not
going to get into this. But all I want to say is that depending on your system we can adjust the
test and we will use this opportunity later.
We evaluated this on a bunch of services that exist in it Microsoft. We basically, I'm going to
talk about the main one, the largest data set that we had that we tried and it's fairly big,
roughly 4500 machines. We looked at their data for 60 days; these are historical logs that
Microsoft already collects. Those who know it’s an autopilot, those who know the system. And
we said okay, we want to have a low false positive rate and we did the processing
[indiscernible] and so on, and the main thing is we validated these results by using the autopilot
existing health logs. Autopilot has a system based on static threshold, at least it used to. I don't
know what it has now, but it used to have a system based on, as I mentioned, static thresholds
that you might adjust and it would issue alerts for the service. Again, with the examples like
something is too low, something is too high it would eat issue action and it would write a log.
So we would say okay we have 60 days of historical data first measurement and also 60 days of
historical logs, it turns out that some spots were missing, and we'll just compare the outputs of
our outlier detection to the health logs. And we’ll these health logs we know they are not
perfect but this is what we have as [indiscernible].
So here are example results of the algorithm. Basically what you see on the X axis of each of
these squares is just time across the day. The Y access is the counter value and it doesn't really
matter. We've got four different counters, each one is a different metric, and I will draw 8
machines out of those that we compared, I just draw them as a line one on top of each other,
but faulty machine I'm going to draw with a black line. You can see the machine suspected to
be faulty for example whatever this counter is this machine consistently has higher values. And
if you look at the other machines are this narrow band so most machines, whatever this value
is, most machines have this value concentrated across on the same narrow band but this one
has a higher one. Maybe it's memory. Maybe this machine has a memory leak or high latency.
Here you see again many machines have this concentrate on this narrow band. At the faulty
machine has the values they really behave differently. They don't have this shape and so on.
So first we did this. We said okay, let's look at machines that are known to fail, we selected a
bunch of failures and also a bunch of machines that we estimate contained no failures because
they had no warning at all in the health logs so we said they probably were really good. We
went back in the time from the time where the failure occurred and we said let's run auto
detector before a failure that we knew occurred. It turns out that you can basically detect with
very high work basically no false positives at least 20 percent of failures in advance if you just
do the detections. So that means that even if you don't like our detector or we don't know how
it works it does mean there is a potential for predicting failures here which is at least 20
percent. We were very conservative. We probably could have set the false positive rate at two
percent rate high.
We looked at the detection performance. We said okay, now given that we've detected that
the machine is an outlier what is the chance that there will be a software, hardware failure in
there. And obviously the more you look ahead the higher the chance you will have a some
failure on that machine. So I'm just going to show you I'm doing a test today one day after the
test. What is the fraction of machines that have failed that I said are an outlier? This is
basically precision at two days or less and so on and so forth up to 15 days or less. The line
below with the squares, the lower figure is basically random. Let's say I take any machine
whatsoever today and then I say what is the chance that it will fail in the next two weeks?
Around 20 percent. When I say of machine failure I don't necessarily mean a hardware failure.
It might just be a software failure and then you don't have to do anything except restart the
service. So this is why there appears to be many failures there. Yeah.
>>: [inaudible]? In your experiment it looks like you're not taking into consideration that
workloads will change over time, right? [inaudible] because your workload is the same
[inaudible].
>> Moshe Gabel: So our assumption is that the workload is the same for all machines at the
same point in time but if you wait a day or wait even 10 minutes there might be a different
workload but it’s still balanced across>>: But your experiment [inaudible] the index itself, the whole architecture is due to [inaudible]
hardware. So they do precisely that. They might have the same workloads with the exact same
hardware all the time.
>> Moshe Gabel: But it's not necessarily the same workload though.
>>: If it’s that same part of an index you should all start on that same hardware.
>> Moshe Gabel: So I don't really know if that's what they really did.
>>: We took a single type, so all these machines are the same type of hardware and we tracked
just this type of hardware [inaudible]. And the idea is we don't make the assumption that the
workload is consistent across time or even that the software is consistent because it could be a
deployment of a new [inaudible]. So the assumption is that at this point in time when I’m
looking today at 12 o'clock all machines should be serving the same kind of workload with the
same version of software.
>>: But I'm saying like in the index example [inaudible] because different machines present like
index cluster will serve different workloads in the same amount of time but they will serve
similar workloads for the same machine over the course of several days.
>>: That's exactly the assumption that we make is that this is why we aggregate [inaudible]. It
could be that at this point in time this machine is serving this request and that machine is
serving a different request and some request may be some request, for example, will use only
the small index and another request will require going to a bigger index and therefore the
signature that you'll get is going to be different; but this will cancel out when you look across
time because the fact that this machine got the easier query this time the next time it might get
the harder time.
>>: That's what I'm saying. You don't have [inaudible] the same machine [inaudible].
>> Moshe Gabel: So let me answer that. You might have static load balancer, which we did
have another system that we tested where basically the query was based on the client and the
load-balancing was based on the client hash, so what you got was some machines always had a
huge workload and some machines had a very small workload. I have a different extension of
this work for that case but here we assume dynamic load-balancing. So in the case that you
mention it will not happen and in the service that we tested we didn't see that. It didn't
happen. It was dynamically balanced. I don't know whether it was the index or the reverse
index, it was a while ago, but that didn't happen. Now I get your question. Wait five minutes
and I'll get to the other one.
>>: So by taking the mean you basically cancel out, does the variance of this thing have any
effect in your confidence, looking at the mean right now you can say the variance [inaudible] .
>> Moshe Gabel: Do I consider the variance in the confidence? Let's have a look.
>>: We use first order statistics [inaudible].
>> Moshe Gabel: Basically these functions have limited range and we don't care. We don't
look at it. So we also had to look in the calendar like this looks at performance per day across
60 days. One graph is precision, the other is recall and so on, and what we wanted to see is
what happened if there's a variation in the service. So, for example, we looked at precision. It
was between 60 and 80 percent. That's good except we had these strange spikes where the
precision [indiscernible] suddenly dropped and it turned out later as we looked for different
types of logs in autopilot that there were rolling service updates there and we didn't know. So
when you do rolling service updates what you have is some machines are being taken down,
some machines are being upgraded to a different version of the service, and it can take a while.
So what you can see is that performance dropped for a couple of days and then it went back up
again until the next update. So the system, we didn't even know, but the system automatically
adjusted to a new version of the software without having to tune any parameter. We also saw
that there's no strong variation in the performance based on any weekly variations in workload
or any changes at all.
>>: Can you explain a little bit more about what you mean by precision dropping? Is that
[inaudible] outliers were not really outliers?
>> Moshe Gabel: Yes. So what is precision? I'm testing a machine and my detector says it’s an
outlier. So what I'm going to do let's say I'm going to look at what happened to this machine in
the next let's say 14 days. If I have some information in the health logs that there was some
problem I say this is a true positive. I found a problem and it was a problem, right? If it was not
then it’s a false positive. Precision is simply what is the probability given that I'm saying. So this
is, and I tested it every day.
>>: So just to be clear, the service updates are not run in a homogenous fashion or at the same
time across the machines you were testing because it's rolling.
>> Moshe Gabel: Yeah. But it’s rolling for two days; it’s rolled for the entire>>: Yeah, but that's exactly what you're saying. So the main assumption is that all the machines
are homogenous; they’re doing the same task and the same version. When you change the
service [inaudible] this breaks and therefore you see this because now [inaudible].
>> Moshe Gabel: We're finding outliers.
>>: [inaudible] tell you what role this machine says. I couldn't imagine that the user has a
whole bunch of machines that are not [inaudible].
>> Moshe Gabel: Of course. In the real system you would not test machines that you know are
being upgraded. We didn't have that.
>>: We assume that all of them are the same [inaudible] periods of time they were not.
>> Moshe Gabel: This too much. I'm here later so I’ll answer any questions. I'm going to skip
over that. Basically the results here were pretty good. For the 14 days ahead prediction the
precision was 70 percent, the recall was lower. It was around 20, 25 percent which just means
that you cannot predict every failure in advance but it does mean that you can be fairly
confident that the machine that we found an outlier in really does have a problem.
And one problem with that method was that it was not suitable if you had problems with
communication. If you say if I have a geographically distributed system you want to reduce
communication. So this is follow-up which I will not talk about, but we did a bunch of
techniques from the world of streaming and distributed streamings, sketching which most
people have probably heard of, and something called geometric monitoring or safe zones which
allow us to monitor without actually sending any data. So basically what we wanted to do was
have a 10 times communication reduction like the size of messages or the amount of messages
and without harming the detection error. This is work that we presented in IPDPS; it also has
an interesting monitor for distributed variance which is something new. I'm not going to get
too much into it. We used a random hyperplane projection, we used [inaudible] to limit
changes in the direction because I mentioned before that we had this average direction you can
reduce the data size without harming this direction.
Like I said, safe zones, I'm not going to get into it. Basically we achieved our goal. Our goal was
to reduce communication by 90 percent without harming too much the detection. Basically
you can get, either it depends of course on the data, but you can get 90 percent communication
reduction with only a very little classification of one percent. I'm going to skip over this.
This was a question that you asked. What happens if your load balancer is not dynamic? What
happens if the machine that gets a query is determined by let's say the client ID and some
clients have very heavy queries, some clients have very light queries, so some machines you get
a heavy load, even they are all doing the basically the same thing, some machines get a heavy
load, some machines get a light load. What you do?
So this also happens with computation like if you're running Hadoop this also happens due to
skewness[phonetic] in the key distribution. So I still want to assume homogenous machines
and I still want to assume that the majority of machines are okay; but of course, like I said, I
cannot have dynamic load-balancing. We are sort of running out of time. So the previous work
relied on compelling raw metrics. We had the dynamic load balancer which means that on
average we could simply take the raw metrics, maybe scale them or do some selection which I
didn't get into it, but we could just take the raw metrics and compare them between machines.
Now we can't do it anymore but we still believe that outlier detection is the way to go. We just
need to make it smarter.
So the idea here is that instead of assuming different machines have the same metrics what we
are going to assume is that because these machines run the same kind of code and they might
get more or less data or more or less work to do but they still it do the same work and they still
have the same hardware so we're going to assume that there is some sort of dependency or
even linear correlation or not so linear as we later see, but let's say linear but not necessarily
parallelized. There's some relationships between all the metrics that we are measuring for that
machine. So, for example, let's say we have a service that for each client request we need 10
megabytes of memory, three database transactions, and two percent of CPU. We measure
these things. So this machine got three requests, this machine got four requests, and this
machine got five requests. You look at that memory and you look at the [indiscernible] and
CPU and basically what you can do is you can establish this rule, this simple linear rule that says
if I take the memory and divide it by 10 and the database transaction divide by three and CPU
divide by two and minus and minus request minus 60 equals zero. This is a rule that we have
and we are going to assume that faults break the rule.
If you have a memory leak you're using too much memory so it's going to break the rule. If you
have a CPU hog somewhere on your system you have too much CPU usage. You have three
requests. For example one of these machines is an outlier and it's this machine. It only has
three requests but it uses way too much memory then it should be and way too much CPU than
it should be. So this is the basic idea. If I plot the right thing, if I plot requests versus CPU it's
easy to see. It's easy to see that this is an outlier. If I just knew what to plot. So the new
assumption is similar machines doing similar work don't have similar behavior exactly but they
have similar correlations or similar relationships not [inaudible] correlations necessarily similar.
So basically we’re going to use a principal component analysis. We're going to establish these
linear correlations or linear relationships at every point in time and we're going to use the same
idea that we used before. We’re going to do this independently now and independently in five
minutes from now, independently 10 minutes from now. If a machine consistently breaks the
rules then it's an outlier because just because one machine breaks the rules a little bit there's
noise. So we're going to use the same idea and there are a couple of statistical approaches to
limit the false positives.
We can use either the same method that we used before or something else. Those who know
this great, those who don't [indiscernible] composition. The idea is that we're going to
represent a metrics of a machine as a combination of let's say normal linear subspace and
abnormal subspace. I'm going to say I’m going to take a few dimensions. The data lies on the
manifold, let's not get into the math, I'm going to basically learn. I have the majority of
machines are fine so I'm going to learn what is normal from the majority of the machines and
then I'm going to take this normal, I'm going to say oh, what's the abnormal subspace, what’s
the other part, and I'm going to say oh, and this is the abnormal subspace if a machine has data.
A large presence in the abnormal subspace I'm going to project its data to the abnormal
subspace and it will be an outlier. And we use some called HR-PCA’s. Those who know PCA
know that it's very sensitive to outliers but there are new robust approaches that are fairly fast
and are not are not so sensitive to outliers.
So illustrating this, this would be our normal subspace and this would be our abnormal
subspace. If we project normal data on the abnormal subspace we get very low variance, very
low projection. If we take the abnormal data and we project it on the abnormal subspace we
get high values. We get a high projection, a very large projection. So what does it help us?
How does it help us? You mentioned high load, low load. So the idea is that there is this let's
say manifold or plane or in the 2-D case a line where depending on the number of requests the
machine will either be on this part of the line, if there's a low load it has few requests, or if it
has a high load it will be on this part of the line because it has many requests. It gets a large
load so it will have more CPU usage obviously, but it's still on the same normal subspace. So
this helps us handle the case where you have low loads and high loads.
No time for math. Basically, again it's the same approach. It every point in time I do the PCA
separately independently from other points of time, I project the data, and again let's say it's a
very similar idea to the lengths of the vector that you previously had, some sort of score, and
I'm going to check if this score is consistently large if the data is consistently abnormal then the
machine is abnormal. Same idea. You can either do the same idea within the latent fault
detection where we get a P value and so on or, I'm not going to get into it, there is some work
by Jackson and Mudholkar that those who know this that limit this prediction on the abnormal
subspace. It's fairly well-known.
Here we didn't have the Microsoft logs. We only had logs for my supercomputer and we didn't
have schedule information so we didn't know what jobs were running when, so we did some
basically ad hoc trying to figure out based on the load of the machine whether it's running and
we said oh, if two machines are running together they probably belong to the same job. It's a
bit problematic, and we compared two failure logs that we do get for the TSUBAME2 computer.
So basically we started with the existing latent default detector that we had and basically those
who read civil operator [indiscernible] curves know that this means that our results are no
better than a random guess. Outlier or no outlier you might as well flip a coin on that kind of
job.
Now if we use the PCA variant, one of the PCA approaches, what we see is that while it's not so
good the better this line approaches this point the better we are. So this is not great but it's
still much better than the random guess. We do know we had lots of problems with grouping.
We didn't have schedule information. Normally if you have a supercomputer you know what's
running on it but due to issues of confidentiality we couldn't get this information. So we know
there are errors, not all machines in our comparisons do the same job because we just didn't
know which machines are doing what so we know there are problems there. So that's it for
that.
However, recently I had the chance to try this on virtual machines where we basically took a
bunch of faulty virtual machines and we injected artificial outliers like injected high CPU usage
and we tried it on several types of workloads Cassandra and Build Linux Kernel and all of these
machines did similar or different workload and we just compared it to each other. And what
we saw is that as before our previous latent fault detectors, the Tukey, LOF and sign tests were
okay or not okay but the Kernel PCA approach was much better. Again, depending on
workloads it's sometimes much better depending also on the false positive rate. This is just
preliminary results because we didn't switch to a different approach based on sparse coding
which basically eliminates the idea that you have uniform hardware which is more tailored for
the VM approach. Sadly, I think we are out of time and I don't want to beleaguer the point, so I
promised people that I would let them leave at eleven thirty. So thank you very much. We
have a few minutes for questions or more because I'm here. I'm going to be here all week. You
had some I think. This is more recent work with a sparse decomposition approach. No time to
get into it now.
>>: So the general approach, the theme you have taken all this is for is basically at each point
you run unconditional anomaly detection between the points on each machine. Basically it’s
multi-dimensional and then you average this anomaly score that you get over time to>> Moshe Gabel: Yes. Exactly. And the original work the idea was that the anomaly score is
bounded so we could use certain concentration inequalities. But it was also very important to
us to consider, to use multidimensional approaches to anomaly detection rather than what was
more common, the single-dimensional approach.
>>: So potentially you could, for example, at each time slice you could just model the
distribution of this multidimensional space and see which one of them is falling.
>> Moshe Gabel: Yes. But if you want to model you have to make some parametric
assumptions like what distribution>>: For nonparametric distribution. [inaudible].
>> Moshe Gabel: I could do, for example that, yes. We did something even a little stronger
perhaps. The local outlier factor which is kind of like a local density estimator. In our results we
didn't get, we don't realize why yet but it somehow doesn't give you good results. We don't
know why exactly. Yeah.
>>: So [inaudible] correlation metric where you say if this request is going to use this amount of
memory how accurate is that?
>> Moshe Gabel: How accurate is that in terms of whether are you interested in finding the
rule or just interested in finding anomalies?
>>: Well, I mean the thing is, so accurate means that are you predicting the memory used for
any particular [inaudible]?
>> Moshe Gabel: No, that's not even what I'm trying to do. I'm simply trying to establish some
correlation between different variables and find those that break it and I'm not trying to use it
for prediction. I'm not even sure, I guess I could try it but that's not, we never thought about
let's say hide this one and try. It's probably going to be very noisy and not very good.
>>: Right. I'm just wondering>> Moshe Gabel: It's like linear regression.
>>: Because the fact that two requests are different sizes doesn't necessarily mean that when
you look at the system the memory usage could be different. And one of the reasons is
because the system itself allocates memory in chunks.
>> Moshe Gabel: Of course. So in this case it means that this variable doesn't have any
presence so the PCA will just not assign it a high loading so it's no problem. Of course one
problem that you do have is you have two different types of requests and it's not one manifold
it’s two and then PCA cannot do much with it and then you need to use some sort of subspace
clustering but we haven't done this.
>>: And then also a more general question is so it seems like another fundamental assumption
is that you have counters that can measure a component that matters.
>> Moshe Gabel: Yes. So I lighted over this. The idea was we always said measure as much as
you can with some fairly reasonable assumption that when you design a system and you have
your programmers they try when they say what should we measure they do know something
about that. But I did skip over a small part that tries to automatically discard those counters
that don't matter because there are those counters that don't matter. They don't tell us
anything. For example, times [indiscernible]. Apparently some people report it periodically for
some reason. So we had a small procedure to remove those.
>>: Maybe there’s a slight twist to that and it’s something I'm struggling with now which is
physically incorrectly implemented counters. Counters that give you bogus results.
>> Moshe Gabel: Interesting. Actually I never>>: We saw assume that counters are correct but they're implemented.
>> Moshe Gabel: So our methods, I didn't talk about this, but our methods can handle missing
data but whether they can handle incorrect data, in theory if it’s a little bit robust HR-PCA
procedure it should be able to handle up to 10 percent outliers of any kind, like it's a very
robust procedure at least in theory, so okay. But if you have a counter that is only always lying I
don't know. It would be very interesting. Possibly if it's always lying and it’s just noise you
could just discard it. But it's interesting. I never considered>>: It’s just something that I run into.
>> Moshe Gabel: That sounds very interesting actually. You had some questions.
>>: It’s not really a question but I’m just to trying to [inaudible] apply all this to a system that
actually changes all over time and is this [inaudible]. You have like multiple tenants on the
same [inaudible]. Should they apply a different [inaudible]?
>> Moshe Gabel: Are you measuring, it’s a VM system based on multi-tenant>>: [inaudible] basically trying to see if you can actually take it and [inaudible].
>> Moshe Gabel: So for multitenant system I think, again, this is like super early. We only
tested it a bit on virtual machines which have also interesting cases of noisy neighbors, for
example, which is somewhat similar. So the idea was if we use sparse decomposition approach
then you might be able to separate this data comes from this tenant, and this data comes from
that tenant, and then you would be able to say this is the dictionary for tenant A, this is the
dictionary for tenant B, and I can still find outliers because I know that was one of the driving
forces behind, I didn’t get into it, the idea that you no longer have all machines doing the same
thing but maybe three different things that machines are doing heavy requests and easy
requests or maybe tenant A doing A something, tenant B doing something B. The idea there
was to still be able to work. Some early results say probably it could work. Same idea where
you look for consistent outliers but the test that you have to do now, like the independent test,
is a little bit different. We can talk about this more if you want. This as is, probably not so
good.
>>: [inaudible] looking for. So one thing you said precision drops significantly [inaudible]?
>> Moshe Gabel: If we knew in advance which machines are being rolled up and down and
what's going on then we could just remove them out of the test. We just didn't know. So the
problem is that we are entering machines into the test which all have the new version and then
we detect them as outliers. So we compare apples to oranges.
>>: [inaudible] interesting questions and interesting challenges for us maybe we can think
about. So how about taking it off-line? Maybe we can meet and you can do more details, kind
of the setting, and we can think whether this works or maybe we need to challenge whether it's
a new challenge for us. I think that would be [inaudible].
>>: We might be able to come up with some details [inaudible].
>>: Sure. We’ll be able to happy to hear and see if we can have anything. It’s either going to be
something that we can help you with or you're going to help us with new questions.
>> Moshe Gabel: All right. Thank you. And like I said, all week I would love to meet with
people who are interested.
>>: So let's thank him.
Download