>> Nikolaj Bjorner: Thanks. It's my pleasure to introduce Moshe Gabel from Technion. Moshe was an intern here a few years ago and worked on latent fault detection. Since then he has taken the research on to an interesting new direction and also done research on distributed monitoring. So without more ado. >> Moshe Gabel: Hi. Thank you, Nikolaj. I'm Moshe Gabel from Technion. In this talk I'm just going to review some of the work we did in the past I think five years or so which started here at Microsoft with [indiscernible] who's sitting over there. And basically we wanted to use performance counters that Microsoft collects in their data centers to detect anomalies and even perhaps predict certain problems. So to start with I think, I hope all of you know that data centers are getting larger or more complex and if you look at the world of high-performance computing they are increasingly using more complex components, more GPUs per machine, more cores per CPU and so on. And if you look at cloud services, if you get any failures you sometimes get data loss or maybe just lose some dollars or if your system is really doing well you're just losing electricity because you have a machine that's failing, but that's the best case that you could have. And if you look at the case of high-performance computing what they usually have in case of failure their entire applications run for hours. It's one computation and if they have a distributed application, so this means they have to do a lot of check-pointing, and the more failures they have the more frequent the check-pointing which means that the utilization of the system is very low, and if they have any failures they have to restart from the last checkpoint which might have been 30 minutes ago, so they're losing a lot of time over this. So you get all these machines and all these components so obviously you get more and more failures because the mean time between failures is decreasing. And in many data centers you go to people and say okay, how do you monitor your machines? And it's getting a little bit better but still many people simply use some sort of manually set threshold. They say we know our system and the CPU usage should be between 60 to 80 and the latency of this [indiscernible] should be above something or there's a problem and so on and so forth. These lead to this sort of cycle of frustration where you have your support engineer which we will call Bob; and Bob is very happy, his life is running very well, and then it starts getting false alarms at 3 AM like paging and e-mails and phone calls at 3 AM, the system is coming down, we have all these alerts that we've set with the manual threshold and the latency is too low or the CPU usage is too high. And Bob goes to the office and logs in remotely and checks and it’s just some backup process that no one told Bob about and it's fine, the system is running really well, and it's just some noise or something that they haven't configured, maybe a new version was pushed and then the threshold shouldn't be set to 80 percent, it should be set to 82 percent. Okay. So this repeats itself a few times and then Bob is really tired and he's not happy. So what Bob does is he relaxes the thresholds, right? He says okay, so CPU should be between 60 percent and 85 percent. It's the new system; it's fine. He's happy again, he can sleep, and everything is good except that a few days later there would have been an alert but actually what you have is a service outage. Why? Because the CPU usage was 83 percent and the threshold should have been set to 82 percent because there was no backup and one machine has a low latency. Basically you relax the threshold too much and now you are missing two alarms. So you lose a lot of money and Bob's boss is like why? We had these alarms. I don't understand. And Bob said well, we had all these false alarms. Well, now I want you to tighten this threshold. So this is how it goes and Bob is very unhappy and we want to solve it. So usually we asked people and they tell us well does all this machine learning stuff that people are doing these days, supervised learning and so on, why not learn from past failures? You have a failure, you have some log, and maybe you can predict this failure. Maybe you can learn what the threshold is. And sometimes it works but there are some inherent problems with this idea because if you're in the world of cloud services, which this is the first work that we did, you have a changing workload and you have changing software. Sometimes we even have changing hardware but not often, but the software sometimes some companies push a new version of the software a few times every day, and even if you do it once every week you push a new version of your software or your workload has changed because something, some event happened so everyone is going to your service or you’re scaling up because you had an article on you in some newspaper and basically now you look at the logs and the logs used to say the threshold should be this and that and that's no longer true because the software has changed or your service has changed or your workload has changed. And another problem is you say let's go to the experts and they will tell you in the logs what the threshold should be. So basically that's really hard. If you go to support engineers and tell please look at all these giant logs and tell me if this system behavior is correct or incorrect and you have to do logs for a few hundred machines for a few days and it will take them many days and it's precious time. It’s not very interesting and it’s precious time of their busy life, and then you tell them you know what the system has changed and we want you to do this again because these old logs they’re not for the new system but don't worry we have these new logs from the new system and no, they are not going to do it. So you want to have something very flexible and adaptive. So there are several challenges. As I said, the system evolves and it gets you into trouble if you try to model from historical data. I'm not going to talk a lot about this but some systems are very big and very distributed and communications has major costs there can even though in theory it’s a data center, some systems don't run in highly connected data centers, and in other systems you don't want to overload the networks with lots of data so some centralized methods are limited. I'm not going to discuss this a lot, we don't have time. And also, like I said, your data sometimes is unstructured, you have a lot of missing labels, you need to be very careful with supervised methods. And what people are now doing, and this is both we have some people who did this in Yahoo and there's a group in Microsoft who started doing this and this work that we did a while ago although no one’s using it as far as I know, basically they’re starting to use unsupervised approaches. They’re starting to say we don't want labels, we don't know labels, we're basically going to collect a lot of performance metrics like CPU usage and memory usage and disk latency and maybe temperature or queue length of requests for whatever service that we are monitoring, we’re going to do some clever outlier detection and we are going to really, really hope that if we see an outlier machine then that means it has some true anomaly, it has a problem. Spoiler alert, usually no. You have a lot of spurious anomalies, as I'm sure anyone who's tried it on a real system, you have anomalies in systems and you want to control this problem. So we might call these anomalies latent faults. I mean these are subtle performance anomalies, they indicate either some subtle problem right now, maybe a misconfiguration or a slow drive or some sort of database corruption, or someone has stepped over the network cable so you get more errors on the network cable, or it could eventually result in a fault. Maybe there's a memory leak. So the system is still running but the leak is growing and in the end the system will crash. And this can be caused by hardware bug, failure or software bug, misconfigurations. And they usually fly under the radar. No one monitors for this. Usually they are too subtle to monitor and either you catch them when it's too late or when it's really, really high and then you might catch them, like if you have a monitor on memory and you have a memory leak you might say okay, memory has to be between this and that but the leak is small so it takes a long while. It would be nice if we could detect the leak before. And of course, like I said, not every outlier is actually like a bad anomaly. Sometimes you just get noise. You say this but the question is anomaly detection. How, when, what kind of outliers are we looking for? How are we going to set the threshold for the anomaly detector? How am I going to verify the results that the anomaly detectors find outliers that are true anomalies and so on? There's a lot of questions in there. And this leads me to our first work with Nikolaj and Ran and my advisor Assaf Schuster. This is work we published in DSN 2012, Latent Fault Detection in Large Scale Services. So to break it down this is like the end result. We wanted to find a latent fault in cloud services which are load balanced and have uniform hardware and uniform software. We managed to, I say predict failures, basically we managed to find outliers and it turned out that these tend to become failures with fairly high precision and up to 14 days in advance. We just find an outlier and look ahead and see what's going on using historical logs. We saw that if you look at true failures from the existing health monitoring system and you go a little back maybe a day, maybe a couple of days, and you can see that at least 20 percent of these failures you could have predicted before with very high precision. And our latent fault detector outlier, basically it’s a kind of outlier detector and it's very easy to use. It’s designed to have no tuning, it doesn't need domain knowledge. We don't care if you're monitoring, what you're monitoring, what the system does as long as it has certain assumptions built in and it’s fairly practical. It's easy to use, it limits the rate of false positives, so it's very practical because you can control it. You can say I want two percent false positives, I want 10 percent rate of false positives, and it's adaptive. It adapts to service changes automatically. So what do we want to do? We have M multivariate time series; I say M machines. We have big M machines. Each machine is giving us C measurements. These measurements, I mentioned some examples before hardware, software, temperature, number of threads and so on; and I'm going to assume that this is a scale out load balance system, some sort of web service perhaps which handles lots of requests per second or per minute, so these are the measurements from machine one. We have T times because let's say I collect measurements and let's say I'm looking at the last T measurements or T measurements per day types. And we have C types of measurements, like I said CPU or [indiscernible] and so on. This is machine 2 and we have M such machines and we go maybe there's an outlier machine and some of these measurements are maybe too low or too high. There's something there that is not correct. We want to find this outlier. This is what we want to do. We want to find this bad machine between all these new machines and it's an outlier detection. We are not going to learn, do any labeled learning, we are not going to learn some model of the correct behavior, we are going to do outlier detection for this basically multivariate time series. So let's talk about explicit assumptions. In the systems that we were testing we had homogeneous hardware. Basically you buy a container of machines, today you buy a container of machines; they all have the same hardware configuration mostly and any old data center in general, it's still true today for people who still use physical machines, you don't just get any machine that you find on the street and put in your data center. You have logistics and you cannot just buy whatever you want to put this in so you tend to have very large groups of machines that are very similar. I'm going to assume that the majorities of machines are okay because if you're in the data center, if 50 percent of your machines are not okay you have bigger fish to fry, right? You know. So let's assume most machines are fine. The software is fine, the hardware is mostly fine. And let's assume that the workload per each machine is scale out and load balanced in average. That is, there is a load balance there and maybe if you look at every specific second of course the load is not balanced. But if you look across an entire day the load balancer does a decent job. If that machine was too busy in the last hour maybe next time it will allocate the job to the least busy machine because on average it's doable to achieve this is sort of load balancing. So this leads to our central assumptions that we have in this entire method. You can compare the raw metrics, if you did it smartly, and if you have healthy machines which are the majority then you will look at the metrics and they will have similar behavior. And if you look at the faulty machine then they are consistently different. What do I mean? First of all they’re different, they’re faulty so there measurements are somehow different, maybe they have a high CPU usage or low disk rate because the disk is failing and it's also consistently different because there is a fault there and the fault is always the same. It's not that you have one machine that has a high temperature in one minute and the next minute actually the disk is slow. You have one fault, more than one, but you have consistent faults. So the idea is simply to compare raw metrics in a smart way and you aggregate it across time to reduce noise. So this is one example of one way to do it. Let's say I have a healthy machine M and I look at the all its neighbors or basically all of other machines and I say okay let's take these measurements and we do some scaling and selection before but I don't have time to get into it, and I'm going to say let's look now at this time and say okay this is the measurement for the machine of interest and these are all the other machines and I look at the average direction. It’s a C dimensional direction but here we have two dimensions and this is the average direction. And then you say okay, let's wait five minutes and do the same thing again. And now the direction is up. Why? Because there's a load balancer and there's noise so maybe this machine before was busy and now it's a bit less busy. Maybe now it has a hard request but now it has an easy request and you wait five more minutes and you do the same thing again another direction is like this. Basically, all these variations are random variations due to noise, due to whatever the load balancer is doing. It's doing its job. So it will have these random noise variations which will cancel each other out. If you add this average and this average and this average together you get a short vector because sometimes it's up, sometimes it's down, sometimes left, sometimes right. You average it, you get a short vector. Now let’s look at a faulty machine. It's consistently different because, like I said, maybe the hard drive is always slow because it's failing. So across some of the measurements that we have it will always be let's say lower. So it’s consistently lower, the direction is always like this, maybe it's a little to the side. There is still some noise but there will be some difference that is reinforced as you average across the time because it's always low or always high. So we call it a sign test because it's basically an extension of the classical sign test, but we average the direction from, I'm looking at machine I and I want to tell you whether it is an outlier or not, we measure the data and we say okay, let's compute the average directions now and then let's do the same thing across T times and add all these directions together or average them out and we can limit, based on the length of the vector that we get, we can have some probability of whether series or machine I is an outlier or not. This helps us limit false positive rate, and if this probability is too low and I say oh, the probability that this series I is not an outlier is low so I'm going to say below one percent it’s an outlier. This is how I can test for outliers. I'm going to skip over that. Basically we use concentration bounds to derive these probabilities because the tests are limited and we also have different tests. That was a sign test. We have tests based on the [indiscernible] and LOF or a local outlier factor but I'm not going to get into this. But all I want to say is that depending on your system we can adjust the test and we will use this opportunity later. We evaluated this on a bunch of services that exist in it Microsoft. We basically, I'm going to talk about the main one, the largest data set that we had that we tried and it's fairly big, roughly 4500 machines. We looked at their data for 60 days; these are historical logs that Microsoft already collects. Those who know it’s an autopilot, those who know the system. And we said okay, we want to have a low false positive rate and we did the processing [indiscernible] and so on, and the main thing is we validated these results by using the autopilot existing health logs. Autopilot has a system based on static threshold, at least it used to. I don't know what it has now, but it used to have a system based on, as I mentioned, static thresholds that you might adjust and it would issue alerts for the service. Again, with the examples like something is too low, something is too high it would eat issue action and it would write a log. So we would say okay we have 60 days of historical data first measurement and also 60 days of historical logs, it turns out that some spots were missing, and we'll just compare the outputs of our outlier detection to the health logs. And we’ll these health logs we know they are not perfect but this is what we have as [indiscernible]. So here are example results of the algorithm. Basically what you see on the X axis of each of these squares is just time across the day. The Y access is the counter value and it doesn't really matter. We've got four different counters, each one is a different metric, and I will draw 8 machines out of those that we compared, I just draw them as a line one on top of each other, but faulty machine I'm going to draw with a black line. You can see the machine suspected to be faulty for example whatever this counter is this machine consistently has higher values. And if you look at the other machines are this narrow band so most machines, whatever this value is, most machines have this value concentrated across on the same narrow band but this one has a higher one. Maybe it's memory. Maybe this machine has a memory leak or high latency. Here you see again many machines have this concentrate on this narrow band. At the faulty machine has the values they really behave differently. They don't have this shape and so on. So first we did this. We said okay, let's look at machines that are known to fail, we selected a bunch of failures and also a bunch of machines that we estimate contained no failures because they had no warning at all in the health logs so we said they probably were really good. We went back in the time from the time where the failure occurred and we said let's run auto detector before a failure that we knew occurred. It turns out that you can basically detect with very high work basically no false positives at least 20 percent of failures in advance if you just do the detections. So that means that even if you don't like our detector or we don't know how it works it does mean there is a potential for predicting failures here which is at least 20 percent. We were very conservative. We probably could have set the false positive rate at two percent rate high. We looked at the detection performance. We said okay, now given that we've detected that the machine is an outlier what is the chance that there will be a software, hardware failure in there. And obviously the more you look ahead the higher the chance you will have a some failure on that machine. So I'm just going to show you I'm doing a test today one day after the test. What is the fraction of machines that have failed that I said are an outlier? This is basically precision at two days or less and so on and so forth up to 15 days or less. The line below with the squares, the lower figure is basically random. Let's say I take any machine whatsoever today and then I say what is the chance that it will fail in the next two weeks? Around 20 percent. When I say of machine failure I don't necessarily mean a hardware failure. It might just be a software failure and then you don't have to do anything except restart the service. So this is why there appears to be many failures there. Yeah. >>: [inaudible]? In your experiment it looks like you're not taking into consideration that workloads will change over time, right? [inaudible] because your workload is the same [inaudible]. >> Moshe Gabel: So our assumption is that the workload is the same for all machines at the same point in time but if you wait a day or wait even 10 minutes there might be a different workload but it’s still balanced across>>: But your experiment [inaudible] the index itself, the whole architecture is due to [inaudible] hardware. So they do precisely that. They might have the same workloads with the exact same hardware all the time. >> Moshe Gabel: But it's not necessarily the same workload though. >>: If it’s that same part of an index you should all start on that same hardware. >> Moshe Gabel: So I don't really know if that's what they really did. >>: We took a single type, so all these machines are the same type of hardware and we tracked just this type of hardware [inaudible]. And the idea is we don't make the assumption that the workload is consistent across time or even that the software is consistent because it could be a deployment of a new [inaudible]. So the assumption is that at this point in time when I’m looking today at 12 o'clock all machines should be serving the same kind of workload with the same version of software. >>: But I'm saying like in the index example [inaudible] because different machines present like index cluster will serve different workloads in the same amount of time but they will serve similar workloads for the same machine over the course of several days. >>: That's exactly the assumption that we make is that this is why we aggregate [inaudible]. It could be that at this point in time this machine is serving this request and that machine is serving a different request and some request may be some request, for example, will use only the small index and another request will require going to a bigger index and therefore the signature that you'll get is going to be different; but this will cancel out when you look across time because the fact that this machine got the easier query this time the next time it might get the harder time. >>: That's what I'm saying. You don't have [inaudible] the same machine [inaudible]. >> Moshe Gabel: So let me answer that. You might have static load balancer, which we did have another system that we tested where basically the query was based on the client and the load-balancing was based on the client hash, so what you got was some machines always had a huge workload and some machines had a very small workload. I have a different extension of this work for that case but here we assume dynamic load-balancing. So in the case that you mention it will not happen and in the service that we tested we didn't see that. It didn't happen. It was dynamically balanced. I don't know whether it was the index or the reverse index, it was a while ago, but that didn't happen. Now I get your question. Wait five minutes and I'll get to the other one. >>: So by taking the mean you basically cancel out, does the variance of this thing have any effect in your confidence, looking at the mean right now you can say the variance [inaudible] . >> Moshe Gabel: Do I consider the variance in the confidence? Let's have a look. >>: We use first order statistics [inaudible]. >> Moshe Gabel: Basically these functions have limited range and we don't care. We don't look at it. So we also had to look in the calendar like this looks at performance per day across 60 days. One graph is precision, the other is recall and so on, and what we wanted to see is what happened if there's a variation in the service. So, for example, we looked at precision. It was between 60 and 80 percent. That's good except we had these strange spikes where the precision [indiscernible] suddenly dropped and it turned out later as we looked for different types of logs in autopilot that there were rolling service updates there and we didn't know. So when you do rolling service updates what you have is some machines are being taken down, some machines are being upgraded to a different version of the service, and it can take a while. So what you can see is that performance dropped for a couple of days and then it went back up again until the next update. So the system, we didn't even know, but the system automatically adjusted to a new version of the software without having to tune any parameter. We also saw that there's no strong variation in the performance based on any weekly variations in workload or any changes at all. >>: Can you explain a little bit more about what you mean by precision dropping? Is that [inaudible] outliers were not really outliers? >> Moshe Gabel: Yes. So what is precision? I'm testing a machine and my detector says it’s an outlier. So what I'm going to do let's say I'm going to look at what happened to this machine in the next let's say 14 days. If I have some information in the health logs that there was some problem I say this is a true positive. I found a problem and it was a problem, right? If it was not then it’s a false positive. Precision is simply what is the probability given that I'm saying. So this is, and I tested it every day. >>: So just to be clear, the service updates are not run in a homogenous fashion or at the same time across the machines you were testing because it's rolling. >> Moshe Gabel: Yeah. But it’s rolling for two days; it’s rolled for the entire>>: Yeah, but that's exactly what you're saying. So the main assumption is that all the machines are homogenous; they’re doing the same task and the same version. When you change the service [inaudible] this breaks and therefore you see this because now [inaudible]. >> Moshe Gabel: We're finding outliers. >>: [inaudible] tell you what role this machine says. I couldn't imagine that the user has a whole bunch of machines that are not [inaudible]. >> Moshe Gabel: Of course. In the real system you would not test machines that you know are being upgraded. We didn't have that. >>: We assume that all of them are the same [inaudible] periods of time they were not. >> Moshe Gabel: This too much. I'm here later so I’ll answer any questions. I'm going to skip over that. Basically the results here were pretty good. For the 14 days ahead prediction the precision was 70 percent, the recall was lower. It was around 20, 25 percent which just means that you cannot predict every failure in advance but it does mean that you can be fairly confident that the machine that we found an outlier in really does have a problem. And one problem with that method was that it was not suitable if you had problems with communication. If you say if I have a geographically distributed system you want to reduce communication. So this is follow-up which I will not talk about, but we did a bunch of techniques from the world of streaming and distributed streamings, sketching which most people have probably heard of, and something called geometric monitoring or safe zones which allow us to monitor without actually sending any data. So basically what we wanted to do was have a 10 times communication reduction like the size of messages or the amount of messages and without harming the detection error. This is work that we presented in IPDPS; it also has an interesting monitor for distributed variance which is something new. I'm not going to get too much into it. We used a random hyperplane projection, we used [inaudible] to limit changes in the direction because I mentioned before that we had this average direction you can reduce the data size without harming this direction. Like I said, safe zones, I'm not going to get into it. Basically we achieved our goal. Our goal was to reduce communication by 90 percent without harming too much the detection. Basically you can get, either it depends of course on the data, but you can get 90 percent communication reduction with only a very little classification of one percent. I'm going to skip over this. This was a question that you asked. What happens if your load balancer is not dynamic? What happens if the machine that gets a query is determined by let's say the client ID and some clients have very heavy queries, some clients have very light queries, so some machines you get a heavy load, even they are all doing the basically the same thing, some machines get a heavy load, some machines get a light load. What you do? So this also happens with computation like if you're running Hadoop this also happens due to skewness[phonetic] in the key distribution. So I still want to assume homogenous machines and I still want to assume that the majority of machines are okay; but of course, like I said, I cannot have dynamic load-balancing. We are sort of running out of time. So the previous work relied on compelling raw metrics. We had the dynamic load balancer which means that on average we could simply take the raw metrics, maybe scale them or do some selection which I didn't get into it, but we could just take the raw metrics and compare them between machines. Now we can't do it anymore but we still believe that outlier detection is the way to go. We just need to make it smarter. So the idea here is that instead of assuming different machines have the same metrics what we are going to assume is that because these machines run the same kind of code and they might get more or less data or more or less work to do but they still it do the same work and they still have the same hardware so we're going to assume that there is some sort of dependency or even linear correlation or not so linear as we later see, but let's say linear but not necessarily parallelized. There's some relationships between all the metrics that we are measuring for that machine. So, for example, let's say we have a service that for each client request we need 10 megabytes of memory, three database transactions, and two percent of CPU. We measure these things. So this machine got three requests, this machine got four requests, and this machine got five requests. You look at that memory and you look at the [indiscernible] and CPU and basically what you can do is you can establish this rule, this simple linear rule that says if I take the memory and divide it by 10 and the database transaction divide by three and CPU divide by two and minus and minus request minus 60 equals zero. This is a rule that we have and we are going to assume that faults break the rule. If you have a memory leak you're using too much memory so it's going to break the rule. If you have a CPU hog somewhere on your system you have too much CPU usage. You have three requests. For example one of these machines is an outlier and it's this machine. It only has three requests but it uses way too much memory then it should be and way too much CPU than it should be. So this is the basic idea. If I plot the right thing, if I plot requests versus CPU it's easy to see. It's easy to see that this is an outlier. If I just knew what to plot. So the new assumption is similar machines doing similar work don't have similar behavior exactly but they have similar correlations or similar relationships not [inaudible] correlations necessarily similar. So basically we’re going to use a principal component analysis. We're going to establish these linear correlations or linear relationships at every point in time and we're going to use the same idea that we used before. We’re going to do this independently now and independently in five minutes from now, independently 10 minutes from now. If a machine consistently breaks the rules then it's an outlier because just because one machine breaks the rules a little bit there's noise. So we're going to use the same idea and there are a couple of statistical approaches to limit the false positives. We can use either the same method that we used before or something else. Those who know this great, those who don't [indiscernible] composition. The idea is that we're going to represent a metrics of a machine as a combination of let's say normal linear subspace and abnormal subspace. I'm going to say I’m going to take a few dimensions. The data lies on the manifold, let's not get into the math, I'm going to basically learn. I have the majority of machines are fine so I'm going to learn what is normal from the majority of the machines and then I'm going to take this normal, I'm going to say oh, what's the abnormal subspace, what’s the other part, and I'm going to say oh, and this is the abnormal subspace if a machine has data. A large presence in the abnormal subspace I'm going to project its data to the abnormal subspace and it will be an outlier. And we use some called HR-PCA’s. Those who know PCA know that it's very sensitive to outliers but there are new robust approaches that are fairly fast and are not are not so sensitive to outliers. So illustrating this, this would be our normal subspace and this would be our abnormal subspace. If we project normal data on the abnormal subspace we get very low variance, very low projection. If we take the abnormal data and we project it on the abnormal subspace we get high values. We get a high projection, a very large projection. So what does it help us? How does it help us? You mentioned high load, low load. So the idea is that there is this let's say manifold or plane or in the 2-D case a line where depending on the number of requests the machine will either be on this part of the line, if there's a low load it has few requests, or if it has a high load it will be on this part of the line because it has many requests. It gets a large load so it will have more CPU usage obviously, but it's still on the same normal subspace. So this helps us handle the case where you have low loads and high loads. No time for math. Basically, again it's the same approach. It every point in time I do the PCA separately independently from other points of time, I project the data, and again let's say it's a very similar idea to the lengths of the vector that you previously had, some sort of score, and I'm going to check if this score is consistently large if the data is consistently abnormal then the machine is abnormal. Same idea. You can either do the same idea within the latent fault detection where we get a P value and so on or, I'm not going to get into it, there is some work by Jackson and Mudholkar that those who know this that limit this prediction on the abnormal subspace. It's fairly well-known. Here we didn't have the Microsoft logs. We only had logs for my supercomputer and we didn't have schedule information so we didn't know what jobs were running when, so we did some basically ad hoc trying to figure out based on the load of the machine whether it's running and we said oh, if two machines are running together they probably belong to the same job. It's a bit problematic, and we compared two failure logs that we do get for the TSUBAME2 computer. So basically we started with the existing latent default detector that we had and basically those who read civil operator [indiscernible] curves know that this means that our results are no better than a random guess. Outlier or no outlier you might as well flip a coin on that kind of job. Now if we use the PCA variant, one of the PCA approaches, what we see is that while it's not so good the better this line approaches this point the better we are. So this is not great but it's still much better than the random guess. We do know we had lots of problems with grouping. We didn't have schedule information. Normally if you have a supercomputer you know what's running on it but due to issues of confidentiality we couldn't get this information. So we know there are errors, not all machines in our comparisons do the same job because we just didn't know which machines are doing what so we know there are problems there. So that's it for that. However, recently I had the chance to try this on virtual machines where we basically took a bunch of faulty virtual machines and we injected artificial outliers like injected high CPU usage and we tried it on several types of workloads Cassandra and Build Linux Kernel and all of these machines did similar or different workload and we just compared it to each other. And what we saw is that as before our previous latent fault detectors, the Tukey, LOF and sign tests were okay or not okay but the Kernel PCA approach was much better. Again, depending on workloads it's sometimes much better depending also on the false positive rate. This is just preliminary results because we didn't switch to a different approach based on sparse coding which basically eliminates the idea that you have uniform hardware which is more tailored for the VM approach. Sadly, I think we are out of time and I don't want to beleaguer the point, so I promised people that I would let them leave at eleven thirty. So thank you very much. We have a few minutes for questions or more because I'm here. I'm going to be here all week. You had some I think. This is more recent work with a sparse decomposition approach. No time to get into it now. >>: So the general approach, the theme you have taken all this is for is basically at each point you run unconditional anomaly detection between the points on each machine. Basically it’s multi-dimensional and then you average this anomaly score that you get over time to>> Moshe Gabel: Yes. Exactly. And the original work the idea was that the anomaly score is bounded so we could use certain concentration inequalities. But it was also very important to us to consider, to use multidimensional approaches to anomaly detection rather than what was more common, the single-dimensional approach. >>: So potentially you could, for example, at each time slice you could just model the distribution of this multidimensional space and see which one of them is falling. >> Moshe Gabel: Yes. But if you want to model you have to make some parametric assumptions like what distribution>>: For nonparametric distribution. [inaudible]. >> Moshe Gabel: I could do, for example that, yes. We did something even a little stronger perhaps. The local outlier factor which is kind of like a local density estimator. In our results we didn't get, we don't realize why yet but it somehow doesn't give you good results. We don't know why exactly. Yeah. >>: So [inaudible] correlation metric where you say if this request is going to use this amount of memory how accurate is that? >> Moshe Gabel: How accurate is that in terms of whether are you interested in finding the rule or just interested in finding anomalies? >>: Well, I mean the thing is, so accurate means that are you predicting the memory used for any particular [inaudible]? >> Moshe Gabel: No, that's not even what I'm trying to do. I'm simply trying to establish some correlation between different variables and find those that break it and I'm not trying to use it for prediction. I'm not even sure, I guess I could try it but that's not, we never thought about let's say hide this one and try. It's probably going to be very noisy and not very good. >>: Right. I'm just wondering>> Moshe Gabel: It's like linear regression. >>: Because the fact that two requests are different sizes doesn't necessarily mean that when you look at the system the memory usage could be different. And one of the reasons is because the system itself allocates memory in chunks. >> Moshe Gabel: Of course. So in this case it means that this variable doesn't have any presence so the PCA will just not assign it a high loading so it's no problem. Of course one problem that you do have is you have two different types of requests and it's not one manifold it’s two and then PCA cannot do much with it and then you need to use some sort of subspace clustering but we haven't done this. >>: And then also a more general question is so it seems like another fundamental assumption is that you have counters that can measure a component that matters. >> Moshe Gabel: Yes. So I lighted over this. The idea was we always said measure as much as you can with some fairly reasonable assumption that when you design a system and you have your programmers they try when they say what should we measure they do know something about that. But I did skip over a small part that tries to automatically discard those counters that don't matter because there are those counters that don't matter. They don't tell us anything. For example, times [indiscernible]. Apparently some people report it periodically for some reason. So we had a small procedure to remove those. >>: Maybe there’s a slight twist to that and it’s something I'm struggling with now which is physically incorrectly implemented counters. Counters that give you bogus results. >> Moshe Gabel: Interesting. Actually I never>>: We saw assume that counters are correct but they're implemented. >> Moshe Gabel: So our methods, I didn't talk about this, but our methods can handle missing data but whether they can handle incorrect data, in theory if it’s a little bit robust HR-PCA procedure it should be able to handle up to 10 percent outliers of any kind, like it's a very robust procedure at least in theory, so okay. But if you have a counter that is only always lying I don't know. It would be very interesting. Possibly if it's always lying and it’s just noise you could just discard it. But it's interesting. I never considered>>: It’s just something that I run into. >> Moshe Gabel: That sounds very interesting actually. You had some questions. >>: It’s not really a question but I’m just to trying to [inaudible] apply all this to a system that actually changes all over time and is this [inaudible]. You have like multiple tenants on the same [inaudible]. Should they apply a different [inaudible]? >> Moshe Gabel: Are you measuring, it’s a VM system based on multi-tenant>>: [inaudible] basically trying to see if you can actually take it and [inaudible]. >> Moshe Gabel: So for multitenant system I think, again, this is like super early. We only tested it a bit on virtual machines which have also interesting cases of noisy neighbors, for example, which is somewhat similar. So the idea was if we use sparse decomposition approach then you might be able to separate this data comes from this tenant, and this data comes from that tenant, and then you would be able to say this is the dictionary for tenant A, this is the dictionary for tenant B, and I can still find outliers because I know that was one of the driving forces behind, I didn’t get into it, the idea that you no longer have all machines doing the same thing but maybe three different things that machines are doing heavy requests and easy requests or maybe tenant A doing A something, tenant B doing something B. The idea there was to still be able to work. Some early results say probably it could work. Same idea where you look for consistent outliers but the test that you have to do now, like the independent test, is a little bit different. We can talk about this more if you want. This as is, probably not so good. >>: [inaudible] looking for. So one thing you said precision drops significantly [inaudible]? >> Moshe Gabel: If we knew in advance which machines are being rolled up and down and what's going on then we could just remove them out of the test. We just didn't know. So the problem is that we are entering machines into the test which all have the new version and then we detect them as outliers. So we compare apples to oranges. >>: [inaudible] interesting questions and interesting challenges for us maybe we can think about. So how about taking it off-line? Maybe we can meet and you can do more details, kind of the setting, and we can think whether this works or maybe we need to challenge whether it's a new challenge for us. I think that would be [inaudible]. >>: We might be able to come up with some details [inaudible]. >>: Sure. We’ll be able to happy to hear and see if we can have anything. It’s either going to be something that we can help you with or you're going to help us with new questions. >> Moshe Gabel: All right. Thank you. And like I said, all week I would love to meet with people who are interested. >>: So let's thank him.