23001 >> Misha Bilenko: Today we're hosting Kanishka Bhaduri from NASA Ames. And Kanishka has been working in the area of distributed learning and data mining for quite some time, since his Ph.D. at University of Maryland, Baltimore County, after which he was a researcher at TU Dortmund for some time. Then he moved to NASA Ames where he's been working on interesting things since then. Today he'll tell us about algorithms for outlier detection. >> Kanishka Bhaduri: Thank you, Misha. So my talk today is about two separate topics. The first one is outlier detection from large databases. The second one is about model fidelity and model checking in large distributed systems. So the first part of the talk will focus on outlet detection, how you can speed it up using both an indexing technique on the database side and also parallel programming techniques from the systems side. This is joint work with my colleagues with Bryan Matthews and Chris Giannella. Bryan is at NASA Ames Chris is at Niter Corporation. This is work we presented at last week's security conference. Here's a brief roadmap of the presentation. First I'll introduce the subject and then I'll give a little bit of background on what the state of the art is and what we're trying to do that will fall into the contributions part. Then some of the algorithms that we are interested in and we have developed and how they perform in the experimental results, and finally a conclusion to end the talk. So this is the setup of the problem. So we're given a dataset where each row is an instance. And for the timing we're assuming that they're IED samples. Each column is one of the variables. Here I have an example of the aviation domain in which I work in so each column can be like speed [inaudible] road angle, pitch angle of the aircraft, and many other parameter so we have 500 or 600 parameters being measured at frequency of one hertz. The idea is to apply outlet detection techniques to find rows or instances of this dataset which are abnormal. And there are several ways you can define abnormal. And I'll propose and use a particular one of them. So the nature of the question comes where can this be used? We all know outlier detection can be used in many places. Here are a few examples. Abnormal user behavior in social and information network. So you have a huge network and you want to find a user who is abnormal than the rest. So typically what you can do is you can define a feature space for each user in the high dimensional space, embed the feature in that space and then run this typical outlier detection algorithms to find the abnormalities. This is one way of doing it. Misconfigured machines in cloud infrastructure is another interesting example where what you do is you take all the machines in the system and then you run a typical distributed or parallel outlier detection technique that runs on the core of each of these machines in the distributed scenario, and then you come up with a set or a collection of machines which is misconfigured or behaving abnormally compared to the rest. Fraud detection in large transaction database is a very interesting application, and much work was going on since the last 10, 15 years now. Lastly, the one that we work on is abnormal aircraft using operational data. So whenever there's a commercial airlines flying from point A to point B there's a huge amount of data that is being collected of the order of several gigabytes. The idea is to leverage all of this data to figure out which aircraft is most abnormal with respect to the other aircrafts. That's the idea that we have. And we are working on that problem. So here's the definition of outlier that I'm going to use. I'm going to use the distance-based outlier. The idea is very simple. An outlier is a point which is very far from its nearest neighbor. Here is the red point which is an outlier, and the reason being if you look at its nearest neighbor, one of the green points, it's quite far from its nearest neighbor. So that's the idea. So the naive approach of doing outlier detection is very simple. For every point you find its nearest neighbors, rank the points according to the distance to its nearest neighbor, and you have the entire ranking of the outlier. Here's one example. The red ones are the outliers and bigger means more outlier, and green ones are nearest neighbors. And the line in between is how far it's from the nearest neighbor. So the farther it is from nearest neighbor the more outlier it is, and that's the transitional ranking in which you can find outliers from top to bottom. As you can see, if you apply the brute force algorithm you have to compute the entire distance metrics, which is expensive for large databases. It doesn't scale at all. There are several relaxations to this problem which have been proposed in the literature. The one I'm going to focus about will guarantee that I get the exact same outliers that is found by the brute force approach. There are approximation techniques which work on the same problem, but I'm not going to talk about them. So here is one relaxation which was proposed in 2003 in KDD and by bay and SHAR backer. What this says instead of finding all outliers, let's just find the top T outliers, where T is the user defined parameter. And why this is important, because of the following reason: You can actually reduce the computational complexity by a huge margin. The way the algorithm works is you first set a buffer to size T and you pay a price by looking at the first T points of your database and finding its actual nearest neighbors. So here's the pictorial representation. So on the left axis I have some points and let's say that I know which points are the most outlier, in the true scenario you won't know that, but for illustration it's easier to see that. And I picked the first three points, let's say T equals 3, and put them in the buffer and find the true nearest neighbors of those points. And along with that what I do is I also find a cutoff threshold which I call C which is the distance to the smallest outlier in terms of its nearest neighbor score. So now what's happening is whenever I pick the fourth point from the database, I keep scanning the database again. And as soon as I find a nearest neighbor of that test point, which is less than C, I can immediately throw that point out because it cannot be an outlier because I already found three outliers which are worse than this guy. On the other hand if you find another point which is a more outlier than these three points then you actually have to go through all of the points in the database. And only after that you can be sure that that is a more outlier point than the three you've already found. So what it means is they have shown that if you randomize the dataset well, on an average, this algorithm runs in linear time. The assumption being for each point you can find the nearest neighbor in constant time. But that doesn't really happen in practice. In most cases what happens is you have a cluster of points which are outliers. You spend a lot of time in that and that is somewhere in between linear and quadratic. In most interesting cases it's N rays to 1.3 or 4. That's what the complexity is. So the contribution from us is three-fold. The first thing is we are trying to speed up this existing technique which is considered state of the art using an indexing scheme. While doing that, we want to make sure that the implementation is disk space, because there are several ways you can spread this up. You can load the entire data in memory and then you can do some efficient processing and say I found outlier. But that's not the goal here, because the size of the datasets we're looking is roughly on the order of 10 to 100 terabytes. And you can really not run anything close to that in memory. So the second contribution is we have figured out that using our index, you can define an R determination correctly. By that what I mean is the algorithm can stop even before processing all of the data points. And this is interesting because as you keep processing more and more points, there will be a stopping condition in which you can say, okay, I found the outlier and I don't need to look at anything else. Doesn't matter what size the database is. The third is we have developed a parallel algorithm which runs on multicore machines and also parallel frameworks and that gives you better scalability just by splitting the data and doing some intelligent processing. So here is the first advantage of using the index. So note that a crucial part of the oracle was how the threshold was data mined. If the threshold increases very slowly, your cutoff, your prune rate will increase very slowly, because you cannot throw more and more in layer points out. So one of the interesting points is there a way of cleverly making sure that your cut-off interest is very fast. So the intuition being if you process the outliers first you can make sure that your cut off increases very, very fast. But the problem is you don't know what the outliers are. It's a chicken and egg problem. If you don't know what the outlier is how can you process them first. Here's one solution to the problem. We pick a random point from the dataset. We call it reference point. And we order all the other points from this fixed point, with the highest distance from this reference point being the top. So here is the example. So the blue points are let's say my normal points and the red points are the bunch of outliers that I have. If I pick a random point from the dataset more often than not I'll pick one from the blue. I may pick from the red but let's assume that I'm picking from the blue. If I ordered all the points in the database as distance to this green point with the farthest being at the top then the red will come first and then the blue and then the green. So now note that just one dimensional projection of your point based on a fixed distance to the reference point. And if I instead of now processing the dataset the way it's given to me, I will rather process it the way it's ordered according to this reference point. And by this there's a better chance I'll process the red ones first, and that would increase the cut-off faster. So this is the experimental results. So these are three different datasets. The red curves are for the traditional algorithms, which is Orca, which is roughly linear. And you see how the cutoff increases, the number of data points being processed, this is why the cutoff increases. And the blue one is our algorithm which we're proposing which is iOrca. In most cases iOrca has a far higher rate of cut off increase than Orca. In fact the last one, the dataset, the reference point was chosen so brilliantly that the first iteration it got almost all of the outliers and your cut off is almost at the very top. So the advantage too for this indexing scheme is how do you find the K nearest neighbor faster? So on the one hand you need the cut-off to increase first and on the other hand you also need to find the KN very, very fast because that's the bottleneck of the computation. So it turns out that you can explore special locality of the points using this index to find the K and N faster. And this happens this way. When I want to test for a point or find the nearest neighbor of let's say the purple point, which is in the index -- I don't know if you can see it -- but it's number one. So instead of processing again the points from top to bottom to find its nearest neighbor, I would rather do a spiral-out approach. I'll go with top, bottom, top bottom. And what that means essentially is I'm exploiting the special locality because these points are grouped together in that specific order according to a fixed distance. The assumption being even in the higher dimensional space you would rather -you would hopefully have this ordering. But we don't know if it's guaranteed or not. So that's the other advantage of using this technique. And the third one, which I think is more interesting is you can actually stop the computation after a fixed number of iterations and that's dataset-dependent. And the intuition is this. So let's say that we are given the index L and the reference point, the green one being R. And XT is any point which you're currently testing to find its nearest neighbors. So even before you do a disk axis, because you have this index in memory, what you can do is simply find the distance between XT and R, which is the one on the right, and compute the distance between R and XN minus K plus 1 and N minus K plus 1 is nothing but the true K nearest neighbor of this K. So there are K points here. And if you add those 2 up and see if it's less than the cutoff then XT cannot be an outlier. And this is very simple to verify. So this is the proof scale. So why can we prune XT? Because XT minus XN minus K plus 1 is nothing but -- it's less than this 1, by triangle inequality. And since I know that XT minus R plus R minus XN minus K plus 1 is already less than the cutoff threshold, then the distance between XT and the true K nearest neighbor of R will be also be less than the cutoff threshold. What it means is N minus K plus one is a true Kth nearest neighbor of R which is less than the cutoff. Now if you go down in fact if you consider any point below it those will also be less than C. So you can prune XT out because it can not be an outlier. This is the first thing. The next thing is anything below XT if you take XT plus I that cannot be an outlier too because of the exact same reason. So if you write this out, what you'll end up with is this quantity. And this quantity by definition is always less than this quantity. Because this quantity is bigger. So anything below XT can also not be an outlier. So that's the third advantage of this indexing scheme, and these three makes our algorithm run much faster. So once we have that, we were thinking if we can distribute this workload amongst a network of machines and still get the same performance and how we designed that algorithm, so this is what we call IDOR, which is outlier detection on ring. So distributed outlier detection on ring. So the idea here is to leverage the distributed computing framework to speed up the distance-based outlier detection. So we're assuming that the machines are arranged in a ring for the sole purpose that there is a particular order in which you can send messages. So every node would do some computation in the network. They would send messages. And then that guy would refine that computation and it would send messages and this would keep happening unless you -- unless and until the message comes back. So the first step is, of course, you split the data and send it to different nodes and the second part is each machine will build index on its own local data. So they would apply the IORCA framework and you would start with a local threshold of CI minus infinity because you don't know anything about the real set. And there are two modes of operation of the algorithm, one is the push and the other is the pull. So here is the scenario. So these are the nodes and we are assuming that the message goes in this order and comes back to this guy, or this guy or this guy, depending on who initiated the computation. So in the push mode, you basically read a block of data from the memory for testing. You apply the IORCA technique to find its approximate nearest neighbors based on just your local dataset. And if there are any points in the test set which are less than the cut-off you simple throw that point out because it cannot be an outlier, but for the other points which you're not sure about whether if they're greater than the threshold you still cannot be sure whether they are truly less than the cutoff because somebody else might have your nearest neighbor. So that's the issue here. So what we do is we prune as much as we can in the local dataset and then we pass it to the next guy and let that guy prune it further. So in the push mode that's what's happened. You prune the points with K and N distance which are less than CI and the pruning points are sent to the next machine for evaluation. And that's so when the new machine gets, when the next machine gets the set of points sent by the first machine, what it simply does, it updates the K nearest neighbor of the test points. And by that what can happen is you can throw some more points out because then some points can go below the threshold. But for some points you still are not sure. So you have to pass it to the next case. And this keeps happening in the ring. So if you started with let's say this size of the test block, when you keep going on the ring your size of the test block will keep shrinking more and more. And the faster you can shrink it, the better it is. Because then you're doing some local processing and pruning. And the idea is if all the points survived, all the machines in the ring, if some points survived all the machines in the ring, then that point is a potential outlier and we have to do a further processing, and that's what is done by a master machine which sits on top of everything and does it. So here's the thing that the master machine does. What it does it updates the list of the current outliers. So it gets outliers from everyone which survived the entire ring. It updates the cutoff based on this list. So the cutoff is set to the lowest from the list of T, and this broadcasts to all machines for efficient pruning. So that's when the cutoff is updated, because the next round your cutoff is evidently higher, and you can prune more points. And the correctness is very simple because the cutoff increases monotonically, you know you're not missing any outliers. That's the basic proof. So in the experimental results what we did was we tried on several datasets and I'm showing only four of them. So the first one has 580,000 feature examples and ten features, the land set has 275. So these are from the UCI repository. The modus is an outscience data. It's very big. We took one year worth of data which is 50 million points. The last one is what we are working on the aviation dataset. Again, we took one year's worth of data which is 100 million points. But this one is extremely large. We took just a very small set. So if you look at the experimental results, so these are four different ones, on the X axis I have the data size and Y axis we have the running time. The black one -- I'm not sure if you can see it. The black one is the time required to build the index. That's the pre-processing for algorithm. The blue one is the actual running time of our algorithm. That includes the black time. So the total time. And the red one is typically the time required for Orca or any other disk-based algorithm to run. So in most cases we see that there's the speed-up of 5 to 15 to 17 times. So that's the typical speedup that we see by just using our technique. Without any parallel processing. Now, if you apply the parallel processing, it speeds up anywhere between you 1.2 to I think 3.5 times on various number of machines like three to seven. But to me it seems that this is really under what we are expecting, and one of the reasons being our implementation was done in probably in not the most efficient manner. We used MPI, which probably had a bigger overhead of the communication. And the other problem is that in the centralized scheme, there is a stopping condition by which the algorithm just stops and doesn't do anything. What was happening here is for the first few iterations, when both the algorithms performed equally, the distributed algorithm is a huge advantage, because then you are distributing the workload. But for the later few iterations, the centralized algorithm was beating it, because all it needed to do was just check that in memory condition. And while this one was having an override of message passing sending determination criteria and those things. So that probably is one of the reasons as to why the speedup was not that great. So in conclusion for this work we developed a sequential and parallel outlier detection method which significantly beats the other existing methods, and we have tried several other methods, some uses clustering as preprocessing step. Some uses PCA as the preprocessing, but ours is much faster and very simple to implement. The algorithms are probably correct. And the way we thought about it is to always go for a disk-based implementation because of the sheer size of the areas that we have. If you think about a clustering-based approach they were all in memory, because you do the clustering by loading the data in memory, which we avoided. And the third one is both the algorithms can terminate even before looking at all the test points. And that's a huge advantage considering the size of the dataset that we plan to apply this to. So this concludes one part of my talk. So the next part of my talk is about distributed model fitting and fidelity checking in large networks. So what I'm talking about here is let's say that if ->>: Can I ->> Kanishka Bhaduri: Sure. >>: How sensitive is it to the choice of reference point? >> Kanishka Bhaduri: It's hugely sensitive. So if you choose a reference point -let's go back. So if you choose a reference point from one of the red ones, you would be back to Orca, essentially. >>: Yeah. >> Kanishka Bhaduri: So it's not worse than Orca, but, of course, it's not as good as I'm showing here. So the advantage -- one of the ways you can do is choose multiple reference points. >>: Okay. >> Kanishka Bhaduri: And then take an intersection of your choice and then find the candidate set for each gaming, that really helps. We did different experiments on different datasets and that really helps. >>: So the results here are ->> Kanishka Bhaduri: Just one. >>: [inaudible]. >> Kanishka Bhaduri: One and randomly picked. >>: Because you could also kind of, as you collect the data, have a running [inaudible]. >> Kanishka Bhaduri: Absolutely. >>: And use that and that would be the best. >> Kanishka Bhaduri: Correct. Or mutate the median or mean of the data. So I think those are very good open questions that we are trying to think and solve. Even, I think, maybe take a sample, do a PCA and then the projection would also be even better. So we are trying to do that. But some of the datasets are quite big in the number of features. So doing PCA is a challenge. We're trying to see if there is a sweet spot which is interesting. Okay. The second part of the talk is about large scale model fitting and monitoring that model in a dynamic environment. So imagine that you have a huge number of nodes in the network and the data in those nodes is constantly changing. You have a prior knowledge of a model. Let it be just a regression model, that's what I'm going to talk here. And how do you asynchronously track whether the model has changed or not without always gathering all the data? That's the focus of this. This again was a joint work with my colleagues and we presented this at SAM conference this year. So this is the same kind of roadmap, introduction, problem definition and what we're trying to solve here. Then the algorithm. And some experimental results and the conclusion. So this is the problem setup. So we are given an input dataset which consists of two parts. The first is the set of inputs X and the second is the output, which is typically the target and there's an underlying linear function which maps your input to the set of outputs. It's a function from RD to R. For linear models, you assume that FX and Y are well behaved and follows it linear criteria and the goal for regression is simply to estimate these weights are 0 A 1 and so on. It's well known how to do that. If you just take an X transpose X inverse transpose Y you can always do that. Here I'm not going to talk about the scenario where X transpose X is huge, so you can't invert it. But let's for the time assume you can invert it and gives you a good model. So A is you can always find A using this rhythm. So let's say we know how A is and we have found out A. Now we want to see how good is A. So there are several goodness of fit measures. One of them being what is known as the coefficient of determination or typically known as R squared 3. So this is the definition of the R squared. It's one minus. The numerator is the sum squared error of the true versus the predicted. YI minus YIJ hat whole square and the denominator is simply the variance of the actual value. So it's YJ minus summation of YJ by whole square. So you can see that for perfect models, our models which are very well behaved, YJ minus YJ hat are equal, and so R square equals 1. On the other hand, if the model is very bad, your R square can actually be 0 and that happens when your average, when you're always spitting out the average answer for whatever instance that you are seeing. So if your Y hat is summation over 1 over MYJ, you will always get R squared to be 0. But what I'm saying here is R square can range from a value of minus infinity to 2 plus 1. You can always have minus infinity because you can choose any arbitrary model, and you have no lower bound on R square value. But for the most practical cases we will assume it's between 0 and 1 because you can always say I'll approximate the model with the average model. So what's new? So all this is known. And very well studied in the statistics community. So what's new? So what we're trying to do is a little different. So we are seeing that if you have a large number of nodes in the network, let's let it be datacenter connected in a peer-to-peer or any other setup, it can even be any other routing technology, you have a dataset at each node, which is X1 through XN. And X is also time variant, so X is the union of all data. Here assuming that the datasets are disjoint, and there is no overlap between them, and the goal for us is to compute regression models on X. Not just on X1 through XN. So where can it be used? It can be used in real time fault detection or health assessment in large scale enterprise servers on the cloud and we have some data where we are trying to apply this. The second one is smart grid applications. That's something that we have experimented out here using a smart grid data from one of the open source datasets. The third one is you can actually track evolving topics in a text domain using sparse regression models. So I'm not going to talk about sparse models here, but it's very easy to see how you can apply the exact same technique for sparse learning. So here is the formal problem statement. So we are given a dataset X1 through XN at each node. We're also given a threshold, which is between 0 and 1. And the threshold is the same at all nodes. We also know that there's a communication infrastructure between node PI and PJ and that is ordered in the sense that the first message should be delivered first and not later because that alters your updating of the system. And we are also given a precomputed set of weights at an earlier timestamp. And this is essential because I'm essentially talking about a monitoring scenario where you have done something and you are trying to track whether it's still correct or not. And the goal for us is to maintain a regression model A at any given time such that R squared computed on the global data not just the local data of the node but the global data is greater than that XI that you have selected. So there are three solutions to this problem. The first is what is known as periodic algorithms. The second is incremental algorithms and the third is reactive. So in the periodic algorithm what happens is periodically you sample data from the node network. You rebuild the model A and see if the last model and the new model is good enough. Using some statistical test of model fit. And the incremental algorithm what you do is you have an old model, and whenever there is a change you update the old model with the new model using some clever algorithm. The third is what we are suggesting is you track the model with respect to the data and whenever there is an alarm that okay this model doesn't fit it, then we resample data from the network, build the model and then keep checking again. So that's kind of the two phase reactive closed loop kind of solution that we are proposing. Of course, the periodic algorithm wastes a lot of resources. The incremental algorithms as we know are extremely efficient but it's very difficult to develop them and that has to be hand tailored for each of the problems that you're trying to solve. The reactive algorithms are very general. So there are solutions for almost all of the data mining problems that we know of. They're very simple to develop. But they're very, very efficient in terms of the communication complexity. So here is the details of the work. So what we are saying is let's say for the time assume that each node has a vector VI. The exact definition of the vector I didn't present here because it's really complicated to write it in terms of the local dataset which is YI and YI hat which is the true value and predicted value. And based on that VI we want to do the model checking. And we also define a parabola G. And the way it's defined it's B 1 minus 1 epsilon squared and this is the user defined epsilon given to the system. So here is the details. So remember the R square formula that I presented earlier, it's basically the sum squared error divided by the variance. So what I've done here is instead of writing it as a single sum, I've just written it as two sums. So first the inner sum is you do it over all nodes, first nodes, any nodes data. And that is summation over J equals 1 to MI. MI is the number of data points at node I. Then you sum over all nodes. So instead of having one sum, we have two sums now. And the denominator is also the same. So instead of having one sum you have two sums. So the R squared that you got earlier is the same as the R squared you should get now. It turns out that if you want to check whether R square greater than XL is the same thing in a geometric setting checking whether the convex combination of these vectors VI, which is the local vector at a node, lies inside the parabola. So what it means is if you gather all of these VIs, take a convex combination and weigh it by the number of points that this peer holds, take the sum over all of them, get this vector VG and see whether it's inside the parabola T it's the same thing as doing, checking whether R squared is greater than epsilon. Still, it's inefficient to compute VG at each time step, because what I have proposed here is what you can do is at every time step you can compute these VIs. You can gather them. You can do this sum and see whether it's inside the parabola. But it's no different than gathering the whole data essentially. So here is the geometric interpretation. What it's saying is so this is the contour of the parabola. And if VI is inside the parabola, for all of the peers, then VG will be inside, too. Because it's just a convex combination. And convex combination will always rely in the convex region. But this is not true for any point outside the parabola, because the outside of the parabola is by definition not convex. So what you can do is you can draw hop spaces. You can bound the approximate parabola using some of these half spaces which are here, and then you can apply the exact same thing. So if all of the VIs line a particular half space then VG will also lie in that same half space. So that's the geometric interpretation of the problem. Now the question is how do we check this global condition more efficiently without gathering all of the PRS data or the machine data at every single time step? And for that we need some local statistics vectors. So these are just mathematical definitions, but the intuition is you have a knowledge which is essentially your local data, which is VI. Plus all of the information that you have gotten from your nodes, your neighbors. It's only the neighbors that we're considering. So SGI is all the information that I has received from J. Agreement is whatever they have shared. Withheld is whatever they have not shared. So this is the difference. And the message is typically whatever you have and whatever you have received. The difference of that that to prove double counting of the same message. Notice this is a cycle definition, because you have its JI here and then everything is defined with respect to everyone else. But when the algorithm starts, your SJI is 0 so your knowledge is set to VI. Agreement is 0, because you have not sent or received anything. So it's 0. And your withheld is essentially your KI because you have not sent anything so everything is withheld. And similarly your message is equal to the knowledge because that's what you're going to send the first time. But as the algorithm progresses, what happens is you start collecting more and more information on SJI and your knowledge will start to differ from your VI. And that's when the interesting stuff would start happening. So what we essentially want to show is if I apply the parabola condition to this KI and figure out if just by looking at G of KI greater than 0 or in other words if this one lies inside the parabola, if I can guarantee that that means that the global vector is also inside the parabola for all the peers, that's a very interesting point. Because then you don't need to gather all the data. You essentially have gotten rid of this summation here. And that's what's given by this global condition. So what it says is if all of these three vectors are sufficient statistics are inside any of the convex region and that convex region can be inside the parabola or any of the hop spaces or any other convex region, then it's guaranteed that the VG, which is the global vector, is also in that convex region. And it's pretty easy to show this because all you do is you pick any two random peers and then you let them exchange all of the data and you can mathematically show that just by applying the convex rule it always remains in the convex region. So you can now define this global condition and detect it based only on the local conditions. That's what we are trying to do. Back to the convex region, it's nothing different. So inside the parabola is GA equals, G of A greater than 0. That's convex by definition. So there's no problem in applying the rule. On the outside you have to draw the standard lines because you don't have the good convexity part of it. And so that's how we approximate the outside of the parabola to apply the theorem. So what the local criteria allows you to do is it allows a node to terminate and stop sending messages whenever its local condition holds. Doesn't matter whether it has received a message, whether its local data has changed, a new node has joined, a node has dropped or anything. So that's all the node needs to guarantee. And it will still guarantee eventual correctness by eventual correctness what we mean is when the computation terminates you'll get the exact same answer if you had gathered all of the data and found the R square. So that's the beauty of it. It's quite good at pruning messages. I'll show you in the experimental section. And it allows the node to sit idle and not do anything unless there's an even. And event are there scenarios sending or receiving a message, a change in its local data or a change in the neighborhood. So unless one of these happens, you don't need to do anything. So this is the flow chart. There's nothing in there basically. What it's saying is so you take a local dataset and error threshold epsilon and you want to basically monitor whether R square is greater than or less than epsilon. So you initialize your VI which I have not shown here for simplicity. Compute the sufficiency, the knowledge and all those things. You define the convex regions and those convex regions are uniformly defined across all of the machines in the network and then you keep applying this convex rule. So you start. You do the initialization. You apply the convex rule. If it's satisfied you don't do anything else. You go back and send messages and then wait for events to happen. That's essentially what you are doing. So up until now what I have said is how you track the model if it has changed. The next part is essentially building on top of that. So let's say that we have detected that there is a change, how do we recompute the model and update the model? So the solution we have is a very simple one. It's based on the converged cost tree. So every peer or every node in the network would simply send its data or a sample of the data up the tree and then when you get all of the information at the node you reveal the regression model and then you percolate it down the tree. And for that what you need are two quantities X transpose X and X transpose Y. And note that X transpose X is nothing but a summation over all of the local peers. So each of the local peers can compute X transpose XI, X transpose XI and they can compute them here and they can send it up the tree and the next peer can simply sum it over all of its children and then pass it up. And that's how you can essentially compute the true A because it's nothing but a summation over these quantities. So here is the experimental setup. So we did an experiment -- you had a question? Oh, sorry. So here's the experimental setup. So we did a dynamic experiment where we were changing the data at different levels. So this is the timing diagram. On the X axis you have time. So what we did was every 20,000 similar tactics we change a fix percentage of the data of the nodes. So I think we changed 20 percent of the data at each node. And we assume each node has a fixed length data buffer. So you return some old data and you put in some fresh data. On top of that, at every 500,000 clock ticks, which is a bigger time period, we change the distribution from which the data is sampled. Here distribution means simply changing the coefficients of your linear model. So here is the data description and how the peers react to the change. So this is the data on the X axis you have time and on the Y axis you have the actual value of R squared as seen by all of the peers. So this is an overlay over all PRS data. So if you look at the box, what happens is the average square R is average close to 2.7 and the reason being we give each peer a similar set of coefficients from which we were also generating the data. So those two coefficients were similar. And that's why there's a higher model fidelity with respect to this R square value. On the other hand, for all the odd articles, we change the coefficient of the data generator while keeping the coefficients figures the same. What that means there's low fidelity of the model. So that's what is happening. >>: The peers don't reestimate? So why ->> Kanishka Bhaduri: This is only for detection purposes. >>: Okay. >> Kanishka Bhaduri: So I'm not allowing it to recompute the model. That's the next experiment I'll show. And the red one is our epsilon chosen, which is .5 such that there's an exact half, the probability of true positives and negatives are exactly the same. And this is how the peers behave on the accuracy side and the X axis is time and the Y axis is accuracy. So between 0 and .5, which is this region, you see almost 100 percent fit. Everybody is saying that the model is correct. And then whenever the data changes, that's when they start communicating, and this is reflected here. That's time versus messages for node and there's a huge drop in the accuracy, because that's when they're unsure about what's happening. They need to gather some more evidence and you then again see a jump in the accuracy. That's when they say okay everybody's outside. Everybody's saying that the model doesn't fit. But here they're saying the model fits the data. If you look at the messages, that's interesting to whenever there's a change in the distribution you have a smart jump in the messages. But on these instances there's very little change in the -- very little number of messages going on. But this is still dynamic. So it's not that I'm changing the data here and here. Throughout I'm changing the data. But because it's the model fitting criteria, the algorithm is doing what it's supposed to do efficiently. So one of the advantages of using this kind of algorithm is what is known as local behavior. What it means is these rules, the rule that I talked about is very data-dependent. And whenever the data changes, there is some action going on. You can't really characterize it by writing some equations when the data will change or what kind of analytical results you'll have. So what we did was we incremented -- we changed the network size from 500 nodes to 4,000 nodes and we plotted the accuracy of the peers. And the number of messages it saved. But what's happening as the number of nodes increasing the accuracy remains almost a constant, which is the left one. And the messages per peer is also a constant, which means that you can infinitely increase the number of nodes in the network and the scalability will not be hampered. And the reason it's happening is these computations are very local to appear. So if you change the data at each peer, it [inaudible] to a very few number of nodes in the neighborhood before it sub sides. And that's what is being reflected in these experiments. And this is true for other local algorithms as well. So this is the second part of the experiment where I'm essentially closing the loop and allowing the peers to recompute the model. So I'm starting off with a very bad estimate of the model at each per, a random guess, so you have a very bad value of R squared. The peers see that this is indeed the case so they recompute the model and for the later half, when the data is in line with the model, you don't -- you see a very high value of R squared. And again I'm randomly changing the data distribution model to a different one and there's a drop in the R squared and this keeps happening. And that's what is reflected in the monitoring messages, and this is the number of times computation happens. So every time the data changes, there's a number of messages which is being passed to collect the data and rebuild the model. And that's what I'm showing here. The last experiment was an application of this algorithm to a smart grid operation. So what we were trying to do was predict how much of CO2 would be, carbon dioxide would be used, carbon dioxide would be released based on two factors. One is how much of wind energy is being generated and what's the demand. The assumption being higher CO2 means less fossil fuel or more fossil fuel is burned and less renewable energy is being used. So we collected about nine months of data from one of the grid operators in Ireland and January through September, and each month of data we assumed it to be one EPOC. We built a model based on the data between January and February, and we were trying to see if the algorithm would detect because there's a huge demand -- there was a drop in the carbon dioxide released between May and June. That is what is known. So the idea was can the algorithm detect that change? So we build the model from this period and then we pipe the data in and keep changing the data for each month and the algorithm did actually pick this up and say that the model doesn't fit the data. So this again is in the open loop mode where I'm not allowing the peers to rebuild the model. Just using as a detector. And if you look at the messages, that's also reflected there. So in conclusion, so this is the first work as we know in monitoring R squared and really large distributed systems which are asynchronous in nature. The algorithm is, of course, provably correct. It's highly efficient and converges to the correct result and R square is scale-free in the sense that it lies between 0 and 1 so you can pick a number and that's it. Unlike many of the other works on thresholding where you have to know what epsilon is. That's very dataset dependent. Even some of my earlier work was that. In short I want to acknowledge the NASA system [inaudible] technology project, this is under the Aviation Safety Program of which I'm a part of and for many of the papers and some of the open source codes, this is where you can download them. That's pretty much what I had. >>: I have a question. Seems like there's kind of two fundamental modes one where the distribution directs naturally anyone to adapt. Another one is where you have a critical event happen, things go out of whack you get the alarm. You don't want to screw up the model. You want the model to remain. >> Kanishka Bhaduri: Exactly. >>: Do you think this is just through something you'd have to pre-decide earlier or can you have some sort of natural mix of the two to where the model would know what they're to adjust, just look at a seize of the acquire alarm and then go back to the previous learned state? >> Kanishka Bhaduri: So the way the algorithm works is you can basically set it at the time in which you want, the type of detection you want to do, essentially. That's not something that you have to decide before the algorithm starts. So the algorithm would still decide, okay, this is the alarm for which I'll rebuild the model. This is the kind of alarm for which I won't be rebuilding the model. I think that kind of thing would be very important. >>: It would be like a second layer of ->> Kanishka Bhaduri: Exactly. Right. And depending on some hypothesis tests and what kind of, what's the significance of the alarm essentially. This is a very simple model, where I'm just saying, okay, it's out. Rebuild the model, yeah. >> Misha Bilenko: Questions? Thank you very much. >> Kanishka Bhaduri: Thank you. [applause]