23001 >> Misha Bilenko: Today we're hosting Kanishka Bhaduri from... And Kanishka has been working in the area of distributed...

advertisement
23001
>> Misha Bilenko: Today we're hosting Kanishka Bhaduri from NASA Ames.
And Kanishka has been working in the area of distributed learning and data
mining for quite some time, since his Ph.D. at University of Maryland, Baltimore
County, after which he was a researcher at TU Dortmund for some time. Then
he moved to NASA Ames where he's been working on interesting things since
then. Today he'll tell us about algorithms for outlier detection.
>> Kanishka Bhaduri: Thank you, Misha. So my talk today is about two
separate topics. The first one is outlier detection from large databases. The
second one is about model fidelity and model checking in large distributed
systems. So the first part of the talk will focus on outlet detection, how you can
speed it up using both an indexing technique on the database side and also
parallel programming techniques from the systems side.
This is joint work with my colleagues with Bryan Matthews and Chris Giannella.
Bryan is at NASA Ames Chris is at Niter Corporation. This is work we presented
at last week's security conference.
Here's a brief roadmap of the presentation. First I'll introduce the subject and
then I'll give a little bit of background on what the state of the art is and what
we're trying to do that will fall into the contributions part. Then some of the
algorithms that we are interested in and we have developed and how they
perform in the experimental results, and finally a conclusion to end the talk.
So this is the setup of the problem. So we're given a dataset where each row is
an instance. And for the timing we're assuming that they're IED samples. Each
column is one of the variables.
Here I have an example of the aviation domain in which I work in so each column
can be like speed [inaudible] road angle, pitch angle of the aircraft, and many
other parameter so we have 500 or 600 parameters being measured at
frequency of one hertz.
The idea is to apply outlet detection techniques to find rows or instances of this
dataset which are abnormal. And there are several ways you can define
abnormal. And I'll propose and use a particular one of them.
So the nature of the question comes where can this be used? We all know
outlier detection can be used in many places. Here are a few examples.
Abnormal user behavior in social and information network. So you have a huge
network and you want to find a user who is abnormal than the rest.
So typically what you can do is you can define a feature space for each user in
the high dimensional space, embed the feature in that space and then run this
typical outlier detection algorithms to find the abnormalities. This is one way of
doing it.
Misconfigured machines in cloud infrastructure is another interesting example
where what you do is you take all the machines in the system and then you run a
typical distributed or parallel outlier detection technique that runs on the core of
each of these machines in the distributed scenario, and then you come up with a
set or a collection of machines which is misconfigured or behaving abnormally
compared to the rest.
Fraud detection in large transaction database is a very interesting application,
and much work was going on since the last 10, 15 years now.
Lastly, the one that we work on is abnormal aircraft using operational data. So
whenever there's a commercial airlines flying from point A to point B there's a
huge amount of data that is being collected of the order of several gigabytes.
The idea is to leverage all of this data to figure out which aircraft is most
abnormal with respect to the other aircrafts.
That's the idea that we have. And we are working on that problem. So here's
the definition of outlier that I'm going to use. I'm going to use the distance-based
outlier. The idea is very simple. An outlier is a point which is very far from its
nearest neighbor. Here is the red point which is an outlier, and the reason being
if you look at its nearest neighbor, one of the green points, it's quite far from its
nearest neighbor. So that's the idea.
So the naive approach of doing outlier detection is very simple. For every point
you find its nearest neighbors, rank the points according to the distance to its
nearest neighbor, and you have the entire ranking of the outlier.
Here's one example. The red ones are the outliers and bigger means more
outlier, and green ones are nearest neighbors. And the line in between is how
far it's from the nearest neighbor. So the farther it is from nearest neighbor the
more outlier it is, and that's the transitional ranking in which you can find outliers
from top to bottom.
As you can see, if you apply the brute force algorithm you have to compute the
entire distance metrics, which is expensive for large databases. It doesn't scale
at all.
There are several relaxations to this problem which have been proposed in the
literature. The one I'm going to focus about will guarantee that I get the exact
same outliers that is found by the brute force approach. There are approximation
techniques which work on the same problem, but I'm not going to talk about
them.
So here is one relaxation which was proposed in 2003 in KDD and by bay and
SHAR backer. What this says instead of finding all outliers, let's just find the top
T outliers, where T is the user defined parameter. And why this is important,
because of the following reason: You can actually reduce the computational
complexity by a huge margin.
The way the algorithm works is you first set a buffer to size T and you pay a price
by looking at the first T points of your database and finding its actual nearest
neighbors. So here's the pictorial representation. So on the left axis I have some
points and let's say that I know which points are the most outlier, in the true
scenario you won't know that, but for illustration it's easier to see that.
And I picked the first three points, let's say T equals 3, and put them in the buffer
and find the true nearest neighbors of those points. And along with that what I do
is I also find a cutoff threshold which I call C which is the distance to the smallest
outlier in terms of its nearest neighbor score. So now what's happening is
whenever I pick the fourth point from the database, I keep scanning the database
again. And as soon as I find a nearest neighbor of that test point, which is less
than C, I can immediately throw that point out because it cannot be an outlier
because I already found three outliers which are worse than this guy.
On the other hand if you find another point which is a more outlier than these
three points then you actually have to go through all of the points in the
database.
And only after that you can be sure that that is a more outlier point than the three
you've already found. So what it means is they have shown that if you
randomize the dataset well, on an average, this algorithm runs in linear time.
The assumption being for each point you can find the nearest neighbor in
constant time. But that doesn't really happen in practice. In most cases what
happens is you have a cluster of points which are outliers. You spend a lot of
time in that and that is somewhere in between linear and quadratic.
In most interesting cases it's N rays to 1.3 or 4. That's what the complexity is.
So the contribution from us is three-fold. The first thing is we are trying to speed
up this existing technique which is considered state of the art using an indexing
scheme.
While doing that, we want to make sure that the implementation is disk space,
because there are several ways you can spread this up. You can load the entire
data in memory and then you can do some efficient processing and say I found
outlier. But that's not the goal here, because the size of the datasets we're
looking is roughly on the order of 10 to 100 terabytes. And you can really not run
anything close to that in memory.
So the second contribution is we have figured out that using our index, you can
define an R determination correctly. By that what I mean is the algorithm can
stop even before processing all of the data points. And this is interesting
because as you keep processing more and more points, there will be a stopping
condition in which you can say, okay, I found the outlier and I don't need to look
at anything else. Doesn't matter what size the database is.
The third is we have developed a parallel algorithm which runs on multicore
machines and also parallel frameworks and that gives you better scalability just
by splitting the data and doing some intelligent processing.
So here is the first advantage of using the index. So note that a crucial part of
the oracle was how the threshold was data mined. If the threshold increases
very slowly, your cutoff, your prune rate will increase very slowly, because you
cannot throw more and more in layer points out.
So one of the interesting points is there a way of cleverly making sure that your
cut-off interest is very fast.
So the intuition being if you process the outliers first you can make sure that your
cut off increases very, very fast. But the problem is you don't know what the
outliers are. It's a chicken and egg problem. If you don't know what the outlier is
how can you process them first.
Here's one solution to the problem. We pick a random point from the dataset.
We call it reference point. And we order all the other points from this fixed point,
with the highest distance from this reference point being the top.
So here is the example. So the blue points are let's say my normal points and
the red points are the bunch of outliers that I have. If I pick a random point from
the dataset more often than not I'll pick one from the blue. I may pick from the
red but let's assume that I'm picking from the blue. If I ordered all the points in
the database as distance to this green point with the farthest being at the top
then the red will come first and then the blue and then the green.
So now note that just one dimensional projection of your point based on a fixed
distance to the reference point. And if I instead of now processing the dataset
the way it's given to me, I will rather process it the way it's ordered according to
this reference point.
And by this there's a better chance I'll process the red ones first, and that would
increase the cut-off faster. So this is the experimental results. So these are
three different datasets. The red curves are for the traditional algorithms, which
is Orca, which is roughly linear. And you see how the cutoff increases, the
number of data points being processed, this is why the cutoff increases. And the
blue one is our algorithm which we're proposing which is iOrca. In most cases
iOrca has a far higher rate of cut off increase than Orca. In fact the last one, the
dataset, the reference point was chosen so brilliantly that the first iteration it got
almost all of the outliers and your cut off is almost at the very top.
So the advantage too for this indexing scheme is how do you find the K nearest
neighbor faster? So on the one hand you need the cut-off to increase first and
on the other hand you also need to find the KN very, very fast because that's the
bottleneck of the computation.
So it turns out that you can explore special locality of the points using this index
to find the K and N faster. And this happens this way. When I want to test for a
point or find the nearest neighbor of let's say the purple point, which is in the
index -- I don't know if you can see it -- but it's number one. So instead of
processing again the points from top to bottom to find its nearest neighbor, I
would rather do a spiral-out approach. I'll go with top, bottom, top bottom. And
what that means essentially is I'm exploiting the special locality because these
points are grouped together in that specific order according to a fixed distance.
The assumption being even in the higher dimensional space you would rather -you would hopefully have this ordering. But we don't know if it's guaranteed or
not. So that's the other advantage of using this technique.
And the third one, which I think is more interesting is you can actually stop the
computation after a fixed number of iterations and that's dataset-dependent. And
the intuition is this. So let's say that we are given the index L and the reference
point, the green one being R. And XT is any point which you're currently testing
to find its nearest neighbors. So even before you do a disk axis, because you
have this index in memory, what you can do is simply find the distance between
XT and R, which is the one on the right, and compute the distance between R
and XN minus K plus 1 and N minus K plus 1 is nothing but the true K nearest
neighbor of this K. So there are K points here.
And if you add those 2 up and see if it's less than the cutoff then XT cannot be an
outlier. And this is very simple to verify. So this is the proof scale.
So why can we prune XT? Because XT minus XN minus K plus 1 is nothing
but -- it's less than this 1, by triangle inequality. And since I know that XT minus
R plus R minus XN minus K plus 1 is already less than the cutoff threshold, then
the distance between XT and the true K nearest neighbor of R will be also be
less than the cutoff threshold. What it means is N minus K plus one is a true Kth
nearest neighbor of R which is less than the cutoff. Now if you go down in fact if
you consider any point below it those will also be less than C.
So you can prune XT out because it can not be an outlier. This is the first thing.
The next thing is anything below XT if you take XT plus I that cannot be an outlier
too because of the exact same reason. So if you write this out, what you'll end
up with is this quantity. And this quantity by definition is always less than this
quantity. Because this quantity is bigger.
So anything below XT can also not be an outlier. So that's the third advantage of
this indexing scheme, and these three makes our algorithm run much faster.
So once we have that, we were thinking if we can distribute this workload
amongst a network of machines and still get the same performance and how we
designed that algorithm, so this is what we call IDOR, which is outlier detection
on ring. So distributed outlier detection on ring.
So the idea here is to leverage the distributed computing framework to speed up
the distance-based outlier detection. So we're assuming that the machines are
arranged in a ring for the sole purpose that there is a particular order in which
you can send messages. So every node would do some computation in the
network. They would send messages. And then that guy would refine that
computation and it would send messages and this would keep happening unless
you -- unless and until the message comes back.
So the first step is, of course, you split the data and send it to different nodes and
the second part is each machine will build index on its own local data. So they
would apply the IORCA framework and you would start with a local threshold of
CI minus infinity because you don't know anything about the real set. And there
are two modes of operation of the algorithm, one is the push and the other is the
pull. So here is the scenario.
So these are the nodes and we are assuming that the message goes in this
order and comes back to this guy, or this guy or this guy, depending on who
initiated the computation.
So in the push mode, you basically read a block of data from the memory for
testing. You apply the IORCA technique to find its approximate nearest
neighbors based on just your local dataset. And if there are any points in the test
set which are less than the cut-off you simple throw that point out because it
cannot be an outlier, but for the other points which you're not sure about whether
if they're greater than the threshold you still cannot be sure whether they are truly
less than the cutoff because somebody else might have your nearest neighbor.
So that's the issue here. So what we do is we prune as much as we can in the
local dataset and then we pass it to the next guy and let that guy prune it further.
So in the push mode that's what's happened. You prune the points with K and N
distance which are less than CI and the pruning points are sent to the next
machine for evaluation.
And that's so when the new machine gets, when the next machine gets the set of
points sent by the first machine, what it simply does, it updates the K nearest
neighbor of the test points. And by that what can happen is you can throw some
more points out because then some points can go below the threshold.
But for some points you still are not sure. So you have to pass it to the next
case. And this keeps happening in the ring. So if you started with let's say this
size of the test block, when you keep going on the ring your size of the test block
will keep shrinking more and more. And the faster you can shrink it, the better it
is. Because then you're doing some local processing and pruning.
And the idea is if all the points survived, all the machines in the ring, if some
points survived all the machines in the ring, then that point is a potential outlier
and we have to do a further processing, and that's what is done by a master
machine which sits on top of everything and does it.
So here's the thing that the master machine does. What it does it updates the list
of the current outliers. So it gets outliers from everyone which survived the entire
ring. It updates the cutoff based on this list. So the cutoff is set to the lowest
from the list of T, and this broadcasts to all machines for efficient pruning. So
that's when the cutoff is updated, because the next round your cutoff is evidently
higher, and you can prune more points. And the correctness is very simple
because the cutoff increases monotonically, you know you're not missing any
outliers. That's the basic proof.
So in the experimental results what we did was we tried on several datasets and
I'm showing only four of them. So the first one has 580,000 feature examples
and ten features, the land set has 275. So these are from the UCI repository.
The modus is an outscience data. It's very big. We took one year worth of data
which is 50 million points. The last one is what we are working on the aviation
dataset. Again, we took one year's worth of data which is 100 million points. But
this one is extremely large. We took just a very small set.
So if you look at the experimental results, so these are four different ones, on the
X axis I have the data size and Y axis we have the running time.
The black one -- I'm not sure if you can see it. The black one is the time required
to build the index. That's the pre-processing for algorithm. The blue one is the
actual running time of our algorithm. That includes the black time. So the total
time. And the red one is typically the time required for Orca or any other
disk-based algorithm to run.
So in most cases we see that there's the speed-up of 5 to 15 to 17 times. So
that's the typical speedup that we see by just using our technique. Without any
parallel processing.
Now, if you apply the parallel processing, it speeds up anywhere between you
1.2 to I think 3.5 times on various number of machines like three to seven. But to
me it seems that this is really under what we are expecting, and one of the
reasons being our implementation was done in probably in not the most efficient
manner.
We used MPI, which probably had a bigger overhead of the communication. And
the other problem is that in the centralized scheme, there is a stopping condition
by which the algorithm just stops and doesn't do anything. What was happening
here is for the first few iterations, when both the algorithms performed equally,
the distributed algorithm is a huge advantage, because then you are distributing
the workload.
But for the later few iterations, the centralized algorithm was beating it, because
all it needed to do was just check that in memory condition. And while this one
was having an override of message passing sending determination criteria and
those things.
So that probably is one of the reasons as to why the speedup was not that great.
So in conclusion for this work we developed a sequential and parallel outlier
detection method which significantly beats the other existing methods, and we
have tried several other methods, some uses clustering as preprocessing step.
Some uses PCA as the preprocessing, but ours is much faster and very simple to
implement.
The algorithms are probably correct. And the way we thought about it is to
always go for a disk-based implementation because of the sheer size of the
areas that we have.
If you think about a clustering-based approach they were all in memory, because
you do the clustering by loading the data in memory, which we avoided. And the
third one is both the algorithms can terminate even before looking at all the test
points. And that's a huge advantage considering the size of the dataset that we
plan to apply this to.
So this concludes one part of my talk. So the next part of my talk is about
distributed model fitting and fidelity checking in large networks.
So what I'm talking about here is let's say that if ->>: Can I ->> Kanishka Bhaduri: Sure.
>>: How sensitive is it to the choice of reference point?
>> Kanishka Bhaduri: It's hugely sensitive. So if you choose a reference point -let's go back. So if you choose a reference point from one of the red ones, you
would be back to Orca, essentially.
>>: Yeah.
>> Kanishka Bhaduri: So it's not worse than Orca, but, of course, it's not as good
as I'm showing here. So the advantage -- one of the ways you can do is choose
multiple reference points.
>>: Okay.
>> Kanishka Bhaduri: And then take an intersection of your choice and then find
the candidate set for each gaming, that really helps. We did different
experiments on different datasets and that really helps.
>>: So the results here are ->> Kanishka Bhaduri: Just one.
>>: [inaudible].
>> Kanishka Bhaduri: One and randomly picked.
>>: Because you could also kind of, as you collect the data, have a running
[inaudible].
>> Kanishka Bhaduri: Absolutely.
>>: And use that and that would be the best.
>> Kanishka Bhaduri: Correct. Or mutate the median or mean of the data. So I
think those are very good open questions that we are trying to think and solve.
Even, I think, maybe take a sample, do a PCA and then the projection would also
be even better.
So we are trying to do that. But some of the datasets are quite big in the number
of features. So doing PCA is a challenge. We're trying to see if there is a sweet
spot which is interesting.
Okay. The second part of the talk is about large scale model fitting and
monitoring that model in a dynamic environment. So imagine that you have a
huge number of nodes in the network and the data in those nodes is constantly
changing.
You have a prior knowledge of a model. Let it be just a regression model, that's
what I'm going to talk here. And how do you asynchronously track whether the
model has changed or not without always gathering all the data? That's the
focus of this. This again was a joint work with my colleagues and we presented
this at SAM conference this year.
So this is the same kind of roadmap, introduction, problem definition and what
we're trying to solve here. Then the algorithm. And some experimental results
and the conclusion.
So this is the problem setup. So we are given an input dataset which consists of
two parts. The first is the set of inputs X and the second is the output, which is
typically the target and there's an underlying linear function which maps your
input to the set of outputs. It's a function from RD to R. For linear models, you
assume that FX and Y are well behaved and follows it linear criteria and the goal
for regression is simply to estimate these weights are 0 A 1 and so on. It's well
known how to do that. If you just take an X transpose X inverse transpose Y you
can always do that.
Here I'm not going to talk about the scenario where X transpose X is huge, so
you can't invert it. But let's for the time assume you can invert it and gives you a
good model.
So A is you can always find A using this rhythm. So let's say we know how A is
and we have found out A. Now we want to see how good is A. So there are
several goodness of fit measures. One of them being what is known as the
coefficient of determination or typically known as R squared 3. So this is the
definition of the R squared. It's one minus. The numerator is the sum squared
error of the true versus the predicted. YI minus YIJ hat whole square and the
denominator is simply the variance of the actual value.
So it's YJ minus summation of YJ by whole square. So you can see that for
perfect models, our models which are very well behaved, YJ minus YJ hat are
equal, and so R square equals 1. On the other hand, if the model is very bad,
your R square can actually be 0 and that happens when your average, when
you're always spitting out the average answer for whatever instance that you are
seeing.
So if your Y hat is summation over 1 over MYJ, you will always get R squared to
be 0. But what I'm saying here is R square can range from a value of minus
infinity to 2 plus 1. You can always have minus infinity because you can choose
any arbitrary model, and you have no lower bound on R square value.
But for the most practical cases we will assume it's between 0 and 1 because
you can always say I'll approximate the model with the average model.
So what's new? So all this is known. And very well studied in the statistics
community. So what's new? So what we're trying to do is a little different. So
we are seeing that if you have a large number of nodes in the network, let's let it
be datacenter connected in a peer-to-peer or any other setup, it can even be any
other routing technology, you have a dataset at each node, which is X1 through
XN.
And X is also time variant, so X is the union of all data. Here assuming that the
datasets are disjoint, and there is no overlap between them, and the goal for us
is to compute regression models on X. Not just on X1 through XN.
So where can it be used? It can be used in real time fault detection or health
assessment in large scale enterprise servers on the cloud and we have some
data where we are trying to apply this.
The second one is smart grid applications. That's something that we have
experimented out here using a smart grid data from one of the open source
datasets. The third one is you can actually track evolving topics in a text domain
using sparse regression models. So I'm not going to talk about sparse models
here, but it's very easy to see how you can apply the exact same technique for
sparse learning.
So here is the formal problem statement. So we are given a dataset X1 through
XN at each node. We're also given a threshold, which is between 0 and 1. And
the threshold is the same at all nodes. We also know that there's a
communication infrastructure between node PI and PJ and that is ordered in the
sense that the first message should be delivered first and not later because that
alters your updating of the system. And we are also given a precomputed set of
weights at an earlier timestamp. And this is essential because I'm essentially
talking about a monitoring scenario where you have done something and you are
trying to track whether it's still correct or not.
And the goal for us is to maintain a regression model A at any given time such
that R squared computed on the global data not just the local data of the node
but the global data is greater than that XI that you have selected.
So there are three solutions to this problem. The first is what is known as
periodic algorithms. The second is incremental algorithms and the third is
reactive. So in the periodic algorithm what happens is periodically you sample
data from the node network.
You rebuild the model A and see if the last model and the new model is good
enough. Using some statistical test of model fit. And the incremental algorithm
what you do is you have an old model, and whenever there is a change you
update the old model with the new model using some clever algorithm.
The third is what we are suggesting is you track the model with respect to the
data and whenever there is an alarm that okay this model doesn't fit it, then we
resample data from the network, build the model and then keep checking again.
So that's kind of the two phase reactive closed loop kind of solution that we are
proposing.
Of course, the periodic algorithm wastes a lot of resources. The incremental
algorithms as we know are extremely efficient but it's very difficult to develop
them and that has to be hand tailored for each of the problems that you're trying
to solve.
The reactive algorithms are very general. So there are solutions for almost all of
the data mining problems that we know of. They're very simple to develop. But
they're very, very efficient in terms of the communication complexity.
So here is the details of the work. So what we are saying is let's say for the time
assume that each node has a vector VI. The exact definition of the vector I didn't
present here because it's really complicated to write it in terms of the local
dataset which is YI and YI hat which is the true value and predicted value.
And based on that VI we want to do the model checking. And we also define a
parabola G. And the way it's defined it's B 1 minus 1 epsilon squared and this is
the user defined epsilon given to the system.
So here is the details. So remember the R square formula that I presented
earlier, it's basically the sum squared error divided by the variance. So what I've
done here is instead of writing it as a single sum, I've just written it as two sums.
So first the inner sum is you do it over all nodes, first nodes, any nodes data.
And that is summation over J equals 1 to MI. MI is the number of data points at
node I. Then you sum over all nodes.
So instead of having one sum, we have two sums now. And the denominator is
also the same. So instead of having one sum you have two sums. So the R
squared that you got earlier is the same as the R squared you should get now.
It turns out that if you want to check whether R square greater than XL is the
same thing in a geometric setting checking whether the convex combination of
these vectors VI, which is the local vector at a node, lies inside the parabola.
So what it means is if you gather all of these VIs, take a convex combination and
weigh it by the number of points that this peer holds, take the sum over all of
them, get this vector VG and see whether it's inside the parabola T it's the same
thing as doing, checking whether R squared is greater than epsilon.
Still, it's inefficient to compute VG at each time step, because what I have
proposed here is what you can do is at every time step you can compute these
VIs. You can gather them. You can do this sum and see whether it's inside the
parabola. But it's no different than gathering the whole data essentially.
So here is the geometric interpretation. What it's saying is so this is the contour
of the parabola. And if VI is inside the parabola, for all of the peers, then VG will
be inside, too. Because it's just a convex combination. And convex combination
will always rely in the convex region. But this is not true for any point outside the
parabola, because the outside of the parabola is by definition not convex. So
what you can do is you can draw hop spaces.
You can bound the approximate parabola using some of these half spaces which
are here, and then you can apply the exact same thing. So if all of the VIs line a
particular half space then VG will also lie in that same half space. So that's the
geometric interpretation of the problem.
Now the question is how do we check this global condition more efficiently
without gathering all of the PRS data or the machine data at every single time
step? And for that we need some local statistics vectors. So these are just
mathematical definitions, but the intuition is you have a knowledge which is
essentially your local data, which is VI. Plus all of the information that you have
gotten from your nodes, your neighbors. It's only the neighbors that we're
considering.
So SGI is all the information that I has received from J. Agreement is whatever
they have shared. Withheld is whatever they have not shared. So this is the
difference. And the message is typically whatever you have and whatever you
have received. The difference of that that to prove double counting of the same
message. Notice this is a cycle definition, because you have its JI here and then
everything is defined with respect to everyone else. But when the algorithm
starts, your SJI is 0 so your knowledge is set to VI. Agreement is 0, because you
have not sent or received anything. So it's 0. And your withheld is essentially
your KI because you have not sent anything so everything is withheld. And
similarly your message is equal to the knowledge because that's what you're
going to send the first time.
But as the algorithm progresses, what happens is you start collecting more and
more information on SJI and your knowledge will start to differ from your VI. And
that's when the interesting stuff would start happening.
So what we essentially want to show is if I apply the parabola condition to this KI
and figure out if just by looking at G of KI greater than 0 or in other words if this
one lies inside the parabola, if I can guarantee that that means that the global
vector is also inside the parabola for all the peers, that's a very interesting point.
Because then you don't need to gather all the data. You essentially have gotten
rid of this summation here. And that's what's given by this global condition. So
what it says is if all of these three vectors are sufficient statistics are inside any of
the convex region and that convex region can be inside the parabola or any of
the hop spaces or any other convex region, then it's guaranteed that the VG,
which is the global vector, is also in that convex region. And it's pretty easy to
show this because all you do is you pick any two random peers and then you let
them exchange all of the data and you can mathematically show that just by
applying the convex rule it always remains in the convex region.
So you can now define this global condition and detect it based only on the local
conditions. That's what we are trying to do.
Back to the convex region, it's nothing different. So inside the parabola is GA
equals, G of A greater than 0. That's convex by definition. So there's no
problem in applying the rule. On the outside you have to draw the standard lines
because you don't have the good convexity part of it.
And so that's how we approximate the outside of the parabola to apply the
theorem. So what the local criteria allows you to do is it allows a node to
terminate and stop sending messages whenever its local condition holds.
Doesn't matter whether it has received a message, whether its local data has
changed, a new node has joined, a node has dropped or anything. So that's all
the node needs to guarantee.
And it will still guarantee eventual correctness by eventual correctness what we
mean is when the computation terminates you'll get the exact same answer if you
had gathered all of the data and found the R square.
So that's the beauty of it. It's quite good at pruning messages. I'll show you in
the experimental section. And it allows the node to sit idle and not do anything
unless there's an even. And event are there scenarios sending or receiving a
message, a change in its local data or a change in the neighborhood.
So unless one of these happens, you don't need to do anything. So this is the
flow chart. There's nothing in there basically. What it's saying is so you take a
local dataset and error threshold epsilon and you want to basically monitor
whether R square is greater than or less than epsilon.
So you initialize your VI which I have not shown here for simplicity. Compute the
sufficiency, the knowledge and all those things. You define the convex regions
and those convex regions are uniformly defined across all of the machines in the
network and then you keep applying this convex rule.
So you start. You do the initialization. You apply the convex rule. If it's satisfied
you don't do anything else. You go back and send messages and then wait for
events to happen. That's essentially what you are doing.
So up until now what I have said is how you track the model if it has changed.
The next part is essentially building on top of that. So let's say that we have
detected that there is a change, how do we recompute the model and update the
model? So the solution we have is a very simple one. It's based on the
converged cost tree.
So every peer or every node in the network would simply send its data or a
sample of the data up the tree and then when you get all of the information at the
node you reveal the regression model and then you percolate it down the tree.
And for that what you need are two quantities X transpose X and X transpose Y.
And note that X transpose X is nothing but a summation over all of the local
peers. So each of the local peers can compute X transpose XI, X transpose XI
and they can compute them here and they can send it up the tree and the next
peer can simply sum it over all of its children and then pass it up.
And that's how you can essentially compute the true A because it's nothing but a
summation over these quantities. So here is the experimental setup. So we did
an experiment -- you had a question? Oh, sorry.
So here's the experimental setup. So we did a dynamic experiment where we
were changing the data at different levels. So this is the timing diagram. On the
X axis you have time. So what we did was every 20,000 similar tactics we
change a fix percentage of the data of the nodes. So I think we changed
20 percent of the data at each node. And we assume each node has a fixed
length data buffer. So you return some old data and you put in some fresh data.
On top of that, at every 500,000 clock ticks, which is a bigger time period, we
change the distribution from which the data is sampled. Here distribution means
simply changing the coefficients of your linear model.
So here is the data description and how the peers react to the change. So this is
the data on the X axis you have time and on the Y axis you have the actual value
of R squared as seen by all of the peers.
So this is an overlay over all PRS data. So if you look at the box, what happens
is the average square R is average close to 2.7 and the reason being we give
each peer a similar set of coefficients from which we were also generating the
data. So those two coefficients were similar.
And that's why there's a higher model fidelity with respect to this R square value.
On the other hand, for all the odd articles, we change the coefficient of the data
generator while keeping the coefficients figures the same. What that means
there's low fidelity of the model.
So that's what is happening.
>>: The peers don't reestimate? So why ->> Kanishka Bhaduri: This is only for detection purposes.
>>: Okay.
>> Kanishka Bhaduri: So I'm not allowing it to recompute the model. That's the
next experiment I'll show. And the red one is our epsilon chosen, which is .5
such that there's an exact half, the probability of true positives and negatives are
exactly the same.
And this is how the peers behave on the accuracy side and the X axis is time and
the Y axis is accuracy. So between 0 and .5, which is this region, you see almost
100 percent fit. Everybody is saying that the model is correct. And then
whenever the data changes, that's when they start communicating, and this is
reflected here. That's time versus messages for node and there's a huge drop in
the accuracy, because that's when they're unsure about what's happening.
They need to gather some more evidence and you then again see a jump in the
accuracy. That's when they say okay everybody's outside. Everybody's saying
that the model doesn't fit. But here they're saying the model fits the data.
If you look at the messages, that's interesting to whenever there's a change in
the distribution you have a smart jump in the messages. But on these instances
there's very little change in the -- very little number of messages going on.
But this is still dynamic. So it's not that I'm changing the data here and here.
Throughout I'm changing the data. But because it's the model fitting criteria, the
algorithm is doing what it's supposed to do efficiently.
So one of the advantages of using this kind of algorithm is what is known as local
behavior. What it means is these rules, the rule that I talked about is very
data-dependent. And whenever the data changes, there is some action going
on. You can't really characterize it by writing some equations when the data will
change or what kind of analytical results you'll have.
So what we did was we incremented -- we changed the network size from 500
nodes to 4,000 nodes and we plotted the accuracy of the peers. And the number
of messages it saved. But what's happening as the number of nodes increasing
the accuracy remains almost a constant, which is the left one. And the
messages per peer is also a constant, which means that you can infinitely
increase the number of nodes in the network and the scalability will not be
hampered.
And the reason it's happening is these computations are very local to appear. So
if you change the data at each peer, it [inaudible] to a very few number of nodes
in the neighborhood before it sub sides.
And that's what is being reflected in these experiments. And this is true for other
local algorithms as well. So this is the second part of the experiment where I'm
essentially closing the loop and allowing the peers to recompute the model.
So I'm starting off with a very bad estimate of the model at each per, a random
guess, so you have a very bad value of R squared. The peers see that this is
indeed the case so they recompute the model and for the later half, when the
data is in line with the model, you don't -- you see a very high value of R squared.
And again I'm randomly changing the data distribution model to a different one
and there's a drop in the R squared and this keeps happening.
And that's what is reflected in the monitoring messages, and this is the number of
times computation happens. So every time the data changes, there's a number
of messages which is being passed to collect the data and rebuild the model.
And that's what I'm showing here.
The last experiment was an application of this algorithm to a smart grid
operation. So what we were trying to do was predict how much of CO2 would
be, carbon dioxide would be used, carbon dioxide would be released based on
two factors. One is how much of wind energy is being generated and what's the
demand. The assumption being higher CO2 means less fossil fuel or more fossil
fuel is burned and less renewable energy is being used.
So we collected about nine months of data from one of the grid operators in
Ireland and January through September, and each month of data we assumed it
to be one EPOC.
We built a model based on the data between January and February, and we
were trying to see if the algorithm would detect because there's a huge
demand -- there was a drop in the carbon dioxide released between May and
June. That is what is known. So the idea was can the algorithm detect that
change? So we build the model from this period and then we pipe the data in
and keep changing the data for each month and the algorithm did actually pick
this up and say that the model doesn't fit the data. So this again is in the open
loop mode where I'm not allowing the peers to rebuild the model. Just using as a
detector. And if you look at the messages, that's also reflected there.
So in conclusion, so this is the first work as we know in monitoring R squared
and really large distributed systems which are asynchronous in nature. The
algorithm is, of course, provably correct. It's highly efficient and converges to the
correct result and R square is scale-free in the sense that it lies between 0 and 1
so you can pick a number and that's it. Unlike many of the other works on
thresholding where you have to know what epsilon is. That's very dataset
dependent. Even some of my earlier work was that. In short I want to
acknowledge the NASA system [inaudible] technology project, this is under the
Aviation Safety Program of which I'm a part of and for many of the papers and
some of the open source codes, this is where you can download them. That's
pretty much what I had.
>>: I have a question. Seems like there's kind of two fundamental modes one
where the distribution directs naturally anyone to adapt. Another one is where
you have a critical event happen, things go out of whack you get the alarm. You
don't want to screw up the model. You want the model to remain.
>> Kanishka Bhaduri: Exactly.
>>: Do you think this is just through something you'd have to pre-decide earlier or
can you have some sort of natural mix of the two to where the model would know
what they're to adjust, just look at a seize of the acquire alarm and then go back
to the previous learned state?
>> Kanishka Bhaduri: So the way the algorithm works is you can basically set it
at the time in which you want, the type of detection you want to do, essentially.
That's not something that you have to decide before the algorithm starts. So the
algorithm would still decide, okay, this is the alarm for which I'll rebuild the model.
This is the kind of alarm for which I won't be rebuilding the model. I think that
kind of thing would be very important.
>>: It would be like a second layer of ->> Kanishka Bhaduri: Exactly. Right. And depending on some hypothesis tests
and what kind of, what's the significance of the alarm essentially. This is a very
simple model, where I'm just saying, okay, it's out. Rebuild the model, yeah.
>> Misha Bilenko: Questions? Thank you very much.
>> Kanishka Bhaduri: Thank you.
[applause]
Download