>> Peter Bodik: Hello, everybody. It's my pleasure to... from the University of Pittsburgh. He's been working on...

advertisement
>> Peter Bodik: Hello, everybody. It's my pleasure to introduce Michal Valko
from the University of Pittsburgh. He's been working on semi-supervised
learning and conditional anomaly detection, and he's graduating this summer,
and he's starting as a post-doc in [inaudible] in France. He's and the whole day,
so if you have any more questions, feel free to stop by and talk to him.
>> Michal Valko: Thank you. I'm pleased to talk about my research on adaptive
graph-based learning, which is my thesis topic.
So in my research I've been focusing on learning with minimal feedback. And
why is that? If you want people to enjoy machine learning, we need to give them
systems that they don't need to spend much time training before they can
actually use them.
The other reason -- the other desired feature of our systems should be that they
can adapt to always changing environments. Take, for example, the problem of
online face recognition. We don't train it much, so we only want to give it one
labeled image, one labeled face, and we want to have a system that is able to
recognize the faces on the fly.
This problem can become challenging, especially when we have a lighting
condition that change from our label examples to the new environment. For
example, background can change and that can make the problem even harder,
but what if we now change hair or grow a beard or start wearing glasses.
And the problem becomes even more challenging in the presence of outliers,
which are the people which are not in our labeled set of the images.
So in this talk I will use the graph-based learning as a basic approach which can
model the complex singularities between the examples in our data.
So in this talk I will first introduce the graph-based learning. After that I spend
some time on talking about semi-surprised learning. After that I will present our
contribution in the field of online semi-supervised learning as well as our
theoretical results, and after that I will showcase the algorithm on the problem of
online face recognition, which you could see on the first slide, and the remainder
of the talk I extend the graph-based learning to the problem of conditional
anomaly detection and have applied it to the problem of detecting the errors in
medical actions such as prescription of heparin.
So, first, what is graph-based learning? Graph-based learning is widely used
machine learning for the problem such as clustering or semi-supervised learning.
The basic idea is that every data point we have in our data set will be
represented in the graph as a node of the graph. We will use face recognition as
a running example, and here we have every node -- every face assigned to a
node in the graph.
While in the graph, we not only need nodes, but we also need edges, and edges
in our graph will represent similarities between our examples such as in this case
it could be the similarities between the faces, and that could some similarity
metric that we need to define. In this case there will be just pixel similarities.
Such a graph can help us explore the structure in the data, and we can use it for
inference.
Semi-supervised learning is a machine learning paradigm for learning from both
labeled and unlabeled examples. In this case we will have only two labeled
examples, Nick and Sumeet, and our goal is to figure out the label or identity of
all the other unlabeled faces which we will call unlabeled data.
Semi-supervised learning can take those unlabeled data into account and come
up with a classifier that takes them into account and reasons with them.
So in the following we will assign No. 1 to the node of the one person and
number minus 1 to the node of the other person. And our goal will be to infer the
labels of the remaining faces, and we will do it with some kind of label
propagation and the label 1 will spread around the label example and minus 1
around the unlabeled example. And at the end we'll come up with some soft
label which will be a real number from minus 1 to 1, and at the end when we get
a number that's positive we'll say, well, that was Nick and if we get a number
that's negative we'll say that was Sumeet.
One approach we can take, we can use the intuition of the random walk, and, for
example, we want to figure out what's the label of this face, and we can calculate
the probability of the random walk on the graph, and random walk, it follows that
it starts in some nodes and it jumps around the vertices respecting the edges, so
it's proportional to the weights of the edges. And the label we get at the end is
the difference between probability that the walk ends up in 1 minus the
probability that this random walk ends up in minus 1.
These random walks do not need to be simulated and could be expressed as an
optimization function where we minimize the difference between these unlabeled
vertices or all vertices with appropriate weights which are the similarities such
that we get the training examples or labeled examples correct.
These objective functions can be rewritten in terms of graph Laplacian, and this
specific one is called harmonic function solution. Its properties are that the
resulting output -- the resulting soft labels, these numbers between minus 1 and
1, are smooth, so for unlabeled vertices we can say that soft label is a weighted
average of its neighbors.
It can be [inaudible] as a closed-form solution, and the solution can be
interpreted as a random on the data simulated graph that we just talked about.
So the advantages of such approach that it can track non-linear patterns in the
data. The solution is -- well, the optimization function is convex, and therefore it's
globally optimal if you find the minimum of that, but this advantage of that is that
as with other methods, it's sensitive to the structure of the data.
Now we have some rules of thumb which could help us to define the metric, but
it's still in some sense some question of some art and in some problems needs to
be tuned by some calibration.
So traditionally semi-supervised learning is an offline method, and all the
examples we're given are given in advance, and we need to make an inference
then. Say that now the data arrives in the stream, such as in our video example,
and we need to make prediction on the fly.
The most straightforward approach how you can go about that is shown on this
slide. We get some similarity graph and a new example that we want to predict
the label, we already have some graph so we add the example to the graph, we
recompute the graph Laplacian and infer the labels, and then we predict based
on what we get inferred for the new example. And we then output the prediction
and the updated graph with a new node.
So what's the problem with this algorithm? The problem is that as we get more
examples, our graph grows, and the storage and inference become infeasible for
millions or maybe even tens of thousands of vertices.
So one of the solutions, we could reduce the number of nodes of the graph and
then make the inference feasible.
My solution combines two ideas, one of online clustering and the other of
semi-supervised inference. In this one, the online clustering, we incrementally
cover the space by the set of the R-balls, which are the balls with a radius R.
And not only, we also remember how many nodes each of the balls cover.
These algorithms come with some guarantees which I like to extend to bounding
the approximation error of the graph Laplacian. But let's now see how this
algorithm works.
Now, let's say that this is our representation that we have. These are our labeled
examples, these are our unlabeled examples, and these are the R-balls and
descent rate with some face, and these little numbers represent how many other
faces that we discarded and we've seen in the past that one [inaudible] covers or
represents.
So say that we get a new example, and this is within the distance R away with
some representative vertex. In this case it's this one. Then we may discard this
vertex and update the count from four to five.
Well, at some point it will happen that the new example will be so far enough so it
will not be within R of any of the previously assigned centers. In that case we
double the R and reassign the vertices to the new [inaudible] such that these
guarantees hold. And these are the known node two vertices are closer than R
and every vertex that we've seen now, even the new one, is covered by at least
one new vertex -- I mean, by at least one [inaudible] or new -- the vertex and a
new representation.
So this is algorithm how it looks after the change we did. Now the inputs are not
only the example on the similarity graph but also we will have the K
representative vertices, so our graph will be only up to K large, and also we'll
have the counts of how many of these vertices each node represents.
So now the algorithm changes as follows. At every step we add examples of the
graph, but if it exceeds number of K, number of vertices we allow to remember,
we quantize it. So we do this kind of compression, the online clustering, and
update the vertex or multiplicities, so how many previously seen faces or
examples every node represents.
And then we compute the Laplacian of that compact [phonetic] representation
which is essentially the graph where every sent rate [phonetic] is multiplied so
many times as is the count as we remember. And this is our approximation for
the full graph that we don't want to represent because we don't want to use so
much memory or computational time.
And after that we again make an inference and prediction. Computing the metric,
it still has the same complexity and inference is also almost cubic, but now it's for
the constant number of vertices. So all of these are the constant for inference
step, and that's something that we can use for very long prediction.
>>: I have a question.
>> Michal Valko: Yes.
>>: So for the [inaudible]
>> Michal Valko: So we do it -- so we always have some similarity graph, and
we add new vertex, we need to calculate the similarity edges to the previously
seen.
>>: Okay [inaudible]
>> Michal Valko: Yeah, that's what's here. It's dynamic. We need to create -extend the graph, add the node and then maybe at the end discard it because we
don't want to use it on the end, but at every point we actually need to add the
node in the graph because we want to make a prediction for every new node. So
for every new node we need calculate similarity with every example that we
have, but it's also only up to K. It's not growing with the time.
Are there any other questions?
Okay. So real world problems involve outliers, and if we set up this algorithm as I
showed, it will work for a while, but it will start to break down if we have new
people.
Why is that? It is because this online clustering approach that we use here
optimizes the worst case. So when we have a lot of outliers, those want to be
covered, and as such, use up the precious space that we want to only use for our
labeled examples because we don't want to make a prediction when we see the
outlier at all.
So how do we deal with that problem? We need to control the extrapolation to
those unlabeled examples so that we don't extrapolate to outliers. And this is
how we can do it. We create a special node which we'll call sink, and we'll
assign the label 0 to it. So besides the label node 1 and minus 1, we'll have a
special node 0. We'll connect every vertex in the graph with some weight
gamma g, and this will be our regularizer for the graph.
And now when we -- if you remember, we can think of all these as a random
walk. Now if we randomly jump there will be always some chance that it will
jump to the sink, to 0. So what will happen with these soft labels if we have
zeros? So essentially all of them will become closer to 0, but the different thing
will happen for outliers because for them the 0 will be the closest label. And in
the case we will use some exponential metric for our similarity graph, these
outliers, the labels of these outliers will be driven toward 0 fast.
But what we actually do is that when we have some inferred label close to 0, we'll
decided a node to predict and discard from a graph. We make a choice that,
okay, this is the outlier, we don't want to predict, and we don't even want to have
a representative, and in such case we control the influence of outliers. So these
gamma g's are a parameter in our algorithm which essentially says how much do
we trust our unlabeled data. Yeah, that's what I said. If we cannot infer the
example with sufficiently high confidence, we'll just discard it.
So now I want to just say the theoretical result that we proved for our algorithm.
Essentially what we want to show is that as we have our algorithm running in the
time, our online algorithm, we want to say that as we get more and more faces or
unlabeled examples, in general our error for our solution doesn't differ much from
the training error or the error on the labeled examples.
So the idea of the proof is that we can use this regularization coefficient and set it
such that as we get more examples, the error term vanishes.
I'm not going to go into much detail, but I'll say that we did it by splitting or
decomposing the error into three parts and bounding each of them separately.
So one is the offline learning error. Even if we did not do any quantization, any
approximation and we could start all of our vertices, we will incur some error just
by label propagation. That's what we call offline learning error.
The other is online learning error. So even without any approximation in
algorithm, we don't see the future. So at this time, step T, we can only use a first
T examples to make our examples, and we don't see the future. So in that case
we incur the error of the not seeing future which we call online learning error.
And the other error is quantization error. We don't, because we don't have
enough memory or time, use the full Laplacian of what we could do, but we only
use this approximate version, the one that we got with this quantization.
And in that case we bound the error of our Laplacian by extending the
guarantees that come from this online clustering algorithm.
So let me present now a couple of experiments that we did with that, and we'll
show it on the problem of online face detection.
So in this case we actually only wanted to test the robustness to outliers. So we
will only use the one person to recognize. We'll get four labeled images of the
person in a cubicle and then many unlabeled images, there will be a video
stream, a person at a different location. So it will be not only cubicle, it will be
office, a presentation room, a cafeteria and so on.
And then to test the robustness of our method we extend it with some random
faces that we do not want to recognize and we inject it to the video stream.
So in the first plot we show the performance of different methods. We want to
measure recall and precision in this experiment. And our experiment -- our
method is shown in red, and the nearest neighbor approach, which is the most
simple thing you could do, it's shown in blue one, which is only -- in this case just
for one, but in general if you just look on your data set, then figure out which is
the closest face that you see to the one you see now and predict that one.
And in this gray one we used a state-of-the-art online semi-supervised learning
method which we call online semi-supervised boosting which has some set of the
week learners using boosting to update them on the fly. And this is the result of
that method, and this is the method we actually tried to help the online
semi-supervised boosting and allow it to use future data to construct the weak
learners. And even in that case we were able to outperform it.
In this other experiment we compare it to the commercial solution, actually, by a
company based in Pittsburgh, and as opposed to our method, it uses really
hand-crafted features designed for face recognition.
In our algorithm we use a very simple metric. We actually don't even use color
and we only use pixel-by-pixel similarities. So in that way the computation of our
metric is really fast. So in that case we are able to make a [inaudible] much
faster, so we only use empirically 20 percent of the computational time of this
commercial solution and we're estimate able to outperform it.
So in the other example we are trying to test multi-class prediction. So we are
able to gather eight people, and we asked them to walk in front of camera and
make funny faces. And for the first time that they appeared on the camera we
labeled first four of those faces and then we -- later when we asked them to
come back again, we were measuring how accurate are our predictions.
So because this other method is not -- that doesn't work for multi-class
prediction, we here only compare with the nearest neighbor approach. But what
we can see is that as the person interacts more with the camera, the algorithm
learns more about you because it can represent more of the space, more -manifold of what your face can be.
So before I move through the second part, are there any questions about this
part?
Yes? Please.
>>: Since you have a video stream, you can use kind of the tracking feature to
generate more labels. So if I recognize someone in the image that says that this
is Mike, if an image [inaudible] is still Mike, so I can treat it as fully supervised as
opposed to semi-supervised in that sense. Have you tried -- do you have any
indication on what would work better?
>> Michal Valko: Yeah. We tried it. So tracking usually helps for all of the
methods we can use. It helps recall but can hurt accuracy. It actually depends
on the video. So these are just a simple video of the people walking, but the
other thing we tried was movies or the shows, and then you really often have the
cuts in the scene, and those could be distracted by tracking. Now, we have a
face here and there's a cut and there's a different face here, and if you just use
tracking, sometimes that -- it just takes the label of previous and then you have
error. So it helps your -- if you get [inaudible] it helps your recall, you can
recognize more faces, but sometimes it hurts prediction. So I guess for the
different kinds of streams, this might be helpful, yeah.
>>: I have another question. Your clustering method seems to me [inaudible]
quantization? Are there differences, or did you try to convert the two or
something like that?
>> Michal Valko: So our method is based on the Charikar algorithm, which is just
online clustering. There are many quantization algorithms that you could take.
The problem is that not all of them you can actually use them here. Like, for
example, Nystrom [phonetic] is very often used. So why you cannot use them?
Because a lot of these methods require that your data is IID, and in these
streams your data not IID note. This frame and the next frame are almost the
same, so you cannot think of this as the random sample you're getting, and you
cannot just randomly decide I want to just discard it. So that's why we used this
one.
The other reason is because it comes with this guarantee that we could actually
prove about something about the method in that in the other competing method,
the online semi-supervised, there are no guarantees, and that's why. We have
no distributional [inaudible] whereas these vector quantization usually do or at
least I don't know about any one that has no distribution example and could be
used online. So that's why.
Other questions?
Okay. So in the second part I will extend the graph [inaudible] on the problem of
conditional anomaly detection. And I will run -- an example will be detecting
medical errors such as performing surgery, sending a patient home, prescribing a
medication and so on.
So patients health records already have a lot of information about a patient
encoded in computers, such as demographics, condition, medications,
procedures, bills, progress notes, X-rays, and so on. So in this simple slide the
pluses represent patients that get some medication, say aspirin, and minuses
that didn't. And patients are grouped by their similarities and their symptoms.
So traditional anomaly detection methods that are used for data, they're far from
the rest -- from most of the data which we call just outliers or anomalies.
In our case these are patients with atypical symptoms. So these are not of our
concern because there's no different decision that we could change. What is our
concern is that this patient that did not get aspirin, but a lot of the patients with
similar condition did. So there is a belief that this condition could be changed.
And our assumption is that these conditional anomalies correspond to medical
errors, and it's very bad if we do such error, so it's very desirable to discover it
and prevent it. These medical errors in the U.S. are the eighth leading cause of
death. So this is a really serious problem. And hospitals already recognize this
problem and design rules that try to discover if something -- some problem is
encountered.
For example, in our hospital in -- I go to school at University of Pittsburgh, and
we have the big hospital [inaudible] called university medical center. It has a
whole department of people that design just rules, you know, if heparin is high
and hemoglobin is low, then do something. It usually takes months to tune these
rules, and then they test it on some patients and it comes back and try it again.
So we believe that we can use past data instead of trying to encode pretty much
whole medical knowledge into these rules.
So traditional anomaly detection methods usually define anomaly into something
that you can search for algorithmically. For example, nearest neighbor method
we'd say, well, anything that's far from errors of the data is anomalous, or density
region, like anything that lies in the low density region, is anomalous.
The other option is classification-based method, which would say, well, let's
separate all the data sets, let's classify all the data to anomalous and
non-anomalous. And all of us know statistical methods and [inaudible] and
everything that's reached under the deviation far from the mean, it's anomaly.
So there are three different ways how we can -- or we recognize three different
ways how we can go about a conditional anomaly detection, and I will brief start
all of them and argue that on this last one, this regularized discriminative
approach, is a good way to go, and propose the method that will be coming from
the first part of the talk and will be able to regularize these outliers away.
So these are the specific challenges that are with this conditional anomaly
detection. One are the isolated points or just traditional outliers that we're talking
in one of the earlier slides, and these are just the methods that -- these are just
these points that are far from the rest of the data. So they're not surrounded by
any other points, and we should not be confident about saying, well, their labels
should be.
I should say that in this case we have access to all the labels, so there's nothing
unlabeled here. And our goal is to say how likely is it that the labels should be
different. And so in a medical example, these labels are, okay, this physician did
a surgery or this physician ordered this medication.
So the other problematic data points for many of these methods would be fringe
points. So these would be the points that lie in the outer boundary of distribution
support. These will be especially problematic for nearest neighbor approach
such as [inaudible] because usually these methods look at the neighborhood,
and neighborhood of this point is very different than the neighborhood of the
typical point. So a lot of these methods without output, these fringe points, just
because their neighborhoods are different. And this is something also that we
don't want.
We also want -- we actually -- we want to output points that we are surrounded
with the points of the different label, said simplistically.
So one of the simplest methods you could take, you could try to build on
standard anomaly detection methods, because there's a huge amount of
research on that. So what we can do, we can say, well, we have a new method
with a label. Take all -- you look at your data set and find all the examples with
the same label and then use the standard anomaly detection method to see if
this point is far away from the other points with the same label.
So the problem is that this method ignores our classes. So it discovers this one,
which is desirable because it's surrounded by this different one, but it also will
think that this is really anomalous because it's also far away, but it's not
surrounded with other points. So from that we can see we really need to count
other classes, and we can just not straightforwardly apply traditional anomaly
detection methods.
So the other one is, well, let's take these classes into account and use some kind
of classification-based approach. So ideally we could say, well, if we have -- if
we could have some probabilistic model of our data we could say, well, if our
probability of the label, given the data is small, then we have conditional
anomaly. So we can try to learn some probabilistic model, and this is something
that we did at the very beginning, and now we compare to this as our baseline
and let's just design a method that outputs some kind of score, and the bigger is
the score, the matter is more anomalous.
One other thing you can do, because you really don't need probabilistic method,
the only thing you actually need in this problem is just to rank all the examples
how much anomalous do you think their label is. So one other thing we could do,
we could just use support vector machines, and once we learn a classifier, then
we can define our anomalous score has a distance from the hyperplane on the
other side.
So if we have a point on the other side we'll say, well, this is our anomaly score
because ideally we don't want to just say, well, these points are anomalous and
these points have known anomalous labels. We actually would like to have
some soft core that enables us to rank all these anomalies.
So, for example, in our practical application we could say, well, I only want to
look on the top 10, which are you the most confident of.
So the problem of this method is, again, similar, whereas you said that you have
some classifier here, you can again become overly confident of some points that
they're anomalous. Say that, you know, you're in a classifier will go somewhere
here and you would say, well, this is my anomaly score, this will be even bigger
anomaly score.
Again, if you use some kind of nearest neighbors graph-based method, well, you
could say that, well, this is -- these are my closest neighbors so I should have
these labels.
And even more, if the metric is exponentially decreasing, which oftentimes is
as -- in all our examples we just use Gaussian kernel, and this [inaudible] really
confident that these labels should be different, should be reverted.
So this is how we actually apply this soft harmonic solution that takes advantage
of the data manifold to solve this problem and we will be able to regularize it.
So as you recall, our soft label from the first part of the talk was just difference
between probability of reaching 1 minus the problem reaching minus 1. If that
was closer to 0, we were not sure, and if it was closer to 1 in absolute value, we
were sure about the label.
And we can rewrite this soft label as we can do with all the real numbers as a
product of the absolute value and the sine. Semi-supervised learning usually
cares about the sine, the positive one class negative the other one in the binary
case. So in this case we did minus .9 is equal to .9 times minus 1, and we would
use this as a label.
But what we can use for the anomaly detection or conditional anomaly detection,
we can interpret that absolute value as a confidence. So the closer to 1, the
more confident are the labels.
And now we can say, well, if we're really confident, so that absolute value is
really close to 1 and the sine is different than what we see in the data, because,
again, you know, we see all the labels, then we're very confident that we found
some conditional anomaly.
And the regularization that we use in the previous, if you recall the entry using
the sink, can diminish the effect of these outliers by connecting all the nodes in
the graph to the sink and with some small width gamma g.
So how do we evaluate such method? It's very difficult to evaluate anomaly
detection method because we usually don't know what our anomalies -- you
know, they are, by definition, the points that are different. So for this version I'm
showing the points that we know the true anomaly score because we generated
the data sets from the mixture of multi [inaudible] Gaussians so we can calculate
the true anomaly score. And the true anomaly score would be probability of a
different label. So in this case we have access to the probabilistic model and we
can say what's the probability of a different label.
And a lot of research in anomaly detection method sometimes inverts the label,
and then in that case we will just say, well, every label is inverted with some -you know, is IID, and sometimes this is not the case.
Well, in this case we can calculate, you know, how much is that label different or
supposed to be different. So in this case we generated the data set. We do it
many times so we can calculate the average. And then we swap some of these.
So some of these red pluses become squares and some of these squares
become red pluses.
And then what we do, we not only want to say, okay, well, let's see how our
method classified between a point between anomalous and anomalous, we
actually ask the methods, all the methods that we compare to, to come up with
the ranking. So for all of these thousand points, give me the scores, how much
do you think these labels should be different.
And because we know the probability, we know the true probabilistic model, we
can say, well, how much discourse, meaning of this list, agrees with the true list
of what the labels should be. And this is what we use.
So we can see that our method is competitive with other approaches and usually
outperforms the other ones. And we generated many data sets that varies in the
shapes and the positions and the difficulty.
And, again, our agreement metric was pretty much what you can call AUC which
we just calculate we have two lists, we have the list of the anomalies that is from
the true anomaly score and one of the -- the one that method proposed, and we
calculated the number of swaps that we can -- that we need from the true list to
the list of the method.
So the other thing what we can do, we can just plot to the top five because we
have some soft score, and we see that our method was successfully able to
detect this until five. But, well, all the synthetic data was to prove a concept, but
a real application was the medical data.
So in that one we used -- in this experiment, specific experiment, we use about
4,000 patients from the hospitals that underwent cardiac surgery in these years.
And we looked at every patient during his or her hospital visit at 8:00 a.m. in the
morning which corresponds to the regular doctor's visit.
So for each of the day we have, like, one patient case, so each of these 45,000
cases is some patient at 8:00 a.m. in the morning. And we summarize all the
patient history in the 9,000 attributes which were designed with this test which
are lab tests, medications, visit features, [inaudible] and procedures done to the
patient such as surgery, and then heart support devices because this is very
important for cardiac surgery patients.
And these features that we created were designed by talking to experts in clinical
care, so this is our knowledge we had to input in the similarity metric, and we
used these features to create a graph. So out of these 45,000 patient days we
asked 15 experts in clinical care to evaluate, you know, 222 states and to say
how much do they think this case is anomalous. And we did it in such way that
every case was seen by at least three of the experts.
And, again, our metric was, well, we asked the methods to give us the anomaly
score for each of these cases, in this case these 222, and we compared how
much these cases -- how much this list, how much this ranking, agrees with the
true ranking that this expert gave us, and this was our anomaly score, because
that's the only [inaudible] evaluation of the method that we could come up with.
Before I show the result, let me say again how we came up with this -- how we
handled the data. And this is one case. This is one patient that was in hospital
for about four days, and each day we look at the patient at 8:00 a.m. in the
morning, so we split the case into these three subcases or more, if needed, and
the features that will be used for creating the metrics was the -- or the data from
the beginning of the -- when the patient came to the hospital until that point at
8:00 a.m.
The next point was the summary of the features until the next day at 8:00 a.m.
and so on. So we used this data to create the features and then we looked at the
decision which are the decisions done in the next 24 hours, and we looked at
about 700 decisions, 400 of them were different lab tests that you could give to
patient, and the remaining were the different medications that you could order or
not order.
Because the important thing in this case is also when the physician forgets to
order something. If physician ordered a lab test that maybe just was expensive,
sometimes that will -- not always, but usually doesn't hurt a patient, but it's more
problematic if the physician forgets to order something. So not ordering
something was also a decision that we want to check if it's anomalous or not.
And this is our [inaudible] showing on one example how we come up with these
features, and these are just some of them because we have many of them, and
we came up with just talking to doctors, pharmacies and experts in clinical care.
So in this plot, this is just one lab that's done to patient every on often, and this is
a platelet count. So this is where we look at the patient at some current time,
and these are different readings in time. These are different values, A, B, C, D,
E, F, and these are the features.
So, for example, one of the features we use is last value, one is the difference
between the first value ever and the last value, one would be just slope, if we just
laterally approximate a trend, and so on.
So this is the result, and in this one we just compared how our method -- how the
scores of our method agree with the scores of the competing [inaudible], but the
second best method was SVM-based when we just, you know, learned the
boundary and calculated the difference from the hyperplane.
And we see that for this range of regularization parameters, which in our case is
the weight, the sink, and SVM, which is the cost, how we can outperform this
method for a wide range of regularizer.
So in summary, I talked about how we can use graph-based methods to
approach the problem of learning with minimal feedback. I talked about two
parts. One was just our contribution of online semi-supervised learning, which
when we use quantization, and the other one was our contribution to the field of
conditional anomaly detection, which we applied to the medical errors.
So one of the other things that we can use these medical errors is that these
days -- so one of the problems we tried to target is medical errors. So if the
physician make errors, if we can prevent it, for example, before the decision is
executed, we can catch it.
The other thing is that these days everybody's talking about healthcare reform
and how we can just limit the spending. So one of the use cases for that is, well,
if we see that a physician orders a test that is maybe not needed but very
expensive, we can alert on that and see, well, maybe you want to use these
resources somehow better.
So as a future work in this area, one is just how to scale more this solution. First
what we can do is do smart quantization. In our quantization that we use now,
we're just trying to somehow cover the space. There may be a way how we can
cover the space with an end goal in mind. So we want to, at the end, do
semi-surprised learning or maybe some other task how we can do the
quantization so that it takes that into account.
The other options that we have to scale the solution to represent more nodes or
to go further in time is to compute this in parallel. Now we just calculate all these
labor propagation on one graph. What we can do, we can split one graph into
different subgraphs and calculate this harmonic solution of these subgraphs in
parallel that could -- the algorithms are cubic, so this could give us some
speed-up in that -- on the other hand, we lose some accuracy.
But, for example, these subgraphs could be tracking manifold around one -around each labeled example. So we can have many, many graphs even for one
class, and in such way we will actually do some kind of multi-manifold learning
which could allow us to scale to even more nodes that we have now.
The other way of this work is how to address more concept drift. So let me be
more specific. In the face examples we can [inaudible] when people are five
years they look different than when they're 50, so how we can adapt to changes
like that and maybe not even, you know, try to remember all the vertices we saw
in the past.
In medical application, there's a concept drift in how the physicians treat patients
so that medical practices change, medications change, so how we can adapt to
those changes. One possible solution, well, we can just forget some of the
vertices that were not used for prediction or the ones that don't change the
prediction much. And so these are the ideas how we could address the concept
drift using those methods.
So in the last one, the last extension of the future work we talk about is how to do
these conditional anomaly detection in a structured way. So in this case, in this
medical example, I was talking about how we look at every decision and we
decide if the order of medication or the lab test was unusual or not, but we do it
all independently.
You have the medication that increases blood pressure and the one that
decreases blood pressure, you would probably not order them at the same time,
so there's some correlations between those [inaudible] that you could take into
account that actually look at all vector of all possible decisions and see if all
vector is anomalous or maybe which part of the vector, but take into account that
these decisions have some relationships between another. So we would like to
scale that to that structured way.
And that's it. Thank you.
[applause].
>>: [inaudible]
>> Michal Valko: Yeah.
>>: I'm curious about the result [inaudible]
>> Michal Valko: So these are the results I was trying to show here. So in
general we can say, well, we can have about, you know, 90 percent accuracy
about 90 percent of the time. So this is, in loose words, our result, and we
obviously take advantage of unlabeled examples and do better than the nearest
neighbors. And we don't have this online supervised boosting because this is
just as binary classifier as we have it now, and perhaps it could be adapted to
multi-class, but it hasn't been yet. So these are the results for the multi-class
way.
>>: [inaudible]
>> Michal Valko: Yeah. So the other competing approach, as I just said, is
online supervised boosting, which is shown here. And this works for binary. So
in this plot we also compared to this other approach.
Other [inaudible] is just, you know, not empirical result, but also we actually can
prove something about our solution and we can get better as time goes or not
much different than the result on the training set. I should be more precise, yeah.
Any other questions?
>>: I'm curious about part of the theory. So what about to get consistency with
your scaled-back method, what kind of assumptions do you need to put on the
distance method? It seems when you're creating those balls and stuff from the
manifold you're putting some assumptions onto the distance measure between
things.
>> Michal Valko: Well, it needs to be metric, it needs to be bounded, but, you
know, it's bounded by 1 in this case. All positive. By, yeah, you know, it needs to
be metric bounded by 1. Yeah, that's important.
>>: [inaudible]
>> Michal Valko: So the reason we can just use those assumptions and not to
have more is that just this online clustering optimizes the worst case. So that's
why we don't need to have, you know, [inaudible] example of some distribution,
which will be just, you know, typical assumptions of that.
>>: Okay.
>> Michal Valko: Well, the other assumption is that also the Laplacian, you use
normalized Laplacian. So also the entries of the Laplacian are bounded too.
That's important too. So in our results it would not work for un-normalized
Laplacian, for example. But I think that's it.
>>: Okay.
>> Peter Bodik: Let's thank our speaker again.
[applause]
Download