>> Peter Bodik: Hello, everybody. It's my pleasure to introduce Michal Valko from the University of Pittsburgh. He's been working on semi-supervised learning and conditional anomaly detection, and he's graduating this summer, and he's starting as a post-doc in [inaudible] in France. He's and the whole day, so if you have any more questions, feel free to stop by and talk to him. >> Michal Valko: Thank you. I'm pleased to talk about my research on adaptive graph-based learning, which is my thesis topic. So in my research I've been focusing on learning with minimal feedback. And why is that? If you want people to enjoy machine learning, we need to give them systems that they don't need to spend much time training before they can actually use them. The other reason -- the other desired feature of our systems should be that they can adapt to always changing environments. Take, for example, the problem of online face recognition. We don't train it much, so we only want to give it one labeled image, one labeled face, and we want to have a system that is able to recognize the faces on the fly. This problem can become challenging, especially when we have a lighting condition that change from our label examples to the new environment. For example, background can change and that can make the problem even harder, but what if we now change hair or grow a beard or start wearing glasses. And the problem becomes even more challenging in the presence of outliers, which are the people which are not in our labeled set of the images. So in this talk I will use the graph-based learning as a basic approach which can model the complex singularities between the examples in our data. So in this talk I will first introduce the graph-based learning. After that I spend some time on talking about semi-surprised learning. After that I will present our contribution in the field of online semi-supervised learning as well as our theoretical results, and after that I will showcase the algorithm on the problem of online face recognition, which you could see on the first slide, and the remainder of the talk I extend the graph-based learning to the problem of conditional anomaly detection and have applied it to the problem of detecting the errors in medical actions such as prescription of heparin. So, first, what is graph-based learning? Graph-based learning is widely used machine learning for the problem such as clustering or semi-supervised learning. The basic idea is that every data point we have in our data set will be represented in the graph as a node of the graph. We will use face recognition as a running example, and here we have every node -- every face assigned to a node in the graph. While in the graph, we not only need nodes, but we also need edges, and edges in our graph will represent similarities between our examples such as in this case it could be the similarities between the faces, and that could some similarity metric that we need to define. In this case there will be just pixel similarities. Such a graph can help us explore the structure in the data, and we can use it for inference. Semi-supervised learning is a machine learning paradigm for learning from both labeled and unlabeled examples. In this case we will have only two labeled examples, Nick and Sumeet, and our goal is to figure out the label or identity of all the other unlabeled faces which we will call unlabeled data. Semi-supervised learning can take those unlabeled data into account and come up with a classifier that takes them into account and reasons with them. So in the following we will assign No. 1 to the node of the one person and number minus 1 to the node of the other person. And our goal will be to infer the labels of the remaining faces, and we will do it with some kind of label propagation and the label 1 will spread around the label example and minus 1 around the unlabeled example. And at the end we'll come up with some soft label which will be a real number from minus 1 to 1, and at the end when we get a number that's positive we'll say, well, that was Nick and if we get a number that's negative we'll say that was Sumeet. One approach we can take, we can use the intuition of the random walk, and, for example, we want to figure out what's the label of this face, and we can calculate the probability of the random walk on the graph, and random walk, it follows that it starts in some nodes and it jumps around the vertices respecting the edges, so it's proportional to the weights of the edges. And the label we get at the end is the difference between probability that the walk ends up in 1 minus the probability that this random walk ends up in minus 1. These random walks do not need to be simulated and could be expressed as an optimization function where we minimize the difference between these unlabeled vertices or all vertices with appropriate weights which are the similarities such that we get the training examples or labeled examples correct. These objective functions can be rewritten in terms of graph Laplacian, and this specific one is called harmonic function solution. Its properties are that the resulting output -- the resulting soft labels, these numbers between minus 1 and 1, are smooth, so for unlabeled vertices we can say that soft label is a weighted average of its neighbors. It can be [inaudible] as a closed-form solution, and the solution can be interpreted as a random on the data simulated graph that we just talked about. So the advantages of such approach that it can track non-linear patterns in the data. The solution is -- well, the optimization function is convex, and therefore it's globally optimal if you find the minimum of that, but this advantage of that is that as with other methods, it's sensitive to the structure of the data. Now we have some rules of thumb which could help us to define the metric, but it's still in some sense some question of some art and in some problems needs to be tuned by some calibration. So traditionally semi-supervised learning is an offline method, and all the examples we're given are given in advance, and we need to make an inference then. Say that now the data arrives in the stream, such as in our video example, and we need to make prediction on the fly. The most straightforward approach how you can go about that is shown on this slide. We get some similarity graph and a new example that we want to predict the label, we already have some graph so we add the example to the graph, we recompute the graph Laplacian and infer the labels, and then we predict based on what we get inferred for the new example. And we then output the prediction and the updated graph with a new node. So what's the problem with this algorithm? The problem is that as we get more examples, our graph grows, and the storage and inference become infeasible for millions or maybe even tens of thousands of vertices. So one of the solutions, we could reduce the number of nodes of the graph and then make the inference feasible. My solution combines two ideas, one of online clustering and the other of semi-supervised inference. In this one, the online clustering, we incrementally cover the space by the set of the R-balls, which are the balls with a radius R. And not only, we also remember how many nodes each of the balls cover. These algorithms come with some guarantees which I like to extend to bounding the approximation error of the graph Laplacian. But let's now see how this algorithm works. Now, let's say that this is our representation that we have. These are our labeled examples, these are our unlabeled examples, and these are the R-balls and descent rate with some face, and these little numbers represent how many other faces that we discarded and we've seen in the past that one [inaudible] covers or represents. So say that we get a new example, and this is within the distance R away with some representative vertex. In this case it's this one. Then we may discard this vertex and update the count from four to five. Well, at some point it will happen that the new example will be so far enough so it will not be within R of any of the previously assigned centers. In that case we double the R and reassign the vertices to the new [inaudible] such that these guarantees hold. And these are the known node two vertices are closer than R and every vertex that we've seen now, even the new one, is covered by at least one new vertex -- I mean, by at least one [inaudible] or new -- the vertex and a new representation. So this is algorithm how it looks after the change we did. Now the inputs are not only the example on the similarity graph but also we will have the K representative vertices, so our graph will be only up to K large, and also we'll have the counts of how many of these vertices each node represents. So now the algorithm changes as follows. At every step we add examples of the graph, but if it exceeds number of K, number of vertices we allow to remember, we quantize it. So we do this kind of compression, the online clustering, and update the vertex or multiplicities, so how many previously seen faces or examples every node represents. And then we compute the Laplacian of that compact [phonetic] representation which is essentially the graph where every sent rate [phonetic] is multiplied so many times as is the count as we remember. And this is our approximation for the full graph that we don't want to represent because we don't want to use so much memory or computational time. And after that we again make an inference and prediction. Computing the metric, it still has the same complexity and inference is also almost cubic, but now it's for the constant number of vertices. So all of these are the constant for inference step, and that's something that we can use for very long prediction. >>: I have a question. >> Michal Valko: Yes. >>: So for the [inaudible] >> Michal Valko: So we do it -- so we always have some similarity graph, and we add new vertex, we need to calculate the similarity edges to the previously seen. >>: Okay [inaudible] >> Michal Valko: Yeah, that's what's here. It's dynamic. We need to create -extend the graph, add the node and then maybe at the end discard it because we don't want to use it on the end, but at every point we actually need to add the node in the graph because we want to make a prediction for every new node. So for every new node we need calculate similarity with every example that we have, but it's also only up to K. It's not growing with the time. Are there any other questions? Okay. So real world problems involve outliers, and if we set up this algorithm as I showed, it will work for a while, but it will start to break down if we have new people. Why is that? It is because this online clustering approach that we use here optimizes the worst case. So when we have a lot of outliers, those want to be covered, and as such, use up the precious space that we want to only use for our labeled examples because we don't want to make a prediction when we see the outlier at all. So how do we deal with that problem? We need to control the extrapolation to those unlabeled examples so that we don't extrapolate to outliers. And this is how we can do it. We create a special node which we'll call sink, and we'll assign the label 0 to it. So besides the label node 1 and minus 1, we'll have a special node 0. We'll connect every vertex in the graph with some weight gamma g, and this will be our regularizer for the graph. And now when we -- if you remember, we can think of all these as a random walk. Now if we randomly jump there will be always some chance that it will jump to the sink, to 0. So what will happen with these soft labels if we have zeros? So essentially all of them will become closer to 0, but the different thing will happen for outliers because for them the 0 will be the closest label. And in the case we will use some exponential metric for our similarity graph, these outliers, the labels of these outliers will be driven toward 0 fast. But what we actually do is that when we have some inferred label close to 0, we'll decided a node to predict and discard from a graph. We make a choice that, okay, this is the outlier, we don't want to predict, and we don't even want to have a representative, and in such case we control the influence of outliers. So these gamma g's are a parameter in our algorithm which essentially says how much do we trust our unlabeled data. Yeah, that's what I said. If we cannot infer the example with sufficiently high confidence, we'll just discard it. So now I want to just say the theoretical result that we proved for our algorithm. Essentially what we want to show is that as we have our algorithm running in the time, our online algorithm, we want to say that as we get more and more faces or unlabeled examples, in general our error for our solution doesn't differ much from the training error or the error on the labeled examples. So the idea of the proof is that we can use this regularization coefficient and set it such that as we get more examples, the error term vanishes. I'm not going to go into much detail, but I'll say that we did it by splitting or decomposing the error into three parts and bounding each of them separately. So one is the offline learning error. Even if we did not do any quantization, any approximation and we could start all of our vertices, we will incur some error just by label propagation. That's what we call offline learning error. The other is online learning error. So even without any approximation in algorithm, we don't see the future. So at this time, step T, we can only use a first T examples to make our examples, and we don't see the future. So in that case we incur the error of the not seeing future which we call online learning error. And the other error is quantization error. We don't, because we don't have enough memory or time, use the full Laplacian of what we could do, but we only use this approximate version, the one that we got with this quantization. And in that case we bound the error of our Laplacian by extending the guarantees that come from this online clustering algorithm. So let me present now a couple of experiments that we did with that, and we'll show it on the problem of online face detection. So in this case we actually only wanted to test the robustness to outliers. So we will only use the one person to recognize. We'll get four labeled images of the person in a cubicle and then many unlabeled images, there will be a video stream, a person at a different location. So it will be not only cubicle, it will be office, a presentation room, a cafeteria and so on. And then to test the robustness of our method we extend it with some random faces that we do not want to recognize and we inject it to the video stream. So in the first plot we show the performance of different methods. We want to measure recall and precision in this experiment. And our experiment -- our method is shown in red, and the nearest neighbor approach, which is the most simple thing you could do, it's shown in blue one, which is only -- in this case just for one, but in general if you just look on your data set, then figure out which is the closest face that you see to the one you see now and predict that one. And in this gray one we used a state-of-the-art online semi-supervised learning method which we call online semi-supervised boosting which has some set of the week learners using boosting to update them on the fly. And this is the result of that method, and this is the method we actually tried to help the online semi-supervised boosting and allow it to use future data to construct the weak learners. And even in that case we were able to outperform it. In this other experiment we compare it to the commercial solution, actually, by a company based in Pittsburgh, and as opposed to our method, it uses really hand-crafted features designed for face recognition. In our algorithm we use a very simple metric. We actually don't even use color and we only use pixel-by-pixel similarities. So in that way the computation of our metric is really fast. So in that case we are able to make a [inaudible] much faster, so we only use empirically 20 percent of the computational time of this commercial solution and we're estimate able to outperform it. So in the other example we are trying to test multi-class prediction. So we are able to gather eight people, and we asked them to walk in front of camera and make funny faces. And for the first time that they appeared on the camera we labeled first four of those faces and then we -- later when we asked them to come back again, we were measuring how accurate are our predictions. So because this other method is not -- that doesn't work for multi-class prediction, we here only compare with the nearest neighbor approach. But what we can see is that as the person interacts more with the camera, the algorithm learns more about you because it can represent more of the space, more -manifold of what your face can be. So before I move through the second part, are there any questions about this part? Yes? Please. >>: Since you have a video stream, you can use kind of the tracking feature to generate more labels. So if I recognize someone in the image that says that this is Mike, if an image [inaudible] is still Mike, so I can treat it as fully supervised as opposed to semi-supervised in that sense. Have you tried -- do you have any indication on what would work better? >> Michal Valko: Yeah. We tried it. So tracking usually helps for all of the methods we can use. It helps recall but can hurt accuracy. It actually depends on the video. So these are just a simple video of the people walking, but the other thing we tried was movies or the shows, and then you really often have the cuts in the scene, and those could be distracted by tracking. Now, we have a face here and there's a cut and there's a different face here, and if you just use tracking, sometimes that -- it just takes the label of previous and then you have error. So it helps your -- if you get [inaudible] it helps your recall, you can recognize more faces, but sometimes it hurts prediction. So I guess for the different kinds of streams, this might be helpful, yeah. >>: I have another question. Your clustering method seems to me [inaudible] quantization? Are there differences, or did you try to convert the two or something like that? >> Michal Valko: So our method is based on the Charikar algorithm, which is just online clustering. There are many quantization algorithms that you could take. The problem is that not all of them you can actually use them here. Like, for example, Nystrom [phonetic] is very often used. So why you cannot use them? Because a lot of these methods require that your data is IID, and in these streams your data not IID note. This frame and the next frame are almost the same, so you cannot think of this as the random sample you're getting, and you cannot just randomly decide I want to just discard it. So that's why we used this one. The other reason is because it comes with this guarantee that we could actually prove about something about the method in that in the other competing method, the online semi-supervised, there are no guarantees, and that's why. We have no distributional [inaudible] whereas these vector quantization usually do or at least I don't know about any one that has no distribution example and could be used online. So that's why. Other questions? Okay. So in the second part I will extend the graph [inaudible] on the problem of conditional anomaly detection. And I will run -- an example will be detecting medical errors such as performing surgery, sending a patient home, prescribing a medication and so on. So patients health records already have a lot of information about a patient encoded in computers, such as demographics, condition, medications, procedures, bills, progress notes, X-rays, and so on. So in this simple slide the pluses represent patients that get some medication, say aspirin, and minuses that didn't. And patients are grouped by their similarities and their symptoms. So traditional anomaly detection methods that are used for data, they're far from the rest -- from most of the data which we call just outliers or anomalies. In our case these are patients with atypical symptoms. So these are not of our concern because there's no different decision that we could change. What is our concern is that this patient that did not get aspirin, but a lot of the patients with similar condition did. So there is a belief that this condition could be changed. And our assumption is that these conditional anomalies correspond to medical errors, and it's very bad if we do such error, so it's very desirable to discover it and prevent it. These medical errors in the U.S. are the eighth leading cause of death. So this is a really serious problem. And hospitals already recognize this problem and design rules that try to discover if something -- some problem is encountered. For example, in our hospital in -- I go to school at University of Pittsburgh, and we have the big hospital [inaudible] called university medical center. It has a whole department of people that design just rules, you know, if heparin is high and hemoglobin is low, then do something. It usually takes months to tune these rules, and then they test it on some patients and it comes back and try it again. So we believe that we can use past data instead of trying to encode pretty much whole medical knowledge into these rules. So traditional anomaly detection methods usually define anomaly into something that you can search for algorithmically. For example, nearest neighbor method we'd say, well, anything that's far from errors of the data is anomalous, or density region, like anything that lies in the low density region, is anomalous. The other option is classification-based method, which would say, well, let's separate all the data sets, let's classify all the data to anomalous and non-anomalous. And all of us know statistical methods and [inaudible] and everything that's reached under the deviation far from the mean, it's anomaly. So there are three different ways how we can -- or we recognize three different ways how we can go about a conditional anomaly detection, and I will brief start all of them and argue that on this last one, this regularized discriminative approach, is a good way to go, and propose the method that will be coming from the first part of the talk and will be able to regularize these outliers away. So these are the specific challenges that are with this conditional anomaly detection. One are the isolated points or just traditional outliers that we're talking in one of the earlier slides, and these are just the methods that -- these are just these points that are far from the rest of the data. So they're not surrounded by any other points, and we should not be confident about saying, well, their labels should be. I should say that in this case we have access to all the labels, so there's nothing unlabeled here. And our goal is to say how likely is it that the labels should be different. And so in a medical example, these labels are, okay, this physician did a surgery or this physician ordered this medication. So the other problematic data points for many of these methods would be fringe points. So these would be the points that lie in the outer boundary of distribution support. These will be especially problematic for nearest neighbor approach such as [inaudible] because usually these methods look at the neighborhood, and neighborhood of this point is very different than the neighborhood of the typical point. So a lot of these methods without output, these fringe points, just because their neighborhoods are different. And this is something also that we don't want. We also want -- we actually -- we want to output points that we are surrounded with the points of the different label, said simplistically. So one of the simplest methods you could take, you could try to build on standard anomaly detection methods, because there's a huge amount of research on that. So what we can do, we can say, well, we have a new method with a label. Take all -- you look at your data set and find all the examples with the same label and then use the standard anomaly detection method to see if this point is far away from the other points with the same label. So the problem is that this method ignores our classes. So it discovers this one, which is desirable because it's surrounded by this different one, but it also will think that this is really anomalous because it's also far away, but it's not surrounded with other points. So from that we can see we really need to count other classes, and we can just not straightforwardly apply traditional anomaly detection methods. So the other one is, well, let's take these classes into account and use some kind of classification-based approach. So ideally we could say, well, if we have -- if we could have some probabilistic model of our data we could say, well, if our probability of the label, given the data is small, then we have conditional anomaly. So we can try to learn some probabilistic model, and this is something that we did at the very beginning, and now we compare to this as our baseline and let's just design a method that outputs some kind of score, and the bigger is the score, the matter is more anomalous. One other thing you can do, because you really don't need probabilistic method, the only thing you actually need in this problem is just to rank all the examples how much anomalous do you think their label is. So one other thing we could do, we could just use support vector machines, and once we learn a classifier, then we can define our anomalous score has a distance from the hyperplane on the other side. So if we have a point on the other side we'll say, well, this is our anomaly score because ideally we don't want to just say, well, these points are anomalous and these points have known anomalous labels. We actually would like to have some soft core that enables us to rank all these anomalies. So, for example, in our practical application we could say, well, I only want to look on the top 10, which are you the most confident of. So the problem of this method is, again, similar, whereas you said that you have some classifier here, you can again become overly confident of some points that they're anomalous. Say that, you know, you're in a classifier will go somewhere here and you would say, well, this is my anomaly score, this will be even bigger anomaly score. Again, if you use some kind of nearest neighbors graph-based method, well, you could say that, well, this is -- these are my closest neighbors so I should have these labels. And even more, if the metric is exponentially decreasing, which oftentimes is as -- in all our examples we just use Gaussian kernel, and this [inaudible] really confident that these labels should be different, should be reverted. So this is how we actually apply this soft harmonic solution that takes advantage of the data manifold to solve this problem and we will be able to regularize it. So as you recall, our soft label from the first part of the talk was just difference between probability of reaching 1 minus the problem reaching minus 1. If that was closer to 0, we were not sure, and if it was closer to 1 in absolute value, we were sure about the label. And we can rewrite this soft label as we can do with all the real numbers as a product of the absolute value and the sine. Semi-supervised learning usually cares about the sine, the positive one class negative the other one in the binary case. So in this case we did minus .9 is equal to .9 times minus 1, and we would use this as a label. But what we can use for the anomaly detection or conditional anomaly detection, we can interpret that absolute value as a confidence. So the closer to 1, the more confident are the labels. And now we can say, well, if we're really confident, so that absolute value is really close to 1 and the sine is different than what we see in the data, because, again, you know, we see all the labels, then we're very confident that we found some conditional anomaly. And the regularization that we use in the previous, if you recall the entry using the sink, can diminish the effect of these outliers by connecting all the nodes in the graph to the sink and with some small width gamma g. So how do we evaluate such method? It's very difficult to evaluate anomaly detection method because we usually don't know what our anomalies -- you know, they are, by definition, the points that are different. So for this version I'm showing the points that we know the true anomaly score because we generated the data sets from the mixture of multi [inaudible] Gaussians so we can calculate the true anomaly score. And the true anomaly score would be probability of a different label. So in this case we have access to the probabilistic model and we can say what's the probability of a different label. And a lot of research in anomaly detection method sometimes inverts the label, and then in that case we will just say, well, every label is inverted with some -you know, is IID, and sometimes this is not the case. Well, in this case we can calculate, you know, how much is that label different or supposed to be different. So in this case we generated the data set. We do it many times so we can calculate the average. And then we swap some of these. So some of these red pluses become squares and some of these squares become red pluses. And then what we do, we not only want to say, okay, well, let's see how our method classified between a point between anomalous and anomalous, we actually ask the methods, all the methods that we compare to, to come up with the ranking. So for all of these thousand points, give me the scores, how much do you think these labels should be different. And because we know the probability, we know the true probabilistic model, we can say, well, how much discourse, meaning of this list, agrees with the true list of what the labels should be. And this is what we use. So we can see that our method is competitive with other approaches and usually outperforms the other ones. And we generated many data sets that varies in the shapes and the positions and the difficulty. And, again, our agreement metric was pretty much what you can call AUC which we just calculate we have two lists, we have the list of the anomalies that is from the true anomaly score and one of the -- the one that method proposed, and we calculated the number of swaps that we can -- that we need from the true list to the list of the method. So the other thing what we can do, we can just plot to the top five because we have some soft score, and we see that our method was successfully able to detect this until five. But, well, all the synthetic data was to prove a concept, but a real application was the medical data. So in that one we used -- in this experiment, specific experiment, we use about 4,000 patients from the hospitals that underwent cardiac surgery in these years. And we looked at every patient during his or her hospital visit at 8:00 a.m. in the morning which corresponds to the regular doctor's visit. So for each of the day we have, like, one patient case, so each of these 45,000 cases is some patient at 8:00 a.m. in the morning. And we summarize all the patient history in the 9,000 attributes which were designed with this test which are lab tests, medications, visit features, [inaudible] and procedures done to the patient such as surgery, and then heart support devices because this is very important for cardiac surgery patients. And these features that we created were designed by talking to experts in clinical care, so this is our knowledge we had to input in the similarity metric, and we used these features to create a graph. So out of these 45,000 patient days we asked 15 experts in clinical care to evaluate, you know, 222 states and to say how much do they think this case is anomalous. And we did it in such way that every case was seen by at least three of the experts. And, again, our metric was, well, we asked the methods to give us the anomaly score for each of these cases, in this case these 222, and we compared how much these cases -- how much this list, how much this ranking, agrees with the true ranking that this expert gave us, and this was our anomaly score, because that's the only [inaudible] evaluation of the method that we could come up with. Before I show the result, let me say again how we came up with this -- how we handled the data. And this is one case. This is one patient that was in hospital for about four days, and each day we look at the patient at 8:00 a.m. in the morning, so we split the case into these three subcases or more, if needed, and the features that will be used for creating the metrics was the -- or the data from the beginning of the -- when the patient came to the hospital until that point at 8:00 a.m. The next point was the summary of the features until the next day at 8:00 a.m. and so on. So we used this data to create the features and then we looked at the decision which are the decisions done in the next 24 hours, and we looked at about 700 decisions, 400 of them were different lab tests that you could give to patient, and the remaining were the different medications that you could order or not order. Because the important thing in this case is also when the physician forgets to order something. If physician ordered a lab test that maybe just was expensive, sometimes that will -- not always, but usually doesn't hurt a patient, but it's more problematic if the physician forgets to order something. So not ordering something was also a decision that we want to check if it's anomalous or not. And this is our [inaudible] showing on one example how we come up with these features, and these are just some of them because we have many of them, and we came up with just talking to doctors, pharmacies and experts in clinical care. So in this plot, this is just one lab that's done to patient every on often, and this is a platelet count. So this is where we look at the patient at some current time, and these are different readings in time. These are different values, A, B, C, D, E, F, and these are the features. So, for example, one of the features we use is last value, one is the difference between the first value ever and the last value, one would be just slope, if we just laterally approximate a trend, and so on. So this is the result, and in this one we just compared how our method -- how the scores of our method agree with the scores of the competing [inaudible], but the second best method was SVM-based when we just, you know, learned the boundary and calculated the difference from the hyperplane. And we see that for this range of regularization parameters, which in our case is the weight, the sink, and SVM, which is the cost, how we can outperform this method for a wide range of regularizer. So in summary, I talked about how we can use graph-based methods to approach the problem of learning with minimal feedback. I talked about two parts. One was just our contribution of online semi-supervised learning, which when we use quantization, and the other one was our contribution to the field of conditional anomaly detection, which we applied to the medical errors. So one of the other things that we can use these medical errors is that these days -- so one of the problems we tried to target is medical errors. So if the physician make errors, if we can prevent it, for example, before the decision is executed, we can catch it. The other thing is that these days everybody's talking about healthcare reform and how we can just limit the spending. So one of the use cases for that is, well, if we see that a physician orders a test that is maybe not needed but very expensive, we can alert on that and see, well, maybe you want to use these resources somehow better. So as a future work in this area, one is just how to scale more this solution. First what we can do is do smart quantization. In our quantization that we use now, we're just trying to somehow cover the space. There may be a way how we can cover the space with an end goal in mind. So we want to, at the end, do semi-surprised learning or maybe some other task how we can do the quantization so that it takes that into account. The other options that we have to scale the solution to represent more nodes or to go further in time is to compute this in parallel. Now we just calculate all these labor propagation on one graph. What we can do, we can split one graph into different subgraphs and calculate this harmonic solution of these subgraphs in parallel that could -- the algorithms are cubic, so this could give us some speed-up in that -- on the other hand, we lose some accuracy. But, for example, these subgraphs could be tracking manifold around one -around each labeled example. So we can have many, many graphs even for one class, and in such way we will actually do some kind of multi-manifold learning which could allow us to scale to even more nodes that we have now. The other way of this work is how to address more concept drift. So let me be more specific. In the face examples we can [inaudible] when people are five years they look different than when they're 50, so how we can adapt to changes like that and maybe not even, you know, try to remember all the vertices we saw in the past. In medical application, there's a concept drift in how the physicians treat patients so that medical practices change, medications change, so how we can adapt to those changes. One possible solution, well, we can just forget some of the vertices that were not used for prediction or the ones that don't change the prediction much. And so these are the ideas how we could address the concept drift using those methods. So in the last one, the last extension of the future work we talk about is how to do these conditional anomaly detection in a structured way. So in this case, in this medical example, I was talking about how we look at every decision and we decide if the order of medication or the lab test was unusual or not, but we do it all independently. You have the medication that increases blood pressure and the one that decreases blood pressure, you would probably not order them at the same time, so there's some correlations between those [inaudible] that you could take into account that actually look at all vector of all possible decisions and see if all vector is anomalous or maybe which part of the vector, but take into account that these decisions have some relationships between another. So we would like to scale that to that structured way. And that's it. Thank you. [applause]. >>: [inaudible] >> Michal Valko: Yeah. >>: I'm curious about the result [inaudible] >> Michal Valko: So these are the results I was trying to show here. So in general we can say, well, we can have about, you know, 90 percent accuracy about 90 percent of the time. So this is, in loose words, our result, and we obviously take advantage of unlabeled examples and do better than the nearest neighbors. And we don't have this online supervised boosting because this is just as binary classifier as we have it now, and perhaps it could be adapted to multi-class, but it hasn't been yet. So these are the results for the multi-class way. >>: [inaudible] >> Michal Valko: Yeah. So the other competing approach, as I just said, is online supervised boosting, which is shown here. And this works for binary. So in this plot we also compared to this other approach. Other [inaudible] is just, you know, not empirical result, but also we actually can prove something about our solution and we can get better as time goes or not much different than the result on the training set. I should be more precise, yeah. Any other questions? >>: I'm curious about part of the theory. So what about to get consistency with your scaled-back method, what kind of assumptions do you need to put on the distance method? It seems when you're creating those balls and stuff from the manifold you're putting some assumptions onto the distance measure between things. >> Michal Valko: Well, it needs to be metric, it needs to be bounded, but, you know, it's bounded by 1 in this case. All positive. By, yeah, you know, it needs to be metric bounded by 1. Yeah, that's important. >>: [inaudible] >> Michal Valko: So the reason we can just use those assumptions and not to have more is that just this online clustering optimizes the worst case. So that's why we don't need to have, you know, [inaudible] example of some distribution, which will be just, you know, typical assumptions of that. >>: Okay. >> Michal Valko: Well, the other assumption is that also the Laplacian, you use normalized Laplacian. So also the entries of the Laplacian are bounded too. That's important too. So in our results it would not work for un-normalized Laplacian, for example. But I think that's it. >>: Okay. >> Peter Bodik: Let's thank our speaker again. [applause]