>> Yuval Peres: We are ready for the talk. If any of you taught the class and had submit joint homework you see what I was feeling when I saw suddenly my speakers are starting to collaborate between each other. We are very happy to have Yishay to tell us about robust interference and local algorithms. >> Yishay Mansour: Thank you very much, Yuval. Many thanks to my co-authors, Uriel Feige, Aviad Rubinstein, Robert Schapire, Moshe Tennenholtz and Shai Vardi. They are more than happy to answer any questions. This is why you take a course with talk. This is the outline of the talk. Mainly I would like to convince you that robust inference is an interesting idea. Brings up questions and we started giving a pew motivating examples in the model. And then describe the main results and, hopefully, we will have enough time to give a sample of our results. As a warning for those guys who report to Jennifer, I did give a similar talk in both places, but it's only similar up to the model. After the model the results are different. Do not tune out. [laughter]. >>: [indiscernible] >> Yishay Mansour: They probably predicted [indiscernible]. Okay. When I say robust probabilistic inference, what do I mean? Like in Hebrew let's read it from right to left. When we talk about an inference problem it's a [indiscernible] to an event. So you see something about the world and you want to do something about what was the state of the world. What I say probabilistic I am saying that the underlying events have a joint probability distribution. Those of you from the machine learning side just think about the Bayesian network. A Bayesian network is a way to describe a joint distribution and what you are observing is you are observing the value of some node and want to predict a value at other nodes. What's new here is robustness because the other two are well known. Robustness is when you picture, the way to think about it is the possibility that what you observe is not what really happened. It's a possibility that what you observe is really has been corrupted by someone. Let me give to motivating examples. The first one is spam detection. When you think about spam detection there is a huge number of filters. Each filter works on a tiny little bit of information trying to find out maybe who sent it, how this e-mail structured, keywords like Viagra, key things like please enter your login et cetera. Given all this, the idea of spam detection is to collect all this information, have those various filters work and then deduce from them whether the e-mail is a spam or not. But when you think about spam detection you realize it's more like a game because I am the one hand side, I guess the good guy side, the one that is trying to detect if it's spam or not. But on the other hand the spammers are also playing the game. In some sense when you set up a spam detector you leave it running by itself for a very long time, it's very likely that spam will get through it because spam will be able to overcome more and more of the filters. If you think about the spammer can adjust the content to detect those we would like to think that in the short run they probably would be able to fool a few detectors. In the long run, they will probably be able to fool a huge number of detectors. So what is our goal in robustness? The goal is to classify the e-mail correctly even if the spammer can adversarially corrupt a few of the detectors. By corrupt, think of he wrote the word with a spelling error so they have all of those tricks to get around the filters that you did put, so let's think that they can fool a few numbers of the detectors and we would like to be able to overcome it. Yes, question? >>: The spammers here knows that they are spammers because most spammers are people who don't realize. [laughter]. >> Yishay Mansour: So you mean people that send us like the weekly MSR. [laughter]. No comment. Another setting is like any real setting I like to think about the network failure, failure of architecture, because I'm familiar with networks. The network is something very complex and every network you have a huge number of failures and we like to think in theory that everything works. But in order to manage the failures what should really happen is you put detectors. In order to solve the failure problems you are putting more hardware into the network. For example, you are trying to detect whether bandwidth is operational or not, whether the way to beat error rate on the link is high or low, but those detectors can fail. When we think fail, most of you probably think it doesn't work, but if it doesn't work this is the easy case. The hard case is if you have something that wants to detect the bit error rate and it's giving you a bit error rate is 0.01. It could be 0.01, in which case you have like a slight link which could be the detector went bad. How can you model with such a thing? Things are going to fail, especially if you have like a huge network like Microsoft. You can't think about it and modeling it like a Bayesian sense. What does it mean tomorrow in a Bayesian sense? It truly means that you assume what would happen in the failure. This is what you really need to do. On the other hand, if it tries to solve it in the worst case and this is what I would advocate now. Modeling in the worst case you are really trying to say that things can fail unless it fails I have no idea what it can say. I have no idea, given this sort of bit error rate, if I failed, I have no idea what is going to happen. The goal is to perform good failure detection in this sort of networking literature you can think of this as overcoming say k point of failure. It's like k arbitrary fail then you say arbitrary is implicitly adversarial. Now I can start describing the model first in the picture and then in a mathematical way. We have a two state, think of y as being Boolean either spam or not spam, they'll were not fail, and given the true state of the world we see some kind of observation. This is called our spam detector or our network detectors. The new features which we are going to add is an adversary. The adversary can sort of modify the things that we observe and output something different. We'll discuss two more models. In the first model which is a simpler one, the adversary picks a modification will so the main difference here in the models is where are we going to get the handover of the result. The other thing is going to be restricted. It's going to be restricted by having a finite set of modification rules from which we can select how to create the co option. In the static model he has to select one modification that will run any sequence. In the adaptive model he can both see the observation and given the observation is going to select a modification rule. >> So what is the point in the adaptive model saying that it's a rule. He might as well just selected a [indiscernible] >> Yishay Mansour: Because I wanted a unified notation. >>: But generally [indiscernible] looks at the [indiscernible] >> Yishay Mansour: Yes. I agree. We have a binary state of nature. We have a state of signals just like [indiscernible] vectors and then we have a joint distribution over the state of nature and the [indiscernible]. The sequence is a trial distribution with [indiscernible]. What we really need it to be computable basically given the x's and y we can compute those probabilities. We also need it to be sampleable. This is a minor issue. When I say that this is computable in a sense I'm saying I can solve the inference problem, because what is the inference problem? The inference problem is given the x's I need to plug y zero and y one, get the probability and now I can find the condition. By saying that things are computable I am hiding the fact that I'm assuming that the problem without co option is solvable which naturally I should have. I want to solve something stronger, another set of observable signals. I like to stay with modification, so modification will sort of map the state of the world and an input will give us an observed value. We assume that it's done in polynomial time because we would like to simulate them and we assume that it's bounded [indiscernible]. m would be modification rules. Here are two examples. One example is that you can flip a single bit, i flips the I observation. This is like in spam detection putting one filter. It doesn't have to be a small number. You can also think of a modification rule that flips all the odd signals. In the [indiscernible] you can with one filter you can have a huge cascading effect. What is the goal? The goal is observed z predict y. >>: Should the adversary past one of the m modification rules and applies it? >> Yishay Mansour: Yes. We will go to this slide again. The predictor is given z to predict y and implicitly defines a policy, given z gives a prediction. It could be probabilistic, it will be. The adversary selects a modification rule and the static case it is selected before the realization. In the adaptive case after the realization. We can think of this as zero sum gain. Zero-sum between the predictor and the adversary. The error is once we fixed the policy and the modification rule we can complete the error very simply. We would like to look at the min max error, so this would be the error of the best policy. Let the predictor move first and he's going to choose a policy which can be probabilistic and then the adversary given the policy can select the worst-case modification error. What will be interesting also is like epsilon optimized, epsilon optimal meaning that the error rate is epsilon above the optimal error rate. Our main result, for the inference problem we can show the following. Given the observable signals we can compute the prediction for y. It's going to be near optimal. It's going to be efficient both in the static case and in the adaptive case. Really there is going to be a dependency on the number of modification rules, but here it's going to be polynomial. Here it's going to be exponential. I'm going to talk today only about this case, about the adaptive case. We do have a bunch of models regarding learning. What do I mean by learning? In the inference setting we don't restrict ourselves how do we make a prediction. In learning you usually think about you have a finite class of hypotheses and we need to select one of them. In this learning setting we are starting with a clean training and like to bid a good predictor for corrupted testing. We can show that given a sample find an optimal or a near optimal input in polynomial time given an Oracle for risk minimization and then show a generalization bound this shows up this is good. Basically in the rest of the talk I will talk about the inductive case, which I think is interesting and it will give a very interesting relationship also to local algorithm. Before I go there, when does it make a difference to think about the robustness? In some sense what should we expect of robust even before we are looking at this slide? We should expect not to put too much weight on anything specific. Here's how it turns out in the regular learning setting. Think about learning with the hyperplane. Usually we like to minimize the error, so we like to find the high topic and minimize the number of mistakes. You can add the margin like in SVM setting. Now let's think about the simple adversary and fix one of the attributes in zero. For simplicity, let's think of uniform distribution. For a static adversary he's going to zero out the highest weight. This is the thing that would cause the most error. With an adaptive adversary is going to zero out the highest weight which is predicting in the right direction. When you compare this thing to this it almost looks like a margin but it's like an adaptive margin. The margin depends on the weights. This is in line for our intuition that we don't want to put too much weight on any single point or any single example. Now I can talk about the adaptive adversary. Let's try to model everything using graph and then the question would become much more combinatorial. We have two graphs. The first graph is a corruption graph. Really we don't need any more deification rules. We just need to say how x maps to z, so given a point x we have the mapping from the x to the possible z's that the adversary can select. This is going to be a bipartite graph going from x to z. We will have a bounded degree. The bounded degree on the x side is coming from the fact that we have a finite number of modification, bounded number of modification rules and we will need to assume the same on the n degree on the other side. The next graph is like an interference graph and we do need also to put weights on the nodes according to the distribution. The next graph is an interference graph. Really, we are talking about Boolean functions, so we can split only inputs to two parts, the positives and negatives. Now we build a different bipartite graph. The bipartite graph basically connects two points, connects a negative point to a positive point if there exists some observable signal z which they can both be mapped into. Now we get a bipartite graph here and we will have also weights on the node. Again so here note that the weights are on both sides because those are the x's and the weight of a node is just its probability according to the final distribution. Let's try to think what would an adversary policy look like. For simplicity let's assume that it is unweighted and then we will get back to the weighted case. What is a possible adversary policy? Let's do the matching. Rebuild the matching in the interference graph. A possible policy is we build the matching when x is realized. We will match x to x prime if it is part of the match. This would imply that any edges we match guarantees it to have exactly one error. Regardless of how we predict we will have to get one error per edge which means that the error rate is the size of the maximum matching over n. n is the number of points. This is just a possible adversary policy. In the nonuniform case we can do something very similar. We've basically do fractional matching. We do a fractional matching and then we get something very similar with just probabilistic. I probably would skip over this slide. Just trust me that you can do it. Now I want to talk about the predictor policy. Predictor is what you will do if it's going to build a vertex cover. In this example the yellow nodes are the vertex cover. Given the vertex cover, let's see how we will predict. What we will do is given some z is going back to the corruption graph and looking at the images of z. If this is z we look at those three nodes. What is going to do is going to ignore the notes which are in the vertex cover and just pick any one of the remaining nodes and predict the coding to it. The importance is the pre-images of any z once you eliminate the nodes in the vertex cover cannot have both a positive and a negative pre-image. This means that the error rate is the size of the vertex cover over n. So now we have an optimal policy list in the uniform distribution case because the minimum size of the vertex cover equals the maximum size of the matching and therefore they have to be identical. This implies that for optimal policy the optimal predictor policy is the deterministic. I didn't do the nonuniform case, but it would be standard [indiscernible] also nonuniformly the adversary policy would be sort of randomized. It does give us a polynomial time algorithm, but don't be confused. Polynomial time here is not good because polynomial time is polynomially the size of a state and this is not what we want. We want something which is really running polynomially in the dimension. So think about x is zero to the n. 01 to the n. We want to run in time polynomially in n. If we want to run in time polynomially in n it's clearly that you cannot agree with all of the x's. This is where the local algorithm comes in very nicely and gives it an incredibly interesting motivation to look at local computations. First of all what is a local algorithm? Here is a setting that local algorithms usually look at. You are given a problem, let's say matching in a graph. With the problem is a set of queries. In the matching you can think of given an edge and so whether the edge is in the matching or not in the matching. The idea is that you like to reply to specific queries fast logarithmic time and you would like no pre-computation or storage or at least storage probably I would say how am I cheating. But this is what you really would like it to look. Now you eventually can sort of a query on all of the edges so you would like that the output is a feasible solution. It's very easy because I can always say no, no edges in the matching, but they would like a near optimal. You can get a local matching algorithm and I will try to sketch it later. It will run in poly-logarithmic time. The notation here hides [indiscernible] dependency on epsilon and [indiscernible] and that's very good. We will need some kind of randomness, so I'm going to use some kind of storage fixed storage relief to store a seed of randomness, but this is really I would claim not really pre-computation. It's more like a pseudothat I would need later in order to get the randomness for walking. This takes me to the local matching algorithm. How can you get a local matching algorithm without inspecting the entire graph and get a near optimal matching? This is the main observation which is [indiscernible]. When you look at matching you look at the augmenting paths. The augmenting paths if they are known edges not in the matching and augment between the in and the and another edge not in the matching. Hope and Kraft [phonetic] in '73, with the show it is if there is no augmenting in path then it is an optimal matching. This is why not. But they showed that if there is no short augmentation path then you are near optimal. What does it imply locally? Think that if you sort of fix this k to be 1 over epsilon you would get a 1 minus epsilon matching if you can check that there are no local augmentation paths. There is a distributed algorithm based on this idea and gets a near optimal matching and what it really does is walk in phases and sort of checks whether there are augmentation paths. It basically takes the maximum set of augmentation paths edit two of the matching and the new matching is guaranteed that it doesn't have any short augmentation path. When you get a local matching you can simulate a distributed algorithm but you need to be careful. You recursively check whether an edge in the matching of k and in the augmentation of k and if it is in one or the other it will remand to the next state. The main thing is to go and how do you build this set Ak which is jumping over. This gives us the, by doing this building in the correct way and you are careful you can get the polylogarithmic running time. >>: How does the consistency work? >> Yishay Mansour: You are really simulating. You have like 1 over epsilon phases. In each phase you are saying phase 1 is augmentation path of length one and [indiscernible] of length three, length five. And each phase you are really generating a matching. What you need now to unravel this is given an edge you want to say is it in the matching of number five. If it's in the matching of number five and was a number three and wasn't augmented or if it was augmented in number three and was not in number five and then you sort of need to go down in exponential size 3 but the exponential is in 1 over epsilon. >>: Is there a specific algorithm that runs for k steps is k local because it is simulated, right? [indiscernible] from distance k and that is exactly the k locality [indiscernible] locality, right? >> Yishay Mansour: The point is you need algorithms are going to this distributed way are using much more computation per node. The question is why am I carrying very long histories because they are carrying the histories. I hope to get in the 1 or 2 minutes that I have to the finish. This is matching and matching is really great if you want to be an adversary, we want to be the good guys; we want to be the predictors. We need to go from matching to vertex cover. There are regular ways to go from matching to vertex cover. The problem is this is sort of the standard way of doing it. Given a maximum matching you can define a vertex covering in the following way. Take the unmatched nodes and these are going to be level 0 and now for each node you compute its level from the unmatched node and take the old level. This is going to be a minimum vertex cover. The problem is we don't have a real matching. We have an approximate matching and the approximate matching would not translate well with this vanilla procedure. We need to do a tiny perturbation of it. And the tiny perturbation, we need to use two randomizations really. One is we need to take a random cut of point rather than to go until the end, rather than counting the one over epsilon, take a cut somewhere in between. The other thing we need is to sort of select the, from the outside we need to select one of the sides at random. Given this we will be able to do it. If the alternating path to v is less than r then it's going to be an approximate vertex cover and if it's more than r we put it in the vertex cover only if it is in the side that we selected. >>: [indiscernible] >> Yishay Mansour: Right. It's always a vertex cover and the expectation is going to be near optimal, so that's putting everything together before I wrap up. So what is the algorithm? We want to predict a certain value. We compute the pre images of the [indiscernible]. Those are the x's with [indiscernible]. Now we want to find some x that is not in the vertex cover and predict the connect to it. How do we know which vertex, which one of those x's in the vertex cover? For each of them we need sort of to compute whether they are in the vertex cover or not. Basically, we sort of test whether they are part of the matching. If they are part of the matching we do need to look at the alternating path of them to the free node and see if they are part of the vertex cover. I am over time, so let me just conclude. The main objective was to try to reach a robust probabilistic inference is an interesting question. I didn't talk about the static adversary. I did talk about the adaptive adversary and I think what was at least for me very interesting and surprising for the adaptive adversary is that it is connecting in a very natural way to local algorithms. Some of [indiscernible] suspended many inference problem can be casted as local algorithm questions. What is very nice for me is that we got very natural graph algorithmic problems matching vertex cover bipartite graph is very good because we know how to solve them. Given those, we can derive the inference algorithm. I didn't talk about the learning. It will just keep it here. And I am done. >> Yuval Peres: Questions, comments? >>: [indiscernible] people like [indiscernible] on these local algorithms and matching and the vertex cover, how are they related? >> Yishay Mansour: This algorithm that I mentioned here [indiscernible] sort of generating a maximum matching. I think what they did is sort of they didn't generate something that I think is maximal, but they did generate something which is still sort of near optimal. There is also like a follow-up walk of other people which can get the matching part to be deterministic and not randomized. But there is sort of one thing. When I worked on it, it took some time to realize that it doesn't work. While you can get good approximation on the matching side, getting a direct approximation on the vertex cover by wasn't able to do it. You can write greedy algorithms for the vertex cover that will get the optimum if you want them, but somehow stopping them [indiscernible] looks like a very bad solution. >> Yuval Peres: Any other questions? If not, thank Yishay again. [applause]