Yuval Peres: We are ready for the talk. If any of you taught the class

advertisement
>> Yuval Peres: We are ready for the talk. If any of you taught the class and had submit joint
homework you see what I was feeling when I saw suddenly my speakers are starting to
collaborate between each other. We are very happy to have Yishay to tell us about robust
interference and local algorithms.
>> Yishay Mansour: Thank you very much, Yuval. Many thanks to my co-authors, Uriel Feige,
Aviad Rubinstein, Robert Schapire, Moshe Tennenholtz and Shai Vardi. They are more than
happy to answer any questions. This is why you take a course with talk. This is the outline of
the talk. Mainly I would like to convince you that robust inference is an interesting idea. Brings
up questions and we started giving a pew motivating examples in the model. And then describe
the main results and, hopefully, we will have enough time to give a sample of our results. As a
warning for those guys who report to Jennifer, I did give a similar talk in both places, but it's
only similar up to the model. After the model the results are different. Do not tune out.
[laughter].
>>: [indiscernible]
>> Yishay Mansour: They probably predicted [indiscernible]. Okay. When I say robust
probabilistic inference, what do I mean? Like in Hebrew let's read it from right to left. When
we talk about an inference problem it's a
[indiscernible] to an event. So you see something about the world and you want to do
something about what was the state of the world. What I say probabilistic I am saying that the
underlying events have a joint probability distribution. Those of you from the machine learning
side just think about the Bayesian network. A Bayesian network is a way to describe a joint
distribution and what you are observing is you are observing the value of some node and want
to predict a value at other nodes. What's new here is robustness because the other two are
well known. Robustness is when you picture, the way to think about it is the possibility that
what you observe is not what really happened. It's a possibility that what you observe is really
has been corrupted by someone. Let me give to motivating examples. The first one is spam
detection. When you think about spam detection there is a huge number of filters. Each filter
works on a tiny little bit of information trying to find out maybe who sent it, how this e-mail
structured, keywords like Viagra, key things like please enter your login et cetera. Given all this,
the idea of spam detection is to collect all this information, have those various filters work and
then deduce from them whether the e-mail is a spam or not. But when you think about spam
detection you realize it's more like a game because I am the one hand side, I guess the good guy
side, the one that is trying to detect if it's spam or not. But on the other hand the spammers
are also playing the game. In some sense when you set up a spam detector you leave it running
by itself for a very long time, it's very likely that spam will get through it because spam will be
able to overcome more and more of the filters. If you think about the spammer can adjust the
content to detect those we would like to think that in the short run they probably would be
able to fool a few detectors. In the long run, they will probably be able to fool a huge number
of detectors. So what is our goal in robustness? The goal is to classify the e-mail correctly even
if the spammer can adversarially corrupt a few of the detectors. By corrupt, think of he wrote
the word with a spelling error so they have all of those tricks to get around the filters that you
did put, so let's think that they can fool a few numbers of the detectors and we would like to be
able to overcome it. Yes, question?
>>: The spammers here knows that they are spammers because most spammers are people
who don't realize. [laughter].
>> Yishay Mansour: So you mean people that send us like the weekly MSR. [laughter]. No
comment. Another setting is like any real setting I like to think about the network failure,
failure of architecture, because I'm familiar with networks. The network is something very
complex and every network you have a huge number of failures and we like to think in theory
that everything works. But in order to manage the failures what should really happen is you
put detectors. In order to solve the failure problems you are putting more hardware into the
network. For example, you are trying to detect whether bandwidth is operational or not,
whether the way to beat error rate on the link is high or low, but those detectors can fail.
When we think fail, most of you probably think it doesn't work, but if it doesn't work this is the
easy case. The hard case is if you have something that wants to detect the bit error rate and it's
giving you a bit error rate is 0.01. It could be 0.01, in which case you have like a slight link
which could be the detector went bad. How can you model with such a thing? Things are going
to fail, especially if you have like a huge network like Microsoft. You can't think about it and
modeling it like a Bayesian sense. What does it mean tomorrow in a Bayesian sense? It truly
means that you assume what would happen in the failure. This is what you really need to do.
On the other hand, if it tries to solve it in the worst case and this is what I would advocate now.
Modeling in the worst case you are really trying to say that things can fail unless it fails I have
no idea what it can say. I have no idea, given this sort of bit error rate, if I failed, I have no idea
what is going to happen. The goal is to perform good failure detection in this sort of
networking literature you can think of this as overcoming say k point of failure. It's like k
arbitrary fail then you say arbitrary is implicitly adversarial. Now I can start describing the
model first in the picture and then in a mathematical way. We have a two state, think of y as
being Boolean either spam or not spam, they'll were not fail, and given the true state of the
world we see some kind of observation. This is called our spam detector or our network
detectors. The new features which we are going to add is an adversary. The adversary can sort
of modify the things that we observe and output something different. We'll discuss two more
models. In the first model which is a simpler one, the adversary picks a modification will so the
main difference here in the models is where are we going to get the handover of the result.
The other thing is going to be restricted. It's going to be restricted by having a finite set of
modification rules from which we can select how to create the co option. In the static model he
has to select one modification that will run any sequence. In the adaptive model he can both
see the observation and given the observation is going to select a modification rule.
>> So what is the point in the adaptive model saying that it's a rule. He might as well just
selected a [indiscernible]
>> Yishay Mansour: Because I wanted a unified notation.
>>: But generally [indiscernible] looks at the [indiscernible]
>> Yishay Mansour: Yes. I agree. We have a binary state of nature. We have a state of signals
just like [indiscernible] vectors and then we have a joint distribution over the state of nature
and the [indiscernible]. The sequence is a trial distribution with [indiscernible]. What we really
need it to be computable basically given the x's and y we can compute those probabilities. We
also need it to be sampleable. This is a minor issue. When I say that this is computable in a
sense I'm saying I can solve the inference problem, because what is the inference problem?
The inference problem is given the x's I need to plug y zero and y one, get the probability and
now I can find the condition. By saying that things are computable I am hiding the fact that I'm
assuming that the problem without co option is solvable which naturally I should have. I want
to solve something stronger, another set of observable signals. I like to stay with modification,
so modification will sort of map the state of the world and an input will give us an observed
value. We assume that it's done in polynomial time because we would like to simulate them
and we assume that it's bounded [indiscernible]. m would be modification rules. Here are two
examples. One example is that you can flip a single bit, i flips the I observation. This is like in
spam detection putting one filter. It doesn't have to be a small number. You can also think of a
modification rule that flips all the odd signals. In the [indiscernible] you can with one filter you
can have a huge cascading effect. What is the goal? The goal is observed z predict y.
>>: Should the adversary past one of the m modification rules and applies it?
>> Yishay Mansour: Yes. We will go to this slide again. The predictor is given z to predict y and
implicitly defines a policy, given z gives a prediction. It could be probabilistic, it will be. The
adversary selects a modification rule and the static case it is selected before the realization. In
the adaptive case after the realization. We can think of this as zero sum gain. Zero-sum
between the predictor and the adversary. The error is once we fixed the policy and the
modification rule we can complete the error very simply. We would like to look at the min max
error, so this would be the error of the best policy. Let the predictor move first and he's going
to choose a policy which can be probabilistic and then the adversary given the policy can select
the worst-case modification error. What will be interesting also is like epsilon optimized,
epsilon optimal meaning that the error rate is epsilon above the optimal error rate. Our main
result, for the inference problem we can show the following. Given the observable signals we
can compute the prediction for y. It's going to be near optimal. It's going to be efficient both in
the static case and in the adaptive case. Really there is going to be a dependency on the
number of modification rules, but here it's going to be polynomial. Here it's going to be
exponential. I'm going to talk today only about this case, about the adaptive case. We do have
a bunch of models regarding learning. What do I mean by learning? In the inference setting we
don't restrict ourselves how do we make a prediction. In learning you usually think about you
have a finite class of hypotheses and we need to select one of them. In this learning setting we
are starting with a clean training and like to bid a good predictor for corrupted testing. We can
show that given a sample find an optimal or a near optimal input in polynomial time given an
Oracle for risk minimization and then show a generalization bound this shows up this is good.
Basically in the rest of the talk I will talk about the inductive case, which I think is interesting
and it will give a very interesting relationship also to local algorithm. Before I go there, when
does it make a difference to think about the robustness? In some sense what should we expect
of robust even before we are looking at this slide? We should expect not to put too much
weight on anything specific. Here's how it turns out in the regular learning setting. Think about
learning with the hyperplane. Usually we like to minimize the error, so we like to find the high
topic and minimize the number of mistakes. You can add the margin like in SVM setting. Now
let's think about the simple adversary and fix one of the attributes in zero. For simplicity, let's
think of uniform distribution. For a static adversary he's going to zero out the highest weight.
This is the thing that would cause the most error. With an adaptive adversary is going to zero
out the highest weight which is predicting in the right direction. When you compare this thing
to this it almost looks like a margin but it's like an adaptive margin. The margin depends on the
weights. This is in line for our intuition that we don't want to put too much weight on any
single point or any single example. Now I can talk about the adaptive adversary. Let's try to
model everything using graph and then the question would become much more combinatorial.
We have two graphs. The first graph is a corruption graph. Really we don't need any more
deification rules. We just need to say how x maps to z, so given a point x we have the mapping
from the x to the possible z's that the adversary can select. This is going to be a bipartite graph
going from x to z. We will have a bounded degree. The bounded degree on the x side is coming
from the fact that we have a finite number of modification, bounded number of modification
rules and we will need to assume the same on the n degree on the other side. The next graph is
like an interference graph and we do need also to put weights on the nodes according to the
distribution. The next graph is an interference graph. Really, we are talking about Boolean
functions, so we can split only inputs to two parts, the positives and negatives. Now we build a
different bipartite graph. The bipartite graph basically connects two points, connects a
negative point to a positive point if there exists some observable signal z which they can both
be mapped into. Now we get a bipartite graph here and we will have also weights on the node.
Again so here note that the weights are on both sides because those are the x's and the weight
of a node is just its probability according to the final distribution. Let's try to think what would
an adversary policy look like. For simplicity let's assume that it is unweighted and then we will
get back to the weighted case. What is a possible adversary policy? Let's do the matching.
Rebuild the matching in the interference graph. A possible policy is we build the matching
when x is realized. We will match x to x prime if it is part of the match. This would imply that
any edges we match guarantees it to have exactly one error. Regardless of how we predict we
will have to get one error per edge which means that the error rate is the size of the maximum
matching over n. n is the number of points. This is just a possible adversary policy. In the
nonuniform case we can do something very similar. We've basically do fractional matching.
We do a fractional matching and then we get something very similar with just probabilistic. I
probably would skip over this slide. Just trust me that you can do it. Now I want to talk about
the predictor policy. Predictor is what you will do if it's going to build a vertex cover. In this
example the yellow nodes are the vertex cover. Given the vertex cover, let's see how we will
predict. What we will do is given some z is going back to the corruption graph and looking at
the images of z. If this is z we look at those three nodes. What is going to do is going to ignore
the notes which are in the vertex cover and just pick any one of the remaining nodes and
predict the coding to it. The importance is the pre-images of any z once you eliminate the
nodes in the vertex cover cannot have both a positive and a negative pre-image. This means
that the error rate is the size of the vertex cover over n. So now we have an optimal policy list
in the uniform distribution case because the minimum size of the vertex cover equals the
maximum size of the matching and therefore they have to be identical. This implies that for
optimal policy the optimal predictor policy is the deterministic. I didn't do the nonuniform
case, but it would be standard [indiscernible] also nonuniformly the adversary policy would be
sort of randomized. It does give us a polynomial time algorithm, but don't be confused.
Polynomial time here is not good because polynomial time is polynomially the size of a state
and this is not what we want. We want something which is really running polynomially in the
dimension. So think about x is zero to the n. 01 to the n. We want to run in time polynomially
in n. If we want to run in time polynomially in n it's clearly that you cannot agree with all of the
x's. This is where the local algorithm comes in very nicely and gives it an incredibly interesting
motivation to look at local computations. First of all what is a local algorithm? Here is a setting
that local algorithms usually look at. You are given a problem, let's say matching in a graph.
With the problem is a set of queries. In the matching you can think of given an edge and so
whether the edge is in the matching or not in the matching. The idea is that you like to reply to
specific queries fast logarithmic time and you would like no pre-computation or storage or at
least storage probably I would say how am I cheating. But this is what you really would like it to
look. Now you eventually can sort of a query on all of the edges so you would like that the
output is a feasible solution. It's very easy because I can always say no, no edges in the
matching, but they would like a near optimal. You can get a local matching algorithm and I will
try to sketch it later. It will run in poly-logarithmic time. The notation here hides [indiscernible]
dependency on epsilon and [indiscernible] and that's very good. We will need some kind of
randomness, so I'm going to use some kind of storage fixed storage relief to store a seed of
randomness, but this is really I would claim not really pre-computation. It's more like a pseudothat I would need later in order to get the randomness for walking. This takes me to the local
matching algorithm. How can you get a local matching algorithm without inspecting the entire
graph and get a near optimal matching? This is the main observation which is [indiscernible].
When you look at matching you look at the augmenting paths. The augmenting paths if they
are known edges not in the matching and augment between the in and the and another edge
not in the matching. Hope and Kraft [phonetic] in '73, with the show it is if there is no
augmenting in path then it is an optimal matching. This is why not. But they showed that if
there is no short augmentation path then you are near optimal. What does it imply locally?
Think that if you sort of fix this k to be 1 over epsilon you would get a 1 minus epsilon matching
if you can check that there are no local augmentation paths. There is a distributed algorithm
based on this idea and gets a near optimal matching and what it really does is walk in phases
and sort of checks whether there are augmentation paths. It basically takes the maximum set
of augmentation paths edit two of the matching and the new matching is guaranteed that it
doesn't have any short augmentation path. When you get a local matching you can simulate a
distributed algorithm but you need to be careful. You recursively check whether an edge in the
matching of k and in the augmentation of k and if it is in one or the other it will remand to the
next state. The main thing is to go and how do you build this set Ak which is jumping over. This
gives us the, by doing this building in the correct way and you are careful you can get the polylogarithmic running time.
>>: How does the consistency work?
>> Yishay Mansour: You are really simulating. You have like 1 over epsilon phases. In each
phase you are saying phase 1 is augmentation path of length one and [indiscernible] of length
three, length five. And each phase you are really generating a matching. What you need now
to unravel this is given an edge you want to say is it in the matching of number five. If it's in the
matching of number five and was a number three and wasn't augmented or if it was
augmented in number three and was not in number five and then you sort of need to go down
in exponential size 3 but the exponential is in 1 over epsilon.
>>: Is there a specific algorithm that runs for k steps is k local because it is simulated, right?
[indiscernible] from distance k and that is exactly the k locality [indiscernible] locality, right?
>> Yishay Mansour: The point is you need algorithms are going to this distributed way are using
much more computation per node. The question is why am I carrying very long histories
because they are carrying the histories. I hope to get in the 1 or 2 minutes that I have to the
finish. This is matching and matching is really great if you want to be an adversary, we want to
be the good guys; we want to be the predictors. We need to go from matching to vertex cover.
There are regular ways to go from matching to vertex cover. The problem is this is sort of the
standard way of doing it. Given a maximum matching you can define a vertex covering in the
following way. Take the unmatched nodes and these are going to be level 0 and now for each
node you compute its level from the unmatched node and take the old level. This is going to be
a minimum vertex cover. The problem is we don't have a real matching. We have an
approximate matching and the approximate matching would not translate well with this vanilla
procedure. We need to do a tiny perturbation of it. And the tiny perturbation, we need to use
two randomizations really. One is we need to take a random cut of point rather than to go until
the end, rather than counting the one over epsilon, take a cut somewhere in between. The
other thing we need is to sort of select the, from the outside we need to select one of the sides
at random. Given this we will be able to do it. If the alternating path to v is less than r then it's
going to be an approximate vertex cover and if it's more than r we put it in the vertex cover
only if it is in the side that we selected.
>>: [indiscernible]
>> Yishay Mansour: Right. It's always a vertex cover and the expectation is going to be near
optimal, so that's putting everything together before I wrap up. So what is the algorithm? We
want to predict a certain value. We compute the pre images of the [indiscernible]. Those are
the x's with [indiscernible]. Now we want to find some x that is not in the vertex cover and
predict the connect to it. How do we know which vertex, which one of those x's in the vertex
cover? For each of them we need sort of to compute whether they are in the vertex cover or
not. Basically, we sort of test whether they are part of the matching. If they are part of the
matching we do need to look at the alternating path of them to the free node and see if they
are part of the vertex cover. I am over time, so let me just conclude. The main objective was to
try to reach a robust probabilistic inference is an interesting question. I didn't talk about the
static adversary. I did talk about the adaptive adversary and I think what was at least for me
very interesting and surprising for the adaptive adversary is that it is connecting in a very
natural way to local algorithms. Some of [indiscernible] suspended many inference problem
can be casted as local algorithm questions. What is very nice for me is that we got very natural
graph algorithmic problems matching vertex cover bipartite graph is very good because we
know how to solve them. Given those, we can derive the inference algorithm. I didn't talk
about the learning. It will just keep it here. And I am done.
>> Yuval Peres: Questions, comments?
>>: [indiscernible] people like [indiscernible] on these local algorithms and matching and the
vertex cover, how are they related?
>> Yishay Mansour: This algorithm that I mentioned here [indiscernible] sort of generating a
maximum matching. I think what they did is sort of they didn't generate something that I think
is maximal, but they did generate something which is still sort of near optimal. There is also like
a follow-up walk of other people which can get the matching part to be deterministic and not
randomized. But there is sort of one thing. When I worked on it, it took some time to realize
that it doesn't work. While you can get good approximation on the matching side, getting a
direct approximation on the vertex cover by wasn't able to do it. You can write greedy
algorithms for the vertex cover that will get the optimum if you want them, but somehow
stopping them [indiscernible] looks like a very bad solution.
>> Yuval Peres: Any other questions? If not, thank Yishay again. [applause]
Download