17269 >> Jin Li: Okay. Thank you very much...

advertisement
17269
>> Jin Li: Okay. Thank you very much for coming. Today we have an exciting talk by Professor
Devavrat Shah currently Associate Professor in ECS Department at MIT. His research interest is
statistic inference and network algorithms. He has received a number of awards, including the
IEEE Infocom Best Paper Award in 2004. ACM Sigmetrics Best Paper Award in '06. He received
2005 George B. Dantzig Best Presentation Award from Informs. He's also the recipient of the
first ACM Sigmetrics Rising Star Award in 2008 for his work on network scheduling algorithms.
Without further ado, let's hear what he has to say about inferring rankings under constraint
sensing.
>> Devavrat Shah: Thanks, Jin. Thank you for hosting and [inaudible]. Very happy to be here.
So what I'm going to talk about today is inferring popular rankings from partial or constrained
information.
The talk is going to be about learning distributions, say, over permutations of rankings. But as the
talk will proceed through a few examples, I'll mention that the type of things that we are
discovering here are likely to extend to a more general set of learning some full information from
available partial information using some additional structures.
This is joint work with my student Srikanth Jagabathula at MIT. Again, feel free to stop me during
the talk, because I might be using a notation that may not be clear, and questions and answers
will be very helpful.
Okay. So to set up the context, let me start with the very basic motivation and very simple
examples. So here's the first example. So election scenarios. Think of pre-primary, which was
not too far in the past or maybe not too far in the future now.
Let's say we have many candidates that are in the race, and we want to figure out what are the
popular rankings of these candidates and voters. So think of candidates Obama, Clinton,
Richardson, and so on.
These are the candidates likely to be in the race. And every voter has implicitly or
self-consciously or subconsciously some form of ranking of these candidates in mind.
And we want to figure out what fraction of voters believe in such ranking. What fraction of voters
like Obama as first and Richardson as second and Clinton as third. And so on.
If you're N candidates, the total number of such rankings will be N factorial. So N equals to 4, this
is a large number, like 24, right?
So, in general, this is lots of candidates. If you have lots of candidates, then that's too many
possible rankings. If you went and asked people, could you please tell me do you prefer this
ranking or that ranking or that ranking, it's impossible and it's impossible because, A, I as an
individual may not be able to sort of consciously put down those rankings, or I might not have
incentive to tell you.
Or if you're standing outside Times Square and taking a poll, then the person coming out of the
subway has just three seconds to answer your question.
So in that situation the likely answer that you might obtain from people, say through polls, would
be of the following type. It would be like do you like Obama or Hillary or not? Or who is your
favorite candidate? Or do you -- more generalization of this would be, do you like Obama as a
third candidate? And so on.
So these are the type of partial information that will be available. And then based on this you
want to figure out what fraction of people believe in what rankings.
Let's take another example, which is when a similar thing happens. So think of ranking teams in
sports leagues. For example, say football league. There are N teams. Not all N teams are going
to play with each other. Some of them will play with each other and over time we'll collect
different results.
And based on this we want to come up with ranking. Now, first I want to argue that there may not
be one global ranking, because a team may be very strong, but let's say one of the days its
quarterback is ill or let's say broke up, then maybe the team may not perform as good as its
native strength.
So there may be some form of a distribution over rankings that will be represented in the results
of the various games.
So what we want to do is looking at the results of the games, that is, what fraction of time Team X
defeated Team Y. We want to figure out what is the distribution over rankings and the teams.
Any questions? Just going through simple examples, and hopefully the examples are making
sense. Okay. All right. So let's take another third example, and this is different from ranking,
and this is where my comment was that some of the things you like to do are likely to extend to
more general set of priorities rather than just ranking.
This talk will only be about rankings. So network measurement case. So let's think of you have a
network and a network has, let's say, N flow's active. Flows are sources and destination and
they're sending data between sources and destination. The type of measurement that you're
gathering are aggregate type information.
So for a given link you would know how much demand is passing through, but you may not know
which flows are sending what demands, and this is primarily because there are quite a few flows
going through one length; you may not be able to measure them. Too much to. So given this
kind of aggregate bandwidth of information, what you want to figure out which are the flows that
are active and how much is the demand that they're sending.
So all these questions in a nutshell are of the following type: That there is some partial
information that's available, that's constrained. Constrained because in case of polling you're
getting answers of candidate X greater than candidate Y or so on. And network measurement,
only links allow you to do aggregate information. It's a constrain in that sense. But what we'd like
to do is infer the complete information; that is, distribution over all rankings based on second
comparison data and so on. Yes.
>>: Do something -- it's possible, it's a matter of situation and solution, you find the most, least
unstable [inaudible].
>> Devavrat Shah: Maybe there is some connection, but, yes.
>>: There's like three, let's say in your ranking thing, there's like A, B and C. A and B, C, D, B
and C. And CB something.
>> Devavrat Shah: The answer is a paradox. There the question is if that type of situation
happens, the following is not possible.
So the question is what if the situation is A beats B, B beats C and C beats A. Then if you have
just that type of information, then there's no one ranking. Distribution over rankings with support
on one ranking is not possible.
And that's exactly what I want to argue; that those type of situations arises because hypothesis is
that there's only -- support is just one ranking. Here what I want to say is when you sort of relax
that requirement or get out of that mode of thinking and now look for a general answer where I'm
looking for a distribution over space of rankings, then there will be a reasonable way to answer
that question and that will alleviate actually some of those classical paradoxes that are in the
[inaudible] literature. That's a great question.
Any other questions? Okay. So bottom line is that in all of these questions we want to learn
something from partial information. And this type of thing, of course, if I have a distribution which
is uniform over N factorial permutations and if I'm giving you only comparison information, there's
not much I can do about it. I just have little information. So the right question to ask is: Given
this type of partial information, when is the recovery possible? And if recovery is possible, how
can we recover it?
All right. Now let's just put a little bit of formalism around it. Let's start with very simple examples.
Okay. So I'm thinking of, say, the case of election and two candidates. Now two candidates can
be ranked in two ways. Candidate 1 ranks to candidate position one. Candidate 2 gets ranked to
position two. This is my notation, my graph and this is how I'll use it throughout the talk. This is
one ranking. The other ranking is candidate 1 is ranked in position two and candidate 2 is ranked
in position one. It's a very simple thing.
Let's say an even fraction of people believe in this ranking. And one minus P2 people believe in
this ranking.
So then in case of comparison information, the type of information we will get thanks to polling will
be of the following type: What fraction of people rank 1 over 2? In this simple setup it would be
probably be assuming that we are choosing people at random. So probability that 1 is ranked
better than 2 which is P1 in this case.
And, similarly, here the fraction of people who believe 2 is greater than 1, which is P2, or if I write
in a matrix form, this is the information that I have available.
This is the native information that is P1 believes in this first permutation or ranking. This is
second ranking. And this column corresponds to the first one. And this column corresponds to
the second one.
All right. So this is just a setup. And more generally, if you're N candidates, you have N times N
minus 1 possible pairs. Of course, half of them are redundant because we're thinking of
distribution because if I know P I greater than J, then 1 minus PI greater than J is P greater than I,
right?
So this is the type of information we have available. And the total amount of possible rankings
are the distribution over which this is that space is N factorial.
So this is usually very, very long. And this is very, very short. Okay. So the matrix that you're
going to see is very thin. Okay. Let's see another example.
So here is a little more detail. And this is the example that I'm going to use throughout the talk,
this type of example, to explain the results. So, again, going back to the same situation. This is
ranking one. This is ranking two. P1 fraction people believe here. P2 fraction of people believe
this one. I'm going to generate this bi-partide [phonetic] graph where every edge has a weight.
So, for example, this is fraction of people who believe that candidate 1 should be ranked in
position 1. And that is in this case it's just this edge because it's only this ranking that contains it.
So Y11 is nothing but P1 and so on.
And in this situation we have this -- this is the information YIJ, which is given by this matrix times
the original data. This again, as before, this matrix is the columns are nothing but correspond to
each different permutation. Questions?
Okay. So here again you have N squared amount of information available and total possible
positions in this distribution or support is N factorial. So we're trying to -- we have this information
available. We are trying to recover this. It's a humongous task.
One more example quickly. This is this network measurement situation where let's say we have
two lengths and possible sources and destinations are any of these situations like N nodes and
any other N minus nodes can be destination. So this many possible flows.
Each link provides you aggregate flow that's passing through that thing. So, for example, here
through this link only flow one is passing with amount P1. Through this link it's P1 plus P2.
That's the amount passing. That's the observation you're going to make as Y1, Y2 and Y23. And
again this is related to the flow matrix times the graph edges and symmetrics as follows.
Any questions? So, again, we have a situation where this is tall, this is thin and this is tiny and
we want to recover this out of this.
Okay. So in a nutshell we have Y equals to AX, where X is, let's say, capital N dimensional. In
case of ranking is N factorial. And the observations are M dimensional. In the two examples I
gave you they are N squared dimensional. A is zero matrix and AMI says that one if, that nth
component is represented by ith permutation in a nutshell.
The question is recover X based on observation Y and the knowledge of A. It's like inverting. So
if we had this big and this small, possibly it was likely but it's the enriched problem. So clearly we
need more constrained question as in general this is hard to answer.
So here's a classical approach. It's a philosophy. [inaudible] that is, you have observations Y.
You can find various explanations that is Xs that are consistent with the observations. The one
you want to find is the simplest. One way to think of simplest is sparsest. That is, find X that has
smallest support. That is, find an X with number of positions with XI not equal to 0. Called L0
norm is smallest.
And Y this type of approach have a likelihood in many of the examples that I discussed, well, for
example, if you think of election situation, you as a voter do care about few issues. And those
few issues define your rankings. Respective number of candidates, the few rankings that are
likely to be dominating.
Similarly in a sports league, you have teams. Teams have native strengths. And most of the
teams are going to be ranked according to their native strengths. But because of some
uncertainties like quarterbacks in this, there will be the rankings that you'll observe at least in the
result will be some perturbation of these native rankings. Again likely to be sparse.
In network, in situations it's likely there are a few flows which should be active, though there are
many possible flows can be there. Okay.
So given this, our problem left is we want to find out a sparsest solution consistent with our
observations. And the question is when is that sparsest solution representing the real situation?
Now, this is -- it's reporting formalism, find out Z that is consistent with your observation in this
sense that has minimum support. Okay. This is the zero optimization.
This is a very classical approach. And there are many situations where people have used it. I'll
sort of allude to two of them. One of them is the communication setup, or in case for decoding.
So think of you have a code word or a message that you can do transmit over a noisy channel.
And at the end of that you will receive a noisy signal. So there are, let's say you transmitted N
bits. Out of them few of the bits are flipped. So the sequence of 01. Some of them are flipped
now that you receive. You want to figure out which ones are flipped.
Now, in case of linear coding, essentially it corresponds to finding out the vector of, so think of
vector of 0s and 1s where 1s represent the positions of where bit flips happen. 0 otherwise. It's
called syndrome.
What you want to do is find out a syndrome that is consistent with your observation that you
received. And this type of L0 decoding would correspond to looking at a code word, looking at
the received signal and finding out the closest code word to it and seeing how much distance.
That's a quick relation to why these type of problems have been well illustrated in coding. More
recently, this has become central in the topic which is popularly known as compressed sensing
which many of you will know that here the question is, again, you have observations you want to
figure out X which is sparse support.
And in all of these situations, the primary questions of the following type. You want to design
matrix A. In context of coding you want to design code so that you will be able to recover X as
the original thing based on observations as long as X is sparse. Is the original X had few
positions where it was nonzero.
And in many of these cases the way you design A sort of solving actual zero optimization
becomes easy. Because in general this optimization is a hard problem.
So this has been general philosophy. Then we have the same situation. Y equals AX, but you
can't design A. A is given to you. And so these types of things are unlikely to work.
Let me just quickly give a small summary of what type of approaches that's popular here. So it
can be useful for our purposes because we're not going to design A but if the way you prove that
As are good you show that A has good property and the good property is called, roughly
speaking, a restricted isoparametric property or [inaudible] what it says is you have a matrix A. If
you look at any K subset of columns or of these A, then they're essentially autonomal so they're
literally dependent and they presume norm. It's like identity.
And if such is the case, then you can recover X using this optimization very quickly. But in our
case A we cannot design but maybe we can ask the question does A have this property.
Because if A has this property, then at least our quest is over.
Did I lose half of you?
>>: I thought the proper sensing random predictions. So do random matrices naturally have
some property similar to K loop or ->> Devavrat Shah: Right. So one of the approaches to prove -- okay. So compressed sensing
or say linear coding you do random linear coding or random projections. You come up with a
matrix A and then you prove that actually this matrix is going to be good because it has this
property.
So while we can design matrix A but maybe our matrix by design has some such property and,
hence, it may lead to good answers.
Well, as it happens the choices of A that we have don't have this ID property. In proving that I'll
give you a simple contra example that will explain it.
So let's think of this simple contra example. It's four candidates, and thinking of marginal
information, that is, I have four candidates on the left. These are positions. This edge represents
candidate 1 ranked to position 1. And this number .5 says that .5 fraction of people believe that
candidate 1 should be ranked in position 1 and this is the whole information that I have.
So this is my Y. Now, here's a simple contra example, which says there are two ways to produce
this Y in which the support is 2, which is the minimal possible. And it has this equal probability.
So this ranking says that candidate 1 is ranked 1. Candidate 2 is ranked 2 and so on. Here it
says 1 is ranked position 2. 2 is ranked position 1 and so on. Or the flip one where I took this on
this side and this on this side.
So now I've got two possible decompositions of this thing with minimal support. And L0 doesn't
have a unique solution and, hence, it's not going to be able to recover what you want.
So it is says that it doesn't even have two ID property. So in some sense the best recovery that
we can do would be K equals to 1, which is rather totology or trivial.
Okay. So this is one situation just to drive the point home. This is a situation where symmetric
measurement case, that is a three linked network, and you're going to measure load two on this
link and load one on this and load one on this.
This can be done two ways. If you have one flow going like this across and there's one flow here,
which will give you the same loading, or this is another example where it will happen.
So, again, it says that the type of situation that we're going to see, it's impossible to expect some
good RIP style property. So now the question is that, well, we have this kind of situation. Our
RIP property is not going to work. That means we won't be able to recover anything interesting.
So what can we do? Now, what I'm going to do is I'm going to give you a sufficient condition
under which it's possible to recover X using L0 optimization as a unique solution. And I'll give you
under that condition a very simple algorithm to recover that X. It's one thing. Well, that condition
can be totally bizarre and useless. What I'll try to argue in the second part is that, well, indeed
that condition is not bizarre. Actually, if you take a random model, an actual random model, then
the conditions will be satisfied and these conditions will be satisfied in an optimal sense.
And I will explain that optimality later. But what this will suggest is that the way RIP approach
goes, adversarially recovery may not be possible. But for most sparse situations recovery will be
possible. The contra example I produced to you, they're actually very special.
If you laid down a uniform distribution over a space of sparse supports, then actually you will not
see those contra examples with high probability.
This is like encoding. There's a dichotomy between adversarial coding versus probabilistic
coding, where, in adversarial coding, for graph codes you need expansion [inaudible]. But for
random error you don't need expansion [inaudible]. It's roughly related to this.
All right. And then, finally, I'll discuss some more things where here I will talk about specific type
of information like comparison, favorite candidates. But then what is a dichotomy of this type of
information? Is there a natural way to ask all sorts of information. And depending on the amount
of information you give, might recovery increase. What's the right trade off between information
and recoverability. I'll discuss those connections here.
>>: Question.
>> Devavrat Shah: Yes.
>>: So can you go back to the previous, just the example. So here, right, I mean the answer
would be you will give a .25 probability to each of these rankings or even 0 probability to those
rankings because that would be the correct answer?
>> Devavrat Shah: Yes .25 to this. .25 to this and the rest 24 minus 2. 0.
>>: Point to each of the four things, right?
>> Devavrat Shah: So one answer would be support 4. So answer 1 with support 2 would be .5
probability to this, .5 probability to this and 0 probability to the rest of the four factorial minus 2
there is 22. And answer two would be the similar support bit 2 and answer three could be support
4, .25, .25, .25 or choose the reasonable combinations.
>>: So you're trying to find an answer with support two?
>> Devavrat Shah: I'm trying to find an answer with the sparsest support because it will be the L0
minimization.
If I had unique L0 minimalization, and Y is generated by the thing I'm looking for, then I have
proved that L0 minimization recovers what I'm looking for.
This is good. Is that helpful? Okay. So now let's move on. So I'm going to tell you sufficient
conditions, at first may look bizarre but hold on there's a promise. It's actually satisfied by natural
model.
So let me walk you through sufficient conditions. Sufficient conditions has two parts. This is the
sufficient condition for proving that the L0 optimization problem that I stated has unique solution,
which happens to be the original acts that I want and there's an algorithm that will recover it very
easily.
First condition says unique signature. So let me just use an example to say what unique
signature is and then we'll go back and read it again.
Okay. So let's do the following. So let's say going back to the marginal thing, so three
candidates. This is N squared. It's nine pieces of information. This candidate 1 ranked in
position 1.9 fraction of people. And let's say original underlying data is this.
There is, out of three factorial, which is six possible permutations, these three permutations are
likely, these three rankings are possible upon people. .1 fraction of people believe in this, .2
fraction of people believe in this and .7 fraction of people believe in this. The rest of the rankings
are not popular.
Now, the unique signature says the following: That in the support of these acts, there's these
three rankings, have unique edges. And those unique edges are these pink edges. What do I
mean by unique edges? This edge is present or this type of ranking is done only in this one and
not in another one.
What that means is that candidate 1 is ranked in position 2 only by this one, and not by these
other two non-zero mass ones. Of course there will be other permutations or other rankings
which will rank 1, 2, 2 but have 0 mass. So essentially they're not visible.
Similarly, the second one says that candidate 2 is ranked in position 2. This doesn't rank and this
doesn't rank that way.
And this is the third one. And that's what it means. This is unique signature. Or in a more
general setup, it's a spectrum of rankings, Y equals EX form.
I should do it this way. Y equals 2X form, says that for every I that's in the support of X, there
exists a rule in which only AMI equals to 1 for this I but for all other J with non-0 support that have
AMJ equals to 0.
In a nutshell what it means when I will apply A to this X, YM will be equals 2 XI. So on one hand
it says, well, I just told you XI, right. On the other hand, I haven't told you really which XI it is
because you don't know which YM it is because you don't know which of the Js are.
So all it says is there's a promise to recover information. But it's not obvious how to do it. The
other one says linear independence. And linear independence, it's essentially what we think of
as linear independence but in a restricted sense.
That is, you look at all these XI values, take any combination of these with CIs coming from
natural numbers. Sorry, rather, positive and negative numbers. But with values between minus
X knot and plus X knot. As a special case what it will mean is that implication of this would be
that look at all XIs are not equal to 0 and let's say it's their first, I'm ordering them as the first 23
up to X knot, and I'm taking any partial sum or subset of them. All of these partial subsets are
distinct.
So that's it going back to this one. What are the possible partial subset sums? It's just this
element or this element or this element.
So these are .1, .2 and .7. Another subset is these two guys .1 plus .2 is .3. Another subset is .7
plus .2 is .9. Third subset is .7 plus .1 is .8. And one and all these are distinct numbers, right?
So this is the situation, and when you have these conditions satisfied, then we will have an
algorithm and also these conditions will be satisfied under the random model that I'll describe.
Now I will tell you the algorithm. So let's go back to this example. So here's what I will do. First I
will look at the data I have. So, remember, we have this data. We don't know how many
underlying permutations that are there that have generated this. We don't know their values, and
we don't know which ones they are. But we want to figure out all those three. There are three
permutations generating this. These permutations are these precisely. And their probabilities are
.1, .9 and .7.
To do that, I'll take this data and I will sort them. So that is .1 is the smallest number. I'll take the
smallest number first, look at the edges with smallest number. Now based on the property, that
unique signature and the distinct linear independence or subset sum property, I will argue that the
.1 must be the probability of this permutation with the smallest probability. Because, okay, let's
think, since we know that Y equals to AX types, and A is 01, each Y is value in each edge is
possibly sum of few probabilities. It's PI1. PI2. PI3. PIK.
Since it's a unique signature what it means there exists an edge which has exactly the value
which is equal to P1 which is the smallest.
So that means that if I sorted this whole thing, then the smallest numbers will be the ones which
are equal to P1 and those are the edges which belong to the permutation with smallest number.
So far so good. We just got started, right?
Is this clear? I really hope that I will be able to convey this algorithm. Because if I don't then I
think it will be a failure of the talk.
So please feel free to ask me questions.
>>: Since it's unique, you can [inaudible].
>> Devavrat Shah: Correct. Good. So now what you'll do, as you said, is now we know that
these two guys here, these two guys are .1 probability edges, they must belong to first
permutation. So here's what we do. We generate first permutation with, of course, here it's
written but let's say the edges that we have learned and we put them and we assign the
probability.
So we just got started. Now let's sort of try to repeat this type of algorithm. We remove those two
edges from this. Now look at the rest of the edges with the rest of the weights. Look at the
second smallest now. Second smallest here is .2.
So .2 -- two possibilities. Again, it can be from P1. Of course this is not P1. Or it can be from P1
plus P2 or P2. Of course P1 plus P2 will be bigger than P2. This must be coming from P2.
Okay. So by the same argument we create another permutation with probability .2. Let's sort it
again. The third is .3. .3 is P1 plus P2. And we know that unique subsets sum property. If it's
P1 plus P2 it can't be equal to PI1 PIK. That means it has to be from P1 plus P2. We'll put these
edges to these two permutations. In a sense we're almost done with this one.
Now we look at the rest of the edges. Now we come with this edge. Now, what are the options?
Well, options are P1 plus P2 we already learned. Either it's P3. Again, in this case it's trivial. But
in general what can happen is the following: At any point in time when you're at the algorithm you
have discovered P1 P2 up to PK.
The number you're looking at it can be either one of the subset sums of the first key elements or
PC plus 1. Two options. Either you discover edges of some of the combination of permutations
you've already seen or you're discovering a new permutation.
Okay. So that's it. So in this case .7 cannot be P1 plus P2. It has to be P3. So you discover
them. And then the last one you know it's P2 plus P3 and that's it. That's the end of the
algorithm.
Any questions?
>>: So one of the complexity of keeping these running, subset sums, stuff like that right, is there
some efficient way to be thinking about this to do this in polynomial time.
>> Devavrat Shah: Subsets by itself is a very hard problem. But for this you need to do it
approximately. And for approximate there's fully polynomial demo of approximation algorithms.
Good question.
>>: [inaudible].
>> Devavrat Shah: Sorry?
>>: [inaudible] variables.
>> Devavrat Shah: Exactly. That's right.
>>: Thank you.
>> Devavrat Shah: Okay. So that was the algorithm. Now, if you see there's not much as far as
structure of permutations that I've used here. This is the fact that as long as you have these two
properties, these unique signature property and subsets and property linear independence
property, this algorithm is going to work.
Okay. And the theorem is that as long as those two conditions are there, L0 optimization is
unique solution. That unique solution is equal to the original solution and this algorithm will
recover it.
Okay. So now the question is how good or how bad this condition is. Okay. So let's -- we know
that this type of conditions are not going to be true adversarially. I just presented a contra
example. Let's look at the next version. That is randomly. Random model goes like this. So
let's say I want to generate a model with sparsity K, that is K nonzero elements.
In case of permutations here's what I will do. Have N factorial possible permutations. I'm going
to choose K permutations randomly out of them. And for each one of them I want to assign
probabilities. P1, P2, PK. Here's how I'll generate.
First I'll generate numbers uniformly at random from an interval A to B. Then I'll sort of normalize
those numbers. That is, some of them divided by the summation to each number. And I will opt
in probabilities and that will be my model. The question is that when are these sufficient
conditions satisfied?
More precisely, for what values, up to what values of K the sufficient conditions are satisfied with
high probability. So that we can recover the original X using the algorithm.
Okay. Now, of course, we know that adversary is not going to be satisfied for even K equals to 3.
But for random model, here's what's going to happen. So I told you the marginal information
setup. Candidate I ranked in position J. And this is true for all IJs. And in that situation up to
sparsity and log N you will be able to recover the sufficient conditions are satisfied with high
probability and hence the algorithm will be able to recover them with high probability.
For comparison data, this works up to log N. Now, it's a little bit depressing in some sense,
because you have the amount of information in both the cases you have is of order N squared,
and here you're able to recover N log N. Here you're recovering up to log N.
Now, it's sort of a small note about comparison. If you have K equals to 1 comparison is giving
you all pair-wise comparisons. In some sense it's equal to classical sorting problem.
What here you're saying is that up to log N permutations you can sort simultaneously using this
permutation information. And in a second I'll tell you why I believe these are the right answers.
Okay. And here is a bit of interesting fact that in addition to comparison, suppose I also give you
the following fact: That I as a polling person I'm asking people information about compare
people. Also I'll ask some people tell me your favorite candidate.
Okay. So just with that bit of information you will certainly now go from N log N to square root of
N.
All right. So now there are two questions. One is the information that I just gave you, are there
really tight or are they sort of an artifact of my sufficient conditions. That's one.
And the second is what is the generic way of asking this information and what are the trade-offs
depending on what question you ask. So first let me answer you the first question. That is, A,
what I'm trying to show here is that actually the results that we have here are not just artifacts of
our sufficient condition, but actually they are necessary.
Specifically under random model, if your sparsity is scaling faster than N log N and no algorithm,
whether it's polynomial or exponential, it doesn't matter. No algorithm will be able to recover the
X with high probability.
So it may not be LO, may not be anything. No algorithm will be able to do it.
All right. So that's the first part. Sorry, this is a typo. Oh. No. Sorry. It's not a typo. I was
reading this one. Okay. So what this says is that for marginal information N log N is this
recoverability threshold in this random sense. Before it you can recover it after you can't recover
and possibility is with algorithm I told you. For comparison I believe log N is the sparsity. If
you're interested, I can talk to you off line why I believe this conjecture should be true.
Now the question is what about the various types of information? And what is the trade-off
between the information and the recoverability threshold? So let's just sort of think what are the
other types of questions you can ask in the case of elections? You can say, well, I asked you first
that candidate I ranked in position J, answer yes or no. Maybe now I can ask you more. It's like
first order information. Now I can ask you second order information. Candidate I and I prime
ranked in position J and J prime. Now tell me what fraction of people believe in that kind of thing.
Now it's more information. Second order marginal. Or maybe K order marginal will be I1 to IK
ranked in position J1 J2 JK. Now in these cases, question is how would the recoverability
threshold change. Yes, please.
>>: So the recoverability threshold you said the previous slide, you said it's independent of your
recovery condition, whether it's either normal or ->> Devavrat Shah: Recoverability threshold has two parts, right? One is when can you recover
and when can you not recover. So when can you recover that answer, that is up to N K equals to
log N. The sufficient conditions will be satisfied and hence our algorithm will recover it.
When can you not recover it, that's independent of algorithm. It's like in information theory sense
it will be converse.
>>: For example, contra example you gave, if you used L2 norms for L0 norms, like the fourth
thing would be the right answer, right, of giving them each probability .25?
>> Devavrat Shah: No. It may not be the right answer if I started with sort of I generated that
information using this X which is sort of put .5 .5 support on those two cases.
>>: Got it. So basically there are two unique -- good. Good.
>> Devavrat Shah: Okay. So this situation, since it's undermined situation, there will be lots of
solutions. So the question is your solution that generated it is the sparsest unique. And in some
sense recovery part will say, yes, that is the case. And not recovery part will say, well, you can't
do anything.
All right. So here is the situation. Now, the question is how will threshold change here? So here
the threshold change is N squared log N. Here threshold changes N2 to par log N. So there's
some pattern. But, again, the question is: Are these all the type of questions you can ask? Can't
I ask some more complex questions?
So here's a class of questions that I'm going to say that how natural association. And these are
essentially all classes of questions in some sense. Now, this type of questions come from the
looking at the representation theory of permutations.
But let me explain what this type of questions are and what type of recoverability threshold that
comes up. So let's take number N. So N is the number of candidates. Now you can take
different partitions of number N. So by partition I mean lambda one, lambda two, lambda R.
Let's say they're decreasing.
And they sum up to N. Okay. So, for example, one partition would be N minus 1 comma 1 is
lambda 1 lambda 2 the sum of 2 N. Or more generally it's this type of partition and so on.
Different type of partitions.
Now this partition, my claim is that it is associated to partial information of the first order
information. It is candidate I ranks to position J. This partial order is related to the K order
marginal information, i.e., this type of information.
More generally, there are different ways to associate them. I'm not going to go into detail. But if
you're interested I'll be happy to talk to you at the end. And in general for every lambda there's a
class of information that's associated. It's like you take a function and you write down what you
transform, then different lambda represents different bases of that functional base space. That's
roughly the idea. Okay.
Now, okay. So I'll just classify the types of partial information that you can get, is lambda
associates to a type of partial information. The question is: How does recoverability threshold
depend on this lambda?
Okay. So for that let me define one more thing. So given lambda, there's given type of partition
of N. The number of ways you can partition N into this type of partition is this. Is N factorial
divided by lambda 1 factorial lambda 2 factorial. This is, if you remember, from classical
combinatorial class.
So, for example, if you have lambda equals to N minus K111, this formula it's N factorial over K
minus factor here. Essentially like N 2 power K. You see the pattern. I told you N to power K.
For K marginal information the threshold was N to power K log N.
So now if you have D lambda, what should be the threshold? Think of K constant and what
would be the threshold?
D lambda log D lambda. So for large class of lambdas of partial information, this will be a
recoverability threshold. If you have a lambda that allows you to look for different partitions, more
of them, you can have more information and recover a lot more. Otherwise you can recover less.
>>: What would be a lambda for comparison?
>> Devavrat Shah: Good. Comparison won't fall into this [inaudible], but it will be implied and,
hence, it's conjecture of log N. And because this is -- okay. So this is a theorem that's D lambda
log lambda. For comparison you can take lambda equals to N minus 2 comma 1 comma 1; that
is, K equals to 2 Ks here.
And then you can aggregate that information to obtain the comparison. But it won't fit naturally
into this classification. It's not one of those conical bases function.
So that essentially brings me to the end of the talk. What I did today is I tried to do the following:
I tried to understand when does this sparsest solution based approach will work for a question of
inferring rankings where rankings are to be inferred based on a partial information that's not
engendered as in measurement matrix is not engineered, you're given. The question is when can
you recover it. And what we tried to do is we tried to study this approach under natural model
and natural model provides for large class of setup, it provides us optimal recovery threshold. So
there are simple sufficient condition which tells you how you can recover them, and beyond that
there's no way you can sort of recover the support or exit support.
Going forward what I would like to do is actually the examples I pointed out, the network
measurement case, clearly there again network topology should play a role and when are the
sufficient conditions satisfied and when they are not.
This type of simple sufficient condition, are they really optimal or are they really just reminding the
recovery threshold. That would be interesting to know and I don't know the answers.
Okay. That's the end of my talk. Thank you.
[applause].
>>: I think your condition is different from the power IP conditions. [inaudible] satisfied for RK
columns, [inaudible] essentially for the particular subset.
>> Devavrat Shah: Yes.
>>: Your particular subset is [inaudible].
>> Devavrat Shah: Correct.
>>: And on top of that the thing works is [inaudible] is one, coefficients created once. Once.
>> Devavrat Shah: Yes.
>>: Is that a [inaudible] to work. Certain condition, very dependent. I'm not sure where that --
>> Devavrat Shah: Okay. So there's one condition which is the unique signature condition. It's a
part of that that are two pieces. One is, as you said, the unique signature part. A witness part.
The second one is A being 0 and 1. The question is A being 1 necessary? That's the first one.
The second one where is the subset sum property or linear independence come up.
So linear independence comes up because it allows you to sort of undeniably argue that if you're
going from smallest to biggest you're doing the right things. So that's where it comes up.
Now, as far as the A being 0 and 1 is concerned, I don't think it will really matter because one can
sort of modify the similar things and so, for example, A is not 0 and 1 but let's say it's 01 or 2,
then you would argue things -- you can add some clauses and make it work. But as it is it may
not work.
>>: But they may not work for -- it worked for -- might be generalized work, 0 [inaudible] but may
not be general [inaudible].
>> Devavrat Shah: Yeah, I think so. It may not. You're right.
Yes.
>>: So the natural one that you have, you have a property with rankings, right? How does it
translate into individual guys? So does it translate into some heavy [inaudible] property, the first
one's probability or like the property that candidate 1 is the highest rank X candidate 2 is highest
rank Y. Is there some [inaudible] properties for those people?
>> Devavrat Shah: Because here the model is uniform, for most of the questions that you asked,
that is, what is candidate 1 ranked in position 1, candidate 1 ranked in position 2 they'll be
uniform under this. But you can think of models like the third model like I say my random model
is like this. Say here is an identity, permutation, then I'm going to do perturbation around it.
So then you can define, let me call it deperturbation model saying that is everybody is naturally, I
maps to I or map uniformly plus or minus D of I.
Now, under this random model, again, you can define appropriate thresholds. So, for example,
for marginal one the recoverability threshold would be D log D. That's D is the perturbation. And
actually that will be tied. So for that model things will work out as in -- so in some sense we've
been in the situation where when does this type of simple condition not going to work, because
it's sort of counterintuitive that these type of things are working. So there must be certain
symmetry that we're using here that is sort of giving the exact answers.
But we haven't been able to find it.
>> Jin Li: Any other questions? Thank the speaker for his talk.
[applause]
Download