>> Yuval Peres: Good afternoon everyone. I am... from Stanford here. I feel some responsibility for Andrea...

advertisement
>> Yuval Peres: Good afternoon everyone. I am very happy to have Andrea Montanari
from Stanford here. I feel some responsibility for Andrea being in the US, because when
I was running a program in MSRI in Berkeley in 2005 and he was postdocing the
program, and people in Stanford saw him, and the rest is history. So I am always amazed
by the variety of and depths of results that Andrea obtains, and he'll tell us today about
large matrices beyond singular value to composition.
>> Andrea Montanari: Thanks Yuval, for the introduction and the invitation. Yeah, so
what I will try to do is to show you a bunch of things that Oz and other people have been
working on in the last couple of years. So I will start with two motivating examples.
One is graph localization. What is--is say you have a cloud of points in a low
dimensional space or say in the plane. And you want to find the points’ positions from
some measurements. And this has a number of applications. One is manifold learning
and I mentioned already the reconstruction. One is positioning. There are applications in
MMR spectroscopy. But in our [inaudible] simple model here, you have the unit square
in the plane and you throw in random points uniformly and independently in the plane.
And now the question is what you measure? Is that you measure the distances between
pair of points if the pair of points are closer than a certain range. So in this case I took all
of the pair of points that are closer than something like that and connected by a match and
I measured the length of all these edges, okay? And I measured them not exactly but
with some error, so I get these matrices of measurements detailed. And now the question
is to make sense of these big matrices of measurements, in particular, can you find the
point positions, how accurately can you find the point positions, how quickly can you do
it, etcetera?
Okay. So this is one case in which [inaudible] is a matrix when there is an underlying
lower rank structure that comes from the dimensional, low dimensional of the geometry.
Second example is collaborative filtering. Here you have a big matrix. These
correspondents contain ratings given by a bunch of users to a bunch of say, movies. And
then matrix is very sparsely populated in the sense, you know the ratings are only, each
user only watched the small subset of movies. And just to make things concrete these are
the numbers that came up in the matrix challenge. You had 500,000 users, 20,000
movies and about 1% of the entries were revealed. These are numbers between one and
five and they tell us for instance how much did you like, I don't know, The Devil Wears
Prada.
And then with it came a bunch of queries that are question marks asking us how much we
like, I don't know the, the Lord of the Rings. So people were challenged to predict these
missing ratings with [inaudible] below 0.8563. And also you can try to win the
challenge; I actually tried for a couple of weeks. I thought it was easy [laughter]. But
then I decided that it was very hard and even the question of can we make sense of this.
For instance, people spent three years trying to beat these numbers and eventually they,
how did, how did they pick this number? What is the reasonable, you know, position that
one can hope to achieve?
So I started by describing a model which is an okay model for the last problem. An
algorithmic and analysis, some numerical experiments and then I switched to this graph
localization and then some result on this. Okay? So what is a model then? This, of
course, if you want to predict rankings and ratings of this matrix, this user matrix, you
have 1% of the entries. If you want to say something of the entries that are missing you
need some structure. So what people do in this field is normally they assume that, one
common model is that the matrix is actually low rank or approximately low rank. So let's
think that there is another underlying matrix of actual ratings that these ranks are. So this
is U sigma, V transposed. And our model would be [inaudible] Z. And then out of this N
matrix, this noisy matrix, we observe a sample, and the sample is a uniformly random
sample of some size, okay? So you can think of this in two ways. Either you pick, you
observe each entry with some probability, or you observe a uniformly random subset of
event of a given size. These are essentially equivalent.
And this is the sample here, okay? And I will indicate the sample as NE, so this is the
matrix where I put to zero, all of the entries that I did not observe, or I put as a [inaudible]
value. And now the objective, while this is very clearly said in the next week's challenge
is to get an estimate of the low matrix that minimizes this matrix. This is the [inaudible].
So it's the sum of all of the entries of your prediction minus the actual value square root
of, okay? So this is just normalized from using this stance between M and [inaudible].
Yeah?
>>: Where is the [inaudible]?
>> Andrea Montanari: Okay. So I will show you actually the singular value of the
composition of these matrices in a few slides. You can see that there are some big
singular values. So it's not [inaudible]. As a matter of fact, people that won the
challenge, a big part of their matrix was having this kind of model in it. There is an end
waving justification that I don't how much you should believe, but it is that every person
can be characterized by a low dimensional vector. Every movie can be characterized by a
low dimensional vector. And then the rating is just an [inaudible] rating of these two
vectors. Okay, but this is a good question.
I saw the parameter of the problem of the data size is M x N. The rank is R, the sample
size, E, models of E. So I think of E as a set, and this is the size of the set. And then the,
you know, noise matrix. I call it noise but one should think of it as whatever is not
explained by the low rank model. And actually a running example we will use, this
example. It is very friendly. It is high and Z noise.
Okay. So what I want to understand is the trade-off between these things. In particular
the how many entries do I need to learn a certain matrix is even our [inaudible] rank are
within some accuracy. So, you know, a little bit before we started working on this while
we started. Candes and Recht came up with this very nice conditioned matrix. They
looked at classes of matrices that satisfy this incoherence condition. So U and V are the
factors, okay? So these are skinny matrices and this condition is telling you that, okay the
average normal of any, rows of this is one. In this condition is telling you that [inaudible]
is R actually and the [inaudible] is not much larger than the average [inaudible]. The
average normal [inaudible] is R. The maximum [inaudible] is not much larger. Another
way of saying this is that if you look at this matrix this is not, these factors, this is not
aligned with the coordinate axis.
So this is a condition and they proved under these hypotheses that if this matrix satisfies
this condition then you can learn it from N to the 6th over 5 times R entries times log n.
And so you can learn in this sense that it is uniquely determined by the observed entries
and the unique minimum of [inaudible] definite program that is very easy to [inaudible].
So here I am thinking for simplicity in the case where you have no noise. So the matrix
is really ranked say five and they are telling you that five times N to the sixth over five is
sufficient, okay?
>>: If the condition was about the matrix, you wouldn't need the [inaudible].
>> Andrea Montanari: Right, yes. So the condition is about the factors, saying that the
factors have not aligned with the coordinate axis. The reason why you need this
condition to learn exactly, you understand very easily if you think of the following lower
ranked matrices, rank one matrix. Taking a matrix that has one entry as 11 and zero
everywhere else. Unless you pick the entry 11 you will never learn it, okay? So if you
want exact reconstruction, you need something like this condition, okay. And if you need
exact reconstruction there is a lower bound.
So this is a nice result. Okay, the SDP was actually introduced in this paper by
[inaudible]. He was a student at Stanford. So it's a great result. It's a couple of
inconvenient. First of all if you think of how well can I hope to do? The number of
degrees of freedom in such a matrix is about their rank times N, because I need just two
ranks [inaudible], right? And here you are requiring me to give you a rank times N to the
6th over 5. So this means N to the 1/5 observation for each bit of information. So this
can be made precise. Actually there is a lower bound which is R times N times log N and
this upper bound is N to the 1/5 above the upper bound, the lower bound.
Second, this was only for the noiseless case, and I am very satisfied with the [inaudible]
collateral say 0.8. And finally this proposes a semi-definite program but the complexity
of some definite program means a polynomial but it's, you know, a decent polynomial. N
to the 4th and, you know, you are interested in solving it for huge matrices and this can be
done only for matrices that are a hundred by a hundred.
Okay. So now we will try to do something. I don't know if I will succeed in it. I did not
succeed in it. Okay, too bad. I realized just, you know, a few hours ago that this wasn't
working. It used to work. Okay. So this was supposed to show how, you know, this is a
big matrix. This is actually a matrix that is a few thousand by a few thousand. And this
is the error between the reconstruction and the, you know, actual matrix. It is colorcoded; the red means bigger, and the white means smaller. And supposedly we should
have a movie showing that as you reveal more and more entries, you actually are able to
reconstruct it.
>>: Can you restart the browser?
>> Andrea Montanari: Sorry, well then I will have to point it. We started from scratch.
Okay. It is a pity. It was a fun movie. I can’t, it used to work.
>>: I'm kidding.
[laughter]
>> Andrea Montanari: Sorry?
>>: I'm kidding.
[laughter]
>> Andrea Montanari: Five. So what's very nice--I can mimic it for you. [laughter]
what you'll see in this movie, this red white stuff, it floats around, and then you reveal
entries; entries are little blue spots. And then at a certain point all of the red goes away,
very rapidly. Meaning that you cross a certain number of entries that you revealed and
error reconstruction very sharply goes down. If you plot it actually the error
reconstruction is approximately constant until a constant times N times R entries and then
it goes down X potentially. So each entry that it reveals, it goes down very rapidly.
Okay?
So let's stop with the fun and talk about the mathematics. So what is the naïve approach,
very naïve approach for trying to reconstruct this matrix is just using some [inaudible].
Okay what is the simplest thing? You take the matrix in which you put to zero all the
entries that you did not observe. So this is an NE. You compute the singular value of the
composition, okay? Okay so this is out of the singular values. And then you pick only
the largest R singular factors and singular values, and set to zero all the rest. And you
have to rescale this by some factor in such a way the [inaudible] expectation is
[inaudible] matrix.
If you eliminated the [inaudible] from your norm, so you have to boost it up exactly by,
this is 1 over P. One over the probability of the [inaudible] sampled. So this is just
projecting, finding the rank R matrix that is essentially closest in [inaudible] that we are
observing. So this naïve approach fails if the number of entries is very small, if the
number of entries is a [inaudible] N. Why? Well, because if you think of your matrix
called the degree of role number of entries that are revealed in that row, there will be
some rows that have degree log N over log log N. this is the extreme model of a
[inaudible]. And corresponding to these rows that are singular values out of all of the
square root of the degree, so square root of log N over log log N. So this is like if you
take a random GNP graph with P for constant over N, the largest, you know, I gain value
will be a [inaudible] square root of log N over log log n, and it's completely, you know,
concentrated on the highest degree node. So it's not something really fateful.
Okay so, we have to take care of this. So there is actually a trick for this. Every time I
see a row, column that has twice the expected number of entries, I just trim it. So I
identify it and I throw away all those entries. This is a bit of fun and it's a bit harsh
because I'm throwing away data, but it's good enough for proving theorems. In reality
you could do better things than this but this is good enough. And so this is the
[inaudible] bar graph, and it's not going to be whole, but the first part of the algorithm is
just [inaudible]. I trim it, so I set to zero the non-[inaudible] entry. I trim it and then I
compute the rank R projection. And what is true for this is that the [inaudible] that you
achieve is what is the sum of two terms. One term is the area that you would make
without noise. So this is the correct behavior as the size of the right, the maximum entry
times square root of the number of degrees of freedom times the, divided by the number
of observations. And then there is a term that has to do with the noise, okay? And what
matters in the noise here is the operator norm of the noise, of the observed noise. So
adjust just this [inaudible] matrix. These two terms are roughly of the same size in my
rounding example.
Okay so this is actually not something that people just started looking at. Computing this
singular value composition is something that has been going on for a while. So this is,
for instance, a nice theorem by Dimitris McSherry. So here the motivation was a bit
different, was having fast linear algebra. Algorithmic they prove the following, that if the
number of entries is N(8 log n) to the 4th, then the error in the noiseless case is of the
same type as we proved there. Okay so, this is essentially the same theorem except for
this eight log n to the 4th. Now since Dimitris is here I want to provoke him. He will tell
you that (8 log n) to the 4th is not a big deal [laughter] and what are a few logs among
friends? But if N is 10 to the 6th which is a good number, (8 log n) to the 4th, is bigger
than N, right? But this is essentially the same theorem, so we just are a bit more carefully
about proving it.
[laughter]
>> Andrea Montanari: No, no. It is not that he's not careful it...
>>: You have the advantage, if he could look at his theorem.
>> Andrea Montanari: Yeah, that's what I wanted to imply. So this is how it looks in the
random matrix. If you take the random rank 4 matrix and then you specify it, you will
see 4 big [inaudible] in values and then continuous [inaudible] in values. The precise
spectrum here is-- this is how it looks real the Netflix matrix. Unfortunately the picture is
not as nice as before. You don't have this nice gap between big [inaudible] values and
small [inaudible] values, singular values and smaller singular values.
So now this is not all algorithmic, what we do is that we do that and then we take the
[inaudible] and try to minimize it greedily. This is nothing sparse. I will not be precise
about what I mean by a minimizing greedily. If you want to hear, but we do some sort of
just gradient descent to minimize the root [inaudible] starting with this factorization. And
precisely the...
>>: [inaudible]
>> Andrea Montanari: Typically, people do between 20, 50 things like this. So, 20 gives
already 0.91 something. The objective was 0.85 and the baseline is 0.95. So it matters
greatly what you are minimizing it. You know the error on the observed entry. So this is
some of all observed entries of the real value of the matrix. Okay, this should be N minus
your conjecture singular value by composition and you just try to minimize it by some
coordinate descent.
And so what we prove here, so this is the theorem before. Now if you take what we
proved here is if we take an incoherent matrix, it has to have bounded condition number
so this is the ratio of the largest singular value to the smallest singular value. And the
number of observed entries that is a constant and [inaudible] and then there is a small
override that shouldn't be here. This is the minimum of R and log N. So think of these as
log N squared. Then this algorithmic that I described, achieved root [inaudible] thing so
this is just, if you look it is just [inaudible] in the original. So the singular [inaudible]
composition has the sum of two error terms, one is due to sub-sampling and one that is
due to noise or to noise rank R.
Now this part is disappeared, after you do this minimization, you get this thing. All right
so you are left only with this contribution. And the complexity, so the original term we
didn't have any bounded on the complexity and we didn't have this, so we had this worst
sampling condition and now we have a paper that we are about to finish where we have a
bound to the complexity and the number of configurations so this is linearing the
dimension and the Q being the number in the rank.
Again for my nice running example with [inaudible] noise, you know, these operator
norm, this is essentially square root of RN divided by the number so it's the right
behavior that you should have, so this is the number of degrees of freedom divided by the
number of observations. So there are kind of two interesting things about this
phenomenon, okay, for we and other people. First singular relative composition is a
cornerstone of, you know, solving many of these problems and this has been studied in
great detail. And now the nice thing is that you can do much better than it. Qualitatively
better. Not for all matters, but for good class of matters and this is something that
practitioners knew for a while, people in much learning in sense of running this type of
gradient algorithms but now we are starting to understand a bit more of it. What is the
difference, what is singular value the composition you minimize the [inaudible] your
observed matrix minus R Rank matrix. Now what I, I minimize the [inaudible] norm of
the projection on the observed entries of the difference between these two, okay?
So PE is just the operator that takes the matrices and specifies it by setting to zero all of
the entries that you don't observe. So the difference between this and this is that this can
be solved in a closed forum essentially by singular composition, but this cannot be
closed, so being closed forum. On the other hand if you think of what you should do, you
should do this right? Because here you are down waiting interests that you don't observe,
while here I just am penalizing the end [inaudible]. And second, surprises that somehow
the singular [inaudible] composition is some of the noise term plus a sampling term here
you have noise divided by something factor. Okay so, we are not being the only one who
proved this kind of theorem. So after doing the exact construction condescend direct,
then [inaudible] the same result and then condescend plan looked at a noisy case. So if
you put these two, condescend thou, condescend plumb together, you get this kind of
result if the matrix is strong and coherent. So I will not explain what this is. This is just
a hypothesis that is stronger than [inaudible], but they don't have any assumption on
condition number and the number of observed entries [inaudible] log N to the 6th.
Then the same [inaudible] achieves these errors [inaudible] bound. Again if there is no
noise this means exactly that that a construction, that there is zero this is exactly
construction. If that is nonzero then you have some errors and the difference is that the
error here [inaudible] the noise matrix, while in our bounds case like you greater
[inaudible]. So between the two there is a factor square root of N, and the square root of
N shows up here and here. And there has been in the last couple of years, you know,
many papers coming up about the same topic. There has been a beautiful couple of
papers by David Gross and his coworkers. So this RN log N to the 6th was in this
condescend Tao was really 40 pages of moment calculations. It's really scary. So David
Gross realized that you can achieve a tighter result if you use these very nice Martingale
inequalities, inequalities for some ideas and the matrix is. There is [inaudible] equalities
tells you to do something about the tale of the sum of [inaudible] random variables.
There is an inequality by [inaudible] and Winter that tells you about the sum of
[inaudible] random matrices. It's beautifully, it's very powerful. And okay these people
Martin and [inaudible] looked at slightly different settings and the other bounds.
Let's keep this. So just a couple of numerical experiments. So this is, okay, it's a case
that is not particle [inaudible] but still show that a lot of algorithms have been proposed.
These are matrices that are 1000 x 1000 of rank ten and I generate them randomly. So
what I do is just I take a factor that is 1000 by 10 and this is I 80 gauge random variable.
I take another factor like this, I take the other product against the matrix 1000 x 1000 by
some sample here; I then tried to reconstruct it, with [inaudible] algorithmic [inaudible]
the ratio of the number reveals the entries to the dimension and you see that the
probability of success as for all of these algorithms are sharp kind of threshold
phenomenon. And there is a lower bound that you can compute using rigidity theory so
this is Singer and co-workers from Princeton, so you get a lower bound meaning that
below this [inaudible] for this view entries you cannot reconstruct it. There are multiple
solutions. Okay, so this is, this is group [inaudible] in Columbia. This is people at, a
group of people at UI UC. So these are various approximate matrices to get a solution,
this is our algorithm so you can get fairly near from this random matrices.
Okay and this is for a noisy example. So this is actually from this paper by Candes and
Plan. This is matrix 500 X 500 of the frank 4oz. I had [inaudible] noise with variance
one and as you can see in the number of entries over increases, the errors go down. There
is a kind of Oracle bound that you can compute. There is an obvious lower bound that
you can compute as follows. You assume that one of the two factors has been given to
you. Okay you just try to reconcile the other factor. Now this becomes a least square
problem and you can tell pretty much everything about it. And this gives a lower bound
and, okay. This is our algorithm. This is what singular value composition does. So this
is kind of nice because you see that singular value of the composition gets the ideal point
where all the entries have been revealed. And as fewer entries are revealed to diverge.
Okay so, this is the really the gap that, what you lose doing singular value composition.
Okay unfortunately things are not as nice if you go back and try to solve the Netflix
challenge. So these are three algorithms. Okay, the best one is this one, the blue, black
one is the one for which we have proof. And the other two this is alternating
minimization, what is the idea you minimize the singular function that you minimize over
X and then over Y and then over X and then over Y etc. This actually was first studied
by a group at Microsoft in India. And so you get this curve and this is another algorithm
that is better for which we have no proof. So the prize was kind of here. We are not
there. But one interesting thing is that this is the error on the, interest that you didn't
observe and this is the entry error on the interest that you observed. So you see this gap
is due to over fitting somehow. On the entries that you observed, you achieve a root
[inaudible] that is Moeller and you know the prediction after.
>>: [inaudible]
>> Andrea Montanari: Sorry?
>>: [inaudible] long reconstruction [inaudible]?
>> Andrea Montanari: Yeah. If you just want to come up with a matrix you are not
required to-- if the matrix is not there exactly rank R, now what is good or bad depends a
lot on your matrix, right? So you might…
>>: [inaudible]
>> Andrea Montanari: Sorry?
>>: What is [inaudible] algorithm [inaudible]?
>> Andrea Montanari: Yeah, I don't understand what it means no unique reconstruction.
You can come up with a guess for each value of each entry. And now you can decide on
the basis of that, this is your prediction.
Okay so, I have one slide about proof. So again the algorithm that I described is that you
do the rank R projection by singular value composition and then you minimize the cost
function greedily starting at whatever singular value the composition gave you and the
cost function is, you know, think of the noiseless case is UV transpose. This is your
factor minus XY transpose. This is what you are minimizing, and you sum these over all
of the observed entries. So what is the strategy of this proof that we have? First you
prove that the singular value of the composition is not too bad. So this is the first
theorem. And this actually, the way we do it we use this technique in, people use the
random graph theory, so this is originally dates back I think to Friedman, Consemeretti
and Fagin [inaudible] and so what these people [inaudible] is they took a random G and P
graph with P while constant over N. They trim the high degree node and they prove that
once you trim the high degree nodes the largest [inaudible] value is bounded, is not
square root of log N over N, okay? And so now how you prove it, there are two
ingredients. One is what you would do also for advanced matrix. This kind of you do
[inaudible] over the units bowl and then you tail, okay you look at the [inaudible]
quotient and you then this is the maximum, the [inaudible] quotient is the maximum of
the process on this theorem, so you have to maximize their profits on the theorem, and
what is nice here is that, okay you have to deal with the fact that the graph is sparse. And
here they do it using expulsion. And essentially we extend this proof to rank R.
And then what is next to prove, is when you are closing off, you know, this cost function
is nice in the sense that it's lower and upper bounded by a parabola which has its unique
minimum at the correct reconstruction. So it's locally convex somehow. The point is this
optimization problem is a non-convex problem but in the neighborhood of this minimum
is convex. So we proved this.
>>: The bound you have here is not a [inaudible].
>> Andrea Montanari: Yeah, yeah, so, so. So luckily convex was one way of saying it.
So what we prove is this, and we prove is that the gradient doesn't have any zero inside
the neighborhood. So the two things are sufficient to show the gradient descent
converges to the minimum. Now, okay, this was the proof technique until one year ago.
Now we have a better proof technique that-- because this proof technique that doesn’t
give a good bound on the number of integrations, now we have a better one that give a
bound on the number of integrations and we use it, now that we learn the results
[inaudible] now uses [inaudible] and the quality.
Okay so in the last 20 minutes I want to talk about this graph localization. So there's this,
the way I got into this graph localization is I was thinking about this collaborative
filtering model. And I thought that model is how a big lie. And the model and the test to
study is a big lie in it. And the big lie is that in the model its user runs the uniform
random movies. Now earlier work just Dimitri a few days ago pointed out that earlier
work that you use the singular value of the composition actually looks at the richer model
in which interests are revealed also depend on the value of the entry. But this is just for
singular value composition and also for those kinds of models have very strong
assumptions, like you know the condition and distribution of observing an entry given the
value.
But here there is a big stumbling block. It's not true that you go to whatever Blockbuster
or Netflix and you pick a uniformly random movie. At least I don't do it. Okay so,
somehow this is a model that has the same feature. So what you would rent is probably
movies that you would expect to like a lot. So you expect to sample entries that are large.
So you should think more of a matrix that is approximately low ranking. And the subset
of entries that you observe are those that are large. And so this is how similar; it's just
something to think about. And it is simpler to think about your point and the interest that
you observe the distance is the peer distances are those that are small, okay? So it's very
much related.
What is the underlying low rank structure? Well, suppose that I measure, you know, all
of the pair’s distances, the distances between all of the pairs. And I construct, I measure
them exactly and I can construct this square matrix that has, the square of the distance
between I and J, at 10 to the I and J. So his is an N by N matrix that has N 3 ID as
distance by J squared. I claim that the rank of D, okay there are too many Ds in this
slide. But the rank of this is the ambient dimension +2. Okay, so now my drawing is
2+2, okay? And this is just something that is 1 minute is because this D is just DIJ is
just XI square which is [inaudible] plus XJ square which is rank 1 minus 2XI XJ, which
is rank D. So you can think of this problem as essentially the same as before. You have
the rank, D plus 2, the rank 4 matrix on which I sub sampling entries. And then I want to
reconstruct all the matrices. Now the interesting thing is that all papers on this topic start
with saying graphic localization is a nice application of this idea, general ideas. But if
you run any of the algorithms that have been tried before they don't work, just because of
this fact because of the entries that you’re sampling, are only the small entries. So of
these matrices you only observe entries that, those entries that are smaller than some
range.
Okay so, here is an algorithm that instead, and it's a kind of work. This is not the best I
can program but is what we can analyze so far. So the impetus again is now perturbed
the distance of measurements and I want to come up with a set of two-dimensional
coordinates that are accurate up to a rigid motion. Of course, you know, I cannot identify
points because original motion do not change distances. So I solve this SDP problem, I
minimize, I take a Q; this is a problem in the matrix Q that is an N by N matrix. I
minimize the trace of Q subject to two positives semi-definite. And essentially a bunch
of linear constraint of Q that tells me that Q produces the correct positions within the Q
accuracy delta. And so I come back to this in a minute. Once I have the solution of this,
I compute ah, I gain the composition of this. So this will be a symmetric matrix. I
compute again the value of composition of this. I take the largest of the [inaudible]
vectors and I give you back this as my embedding of the points.
>>: So is there noise here? There's no noise.
>> Andrea Montanari: Yeah, there is noise. I am assuming. So there is noise and this is
in the algorithm reflects in the fact that I'm not trying to reproduce the measurements
exactly, detailed exactly. I only try to reproduce them up to some accuracy delta. Okay?
>>: [inaudible]
>> Andrea Montanari: So I am down one piece of [inaudible] what is this MIJ, this is
just; okay this MIJ is just a matrix that has entry value. All entries zero except entry II
which is one and entry JJ which is one and 3IJ and 3JI which are -1. Then when I
compute, so if you think of this, this is the semi definite program that I try to solve. What
is directional for this, so this is again the significant problem, if, so suppose that this is,
you know, you compose it in this form. This is a positive significant matrix so you can
always write it as X times X transpose and therefore the [inaudible] MIJ is a vector X
drop [inaudible] vector XJ. This thing is a equivalent to minimize the trace will be the
sum of the [inaudible]. So the sum of all the product X I X I so the sum of the square of
the positions subject to what is this constraint if you write it down is just telling me that
XI minus XJ square is far from the [inaudible] most Delta.
So this Q, what this really is [inaudible] matrix of my points? And this, this is a
[inaudible] product between matrices. I mean, AB transpose and this [inaudible] product
is extracting this object.
>>: So why are you trying to minimize this?
>> Andrea Montanari: Excellent. I am trying to minimize because I need too many
[inaudible]. And second because the problem is [inaudible] translation. So if I don't put
anything-- ideally I would like to find something that satisfies this. So if I can solve the
[inaudible] problem fine, I'm fine. What this really does is it centers the points. So it
takes out the center of mass. So whatever solution you get will have center of mass at the
origin. So in the end, the proof doesn't play any role in the. So what is known? People
look at this as the pier of relaxation. Mainly motivated by positioning so mainly
[inaudible] and coworkers so they prove these kind of results that if the measurements are
exact and the graph is a D+ one [inaudible] lateration, then these algorithmic [inaudible]
correct positions so what does it mean D+ one [inaudible] that there exists another
offering of the points, of the vertices, such that vertex I is at least a D+ one neighbors
between 1 and I minus 1. Where does this come from? Imagine, what is the simplest
algorithm for finding positions, is just triangulation. You first position three points and
then you find the fourth point that has two neighbors among the first three and then a fifth
that [inaudible]. And so this theorem says that if the measurement is exact and
triangulation works then SDP works. That is nice to know.
>>: [inaudible] SDP is seems like you have no [inaudible].
>> Andrea Montanari: But I don't observe all the entries.
>>: Are you saying that is only true if I have all the distances?
>> Andrea Montanari: If you have all of the distances then you don't have two
[inaudible]. Subset. If I ever have the subset, the simplest matrix is usually what
[inaudible] it's just triangulation. And now people who do this kind of things, many
application people complain about this because it's not robust. For instance, I was
reading that the way they first measured the height of mountains of K2 in the Malia was
to triangulation. And for a while they were convinced that K2 was higher because along
the way somebody got the wrong measurements, right? And then this propagates. So
but, of course theorem doesn't show what is the advantage of doing one STP over respect
to the other.
There has been a case, a noiseless case that has a long history, because it is related to
rigidity to theory and there is this long extending question of given a graph and a set of
distances is there a unique set of points that realize these distances. So this is history
actually knows very little about it. I think it was settled in two dimension [inaudible] and
others and was settled in general dimensional only recently. And there is a notion of
genetic rigidity here that I am not going to talk about. Okay?
So here the model that I'm going to look at is this you know [inaudible] model. Since my
measurements are far from actual distance at most by Delta, no? And for simplicity let's
say that the range I wish I measure things is log n over n to the power 1 [inaudible]. Why
this funny function? Because this is essentially the first radius at which the graph
becomes connected. If this is, if the range is much smaller the graph is multiple
component and you cannot position.
So what happened? So this is the theorem that we have. So under these hypotheses
without probability this has the [inaudible] okay [inaudible] position is upward bounded
by n to the 4 over D times delta. Okay, here there is an H missing. So this is a bit
annoying because there is a factor [inaudible] that diverges with the number points so you
wonder whether this is the best factor that you can get at least in the analysis, and it's not
that we are at least within this position, we did things correctly, in the sense that they're
exists another set of measurements and configuration of points such that a set the
measurements such that the [inaudible] most of the [inaudible]. And here are the
[inaudible] for this light and forgetting log terms. [laughter]. But somebody might be
more careful than me. So I not only forgetting, but I am very sloppy really. Any cost
[inaudible].
Okay so, this is what we were able to, actually it's quite interesting. The upper bound is a
lot of things are very related to [inaudible] chain analysis. So analysis of [inaudible]. A
lot of things are related to this. But I will not talk about this. Lower bound, I will just
say one thing about lower bound. What is the worst-case configuration? So for a while
we just triangled trying to improve our upper bound because we thought that the lower
bound is this. The worst-based compilation is this. You have a set of points and the
adversity what it does is that it blows up by a constant factor all the positions and then it
returns you the measurements corresponding to these distances, okay? And then gives
them back to you and what you do, whatever algorithm will you use, you will reconstruct
these positions. So you will get an [inaudible] between the actual positions and the
reconstructed positions. Now you do the massive computing, what is the size of the error
that you get for this kind of noise? A very [inaudible] noise. And you get N to the 2
times delta. So for a while we tried to push down the upper bound and we didn't succeed.
We didn't succeed for a reason, is that this is not the worst case configuration. So what is
the worst-case configuration? It's actually pretty, it's this one. The SDP doesn't know
what is the dimension in which you are embedding the problem. So when it returns you a
grand matrix [inaudible] in many dimensions not just in two dimensions, right? So they
adversary, what you will do, you will take your points and then okay, this I struggle to do
this picture, but it bends you in the third dimension. And now is this point position bent
and computes all this distances, give all these distances back to you. What you will do,
the SDP will reconstruct this point position, then you do the singular value composition
so you flatten it. You do your math. You compute [inaudible]. You get your [inaudible]
which is N to the 4th over [inaudible]. This has nothing to do. Of course there are ways
to, once you understand that this is the worst case, there are ways to improve the
algorithm. Okay, but I will not talk about it.
>>: If you go back to what was the [inaudible] upper bound [inaudible]
>> Andrea Montanari: It's the same up to poly log factors.
>>: And you won't reveal to us what is the power [inaudible]?
>> Andrea Montanari: I don't. It is unknown. I don't know. Yeah, shame on me. So
conclusion, so there is, okay there has been a singular value composition for analyzing
large matters is something that is relatively well understood. All of this no [inaudible] so
I think that is quite less understood and it's very interesting and there is some known
trivial probability that perhaps people could contribute to and it is still open in the end.
There has been all these papers but precise sharp characterizations are open. There has
been some work and this is very interesting about other matters. In reality they thought
not of fine [inaudible]. So the right way to think of it is that they are of rank N, so full
rank but so how do you want to discount things that have, you know, things that have, all
the singular values of the same order. And, you know you're trying to reconstruct the
matrix; the real structure that you're trying to pull out is something you watch. There is a
few that are singular values that are big and you know many [inaudible] more.
So what is the correct way of pulling out this structure, one is to cook up the right
function to be optimized. Okay so, this is related to non-paramedic methods. And finally
the interesting question that is even for singular value composition and what happens if
the matrix is really huge in the sense that it doesn't even fit on a single machine and what
are methods to solve this problem in that case. All right so these are our interesting
research [inaudible]. Thanks.
[applause]
>> Yuval Peres: Are there questions?
>>: For the worst-case scenario you just draw that, I think while we can fix that SDP,
[inaudible] you can maximize the [inaudible]?
>> Andrea Montanari: There is this theory of maximizing the [inaudible]. But then you
have to put the linear constraint, but the point is you see that, the point is that if
somebody gave you-- it's not clear that will work, but this is a possible trick. Another
trick is that you do the following. If two points are not connected by an edge, you still
can compute an upper bound on their distances just by taking the shortest spot in the
graph. You can put that upper bound as your SDP. And this upper bound is actually
pretty good if there are enough points within 90% correct. So you put actually lower
bound on your SDP. So this will be like adding, you know, all the edges in your graph
but with some bigger error. So we are actually trying to do the analysis of that right now.
But yeah, I mean, once you understand what is the problem, then there are fixes that one
can try to do.
>>: Go ahead and use standard matrix [inaudible] not too much of a scale [inaudible]?
>> Andrea Montanari: Yeah, good question. So there is this matter that we didn't talk
about that as much as [inaudible] exactly what I was saying is what you do is that you
take all the, you fill your matrix by putting in all the answers that you didn't observe the
length of the shortest spot. Great but even if the measurement, what is better about this
method is that even if the measurements are exact, this will not return the exact position,
right? It will return something with some error and that actually is not there. So it is like
in my first example, is like simple singular value decomposition. Singular value
decomposition, even if the rank matrix is exactly low rank and you observe without
noise, you will not be able to reconstruct exactly. So there is this extra [inaudible] that
you get from SDP that you don't get with MDS.
>> Yuval Peres: Anymore questions? Thank you.
[applause]
Download