>> Yuval Peres: Good afternoon everyone. I am very happy to have Andrea Montanari from Stanford here. I feel some responsibility for Andrea being in the US, because when I was running a program in MSRI in Berkeley in 2005 and he was postdocing the program, and people in Stanford saw him, and the rest is history. So I am always amazed by the variety of and depths of results that Andrea obtains, and he'll tell us today about large matrices beyond singular value to composition. >> Andrea Montanari: Thanks Yuval, for the introduction and the invitation. Yeah, so what I will try to do is to show you a bunch of things that Oz and other people have been working on in the last couple of years. So I will start with two motivating examples. One is graph localization. What is--is say you have a cloud of points in a low dimensional space or say in the plane. And you want to find the points’ positions from some measurements. And this has a number of applications. One is manifold learning and I mentioned already the reconstruction. One is positioning. There are applications in MMR spectroscopy. But in our [inaudible] simple model here, you have the unit square in the plane and you throw in random points uniformly and independently in the plane. And now the question is what you measure? Is that you measure the distances between pair of points if the pair of points are closer than a certain range. So in this case I took all of the pair of points that are closer than something like that and connected by a match and I measured the length of all these edges, okay? And I measured them not exactly but with some error, so I get these matrices of measurements detailed. And now the question is to make sense of these big matrices of measurements, in particular, can you find the point positions, how accurately can you find the point positions, how quickly can you do it, etcetera? Okay. So this is one case in which [inaudible] is a matrix when there is an underlying lower rank structure that comes from the dimensional, low dimensional of the geometry. Second example is collaborative filtering. Here you have a big matrix. These correspondents contain ratings given by a bunch of users to a bunch of say, movies. And then matrix is very sparsely populated in the sense, you know the ratings are only, each user only watched the small subset of movies. And just to make things concrete these are the numbers that came up in the matrix challenge. You had 500,000 users, 20,000 movies and about 1% of the entries were revealed. These are numbers between one and five and they tell us for instance how much did you like, I don't know, The Devil Wears Prada. And then with it came a bunch of queries that are question marks asking us how much we like, I don't know the, the Lord of the Rings. So people were challenged to predict these missing ratings with [inaudible] below 0.8563. And also you can try to win the challenge; I actually tried for a couple of weeks. I thought it was easy [laughter]. But then I decided that it was very hard and even the question of can we make sense of this. For instance, people spent three years trying to beat these numbers and eventually they, how did, how did they pick this number? What is the reasonable, you know, position that one can hope to achieve? So I started by describing a model which is an okay model for the last problem. An algorithmic and analysis, some numerical experiments and then I switched to this graph localization and then some result on this. Okay? So what is a model then? This, of course, if you want to predict rankings and ratings of this matrix, this user matrix, you have 1% of the entries. If you want to say something of the entries that are missing you need some structure. So what people do in this field is normally they assume that, one common model is that the matrix is actually low rank or approximately low rank. So let's think that there is another underlying matrix of actual ratings that these ranks are. So this is U sigma, V transposed. And our model would be [inaudible] Z. And then out of this N matrix, this noisy matrix, we observe a sample, and the sample is a uniformly random sample of some size, okay? So you can think of this in two ways. Either you pick, you observe each entry with some probability, or you observe a uniformly random subset of event of a given size. These are essentially equivalent. And this is the sample here, okay? And I will indicate the sample as NE, so this is the matrix where I put to zero, all of the entries that I did not observe, or I put as a [inaudible] value. And now the objective, while this is very clearly said in the next week's challenge is to get an estimate of the low matrix that minimizes this matrix. This is the [inaudible]. So it's the sum of all of the entries of your prediction minus the actual value square root of, okay? So this is just normalized from using this stance between M and [inaudible]. Yeah? >>: Where is the [inaudible]? >> Andrea Montanari: Okay. So I will show you actually the singular value of the composition of these matrices in a few slides. You can see that there are some big singular values. So it's not [inaudible]. As a matter of fact, people that won the challenge, a big part of their matrix was having this kind of model in it. There is an end waving justification that I don't how much you should believe, but it is that every person can be characterized by a low dimensional vector. Every movie can be characterized by a low dimensional vector. And then the rating is just an [inaudible] rating of these two vectors. Okay, but this is a good question. I saw the parameter of the problem of the data size is M x N. The rank is R, the sample size, E, models of E. So I think of E as a set, and this is the size of the set. And then the, you know, noise matrix. I call it noise but one should think of it as whatever is not explained by the low rank model. And actually a running example we will use, this example. It is very friendly. It is high and Z noise. Okay. So what I want to understand is the trade-off between these things. In particular the how many entries do I need to learn a certain matrix is even our [inaudible] rank are within some accuracy. So, you know, a little bit before we started working on this while we started. Candes and Recht came up with this very nice conditioned matrix. They looked at classes of matrices that satisfy this incoherence condition. So U and V are the factors, okay? So these are skinny matrices and this condition is telling you that, okay the average normal of any, rows of this is one. In this condition is telling you that [inaudible] is R actually and the [inaudible] is not much larger than the average [inaudible]. The average normal [inaudible] is R. The maximum [inaudible] is not much larger. Another way of saying this is that if you look at this matrix this is not, these factors, this is not aligned with the coordinate axis. So this is a condition and they proved under these hypotheses that if this matrix satisfies this condition then you can learn it from N to the 6th over 5 times R entries times log n. And so you can learn in this sense that it is uniquely determined by the observed entries and the unique minimum of [inaudible] definite program that is very easy to [inaudible]. So here I am thinking for simplicity in the case where you have no noise. So the matrix is really ranked say five and they are telling you that five times N to the sixth over five is sufficient, okay? >>: If the condition was about the matrix, you wouldn't need the [inaudible]. >> Andrea Montanari: Right, yes. So the condition is about the factors, saying that the factors have not aligned with the coordinate axis. The reason why you need this condition to learn exactly, you understand very easily if you think of the following lower ranked matrices, rank one matrix. Taking a matrix that has one entry as 11 and zero everywhere else. Unless you pick the entry 11 you will never learn it, okay? So if you want exact reconstruction, you need something like this condition, okay. And if you need exact reconstruction there is a lower bound. So this is a nice result. Okay, the SDP was actually introduced in this paper by [inaudible]. He was a student at Stanford. So it's a great result. It's a couple of inconvenient. First of all if you think of how well can I hope to do? The number of degrees of freedom in such a matrix is about their rank times N, because I need just two ranks [inaudible], right? And here you are requiring me to give you a rank times N to the 6th over 5. So this means N to the 1/5 observation for each bit of information. So this can be made precise. Actually there is a lower bound which is R times N times log N and this upper bound is N to the 1/5 above the upper bound, the lower bound. Second, this was only for the noiseless case, and I am very satisfied with the [inaudible] collateral say 0.8. And finally this proposes a semi-definite program but the complexity of some definite program means a polynomial but it's, you know, a decent polynomial. N to the 4th and, you know, you are interested in solving it for huge matrices and this can be done only for matrices that are a hundred by a hundred. Okay. So now we will try to do something. I don't know if I will succeed in it. I did not succeed in it. Okay, too bad. I realized just, you know, a few hours ago that this wasn't working. It used to work. Okay. So this was supposed to show how, you know, this is a big matrix. This is actually a matrix that is a few thousand by a few thousand. And this is the error between the reconstruction and the, you know, actual matrix. It is colorcoded; the red means bigger, and the white means smaller. And supposedly we should have a movie showing that as you reveal more and more entries, you actually are able to reconstruct it. >>: Can you restart the browser? >> Andrea Montanari: Sorry, well then I will have to point it. We started from scratch. Okay. It is a pity. It was a fun movie. I can’t, it used to work. >>: I'm kidding. [laughter] >> Andrea Montanari: Sorry? >>: I'm kidding. [laughter] >> Andrea Montanari: Five. So what's very nice--I can mimic it for you. [laughter] what you'll see in this movie, this red white stuff, it floats around, and then you reveal entries; entries are little blue spots. And then at a certain point all of the red goes away, very rapidly. Meaning that you cross a certain number of entries that you revealed and error reconstruction very sharply goes down. If you plot it actually the error reconstruction is approximately constant until a constant times N times R entries and then it goes down X potentially. So each entry that it reveals, it goes down very rapidly. Okay? So let's stop with the fun and talk about the mathematics. So what is the naïve approach, very naïve approach for trying to reconstruct this matrix is just using some [inaudible]. Okay what is the simplest thing? You take the matrix in which you put to zero all the entries that you did not observe. So this is an NE. You compute the singular value of the composition, okay? Okay so this is out of the singular values. And then you pick only the largest R singular factors and singular values, and set to zero all the rest. And you have to rescale this by some factor in such a way the [inaudible] expectation is [inaudible] matrix. If you eliminated the [inaudible] from your norm, so you have to boost it up exactly by, this is 1 over P. One over the probability of the [inaudible] sampled. So this is just projecting, finding the rank R matrix that is essentially closest in [inaudible] that we are observing. So this naïve approach fails if the number of entries is very small, if the number of entries is a [inaudible] N. Why? Well, because if you think of your matrix called the degree of role number of entries that are revealed in that row, there will be some rows that have degree log N over log log N. this is the extreme model of a [inaudible]. And corresponding to these rows that are singular values out of all of the square root of the degree, so square root of log N over log log N. So this is like if you take a random GNP graph with P for constant over N, the largest, you know, I gain value will be a [inaudible] square root of log N over log log n, and it's completely, you know, concentrated on the highest degree node. So it's not something really fateful. Okay so, we have to take care of this. So there is actually a trick for this. Every time I see a row, column that has twice the expected number of entries, I just trim it. So I identify it and I throw away all those entries. This is a bit of fun and it's a bit harsh because I'm throwing away data, but it's good enough for proving theorems. In reality you could do better things than this but this is good enough. And so this is the [inaudible] bar graph, and it's not going to be whole, but the first part of the algorithm is just [inaudible]. I trim it, so I set to zero the non-[inaudible] entry. I trim it and then I compute the rank R projection. And what is true for this is that the [inaudible] that you achieve is what is the sum of two terms. One term is the area that you would make without noise. So this is the correct behavior as the size of the right, the maximum entry times square root of the number of degrees of freedom times the, divided by the number of observations. And then there is a term that has to do with the noise, okay? And what matters in the noise here is the operator norm of the noise, of the observed noise. So adjust just this [inaudible] matrix. These two terms are roughly of the same size in my rounding example. Okay so this is actually not something that people just started looking at. Computing this singular value composition is something that has been going on for a while. So this is, for instance, a nice theorem by Dimitris McSherry. So here the motivation was a bit different, was having fast linear algebra. Algorithmic they prove the following, that if the number of entries is N(8 log n) to the 4th, then the error in the noiseless case is of the same type as we proved there. Okay so, this is essentially the same theorem except for this eight log n to the 4th. Now since Dimitris is here I want to provoke him. He will tell you that (8 log n) to the 4th is not a big deal [laughter] and what are a few logs among friends? But if N is 10 to the 6th which is a good number, (8 log n) to the 4th, is bigger than N, right? But this is essentially the same theorem, so we just are a bit more carefully about proving it. [laughter] >> Andrea Montanari: No, no. It is not that he's not careful it... >>: You have the advantage, if he could look at his theorem. >> Andrea Montanari: Yeah, that's what I wanted to imply. So this is how it looks in the random matrix. If you take the random rank 4 matrix and then you specify it, you will see 4 big [inaudible] in values and then continuous [inaudible] in values. The precise spectrum here is-- this is how it looks real the Netflix matrix. Unfortunately the picture is not as nice as before. You don't have this nice gap between big [inaudible] values and small [inaudible] values, singular values and smaller singular values. So now this is not all algorithmic, what we do is that we do that and then we take the [inaudible] and try to minimize it greedily. This is nothing sparse. I will not be precise about what I mean by a minimizing greedily. If you want to hear, but we do some sort of just gradient descent to minimize the root [inaudible] starting with this factorization. And precisely the... >>: [inaudible] >> Andrea Montanari: Typically, people do between 20, 50 things like this. So, 20 gives already 0.91 something. The objective was 0.85 and the baseline is 0.95. So it matters greatly what you are minimizing it. You know the error on the observed entry. So this is some of all observed entries of the real value of the matrix. Okay, this should be N minus your conjecture singular value by composition and you just try to minimize it by some coordinate descent. And so what we prove here, so this is the theorem before. Now if you take what we proved here is if we take an incoherent matrix, it has to have bounded condition number so this is the ratio of the largest singular value to the smallest singular value. And the number of observed entries that is a constant and [inaudible] and then there is a small override that shouldn't be here. This is the minimum of R and log N. So think of these as log N squared. Then this algorithmic that I described, achieved root [inaudible] thing so this is just, if you look it is just [inaudible] in the original. So the singular [inaudible] composition has the sum of two error terms, one is due to sub-sampling and one that is due to noise or to noise rank R. Now this part is disappeared, after you do this minimization, you get this thing. All right so you are left only with this contribution. And the complexity, so the original term we didn't have any bounded on the complexity and we didn't have this, so we had this worst sampling condition and now we have a paper that we are about to finish where we have a bound to the complexity and the number of configurations so this is linearing the dimension and the Q being the number in the rank. Again for my nice running example with [inaudible] noise, you know, these operator norm, this is essentially square root of RN divided by the number so it's the right behavior that you should have, so this is the number of degrees of freedom divided by the number of observations. So there are kind of two interesting things about this phenomenon, okay, for we and other people. First singular relative composition is a cornerstone of, you know, solving many of these problems and this has been studied in great detail. And now the nice thing is that you can do much better than it. Qualitatively better. Not for all matters, but for good class of matters and this is something that practitioners knew for a while, people in much learning in sense of running this type of gradient algorithms but now we are starting to understand a bit more of it. What is the difference, what is singular value the composition you minimize the [inaudible] your observed matrix minus R Rank matrix. Now what I, I minimize the [inaudible] norm of the projection on the observed entries of the difference between these two, okay? So PE is just the operator that takes the matrices and specifies it by setting to zero all of the entries that you don't observe. So the difference between this and this is that this can be solved in a closed forum essentially by singular composition, but this cannot be closed, so being closed forum. On the other hand if you think of what you should do, you should do this right? Because here you are down waiting interests that you don't observe, while here I just am penalizing the end [inaudible]. And second, surprises that somehow the singular [inaudible] composition is some of the noise term plus a sampling term here you have noise divided by something factor. Okay so, we are not being the only one who proved this kind of theorem. So after doing the exact construction condescend direct, then [inaudible] the same result and then condescend plan looked at a noisy case. So if you put these two, condescend thou, condescend plumb together, you get this kind of result if the matrix is strong and coherent. So I will not explain what this is. This is just a hypothesis that is stronger than [inaudible], but they don't have any assumption on condition number and the number of observed entries [inaudible] log N to the 6th. Then the same [inaudible] achieves these errors [inaudible] bound. Again if there is no noise this means exactly that that a construction, that there is zero this is exactly construction. If that is nonzero then you have some errors and the difference is that the error here [inaudible] the noise matrix, while in our bounds case like you greater [inaudible]. So between the two there is a factor square root of N, and the square root of N shows up here and here. And there has been in the last couple of years, you know, many papers coming up about the same topic. There has been a beautiful couple of papers by David Gross and his coworkers. So this RN log N to the 6th was in this condescend Tao was really 40 pages of moment calculations. It's really scary. So David Gross realized that you can achieve a tighter result if you use these very nice Martingale inequalities, inequalities for some ideas and the matrix is. There is [inaudible] equalities tells you to do something about the tale of the sum of [inaudible] random variables. There is an inequality by [inaudible] and Winter that tells you about the sum of [inaudible] random matrices. It's beautifully, it's very powerful. And okay these people Martin and [inaudible] looked at slightly different settings and the other bounds. Let's keep this. So just a couple of numerical experiments. So this is, okay, it's a case that is not particle [inaudible] but still show that a lot of algorithms have been proposed. These are matrices that are 1000 x 1000 of rank ten and I generate them randomly. So what I do is just I take a factor that is 1000 by 10 and this is I 80 gauge random variable. I take another factor like this, I take the other product against the matrix 1000 x 1000 by some sample here; I then tried to reconstruct it, with [inaudible] algorithmic [inaudible] the ratio of the number reveals the entries to the dimension and you see that the probability of success as for all of these algorithms are sharp kind of threshold phenomenon. And there is a lower bound that you can compute using rigidity theory so this is Singer and co-workers from Princeton, so you get a lower bound meaning that below this [inaudible] for this view entries you cannot reconstruct it. There are multiple solutions. Okay, so this is, this is group [inaudible] in Columbia. This is people at, a group of people at UI UC. So these are various approximate matrices to get a solution, this is our algorithm so you can get fairly near from this random matrices. Okay and this is for a noisy example. So this is actually from this paper by Candes and Plan. This is matrix 500 X 500 of the frank 4oz. I had [inaudible] noise with variance one and as you can see in the number of entries over increases, the errors go down. There is a kind of Oracle bound that you can compute. There is an obvious lower bound that you can compute as follows. You assume that one of the two factors has been given to you. Okay you just try to reconcile the other factor. Now this becomes a least square problem and you can tell pretty much everything about it. And this gives a lower bound and, okay. This is our algorithm. This is what singular value composition does. So this is kind of nice because you see that singular value of the composition gets the ideal point where all the entries have been revealed. And as fewer entries are revealed to diverge. Okay so, this is the really the gap that, what you lose doing singular value composition. Okay unfortunately things are not as nice if you go back and try to solve the Netflix challenge. So these are three algorithms. Okay, the best one is this one, the blue, black one is the one for which we have proof. And the other two this is alternating minimization, what is the idea you minimize the singular function that you minimize over X and then over Y and then over X and then over Y etc. This actually was first studied by a group at Microsoft in India. And so you get this curve and this is another algorithm that is better for which we have no proof. So the prize was kind of here. We are not there. But one interesting thing is that this is the error on the, interest that you didn't observe and this is the entry error on the interest that you observed. So you see this gap is due to over fitting somehow. On the entries that you observed, you achieve a root [inaudible] that is Moeller and you know the prediction after. >>: [inaudible] >> Andrea Montanari: Sorry? >>: [inaudible] long reconstruction [inaudible]? >> Andrea Montanari: Yeah. If you just want to come up with a matrix you are not required to-- if the matrix is not there exactly rank R, now what is good or bad depends a lot on your matrix, right? So you might… >>: [inaudible] >> Andrea Montanari: Sorry? >>: What is [inaudible] algorithm [inaudible]? >> Andrea Montanari: Yeah, I don't understand what it means no unique reconstruction. You can come up with a guess for each value of each entry. And now you can decide on the basis of that, this is your prediction. Okay so, I have one slide about proof. So again the algorithm that I described is that you do the rank R projection by singular value composition and then you minimize the cost function greedily starting at whatever singular value the composition gave you and the cost function is, you know, think of the noiseless case is UV transpose. This is your factor minus XY transpose. This is what you are minimizing, and you sum these over all of the observed entries. So what is the strategy of this proof that we have? First you prove that the singular value of the composition is not too bad. So this is the first theorem. And this actually, the way we do it we use this technique in, people use the random graph theory, so this is originally dates back I think to Friedman, Consemeretti and Fagin [inaudible] and so what these people [inaudible] is they took a random G and P graph with P while constant over N. They trim the high degree node and they prove that once you trim the high degree nodes the largest [inaudible] value is bounded, is not square root of log N over N, okay? And so now how you prove it, there are two ingredients. One is what you would do also for advanced matrix. This kind of you do [inaudible] over the units bowl and then you tail, okay you look at the [inaudible] quotient and you then this is the maximum, the [inaudible] quotient is the maximum of the process on this theorem, so you have to maximize their profits on the theorem, and what is nice here is that, okay you have to deal with the fact that the graph is sparse. And here they do it using expulsion. And essentially we extend this proof to rank R. And then what is next to prove, is when you are closing off, you know, this cost function is nice in the sense that it's lower and upper bounded by a parabola which has its unique minimum at the correct reconstruction. So it's locally convex somehow. The point is this optimization problem is a non-convex problem but in the neighborhood of this minimum is convex. So we proved this. >>: The bound you have here is not a [inaudible]. >> Andrea Montanari: Yeah, yeah, so, so. So luckily convex was one way of saying it. So what we prove is this, and we prove is that the gradient doesn't have any zero inside the neighborhood. So the two things are sufficient to show the gradient descent converges to the minimum. Now, okay, this was the proof technique until one year ago. Now we have a better proof technique that-- because this proof technique that doesn’t give a good bound on the number of integrations, now we have a better one that give a bound on the number of integrations and we use it, now that we learn the results [inaudible] now uses [inaudible] and the quality. Okay so in the last 20 minutes I want to talk about this graph localization. So there's this, the way I got into this graph localization is I was thinking about this collaborative filtering model. And I thought that model is how a big lie. And the model and the test to study is a big lie in it. And the big lie is that in the model its user runs the uniform random movies. Now earlier work just Dimitri a few days ago pointed out that earlier work that you use the singular value of the composition actually looks at the richer model in which interests are revealed also depend on the value of the entry. But this is just for singular value composition and also for those kinds of models have very strong assumptions, like you know the condition and distribution of observing an entry given the value. But here there is a big stumbling block. It's not true that you go to whatever Blockbuster or Netflix and you pick a uniformly random movie. At least I don't do it. Okay so, somehow this is a model that has the same feature. So what you would rent is probably movies that you would expect to like a lot. So you expect to sample entries that are large. So you should think more of a matrix that is approximately low ranking. And the subset of entries that you observe are those that are large. And so this is how similar; it's just something to think about. And it is simpler to think about your point and the interest that you observe the distance is the peer distances are those that are small, okay? So it's very much related. What is the underlying low rank structure? Well, suppose that I measure, you know, all of the pair’s distances, the distances between all of the pairs. And I construct, I measure them exactly and I can construct this square matrix that has, the square of the distance between I and J, at 10 to the I and J. So his is an N by N matrix that has N 3 ID as distance by J squared. I claim that the rank of D, okay there are too many Ds in this slide. But the rank of this is the ambient dimension +2. Okay, so now my drawing is 2+2, okay? And this is just something that is 1 minute is because this D is just DIJ is just XI square which is [inaudible] plus XJ square which is rank 1 minus 2XI XJ, which is rank D. So you can think of this problem as essentially the same as before. You have the rank, D plus 2, the rank 4 matrix on which I sub sampling entries. And then I want to reconstruct all the matrices. Now the interesting thing is that all papers on this topic start with saying graphic localization is a nice application of this idea, general ideas. But if you run any of the algorithms that have been tried before they don't work, just because of this fact because of the entries that you’re sampling, are only the small entries. So of these matrices you only observe entries that, those entries that are smaller than some range. Okay so, here is an algorithm that instead, and it's a kind of work. This is not the best I can program but is what we can analyze so far. So the impetus again is now perturbed the distance of measurements and I want to come up with a set of two-dimensional coordinates that are accurate up to a rigid motion. Of course, you know, I cannot identify points because original motion do not change distances. So I solve this SDP problem, I minimize, I take a Q; this is a problem in the matrix Q that is an N by N matrix. I minimize the trace of Q subject to two positives semi-definite. And essentially a bunch of linear constraint of Q that tells me that Q produces the correct positions within the Q accuracy delta. And so I come back to this in a minute. Once I have the solution of this, I compute ah, I gain the composition of this. So this will be a symmetric matrix. I compute again the value of composition of this. I take the largest of the [inaudible] vectors and I give you back this as my embedding of the points. >>: So is there noise here? There's no noise. >> Andrea Montanari: Yeah, there is noise. I am assuming. So there is noise and this is in the algorithm reflects in the fact that I'm not trying to reproduce the measurements exactly, detailed exactly. I only try to reproduce them up to some accuracy delta. Okay? >>: [inaudible] >> Andrea Montanari: So I am down one piece of [inaudible] what is this MIJ, this is just; okay this MIJ is just a matrix that has entry value. All entries zero except entry II which is one and entry JJ which is one and 3IJ and 3JI which are -1. Then when I compute, so if you think of this, this is the semi definite program that I try to solve. What is directional for this, so this is again the significant problem, if, so suppose that this is, you know, you compose it in this form. This is a positive significant matrix so you can always write it as X times X transpose and therefore the [inaudible] MIJ is a vector X drop [inaudible] vector XJ. This thing is a equivalent to minimize the trace will be the sum of the [inaudible]. So the sum of all the product X I X I so the sum of the square of the positions subject to what is this constraint if you write it down is just telling me that XI minus XJ square is far from the [inaudible] most Delta. So this Q, what this really is [inaudible] matrix of my points? And this, this is a [inaudible] product between matrices. I mean, AB transpose and this [inaudible] product is extracting this object. >>: So why are you trying to minimize this? >> Andrea Montanari: Excellent. I am trying to minimize because I need too many [inaudible]. And second because the problem is [inaudible] translation. So if I don't put anything-- ideally I would like to find something that satisfies this. So if I can solve the [inaudible] problem fine, I'm fine. What this really does is it centers the points. So it takes out the center of mass. So whatever solution you get will have center of mass at the origin. So in the end, the proof doesn't play any role in the. So what is known? People look at this as the pier of relaxation. Mainly motivated by positioning so mainly [inaudible] and coworkers so they prove these kind of results that if the measurements are exact and the graph is a D+ one [inaudible] lateration, then these algorithmic [inaudible] correct positions so what does it mean D+ one [inaudible] that there exists another offering of the points, of the vertices, such that vertex I is at least a D+ one neighbors between 1 and I minus 1. Where does this come from? Imagine, what is the simplest algorithm for finding positions, is just triangulation. You first position three points and then you find the fourth point that has two neighbors among the first three and then a fifth that [inaudible]. And so this theorem says that if the measurement is exact and triangulation works then SDP works. That is nice to know. >>: [inaudible] SDP is seems like you have no [inaudible]. >> Andrea Montanari: But I don't observe all the entries. >>: Are you saying that is only true if I have all the distances? >> Andrea Montanari: If you have all of the distances then you don't have two [inaudible]. Subset. If I ever have the subset, the simplest matrix is usually what [inaudible] it's just triangulation. And now people who do this kind of things, many application people complain about this because it's not robust. For instance, I was reading that the way they first measured the height of mountains of K2 in the Malia was to triangulation. And for a while they were convinced that K2 was higher because along the way somebody got the wrong measurements, right? And then this propagates. So but, of course theorem doesn't show what is the advantage of doing one STP over respect to the other. There has been a case, a noiseless case that has a long history, because it is related to rigidity to theory and there is this long extending question of given a graph and a set of distances is there a unique set of points that realize these distances. So this is history actually knows very little about it. I think it was settled in two dimension [inaudible] and others and was settled in general dimensional only recently. And there is a notion of genetic rigidity here that I am not going to talk about. Okay? So here the model that I'm going to look at is this you know [inaudible] model. Since my measurements are far from actual distance at most by Delta, no? And for simplicity let's say that the range I wish I measure things is log n over n to the power 1 [inaudible]. Why this funny function? Because this is essentially the first radius at which the graph becomes connected. If this is, if the range is much smaller the graph is multiple component and you cannot position. So what happened? So this is the theorem that we have. So under these hypotheses without probability this has the [inaudible] okay [inaudible] position is upward bounded by n to the 4 over D times delta. Okay, here there is an H missing. So this is a bit annoying because there is a factor [inaudible] that diverges with the number points so you wonder whether this is the best factor that you can get at least in the analysis, and it's not that we are at least within this position, we did things correctly, in the sense that they're exists another set of measurements and configuration of points such that a set the measurements such that the [inaudible] most of the [inaudible]. And here are the [inaudible] for this light and forgetting log terms. [laughter]. But somebody might be more careful than me. So I not only forgetting, but I am very sloppy really. Any cost [inaudible]. Okay so, this is what we were able to, actually it's quite interesting. The upper bound is a lot of things are very related to [inaudible] chain analysis. So analysis of [inaudible]. A lot of things are related to this. But I will not talk about this. Lower bound, I will just say one thing about lower bound. What is the worst-case configuration? So for a while we just triangled trying to improve our upper bound because we thought that the lower bound is this. The worst-based compilation is this. You have a set of points and the adversity what it does is that it blows up by a constant factor all the positions and then it returns you the measurements corresponding to these distances, okay? And then gives them back to you and what you do, whatever algorithm will you use, you will reconstruct these positions. So you will get an [inaudible] between the actual positions and the reconstructed positions. Now you do the massive computing, what is the size of the error that you get for this kind of noise? A very [inaudible] noise. And you get N to the 2 times delta. So for a while we tried to push down the upper bound and we didn't succeed. We didn't succeed for a reason, is that this is not the worst case configuration. So what is the worst-case configuration? It's actually pretty, it's this one. The SDP doesn't know what is the dimension in which you are embedding the problem. So when it returns you a grand matrix [inaudible] in many dimensions not just in two dimensions, right? So they adversary, what you will do, you will take your points and then okay, this I struggle to do this picture, but it bends you in the third dimension. And now is this point position bent and computes all this distances, give all these distances back to you. What you will do, the SDP will reconstruct this point position, then you do the singular value composition so you flatten it. You do your math. You compute [inaudible]. You get your [inaudible] which is N to the 4th over [inaudible]. This has nothing to do. Of course there are ways to, once you understand that this is the worst case, there are ways to improve the algorithm. Okay, but I will not talk about it. >>: If you go back to what was the [inaudible] upper bound [inaudible] >> Andrea Montanari: It's the same up to poly log factors. >>: And you won't reveal to us what is the power [inaudible]? >> Andrea Montanari: I don't. It is unknown. I don't know. Yeah, shame on me. So conclusion, so there is, okay there has been a singular value composition for analyzing large matters is something that is relatively well understood. All of this no [inaudible] so I think that is quite less understood and it's very interesting and there is some known trivial probability that perhaps people could contribute to and it is still open in the end. There has been all these papers but precise sharp characterizations are open. There has been some work and this is very interesting about other matters. In reality they thought not of fine [inaudible]. So the right way to think of it is that they are of rank N, so full rank but so how do you want to discount things that have, you know, things that have, all the singular values of the same order. And, you know you're trying to reconstruct the matrix; the real structure that you're trying to pull out is something you watch. There is a few that are singular values that are big and you know many [inaudible] more. So what is the correct way of pulling out this structure, one is to cook up the right function to be optimized. Okay so, this is related to non-paramedic methods. And finally the interesting question that is even for singular value composition and what happens if the matrix is really huge in the sense that it doesn't even fit on a single machine and what are methods to solve this problem in that case. All right so these are our interesting research [inaudible]. Thanks. [applause] >> Yuval Peres: Are there questions? >>: For the worst-case scenario you just draw that, I think while we can fix that SDP, [inaudible] you can maximize the [inaudible]? >> Andrea Montanari: There is this theory of maximizing the [inaudible]. But then you have to put the linear constraint, but the point is you see that, the point is that if somebody gave you-- it's not clear that will work, but this is a possible trick. Another trick is that you do the following. If two points are not connected by an edge, you still can compute an upper bound on their distances just by taking the shortest spot in the graph. You can put that upper bound as your SDP. And this upper bound is actually pretty good if there are enough points within 90% correct. So you put actually lower bound on your SDP. So this will be like adding, you know, all the edges in your graph but with some bigger error. So we are actually trying to do the analysis of that right now. But yeah, I mean, once you understand what is the problem, then there are fixes that one can try to do. >>: Go ahead and use standard matrix [inaudible] not too much of a scale [inaudible]? >> Andrea Montanari: Yeah, good question. So there is this matter that we didn't talk about that as much as [inaudible] exactly what I was saying is what you do is that you take all the, you fill your matrix by putting in all the answers that you didn't observe the length of the shortest spot. Great but even if the measurement, what is better about this method is that even if the measurements are exact, this will not return the exact position, right? It will return something with some error and that actually is not there. So it is like in my first example, is like simple singular value decomposition. Singular value decomposition, even if the rank matrix is exactly low rank and you observe without noise, you will not be able to reconstruct exactly. So there is this extra [inaudible] that you get from SDP that you don't get with MDS. >> Yuval Peres: Anymore questions? Thank you. [applause]