>> Ran Gilad-Bachrach: So it's a great pleasure to... way from across the lake to join us today.

>> Ran Gilad-Bachrach: So it's a great pleasure to have Guy Lebanon with us. He came all the way from across the lake to join us today. Guy is a senior manager at Amazon where he leads the Machine Learning Science Group. Prior to that he was a tenured professor at Georgia Institute of Technology and a scientist at Google and Yahoo!. And even in his history there is an internship at Microsoft. His main research area are machine learning and data science. He received his Ph.D. from CMU and BA and MS from the Technion, the Israeli Institute of Technology. He has authored more than 60 refereed papers. He's an action editor of JMLR, was the program chair of CIKM, and will be the conference chair of AISTATS in 2015, so be nice to him if you want your paper accepted. He received the NSF CAREER Award, the ICML best paper runner-up award in 2013, and the WWW best paper in 2014. He had the Yahoo! Faculty Research and Engagement Award and is a Siebel Scholar. So he's going to talk about local low rank matrix approximation. >> Guy Lebanon: Thank you very much. I wanted to say a few words first. This is work I did while I was at Google. And the collaborator is Yoran Singer, and Seungyeon Kim and Joonseok Lee are two of my Ph.D. students at Georgia Tech. And they're still my Ph.D. students, although I'll probably be handing them off soon and become a coadvisor instead of a regular advisor. I've been one year so far at Amazon. But we're doing follow-up work on this. And, in fact, the follow-up work, like Ranny mentioned, won an award at WWW. And I'm happy to take questions during the presentation, so please feel free to ask questions. And, finally, thank you for hosting me. I think I know almost everyone here. I'm happy to meet old friends. Okay. So the matrix completion problem is -- this should be your -- one way to describe it is as follows. We have a matrix. Let's think of it this way. And then we have some observations. Some entries in the matrix marked by red are actually available to us. The numbers there are known. The black dots are unknown. And the task is to predict what's the value of the black dots from the red dots. That's matrix completion. So formally we have these observations, a few entries of the matrix, and we want to predict the unobserved entries of the matrix. And this is very hard to do if we don't say anything else because of course there is a lot of possible completions that are consistent with the training data and how should we favor one over the other. In machine learning we have this principle of trying to favor a simple model rather than a complicated model that agrees with the data to try to prevent overfitting. But what is a simple matrix? If it's a vector, we would probably put regularization on it, favor small entries or sparse, favor sparsity. But for matrix, what does it mean? Probably the most popular approach to define what is a simple matrix or favor simpler matrixes is to define that as the rank of the matrix. So we want to favor low rank matrixes. And a low rank matrix, those of you that don't remember linear algebra, is a matrix that can be written as a product of two other matrices. One of them is tall and narrow; the other one is short and broad. And the dimensionality here and here is the rank of the matrix. And the idea is that that rank is much smaller than the minimum of this and that. Any questions so far? Okay. Okay. So how does it -- how do we use that or what's an application of this matrix completion problem? This is probably the most popular application these days, although there is other applications as well, recommendation systems. Specifically these entries are ratings. And here is how we interpret it. The row denotes the user and the column denotes an item. And the entry denotes the number of stars given to that item by that user. >>: Is this really the best model for a recommendation system? Because a lot of recommendation systems that have values [inaudible] random people, it's not truly Gaussianly distributed [inaudible]. >> Guy Lebanon: I haven't gotten to the loss function or the specific [inaudible] yet, but you're right that in many cases this is assumed. And I can tell you -- so this paper is an academic paper and I've worked on it while I was at Google. But now with Amazon I'm doing recommendation system in practice. And things are very different. So RMSE performance does not correlate well to increases in revenue. It's not the same thing. And if you look at publications and there's ten papers and one of them is the winner in RMSE, it's not necessarily the approach you want to take in industry in order to get better user engagement or better sales revenue increase or some other metric you're interested in. The whole concept, in fact, I would go further -- beyond what you said, the whole concept of offline evaluation is not working very well. We need to innovate how we evaluate not just a particular loss function but the whole process of having a training set training or a training set and testing on a testing set. I don't think that's working well for real-world products. It doesn't get us beyond a certain point. That's a whole big topic. Okay. So anything else? Yeah. So the training data is some observations that we have. Think of Netflix example in your mind if you want. And the unobserved are the ratings of the users for unseen items. And if we can complete a matrix, we can recommend items to users based upon these values. Now, again, I should say this is not necessarily if you want to build a recommendation system in industry, that may not be the best strategy. But that's the kind of data that we have in academia or public datasets. I cannot publish results on Amazon datasets. So this is results on -- I'm going to show results on Netflix datasets, other public datasets, and that's the kind of things you can do with it. Yeah. >>: [inaudible] abstraction [inaudible] like all this like datasets are kind of static, like kind of take [inaudible] the most present problem, the recommendations [inaudible]. So is it -- do you think it is worth like abstracting [inaudible]? >> Guy Lebanon: Absolutely. So I don't want to talk about my work at Amazon, but I can just say very briefly a lot of work needs to be devoted to what is -- what do you usually watch in the evening or in the morning or in the weekends versus the week, is it the kids watching or the wife watching or the husband. And is your taste changing? What about new movies that just came out? Everyone want to watch them. There's a lot of subtleties. So I absolutely agree with you. This is kind of a well-contained talk within a standard kind of static offline dataset. And I think still it has a lot of interesting ideas. I don't think you'll be disappointed. But I agree with you that this is not necessarily the right setup for industry. Yeah. Good point. Okay. So here is the standard or the most popular recommendation system. If you look at academic papers and if you think about industry, probably this is one of the main baselines or variations of this are still being used. Okay. So we have -- the thing to pay attention here is we have the observations which we denote by M. That's the matrix of observations. And we take the squared error between the observed entries and the values of the reconstructed matrix. Then we sum over all observed entries. A denotes the set of training set. And then we find the matrix that minimizes the reconstruction error on the training set subject to the rank of the matrix being a specific value. This is hard to optimize because this constraint is not very easy to work with. So what we do instead, we cast it in the following way, which is equivalence, equivalent optimization problem. We optimize instead of over X, we write X as U times V transposed in that kind of factorized form that I've showed you earlier. And then we optimize over U and V. And there's no constraints. Just nonconstraint optimization problem. If you want you can add regularization. But this is much more convenient to work with. This is a very popular matrix completion method. It works reasonably well if you measure your performance by reconstruction error on test set. One issue is you need to know the rank R in advance. And another issue, it's not convex. And I want to say one more word about this because if you remember linear algebra, SVD, there is a single SVD decomposition for any matrix. So why can't we find the right or the single SVD decomposition? The reason is that we sum here only over a subset. We don't sum over all entries in the matrix. If this sum was over the entire matrix, there would be a single SVD decomposition that minimizes the reconstruction error. And you can get that decomposition by looking at singular values and singular vectors of the matrix. But we don't sum over the entire matrix, and that introduces lots of local maxima. Questions? Okay. Here is what happens if you run this kind of procedure. The Y axis is the mean squared error or root mean squared error on a test set. And the X axis is the rank R of the decomposed matrix. And you get better test set performance as you increase the rank, but at some point you start hitting a lot of dimensioning returns. You can start -- you can increase the performance much, much further. >>: What are the units [inaudible] stars? >> Guy Lebanon: Yeah. That's right. That's right. >>: So there's no [inaudible] I can increase the rank? >> Guy Lebanon: At some point there are would be overfitting. And if you add regularization, you can probably do better. Yeah. >>: It goes to .95. What's 1? 1 is total ignorance or ->> Guy Lebanon: No, so 1 would be an average error of one star away from the actual prediction on a 1 to 5 star. 1 to 5 stars, yeah. >>: Okay. >> Guy Lebanon: So it means an average error of one. But it in the sense of root means squared error. More questions? Okay. So here is the observation. As the rank increases, the accuracy quickly hits law of dimensioning returns. And let's consider the following two hypothesis, okay? Hypothesis one, the matrix that we are trying to complete has a low rank and the law of dimensioning returns reflect the best possible prediction. Okay? The data just has some noise in it, we cannot improve the prediction beyond that. We've correctly estimated the rank and we're using the right rank. The second hypothesis is that the matrix that we're trying to complete has a high rank. So the low rank assumption is incorrect. And in this case the dimensioning returns, I mean, why doesn't it keep improving if the matrix has low rank? The dimensioning returns are due to overfitting and due to -- and the reason of overfitting because as you increase the rank, you're going to have more parameters. The matrices that you use, the factorized form have more and more parameters. And you would overfit because the data doesn't grow -- you can grow it to infinity as you increase the rank. The second reason could be convergence to a poor local maximum. And -- yeah. >>: How do you formalize the fact that it's neither low or -- is it like a binary distinction while in reality there's going to be a sliding scale of as you increase the rank you're able to fit better but at the same time there's like lower fitting stability? Is there a way to say like there's a mean in some metric which indicates whether it's a ->> Guy Lebanon: No, so this assumption is on the original matrix ->>: Right. >> Guy Lebanon: -- that we don't observe fully. And this is either a low rank or a high rank. >>: So you're saying that the [inaudible] assume that there is a fixed rank that ->> Guy Lebanon: We assume there is a truth. We assume there is some matrix that we're trying to complete. We don't know what it is precisely, but it exists. And it either has a low rank or a high rank. >>: But even [inaudible] probabilistic assumption that [inaudible] some distribution over possible ranks ->> Guy Lebanon: No. No. It's either ->>: Or an eigen [inaudible]. >>: Or [inaudible] approximately. >>: Yeah, approximate rank. >> Guy Lebanon: Yeah. Okay. So in the -- so what we're trying to -- what I'm going to try and argue with this presentation is that in the context of recommendation systems and standard datasets like Netflix and other datasets, the second hypothesis is the correct one rather than the first. But also the first hypothesis is incorrect globally but is correct locally. And I'll explain what that means. The third thing is combining SVD with nonparametric smoothing significantly postpones the dimensioning returns and it also reduces the computation. So we have a win-win situation here. And, in fact, the computation is a big problem because datasets are big in industry and in many case we're having a real computation challenge trying to optimize that function that I showed before. Okay. So I'm going to describe the model first informally and then there will be a formal definition later. Okay. So this is the informal description. A matrix -- so here is the assumption, okay? We're assuming two things. We're assuming that the matrix completion that is accurate across all users in items should be of high rank. Second assumption is the matrix completion that is accurate for one special type of users and one specific type of items, for example, if we only look at the population of teenagers and we only look at the population of old movies, that kind of submatrix, teenagers and old movies, will have a low rank. And the low rank is probably teenagers don't like old movies. The rank is 0. Right? So it's like a very simple problem. Once you narrow down your problem to a very coherent population of users and a very coherent population of movies, the problem comes because you have a lot of these coherent subpopulations, and if you want a matrix that is correct globally, then the rank will be high. >>: [inaudible]. >> Guy Lebanon: Sorry? >>: Yeah, the matrix M is composed of [inaudible]. >> Guy Lebanon: It's not -- so it's more complicated than that because the -- these populations may be overlapping. And it's not necessarily a block structure. But that is a correct intuition. Okay. So here is the algorithm. Okay. Identify Q neighborhoods in the set of matrix entries. So first thing, please parse the sentence here. Okay? We have a set of all possible matrix entries. Okay? This is a set, but it's -- we can also think of it as a space with a matrix structure that is discrete. It's a discrete space, not continuous, but we can still say symmetric space. And we can still define the notion of neighborhoods if we have an appropriate distance function. How do we get the distance function? I can talk about it a little bit later if you're interested. For now, let's just assume we have the distance function that measures similarity of items and similarity of users. Let's assume this is given. If we have that measure, we can identify the neighborhoods. >>: [inaudible] metadata. >> Guy Lebanon: Not necessarily, actually. We -- in the experiments that I'll show, we get the neighborhoods from the training set that we use to construct the matrix factorization later on. So we don't use additional data. We use the same data to construct the neighborhoods as well. >>: Are you using the neighborhoods because that particular [inaudible] will result in lowering approximation for that piece, or is that [inaudible] define the subsets here? >> Guy Lebanon: This is a pretty big issue how to find the neighborhoods. In the work that we've done here, this would -- we're using a pretty elementary technique. And I can discuss how we do that. >>: Is it a kind of [inaudible] of the problem, or is it something like [inaudible] that are given to me by an [inaudible] way ->> Guy Lebanon: Right. So okay. Okay. I'm happy to describe. So what we did to define the neighborhoods -- I prefer not to get into that right now, but I'm happy to answer since both of you are interested. What we did is we sampled entries from the training set. Okay? We sampled entries from the training set. And then we defined neighborhoods that are centered at this sample. So say we sampled ten -- for example, we want to have ten neighborhoods. We sampled ten entries from the training set set of entries, and then we form ten balls in this, in the distance metric that we defined centered around these ten samples. We have a follow-up paper where we do this in a more adaptive manner -- yeah. >>: [inaudible] the neighborhoods are not necessarily [inaudible]. >> Guy Lebanon: Exactly. Yes. Absolutely. Absolutely. And, in fact, they don't even have to have -- it's not like they have to be block -- even if you permute the rows and the columns, you can still not get blocks. But absolutely. Yeah. >>: [inaudible] the neighborhood don't have the [inaudible] structure. >> Guy Lebanon: Not necessarily. In the general -- I mean, we are using distance metric in a product -- product distance in the experiment but not necessarily. Okay. So we identified these Q neighborhoods. And the way we do that is we just sample Q points from the training set observations entries and we create balls around these points. And then for each of the neighborhoods we construct a separate low rank matrix that is especially accurate within it. We don't care so much about being accurate outside of it. And then we patch up these local models to get a matrix approximation that is global and that is high rank. But the local patches are low rank. >>: But you're doing no refinement of the neighborhood to improve the fit of the low rank at that point? Like [inaudible]? >> Guy Lebanon: That's right. That's right. Like I said, we have a follow -- we have a journal paper that where we generalize this. But right now this is -- we justify the neighborhoods right now in a very naive way, we compute a local model, and then we patch up the local models. Okay. So here is a picture of the neighborhoods. But like Chris said, and I always say this caveat, this neighborhood isn't normally adjacent rows and adjacent columns. Okay? Because, for example, this movie may be similar to this movie and to this movie and this user may be similar to that user. But I cannot visualize it in a different way. So this is just for the sake of visualization. Assuming that the distance function just is the distance between the indices, okay, so these two are necessarily similar items, these two a little bit further, a little bit further, same thing with the rows, you could get something like this kind of neighborhood structure. That's one thing I wanted to mention. The second thing I want to mention is not exactly what we do. We do something a little bit different. Because here we assume the neighborhoods are precisely defined and they have a precise boundary and everything within the neighborhood counts in the same way. We use something called kernel smoothing where there's a weighting function that gives the weight corresponding to the distance from the center of the neighborhood. And that means we can have neighborhoods that cover the entire matrix even or cover a subset of the matrix. But as you get closer and closer to the center, that way it becomes higher. >>: So just to make sure that I understand. So at the end of the day what you're getting are basically the model can be represented by a set of [inaudible] and the set of matrix completions which were trained to be low. >> Guy Lebanon: That's right. >>: So when you want to do a prediction, how would the prediction work? >> Guy Lebanon: So I haven't gotten to this. I said patch up, that's the first -- the third step. I haven't gotten to it yet. I will show you in the next slide or two how we do that. Yeah. >>: I think so if -- let's set aside the low rank stuff. If you have a distance metric [inaudible] even defining Gaussian process if there were constants and just made predictions, so you're saying is that you could always -- if you have a distance metric or a kernel on your sort of item or your pairs, you could always just do Gaussian process reduction, right? So is what you're doing -- and that's kind of like if you imagine a Gaussian -- I mean, this is weird, it's [inaudible] Gaussian with different sized bumps that are [inaudible]. >> Guy Lebanon: Yeah. Yeah, yeah. >> Now, are you doing -- there used to be an old technique where you took the Gaussians and then had like a little local linear model that's sort of turned on by each Gaussian. Is that what you're doing? [inaudible] but it also goes back to [inaudible]. >> Guy Lebanon: I think it's -- it would be similar because that would correspond to a low rank structure, right, like a local low rank. So I think it's similar. Not exactly. But there is some relationship. >>: Okay. So I don't know if you've also tested versus just the plain old Gaussian process where you take your neighborhoods and just a constant -- everything is ranked 0. I don't know if you tested that. >> Guy Lebanon: That performs very bad. >>: Oh, okay. >> Guy Lebanon: Yeah. The local low rank is crucial because it kind of discovers this like local semantics structure in it. I'll show you later graphs where you can see the performance of the local model is a function of the rank of the local models. And you can see that we really postpone the law of dimensioning returns, and the [inaudible] significant improvement as you improvement as you increase the rank of the local model. Okay. So here is the -- so this is the first step, identify the neighborhoods. Okay. Second step is for each neighborhood construct a local low rank matrix completion. Suppose we look at this neighborhood, and remember we have this weighting function. So we want to get this low rank matrix that approximates the entire big matrix but it especially is accurate here and a little bit less accurate here, a little bit less accurate here, and here we don't care. And that would be the local model that correctly describes the situation here. If we want to think of it intuitively, we can think of teenagers and old movies. But like you said, we can do that even without metadata, just by figuring out which users have similar viewing patterns and which items have similar viewing patterns. Yeah. >>: [inaudible] are not. >> Guy Lebanon: Depends -- depends. Yeah. I mean, in the experiments we did, we did use a product structure. This is -- like I said, the neighborhood will not be kind of especially similar to each other. But in our experiments the neighborhoods do have a product structure. But you can think of whatever distance you want more generally. If you have metadata, you can use the metadata. And conceivably one would want to use nonproduct distances. Okay. And then once you -- the third step, once you have these local models, you have a local model for here, for here, and for here, you would patch them up. Okay. So we are ready for the formal description. I think. So here are the -- what we have. So we have a distance function on the set of matrix entries, and this is in our case computed based on observed matrix entries. But it could also be computed based on additional data like if we know the directors of the movies or the actors or the genres or whatever, or we know something about users, we can have that help us. Okay. And here is what we assume to be the truth that we're trying to discover. We assume that there's an operator that maps the set of matrix entries to matrices. So for each matrix entry, we get a specific matrix. Okay? This is the set of all matrices and these are the matrix entries. So for each matrix entry, for example, the entry row 2 and column 3 we would get a brand-new matrix corresponding to that. And that matrix has two properties. One property it has rank R, low rank, a second property is that that matrix evaluated at the specific matrix entry that we picked here would give us the same as the observed value. The second property of the operator is that it's somewhat continuous in just a sense -- just a second -- it's continuous in the sense that it's slowly changing. Because we don't want the operator to jump wildly. The whole point in defining the distance function is that the distance function means something. What it means is that operator, if it maps two points that are close to each other, it will give us similar matrices. Now, this is not a continuous space, so slowly varying or smoothness we define in this way which is called Hölder continuous. The distance between -- this is one matrix evaluated -- this is the operator evaluated at one point, the operator evaluated at another point, these are two matrices. The Frobenius norm between them is less than the distance between the two matrix entries. And we have to have that inequality, meaning it cannot change too much. Wait with your question because I have a diagram that may make it more clear, and then I'll take questions. Here is a picture of what we have. This is the matrix that we're working with. Okay? And each point in it is mapped to another matrix by this operator. And these matrices all have low rank. And as we move from one point here to a neighboring point, neighboring defined by the distance that we defined, if we move a little bit here, these matrices should move also slowly. Should not change too much. And the value of this matrix here is the same as the value here. Okay? And the value of this one here is the same as the value here. So these matrices have to agree with the values in this matrix. That's one. Second thing, they have to have a low rank. These once. And the third thing is they have to change slowly. Yeah. >>: [inaudible] matrices defined for neighborhoods to essentially [inaudible]. >> Guy Lebanon: For every single -- every single matrix entry we have a brand-new matrix. Yes. And this is not something we have to find, but this is -- this is the model what we assume is happening. We have these matrices. Now, these are low rank. Okay. And they describe what's going on in this point and they change slowly. >>: With this slow changing property, you're only requiring within [inaudible] right, not across [inaudible]? >> Guy Lebanon: No, it's defined -- it's defined this way. Yeah. >>: [inaudible] kind of global ->> Guy Lebanon: Yeah, these [inaudible] measured here. So, yeah, this is the distance measure here. We define. It's up to us to define. If we do a bad job with defining D, then obviously this assumption will not be correct. >>: But you have a T ->>: Constants will be large. >>: But you look at T across -- so your different patches will have different ranked sizes, right? They can -- because you have neighborhoods of different sizes. So T for one is not the same size as T for another. Or is it a global T? >> Guy Lebanon: So T -- the size of T -- T maps into the same size. All of them are matrices over the entire set ->>: Over the entire space. >> Guy Lebanon: Yeah. So those are the number of users and number of items. >>: So [inaudible] exponents bigger depending on the [inaudible]? >> Guy Lebanon: Probably. Yeah. >>: So to one in R, the rank is fixed across the entire [inaudible]? >> Guy Lebanon: Yeah. This is also something that can be generalized. But in our experiments we use a single constant for the rank of all these matrices. >>: And the second question, is the distance metric [inaudible]. >> Guy Lebanon: That's right. That's right. Yeah. >>: So have you thought about bridges this? Because right now D is completely separate, right? So ideally you'd want to sort of [inaudible] and learn a good D that would give you good [inaudible] as opposed to trying to fit some D that may or may not be right [inaudible]. >> Guy Lebanon: Yeah. Yeah. Yeah. It is a good point. I mean, we haven't done any work on that. Our first extension was to -- I haven't even showed you what we're doing formally. So I want to talk about it very quickly. The first extension is to work with rank loss functions which performs better in real-world cases than squared loss. Second extension was to work with -- to try to identify the neighborhoods in an adaptive manner. And what you're talking is another level which is adaptively modify D. That's another reasonable approach. >>: [inaudible]. >> Guy Lebanon: Right. So right. So right now this work we assume -- if you want, I'll tell you how we get D, but we get it from the training observations based on similarities between how users -- what users view and what items -- how items are being viewed. Okay. So I -- this is another -- I'm just going to skip that I think because I don't have that much time. So here is -- we have two estimators. Now I need to show you what exactly we do. I haven't yet showed you formally. There are two estimators. One is called the expensive estimator. And this estimates the operator T. Okay? So we're going to estimate the -- we're going to estimate each one of these matrices. Okay? And each one of these, if we estimate the matrix, this one, we can evaluate it here. And this can serve as an estimate for this. Okay? So we're going to do that for every single point. And that's why we're going to call it the expensive estimator. And then we're going to have a cheap estimator that I'll show you how we do. Yeah. >>: Before you jump in, how many neighbors are you talking about? About five? About 5 million? What sort of numbers are in your head? >> Guy Lebanon: Some -- actually, I show -- I think I show you a graph -- I think I may have a graph as a function of the number of models. But I would say in the tens or in the hundreds, I would say. Depends on the size of your matrix also. So this is the expensive estimator. So here is the -- what we formally do, how we get our model. We look at the training set and we learn the low rank matrix that minimizes the squared reconstruction error, but we give weights. The weights correspond to the distance of the point from the center of the neighborhood. Okay? So because I want my local model to be particularly accurate in the area in the neighborhood that I'm choosing, so I want it more and more accurate as I get closer to the center subject to a low rank. So I showed you already how do that without the weight. I showed you earlier how we do without the weight. Sticking in the weight is relatively simple. It's same difficulty level. And then I just evaluate this estimate that I've computed at the point I want to predict, and I get my matrix complete completion or approximation. K is a kernel function defined in one of these ways or a different way, just something that measures -- is higher and higher as you get closer to the center. What's the difficulty here? Difficulty is that for each entry that we want to predict, we have to do a separate SVD. We have a run a separate SVD. When we need to complete -- when we need to estimate a ranking for -- a rating for a user for a movie by a user, we sometimes get it at runtime. We don't want to solve an SVD at runtime and have the user wait for the answer. And even if it was offline, it's still too much to compute. So this is not practical. And we're going to get to the cheap estimator which is T double hat. And this is going to be very practical, actually. In fact, it's going to be so practical that it's going to be faster to compute than the nonlocal version. So here is the cheap estimator. First step is compute the approximation T hat at Q anchor points, S1 through SQ offline. These are what I said the neighborhoods to identify. And the way we construct them is we just sample the centroids from the training set and we put a certain ball around it. Okay. So we compute Q local models. Okay? We compute a low rank model that is especially accurate as you get closer to these centroids. And then you patch up these local models, these Q local models. And how do you do that? So you have different values you want to patch up, we do that using a technique called kernel regression. This formula specifically is called Nadaraya-Watson local regression or kernel regression, locally constant kernel regression. And the thing we do here, though, this is kernel regression in the space of matrices as opposed to standard way of talking about kernel regression is you say kernel regression on a real line, but here instead of points scalers, doing kernel regression over scalers or over vectors, we do it over matrices. So in essence this is just an average, weighted average of a local model. And the weights are reflected by the similarity of the local model to the point that you want to predict. And then you can form your completion this way. So what do we have here? We have step one, offline computation. You can do that offline before you need to serve anything to users. Solving K low rank SVD problems in parallel. Step two is you -- this patching up. It could be done very fast. Linear combination of local models. We're going to show you some graphs soon that shows improved prediction accuracy as well as reduced computational cost. Now, why would you have reduced computational cost? Because you replace one SVD with Q SVDs. So how do we save computation? Any ideas? >>: [inaudible]. >> Guy Lebanon: That's right. So there's two ways that you gain. One is because the kernels may have a finite support and then that means you can solve a smaller matrix approximation or a low rank SVD. And it could be much smaller. That's one gain. A second gain is that the rank of the local models can be much smaller than the rank of the global model that you do alternatively. And solving, say, 100 SVDs of rank 50 is much easier than solving one SVD of rank 5,000. >>: [inaudible] support they actually go to zero, though, then you need to check to make sure there aren't places in the matrix [inaudible]. >> Guy Lebanon: That's right. Yep. That's right. That's right. So you have to be careful there. But there's these two gains. One is the support and the other one is you can get really good results by having a really low rank, much lower rank than the original low rank for the entire matrix. And it's easier to solve many problems with low rank than one problem with bigger rank. Yeah. >>: [inaudible] SVD, correct? >> Guy Lebanon: So I mean this problem right here. I call this incomplete SVD because we sum only over a training set. We don't sum over all -- we don't sum over all the entries, which would be the algebraic SVD. All right. So this requires gradient descent. You cannot do a singular value, singular vectors. Okay. I'll show you some results. So the X axis is the number of anchor points which is the number of local models or centroids or patches. The Y axis is the RMSE on test set. These patched lines show the performance of SVD, which is the standard baseline. Still very popular. And you increase the rank from 5 to 7 to 10 to 15, you see law of dimensioning returns. Right? Very quickly you get to a point that you cannot improve beyond. And the reason you cannot improve beyond is what we said earlier, that issue of overfitting and convergence to a local maxima. The local models, the performance increases as the number of anchor points increase. And they also increase as you -- they also get better as you increase the rank of the local model. This is rank 5. Every local model is of rank 5 here. Okay? But we have 45 models of rank 5 each. Okay? So the total rank would be 45 times 5. Okay. The rank of the big matrix, that will be approximately 45 times 5, here it will be 45 times 15. But you see that if you would keep increasing it, you're not going to -- you're basically going to be stuck here if you just do a rank 500 SVD. It's not -- it's not going to get much better. Sorry? >>: What's DFC? >> Guy Lebanon: DFC is something called Divide-and-Conquer Matrix Factorization, a paper by Michael Jordan, Lester Mackey, which does a similar trick of dividing the original matrix into patches and learning local models and then patching together. So it's relatively recent paper that takes a similar approach. And this is the Netflix winner. Although we used -- it's a long story -- we used slightly different dataset than the Netflix winner. So this line should be -- it's the same size, but this line should be taken with a grain of salt. It's not a formal comparison. >>: Do you know where the limit is? I mean, because you -- you know, you're showing up to 15, right? Then what? [inaudible]. >> Guy Lebanon: Yeah. So -- right. So the -- you're [inaudible] the limit pretty soon after. >>: All right. >>: So all the way on the left, so if you have, say, one anchor point, you're still not going a global ->> Guy Lebanon: That's right. >>: [inaudible]. >> Guy Lebanon: That's right. So this is very -- that's a very good observation actually. Look how better we do with even a rank 1 local -- I'm sorry, with one anchor point. Why do we do so much better than the global SVD? Because we sample the anchor point from the training set and the learned model is more accurate in the vicinity of the training set than in remote corners of the matrix that you don't care about. >>: So you patch it on the rest [inaudible]. >> Guy Lebanon: So we focus -- we particularly focus our attention on what we've seen in the training set, which is a good proxy for what we're going to see in the test set. >>: So how does the training RMSE look like? So what you said that basically you can refactor the model [inaudible] say ten anchor points into the single matrix of higher dimension. So does it indeed have -- how does its training -- so if you want to show that it's a local -- it's a form of the optimization, you want to compare the training error, not the test error. >> Guy Lebanon: It's not a problem of the optimization. It's a problem of the generalization or the learned model on test set. >>: So in terms of -- in terms of its train error, actually it's train error is going to be worse than the train error of if I would have done a model -- SVD model with appropriate [inaudible]? >> Guy Lebanon: No, I think -- no, I think the training error would also probably be better. But the real comparison is -- I mean, you really care more about the test error, I think. >>: It's true. But if we try to understand what is -- you know, what is source, what is bottleneck, then it's interesting to look whether it's just the optimization method. So it's better off if I have a -- so in a sense it would say if I have a large nonconvex ->> Guy Lebanon: I see what you're saying. Okay. So you're saying how much of the problem is convergence to local maxima. That's what you're saying. It's difficult for me to say because the search space is so big and there's a lot of local maxima. It's difficult to do an exhaustive search and say with certainty here is the global maxima. But that's a good point. Maybe we can look at this further to kind of quantify different -- let me -- I just have a few more slides and I'm almost out of time, so ->>: Do you have a ->> Guy Lebanon: Quickly, though. >>: [inaudible] why you should not approximate one or where you center [inaudible] on five of them, so that's then cheaper than Q of them, right? >> Guy Lebanon: Yeah. So we have that comparison in the follow-up paper, journal paper, because there we adaptively select the anchor points, the centroids. Yeah. Let me move ahead because there is a couple more things I want to say and we're almost out of time. This is not something that we did but I think is interesting to -- couple of slides that I think are interesting to learn about then something that we did. But it's all related to this. So the nuclear norm of a matrix is the sum of the singular values. This is the definition. And that's another way of defining simplicity of a matrix. Remember I said simplicity is the rank of the matrix. This is one way to define simplicity. An alternative way is perhaps the nuclear norm. So we want to favor matrices with low nuclear norm. One of the motivation is that minimizing the nuclear norm subject to some constraints is a good surrogate for minimizing the rank of the matrix. Just in the same way that minimizing the L1 norm is a good surrogate for minimizing the L0 norm. There's papers showing both, both results. So there's some kind of parallelism between these two problems. So that kind of leads us to the following matrix completion problem. Given a dataset, you minimize the nuclear norm of all possible matrices subject to the matrix being not to wrong on the training data. That's an alternative matrix -- approximation matrix completion problem. And it has gotten a lot of attention recently in academic community. Industry not so much. But in a -- I can say why, but lots of papers in academic community. And I think in large part because of this and it has some compressed sensing properties, et cetera. So we thought maybe we could do a local version of that as well. And here is the local version. Minimize the nuclear -- so the estimate of the operator at a particular point would be the minimization of the nuclear norm of the matrix over all possible matrices subject to reconstruction error not -- being not to large, where reconstruction error is the same except we have these weights, kernel weights to tell us how much we care about different reconstructions. So in general this has an advantage in that it's a convex problem. And there's a single global maximum. It has an advantage that we don't need to specify the rank in advance. However, you need to specify alpha. Okay? So we don't quite get away with that. >>: So it's weird, is MCD [inaudible] CD also was [inaudible] so you're kind of reweighting the [inaudible]? >> Guy Lebanon: So this would be the training set. >>: Oh, the local. >> Guy Lebanon: Yeah. So ->>: [inaudible]. >> Guy Lebanon: This is a local at one specific point. This would be the cheap estimator. >>: I see, I see. >> Guy Lebanon: And we have to patch up multiple. With local regression we will patch them up to get an estimation everywhere, wherever we want, we will patch them up. One difficulty is that this is hard to compute in large scale because this requires semi definite programming optimization. And for large-scale problem that we have in industry, this is not very practical. Okay. So there's this -- do I have another five minutes or -- okay. So there's this really interesting result by Donoho and Tao and some other people, and it goes like this. And I think it's really interesting to be aware of that. Completion of a rank R matrix is possible with high probability if the number of observed entries satisfies this inequality. So N is the minimal of N1 and N2. And for simplicity, let's just think about square matrices, completing square matrices. So the number of observed entries needs to be larger than N log N times R, only log N is to the power of 6 and R is the rank of the original matrix. So if we fix the rank, we need -- we fix the rank, we need the number of observed entries to grow at the rate of N log N, almost N log N. N times log N to the power of 6. >>: [inaudible]. >> Guy Lebanon: I'm sorry? >>: What kind of [inaudible] C1? >> Guy Lebanon: I don't know. I'm not sure. The even more interesting direction is that if this is not true, okay, then it's impossible to reconstruct the matrix. If this is not true without the 6th power. Okay? So we have like kind of exactly the rate at which -- with which N needs to grow or number of observed entries needs to grow. So if you don't have the 6th power, this is the lower bound. You cannot succeed if the number for any algorithm whatsoever -- similar to the result in compression, you cannot compress better than the entropy rate with any algorithm whatsoever. Just a sec. So same thing here. So this is the first interesting result. The second result is that the -- this method that I showed, this thing right here, actually works with this rate. So there's some error rates but basically shows that the method that I showed you in a slide before achieves almost the lower bound. It's almost like the best optimal compression, the best optimal reconstruction except that you need the 6th power here. So we almost accomplished that optimal rate. And the way this relates to the talk I just gave is that we've kind of generalized these results to the local variation. So we get kind of bounds on the reconstruction error that is similar to what they got in that paper. I think I'm out of time, so I'll just talk a little bit about next steps. Matrix completion with nuclear norm minimization, extend framework to ranked loss function, adaptive selection of anchor points, large-scale massively parallel implementation. We've actually done all of these, but there's other future steps. Maybe -- I haven't told you how to get the distance function. I can tell you what we did here. What we did is we did an initial SVD, global SVD, and then we looked in the latent space what is the distance between the users and the distance between the items, and that gave us the distance that we use in our experiments. I'm happy to answer more questions, but if you guys need to go, I think I'm out of time, so I just want to make sure. >> Ran Gilad-Bachrach: Let's thank the speaker. >> Guy Lebanon: Thank you. [applause]. >>: So the result of the lower bound, isn't it something like you have to also have the matrix [inaudible] have to be [inaudible]? >> Guy Lebanon: There is -- yeah, there are some conditions for that. Yeah. That's true. There are some conditions [inaudible] represent. Yeah. >>: I'm curious about how important it is to put this all back into one big matrix. So let me propose a naive algorithm and tell me where this would fail. Let's imagine we had some way of -- you know, pick 50 or a hundred anchor points [inaudible] and then let's imagine we have a way of sort of clustering the users to the movies into those 50 different clusters based on those anchor points. And now for each of these groups just do a low rank matrix approximation. And then when you need to make a prediction, you just use the membership weight. I mean, it could be even a soft clustering. You use the membership weight into the clusters. Each of these low-rank matrices, which would be fast and easy to do, just makes its own prediction, even [inaudible] prediction with those weights, what -- where ->> Guy Lebanon: So -- yeah, so ->>: Where would that fail? >> Guy Lebanon: So this is actually -- I think will succeed. This is very similar to what we're doing. There's only a couple of small differences. >>: Right. >> Guy Lebanon: One difference is we use instead of partitioning into separate problems and each problem being matrix SVD, a matrix approximation, we make it a weighted matrix approximation so that it's increasingly more accurate as you get close to the centroid rather than just have everyone have equal weight. So that's one difference. The other difference is how exactly you take the linear combination. So the details matter. Some formula would work better than others. I don't know exactly what's the best way. The formula I just gave is locally constant regression. Conceivably there are other better ways of doing it and worse ways of doing it. But it's very similar to what you described. >>: I have a question about the [inaudible]. So let's look for [inaudible] right, so I have -- I look at it as if I have the matrix of queries versus URLs and I know, say, I have [inaudible] I know from click data or from whatever [inaudible] I know [inaudible]. So would it be a useful approach to try this kind of matrix factorization [inaudible] recommendation system to work this problem? >> Guy Lebanon: Yeah. Yeah. Actually, we were thinking about looking at other datasets like document term matrices or in social network you can think of graph structure. So we actually thought about [inaudible]. Don't have any experiments to result -- any experiments to report. But I think it's an interesting possibility. >>: So how distinct is [inaudible] compared to [inaudible] prediction approach? >> Guy Lebanon: My estimate or my guess is that it will work also well. Compared to, say, LSA or, for example, in social graphs you have problem of predicting links, you want to recommend links, hey, maybe you know this guy. My guess is that it will work equally well. It will show an improvement because the assumption that there is a global low rank is incorrect, and if you just tried to increase that low rank you're going to start facing overfitting problems. So assuming local low rank is an effective way of defeating that problem. So I haven't tried to do it, so that's just a guess. Yeah. >>: So your work needs kind of two things. It needs this kind of high rank as a sum of low rank and then solving like local low rank problem. I think [inaudible] you could have a parameterization which is like a weighted [inaudible] your score is a weighted sum of the product of some [inaudible] vectors. And then you could perfectly optimize the global MSE with that. So sit important to basically solve [inaudible] right? >> Guy Lebanon: Right. >>: To solve [inaudible]? >> Guy Lebanon: Yeah. So I think it's very useful from a computational standpoint to separate the problem into several smaller problems. We also looked at that. So we looked at solving the problem while optimizing over all -- learning all local models at the same time, not separately. But computationally that could be difficult. >>: This is difficult. But is it a better -- is it a good [inaudible] computational aspect and there's a [inaudible] aspect. Is this constrained if I have to solve the local problem [inaudible]? >> Guy Lebanon: I think it works for us. We haven't noticed a problem with ->>: Okay. >> Guy Lebanon: Yeah, we didn't find the other approach to work better. >>: I see. >> Guy Lebanon: Yeah. >>: How sensitive is it to the distance and how easy is it to get in, get themes by improving the distance? >> Guy Lebanon: It's actually very sensitive to the distance. If you think about distance and all of this is worthless, it's not sensitive in the sense that if you perturb the distance a little bit it's going to perform poorly. But if you pick a bad distance, then there's no point. The first attempt we had was using the distance just either relationship between the observed row indices and the observed column indices without doing a first SVD and looking at the latent space. That did not work as well because of a problem of sparsity. Two users oftentimes, they have completely disjoint movies that they've watched, you don't know how to compute the distance between them if you look at original space. So that did not work well. The distance that we use then which was the most obvious one, just look at the -- like [inaudible] similarity of the observation vectors didn't work well. When we did it in the latent space, that worked much better. So I think it's very important, but it's not sensitive in the rigorous mathematical sense. >>: [inaudible] does that further improve the results? >> Guy Lebanon: We haven't tested that. But we hypothesize that it would be very helpful, especially for items that don't have a lot of viewing history or for users that don't have a lot of viewing history. You get a very poor notion for distance. If you have what we call the [inaudible] users and [inaudible] items. So for these I think metadata would be very crucial. Yeah. >>: I'm curious if you or anyone else -- MSE is such a weird metric for this, you know, as you were saying earlier, it's kind of like, you know, getting the average, like there's a lot of 3s, who cares, you know, for instance, you could imagine saying like when we look at all the 5s and look at all the 1s and look at accuracy of getting the love its or hate its more meaningful, I should never show the hate its, I should always show the love its [inaudible]. >> Guy Lebanon: Yeah. >>: Are there any -- have people tried to do metrics like that? >> Guy Lebanon: I think the main alternative, which I do think is an improvement -- although I don't think it's necessarily good -- like I said, it's still poor substitute for what you want to do in the real world. But the alternative is to use ranked measures because the thinking is what you care about is you show to the user five movies or a carousel of movies and you care about being accurate in the top 5 or top 10. You don't care about the rest of the 50,000 movies. So that's a reasonable improvement. Actually I think quite reasonable. >>: Then once it gets down to the weeds, like once you're in the 2, 3, 4s, it might not really matter, right, what the difference [inaudible]. >> Guy Lebanon: That's right. >>: Like 3 and 2. >> Guy Lebanon: That's right. >>: [inaudible]. >> Guy Lebanon: So one of our extension, in fact, the paper that won the award in WWW exactly generalized this, the ranked loss approach where we both trained and tested, evaluated based on the ranked loss method. Yeah. >>: [inaudible] do you have features for the users and the [inaudible] because the features themselves would give you [inaudible] and then it will have [inaudible] then you would get like [inaudible]. So something similar to [inaudible]. >> Guy Lebanon: So I think one has to be a little bit careful. And also I would say humble. What you said I think would work well or would work better for movies and items that we don't have a lot of viewing history for. But if we have a user that have watched a lot or rated a lot and a movie that has been around, I think these methods would perform better than methods with features -- no, than feature-based methods. So it depends really. I think the reason like feature based is better than collaborative filtering I would say depends on which users and on which items. In some cases you're right, this would work better. In other cases the collaborative filtering approach is the one to use, actually. And we've had confirmation of that also in experiments. And also of course depends on the quality of features. Like what do you know about your users. Like maybe you know their names and address, that's okay, but do you know -- do you know their browsing behavior or which ads they click on. I mean, the more you know, the better. >>: And usually that's definitely true because like if you use feature-based things, like all kind of really crappy music but the right descriptors or something -- it's not good music. You know, whereas the metadata -- I'm sorry, the comp data actually tells you something really meaningful about what people are watching. >> Guy Lebanon: So I really think you have to look at both approaches. Okay? Thank you very much. [applause]

>> Ran Gilad-Bachrach: So it's a great pleasure to... way from across the lake to join us today.

Related documents

Products

Support

&gt;&gt; Ran Gilad-Bachrach: So it's a great pleasure to... way from across the lake to join us today.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ran Gilad-Bachrach: So it's a great pleasure to... way from across the lake to join us today.