>> Ran Gilad-Bachrach: So it's a great pleasure to... way from across the lake to join us today.

advertisement
>> Ran Gilad-Bachrach: So it's a great pleasure to have Guy Lebanon with us. He came all the
way from across the lake to join us today.
Guy is a senior manager at Amazon where he leads the Machine Learning Science Group. Prior
to that he was a tenured professor at Georgia Institute of Technology and a scientist at Google
and Yahoo!. And even in his history there is an internship at Microsoft.
His main research area are machine learning and data science. He received his Ph.D. from CMU
and BA and MS from the Technion, the Israeli Institute of Technology.
He has authored more than 60 refereed papers. He's an action editor of JMLR, was the program
chair of CIKM, and will be the conference chair of AISTATS in 2015, so be nice to him if you
want your paper accepted.
He received the NSF CAREER Award, the ICML best paper runner-up award in 2013, and the
WWW best paper in 2014. He had the Yahoo! Faculty Research and Engagement Award and is
a Siebel Scholar. So he's going to talk about local low rank matrix approximation.
>> Guy Lebanon: Thank you very much. I wanted to say a few words first. This is work I did
while I was at Google. And the collaborator is Yoran Singer, and Seungyeon Kim and Joonseok
Lee are two of my Ph.D. students at Georgia Tech. And they're still my Ph.D. students, although
I'll probably be handing them off soon and become a coadvisor instead of a regular advisor.
I've been one year so far at Amazon. But we're doing follow-up work on this. And, in fact, the
follow-up work, like Ranny mentioned, won an award at WWW.
And I'm happy to take questions during the presentation, so please feel free to ask questions.
And, finally, thank you for hosting me. I think I know almost everyone here. I'm happy to meet
old friends.
Okay. So the matrix completion problem is -- this should be your -- one way to describe it is as
follows. We have a matrix. Let's think of it this way. And then we have some observations.
Some entries in the matrix marked by red are actually available to us. The numbers there are
known. The black dots are unknown. And the task is to predict what's the value of the black
dots from the red dots. That's matrix completion.
So formally we have these observations, a few entries of the matrix, and we want to predict the
unobserved entries of the matrix. And this is very hard to do if we don't say anything else
because of course there is a lot of possible completions that are consistent with the training data
and how should we favor one over the other.
In machine learning we have this principle of trying to favor a simple model rather than a
complicated model that agrees with the data to try to prevent overfitting.
But what is a simple matrix? If it's a vector, we would probably put regularization on it, favor
small entries or sparse, favor sparsity. But for matrix, what does it mean? Probably the most
popular approach to define what is a simple matrix or favor simpler matrixes is to define that as
the rank of the matrix. So we want to favor low rank matrixes.
And a low rank matrix, those of you that don't remember linear algebra, is a matrix that can be
written as a product of two other matrices. One of them is tall and narrow; the other one is short
and broad. And the dimensionality here and here is the rank of the matrix. And the idea is that
that rank is much smaller than the minimum of this and that.
Any questions so far? Okay.
Okay. So how does it -- how do we use that or what's an application of this matrix completion
problem? This is probably the most popular application these days, although there is other
applications as well, recommendation systems. Specifically these entries are ratings. And here
is how we interpret it. The row denotes the user and the column denotes an item. And the entry
denotes the number of stars given to that item by that user.
>>: Is this really the best model for a recommendation system? Because a lot of
recommendation systems that have values [inaudible] random people, it's not truly Gaussianly
distributed [inaudible].
>> Guy Lebanon: I haven't gotten to the loss function or the specific [inaudible] yet, but you're
right that in many cases this is assumed.
And I can tell you -- so this paper is an academic paper and I've worked on it while I was at
Google. But now with Amazon I'm doing recommendation system in practice. And things are
very different.
So RMSE performance does not correlate well to increases in revenue. It's not the same thing.
And if you look at publications and there's ten papers and one of them is the winner in RMSE,
it's not necessarily the approach you want to take in industry in order to get better user
engagement or better sales revenue increase or some other metric you're interested in.
The whole concept, in fact, I would go further -- beyond what you said, the whole concept of
offline evaluation is not working very well. We need to innovate how we evaluate not just a
particular loss function but the whole process of having a training set training or a training set
and testing on a testing set. I don't think that's working well for real-world products. It doesn't
get us beyond a certain point. That's a whole big topic.
Okay. So anything else? Yeah. So the training data is some observations that we have. Think
of Netflix example in your mind if you want. And the unobserved are the ratings of the users for
unseen items. And if we can complete a matrix, we can recommend items to users based upon
these values.
Now, again, I should say this is not necessarily if you want to build a recommendation system in
industry, that may not be the best strategy. But that's the kind of data that we have in academia
or public datasets. I cannot publish results on Amazon datasets. So this is results on -- I'm going
to show results on Netflix datasets, other public datasets, and that's the kind of things you can do
with it. Yeah.
>>: [inaudible] abstraction [inaudible] like all this like datasets are kind of static, like kind of
take [inaudible] the most present problem, the recommendations [inaudible]. So is it -- do you
think it is worth like abstracting [inaudible]?
>> Guy Lebanon: Absolutely. So I don't want to talk about my work at Amazon, but I can just
say very briefly a lot of work needs to be devoted to what is -- what do you usually watch in the
evening or in the morning or in the weekends versus the week, is it the kids watching or the wife
watching or the husband.
And is your taste changing? What about new movies that just came out? Everyone want to
watch them. There's a lot of subtleties. So I absolutely agree with you. This is kind of a
well-contained talk within a standard kind of static offline dataset.
And I think still it has a lot of interesting ideas. I don't think you'll be disappointed. But I agree
with you that this is not necessarily the right setup for industry. Yeah. Good point.
Okay. So here is the standard or the most popular recommendation system. If you look at
academic papers and if you think about industry, probably this is one of the main baselines or
variations of this are still being used.
Okay. So we have -- the thing to pay attention here is we have the observations which we denote
by M. That's the matrix of observations. And we take the squared error between the observed
entries and the values of the reconstructed matrix. Then we sum over all observed entries. A
denotes the set of training set. And then we find the matrix that minimizes the reconstruction
error on the training set subject to the rank of the matrix being a specific value.
This is hard to optimize because this constraint is not very easy to work with. So what we do
instead, we cast it in the following way, which is equivalence, equivalent optimization problem.
We optimize instead of over X, we write X as U times V transposed in that kind of factorized
form that I've showed you earlier. And then we optimize over U and V. And there's no
constraints. Just nonconstraint optimization problem. If you want you can add regularization.
But this is much more convenient to work with.
This is a very popular matrix completion method. It works reasonably well if you measure your
performance by reconstruction error on test set.
One issue is you need to know the rank R in advance. And another issue, it's not convex. And I
want to say one more word about this because if you remember linear algebra, SVD, there is a
single SVD decomposition for any matrix. So why can't we find the right or the single SVD
decomposition? The reason is that we sum here only over a subset. We don't sum over all
entries in the matrix. If this sum was over the entire matrix, there would be a single SVD
decomposition that minimizes the reconstruction error. And you can get that decomposition by
looking at singular values and singular vectors of the matrix. But we don't sum over the entire
matrix, and that introduces lots of local maxima.
Questions? Okay. Here is what happens if you run this kind of procedure. The Y axis is the
mean squared error or root mean squared error on a test set. And the X axis is the rank R of the
decomposed matrix. And you get better test set performance as you increase the rank, but at
some point you start hitting a lot of dimensioning returns. You can start -- you can increase the
performance much, much further.
>>: What are the units [inaudible] stars?
>> Guy Lebanon: Yeah. That's right. That's right.
>>: So there's no [inaudible] I can increase the rank?
>> Guy Lebanon: At some point there are would be overfitting. And if you add regularization,
you can probably do better. Yeah.
>>: It goes to .95. What's 1? 1 is total ignorance or ->> Guy Lebanon: No, so 1 would be an average error of one star away from the actual
prediction on a 1 to 5 star. 1 to 5 stars, yeah.
>>: Okay.
>> Guy Lebanon: So it means an average error of one. But it in the sense of root means squared
error.
More questions? Okay.
So here is the observation. As the rank increases, the accuracy quickly hits law of dimensioning
returns. And let's consider the following two hypothesis, okay? Hypothesis one, the matrix that
we are trying to complete has a low rank and the law of dimensioning returns reflect the best
possible prediction. Okay? The data just has some noise in it, we cannot improve the prediction
beyond that. We've correctly estimated the rank and we're using the right rank.
The second hypothesis is that the matrix that we're trying to complete has a high rank. So the
low rank assumption is incorrect. And in this case the dimensioning returns, I mean, why doesn't
it keep improving if the matrix has low rank? The dimensioning returns are due to overfitting
and due to -- and the reason of overfitting because as you increase the rank, you're going to have
more parameters. The matrices that you use, the factorized form have more and more
parameters. And you would overfit because the data doesn't grow -- you can grow it to infinity
as you increase the rank.
The second reason could be convergence to a poor local maximum. And -- yeah.
>>: How do you formalize the fact that it's neither low or -- is it like a binary distinction while in
reality there's going to be a sliding scale of as you increase the rank you're able to fit better but at
the same time there's like lower fitting stability? Is there a way to say like there's a mean in
some metric which indicates whether it's a ->> Guy Lebanon: No, so this assumption is on the original matrix ->>: Right.
>> Guy Lebanon: -- that we don't observe fully. And this is either a low rank or a high rank.
>>: So you're saying that the [inaudible] assume that there is a fixed rank that ->> Guy Lebanon: We assume there is a truth. We assume there is some matrix that we're trying
to complete. We don't know what it is precisely, but it exists. And it either has a low rank or a
high rank.
>>: But even [inaudible] probabilistic assumption that [inaudible] some distribution over
possible ranks ->> Guy Lebanon: No. No. It's either ->>: Or an eigen [inaudible].
>>: Or [inaudible] approximately.
>>: Yeah, approximate rank.
>> Guy Lebanon: Yeah. Okay. So in the -- so what we're trying to -- what I'm going to try and
argue with this presentation is that in the context of recommendation systems and standard
datasets like Netflix and other datasets, the second hypothesis is the correct one rather than the
first. But also the first hypothesis is incorrect globally but is correct locally. And I'll explain
what that means.
The third thing is combining SVD with nonparametric smoothing significantly postpones the
dimensioning returns and it also reduces the computation. So we have a win-win situation here.
And, in fact, the computation is a big problem because datasets are big in industry and in many
case we're having a real computation challenge trying to optimize that function that I showed
before.
Okay. So I'm going to describe the model first informally and then there will be a formal
definition later. Okay. So this is the informal description. A matrix -- so here is the assumption,
okay? We're assuming two things. We're assuming that the matrix completion that is accurate
across all users in items should be of high rank. Second assumption is the matrix completion that
is accurate for one special type of users and one specific type of items, for example, if we only
look at the population of teenagers and we only look at the population of old movies, that kind of
submatrix, teenagers and old movies, will have a low rank. And the low rank is probably
teenagers don't like old movies. The rank is 0. Right?
So it's like a very simple problem. Once you narrow down your problem to a very coherent
population of users and a very coherent population of movies, the problem comes because you
have a lot of these coherent subpopulations, and if you want a matrix that is correct globally,
then the rank will be high.
>>: [inaudible].
>> Guy Lebanon: Sorry?
>>: Yeah, the matrix M is composed of [inaudible].
>> Guy Lebanon: It's not -- so it's more complicated than that because the -- these populations
may be overlapping. And it's not necessarily a block structure. But that is a correct intuition.
Okay. So here is the algorithm. Okay. Identify Q neighborhoods in the set of matrix entries. So
first thing, please parse the sentence here. Okay? We have a set of all possible matrix entries.
Okay? This is a set, but it's -- we can also think of it as a space with a matrix structure that is
discrete. It's a discrete space, not continuous, but we can still say symmetric space. And we can
still define the notion of neighborhoods if we have an appropriate distance function.
How do we get the distance function? I can talk about it a little bit later if you're interested. For
now, let's just assume we have the distance function that measures similarity of items and
similarity of users. Let's assume this is given. If we have that measure, we can identify the
neighborhoods.
>>: [inaudible] metadata.
>> Guy Lebanon: Not necessarily, actually. We -- in the experiments that I'll show, we get the
neighborhoods from the training set that we use to construct the matrix factorization later on. So
we don't use additional data. We use the same data to construct the neighborhoods as well.
>>: Are you using the neighborhoods because that particular [inaudible] will result in lowering
approximation for that piece, or is that [inaudible] define the subsets here?
>> Guy Lebanon: This is a pretty big issue how to find the neighborhoods. In the work that
we've done here, this would -- we're using a pretty elementary technique. And I can discuss how
we do that.
>>: Is it a kind of [inaudible] of the problem, or is it something like [inaudible] that are given to
me by an [inaudible] way ->> Guy Lebanon: Right. So okay. Okay. I'm happy to describe. So what we did to define the
neighborhoods -- I prefer not to get into that right now, but I'm happy to answer since both of
you are interested. What we did is we sampled entries from the training set. Okay? We
sampled entries from the training set. And then we defined neighborhoods that are centered at
this sample.
So say we sampled ten -- for example, we want to have ten neighborhoods. We sampled ten
entries from the training set set of entries, and then we form ten balls in this, in the distance
metric that we defined centered around these ten samples.
We have a follow-up paper where we do this in a more adaptive manner -- yeah.
>>: [inaudible] the neighborhoods are not necessarily [inaudible].
>> Guy Lebanon: Exactly. Yes. Absolutely. Absolutely. And, in fact, they don't even have to
have -- it's not like they have to be block -- even if you permute the rows and the columns, you
can still not get blocks. But absolutely. Yeah.
>>: [inaudible] the neighborhood don't have the [inaudible] structure.
>> Guy Lebanon: Not necessarily. In the general -- I mean, we are using distance metric in a
product -- product distance in the experiment but not necessarily.
Okay. So we identified these Q neighborhoods. And the way we do that is we just sample Q
points from the training set observations entries and we create balls around these points.
And then for each of the neighborhoods we construct a separate low rank matrix that is
especially accurate within it. We don't care so much about being accurate outside of it. And
then we patch up these local models to get a matrix approximation that is global and that is high
rank. But the local patches are low rank.
>>: But you're doing no refinement of the neighborhood to improve the fit of the low rank at
that point? Like [inaudible]?
>> Guy Lebanon: That's right. That's right. Like I said, we have a follow -- we have a journal
paper that where we generalize this. But right now this is -- we justify the neighborhoods right
now in a very naive way, we compute a local model, and then we patch up the local models.
Okay. So here is a picture of the neighborhoods. But like Chris said, and I always say this
caveat, this neighborhood isn't normally adjacent rows and adjacent columns. Okay? Because,
for example, this movie may be similar to this movie and to this movie and this user may be
similar to that user. But I cannot visualize it in a different way. So this is just for the sake of
visualization.
Assuming that the distance function just is the distance between the indices, okay, so these two
are necessarily similar items, these two a little bit further, a little bit further, same thing with the
rows, you could get something like this kind of neighborhood structure.
That's one thing I wanted to mention.
The second thing I want to mention is not exactly what we do. We do something a little bit
different. Because here we assume the neighborhoods are precisely defined and they have a
precise boundary and everything within the neighborhood counts in the same way. We use
something called kernel smoothing where there's a weighting function that gives the weight
corresponding to the distance from the center of the neighborhood.
And that means we can have neighborhoods that cover the entire matrix even or cover a subset of
the matrix. But as you get closer and closer to the center, that way it becomes higher.
>>: So just to make sure that I understand. So at the end of the day what you're getting are
basically the model can be represented by a set of [inaudible] and the set of matrix completions
which were trained to be low.
>> Guy Lebanon: That's right.
>>: So when you want to do a prediction, how would the prediction work?
>> Guy Lebanon: So I haven't gotten to this. I said patch up, that's the first -- the third step. I
haven't gotten to it yet. I will show you in the next slide or two how we do that. Yeah.
>>: I think so if -- let's set aside the low rank stuff. If you have a distance metric [inaudible]
even defining Gaussian process if there were constants and just made predictions, so you're
saying is that you could always -- if you have a distance metric or a kernel on your sort of item
or your pairs, you could always just do Gaussian process reduction, right? So is what you're
doing -- and that's kind of like if you imagine a Gaussian -- I mean, this is weird, it's [inaudible]
Gaussian with different sized bumps that are [inaudible].
>> Guy Lebanon: Yeah. Yeah, yeah.
>> Now, are you doing -- there used to be an old technique where you took the Gaussians and
then had like a little local linear model that's sort of turned on by each Gaussian. Is that what
you're doing? [inaudible] but it also goes back to [inaudible].
>> Guy Lebanon: I think it's -- it would be similar because that would correspond to a low rank
structure, right, like a local low rank. So I think it's similar. Not exactly. But there is some
relationship.
>>: Okay. So I don't know if you've also tested versus just the plain old Gaussian process where
you take your neighborhoods and just a constant -- everything is ranked 0. I don't know if you
tested that.
>> Guy Lebanon: That performs very bad.
>>: Oh, okay.
>> Guy Lebanon: Yeah. The local low rank is crucial because it kind of discovers this like local
semantics structure in it. I'll show you later graphs where you can see the performance of the
local model is a function of the rank of the local models. And you can see that we really
postpone the law of dimensioning returns, and the [inaudible] significant improvement as you
improvement as you increase the rank of the local model.
Okay. So here is the -- so this is the first step, identify the neighborhoods. Okay. Second step is
for each neighborhood construct a local low rank matrix completion. Suppose we look at this
neighborhood, and remember we have this weighting function. So we want to get this low rank
matrix that approximates the entire big matrix but it especially is accurate here and a little bit less
accurate here, a little bit less accurate here, and here we don't care.
And that would be the local model that correctly describes the situation here. If we want to think
of it intuitively, we can think of teenagers and old movies. But like you said, we can do that
even without metadata, just by figuring out which users have similar viewing patterns and which
items have similar viewing patterns. Yeah.
>>: [inaudible] are not.
>> Guy Lebanon: Depends -- depends. Yeah. I mean, in the experiments we did, we did use a
product structure. This is -- like I said, the neighborhood will not be kind of especially similar to
each other. But in our experiments the neighborhoods do have a product structure. But you can
think of whatever distance you want more generally. If you have metadata, you can use the
metadata. And conceivably one would want to use nonproduct distances.
Okay. And then once you -- the third step, once you have these local models, you have a local
model for here, for here, and for here, you would patch them up.
Okay. So we are ready for the formal description. I think. So here are the -- what we have. So
we have a distance function on the set of matrix entries, and this is in our case computed based
on observed matrix entries. But it could also be computed based on additional data like if we
know the directors of the movies or the actors or the genres or whatever, or we know something
about users, we can have that help us.
Okay. And here is what we assume to be the truth that we're trying to discover.
We assume that there's an operator that maps the set of matrix entries to matrices. So for each
matrix entry, we get a specific matrix. Okay? This is the set of all matrices and these are the
matrix entries. So for each matrix entry, for example, the entry row 2 and column 3 we would
get a brand-new matrix corresponding to that.
And that matrix has two properties. One property it has rank R, low rank, a second property is
that that matrix evaluated at the specific matrix entry that we picked here would give us the same
as the observed value.
The second property of the operator is that it's somewhat continuous in just a sense -- just a
second -- it's continuous in the sense that it's slowly changing. Because we don't want the
operator to jump wildly. The whole point in defining the distance function is that the distance
function means something. What it means is that operator, if it maps two points that are close to
each other, it will give us similar matrices.
Now, this is not a continuous space, so slowly varying or smoothness we define in this way
which is called Hölder continuous. The distance between -- this is one matrix evaluated -- this is
the operator evaluated at one point, the operator evaluated at another point, these are two
matrices. The Frobenius norm between them is less than the distance between the two matrix
entries.
And we have to have that inequality, meaning it cannot change too much.
Wait with your question because I have a diagram that may make it more clear, and then I'll take
questions.
Here is a picture of what we have. This is the matrix that we're working with. Okay? And each
point in it is mapped to another matrix by this operator. And these matrices all have low rank.
And as we move from one point here to a neighboring point, neighboring defined by the distance
that we defined, if we move a little bit here, these matrices should move also slowly. Should not
change too much.
And the value of this matrix here is the same as the value here. Okay? And the value of this one
here is the same as the value here.
So these matrices have to agree with the values in this matrix. That's one. Second thing, they
have to have a low rank. These once. And the third thing is they have to change slowly. Yeah.
>>: [inaudible] matrices defined for neighborhoods to essentially [inaudible].
>> Guy Lebanon: For every single -- every single matrix entry we have a brand-new matrix.
Yes. And this is not something we have to find, but this is -- this is the model what we assume is
happening. We have these matrices. Now, these are low rank. Okay. And they describe what's
going on in this point and they change slowly.
>>: With this slow changing property, you're only requiring within [inaudible] right, not across
[inaudible]?
>> Guy Lebanon: No, it's defined -- it's defined this way. Yeah.
>>: [inaudible] kind of global ->> Guy Lebanon: Yeah, these [inaudible] measured here. So, yeah, this is the distance measure
here. We define. It's up to us to define. If we do a bad job with defining D, then obviously this
assumption will not be correct.
>>: But you have a T ->>: Constants will be large.
>>: But you look at T across -- so your different patches will have different ranked sizes, right?
They can -- because you have neighborhoods of different sizes. So T for one is not the same size
as T for another. Or is it a global T?
>> Guy Lebanon: So T -- the size of T -- T maps into the same size. All of them are matrices
over the entire set ->>: Over the entire space.
>> Guy Lebanon: Yeah. So those are the number of users and number of items.
>>: So [inaudible] exponents bigger depending on the [inaudible]?
>> Guy Lebanon: Probably. Yeah.
>>: So to one in R, the rank is fixed across the entire [inaudible]?
>> Guy Lebanon: Yeah. This is also something that can be generalized. But in our experiments
we use a single constant for the rank of all these matrices.
>>: And the second question, is the distance metric [inaudible].
>> Guy Lebanon: That's right. That's right. Yeah.
>>: So have you thought about bridges this? Because right now D is completely separate, right?
So ideally you'd want to sort of [inaudible] and learn a good D that would give you good
[inaudible] as opposed to trying to fit some D that may or may not be right [inaudible].
>> Guy Lebanon: Yeah. Yeah. Yeah. It is a good point. I mean, we haven't done any work on
that. Our first extension was to -- I haven't even showed you what we're doing formally. So I
want to talk about it very quickly. The first extension is to work with rank loss functions which
performs better in real-world cases than squared loss. Second extension was to work with -- to
try to identify the neighborhoods in an adaptive manner.
And what you're talking is another level which is adaptively modify D. That's another
reasonable approach.
>>: [inaudible].
>> Guy Lebanon: Right. So right. So right now this work we assume -- if you want, I'll tell you
how we get D, but we get it from the training observations based on similarities between how
users -- what users view and what items -- how items are being viewed.
Okay. So I -- this is another -- I'm just going to skip that I think because I don't have that much
time. So here is -- we have two estimators. Now I need to show you what exactly we do. I
haven't yet showed you formally. There are two estimators. One is called the expensive
estimator. And this estimates the operator T. Okay? So we're going to estimate the -- we're
going to estimate each one of these matrices. Okay? And each one of these, if we estimate the
matrix, this one, we can evaluate it here. And this can serve as an estimate for this. Okay?
So we're going to do that for every single point. And that's why we're going to call it the
expensive estimator. And then we're going to have a cheap estimator that I'll show you how we
do. Yeah.
>>: Before you jump in, how many neighbors are you talking about? About five? About 5
million? What sort of numbers are in your head?
>> Guy Lebanon: Some -- actually, I show -- I think I show you a graph -- I think I may have a
graph as a function of the number of models. But I would say in the tens or in the hundreds, I
would say. Depends on the size of your matrix also.
So this is the expensive estimator. So here is the -- what we formally do, how we get our model.
We look at the training set and we learn the low rank matrix that minimizes the squared
reconstruction error, but we give weights. The weights correspond to the distance of the point
from the center of the neighborhood. Okay?
So because I want my local model to be particularly accurate in the area in the neighborhood that
I'm choosing, so I want it more and more accurate as I get closer to the center subject to a low
rank. So I showed you already how do that without the weight. I showed you earlier how we do
without the weight. Sticking in the weight is relatively simple. It's same difficulty level.
And then I just evaluate this estimate that I've computed at the point I want to predict, and I get
my matrix complete completion or approximation.
K is a kernel function defined in one of these ways or a different way, just something that
measures -- is higher and higher as you get closer to the center.
What's the difficulty here? Difficulty is that for each entry that we want to predict, we have to do
a separate SVD. We have a run a separate SVD. When we need to complete -- when we need to
estimate a ranking for -- a rating for a user for a movie by a user, we sometimes get it at runtime.
We don't want to solve an SVD at runtime and have the user wait for the answer. And even if it
was offline, it's still too much to compute.
So this is not practical. And we're going to get to the cheap estimator which is T double hat.
And this is going to be very practical, actually. In fact, it's going to be so practical that it's going
to be faster to compute than the nonlocal version.
So here is the cheap estimator. First step is compute the approximation T hat at Q anchor points,
S1 through SQ offline. These are what I said the neighborhoods to identify. And the way we
construct them is we just sample the centroids from the training set and we put a certain ball
around it.
Okay. So we compute Q local models. Okay? We compute a low rank model that is especially
accurate as you get closer to these centroids. And then you patch up these local models, these Q
local models. And how do you do that? So you have different values you want to patch up, we
do that using a technique called kernel regression. This formula specifically is called
Nadaraya-Watson local regression or kernel regression, locally constant kernel regression.
And the thing we do here, though, this is kernel regression in the space of matrices as opposed to
standard way of talking about kernel regression is you say kernel regression on a real line, but
here instead of points scalers, doing kernel regression over scalers or over vectors, we do it over
matrices.
So in essence this is just an average, weighted average of a local model. And the weights are
reflected by the similarity of the local model to the point that you want to predict. And then you
can form your completion this way.
So what do we have here? We have step one, offline computation. You can do that offline
before you need to serve anything to users. Solving K low rank SVD problems in parallel.
Step two is you -- this patching up. It could be done very fast. Linear combination of local
models. We're going to show you some graphs soon that shows improved prediction accuracy as
well as reduced computational cost. Now, why would you have reduced computational cost?
Because you replace one SVD with Q SVDs. So how do we save computation? Any ideas?
>>: [inaudible].
>> Guy Lebanon: That's right. So there's two ways that you gain. One is because the kernels
may have a finite support and then that means you can solve a smaller matrix approximation or a
low rank SVD. And it could be much smaller. That's one gain.
A second gain is that the rank of the local models can be much smaller than the rank of the
global model that you do alternatively. And solving, say, 100 SVDs of rank 50 is much easier
than solving one SVD of rank 5,000.
>>: [inaudible] support they actually go to zero, though, then you need to check to make sure
there aren't places in the matrix [inaudible].
>> Guy Lebanon: That's right. Yep. That's right. That's right. So you have to be careful there.
But there's these two gains. One is the support and the other one is you can get really good
results by having a really low rank, much lower rank than the original low rank for the entire
matrix. And it's easier to solve many problems with low rank than one problem with bigger
rank. Yeah.
>>: [inaudible] SVD, correct?
>> Guy Lebanon: So I mean this problem right here. I call this incomplete SVD because we
sum only over a training set. We don't sum over all -- we don't sum over all the entries, which
would be the algebraic SVD. All right. So this requires gradient descent. You cannot do a
singular value, singular vectors.
Okay. I'll show you some results. So the X axis is the number of anchor points which is the
number of local models or centroids or patches. The Y axis is the RMSE on test set.
These patched lines show the performance of SVD, which is the standard baseline. Still very
popular. And you increase the rank from 5 to 7 to 10 to 15, you see law of dimensioning returns.
Right? Very quickly you get to a point that you cannot improve beyond. And the reason you
cannot improve beyond is what we said earlier, that issue of overfitting and convergence to a
local maxima.
The local models, the performance increases as the number of anchor points increase. And they
also increase as you -- they also get better as you increase the rank of the local model. This is
rank 5. Every local model is of rank 5 here. Okay? But we have 45 models of rank 5 each.
Okay? So the total rank would be 45 times 5. Okay. The rank of the big matrix, that will be
approximately 45 times 5, here it will be 45 times 15.
But you see that if you would keep increasing it, you're not going to -- you're basically going to
be stuck here if you just do a rank 500 SVD. It's not -- it's not going to get much better. Sorry?
>>: What's DFC?
>> Guy Lebanon: DFC is something called Divide-and-Conquer Matrix Factorization, a paper
by Michael Jordan, Lester Mackey, which does a similar trick of dividing the original matrix into
patches and learning local models and then patching together.
So it's relatively recent paper that takes a similar approach. And this is the Netflix winner.
Although we used -- it's a long story -- we used slightly different dataset than the Netflix winner.
So this line should be -- it's the same size, but this line should be taken with a grain of salt. It's
not a formal comparison.
>>: Do you know where the limit is? I mean, because you -- you know, you're showing up to
15, right? Then what? [inaudible].
>> Guy Lebanon: Yeah. So -- right. So the -- you're [inaudible] the limit pretty soon after.
>>: All right.
>>: So all the way on the left, so if you have, say, one anchor point, you're still not going a
global ->> Guy Lebanon: That's right.
>>: [inaudible].
>> Guy Lebanon: That's right. So this is very -- that's a very good observation actually. Look
how better we do with even a rank 1 local -- I'm sorry, with one anchor point. Why do we do so
much better than the global SVD? Because we sample the anchor point from the training set and
the learned model is more accurate in the vicinity of the training set than in remote corners of the
matrix that you don't care about.
>>: So you patch it on the rest [inaudible].
>> Guy Lebanon: So we focus -- we particularly focus our attention on what we've seen in the
training set, which is a good proxy for what we're going to see in the test set.
>>: So how does the training RMSE look like? So what you said that basically you can refactor
the model [inaudible] say ten anchor points into the single matrix of higher dimension. So does
it indeed have -- how does its training -- so if you want to show that it's a local -- it's a form of
the optimization, you want to compare the training error, not the test error.
>> Guy Lebanon: It's not a problem of the optimization. It's a problem of the generalization or
the learned model on test set.
>>: So in terms of -- in terms of its train error, actually it's train error is going to be worse than
the train error of if I would have done a model -- SVD model with appropriate [inaudible]?
>> Guy Lebanon: No, I think -- no, I think the training error would also probably be better. But
the real comparison is -- I mean, you really care more about the test error, I think.
>>: It's true. But if we try to understand what is -- you know, what is source, what is bottleneck,
then it's interesting to look whether it's just the optimization method. So it's better off if I have
a -- so in a sense it would say if I have a large nonconvex ->> Guy Lebanon: I see what you're saying. Okay. So you're saying how much of the problem
is convergence to local maxima. That's what you're saying. It's difficult for me to say because
the search space is so big and there's a lot of local maxima. It's difficult to do an exhaustive
search and say with certainty here is the global maxima.
But that's a good point. Maybe we can look at this further to kind of quantify different -- let
me -- I just have a few more slides and I'm almost out of time, so ->>: Do you have a ->> Guy Lebanon: Quickly, though.
>>: [inaudible] why you should not approximate one or where you center [inaudible] on five of
them, so that's then cheaper than Q of them, right?
>> Guy Lebanon: Yeah. So we have that comparison in the follow-up paper, journal paper,
because there we adaptively select the anchor points, the centroids. Yeah.
Let me move ahead because there is a couple more things I want to say and we're almost out of
time. This is not something that we did but I think is interesting to -- couple of slides that I think
are interesting to learn about then something that we did. But it's all related to this.
So the nuclear norm of a matrix is the sum of the singular values. This is the definition. And
that's another way of defining simplicity of a matrix. Remember I said simplicity is the rank of
the matrix. This is one way to define simplicity. An alternative way is perhaps the nuclear
norm. So we want to favor matrices with low nuclear norm.
One of the motivation is that minimizing the nuclear norm subject to some constraints is a good
surrogate for minimizing the rank of the matrix.
Just in the same way that minimizing the L1 norm is a good surrogate for minimizing the L0
norm. There's papers showing both, both results. So there's some kind of parallelism between
these two problems.
So that kind of leads us to the following matrix completion problem. Given a dataset, you
minimize the nuclear norm of all possible matrices subject to the matrix being not to wrong on
the training data. That's an alternative matrix -- approximation matrix completion problem. And
it has gotten a lot of attention recently in academic community. Industry not so much. But in
a -- I can say why, but lots of papers in academic community. And I think in large part because
of this and it has some compressed sensing properties, et cetera.
So we thought maybe we could do a local version of that as well. And here is the local version.
Minimize the nuclear -- so the estimate of the operator at a particular point would be the
minimization of the nuclear norm of the matrix over all possible matrices subject to
reconstruction error not -- being not to large, where reconstruction error is the same except we
have these weights, kernel weights to tell us how much we care about different reconstructions.
So in general this has an advantage in that it's a convex problem. And there's a single global
maximum. It has an advantage that we don't need to specify the rank in advance. However, you
need to specify alpha. Okay? So we don't quite get away with that.
>>: So it's weird, is MCD [inaudible] CD also was [inaudible] so you're kind of reweighting the
[inaudible]?
>> Guy Lebanon: So this would be the training set.
>>: Oh, the local.
>> Guy Lebanon: Yeah. So ->>: [inaudible].
>> Guy Lebanon: This is a local at one specific point. This would be the cheap estimator.
>>: I see, I see.
>> Guy Lebanon: And we have to patch up multiple. With local regression we will patch them
up to get an estimation everywhere, wherever we want, we will patch them up.
One difficulty is that this is hard to compute in large scale because this requires semi definite
programming optimization. And for large-scale problem that we have in industry, this is not
very practical.
Okay. So there's this -- do I have another five minutes or -- okay. So there's this really
interesting result by Donoho and Tao and some other people, and it goes like this. And I think
it's really interesting to be aware of that. Completion of a rank R matrix is possible with high
probability if the number of observed entries satisfies this inequality. So N is the minimal of N1
and N2. And for simplicity, let's just think about square matrices, completing square matrices.
So the number of observed entries needs to be larger than N log N times R, only log N is to the
power of 6 and R is the rank of the original matrix.
So if we fix the rank, we need -- we fix the rank, we need the number of observed entries to grow
at the rate of N log N, almost N log N. N times log N to the power of 6.
>>: [inaudible].
>> Guy Lebanon: I'm sorry?
>>: What kind of [inaudible] C1?
>> Guy Lebanon: I don't know. I'm not sure. The even more interesting direction is that if this
is not true, okay, then it's impossible to reconstruct the matrix. If this is not true without the 6th
power. Okay? So we have like kind of exactly the rate at which -- with which N needs to grow
or number of observed entries needs to grow.
So if you don't have the 6th power, this is the lower bound. You cannot succeed if the number
for any algorithm whatsoever -- similar to the result in compression, you cannot compress better
than the entropy rate with any algorithm whatsoever. Just a sec. So same thing here. So this is
the first interesting result.
The second result is that the -- this method that I showed, this thing right here, actually works
with this rate. So there's some error rates but basically shows that the method that I showed you
in a slide before achieves almost the lower bound. It's almost like the best optimal compression,
the best optimal reconstruction except that you need the 6th power here.
So we almost accomplished that optimal rate.
And the way this relates to the talk I just gave is that we've kind of generalized these results to
the local variation. So we get kind of bounds on the reconstruction error that is similar to what
they got in that paper.
I think I'm out of time, so I'll just talk a little bit about next steps. Matrix completion with
nuclear norm minimization, extend framework to ranked loss function, adaptive selection of
anchor points, large-scale massively parallel implementation. We've actually done all of these,
but there's other future steps. Maybe -- I haven't told you how to get the distance function. I can
tell you what we did here. What we did is we did an initial SVD, global SVD, and then we
looked in the latent space what is the distance between the users and the distance between the
items, and that gave us the distance that we use in our experiments.
I'm happy to answer more questions, but if you guys need to go, I think I'm out of time, so I just
want to make sure.
>> Ran Gilad-Bachrach: Let's thank the speaker.
>> Guy Lebanon: Thank you.
[applause].
>>: So the result of the lower bound, isn't it something like you have to also have the matrix
[inaudible] have to be [inaudible]?
>> Guy Lebanon: There is -- yeah, there are some conditions for that. Yeah. That's true. There
are some conditions [inaudible] represent. Yeah.
>>: I'm curious about how important it is to put this all back into one big matrix. So let me
propose a naive algorithm and tell me where this would fail.
Let's imagine we had some way of -- you know, pick 50 or a hundred anchor points [inaudible]
and then let's imagine we have a way of sort of clustering the users to the movies into those 50
different clusters based on those anchor points. And now for each of these groups just do a low
rank matrix approximation. And then when you need to make a prediction, you just use the
membership weight. I mean, it could be even a soft clustering. You use the membership weight
into the clusters. Each of these low-rank matrices, which would be fast and easy to do, just
makes its own prediction, even [inaudible] prediction with those weights, what -- where ->> Guy Lebanon: So -- yeah, so ->>: Where would that fail?
>> Guy Lebanon: So this is actually -- I think will succeed. This is very similar to what we're
doing. There's only a couple of small differences.
>>: Right.
>> Guy Lebanon: One difference is we use instead of partitioning into separate problems and
each problem being matrix SVD, a matrix approximation, we make it a weighted matrix
approximation so that it's increasingly more accurate as you get close to the centroid rather than
just have everyone have equal weight. So that's one difference.
The other difference is how exactly you take the linear combination. So the details matter.
Some formula would work better than others. I don't know exactly what's the best way. The
formula I just gave is locally constant regression. Conceivably there are other better ways of
doing it and worse ways of doing it. But it's very similar to what you described.
>>: I have a question about the [inaudible]. So let's look for [inaudible] right, so I have -- I look
at it as if I have the matrix of queries versus URLs and I know, say, I have [inaudible] I know
from click data or from whatever [inaudible] I know [inaudible]. So would it be a useful
approach to try this kind of matrix factorization [inaudible] recommendation system to work this
problem?
>> Guy Lebanon: Yeah. Yeah. Actually, we were thinking about looking at other datasets like
document term matrices or in social network you can think of graph structure. So we actually
thought about [inaudible]. Don't have any experiments to result -- any experiments to report.
But I think it's an interesting possibility.
>>: So how distinct is [inaudible] compared to [inaudible] prediction approach?
>> Guy Lebanon: My estimate or my guess is that it will work also well. Compared to, say,
LSA or, for example, in social graphs you have problem of predicting links, you want to
recommend links, hey, maybe you know this guy. My guess is that it will work equally well. It
will show an improvement because the assumption that there is a global low rank is incorrect,
and if you just tried to increase that low rank you're going to start facing overfitting problems.
So assuming local low rank is an effective way of defeating that problem. So I haven't tried to
do it, so that's just a guess. Yeah.
>>: So your work needs kind of two things. It needs this kind of high rank as a sum of low rank
and then solving like local low rank problem. I think [inaudible] you could have a
parameterization which is like a weighted [inaudible] your score is a weighted sum of the
product of some [inaudible] vectors. And then you could perfectly optimize the global MSE
with that.
So sit important to basically solve [inaudible] right?
>> Guy Lebanon: Right.
>>: To solve [inaudible]?
>> Guy Lebanon: Yeah. So I think it's very useful from a computational standpoint to separate
the problem into several smaller problems. We also looked at that. So we looked at solving the
problem while optimizing over all -- learning all local models at the same time, not separately.
But computationally that could be difficult.
>>: This is difficult. But is it a better -- is it a good [inaudible] computational aspect and there's
a [inaudible] aspect. Is this constrained if I have to solve the local problem [inaudible]?
>> Guy Lebanon: I think it works for us. We haven't noticed a problem with ->>: Okay.
>> Guy Lebanon: Yeah, we didn't find the other approach to work better.
>>: I see.
>> Guy Lebanon: Yeah.
>>: How sensitive is it to the distance and how easy is it to get in, get themes by improving the
distance?
>> Guy Lebanon: It's actually very sensitive to the distance. If you think about distance and all
of this is worthless, it's not sensitive in the sense that if you perturb the distance a little bit it's
going to perform poorly. But if you pick a bad distance, then there's no point.
The first attempt we had was using the distance just either relationship between the observed row
indices and the observed column indices without doing a first SVD and looking at the latent
space. That did not work as well because of a problem of sparsity. Two users oftentimes, they
have completely disjoint movies that they've watched, you don't know how to compute the
distance between them if you look at original space.
So that did not work well. The distance that we use then which was the most obvious one, just
look at the -- like [inaudible] similarity of the observation vectors didn't work well. When we
did it in the latent space, that worked much better.
So I think it's very important, but it's not sensitive in the rigorous mathematical sense.
>>: [inaudible] does that further improve the results?
>> Guy Lebanon: We haven't tested that. But we hypothesize that it would be very helpful,
especially for items that don't have a lot of viewing history or for users that don't have a lot of
viewing history. You get a very poor notion for distance. If you have what we call the
[inaudible] users and [inaudible] items. So for these I think metadata would be very crucial.
Yeah.
>>: I'm curious if you or anyone else -- MSE is such a weird metric for this, you know, as you
were saying earlier, it's kind of like, you know, getting the average, like there's a lot of 3s, who
cares, you know, for instance, you could imagine saying like when we look at all the 5s and look
at all the 1s and look at accuracy of getting the love its or hate its more meaningful, I should
never show the hate its, I should always show the love its [inaudible].
>> Guy Lebanon: Yeah.
>>: Are there any -- have people tried to do metrics like that?
>> Guy Lebanon: I think the main alternative, which I do think is an improvement -- although I
don't think it's necessarily good -- like I said, it's still poor substitute for what you want to do in
the real world. But the alternative is to use ranked measures because the thinking is what you
care about is you show to the user five movies or a carousel of movies and you care about being
accurate in the top 5 or top 10. You don't care about the rest of the 50,000 movies. So that's a
reasonable improvement. Actually I think quite reasonable.
>>: Then once it gets down to the weeds, like once you're in the 2, 3, 4s, it might not really
matter, right, what the difference [inaudible].
>> Guy Lebanon: That's right.
>>: Like 3 and 2.
>> Guy Lebanon: That's right.
>>: [inaudible].
>> Guy Lebanon: So one of our extension, in fact, the paper that won the award in WWW
exactly generalized this, the ranked loss approach where we both trained and tested, evaluated
based on the ranked loss method. Yeah.
>>: [inaudible] do you have features for the users and the [inaudible] because the features
themselves would give you [inaudible] and then it will have [inaudible] then you would get like
[inaudible]. So something similar to [inaudible].
>> Guy Lebanon: So I think one has to be a little bit careful. And also I would say humble.
What you said I think would work well or would work better for movies and items that we don't
have a lot of viewing history for. But if we have a user that have watched a lot or rated a lot and
a movie that has been around, I think these methods would perform better than methods with
features -- no, than feature-based methods.
So it depends really. I think the reason like feature based is better than collaborative filtering I
would say depends on which users and on which items. In some cases you're right, this would
work better. In other cases the collaborative filtering approach is the one to use, actually. And
we've had confirmation of that also in experiments. And also of course depends on the quality of
features. Like what do you know about your users. Like maybe you know their names and
address, that's okay, but do you know -- do you know their browsing behavior or which ads they
click on. I mean, the more you know, the better.
>>: And usually that's definitely true because like if you use feature-based things, like all kind
of really crappy music but the right descriptors or something -- it's not good music. You know,
whereas the metadata -- I'm sorry, the comp data actually tells you something really meaningful
about what people are watching.
>> Guy Lebanon: So I really think you have to look at both approaches. Okay? Thank you very
much.
[applause]
Download