>> Dengyong Zhou: I'm so pleased to have Tengyu... from Princeton University in the Computer Science Department. His...

advertisement
>> Dengyong Zhou: I'm so pleased to have Tengyu here. Tengyu currently is a PhD student
from Princeton University in the Computer Science Department. His adviser is Sanjeev Arora,
and that's a very well-known name for many people here. Before, he got a graduate degree from
Tsinghua University, also in the Computer Science Department, and Tengyu will stay here for
two days, today and tomorrow, so if you want to meet him, please let me know.
>> Tengyu Ma: Okay, thanks for having me here. And I'm going to talk about analyzing nonconvex optimization for sparse coding, and this is the joint work with my advisers, Sanjeev
Arora and Rong Ge, Microsoft Research New England, and Ankur Moitra from MIT. So sparse
coding, which is often also called dictionary learning -- sometimes, I will use dictionary learning,
just because I usually that notation, that terminology. And so the key idea is in many settings,
data has a sparse representation in a different, appropriate chosen basis. So why is our data, and
if you choose appropriately or a matrix, so each column of A, I use Ai to denote the ith column
of A, and if you try to write this data in this basis, then you will get the sparse vector X, so X is
the representation, and by sparse, I mean that there are at most, say, K nonzero entries in X. And
so just to emphasize, this dictionary is unknown usually. So if the dictionary is known, then this
is sparse recovery or sparse linear regression problem, but here, we are interested in a setting
when the dictionary is unknown, and it's an N by N matrix, and usually, we are interested in the
case when this dictionary is overcomplete, so where M is larger than N, but maybe not too much
larger, and so when M is larger than N, then the columns of A are not independent, so it's a
redundant or overcomplete basis. So because the dictionary is unknown and the representation is
also unknown, so this is an unsupervised learning problem, so you have no hope to find A and X
if you are just given one Y, right? So we are given multiple samples Y, [ID] samples from the
same distribution, so I'm going to use superscript to index the samples, and so then my data is a
matrix, and I'm going to decompose this matrix approximately into two matrixes, the product of
two matrixes. One is a dictionary, and the other one is the representations, so each column of
this representation matrix has at most K nonzeros, and sometimes I use capital X for this X
matrix and capital Y for this data matrix. Just the notation. And so this problem originated from
-- actually, Olshausen and Field, the seminal work of these two computational neuroscientists in
1997, and so they were asking the question why human and visual systems can recognize the
object very quickly and very robustly? So they tried to understand the human visual system, and
the idea is just the same picture, so on the left-hand side, so this is the first layer of neurons of
our visual system, so these layers of neurons got activated by the light from our sight, so you see
the picture, and the neurons got activated. This is just the raw data. And then, somehow, the
first layer -- the synapses, usually it's called the primary visual cortex, V1, so then after
processed by this visual cortex, then you've got a sparse representation. The neurons in the
second layer, only a small number of neurons got activated, and it turns out that neuroscientists
believe that the number of neurons in the second layer is much larger than the number of neurons
in the first layer, so this is an overcomplete matrix, and so basically, neuroscientists believe that
this is a very reasonable model for the first layer of your visual system, and so the reason I
mentioned this is -- one of the reasons is that this is the origin, and the other is that actually, these
two neuroscientists, they also proposed a very good model for this and also a very good
algorithm, which I am going to talk about. And so then computer scientists see this very nice
explanation and tried to use it for our image problems, because we believe that if human eyes
could do it just by neurons -- neurons is a very weak computational unit compared to computers,
so then we could do it. If you could do it on neurons, then you could do it in the computer. So
then here is exactly the same picture, so we get the image or image patches, and you want to
represent our image patches by this sparse -- in this basis with a sparse representation. And if
you learn such a dictionary and you represent your image patches by sparse representation, then
what you can do, you can imagine, oh, if this is a nice representation, then you can do supervised
learning using this nice representation as a set of features, or you can do a reconstruction,
because this is just an approximation. If you multiply this with the representation you get, you
get a slightly different image, and this is a reconstruction. Presumably, this could be better. You
could de-noise the original image. In super resolution, I think people use this to restrict the space
of the possible images, because you know that your image has this structure, then you know,
okay, not all the images are possible. So also, a very important fact is that so you can use -probably, you can use these representations to build higher-level representations, like in deep
learning, and actually, in visual systems, this is just the second layers of neurons, and we have
many other layers behind it. So this is the application of this problem, and so also, the deep
learning is one of the -- so because we want to understand the deep learning, and it seems that
this is the simplest version of deep learning is not even a version of deep learning. It's just a
building block of deep learning in some sense, so it's a good starting point for theoretical
understanding, so this work is -- our aim is to theoretically understand this type of problems and
the heuristics that is used for this type of problems. Okay, so we want to have some not only
efficient algorithms but also provable algorithms, so basically algorithms with provable
guarantees for this problem. So this is my outline, so I'm going to first introduce some existing
heuristics that is used in practice, and then some models that allow us to understand this problem
from a theoretical point of view. So recall we want to decompose this data matrix into a product
of an overcomplete matrix and a sparse matrix, and so Olshausen and Field, they proposed this
energy function, so they want to minimize this energy function, so this energy function consists
of two terms. One is this reconstruction area. This is just -- when you fit the model with this
error, what is the error, and this is the sparse penalty, which enforces this X matrix is sparse, so
you can, for example, use L0, sparsity L1. The exact form of this penalty is not relevant to this
talk, but certainly you can think of that, say, L1 sparsity. So Olshausen and Field proposed, so
we want to minimize this energy function over A and X, because both A and X are not known.
And so these two neuroscientists, they proposed this alternating update algorithm, so the
algorithm is very simple. So you repeat and do convergence, and you alternate between these
two steps. The first step is you fix A and update X, and so there are different choices of these
steps, but what they propose is just you take the minimizer, because when you fix A, then this
energy function is the sum of the reconstruction area of each sample, plus the -- so you can
decompose this energy function into different samples, so you can treat each sample separately,
and you just take the minimizer in the [various] approach. So, for example, if this is L1, then
this is convex. And then the literal step, so then you fix X and you try to update A, and the way
that they proposed to update this A is you just take the gradient descent. And you take the
gradient of this energy function with respect to A. Now, X is fixed. You only take the partial
gradient with respect to A, and you choose the learning rate eta, and you update your A to be A
minus eta times the gradient. So actually, there are many other algorithms for this problem, and
they are basically the efficient algorithm of this form, under all following the same framework,
you have this decoding stack and learning stack. So the only difference is how do you decode
and how do you learn, update A, so let me just do a little bit of summary of the existing
algorithms. So I think there's three interesting algorithms in literature. One is this Olshausen
and Field algorithm, I referred to it as neural algorithms, and the other is this MOD method of
optimal direction, and the third one is this k-SVD. So recall that we want to minimize this
energy function, and in a decoding step, OF just takes this minimizer, and in k-SVD and MOD,
they use these pursuit algorithms, so recall that in this decoding step, we fix A. We want to
minimize over X, and this is essentially just the sparse recovery problem, and you can use
different tools, and now L1 minimization is the dominating one, I guess, but before that, there are
even simpler algorithms, like pursuit algorithm, which is essentially you just -- so you want to
find the sparse vector X here, and you just choose the nonzeros in a very gritty way. So you
choose -- so first, you take some entry which explains your data the most, and then you choose
the next. And here, you choose exactly K nonzero entries, and you add that.
>>: Just the lasso.
>> Tengyu Ma: No, no, this is even. Lasso is you minimize this L1 minimization, so this is
even simpler than lasso. It's an even grittier version.
>>: It's L0.
>> Tengyu Ma: Yes, so you really enforce L0 constraints.
>>: So lasso can be analyzed. People have known ->> Tengyu Ma: Yes, yes.
>>: So the analyses you will be talking about is mainly of the other step.
>> Tengyu Ma: Because here we don't really know at what -- in lasso, we will assume that you
know A, you are given A, but here, A is not known.
>>: So that part is hard.
>> Tengyu Ma: So the trouble is that exactly you don't know A. So you have to approximate A
and X alternatively so that you get close to the truth. And in the learning step, so the OF
algorithm just uses gradient descent, and MOD uses you just take the minimizer, because when
you fix X and minimize over A, this is just a quadratic problem. Because this penalty term only
depends on X. When you fix X, this is a constant, and k-SVD, this is a little bit trickier, so I just
improved it just for [indiscernible] of A, but you don't really need -- so the idea is that you fix
everything else, everything else except the ith column of A and the ith row of X. And you just
update these two, and then if you fix everything else, then this is just a rank file approximation,
although we have still two sets of parameters, but thanks to this rank file approximation,
although this is not convex, still you can do it, because you can use SVD. But I'm not going to
talk about this algorithm in detail, just for [indiscernible].
>>: So none of them can be analyzed.
>> Tengyu Ma: It was not known, so we didn't know how to analyze it, and now we can analyze
the check mark, the one that I checked. So basically, we can analyze this plus this or this plus
this, and in this talk, I'm only going to talk about this and pass this.
>>: MOD here, this problem, so OF is a solution, one solution for two. So when you use a
minimizer, is that any special, you're talking about special [indiscernible].
>> Tengyu Ma: No, no, so basically, that's -- yes, that's a great question. So the minimizer, so
here, we are fixing X, so this is simpler, so this is a quadratic problem. So indeed, if you take a
lot of ->>: This is definitional learning right? Given X, given to [indiscernible].
>> Tengyu Ma: Yes, yes. But sometimes -- exactly. But sometimes, you don't really want to
take such a big step, so that's why ->>: Yes, hoping to get global if possible.
>> Tengyu Ma: Yes, so we are hoping to get global optimal, but it turns out this MOD, at least
from a theoretical point of view, this step is too big. So we need to make a little bit conservative
step, and actually, at least from the experiment I run, just a very simple experiment, this MOD is
not as good as k-SVD and OF in some aspects.
>>: So OF, do they actually do early stopping so they don't go too far? They need to go back to
the matrix [indiscernible]?
>> Tengyu Ma: I don't think so. I'm not sure about the experimental details, so ->>: Really, just do one gradient step. Yes, just one step.
>> Tengyu Ma: Yes, yes, exactly. So that's a good point. So you just take a one step, and then
you update this.
>>: The MOD just fully minimizes.
>> Tengyu Ma: So this is alternating algorithms. You alternate between this step and this step
many times.
>>: So that's a super, super, super early stopping, right? Just one step.
>> Tengyu Ma: So I think if you ->>: If we use MOD, this step would be too much.
>>: Too much. Right, okay.
>> Tengyu Ma: So I think compared to deep learning, this step is more or less the step where we
calculate the back propagation, and this step is more or less the learning step into the gradient
set, but probably this is not a good comparison. So we are going to talk about this plus this. And
the main contribution of this work is that we can analyze this algorithm from some good,
responsible starting point. So before talking about the theory part, so let me talk about the
generative model, which was also proposed by Olshausen and Field, and which makes a lot of
sense. So because we need to assume something about our data, otherwise, so it's too hard. We
want to minimize the non-convex function, it's NP hard and there's no really hope. So basically,
we assume that the data is generated from some ground truth A*, the ground truth dictionary, and
it is overcomplete. And we need to make the assumption that the columns of A are close to
orthogonal. There are no two columns that are very parallel with each other, and they are kind of
isotropic, which means that the spectrum is not too large. And then we assume that this X* let's
say is -1,0,1, but we could relax this assumption a little bit, probably a lot, but for simplicity, I'm
assuming that this is just to take three values, -1,0,1, and the expectation of X* for each is zero,
and we also assume that the correlation between the Xi* and Xj* are kind of independent, and
also, X* is a k-sparse vector. It only takes at most K nonzero entries. And then the generative
model is that we are given, say, P samples, and each sample of this form is A* times X*, plus
noise, and the noise is not too large. I won't talk about how to deal with the noise, because even
without the noise, this is the hard problem. But the noise we can tolerate a certain amount of
noise, and okay, I omit the superscript here. Typically, this is A -- Y with superscript J is equal
to A* times X* superscript j for the J sample.
>> But that is -- what do they limit X to be -1,0,1, that may not be realistic.
>> Tengyu Ma: Yes, that's not realistic, sure.
>>: Only under [indiscernible] analysis?
>> Tengyu Ma: Oh, no, no, so we can relax this. So basically, what we need is if this is zero,
then this is zero, right? If it's nonzero, then it should be bounded away from zero. It shouldn't be
too small. So this is a sparse vector with kind of a gap condition, so every nonzero entry is larger
than a constant C, like a half, or a lot less than minus a half.
>>: So I know that. So the argument is that it doesn't really matter what X you have, as long as
A is more general, you can synthesize -- you can generate any kind of Y.
>> Tengyu Ma: Yes.
>>: But on the other hand, in practical problem, if X is not limited by [indiscernible], then the
result may not be meaningful.
>> Tengyu Ma: We can analyze the case when it is bound away.
>>: So arbitrary X.
>> Tengyu Ma: Yes. As long as it is bound away from zero, if it is not. Technically, that's the
assumption, but for simplicity, I am going to assume this, for this talk. So technically, our
assumption is that X is a sparse vector in a sense, and if Xi is nonzero, then it must be either
larger than a half, say, or smaller than a minus a half.
>>: [Indiscernible] lower bound or it's bounded away.
>> Tengyu Ma: This should be constant.
>>: As long as there exists a constant.
>> Tengyu Ma: Yes, exists a constant.
>>: I see.
>>: So X is between 0, between -1 and 1, or is it ->> Tengyu Ma: There's no upper bound. There is no upper bound. It's just a lower bound. You
cannot be too small. Of course, there are some -- there are some scaling issues, so let's say the
main range of X is a constant, between constant.
>>: What is the amount of correlation that you can allow?
>> Tengyu Ma: So, basically, we can allow I think pretty good correlation, so if X is a uniform
sparse vector, then the correlation between X and Xj is something like K over dimension
squared. We can allow something like another constant, say, a constant times K over M squared.
>>: [Indiscernible].
>> Tengyu Ma: Oh, so incoherence. Incoherent. I'm going to talk about the details when we
use it.
>>: So even if there are two coordinates that always come up together, that will mess up the
algorithm.
>> Tengyu Ma: That will mess up the algorithm. If the algo is the same, if the Xi is always
equal to Xj, then that's really our trouble.
>>: Then that's really trouble. The algorithm is not going to work, or the analysis.
>> Tengyu Ma: For sure, the analysis is not going to work. The algorithm, I think certainly you
can find some corner case that when it is not working, but in practice, I think -- if just a small
number of pairs have this weird correlation, it should be fine, but we didn't analyze it. Okay. So
this model makes sense, because the goal is to recover a store.
>>: So for the P -- RP samples, you do it from X* is P different X*s?
>> Tengyu Ma: Yes, different X*s, yes.
>>: It's not that the noise are ->> Tengyu Ma: The noise are also different, but A* is the same.
>>: Yes, the rule is like same X*. We just generate P from the noise.
>> Tengyu Ma: Oh, no, no, no. I think I omit the superscript here and here, so you need to have
a superscript like J and J here. So this model makes sense, because if you think about the energy
function without the penalty term -- the penalty term is just to enforce sparsity. This is the log
likelihood of Y given A times X, so when you minimize the energy function, you are trying to
find the maximum likelihood. And indeed, you can show that when P is large enough, then the
ground truth -- yes, A* and other coefficients, and the representations, this is a minimizer of this
energy function. So that's why.
>>: That's not entirely true, right?
>> Tengyu Ma: It’s not entirely true, because -- sorry?
>>: Because not all X -- if I put in X something which is not k-sparse, it's a zero likelihood.
>> Tengyu Ma: Yes, so ->>: So it's only a log-likelihood restricted.
>> Tengyu Ma: Yes, yes, exactly. So I guess this sentence means that I'm considering the
sparse penalty. Without the sparse penalty, this is the log likelihood. Plus the sparse penalty, it
sit the minimizer in some sense. And also, you need to choose the penalty in the correct way.
This is just the intuition. We are not going to use it.
>>: So any property that is associated with the noise is Gaussian distributed?
>> Tengyu Ma: Yes, so we are assuming that this is just a white Gaussian noise. So I'm trying
to not talk too much about noise, because even without the noise, if the noise is zero, this is not
an easy question.
>>: [Indiscernible].
>> Tengyu Ma: Okay, so this is the model, so with this model, we can talk about some theory
work. So the previous works, so we've found that in practice, non-convex optimization, we have
this alternating updates algorithm. It's really good, it's very efficient, it's very simple, and also
it's generic, so you don't really need to worry about what the parameters are, so you just gradient
descent or something, and it's very successful. Up to K is less than square root N. K is the
sparsity, let me remind you, and N is the dimension of the vectors. And so from a theoretical
point of view, okay, we don't know whether it can get stuck at the local optimals, or we don't
even know whether there are local optimals. We are not really sure, actually. So it's not very
easy to certify that there is a local optimal sometimes.
>>: So this one doesn't depend on what is the learning rate, eta.
>> Tengyu Ma: It depends on learning rate.
>>: That has to be appropriately chosen.
>> Tengyu Ma: Yes, to run the algorithm, to make the algorithm work. Right, yes, you need to
choose that. But usually, if we choose that to be very small, then the smaller, the more robust,
but the slower. And we don't even know -- so I think even just one year before, we started to
know that if we start from -- sorry, I think this is not very -- so if we start with 1 over K close
matrix to K*. So we start with a very, very close -- so if you are given a very close starting
point, like 1 over K close column wise. For each column, you are already 1 over K close to the
ground truth, then you know that this is guaranteed to converge in L2 norm, in equivalent
distance. I'm going to use equivalent distance throughout the talk. So basically, we only a very,
very small basing of contraction, theoretically, but empirically, this is certainly not -- I mean, the
base of contraction is certainly much, much larger, I think. And there are many other -- there are
several existing theoretical works for this problem. I listed them. So I think [indiscernible]
started from Spielman, Wang and Wright, 12, and this is the [code best] paper, and they used the
LP-based algorithm. It's a convex optimization for this problem, and they allow sparsity to cased
as the square root, which is good, but they cannot allow overcomplete matrix. So in their case,
this matrix A is just the square matrix, and the number of samples is something lie N squared.
You need this many of samples, and then the other three authors are Arora, Ge and Moitra. They
had this combinatorial algorithm for this problem, and also [indiscernible] and Kumar and other
authors, so they can prove that if K is slightly less than square root N, then you need M squared
times 1 over epsilon number of samples, so epsilon is the accuracy, so in some measure -- so if
you want to get epsilon accuracy, you need 1 over epsilon times M squared samples, and it can
tolerate overcomplete dictionary. And the third one, this is more of a theoretical -- the interest is
not really -- so this is Barak, Kelner and Steurer, and they use Lassere's sum of squares
relaxation. If you're not familiar with this terminology, it doesn't really matter. This is very
high-order convex relaxation. Probably, the most powerful relaxation that we know for now, and
they can tolerate even larger sparsity, up to almost linear, and just -- but for theory, this is really
good. This is really powerful, but for practice, this is terrible, because the number of samples,
it's exponential in 1 over epsilon, so if you want to get, say, 0.2 accuracy, which is I guess very
bad for practice, then you need M to let's say 10. This is ->>: You don't show any result on the rate of convergence, how fast the algorithm ->> Tengyu Ma: They are not iterative algorithms. This is convex relaxation. You write down
the convex optimization and you solve it. So this is combinatorial, and the rate is this. This is
not really the rate. So this is the number of samples. So I guess the main point here is that these
theoretical approaches, the number of samples is too large, and they are really not doing as well
as the prep on the non-convexity.
>>: These aren't just guaranteed or with high probability?
>> Tengyu Ma: With high probability, it returns the ground truth. Yes.
>>: So the problem is non-convex to begin with, right?
>> Tengyu Ma: Yes.
>>: So when you're saying that the formulate the problem to be convex and analyze it, so do
they address how much error you get by approximating the problem?
>> Tengyu Ma: Yes, they need to address that.
>>: Not here, right?
>> Tengyu Ma: Not for today, but in these works, they need to address that when you do this
convex relaxation, how much you lose.
>>: I see.
>>: What do you mean exactly by overcomplete? You mean M is like ->> Tengyu Ma: M is larger than N. It's a constant times it.
>>: Constant time.
>> Tengyu Ma: Not much larger. I think this one could -- probably this work could tolerate M
to be N to the 1.1, for example, something like that, about five, probably, the best guess. But
anyway, this is not much larger. But I guess in practice, we don't really care about when M is
much larger than N, probably. Okay. So I guess the key point here is that the non-convex
approach is really the right way. We are really working hard on this problem from a theoretical
point of view using convex relaxation, but we never get the same result, compared to just the
simple non-convex heuristic. So the next question to us is whether we can analyze those nonconvex optimizations approach, so it seems that this is really the right way to do. Okay, so
basically this motivates our work, and we tried to answer this question, and this is our main
theorem, so we have some condition, K is less than square root N, and M is constant times N,
and some assumptions, N [indiscernible] as we described in the [training] model slides, and we
need to have this. This is very crucial, so a zero is column-wise 1/log n close to A*. By this, I
mean that each column -- so if you compute the column-wise distance between A0 and A* and
most 1/log n. By the way, the normalization is that each column of A* has [unit] norm so that
the forbidden norm is something like N square. Forbidden norm squared is like M. And what
we can show is that with this reasonably okay starting point, then we can show that the OF
algorithm, this non-convex approach, with a simple decoding rule, which I didn't describe, but
I'm going to describe later, will return a guess at each iteration As, such that this is true. So
basically, after S iterations, the error decreased linearly in S, so it's a geometry degree of error.
And we have some bias here, which can be removed, so but for simplicity I'm going to have it,
because we are not going to talk about the ->>: In some respect, the results will have better [indiscernible].
>> Tengyu Ma: Yes, if you choose -- yes, so yes. The learning rate needs to be properly chosen,
and if you choose the learning rate ->>: [Indiscernible].
>> Tengyu Ma: Oh, I have already optimized over the linear rate.
>>: Oh, I see. It's [indiscernible].
>> Tengyu Ma: So I have already chosen the best learning rate, so that this ->>: Is this also high probability?
>> Tengyu Ma: Yes, this is also high probability. So everything we are talking about is high
probability, but I am cheating on that. Okay, under this line is really hurting us, because this is ->>: How about X as the initial additional X?
>> Tengyu Ma: So we use this decoding rule to initialize X, so if we have A0, then we use this
decoding rule to initialize X.
>>: So in that perspective, does the presentation of the columns matter? Like you said column 1
->> Tengyu Ma: Yes, so -- yes.
>>: Is that [indiscernible].
>> Tengyu Ma: Yes, I have considered the permutation. So I have two permutations and sample
it, actually, so if you flipped each column with the -- you multiply -1 on each column, it doesn't
really change the problem. So after permutation, un-sample it, so this column-wise close. So
okay. And to complement, so we show spectral matter-based algorithm, which returns a zero,
such a good initialization that is column-wise 1/log N close to A*. But I guess this is the main
part of this talk, and I'm going to talk a little bit for this initialization. So just some notes. S
because this is geometrical decay of error, it means that -- if you want to achieve epsilon error,
you only need log N over epsilon iterations, and the sample complexity is N times K, and the
previous one is M squared, so we improve just at least a factor of M over K, and also, the sample
complexity depends on 1/epsilon, but here it depends on log 1/epsilon, so that's the theoretical
improvement, and the runtime is N times N times P, so this is really very good, because if you
want to evaluate this energy function, you need to multiply N by M matrix with the M by P
matrix, so even this multiplication takes you N times P, if you don't use matrix multiplication.
>>: That's per iteration, right?
>> Tengyu Ma: Per iteration, but it usually is log. The number of iterations is log, so I use this
tilde to hide the log factor, but here I used the log. So the initialization also takes the same
number of samples, and the runtime is a little bit larger, but we think that's in some sense quite
necessary, because we are -- the initializing gives you a very good approximation, like 1/log N,
so presumably it should take more time. And also, all the numbers here could be improved if
you assume a little bit more independence on the X or if you change some assumptions
somewhere, so this is not really the optimal, but just the demonstration. Okay, so I guess -- now
let's talk about how to analyze this alternating minimization, and my plan is that I'm going to
give a new perspective of this kind of non-convex optimization problem, and then I'm going to
describe a general condition for convergence, and then I'm going to apply this general framework
to sparse coding and show you how to get a result.
>>: Can you go back to see the [indiscernible].
>> Tengyu Ma: Okay, sure, sure.
>>: The generative model that you talked about it, so what is the hidden? Okay, so X is the
hidden variable?
>> Tengyu Ma: Okay, X is the hidden variable. So I think -- technically, I should write here the
sample is A* times X*j, plus noise.
>>: Yes, so the hidden variable ->> Tengyu Ma: Hidden variable is X*. This is the parameter.
>>: Oh, so the generative model only applies to one step, only for the learning part.
>> Tengyu Ma: No. This is the generative model for the data. This is not about the algorithm.
>>: Not about the data, so to me, X* -- both A and X.
>> Tengyu Ma: Are [non-noise].
>>: So it doesn't make sense to talk about one is parameter and one is the data.
>> Tengyu Ma: Oh, I see. Oh, I see.
>>: So [indiscernible] model you normally assume that Y is unknown, and Y is observable. The
hidden variable really is a nonvariable. You cannot observe hidden variable ->> Tengyu Ma: Yes, both A and X are not known. So why I called this parameter? Because A*
is shared across all the samples, so if I write another --
>>: Oh, I see.
>> Tengyu Ma: And the X* is an independent stochastic part. So you draw an independent X
for each J.
>>: Okay, so that is [indiscernible].
>> Tengyu Ma: So okay? Okay, so new perspective, so this is -- what is the old perspective?
At least, this was my understanding, I mean, 10 months ago, for example. So we tried to analyze
this non-convex function E of A times A comma X, and if you write this, this is the direction of
X, this is the direction of A, what we are doing is -- because we are doing alternating updates, so
each time we would update either X or A. This is the update. And this is a non-convex and
[indiscernible] our understanding was just somehow, because the -- because this special update
rule, this alternating of the rule, somehow we can avoid local optimal. I guess that was the best
hope. It doesn't make much sense. I agree, exactly, but we didn't know much better than this.
So maybe there is even no local optimum. That's also possible.
>>: But no local optimum -- you cannot come up with a small example in a small dimension
where you would have this local minimum? [Indiscernible] or search over ->> Tengyu Ma: For small dimensions?
>>: Yes, just to show that there exist examples with local minimums.
>> Tengyu Ma: For small dimensions, we didn't try, but I think we can. So I think the question
is, if you increased the number of samples whether you can find it. So I think I even found -- if I
fixed the number of samples just to be five times the dimension, then I can find the local
optimum, but if I increase the number of samples, then it's harder and harder to find the local
optimum.
>>: Maybe it's easier to construct local optimum because of the symmetry, right? Because of
the -- you can use the symmetry argument, right [indiscernible] has to be the same, unless they're
all the same or everything in between them is the same. Unless that's the case, they have to be
local ->> Tengyu Ma: So I guess what they are saying is that there are a lot of global optimals, because
if you permute A and X accordingly, then you get exactly the same solution, so in the whole
space there are a lot of minimizers, so for sure, in the whole space, there are many at least
[indiscernible] points, just mathematically, right? But in the regime that we are interested in,
right, we can prove 1/log N close whether there is one, so right? So maybe there is no local
optimal in this small [ball]? So okay, but anyway, this was our confusion, and so we tried to
propose something that is slightly better than this.
>>: So this is your conjecture?
>> Tengyu Ma: This is not the conjecture. This was our confusion. We don't really understand
why this is. It's just the confusion. So just for comparison, because I am going to say something
new. So we currently are going to analyze this, and in the learnings, we are going to use gradient
descent, right? So -- and so if you look at this energy function, this is kind of special, like we
discussed, so we have this A that corresponds to a fixed set of parameters, and X corresponds to
a stochastic part. It's a random part in some sense. And also, this is special, because if you fix
X, then this function is -- I mean, in our case is quadratic, so at least it's convex respect to A, if
you fix X. Actually, if you choose this to be a convex function, this is sparse convex function.
So we hope to exploit this specialty of this function so that we can prove something, we can
achieve something. And the way that we apply it, explore it, is that we observe that if you plug
in X* into this function -- X* is our ground truth, the truth, which we don't know, but for a
thought experiment, let's just plug in X*. We get a function that only depends on A, and this
penalty term becomes a constant, and this is a quadratic function over A. So this is convex, but it
is not known. It's a hypothetical -- it's a thought experiment. But the observation is that when X
is close to X*, the gradient, if you look at this and this -- the only difference is X and X*. The
gradient of A when you plug in X and the gradient of A when you plug in X*, they are more or
less close. This is just the intuition, and the gradient of A at the point X* is the gradient of this
convex function, just by definition. So what we are seeing is that this right term, the one that we
are interested in right here, this guy, that it's close to the gradient of the convex function,
although this gradient is unknown. So that's the point. So then we have a different picture, so
we defined this unknown convex function, and we showed that the gradient that we have is an
inexact version of the gradient of this convex [indiscernible]. And if you look at this picture, we
know that if you followed this one, this blue one, we are guaranteed to converge to the global
optimal, because this is a convex function, and we are in a space of A, this is convex and
everything is nice.
>>: Yes, but I thought this was precisely the second line that we talked about earlier ->> Tengyu Ma: Yes, yes, yes, exactly. I'm just ->>: So the problem is that it goes too far, so that's why it's not as good as just taking small steps
to reach the global optimal. Earlier, you have three ->> Tengyu Ma: Yes.
>>: And this one, the one you just talked about actually ->> Tengyu Ma: Yes, yes, exactly. Yes.
>>: Which is not as good as the first one.
>> Tengyu Ma: Yes. Because you are going too far, yes. Exactly. This MOD matters. This
taking [indiscernible].
>>: So your problem is that you want to show that not going so far is better than going
[indiscernible].
>> Tengyu Ma: At least ->>: [Indiscernible].
>> Tengyu Ma: At least -- so I am not going to say it is really better, because we don't have
really evidence, but intuition -- at a high level, I think that's true. But at least from a theoretical
point of view, if we analyze it, we want to choose the iterator to be not too large so that it doesn't
give you anything too bad. Anyway, if you follow this blue arrow, then we are going to
converge to global optimal. But what we have, this is red arrow, which is an approximation in
some sense of the blue arrow, and we hope that this approximation doesn’t hurt us much, too
much, so even with the approximation, we can still do something good. So basically this is the
new perspective versus this, so okay, just to repeat the same thing, we are saying this converged
to A*, and we hope to show that this implies the one that we have converges to A*, so basically
we are asking, okay, so whether we have a theory of biased approximate gradient descent, and
another question is probably because we are anyway estimating the other A*, maybe we can use
another approximation. And so a note is that this is different from stochastic gradient descent,
because in stochastic gradient, usually, you pick some samples to estimate your gradient, but that
estimation is always unbiased. But here, this estimation could be biased, could be -- so we need
the biased version of approximate gradient descent, and the answer is yes, actually, we can be a
little bit more general. Another approximate of the other also converges, and the answer is yes,
so that's why we can analyze this k-SVD 1, which is something that doesn't really involve
gradient or anything, but still, we can think of it as an approximation of gradient descent and
maybe some other variants of the Olshausen and Field algorithm. Okay. So this is the new
framework, and then we are going to talk about [indiscernible], and so -- okay, so then our goal
is to build this approximation theory of approximate gradient descent, in some sense. And
actually, we'll propose an even more general condition, so thinking about this -- so we are
interested in this linear update rule, this first-order update, so we have some Z. I changed the A
to Z, so the map is Z is A and G is -- you can think of this as gradient. So just for -- because this
is interesting of its own, right? This is my first-order update, and this is my theorem. This is my
condition. My condition says that if GS is somehow correlated with the desired solutions, A*, if
the inner product of GS between ZS minus Z* is somehow not too small. I guess this doesn't
make too much sense at this point, but I'm going to have a picture in the next slides, just here, so
what we are seeing is that, okay, this is the gradient of -- this is GS, the direction of update. And
this is ZS minus Z*, so what we are really seeing is just this red arrow -- the angle between this
red and green one is less than 90 degrees, strictly less than 90 degrees. And the reason why we
have this two norm here is just we don't want ->>: That less than 90 degrees, that's always. Is it always it's less than ->> Tengyu Ma: So this is our condition. If it is always true, then we are good, right?
>>: I see. So that will require that your gradient is not too far from the real, true gradient,
because if you go in the opposite direction, the whole [indiscernible].
>> Tengyu Ma: Yes, yes.
>>: In your true solution, in that solution. So you like [indiscernible] that condition says that
estimate of the gradient cannot be noisy.
>> Tengyu Ma: Not exactly, but I think you get the main point. So basically, if this -- my
direction is close to the true gradient, then certainly this is less than 90 degrees. But on the
reverse side, it's not necessarily true.
>>: So there [indiscernible].
>> Tengyu Ma: But it's possible that, for example, this is the gradient, the true gradient, and this
right direction is my update rule. Both of my [indiscernible] the desired direction, but they too
are not really correlated. That's also possible, but we don't really have an example, but
theoretically, it's possible. So I'm going to have a picture, I think, in several slides. But
basically, we are saying that it's not too far from the true gradient, and the theorem is that this GS
always satisfies this condition, so this is something we need to check, but let's say if it always
satisfies this condition, then we have geometrical [indiscernible]. Here, the learning rate is here.
We need to choose it to be small -- I think the we need the learning rate to be less than two beta,
so we can't really choose it to be very large. And we also have some systematic error here. I
didn't talk about this epsilon S, but there's some systematic [indiscernible] which we are going to
allow, because for this problem, we really have this error.
>>: By the way, for this, is this just some lag to whether you allow yourself to go more than one
step, a few steps, is it going to do better?
>> Tengyu Ma: You mean like an accelerated gradient descent?
>>: No, a gradient descent, more than just one step. You can do gradient descent.
>> Tengyu Ma: Oh, yeah, so here, we have asked that. After we ask that, we have the arrow is
decay is jumps to -- yes. It's not only about ->>: [Indiscernible].
>> Tengyu Ma: So we have ->>: So the room [indiscernible].
>> Tengyu Ma: So sure, yes. I think so. So let's -- okay. Okay, so the proof for this is you can
prove by picture, so we are trying to show that after S iterations, this error jumps with this rate.
So we only need to show that for each iteration, it drops with a constant factor and I omit the
systematic error there, but let's say this epsilon over zero, so we don't need to show this, but this
is almost a proof by picture, because if this is less than 90 degrees and we choose this iterator to
be a little bit small, then certainly this blue arrow is less than the green one, right? And that's it,
so -- and just a little bit of discussion about this. So there is no objective function involved in
this framework, so there is no gradient, no convexity really involved, so we only need some first-
order update, some update rule, and then it will have a condition and we can check it. But,
actually, behind it, this captures analysis of convex optimization, and also, this is extracted from
an analysis of convex optimization. This is the most -- the most basic thing that people did in
convex optimization, and the only thing different is that this is a different level of abstraction.
That's the only thing. Yes. Okay, so -- but we hope that this different level of abstraction can
make it more general, even you can apply it to non-convex problems. That's the point. So now,
I'm going to -- trying to apply this framework to energy minimization, and recall that ->>: [Indiscernible] level of abstraction. I think -- but this has to work locally when it is strongly
convex.
>> Tengyu Ma: Yes, yes, exactly. Yes.
>>: So that's the bottom line.
>> Tengyu Ma: Yes, that's the limitation.
>>: Yes, that's the limit. That's why I [indiscernible] your last point saying possibly working on
non-convex -- because, [indiscernible] is the same. It's the level of abstraction.
>>: So what does it mean by level of abstraction?
>> Tengyu Ma: So in terms of analysis. This is not -- so previously, we are saying that G must
be a gradient of a convex function, and this is another rule using gradient descent, and this
condition is hidden in the proof. So previously, my theorem is that G is the gradient of some
convex function, and then I have this. And now I'm saying that G doesn't have to be gradient of
anything. It's just something, and if I have this condition, I have ->>: It's like the factor [indiscernible].
>>: Okay, so but put another way is that, for example, [indiscernible] strongly convex, at most,
you can imply this condition.
>> Tengyu Ma: Yes, yes.
>>: But on the other side, I doubt if you can find some function which satisfies this but does not
satisfy strong convexity.
>> Tengyu Ma: There is no function involved, right, so that's why it's slightly more powerful.
>>: But if you restrict any function, then that has ->> Tengyu Ma: Yes, if you restrict to a function, yes, but here, we are going to apply this to a
[nominal] convex function, so that's the difference. Okay, so I think -- oops. Oh, here, okay. So
I guess we need to speed up a little bit, so I'm going to apply this to energy minimization
problem, so let's see. I think this relates to your question, so how do you use this, right? So I
have this alternating update, and my decode is something like this, and there you hide it. This is
a decode, decoding, and this is a learning. And on the other hand, my framework is that if my
update is like this, then I need to check this condition, and so basically I need to connect this to
that. So I just define G to be this gradient, and with X to be the decoder of 1A, and I want A to
be of this form. I want to match this too, and this is A is updated to A minus A.G, and basically
we want to check G satisfies this.
>>: And so is it true?
>> Tengyu Ma: Yes, it is true.
>>: Nice.
>> Tengyu Ma: Locally. Only locally. With no initializing. Well, over log [indiscernible].
>>: That's the ->>: Oh, okay. Perhaps it isn't, okay.
>> Tengyu Ma: But we believe 1/log shouldn't be the truth, right? It should be something like
constant.
>>: But there is no hope that it would be true ->> Tengyu Ma: Globally. It should be different. So I think let me get this probably -- so this is
the picture that -- so it's possible that the gradient is in this direction, but the one that you have is
in this direction, and the angle between the purple one and the red one is very large, but they are
both correlated with the green one. That's possible, but that's just ->>: It's okay, right? It's also 90 degree -- less than 90 degrees between the green one?
>> Tengyu Ma: Yes, so this is a good stream.
>>: The same picture will show the same thing.
>> Tengyu Ma: Yes, but -- so the point is just you don't really need to compare the right with
the purple. You only need to compare the right with the green. And actually, as you suggest ->>: Okay, extra point. That can be true for the particular instance, but if you want to do a
stochastic analysis, then the expectation will always be 90 degrees. That's what I would guess.
>> Tengyu Ma: This is not the best estimate of this, and maybe this is something else. You can
choose something like something else that doesn't have anything to do with the gradient.
>>: Except for the picture, if eta is very large, the --
>> Tengyu Ma: It's not going to choose. I'm going to choose eta to be small. And just a very
quick -- so as you suggested, if you compare the gradient with this red one with the purple one,
that's a slightly stronger condition, and this is exactly the condition in the [indiscernible] paper,
and we realized that this is independent work, and they are doing this on EM algorithms, so it's a
different thing, but it's exactly the same type. So EM algorithm, we have the parameter and
hidden variable, but although the only difference is, when you ask me the hidden variable, you
don't do a decoding, you do a posterior isolation, but this is exactly the same thing. So under
their condition, it's slightly stronger. They compare the red one with the purple one but not the
red one with the blue one, so in their case, this is small. Okay, so I'm going to -- yes, I think time
wise it should be fine. So now I'm going to check this condition, right, the last line, right? This
is what I need to do. G is correlated with A*, and what is G? That's the -- okay, I'm going to
check this. Yes. So what is G? So if you take the gradient, G is this. It's something like this,
and let's consider population version. Let's say let's have infinite number of samples, so let's
consider expectation of it, so the sample convexity, I am not going to talk about it. That's very
tough. And X is defined to be the decoding of Y of A -- of A. And this is very -- this is a quite
nasty thing, right? Because X is a function of when A and -- it could be something like an L1
minimization, right? You don't really know, and so the question is whether you can have a good
form for G, so that you can check your condition. And then the question is what decoding you
want to use so that you can really have an analytic form of G.
>>: What is the expectation [indiscernible]?
>> Tengyu Ma: Excuse me?
>>: What is the expectation with respect to?
>> Tengyu Ma: So if you plug in X* here, right? Is that your question?
>>: Yes.
>> Tengyu Ma: If you plug in the X*, this is just A minus A*, because this is a convex function.
This is a quadratic function. So certain Xs is not really X*, right? And the interlude, so what
decoding style we are going to use? So let's say our problem is we have Y equals to A*, X*, and
we want to decode X from Y and the noisy version of A*. And let's say if we're given A*, this
reduces to a sparse recovery problem, and then we can use this -- in this situation, we can use
this projection decoding. The idea is that we just take A* transpose Y, and we do a threshold.
By threshold, I mean that if this entry is larger than a half, this Ai* and Y, which is just the i
essentially of this A* transpose Y, it's larger than half the ticket. Otherwise, I zero it out.
>>: In the sense of what do you come up with the projection of the code [indiscernible].
>> Tengyu Ma: Why did I come up ->>: Because the original problem, given A, is a convex problem, gives two global solutions.
But the -- you don't want to go too far in the --
>> Tengyu Ma: No, the point is that I need the -- yes. First of all, I don't really need that precise
solution. I'm okay with the approximation. That's one of the key points. Also, for theoretical
analysis, I want something closed form, that has a closed form. If it's a minimizer of a convex
function, there's no hope to analyze it, and actually, I think we have some evidence that it's okay.
It's comparative, but probably slightly worse than this.
>>: So when you take this threshold, this is just an arbitrary threshold. How do you make sense
of this kind of solution?
>> Tengyu Ma: I'm going to talk about it. So the proof is that there is a theorem. This dates
back to probably 20 years ago, so if A is incoherent, which means Ai unit norm column, and the
inner product between the columns is less than 1 over square root N, and then if you use this
threshold, then the support of X is equal to the support of the true X. And the proof ->>: So is it intuitive that that threshold is sizable as an approximate solution?
>> Tengyu Ma: I hope I can convince you. Let me show you the proof, which is just one line -I hope that I can convince you, but it's not that easy to see. So if you take Ai*, Y, if you do the
calculation, this is Xi plus something, and this sum over J is not. So because i is a sparse vector,
so there is only K nonzero entries here, and each in the product is less than 1 over square root N,
so this is less than K over square root N. So this is less than K over square root N, which means
that this -- so basically, if you look at the beginning and the end, this says that if you take the
inner product of Ai* and Y, this is Xi plus/minus a half, plus/minus some [area], not really a half
-- plus/minus something smaller than a half. So then, if you take the threshold and you
something that Xi is 0, 1 or 0 or something larger than 1, then this threshold makes sense. Okay.
And then the next slides just show that if you have A that is approximately close to A* and then,
if you use the same thing, just with A, not A*, you will get the same thing, and how close do you
need to be? You need to be 1 over square root of log N close. This is exactly where we use this
1 over square root starting point. So we need the decoding to be not correct but correct for
support. And actually, we can relax it a little bit if the support is not exactly correct, it is correct
with a high probability of the constant probability, probably it is fine, but that's a little bit hard to
calculate. So the point is here, if the support is correct, then I can write this decoding, this
threshold decoding, as i equals A transpose Y, restrict to the support, and this is a closed form,
for analysis, and also this is very fast. You just take the matrix multiplication. And then, I can
do this calculation, so I'm going to accurate expectation, and my decoding is just a closed form,
Ai is transposed Y, and I plug in multiple definitions. I'm going to skip this. I guess this just
shows how nasty this group gets, and I plug in a lot of things, just do the calculation. I plug in
this, this definition of X into this, and you also need to plug in the definition of Y and everything
and take the expectation and a lot of calculations, and we get this. So we cannot get a good form
G, but we can get a good form for each column of G. Each column of G is of this form. It's the
lambda squared times Ai, means the lambda times Ai* times some error. Let's forget about the
error, so this lambda is something close to 1, so this is really [indiscernible] with the direction Ai
minus Ai*, with some different constants, slightly different constants, and this constant actually
converts to 1. So it's really a little bit hard to believe this, but I guess the reason, as you asked, is
if plugging X to be X*, then this is really Ai minus Ai*, so what we are doing is it's just saying
that if X is not X*, then this doesn't cost too much error due to the cancellation. This
cancellation is really important, because -- so basically we are saying that X minus X*, this is
something. For individual X, this might be very large, but X minus X*, it points to different
directions for different X. So you have this cancellation. Somehow you have this cancellation
phenomenon, so that the error is not that much. And here we use the fact that A and A* are both
close to isotropic. So the important thing is that we also use A is close to isotropic, which we
cannot really guarantee, right? Because A is the estimate. A* is the ground truth. We can
assume that that's also close to isotropic, but how do you maintain A to be isotropic? And okay,
so I will talk about that in two slides. And so then let's check this condition. I want to check that
G is [indiscernible] with A minus A*, and we have this, and this is just -- we can show this by
picture, so basically lambda is Ai*, so basically lambda is the size of the projection of Ai* to Ai.
This is lambda. And here I get lambda-square Ai, so let me extract one lambda from these two
terms, so this is just -- so if I multiply lambda with Ai, I get this vector, and minus Ai*, I get this
vector. So I guess I'm doing it a little bit quickly, but anyway, you can just show it by picture,
because lambda is close to 1, this is really correlated with Ai minus Ai*, so for each column, this
is correlated, and then we can show it for the -- we can take a sum, and we show it for the whole
matrix. So we check this is true, and then we can apply the framework and you get that. So the
only thing that remains is that it's caveat, so how do you maintain each estimate to remain
isotropic, close to isotropic. So the solution one is that each time we can project the lowspectrum number, so that it becomes -- the spectrum number is smaller. So though spectrum
number is convex, so that's a shortcut. And there are some other caveats that I am not going to
talk about, but this is the key idea. And I like the solution two more, which is that I can use a
different G-prime, which is Y minus AX. So previously we used X here, and now I used the sine
of X, so somehow, if you use the sine of X, it reduces the noise. And for this G-prime, first I can
show it still satisfies this condition. It's still correlated with the direction, Ai minus Ai*, and
also, if you use this update rule, you just prove it, you can prove this Ai is always less than three
times the instruction of A*. So because you use a different update rule, so because this update
rule is simpler and more robust, so you can prove something smaller.
>>: So does it make sense to do the sine of the gradient itself, putting the site out of the
exploitation? That's something very, very popular in the receiving by propagation effect.
>> Tengyu Ma: I see, the sine of gradient. For this case, I think it doesn't. But actually, I think
we can talk offline on this. This sine corresponds to -- we're inspired a little bit by backpropagation, but let's talk it offline if you want. So the point is that this G-prime is really not any
gradient of any function, so that's the benefit of our framework, so we don't really depend on the
fact that G is close some gradient of some true thing. And so our definition, our framework,
doesn't involve the gradient, the convex.
>>: So what is better than the original one?
>> Tengyu Ma: In practice, this is not better than the original one. It's almost the same, but for
analysis, it's slightly better. It's easier.
>>: It's not as accurate. The gradient is not as accurate.
>> Tengyu Ma: Yes, but I guess the whole point is that the gradient, it doesn't have to be
accurate. You need it to be more unbiased.
>>: But it may turn out to be greater than 90 degrees [indiscernible].
>> Tengyu Ma: Sorry?
>>: Because the gradient has to be within 90 degrees towards the green line. But if you do this,
you may not actually have that happen.
>> Tengyu Ma: You can prove that it's still.
>>: Okay, okay.
>> Tengyu Ma: Okay, but this is just some caveats that we don't really care. So let me do this
summary, so we show that locally, minimizing this non-convex function is close to -- you can
think of it as minimizing this unknown convex function with an inexact oracle in some sense.
And this is our condition. We have this first-order update rule and general update rule, and we
have this condition, if GS is correlated with ZS minus Z*, it implies that ZS converges to Z*, and
we can apply this to sparse coding, and we need to have some check for the decoding and things
so that we can calculate this expectation. Okay, so the initialization, I think let me leave some
time for -- let's see. Okay, maybe let me show you this slide, just this guy. So we want to have
initialization, and what we do, we just pick two samples, U and V, and we compute this matrix,
Y, Y transposed. This is the two moments, but we weight it by the [product] between U and Y
and V and Y, and so weighted two moments. And then we take the top eigenvector. That's it.
So the algorithm is very simple. So if you don't have this weighted moment, this Y, Y
transposed is like identity. It's not useful. So the point is that if a sample U -- a sample is a form
A* times a sparse vector, so U is A* times alpha, B is A* times beta, and take the support of
alpha and beta. Let's take the intersection of the support of alpha and beta. If by some chance
their support only intersects at one place, then I can say this MUV, this is really the top end,
whether it's really this direction, this T, AT*, AT* transposed, plus some noise. This noise is
small. So if you happen to choose UV that satisfies this, then you find some direction, it is good.
This [indiscernible] is 1/log N, something like that. So yes, I think -- yes, let me skip the rest of - okay, and the chance that you find this thing happens is not too small. It's M over K squared,
by worst data paradox, so you usually have some chance to find such phenomenon. So let me
skip the rest of this and jump to the discussion and open questions. So the first question is
whether this is generalizable to alternating updates, like EM algorithm, for example, and for
other hidden -- okay, actually, this -- and for other hidden variable models, for example,
especially, EN. And the [indiscernible] paper, they study EN, and at least for the problem that
they are studying, our framework, because our condition is weaker, so it still applies there. But
there are two questions that provides this to be more powerful. The first one is the decoding
doesn't have a simple closed form. So this is a technical area, so if you don't have a closed form
decoding, then you don't know how to calculate the direction of update. Just because
technically, you don't know how to calculate it. And the second one I think is more influential,
and which relates to [Shel's] question, and it is if this unknown convex function, this E*, if this is
not strongly convex and smooth, the smooth function, the smooth convex function, then what
happens? So then, this is not -- I think I even read some papers by Nesterov, which shows that in
this case, if you have an inexact oracle of the gradient, then the error accumulates. But this is not
necessarily because you can -- you can change the algorithm or you can find some other ways to
avoid it, but they only study it the naive way, probably. I didn't read the paper very carefully, but
anyway, our limitation, our approach, is just it must be strongly convex smooth, and we don't
know how to go beyond that. And also, another limitation is that this is limited to local
convergence, and the local convergence in the sense that it's not really -- it doesn't need to be a
small base of convex, but the limitation is that our framework can only apply to the regime that
the geometrical degree of error really happens. So, for example, for this sparse coding problem,
if you initialize with random initialization, randomly initialize this matrix A with a small trick,
then you know that it converts globally. But at the beginning, the convergence, the reason of the
convergence I think is essentially different, so it's not really a technical problem. It's just an
essentially different condition, and we don't really understand it. I think it is related to some
dynamic system problem, but yes. I only have some -- and also this may need -- for global
convergence, you need ad hoc analysis somehow, because for some problems, the global
convergence is not true. For Gaussian mixed model, it's not true, but for some models, it is true,
so maybe we need to really explore the structure of the problem. So in general, here, our
framework is not really very sensitive to the structure of a problem. As long as the unknown
convex function is strongly convex, it solves. And finally, an impractical algorithm, because it
goes beyond K over square root N, so we talk about the strong theoretical work. The sum of
square roots relaxation algorithm, which can go beyond this, but the running time is N to the 20,
which is really not very practical, and this is still -- and finally, I guess because [Doug] is here, I
add this slice -- so can we go beyond sparse coding. Actually, this is the main motivation of
studying sparse coding, because we want to go beyond sparse coding if we understand it. So one
thing is that the simple decoding rule is enough for learning. You don't really need to take lasso
or other minimization. You can just use a simple heuristic and some heuristic decoding. I see
enough for learning, and also, this simple decoding rule is really nice. It matches the feed for our
calculation in deep learning. So, for example, this is all observable, the image. And if you do
the feed forward calculation, this is something like A transpose Y, no threshold. That's some
nonlinear function. If they fairly linear, they are all similar in some sense. So this decoding rule
is really feed forward calculation, and the [joint] model in this case, in this case, in dictionary
learning, this is linear generated model, which is not going to be the case in deep learning,
because in deep learning, if you have multiple layers, the point of deep learning is you want to
not stack this in multiple layers. And if you have this [kind] of model, and you have another
layer and you generate and generate, and linear plus linear is still linear. So in the generative
model, I think a reasonable generative model for deep learning is Y is equal to some nonlinear
function of A*X*, so you can stack it. And so you want to stack the layers, and also, so we hope
that either if our data is R of some nonlinear function, [indiscernible] linear function of A*X*,
we can still learn it, but currently, we don't know how to do it. But this is I think a nice open
question, so if you can learn this, then we can hope to have some provable algorithm for
unsurprised learning ->>: Did you read the paper that I wrote on deep stacking network, where you actually can
formulate each step to be convex, but I don't know how to prove it? Eventually, I'll show you.
>> Tengyu Ma: Oh, okay, I would like --
>>: [Indiscernible] combined in such a way that you don't have to [indiscernible].
>> Tengyu Ma: Oh, I see.
>>: I will show you.
>> Tengyu Ma: Yes, I would love to talk about it. So, and finally, we hope that our framework
could be used to analyze back prop in some sense, and we have some preliminary idea of -- I
think we have some proof for two layers under some model, but it's not very -- we want to make
it stronger. Yes, so I guess yes, I'm going to end here.
>>: But you have to change back prop in such a way that it's -- you're not going to do just
analysis of the original back-prop point.
>> Tengyu Ma: Yes. We might need to change that a little bit.
>>: Modify them a little bit, and then you can see whether we need to apply that, revise the
version you handle it better, and I think that will give some valid [indiscernible].
>> Tengyu Ma: Yes, yes, yes. Yes, okay, I'll stop here. Thank you.
>>: I have a question. So right now, the analysis is based on that the data is generated according
to the ground truth model that's linear. So what would be the major technical barrier if you
consider the case of a model mismatch. Like the actual data are [indiscernible] by some
distribution, but you don't know about it, like with some regularity conditions, and then you still
use sparse coding to do the data representation and the modeling. So in this case, what would be
the major technical barrier to generalize the current analysis toward these things? Like which
part of the major part of your analysis relies on this ground truth model being consistent with
each other?
>>: That's a totally different problem.
>> Tengyu Ma: Yes, so I think all of our techniques depend on ->>: In this sparse pair how to [indiscernible] even the criterion is different, so whether they are
criteria to that is good or bad.
>>: Really in the comments, basically, so your model is that you relied on a model to generate
[indiscernible] to generate the data. That actually has something to do with your condition is
your gradient correlates to the direction to the Z*. Okay. That you can say is one theory to say
you cannot [have] the true gradient, but that's also a disadvantage that you cannot get your
generative models. Alternatively, a classic position is that you will show that the correlation
between your function and the true gradient, now, you do not need to know where the true
optimal point is. All you need is the true gradient.
>> Tengyu Ma: Yes.
>>: If you do that, then you do not need a notion of a true model there, and a similar message
was, of course, there, the convergence will be seen. This thing will just converge to the local
optimum.
>>: This is the best model -- this is the best parameter to use my current model to fit this data.
This is actually my ->> Tengyu Ma: Yes, so but I would guess that's a much harder problem, because you don't even
know what is on the minimizer, in your case, right? So the minimizer is something there. And
you don't know how to relate your minimizer with everything, like data. So technically ->>: There, you can just separate in two, right? One is you show you're always convergence.
The second one, the only thing you need to show -- I guess that shouldn't be the same, because
the second one, you already have to be there. The second hand, it has to old. Anyway, it's
locally you are strong convex. So you look like you have some point, strength on it. So the
advantage of good animation is that you need to pursue nothing about your true optimal point.
>>: That's a [indiscernible] miss a specifier, I don't know what it would learn.
>>: It would be better, right? You have to regularize on it anyway.
>>: But the actual machine learning problems will always have model more mismatch, because
the point of having machine learning is really we actually don't know the -- that's why we use
deep learning models or this and that to fit. So, basically, we are assuming that probably our
model can, in the optimal point, where our model could be somehow closely related to the actual
data, and then, in that case, of course, like for example with slight model mismatch or something
like that.
>>: So actually even [indiscernible], then there are two different kinds of [indiscernible] minmax risk, it's just like here, and then for [indiscernible] about if you monitor which specifier.
>>: So to follow up that discussion, can you show me in your slides where the analysis requires
the generative model.
>> Tengyu Ma: Oh, it requires the generative model everywhere. It's almost everywhere.
>>: So when you say that you actually do have the general, the correct -- so yeah, maybe just for
->> Tengyu Ma: Yes, I think we require the generative model. So, for example, so what is Z*?
What is Z*? Z* is Ai*. This is defined by the ->>: You need to have the true gradient. You don't need to have that generative model, so
[indiscernible] comment on that. It's this analysis [indiscernible].
>>: What I'm saying is that -- what I said is that I think it should be roughly equivalent
conditions, but with different assumptions. Instead of requiring the true upper point, you can
require on the [indiscernible].
>> Tengyu Ma: Yes, it's possible.
>>: So you don't have to assume that it's generative model. You should assume that there is a
unique solution for the global optimal solution for that.
>> Tengyu Ma: Yes, I completely agree. That would be better, yes.
>>: That's my question. It's a difficult problem.
>> Tengyu Ma: But I think this is definitely in a theoretical point of view, probably. Anyway,
we were not bothered about that for a little ->>: So that's the point. If you assume you have locally optimal -- okay, locally sometimes you
have a global somewhere there. Then, the proof is happening and if you want to say
[indiscernible] there is to say like your subdirection always varies from a true gradient by
something. Then you will automatically get that kind of answer, so that's kind of a different
assumption.
>>: And can you also show me where do you use the assumption of a joint tree model where X
is -1, 0, 1.
>> Tengyu Ma: Oh, we only use it in the decoding.
>>: All right, which one? Which part?
>> Tengyu Ma: Here. So we want to prove this, and this assumes that Xi is not -1, 0, 1. Or we
assume that X is either larger than 1 or smaller than -1 or 0. That's the only thing we need.
>>: I see. Okay, but you said they can be generalized without that kind of limit.
>> Tengyu Ma: So anyway, we need something like this. So we need the condition like Xi is in
-- it's either 0 or let's say between A and B, some constant, and/or between minus A and minus
B, minus B and minus A, so this will mean it.
>>: Oh, okay, so if you use that and this one to generalize a little bit.
>>: So B, do you have time to ->>: Yes, I don't have time.
>>: Okay, so I will solve the problem, that will [indiscernible]. Okay, thanks very much.
>>: [Indiscernible].
Download