>> Dengyong Zhou: I'm so pleased to have Tengyu here. Tengyu currently is a PhD student from Princeton University in the Computer Science Department. His adviser is Sanjeev Arora, and that's a very well-known name for many people here. Before, he got a graduate degree from Tsinghua University, also in the Computer Science Department, and Tengyu will stay here for two days, today and tomorrow, so if you want to meet him, please let me know. >> Tengyu Ma: Okay, thanks for having me here. And I'm going to talk about analyzing nonconvex optimization for sparse coding, and this is the joint work with my advisers, Sanjeev Arora and Rong Ge, Microsoft Research New England, and Ankur Moitra from MIT. So sparse coding, which is often also called dictionary learning -- sometimes, I will use dictionary learning, just because I usually that notation, that terminology. And so the key idea is in many settings, data has a sparse representation in a different, appropriate chosen basis. So why is our data, and if you choose appropriately or a matrix, so each column of A, I use Ai to denote the ith column of A, and if you try to write this data in this basis, then you will get the sparse vector X, so X is the representation, and by sparse, I mean that there are at most, say, K nonzero entries in X. And so just to emphasize, this dictionary is unknown usually. So if the dictionary is known, then this is sparse recovery or sparse linear regression problem, but here, we are interested in a setting when the dictionary is unknown, and it's an N by N matrix, and usually, we are interested in the case when this dictionary is overcomplete, so where M is larger than N, but maybe not too much larger, and so when M is larger than N, then the columns of A are not independent, so it's a redundant or overcomplete basis. So because the dictionary is unknown and the representation is also unknown, so this is an unsupervised learning problem, so you have no hope to find A and X if you are just given one Y, right? So we are given multiple samples Y, [ID] samples from the same distribution, so I'm going to use superscript to index the samples, and so then my data is a matrix, and I'm going to decompose this matrix approximately into two matrixes, the product of two matrixes. One is a dictionary, and the other one is the representations, so each column of this representation matrix has at most K nonzeros, and sometimes I use capital X for this X matrix and capital Y for this data matrix. Just the notation. And so this problem originated from -- actually, Olshausen and Field, the seminal work of these two computational neuroscientists in 1997, and so they were asking the question why human and visual systems can recognize the object very quickly and very robustly? So they tried to understand the human visual system, and the idea is just the same picture, so on the left-hand side, so this is the first layer of neurons of our visual system, so these layers of neurons got activated by the light from our sight, so you see the picture, and the neurons got activated. This is just the raw data. And then, somehow, the first layer -- the synapses, usually it's called the primary visual cortex, V1, so then after processed by this visual cortex, then you've got a sparse representation. The neurons in the second layer, only a small number of neurons got activated, and it turns out that neuroscientists believe that the number of neurons in the second layer is much larger than the number of neurons in the first layer, so this is an overcomplete matrix, and so basically, neuroscientists believe that this is a very reasonable model for the first layer of your visual system, and so the reason I mentioned this is -- one of the reasons is that this is the origin, and the other is that actually, these two neuroscientists, they also proposed a very good model for this and also a very good algorithm, which I am going to talk about. And so then computer scientists see this very nice explanation and tried to use it for our image problems, because we believe that if human eyes could do it just by neurons -- neurons is a very weak computational unit compared to computers, so then we could do it. If you could do it on neurons, then you could do it in the computer. So then here is exactly the same picture, so we get the image or image patches, and you want to represent our image patches by this sparse -- in this basis with a sparse representation. And if you learn such a dictionary and you represent your image patches by sparse representation, then what you can do, you can imagine, oh, if this is a nice representation, then you can do supervised learning using this nice representation as a set of features, or you can do a reconstruction, because this is just an approximation. If you multiply this with the representation you get, you get a slightly different image, and this is a reconstruction. Presumably, this could be better. You could de-noise the original image. In super resolution, I think people use this to restrict the space of the possible images, because you know that your image has this structure, then you know, okay, not all the images are possible. So also, a very important fact is that so you can use -probably, you can use these representations to build higher-level representations, like in deep learning, and actually, in visual systems, this is just the second layers of neurons, and we have many other layers behind it. So this is the application of this problem, and so also, the deep learning is one of the -- so because we want to understand the deep learning, and it seems that this is the simplest version of deep learning is not even a version of deep learning. It's just a building block of deep learning in some sense, so it's a good starting point for theoretical understanding, so this work is -- our aim is to theoretically understand this type of problems and the heuristics that is used for this type of problems. Okay, so we want to have some not only efficient algorithms but also provable algorithms, so basically algorithms with provable guarantees for this problem. So this is my outline, so I'm going to first introduce some existing heuristics that is used in practice, and then some models that allow us to understand this problem from a theoretical point of view. So recall we want to decompose this data matrix into a product of an overcomplete matrix and a sparse matrix, and so Olshausen and Field, they proposed this energy function, so they want to minimize this energy function, so this energy function consists of two terms. One is this reconstruction area. This is just -- when you fit the model with this error, what is the error, and this is the sparse penalty, which enforces this X matrix is sparse, so you can, for example, use L0, sparsity L1. The exact form of this penalty is not relevant to this talk, but certainly you can think of that, say, L1 sparsity. So Olshausen and Field proposed, so we want to minimize this energy function over A and X, because both A and X are not known. And so these two neuroscientists, they proposed this alternating update algorithm, so the algorithm is very simple. So you repeat and do convergence, and you alternate between these two steps. The first step is you fix A and update X, and so there are different choices of these steps, but what they propose is just you take the minimizer, because when you fix A, then this energy function is the sum of the reconstruction area of each sample, plus the -- so you can decompose this energy function into different samples, so you can treat each sample separately, and you just take the minimizer in the [various] approach. So, for example, if this is L1, then this is convex. And then the literal step, so then you fix X and you try to update A, and the way that they proposed to update this A is you just take the gradient descent. And you take the gradient of this energy function with respect to A. Now, X is fixed. You only take the partial gradient with respect to A, and you choose the learning rate eta, and you update your A to be A minus eta times the gradient. So actually, there are many other algorithms for this problem, and they are basically the efficient algorithm of this form, under all following the same framework, you have this decoding stack and learning stack. So the only difference is how do you decode and how do you learn, update A, so let me just do a little bit of summary of the existing algorithms. So I think there's three interesting algorithms in literature. One is this Olshausen and Field algorithm, I referred to it as neural algorithms, and the other is this MOD method of optimal direction, and the third one is this k-SVD. So recall that we want to minimize this energy function, and in a decoding step, OF just takes this minimizer, and in k-SVD and MOD, they use these pursuit algorithms, so recall that in this decoding step, we fix A. We want to minimize over X, and this is essentially just the sparse recovery problem, and you can use different tools, and now L1 minimization is the dominating one, I guess, but before that, there are even simpler algorithms, like pursuit algorithm, which is essentially you just -- so you want to find the sparse vector X here, and you just choose the nonzeros in a very gritty way. So you choose -- so first, you take some entry which explains your data the most, and then you choose the next. And here, you choose exactly K nonzero entries, and you add that. >>: Just the lasso. >> Tengyu Ma: No, no, this is even. Lasso is you minimize this L1 minimization, so this is even simpler than lasso. It's an even grittier version. >>: It's L0. >> Tengyu Ma: Yes, so you really enforce L0 constraints. >>: So lasso can be analyzed. People have known ->> Tengyu Ma: Yes, yes. >>: So the analyses you will be talking about is mainly of the other step. >> Tengyu Ma: Because here we don't really know at what -- in lasso, we will assume that you know A, you are given A, but here, A is not known. >>: So that part is hard. >> Tengyu Ma: So the trouble is that exactly you don't know A. So you have to approximate A and X alternatively so that you get close to the truth. And in the learning step, so the OF algorithm just uses gradient descent, and MOD uses you just take the minimizer, because when you fix X and minimize over A, this is just a quadratic problem. Because this penalty term only depends on X. When you fix X, this is a constant, and k-SVD, this is a little bit trickier, so I just improved it just for [indiscernible] of A, but you don't really need -- so the idea is that you fix everything else, everything else except the ith column of A and the ith row of X. And you just update these two, and then if you fix everything else, then this is just a rank file approximation, although we have still two sets of parameters, but thanks to this rank file approximation, although this is not convex, still you can do it, because you can use SVD. But I'm not going to talk about this algorithm in detail, just for [indiscernible]. >>: So none of them can be analyzed. >> Tengyu Ma: It was not known, so we didn't know how to analyze it, and now we can analyze the check mark, the one that I checked. So basically, we can analyze this plus this or this plus this, and in this talk, I'm only going to talk about this and pass this. >>: MOD here, this problem, so OF is a solution, one solution for two. So when you use a minimizer, is that any special, you're talking about special [indiscernible]. >> Tengyu Ma: No, no, so basically, that's -- yes, that's a great question. So the minimizer, so here, we are fixing X, so this is simpler, so this is a quadratic problem. So indeed, if you take a lot of ->>: This is definitional learning right? Given X, given to [indiscernible]. >> Tengyu Ma: Yes, yes. But sometimes -- exactly. But sometimes, you don't really want to take such a big step, so that's why ->>: Yes, hoping to get global if possible. >> Tengyu Ma: Yes, so we are hoping to get global optimal, but it turns out this MOD, at least from a theoretical point of view, this step is too big. So we need to make a little bit conservative step, and actually, at least from the experiment I run, just a very simple experiment, this MOD is not as good as k-SVD and OF in some aspects. >>: So OF, do they actually do early stopping so they don't go too far? They need to go back to the matrix [indiscernible]? >> Tengyu Ma: I don't think so. I'm not sure about the experimental details, so ->>: Really, just do one gradient step. Yes, just one step. >> Tengyu Ma: Yes, yes, exactly. So that's a good point. So you just take a one step, and then you update this. >>: The MOD just fully minimizes. >> Tengyu Ma: So this is alternating algorithms. You alternate between this step and this step many times. >>: So that's a super, super, super early stopping, right? Just one step. >> Tengyu Ma: So I think if you ->>: If we use MOD, this step would be too much. >>: Too much. Right, okay. >> Tengyu Ma: So I think compared to deep learning, this step is more or less the step where we calculate the back propagation, and this step is more or less the learning step into the gradient set, but probably this is not a good comparison. So we are going to talk about this plus this. And the main contribution of this work is that we can analyze this algorithm from some good, responsible starting point. So before talking about the theory part, so let me talk about the generative model, which was also proposed by Olshausen and Field, and which makes a lot of sense. So because we need to assume something about our data, otherwise, so it's too hard. We want to minimize the non-convex function, it's NP hard and there's no really hope. So basically, we assume that the data is generated from some ground truth A*, the ground truth dictionary, and it is overcomplete. And we need to make the assumption that the columns of A are close to orthogonal. There are no two columns that are very parallel with each other, and they are kind of isotropic, which means that the spectrum is not too large. And then we assume that this X* let's say is -1,0,1, but we could relax this assumption a little bit, probably a lot, but for simplicity, I'm assuming that this is just to take three values, -1,0,1, and the expectation of X* for each is zero, and we also assume that the correlation between the Xi* and Xj* are kind of independent, and also, X* is a k-sparse vector. It only takes at most K nonzero entries. And then the generative model is that we are given, say, P samples, and each sample of this form is A* times X*, plus noise, and the noise is not too large. I won't talk about how to deal with the noise, because even without the noise, this is the hard problem. But the noise we can tolerate a certain amount of noise, and okay, I omit the superscript here. Typically, this is A -- Y with superscript J is equal to A* times X* superscript j for the J sample. >> But that is -- what do they limit X to be -1,0,1, that may not be realistic. >> Tengyu Ma: Yes, that's not realistic, sure. >>: Only under [indiscernible] analysis? >> Tengyu Ma: Oh, no, no, so we can relax this. So basically, what we need is if this is zero, then this is zero, right? If it's nonzero, then it should be bounded away from zero. It shouldn't be too small. So this is a sparse vector with kind of a gap condition, so every nonzero entry is larger than a constant C, like a half, or a lot less than minus a half. >>: So I know that. So the argument is that it doesn't really matter what X you have, as long as A is more general, you can synthesize -- you can generate any kind of Y. >> Tengyu Ma: Yes. >>: But on the other hand, in practical problem, if X is not limited by [indiscernible], then the result may not be meaningful. >> Tengyu Ma: We can analyze the case when it is bound away. >>: So arbitrary X. >> Tengyu Ma: Yes. As long as it is bound away from zero, if it is not. Technically, that's the assumption, but for simplicity, I am going to assume this, for this talk. So technically, our assumption is that X is a sparse vector in a sense, and if Xi is nonzero, then it must be either larger than a half, say, or smaller than a minus a half. >>: [Indiscernible] lower bound or it's bounded away. >> Tengyu Ma: This should be constant. >>: As long as there exists a constant. >> Tengyu Ma: Yes, exists a constant. >>: I see. >>: So X is between 0, between -1 and 1, or is it ->> Tengyu Ma: There's no upper bound. There is no upper bound. It's just a lower bound. You cannot be too small. Of course, there are some -- there are some scaling issues, so let's say the main range of X is a constant, between constant. >>: What is the amount of correlation that you can allow? >> Tengyu Ma: So, basically, we can allow I think pretty good correlation, so if X is a uniform sparse vector, then the correlation between X and Xj is something like K over dimension squared. We can allow something like another constant, say, a constant times K over M squared. >>: [Indiscernible]. >> Tengyu Ma: Oh, so incoherence. Incoherent. I'm going to talk about the details when we use it. >>: So even if there are two coordinates that always come up together, that will mess up the algorithm. >> Tengyu Ma: That will mess up the algorithm. If the algo is the same, if the Xi is always equal to Xj, then that's really our trouble. >>: Then that's really trouble. The algorithm is not going to work, or the analysis. >> Tengyu Ma: For sure, the analysis is not going to work. The algorithm, I think certainly you can find some corner case that when it is not working, but in practice, I think -- if just a small number of pairs have this weird correlation, it should be fine, but we didn't analyze it. Okay. So this model makes sense, because the goal is to recover a store. >>: So for the P -- RP samples, you do it from X* is P different X*s? >> Tengyu Ma: Yes, different X*s, yes. >>: It's not that the noise are ->> Tengyu Ma: The noise are also different, but A* is the same. >>: Yes, the rule is like same X*. We just generate P from the noise. >> Tengyu Ma: Oh, no, no, no. I think I omit the superscript here and here, so you need to have a superscript like J and J here. So this model makes sense, because if you think about the energy function without the penalty term -- the penalty term is just to enforce sparsity. This is the log likelihood of Y given A times X, so when you minimize the energy function, you are trying to find the maximum likelihood. And indeed, you can show that when P is large enough, then the ground truth -- yes, A* and other coefficients, and the representations, this is a minimizer of this energy function. So that's why. >>: That's not entirely true, right? >> Tengyu Ma: It’s not entirely true, because -- sorry? >>: Because not all X -- if I put in X something which is not k-sparse, it's a zero likelihood. >> Tengyu Ma: Yes, so ->>: So it's only a log-likelihood restricted. >> Tengyu Ma: Yes, yes, exactly. So I guess this sentence means that I'm considering the sparse penalty. Without the sparse penalty, this is the log likelihood. Plus the sparse penalty, it sit the minimizer in some sense. And also, you need to choose the penalty in the correct way. This is just the intuition. We are not going to use it. >>: So any property that is associated with the noise is Gaussian distributed? >> Tengyu Ma: Yes, so we are assuming that this is just a white Gaussian noise. So I'm trying to not talk too much about noise, because even without the noise, if the noise is zero, this is not an easy question. >>: [Indiscernible]. >> Tengyu Ma: Okay, so this is the model, so with this model, we can talk about some theory work. So the previous works, so we've found that in practice, non-convex optimization, we have this alternating updates algorithm. It's really good, it's very efficient, it's very simple, and also it's generic, so you don't really need to worry about what the parameters are, so you just gradient descent or something, and it's very successful. Up to K is less than square root N. K is the sparsity, let me remind you, and N is the dimension of the vectors. And so from a theoretical point of view, okay, we don't know whether it can get stuck at the local optimals, or we don't even know whether there are local optimals. We are not really sure, actually. So it's not very easy to certify that there is a local optimal sometimes. >>: So this one doesn't depend on what is the learning rate, eta. >> Tengyu Ma: It depends on learning rate. >>: That has to be appropriately chosen. >> Tengyu Ma: Yes, to run the algorithm, to make the algorithm work. Right, yes, you need to choose that. But usually, if we choose that to be very small, then the smaller, the more robust, but the slower. And we don't even know -- so I think even just one year before, we started to know that if we start from -- sorry, I think this is not very -- so if we start with 1 over K close matrix to K*. So we start with a very, very close -- so if you are given a very close starting point, like 1 over K close column wise. For each column, you are already 1 over K close to the ground truth, then you know that this is guaranteed to converge in L2 norm, in equivalent distance. I'm going to use equivalent distance throughout the talk. So basically, we only a very, very small basing of contraction, theoretically, but empirically, this is certainly not -- I mean, the base of contraction is certainly much, much larger, I think. And there are many other -- there are several existing theoretical works for this problem. I listed them. So I think [indiscernible] started from Spielman, Wang and Wright, 12, and this is the [code best] paper, and they used the LP-based algorithm. It's a convex optimization for this problem, and they allow sparsity to cased as the square root, which is good, but they cannot allow overcomplete matrix. So in their case, this matrix A is just the square matrix, and the number of samples is something lie N squared. You need this many of samples, and then the other three authors are Arora, Ge and Moitra. They had this combinatorial algorithm for this problem, and also [indiscernible] and Kumar and other authors, so they can prove that if K is slightly less than square root N, then you need M squared times 1 over epsilon number of samples, so epsilon is the accuracy, so in some measure -- so if you want to get epsilon accuracy, you need 1 over epsilon times M squared samples, and it can tolerate overcomplete dictionary. And the third one, this is more of a theoretical -- the interest is not really -- so this is Barak, Kelner and Steurer, and they use Lassere's sum of squares relaxation. If you're not familiar with this terminology, it doesn't really matter. This is very high-order convex relaxation. Probably, the most powerful relaxation that we know for now, and they can tolerate even larger sparsity, up to almost linear, and just -- but for theory, this is really good. This is really powerful, but for practice, this is terrible, because the number of samples, it's exponential in 1 over epsilon, so if you want to get, say, 0.2 accuracy, which is I guess very bad for practice, then you need M to let's say 10. This is ->>: You don't show any result on the rate of convergence, how fast the algorithm ->> Tengyu Ma: They are not iterative algorithms. This is convex relaxation. You write down the convex optimization and you solve it. So this is combinatorial, and the rate is this. This is not really the rate. So this is the number of samples. So I guess the main point here is that these theoretical approaches, the number of samples is too large, and they are really not doing as well as the prep on the non-convexity. >>: These aren't just guaranteed or with high probability? >> Tengyu Ma: With high probability, it returns the ground truth. Yes. >>: So the problem is non-convex to begin with, right? >> Tengyu Ma: Yes. >>: So when you're saying that the formulate the problem to be convex and analyze it, so do they address how much error you get by approximating the problem? >> Tengyu Ma: Yes, they need to address that. >>: Not here, right? >> Tengyu Ma: Not for today, but in these works, they need to address that when you do this convex relaxation, how much you lose. >>: I see. >>: What do you mean exactly by overcomplete? You mean M is like ->> Tengyu Ma: M is larger than N. It's a constant times it. >>: Constant time. >> Tengyu Ma: Not much larger. I think this one could -- probably this work could tolerate M to be N to the 1.1, for example, something like that, about five, probably, the best guess. But anyway, this is not much larger. But I guess in practice, we don't really care about when M is much larger than N, probably. Okay. So I guess the key point here is that the non-convex approach is really the right way. We are really working hard on this problem from a theoretical point of view using convex relaxation, but we never get the same result, compared to just the simple non-convex heuristic. So the next question to us is whether we can analyze those nonconvex optimizations approach, so it seems that this is really the right way to do. Okay, so basically this motivates our work, and we tried to answer this question, and this is our main theorem, so we have some condition, K is less than square root N, and M is constant times N, and some assumptions, N [indiscernible] as we described in the [training] model slides, and we need to have this. This is very crucial, so a zero is column-wise 1/log n close to A*. By this, I mean that each column -- so if you compute the column-wise distance between A0 and A* and most 1/log n. By the way, the normalization is that each column of A* has [unit] norm so that the forbidden norm is something like N square. Forbidden norm squared is like M. And what we can show is that with this reasonably okay starting point, then we can show that the OF algorithm, this non-convex approach, with a simple decoding rule, which I didn't describe, but I'm going to describe later, will return a guess at each iteration As, such that this is true. So basically, after S iterations, the error decreased linearly in S, so it's a geometry degree of error. And we have some bias here, which can be removed, so but for simplicity I'm going to have it, because we are not going to talk about the ->>: In some respect, the results will have better [indiscernible]. >> Tengyu Ma: Yes, if you choose -- yes, so yes. The learning rate needs to be properly chosen, and if you choose the learning rate ->>: [Indiscernible]. >> Tengyu Ma: Oh, I have already optimized over the linear rate. >>: Oh, I see. It's [indiscernible]. >> Tengyu Ma: So I have already chosen the best learning rate, so that this ->>: Is this also high probability? >> Tengyu Ma: Yes, this is also high probability. So everything we are talking about is high probability, but I am cheating on that. Okay, under this line is really hurting us, because this is ->>: How about X as the initial additional X? >> Tengyu Ma: So we use this decoding rule to initialize X, so if we have A0, then we use this decoding rule to initialize X. >>: So in that perspective, does the presentation of the columns matter? Like you said column 1 ->> Tengyu Ma: Yes, so -- yes. >>: Is that [indiscernible]. >> Tengyu Ma: Yes, I have considered the permutation. So I have two permutations and sample it, actually, so if you flipped each column with the -- you multiply -1 on each column, it doesn't really change the problem. So after permutation, un-sample it, so this column-wise close. So okay. And to complement, so we show spectral matter-based algorithm, which returns a zero, such a good initialization that is column-wise 1/log N close to A*. But I guess this is the main part of this talk, and I'm going to talk a little bit for this initialization. So just some notes. S because this is geometrical decay of error, it means that -- if you want to achieve epsilon error, you only need log N over epsilon iterations, and the sample complexity is N times K, and the previous one is M squared, so we improve just at least a factor of M over K, and also, the sample complexity depends on 1/epsilon, but here it depends on log 1/epsilon, so that's the theoretical improvement, and the runtime is N times N times P, so this is really very good, because if you want to evaluate this energy function, you need to multiply N by M matrix with the M by P matrix, so even this multiplication takes you N times P, if you don't use matrix multiplication. >>: That's per iteration, right? >> Tengyu Ma: Per iteration, but it usually is log. The number of iterations is log, so I use this tilde to hide the log factor, but here I used the log. So the initialization also takes the same number of samples, and the runtime is a little bit larger, but we think that's in some sense quite necessary, because we are -- the initializing gives you a very good approximation, like 1/log N, so presumably it should take more time. And also, all the numbers here could be improved if you assume a little bit more independence on the X or if you change some assumptions somewhere, so this is not really the optimal, but just the demonstration. Okay, so I guess -- now let's talk about how to analyze this alternating minimization, and my plan is that I'm going to give a new perspective of this kind of non-convex optimization problem, and then I'm going to describe a general condition for convergence, and then I'm going to apply this general framework to sparse coding and show you how to get a result. >>: Can you go back to see the [indiscernible]. >> Tengyu Ma: Okay, sure, sure. >>: The generative model that you talked about it, so what is the hidden? Okay, so X is the hidden variable? >> Tengyu Ma: Okay, X is the hidden variable. So I think -- technically, I should write here the sample is A* times X*j, plus noise. >>: Yes, so the hidden variable ->> Tengyu Ma: Hidden variable is X*. This is the parameter. >>: Oh, so the generative model only applies to one step, only for the learning part. >> Tengyu Ma: No. This is the generative model for the data. This is not about the algorithm. >>: Not about the data, so to me, X* -- both A and X. >> Tengyu Ma: Are [non-noise]. >>: So it doesn't make sense to talk about one is parameter and one is the data. >> Tengyu Ma: Oh, I see. Oh, I see. >>: So [indiscernible] model you normally assume that Y is unknown, and Y is observable. The hidden variable really is a nonvariable. You cannot observe hidden variable ->> Tengyu Ma: Yes, both A and X are not known. So why I called this parameter? Because A* is shared across all the samples, so if I write another -- >>: Oh, I see. >> Tengyu Ma: And the X* is an independent stochastic part. So you draw an independent X for each J. >>: Okay, so that is [indiscernible]. >> Tengyu Ma: So okay? Okay, so new perspective, so this is -- what is the old perspective? At least, this was my understanding, I mean, 10 months ago, for example. So we tried to analyze this non-convex function E of A times A comma X, and if you write this, this is the direction of X, this is the direction of A, what we are doing is -- because we are doing alternating updates, so each time we would update either X or A. This is the update. And this is a non-convex and [indiscernible] our understanding was just somehow, because the -- because this special update rule, this alternating of the rule, somehow we can avoid local optimal. I guess that was the best hope. It doesn't make much sense. I agree, exactly, but we didn't know much better than this. So maybe there is even no local optimum. That's also possible. >>: But no local optimum -- you cannot come up with a small example in a small dimension where you would have this local minimum? [Indiscernible] or search over ->> Tengyu Ma: For small dimensions? >>: Yes, just to show that there exist examples with local minimums. >> Tengyu Ma: For small dimensions, we didn't try, but I think we can. So I think the question is, if you increased the number of samples whether you can find it. So I think I even found -- if I fixed the number of samples just to be five times the dimension, then I can find the local optimum, but if I increase the number of samples, then it's harder and harder to find the local optimum. >>: Maybe it's easier to construct local optimum because of the symmetry, right? Because of the -- you can use the symmetry argument, right [indiscernible] has to be the same, unless they're all the same or everything in between them is the same. Unless that's the case, they have to be local ->> Tengyu Ma: So I guess what they are saying is that there are a lot of global optimals, because if you permute A and X accordingly, then you get exactly the same solution, so in the whole space there are a lot of minimizers, so for sure, in the whole space, there are many at least [indiscernible] points, just mathematically, right? But in the regime that we are interested in, right, we can prove 1/log N close whether there is one, so right? So maybe there is no local optimal in this small [ball]? So okay, but anyway, this was our confusion, and so we tried to propose something that is slightly better than this. >>: So this is your conjecture? >> Tengyu Ma: This is not the conjecture. This was our confusion. We don't really understand why this is. It's just the confusion. So just for comparison, because I am going to say something new. So we currently are going to analyze this, and in the learnings, we are going to use gradient descent, right? So -- and so if you look at this energy function, this is kind of special, like we discussed, so we have this A that corresponds to a fixed set of parameters, and X corresponds to a stochastic part. It's a random part in some sense. And also, this is special, because if you fix X, then this function is -- I mean, in our case is quadratic, so at least it's convex respect to A, if you fix X. Actually, if you choose this to be a convex function, this is sparse convex function. So we hope to exploit this specialty of this function so that we can prove something, we can achieve something. And the way that we apply it, explore it, is that we observe that if you plug in X* into this function -- X* is our ground truth, the truth, which we don't know, but for a thought experiment, let's just plug in X*. We get a function that only depends on A, and this penalty term becomes a constant, and this is a quadratic function over A. So this is convex, but it is not known. It's a hypothetical -- it's a thought experiment. But the observation is that when X is close to X*, the gradient, if you look at this and this -- the only difference is X and X*. The gradient of A when you plug in X and the gradient of A when you plug in X*, they are more or less close. This is just the intuition, and the gradient of A at the point X* is the gradient of this convex function, just by definition. So what we are seeing is that this right term, the one that we are interested in right here, this guy, that it's close to the gradient of the convex function, although this gradient is unknown. So that's the point. So then we have a different picture, so we defined this unknown convex function, and we showed that the gradient that we have is an inexact version of the gradient of this convex [indiscernible]. And if you look at this picture, we know that if you followed this one, this blue one, we are guaranteed to converge to the global optimal, because this is a convex function, and we are in a space of A, this is convex and everything is nice. >>: Yes, but I thought this was precisely the second line that we talked about earlier ->> Tengyu Ma: Yes, yes, yes, exactly. I'm just ->>: So the problem is that it goes too far, so that's why it's not as good as just taking small steps to reach the global optimal. Earlier, you have three ->> Tengyu Ma: Yes. >>: And this one, the one you just talked about actually ->> Tengyu Ma: Yes, yes, exactly. Yes. >>: Which is not as good as the first one. >> Tengyu Ma: Yes. Because you are going too far, yes. Exactly. This MOD matters. This taking [indiscernible]. >>: So your problem is that you want to show that not going so far is better than going [indiscernible]. >> Tengyu Ma: At least ->>: [Indiscernible]. >> Tengyu Ma: At least -- so I am not going to say it is really better, because we don't have really evidence, but intuition -- at a high level, I think that's true. But at least from a theoretical point of view, if we analyze it, we want to choose the iterator to be not too large so that it doesn't give you anything too bad. Anyway, if you follow this blue arrow, then we are going to converge to global optimal. But what we have, this is red arrow, which is an approximation in some sense of the blue arrow, and we hope that this approximation doesn’t hurt us much, too much, so even with the approximation, we can still do something good. So basically this is the new perspective versus this, so okay, just to repeat the same thing, we are saying this converged to A*, and we hope to show that this implies the one that we have converges to A*, so basically we are asking, okay, so whether we have a theory of biased approximate gradient descent, and another question is probably because we are anyway estimating the other A*, maybe we can use another approximation. And so a note is that this is different from stochastic gradient descent, because in stochastic gradient, usually, you pick some samples to estimate your gradient, but that estimation is always unbiased. But here, this estimation could be biased, could be -- so we need the biased version of approximate gradient descent, and the answer is yes, actually, we can be a little bit more general. Another approximate of the other also converges, and the answer is yes, so that's why we can analyze this k-SVD 1, which is something that doesn't really involve gradient or anything, but still, we can think of it as an approximation of gradient descent and maybe some other variants of the Olshausen and Field algorithm. Okay. So this is the new framework, and then we are going to talk about [indiscernible], and so -- okay, so then our goal is to build this approximation theory of approximate gradient descent, in some sense. And actually, we'll propose an even more general condition, so thinking about this -- so we are interested in this linear update rule, this first-order update, so we have some Z. I changed the A to Z, so the map is Z is A and G is -- you can think of this as gradient. So just for -- because this is interesting of its own, right? This is my first-order update, and this is my theorem. This is my condition. My condition says that if GS is somehow correlated with the desired solutions, A*, if the inner product of GS between ZS minus Z* is somehow not too small. I guess this doesn't make too much sense at this point, but I'm going to have a picture in the next slides, just here, so what we are seeing is that, okay, this is the gradient of -- this is GS, the direction of update. And this is ZS minus Z*, so what we are really seeing is just this red arrow -- the angle between this red and green one is less than 90 degrees, strictly less than 90 degrees. And the reason why we have this two norm here is just we don't want ->>: That less than 90 degrees, that's always. Is it always it's less than ->> Tengyu Ma: So this is our condition. If it is always true, then we are good, right? >>: I see. So that will require that your gradient is not too far from the real, true gradient, because if you go in the opposite direction, the whole [indiscernible]. >> Tengyu Ma: Yes, yes. >>: In your true solution, in that solution. So you like [indiscernible] that condition says that estimate of the gradient cannot be noisy. >> Tengyu Ma: Not exactly, but I think you get the main point. So basically, if this -- my direction is close to the true gradient, then certainly this is less than 90 degrees. But on the reverse side, it's not necessarily true. >>: So there [indiscernible]. >> Tengyu Ma: But it's possible that, for example, this is the gradient, the true gradient, and this right direction is my update rule. Both of my [indiscernible] the desired direction, but they too are not really correlated. That's also possible, but we don't really have an example, but theoretically, it's possible. So I'm going to have a picture, I think, in several slides. But basically, we are saying that it's not too far from the true gradient, and the theorem is that this GS always satisfies this condition, so this is something we need to check, but let's say if it always satisfies this condition, then we have geometrical [indiscernible]. Here, the learning rate is here. We need to choose it to be small -- I think the we need the learning rate to be less than two beta, so we can't really choose it to be very large. And we also have some systematic error here. I didn't talk about this epsilon S, but there's some systematic [indiscernible] which we are going to allow, because for this problem, we really have this error. >>: By the way, for this, is this just some lag to whether you allow yourself to go more than one step, a few steps, is it going to do better? >> Tengyu Ma: You mean like an accelerated gradient descent? >>: No, a gradient descent, more than just one step. You can do gradient descent. >> Tengyu Ma: Oh, yeah, so here, we have asked that. After we ask that, we have the arrow is decay is jumps to -- yes. It's not only about ->>: [Indiscernible]. >> Tengyu Ma: So we have ->>: So the room [indiscernible]. >> Tengyu Ma: So sure, yes. I think so. So let's -- okay. Okay, so the proof for this is you can prove by picture, so we are trying to show that after S iterations, this error jumps with this rate. So we only need to show that for each iteration, it drops with a constant factor and I omit the systematic error there, but let's say this epsilon over zero, so we don't need to show this, but this is almost a proof by picture, because if this is less than 90 degrees and we choose this iterator to be a little bit small, then certainly this blue arrow is less than the green one, right? And that's it, so -- and just a little bit of discussion about this. So there is no objective function involved in this framework, so there is no gradient, no convexity really involved, so we only need some first- order update, some update rule, and then it will have a condition and we can check it. But, actually, behind it, this captures analysis of convex optimization, and also, this is extracted from an analysis of convex optimization. This is the most -- the most basic thing that people did in convex optimization, and the only thing different is that this is a different level of abstraction. That's the only thing. Yes. Okay, so -- but we hope that this different level of abstraction can make it more general, even you can apply it to non-convex problems. That's the point. So now, I'm going to -- trying to apply this framework to energy minimization, and recall that ->>: [Indiscernible] level of abstraction. I think -- but this has to work locally when it is strongly convex. >> Tengyu Ma: Yes, yes, exactly. Yes. >>: So that's the bottom line. >> Tengyu Ma: Yes, that's the limitation. >>: Yes, that's the limit. That's why I [indiscernible] your last point saying possibly working on non-convex -- because, [indiscernible] is the same. It's the level of abstraction. >>: So what does it mean by level of abstraction? >> Tengyu Ma: So in terms of analysis. This is not -- so previously, we are saying that G must be a gradient of a convex function, and this is another rule using gradient descent, and this condition is hidden in the proof. So previously, my theorem is that G is the gradient of some convex function, and then I have this. And now I'm saying that G doesn't have to be gradient of anything. It's just something, and if I have this condition, I have ->>: It's like the factor [indiscernible]. >>: Okay, so but put another way is that, for example, [indiscernible] strongly convex, at most, you can imply this condition. >> Tengyu Ma: Yes, yes. >>: But on the other side, I doubt if you can find some function which satisfies this but does not satisfy strong convexity. >> Tengyu Ma: There is no function involved, right, so that's why it's slightly more powerful. >>: But if you restrict any function, then that has ->> Tengyu Ma: Yes, if you restrict to a function, yes, but here, we are going to apply this to a [nominal] convex function, so that's the difference. Okay, so I think -- oops. Oh, here, okay. So I guess we need to speed up a little bit, so I'm going to apply this to energy minimization problem, so let's see. I think this relates to your question, so how do you use this, right? So I have this alternating update, and my decode is something like this, and there you hide it. This is a decode, decoding, and this is a learning. And on the other hand, my framework is that if my update is like this, then I need to check this condition, and so basically I need to connect this to that. So I just define G to be this gradient, and with X to be the decoder of 1A, and I want A to be of this form. I want to match this too, and this is A is updated to A minus A.G, and basically we want to check G satisfies this. >>: And so is it true? >> Tengyu Ma: Yes, it is true. >>: Nice. >> Tengyu Ma: Locally. Only locally. With no initializing. Well, over log [indiscernible]. >>: That's the ->>: Oh, okay. Perhaps it isn't, okay. >> Tengyu Ma: But we believe 1/log shouldn't be the truth, right? It should be something like constant. >>: But there is no hope that it would be true ->> Tengyu Ma: Globally. It should be different. So I think let me get this probably -- so this is the picture that -- so it's possible that the gradient is in this direction, but the one that you have is in this direction, and the angle between the purple one and the red one is very large, but they are both correlated with the green one. That's possible, but that's just ->>: It's okay, right? It's also 90 degree -- less than 90 degrees between the green one? >> Tengyu Ma: Yes, so this is a good stream. >>: The same picture will show the same thing. >> Tengyu Ma: Yes, but -- so the point is just you don't really need to compare the right with the purple. You only need to compare the right with the green. And actually, as you suggest ->>: Okay, extra point. That can be true for the particular instance, but if you want to do a stochastic analysis, then the expectation will always be 90 degrees. That's what I would guess. >> Tengyu Ma: This is not the best estimate of this, and maybe this is something else. You can choose something like something else that doesn't have anything to do with the gradient. >>: Except for the picture, if eta is very large, the -- >> Tengyu Ma: It's not going to choose. I'm going to choose eta to be small. And just a very quick -- so as you suggested, if you compare the gradient with this red one with the purple one, that's a slightly stronger condition, and this is exactly the condition in the [indiscernible] paper, and we realized that this is independent work, and they are doing this on EM algorithms, so it's a different thing, but it's exactly the same type. So EM algorithm, we have the parameter and hidden variable, but although the only difference is, when you ask me the hidden variable, you don't do a decoding, you do a posterior isolation, but this is exactly the same thing. So under their condition, it's slightly stronger. They compare the red one with the purple one but not the red one with the blue one, so in their case, this is small. Okay, so I'm going to -- yes, I think time wise it should be fine. So now I'm going to check this condition, right, the last line, right? This is what I need to do. G is correlated with A*, and what is G? That's the -- okay, I'm going to check this. Yes. So what is G? So if you take the gradient, G is this. It's something like this, and let's consider population version. Let's say let's have infinite number of samples, so let's consider expectation of it, so the sample convexity, I am not going to talk about it. That's very tough. And X is defined to be the decoding of Y of A -- of A. And this is very -- this is a quite nasty thing, right? Because X is a function of when A and -- it could be something like an L1 minimization, right? You don't really know, and so the question is whether you can have a good form for G, so that you can check your condition. And then the question is what decoding you want to use so that you can really have an analytic form of G. >>: What is the expectation [indiscernible]? >> Tengyu Ma: Excuse me? >>: What is the expectation with respect to? >> Tengyu Ma: So if you plug in X* here, right? Is that your question? >>: Yes. >> Tengyu Ma: If you plug in the X*, this is just A minus A*, because this is a convex function. This is a quadratic function. So certain Xs is not really X*, right? And the interlude, so what decoding style we are going to use? So let's say our problem is we have Y equals to A*, X*, and we want to decode X from Y and the noisy version of A*. And let's say if we're given A*, this reduces to a sparse recovery problem, and then we can use this -- in this situation, we can use this projection decoding. The idea is that we just take A* transpose Y, and we do a threshold. By threshold, I mean that if this entry is larger than a half, this Ai* and Y, which is just the i essentially of this A* transpose Y, it's larger than half the ticket. Otherwise, I zero it out. >>: In the sense of what do you come up with the projection of the code [indiscernible]. >> Tengyu Ma: Why did I come up ->>: Because the original problem, given A, is a convex problem, gives two global solutions. But the -- you don't want to go too far in the -- >> Tengyu Ma: No, the point is that I need the -- yes. First of all, I don't really need that precise solution. I'm okay with the approximation. That's one of the key points. Also, for theoretical analysis, I want something closed form, that has a closed form. If it's a minimizer of a convex function, there's no hope to analyze it, and actually, I think we have some evidence that it's okay. It's comparative, but probably slightly worse than this. >>: So when you take this threshold, this is just an arbitrary threshold. How do you make sense of this kind of solution? >> Tengyu Ma: I'm going to talk about it. So the proof is that there is a theorem. This dates back to probably 20 years ago, so if A is incoherent, which means Ai unit norm column, and the inner product between the columns is less than 1 over square root N, and then if you use this threshold, then the support of X is equal to the support of the true X. And the proof ->>: So is it intuitive that that threshold is sizable as an approximate solution? >> Tengyu Ma: I hope I can convince you. Let me show you the proof, which is just one line -I hope that I can convince you, but it's not that easy to see. So if you take Ai*, Y, if you do the calculation, this is Xi plus something, and this sum over J is not. So because i is a sparse vector, so there is only K nonzero entries here, and each in the product is less than 1 over square root N, so this is less than K over square root N. So this is less than K over square root N, which means that this -- so basically, if you look at the beginning and the end, this says that if you take the inner product of Ai* and Y, this is Xi plus/minus a half, plus/minus some [area], not really a half -- plus/minus something smaller than a half. So then, if you take the threshold and you something that Xi is 0, 1 or 0 or something larger than 1, then this threshold makes sense. Okay. And then the next slides just show that if you have A that is approximately close to A* and then, if you use the same thing, just with A, not A*, you will get the same thing, and how close do you need to be? You need to be 1 over square root of log N close. This is exactly where we use this 1 over square root starting point. So we need the decoding to be not correct but correct for support. And actually, we can relax it a little bit if the support is not exactly correct, it is correct with a high probability of the constant probability, probably it is fine, but that's a little bit hard to calculate. So the point is here, if the support is correct, then I can write this decoding, this threshold decoding, as i equals A transpose Y, restrict to the support, and this is a closed form, for analysis, and also this is very fast. You just take the matrix multiplication. And then, I can do this calculation, so I'm going to accurate expectation, and my decoding is just a closed form, Ai is transposed Y, and I plug in multiple definitions. I'm going to skip this. I guess this just shows how nasty this group gets, and I plug in a lot of things, just do the calculation. I plug in this, this definition of X into this, and you also need to plug in the definition of Y and everything and take the expectation and a lot of calculations, and we get this. So we cannot get a good form G, but we can get a good form for each column of G. Each column of G is of this form. It's the lambda squared times Ai, means the lambda times Ai* times some error. Let's forget about the error, so this lambda is something close to 1, so this is really [indiscernible] with the direction Ai minus Ai*, with some different constants, slightly different constants, and this constant actually converts to 1. So it's really a little bit hard to believe this, but I guess the reason, as you asked, is if plugging X to be X*, then this is really Ai minus Ai*, so what we are doing is it's just saying that if X is not X*, then this doesn't cost too much error due to the cancellation. This cancellation is really important, because -- so basically we are saying that X minus X*, this is something. For individual X, this might be very large, but X minus X*, it points to different directions for different X. So you have this cancellation. Somehow you have this cancellation phenomenon, so that the error is not that much. And here we use the fact that A and A* are both close to isotropic. So the important thing is that we also use A is close to isotropic, which we cannot really guarantee, right? Because A is the estimate. A* is the ground truth. We can assume that that's also close to isotropic, but how do you maintain A to be isotropic? And okay, so I will talk about that in two slides. And so then let's check this condition. I want to check that G is [indiscernible] with A minus A*, and we have this, and this is just -- we can show this by picture, so basically lambda is Ai*, so basically lambda is the size of the projection of Ai* to Ai. This is lambda. And here I get lambda-square Ai, so let me extract one lambda from these two terms, so this is just -- so if I multiply lambda with Ai, I get this vector, and minus Ai*, I get this vector. So I guess I'm doing it a little bit quickly, but anyway, you can just show it by picture, because lambda is close to 1, this is really correlated with Ai minus Ai*, so for each column, this is correlated, and then we can show it for the -- we can take a sum, and we show it for the whole matrix. So we check this is true, and then we can apply the framework and you get that. So the only thing that remains is that it's caveat, so how do you maintain each estimate to remain isotropic, close to isotropic. So the solution one is that each time we can project the lowspectrum number, so that it becomes -- the spectrum number is smaller. So though spectrum number is convex, so that's a shortcut. And there are some other caveats that I am not going to talk about, but this is the key idea. And I like the solution two more, which is that I can use a different G-prime, which is Y minus AX. So previously we used X here, and now I used the sine of X, so somehow, if you use the sine of X, it reduces the noise. And for this G-prime, first I can show it still satisfies this condition. It's still correlated with the direction, Ai minus Ai*, and also, if you use this update rule, you just prove it, you can prove this Ai is always less than three times the instruction of A*. So because you use a different update rule, so because this update rule is simpler and more robust, so you can prove something smaller. >>: So does it make sense to do the sine of the gradient itself, putting the site out of the exploitation? That's something very, very popular in the receiving by propagation effect. >> Tengyu Ma: I see, the sine of gradient. For this case, I think it doesn't. But actually, I think we can talk offline on this. This sine corresponds to -- we're inspired a little bit by backpropagation, but let's talk it offline if you want. So the point is that this G-prime is really not any gradient of any function, so that's the benefit of our framework, so we don't really depend on the fact that G is close some gradient of some true thing. And so our definition, our framework, doesn't involve the gradient, the convex. >>: So what is better than the original one? >> Tengyu Ma: In practice, this is not better than the original one. It's almost the same, but for analysis, it's slightly better. It's easier. >>: It's not as accurate. The gradient is not as accurate. >> Tengyu Ma: Yes, but I guess the whole point is that the gradient, it doesn't have to be accurate. You need it to be more unbiased. >>: But it may turn out to be greater than 90 degrees [indiscernible]. >> Tengyu Ma: Sorry? >>: Because the gradient has to be within 90 degrees towards the green line. But if you do this, you may not actually have that happen. >> Tengyu Ma: You can prove that it's still. >>: Okay, okay. >> Tengyu Ma: Okay, but this is just some caveats that we don't really care. So let me do this summary, so we show that locally, minimizing this non-convex function is close to -- you can think of it as minimizing this unknown convex function with an inexact oracle in some sense. And this is our condition. We have this first-order update rule and general update rule, and we have this condition, if GS is correlated with ZS minus Z*, it implies that ZS converges to Z*, and we can apply this to sparse coding, and we need to have some check for the decoding and things so that we can calculate this expectation. Okay, so the initialization, I think let me leave some time for -- let's see. Okay, maybe let me show you this slide, just this guy. So we want to have initialization, and what we do, we just pick two samples, U and V, and we compute this matrix, Y, Y transposed. This is the two moments, but we weight it by the [product] between U and Y and V and Y, and so weighted two moments. And then we take the top eigenvector. That's it. So the algorithm is very simple. So if you don't have this weighted moment, this Y, Y transposed is like identity. It's not useful. So the point is that if a sample U -- a sample is a form A* times a sparse vector, so U is A* times alpha, B is A* times beta, and take the support of alpha and beta. Let's take the intersection of the support of alpha and beta. If by some chance their support only intersects at one place, then I can say this MUV, this is really the top end, whether it's really this direction, this T, AT*, AT* transposed, plus some noise. This noise is small. So if you happen to choose UV that satisfies this, then you find some direction, it is good. This [indiscernible] is 1/log N, something like that. So yes, I think -- yes, let me skip the rest of - okay, and the chance that you find this thing happens is not too small. It's M over K squared, by worst data paradox, so you usually have some chance to find such phenomenon. So let me skip the rest of this and jump to the discussion and open questions. So the first question is whether this is generalizable to alternating updates, like EM algorithm, for example, and for other hidden -- okay, actually, this -- and for other hidden variable models, for example, especially, EN. And the [indiscernible] paper, they study EN, and at least for the problem that they are studying, our framework, because our condition is weaker, so it still applies there. But there are two questions that provides this to be more powerful. The first one is the decoding doesn't have a simple closed form. So this is a technical area, so if you don't have a closed form decoding, then you don't know how to calculate the direction of update. Just because technically, you don't know how to calculate it. And the second one I think is more influential, and which relates to [Shel's] question, and it is if this unknown convex function, this E*, if this is not strongly convex and smooth, the smooth function, the smooth convex function, then what happens? So then, this is not -- I think I even read some papers by Nesterov, which shows that in this case, if you have an inexact oracle of the gradient, then the error accumulates. But this is not necessarily because you can -- you can change the algorithm or you can find some other ways to avoid it, but they only study it the naive way, probably. I didn't read the paper very carefully, but anyway, our limitation, our approach, is just it must be strongly convex smooth, and we don't know how to go beyond that. And also, another limitation is that this is limited to local convergence, and the local convergence in the sense that it's not really -- it doesn't need to be a small base of convex, but the limitation is that our framework can only apply to the regime that the geometrical degree of error really happens. So, for example, for this sparse coding problem, if you initialize with random initialization, randomly initialize this matrix A with a small trick, then you know that it converts globally. But at the beginning, the convergence, the reason of the convergence I think is essentially different, so it's not really a technical problem. It's just an essentially different condition, and we don't really understand it. I think it is related to some dynamic system problem, but yes. I only have some -- and also this may need -- for global convergence, you need ad hoc analysis somehow, because for some problems, the global convergence is not true. For Gaussian mixed model, it's not true, but for some models, it is true, so maybe we need to really explore the structure of the problem. So in general, here, our framework is not really very sensitive to the structure of a problem. As long as the unknown convex function is strongly convex, it solves. And finally, an impractical algorithm, because it goes beyond K over square root N, so we talk about the strong theoretical work. The sum of square roots relaxation algorithm, which can go beyond this, but the running time is N to the 20, which is really not very practical, and this is still -- and finally, I guess because [Doug] is here, I add this slice -- so can we go beyond sparse coding. Actually, this is the main motivation of studying sparse coding, because we want to go beyond sparse coding if we understand it. So one thing is that the simple decoding rule is enough for learning. You don't really need to take lasso or other minimization. You can just use a simple heuristic and some heuristic decoding. I see enough for learning, and also, this simple decoding rule is really nice. It matches the feed for our calculation in deep learning. So, for example, this is all observable, the image. And if you do the feed forward calculation, this is something like A transpose Y, no threshold. That's some nonlinear function. If they fairly linear, they are all similar in some sense. So this decoding rule is really feed forward calculation, and the [joint] model in this case, in this case, in dictionary learning, this is linear generated model, which is not going to be the case in deep learning, because in deep learning, if you have multiple layers, the point of deep learning is you want to not stack this in multiple layers. And if you have this [kind] of model, and you have another layer and you generate and generate, and linear plus linear is still linear. So in the generative model, I think a reasonable generative model for deep learning is Y is equal to some nonlinear function of A*X*, so you can stack it. And so you want to stack the layers, and also, so we hope that either if our data is R of some nonlinear function, [indiscernible] linear function of A*X*, we can still learn it, but currently, we don't know how to do it. But this is I think a nice open question, so if you can learn this, then we can hope to have some provable algorithm for unsurprised learning ->>: Did you read the paper that I wrote on deep stacking network, where you actually can formulate each step to be convex, but I don't know how to prove it? Eventually, I'll show you. >> Tengyu Ma: Oh, okay, I would like -- >>: [Indiscernible] combined in such a way that you don't have to [indiscernible]. >> Tengyu Ma: Oh, I see. >>: I will show you. >> Tengyu Ma: Yes, I would love to talk about it. So, and finally, we hope that our framework could be used to analyze back prop in some sense, and we have some preliminary idea of -- I think we have some proof for two layers under some model, but it's not very -- we want to make it stronger. Yes, so I guess yes, I'm going to end here. >>: But you have to change back prop in such a way that it's -- you're not going to do just analysis of the original back-prop point. >> Tengyu Ma: Yes. We might need to change that a little bit. >>: Modify them a little bit, and then you can see whether we need to apply that, revise the version you handle it better, and I think that will give some valid [indiscernible]. >> Tengyu Ma: Yes, yes, yes. Yes, okay, I'll stop here. Thank you. >>: I have a question. So right now, the analysis is based on that the data is generated according to the ground truth model that's linear. So what would be the major technical barrier if you consider the case of a model mismatch. Like the actual data are [indiscernible] by some distribution, but you don't know about it, like with some regularity conditions, and then you still use sparse coding to do the data representation and the modeling. So in this case, what would be the major technical barrier to generalize the current analysis toward these things? Like which part of the major part of your analysis relies on this ground truth model being consistent with each other? >>: That's a totally different problem. >> Tengyu Ma: Yes, so I think all of our techniques depend on ->>: In this sparse pair how to [indiscernible] even the criterion is different, so whether they are criteria to that is good or bad. >>: Really in the comments, basically, so your model is that you relied on a model to generate [indiscernible] to generate the data. That actually has something to do with your condition is your gradient correlates to the direction to the Z*. Okay. That you can say is one theory to say you cannot [have] the true gradient, but that's also a disadvantage that you cannot get your generative models. Alternatively, a classic position is that you will show that the correlation between your function and the true gradient, now, you do not need to know where the true optimal point is. All you need is the true gradient. >> Tengyu Ma: Yes. >>: If you do that, then you do not need a notion of a true model there, and a similar message was, of course, there, the convergence will be seen. This thing will just converge to the local optimum. >>: This is the best model -- this is the best parameter to use my current model to fit this data. This is actually my ->> Tengyu Ma: Yes, so but I would guess that's a much harder problem, because you don't even know what is on the minimizer, in your case, right? So the minimizer is something there. And you don't know how to relate your minimizer with everything, like data. So technically ->>: There, you can just separate in two, right? One is you show you're always convergence. The second one, the only thing you need to show -- I guess that shouldn't be the same, because the second one, you already have to be there. The second hand, it has to old. Anyway, it's locally you are strong convex. So you look like you have some point, strength on it. So the advantage of good animation is that you need to pursue nothing about your true optimal point. >>: That's a [indiscernible] miss a specifier, I don't know what it would learn. >>: It would be better, right? You have to regularize on it anyway. >>: But the actual machine learning problems will always have model more mismatch, because the point of having machine learning is really we actually don't know the -- that's why we use deep learning models or this and that to fit. So, basically, we are assuming that probably our model can, in the optimal point, where our model could be somehow closely related to the actual data, and then, in that case, of course, like for example with slight model mismatch or something like that. >>: So actually even [indiscernible], then there are two different kinds of [indiscernible] minmax risk, it's just like here, and then for [indiscernible] about if you monitor which specifier. >>: So to follow up that discussion, can you show me in your slides where the analysis requires the generative model. >> Tengyu Ma: Oh, it requires the generative model everywhere. It's almost everywhere. >>: So when you say that you actually do have the general, the correct -- so yeah, maybe just for ->> Tengyu Ma: Yes, I think we require the generative model. So, for example, so what is Z*? What is Z*? Z* is Ai*. This is defined by the ->>: You need to have the true gradient. You don't need to have that generative model, so [indiscernible] comment on that. It's this analysis [indiscernible]. >>: What I'm saying is that -- what I said is that I think it should be roughly equivalent conditions, but with different assumptions. Instead of requiring the true upper point, you can require on the [indiscernible]. >> Tengyu Ma: Yes, it's possible. >>: So you don't have to assume that it's generative model. You should assume that there is a unique solution for the global optimal solution for that. >> Tengyu Ma: Yes, I completely agree. That would be better, yes. >>: That's my question. It's a difficult problem. >> Tengyu Ma: But I think this is definitely in a theoretical point of view, probably. Anyway, we were not bothered about that for a little ->>: So that's the point. If you assume you have locally optimal -- okay, locally sometimes you have a global somewhere there. Then, the proof is happening and if you want to say [indiscernible] there is to say like your subdirection always varies from a true gradient by something. Then you will automatically get that kind of answer, so that's kind of a different assumption. >>: And can you also show me where do you use the assumption of a joint tree model where X is -1, 0, 1. >> Tengyu Ma: Oh, we only use it in the decoding. >>: All right, which one? Which part? >> Tengyu Ma: Here. So we want to prove this, and this assumes that Xi is not -1, 0, 1. Or we assume that X is either larger than 1 or smaller than -1 or 0. That's the only thing we need. >>: I see. Okay, but you said they can be generalized without that kind of limit. >> Tengyu Ma: So anyway, we need something like this. So we need the condition like Xi is in -- it's either 0 or let's say between A and B, some constant, and/or between minus A and minus B, minus B and minus A, so this will mean it. >>: Oh, okay, so if you use that and this one to generalize a little bit. >>: So B, do you have time to ->>: Yes, I don't have time. >>: Okay, so I will solve the problem, that will [indiscernible]. Okay, thanks very much. >>: [Indiscernible].