20830 >> Dengyong Zhou: It's my great pleasure to host Professor Tong Zhang from Rutgers University. Tong has received a BA in mathematics and computer science from Canary University in 1994 and also Ph.D. degree in computer science from Stanford University in 1998. And after graduation he worked in Research Center and also Yahoo! Research Labs in New York City. And now he's a professor in Rutgers University [indiscernible] department, and his interests include machine learning and [indiscernible] for and also mathematical analysis and applications. Has has been here a couple days has offered [indiscernible] if you want to [indiscernible] just send me an e-mail. Today he'll talk about gradient [indiscernible] for sparsity [indiscernible] of [indiscernible]. >> Tong Zhang: Okay. Thanks for the introduction. So today I want to talk mostly about some more theoretical stuff in machine learning optimization. So here's the motivation. So basically from machine learning people are interested in the learning problem with a large number of features. Then if you don't do anything with that, really you will have this problem called overfitting. So what people do is assume some structure of your target. One structure is sparse. Basically means that your target is linear combination of small number of features. In that sense, then, you can form the learning problem as sparsity constraint optimization, which we will show a little bit later. Generally the problem of this optimization is it's NP hard problem. So people have to do the approximation algorithms. This talk in particular we want to talk about greedy algorithm. Essentially this is probably a little bit dry in the sense it's more theoretical. I'll talk about variation of this greedy algorithm in the contents of sparse constraint optimization. Variations of greedy algorithm and the focus on the theoretical analysis. I know there are people in the computer theory literature also doing greedy algorithm analysis. They have different assumptions, basically [indiscernible] which is not quite the same as sparsity. It's not as modular. Actually, I briefly look -- I'm not that familiar with this but I will look briefly a little bit I think sparsity you need a little more involved techniques than computer literature type of greedy algorithm, because something you have to prove here, that's almost already given if you have already assumed modularity or already given something which you have the keys to prove things, something similar assumption of that, although it doesn't hold. Anyway, so here's the sparsity constraint optimization. The formulation is you want to minimize some loss function, which we will do the convex loss for now. So you have a parameter W. And then you have subject to W sparse. Sparse means the number of non-0s in W is smaller than something. Let's say K. Example in the -- it's the largest regression for binary classification. Basically the loss function is log likelihood. Negative log likelihood actually, secure W, with observed data. There's training data. This is an active log likelihood. You want to minimize log likelihood in regression and you subject to the sparse constraint. It's an L0 constraint. So that's the setup for the sparsity constraint optimization. The question is how to solve that. Generally there are two types of strategies. One is convex relaxation and the other is greedy algorithms. So in convex relaxation, the standard relaxation is to minimize this one, say the largest where the other convex loss and subject to instead of L0 constraints now become L1 constraints. The questions to ask in this context is what kind of conditions which we're solving to can be approximately solution of 1. So you want to establish kind of this is kind of a theoretical question to ask, to study. The other approach, which is more of this talk is about greedy algorithm. Basic greedy algorithm. We try to find the non-zero coefficients of W. W has more set of non-0s, one by one, greedy length. This talk, we will talk a little bit about the history. And some variations of greedy algorithm in this context. In particular I will emphasize the theoretical aspect. So I don't show really a lot of examples. And also I focus on my own research. There are a lot of other people doing this. So don't be offended if you find that I only mention my own, because this is really about -- I'll give a rough picture but I probably don't give a very complex, comprehensive list of other people. But that's mainly because this is my talk. So the type of greedy algorithm, one thing I want to talk a little bit about that, because we want to say there are different variations and this involves different ways to say what kind of greedy algorithm you are doing. One is you can doing loss function. A lot of people in the signal processing community doing this kind of least squares logs. The same kind of process, it's called matching pursuit. General loss function in the machine learning is called the boosting. So people can classify it by using loss function. The simpler is the least squared and boosting is more complicated. The other thing you can do is using the regularization. The traditional analysis actually have L1 constraint. I will show you why, give you some results. To derive the theoretical results, including L1 constraints. Now you can also say you don't include L1 constraints and what happens. The third is optimization within the active set. Basically you select the features whether you want to do a full optimization within that set or not. It's noncorrective versus the fully corrective. Again, there are some theoretical differences. That's also part how we explain. Then the last one you can see the search criteria, basically the traditional greedy algorithm is forwards. The local search, ,what I call local search, would be forward or backward, basically you try to find a fixed point using forward or backward procedure instead of strictly forward greedy. So you can view greedy differently by using this kind of different ways. There's also extensions. In particular I'll talk about this extension basically it have generalization of structure sparsity optimization. You have more general form of optimization procedure. Nonlinear in fact the boosting you can really -- although I talk about linear you can easily do linear. It's not a big issue. So it's not even an extension. You just need to arrange your algorithm carefully. Even though they see in the context of linear. So related again I say the greedy algorithm in theory. It's different assumptions. So this one is kind of different. History, mainly I would emphasize this theoretical analysis. I talk about different ways to look at the greedy algorithm. Basically different variations, depending on how they want to classify that. So one is with L1 constraints. L1 constraints instead of minimize the QW subject to this I will minimize QW subject to also W is 1 or small. Not only W sparsity small but also one log. This has been studied back, back end of that. Rediscovered here in node. And this has a lot of impact, because most modern work -- this actually got forgotten, optimization. And this work mainly is brought up by Clarkson. 92 is a lot of people follow up. There's a lot of works. That's why I'm saying I'm not giving comprehensive listing, but this talk I [indiscernible] because I want to talk a little bit about my analysis in this one so that's a slightly different than the original. This one is variation. But when I did that I didn't realize -- I don't know this work. I'm normally more focused on extension of these guys. Then there's a survey paper. There is a long history. Least squared without L1 constraint is first using least squared Q. W becomes least squared solution. And then you don't have this one. You remove this L constraint and you just say sparsity constraint. This one is in the signal processing. It's matching pursuit. This is not me. So this is a different. And they have some convergence analysis, some different convergence analysis. I'll comment on that. Once you put instant, orthogonal matching pursuit, you can get rid of convergence, not only just convergence. I'll say a little bit more about that one when I talk about results. Then there's in the signal recovery community, job that does some kind of feature selection. I do some feature selection with stochastic noise. It's a little bit extension of this work. But I'm not going to talk about any of these. I will talk a little bit about in the, not in least squared setting but more general setting. >>: Question. So conventional analysis. This is about convex properties ->> Tong Zhang: Noncovex. >>: It's convex to local analysis. >> Tong Zhang: Actually, the convergence is to say when you get K equals to infinity. Then you have this convergence. For anything you will find the solution of this. Even X can be infinity dictionary size [phonetic]. >>: Infinity that's a component that's not a constraint ->> Tong Zhang: It's not. It's not. So I will talk about it later. We'll show analysis. In fact, you want to say what is the size of K eventually. Here you want to say what K can be -- basically, all right, the result like this is if I solve K, I cannot solve K. Type of result is I will replace K by K prime. K prime is little bit larger. So complexity is a little bit larger. Then I can get as good this arrow as K. So you can define optimal K as this objective arrow. Then I will replace K by K prime. So that's a flavor of analysis. I will show you later. And then that makes more sense. Right. But, of course, if you just play with this itself, you will get just say local. But this is all global, the analysis. The point is you don't want the local convergence to local. So you have to change the style a little bit. But that's a good question. So what we will see later the details. Here is this general loss function. So I will talk a little bit about these three. Sparse solution of minimize of general convex, which is more in the setting of boosting. The original boosting, of course, add boost with exponential loss, which they have a weak learning assumption, and the convergence -- again convergence is say K goes to infinity now you can converge. And the freedom has different set of boosting, without analysis. So we did some analysis more in this setting. Again it's convergence analysis. Because by the way, why it's not -- this guy give the rate of convergence but using the fully corrective boosting, you cannot use this one to get this kind of result. So this result you have to -- I will explain that later to gather rate. The rate is the more interesting one. And this one also shows that if you do the fully corrective boosting, you can also get a sparse recovery instead of with additional assumptions. So this is more to do with compressed sensing type of thing. So local search. Local search then instead of -- again, you find the same problem. So this kind of same problem. But you really target it with K now. Essentially you try to use the forward/backward ideas. And we'll talk about two variations. One is this guy and one is this guy. Basically once you get that, you get a little bit better result, have a better feature selection properties. We will see what it means, look at the results. And this also related to away steps in the [indiscernible] literature CEC [phonetic] literature, the literature in the [indiscernible] algorithm. That's with L1 constraints. Here we don't do L1 constraints. Okay. So the other thing is you want to do a little bit more general. So this is a QW versus sparsity. Now, replace sparsity by general cost function. This is structured sparsity instead of just L0 norm but you look at the support, just the set itself. You define function with the set and then there are some results about greedy algorithm extension to this problem. And there are other extensions which are different. Like eigenvalue problem. There's something here Q may not be convex. So you have to do that especially. Anyway, so I will look at just give you a flavor of what kind of algorithm, what kind of analysis produces. So here is the L1 constraint. What it does is the following: After is just actually L1. Now you remove this K now. K in this case is not L1 constraint but case infinity. But when you produce solution to these, you have sparse approximation of this algorithm. And sparse approximation with K sparse which essentially is this algorithm. Algorithm each time you do a shrinkage of your curve in W and you add one additional component. The additional component is greedily openedmized your cost function. And then you update by adding this additional components, W. So each time you add one more features. When you have K steps, you have K sparse W. Here is what you get with this kind of algorithm. Basically this one is unconstrained problem plus CA squared over K. You can make a constraint. It doesn't make much difference simply because of optimal K or K prime is always smaller than OPT. So therefore you can approximately solve this problem with using sparsity solution. So this is the type of flavor of that. Of course it doesn't directly solve original problem, is one thing. The other thing is main thing from algorithm point, if you want to implement that. Say if you want to implement it to your learning algorithm. The main issue is really relies on A and also this adding of the two. This is a disadvantage. The profile I think I'll briefly mention. So briefly mention in case there are some people doing theory here. So roughly you need to prove something like that. The first for each step from WJ minus 1 to WJ you get bound like that. Basically QWJ minus the optimal. This is your access risk. Minus this ki square and this one is smaller then. So this is your risk after another step at the J step of the greedy algorithm. This is type of result you get. Once you get this result, you just solve this recursive relationship. The relationship that with mutuality, this is not modular. But you have similar equations. So modular is definition. It's definition of some actually stronger equation than this one. Basically, Q small QWJ smaller than QJ minus something. There's no square here and something like that. So this is actually what you really need to prove in the sparsity setting while in the lot of modularity functions you assume that. And this is actually the key part. Not the other part. >>: Question so on the Qs, risk. >> Tong Zhang: Risk. >>: Average distribution. Assume distribution would be the test of the data? They're not just training risks. >> Tong Zhang: It is a training risk. It's actually training risk. Yeah. Yeah. Like the logistic -this is all about training, so I'm not talking about generalization not. >>: This is training. >> Tong Zhang: Yes, this is optimization, how you optimize the training error. Completely see optimizing. There's nothing to do with generalization. Generalization is when you have sparsity you can get good generalization. So that's where it's the extent. That's why you want sparse. Here you say if you have some sparse solution you can get some good generalization. >>: The assumption, does it look at what I do need mostly assumption. >> Tong Zhang: Yes, you need -- this one, yes. This is C prime. C prime depends on this. If you have -- actually, this one even assumes your derivative is Lipschitz. If the derivative is not Lipschitz, if it's not derivative Lipschitz you have actually worse square root of K you can derive something. Yeah. >>: So this is [indiscernible]. >> Tong Zhang: Yeah, yeah. Just depends on the smoothness of Q. But it does depend on that. >>: How much ->> Tong Zhang: Right. It holds on any training data. Generally it's okay like logistic regression it's small so it's fine and your training data holds. And this constant really is just the real constant once you get this, for example, logistic loss maybe one or two or whatever something. But roughly this is type of result. The key from the algorithm point of view I'm trying to say you depend on A. It's not nice, because you have many tuning parameters. One is K and one is A. Really you want to tune one parameter. Two parameters is okay. But there are other issues when we really implement. Basically you can implement by posting using these but I haven't really seen anybody try that, mainly because it's complex. The other posting algorithm tried is more like this kind of flavor. So you get rid of this L one constraints. Here is the algorithm. Basically without our constraints. With small step size, this is a little bit of variation of freedom in posting, freedom is this gradient posting, you can do gradient posting analysis also holds. And essentially you don't get A now. You're just adding one. You don't have shrinkage over W. That makes things a lot easier. Here you shrink W you add A you're doing, you have A here. You just don't shrink, you're just adding something. And then there's some theory again it's just so far this theory as I said is just convergence itself. The convergence is not trivial in the sense that nibble can be very, very large. Convergence itself doesn't really depend on the size of the dictionary. So that's why it's not trivial result. Just as you cannot avoid. It's not completely clear. So you have to really prove that. The key is all these results doesn't depend on the size of -- the size of the dictionary. Drawback is you have to pick the learning rate. I cannot compute the convergence rate. Now you have to make another modification which is a fully correctible thing to gather this thing. This is a fully posting. This one can do the original one where you can do the gradient. Doesn't make much difference. Just pick largest gradient or pick the largest components which give you the most decrease of your loss. It doesn't change analysis. The key is you have to do this. Once you find that the active set, you add one element to your set. You need to do a full optimization. This one is not -- does not exist, for example, in the other boost or in the gradient boosting. That actually makes a difference in the theory. Once you do that, then you get a little bit better. You can also get sparse recovery. So I'll talk about sparse recovery. But essentially you get flavor of this kind of results. You minimize this guy. It's adaptive for all W. But it's spread to L1 norm over K and the key is if you look at the original L 1 constraint it's not adapted to K you have to know A and get similar result. Here you get a similar result but you don't need to know A once you do this optimization and you get freely adaptable A and you don't have learning route. Here you also have small step size learning route constraints even had earlier. So here you don't need to get that and you get a convergence rate. >>: With this rate do we ignore the previous rates you just find out the workings and you learn the bits? >> Tong Zhang: Yeah, that's it exactly. That's fully corrective. And then you get adaptive results. This one I'm going to talk about the backward step. So the next one will be backward. But the main thing is to observe this is good. But when I do the sparse recovery it's not necessarily optimal. You can do better when you have the backwards step. Mainly because I will talk about it a little bit later. To prove the idea roughly basically you get a similar equation as what you observe, got previously in the -- but this one it doesn't depend on A. Instead of depending on A, in the setting of here, the L1 constraint, you depend on any target. This target can be anything. Which has to be K sparse. And you get this. And then you get something like that. Sparse recovery. So sparse recovery, you can use this fully greedy algorithm to do sparse recovery. Which is more targeted to the sparsity constraint. This is loosely to do with sparsity constraint, because you have L1 regularization here. But sparse recovery you see you don't have L1. You have L0. Roughly the following. You have a little bit stronger assumption. I don't want to talk about this assumption. But this one makes more sense in comparison of learning. K sparse signal, basically L0 is small and also you have to add a comparison meaning this one is close to a global optimal. Route of target, that one doesn't appear in the earlier, this kind of result you don't need to assume that. You need to add especially these two assumptions. Once you add that, you can say recover the sparsity using this algorithm, fully corrective. Roughly to say you have -once you find the K, K is a little bit larger than K bar. The original sparse. You have to do step more than K sparse space in order to get this kind of results. But it's not too much larger. It's a constant larger than the K bar, then you get this flavor. Basically WK minus W will be smaller than -- it doesn't depend on K but it depends on K bar epsilon. Actually, this should be O. I'm too optimistic, this should have a big O here. And once you do that, you get recovery. For example, if epsilon is 0, basically means that if your target is sparse and these global solution of the unconstrained problem then you can recover W bar exactly by using greedy algorithm. So that's kind of flavor results. But it's optimal in the feature selection I'll briefly say how to do that. Roughly you get a stronger -you have to first do a stronger type of -- this is always key one-step progress. You have WJ minus 1 to WJ. You say how much each greedy steps decrease your objective function, and you roughly you get a stronger results than what's by using under this assumption. Under this assumption you can get stronger results. Under that, you manipulate a little bit to say this set decrease. Either this set will decrease or this is most enough. So basically you say once you get this guy, either this is sufficiently small or this one goes to 0. So two has to happen. But if this goes to 0 you already recovered the original sparse signal. So that's the rough proof idea. Another thing I'll mention that you can improve that using local search. Here is a local search. Again now this one is real sparsity constraint algorithm. The local search algorithm, define optimal K. K is to say the minimum of what you can achieve using the Q of W with sparsity K. The faster you can do with K sparse target. So that's the optimization problem in terms of K. Then with some fixed K what I do is the following for the local search. Repeat the following steps until decrease of Q is no more than epsilon, for example. Basically you add one feature. You are starting with arbitrary set of K features. Then you add one feature, make it K plus 1 sparse vector. This one using the fully corrective boosting. You do it one step like fully corrected. And there you remove the one feature from that. After that, from K plus 1 feature to K features. You remove that with smallest, this one, with the smallest decrease of your Q. Either way is fine. But roughly you add one and remove one. You just keep doing these two steps. So this is a local search replacement strategy. Once you do that you get this kind of result. Again, under some kind of strong convexity, you can say I can find the QW smaller than optimal K bar plus O epsilon. Epsilon is in the algorithm. Set epsilon to 0, let it go to 0. It just affects speed of the algorithm but not the result itself. So you can let it think about epsilon to be 0 effectively. You want to compute with some optimal K bar using sparsity W of K. Here algorithm is W is K. Then you want to compute with K bar. So how good you can compute with K bar. K bar you have smaller than C of it. That depends on your condition of loss function. Constant 0 to 1 basically you're saying I cannot solve the original K bar problem but I can solve a little bit worse problem with a little bit larger complexity. That complexity K is at least constantly larger than this K bar sparse solution. So that's the flavor of this result. So indeed you still cannot solve this optimal K bar within itself constraint but you can solve that with a little bit of relaxes constraint. So that's actually to do with your question. So this kind of flavor, yeah. And it's globally optimal. It's compared with global. So suboptimal feature selection for sparse signal basically K always larger than K prime and cannot be reduced, at least in this style analysis. You do a little bit of group idea just the high level idea. You see how it goes. You add one feature. Again, you add one feature. You want to say how much is a decrease. The decrease you can say is at least this for some C1. Remove feature. This one gets increases, not improves. It should be increases by at most this guy for some K C2. Then you solve that. This one, if you make progress, if this one is larger than this one, because you decrease more each step. You're adding less. And then you select manipulation and then you just say -- and this one is larger than this guy one. This one is smaller than this one. So once this holds, your decrease is always increased than that unless this one is minus. Starting to be minus. So therefore you just keep doing until you get this kind of thing. So that's the rough idea. But then you just go to details. This high level idea. So more aggressive local search. It's actually earlier than that paper. But it's more targeted for feature selection. You try to minimize this L0 of this form and you define the local condition. Basically two-step local condition to remove -- to reduce this L 0 regularized objective function. Why is addition. Addition forward you W decrease. This guy increase by lambda. You're adding one feature. This one will increase by lambda. This one will decrease. Feature deletion. This one guy increases. And this one will be decreased by lambda. Idea you want to alternatingly adding, deleting to reduce objective. It's half this kind of flavor, except it's the kind of this flavor algorithm. Except this one is hard constraint. The other is soft constraint. But in addition if you do soft constraint there's another idea you have to do is truck sparse solution this one has been -- this turns out to be refractive. You're starting with lambda large. Then gradually decrease lambda. And this also has to be done. And you cannot, with the very small lambda. Because if you do very small lambda, this guy quickly L 0 quickly increase until you are starting to do backward step, which is not good. So you have to start with very large and gradually decrease. So combine these two ideas. You get this kind of algorithm. You have a forward step error reduction. You're adding one feature. Then you check if it's small. If small, then you are done. You have backward -- you have a backward step. You check square error increase, does that increase by more than 2 of these 2? . If you reduce by delta K but increase by delta at most delta K over 2, you get one forward, and one backward, but you actually are better off than you were starting, because you are decreased by at least epsilon K over 2. Therefore, the backward step make sure when you go forward, go back to an earlier stage, you actually are always improve your object function. So you're just repeating this guy until you cannot do that. And you can show this term in polynomial time. That's always the case, polynomial in epsilon. Then you have sparse recovery result signal. I'm just trying to give very high level and compared to the earlier results with not doing this. So again this kind of similar in the setting of fully corrective. Fully corrective is here. Not this one. Fully corrective is here results compared with that. This is sparsity K. This is the result and using similar assumptions. But the other one you can get a better in the sense of G. G is to say J actually should be larger. There seems to be a lot of typos. Here should be this one larger. The number of -- the number of ambiguous -- actually, unambiguous features. Basically J should be larger is small. Sorry about that. So if not only each is sparse and this one but also each component is relatively large. That's a small number. Then your error is in terms of this number. So there is some typo here. It should be larger and [indiscernible] should be unambiguous, and instead of using K bar, this one can be no more than K bar but you're only in terms of the ambiguous -- oh, actually, right, this is correct. Ambiguous feature is small. Basically means larger than that. So I don't know why. So basically the set of small ways, small now 0 weights is small, basically a lot of ways are quite bounded away from 0. In this case you only are in terms of the small ways. So this one can be much smaller than K bar. In that case this algorithm fields better. So intuitively, for example, so feature example you're more effective in the sense you'll useless number of features to solve problem in the following case. In the forward, fully corrective forward, imagine this case. This is -- it's a simple example. F1, F2. Your target is linear combination of F1 and F2. You also have F3, F4, F5. F3 is closest to Y. It doesn't expand by F1, F2. If you just do forward greedy, you'll pick F1. Eventually you pick correct. You're starting with F1. F1. F3. F2. In that case you make a mistake in the sense you will add an extra feature which is not necessary. Because when you do the forward or backward, once you're F3, F1, F2. At this point you'll see Y is already expanded by F1, F2. And you can remove F3 at this point so you have a more exact model. It gets better. And this is where you hear this condition because each time you have to judge whether weight is smaller. And this is gets you a more compact model. So I'll quickly mention another optimization is a little bit more general than sparsity optimization. It's generalized sparsity constraint problem. The reason is that these are variables. Sparsity constraint means you want to take control of neighborhood. These variables are completely unrelated. In wavelets you may say if this happens, this has to happen. You follow tree structure. Or follow neighbor structure, for example. And in this kind of flavor you want to introduce a different cost function than just merely the counting the number of non-0s, put a more sophisticated function to say I will encourage if they are close together, if this kind of target, this variable, select, close together variable being selected together and then I will have smaller penalty than if they are just randomly distributed. Here is given support of W. You try to keep a cost function. This cost function will be smaller if they are close together. They will be larger if they are random. Completely random. So you want to minimize this cost function. Here is a little bit change of your greedy algorithm. Again, you want to each time you want to find -- each time instead of adding one component you have to add block set. You have to add a few components simultaneously. Then the key idea is instead of trying to say Q owed minus Q mu is large. You want to look at the new set you are forming by adding to the old set, your complexity increases while your cost function decreases. You want to say per unit increase of your complexity, you want to reduce the cost function most. So this is intuition. This is generalization of your greedy algorithm. Maximize decrease of objective function per unit increase of cost. Here is what you can say about that. Roughly you have fine W with complex CW. CW is the complexity to compute with target of S. So the idea again similar recovery, we have similar form. You want to find W such that which probably, we see W larger than S. But you want to compute with the best possible with S. Here is S with complexity S. It's what your optimal. What you can do is you say how much should CW be. Here's a bound under some restricted convexity. CW should be larger than S. But it's logarithm in terms of 1 over epsilon. If you make more assumptions you can say CW should be constantly larger than S. The interpretation is computed with complexity of S with not much more complex solutions. Solution W is complex CW which is ignore the log, you have constantly larger than S. So that's kind of the result. You can do local search and so on on this type of thing as well. But this is basically generalization of totally corrective boosting. So using the structure. Here is the example just say when you do the structure actually it helps. I'll show you an example because this one you'll introduce a new optimization procedure. The question is why you want to introduce new optimization procedure because you want to take advantage of this structure. Turns out when you take advantage of this structure, the performance increases. And this example just this one is you don't take advantage of the structure. This is original image. This is using the greedy algorithm. This is using the convex relaxation. This one is actually when you really use the structure. And again similarly this case. So when you use a structure, turns out you can actually do better. That's why we want to put a structure on top of this sparsity. So you have to change the sparsity optimization tool to become the sparse structure, sparsity optimization. Basically in summary, so I want to just give a summary of what I talked about. Talked about greedy algorithm variations and their theoretical properties. One thing I started with is more classical, starting with 50s, Frank Wolf algorithm, which is L1 constraint greedy algorithm. The algorithm itself, the problem of if you want to implement this algorithm, it will depend on some kind of constraint size A and learning read, which is not good. It's more practical, it's greedy boosting, freedom algorithm, removing A dependence. You don't need to worry about A, which is making it much easier but it still depends on learning route, which is freedom in quite step size. Then once you can remove this eta, make it fully adaptive making it fully boosting, based on this freedom algorithm you do fully optimization with your convex set -- within the active set of select features. This one gets rid of this and this eta problem. And then on top of that you can do local search. Local search will give you better feature selection. And in this set you can do feature selection better. I'll give you one intuitive example, you can get better results because you get more complex solution than if you don't do local search results. Do the completely greedy. You get less compact selection then you include backward step. That can help when you can do feature selection. Then talk about generalization. Generalization is to say instead of sparsity constraint that you can do the more complex, the structured sparsity optimizer, slightly more general optimization problem. It's harder than the original problem. I talk about specific algorithm its generalization of fully corrective greedy algorithm but then you can do the other things. Similarly you can study in this framework. Finally, I want to say a little bit about this sparsity optimization. Personally I'm more interested in greedy algorithm and local search because I think at least for this time problem sparsity constraint is more powerful algorithm. A lot of people are more interested in the convex relaxation. I think the reason is probably it's easier to understand the convex relaxation in the sense of you just write down the formula. Convex relaxation is not a procedure it's just a optimization problem. You change the optimization problem. And there's a good advantage of that you can use different procedures to solve that. Disadvantages more -- I don't know, you can argue it's less flexible in the sense greedy algorithm you can have all the variations. It directly attacks the underlying L0 problem more directly than L1. L1 is just 1. You cannot relax to L0.5 for convex relaxation. So you're basically stuck with this, and then you also you think about optimization. This is not optimization procedure. It's just a formula. Once you look at optimization, specific optimization procedure will be greedy line. Therefore, this is called liar solution. If you look at this solution you can -- then you can study this particular optimization procedure and expand that to look at other greedy algorithms, which are attacking the underlying problem more directly. Of course, the advantage of L1 is you don't have to use large you can use the other problem to solve. But then you have the limitations. You are essentially equivalent to a specific kind of greedy algorithm. So that's it. [applause]. >>: In your Frank Wolfe example, there's two things, one is firstly one of the combinations is minimize the components EI, also eta. One of the formulations, the second formulation you have just fix some eta I and then just find the best. >> Tong Zhang: Wait a second. You mean L1 constraint? >>: In the first one. >> Tong Zhang: Let's look at that. >>: Go forward a little bit. Backward. Backward. Here. Backward. Here. >> Tong Zhang: Okay. Right. >>: Here you have IG to J [phonetic]. That's a [indiscernible] problem over the eta and EI. And the next algorithm. >> Tong Zhang: Don't have A. >>: Outer means step. >> Tong Zhang: This actually is wrong. So I think I have some kind of -- you don't have it here also. This one you should read -- yeah, I think I just copy/paste. There's a lot of typos, I notice. >>: The law about is you fix eta J and choose vector best EI? >> Tong Zhang: You can optimize that. This doesn't make too much difference because you can also pick specific eta J here. >>: Greedy boosting, essentially the algorithm is just first we find the best EI and then you do a large search to fund the eta J. >> Tong Zhang: Eta boost does that, you're right. Eta boost would be equivalent to some of this finding I and J. I andette ta. Step size -- it's equivalent to optimizing simultaneously I and eta in this particular form. So that will give you eta, exactly the [indiscernible] of eta. >>: My question was just basically it would be easy to first find the direction EI first then do a live search for eta J. >> Tong Zhang: You could. But freedom N does fix -- you can do this version. This version basically let me see the version of fully corrective. You can do the greedy or you can do it a different way. There are several different ways to find I without using the step size. This is one. But there's another one making Lees squared problem. I didn't mention too much about that. Because I'll focus on the other things. But there are several -- they're all okay. So here is to say you want to find maximum gradient. There are different versions of that and then once you find that, you can say yes you can find particular step size. Replace this guy by finding the max gradient it's okay, yeah. And then you optimize. But that's still not fully corrective, because you are only optimize eta. If you get fully corrective you have to not only optimize eta but you have to optimize everything. Yes, you can do that. So I think that's good. When I wrote that down, it's not only a version -- there are variations. I think I didn't really go into that because I'm focusing on the other side of the story, not the variations of implementing greedy algorithm. >>: A question, if you go back to those compression vectors. >> Tong Zhang: Which? >>: The two images that you showed. >> Tong Zhang: The images, the image recovery. >>: You're doing compression, right? >> Tong Zhang: It's image recovery. So you send out some random projection of this image from random projection you want recovery of original image. Kind of compression. >>: You're learning projection. >> Tong Zhang: No, it's random projection, random projection. >>: So random back to the original. >> Tong Zhang: Yeah, you're learning back, giving the observed by compressing sensing you random project that to small set given that you'll go back. >>: Is this the best method for doing this? >> Tong Zhang: I think people talk about the structure sparsity a lot nowadays. So I don't know if it's best. There are convex versions of that, there are other things. Of course we want to say it's best, but I don't think that's true. Not necessarily true. It's just one -- the advantage of ours is we can prove things. I don't see much other people doing that. They're just given formulations and say it's good. I can show you a picture maybe like this. But they don't have these things. So there are other versions. This is not only. That's why I'm saying I'm not giving a comprehensive review of the literature. I'm just ->>: It's kind of like ->>: Huh? >>: It's kind of like doing this? >> Tong Zhang: That I don't know. So maybe. This one is a little bit not quite related to kernel, but maybe it is. Dengyong Zhou: Okay. [applause]