20830 >> Dengyong Zhou: It's my great pleasure to host... University. Tong has received a BA in mathematics and...

advertisement
20830
>> Dengyong Zhou: It's my great pleasure to host Professor Tong Zhang from Rutgers
University. Tong has received a BA in mathematics and computer science from Canary
University in 1994 and also Ph.D. degree in computer science from Stanford University in 1998.
And after graduation he worked in Research Center and also Yahoo! Research Labs in New York
City. And now he's a professor in Rutgers University [indiscernible] department, and his interests
include machine learning and [indiscernible] for and also mathematical analysis and applications.
Has has been here a couple days has offered [indiscernible] if you want to [indiscernible] just
send me an e-mail. Today he'll talk about gradient [indiscernible] for sparsity [indiscernible] of
[indiscernible].
>> Tong Zhang: Okay. Thanks for the introduction. So today I want to talk mostly about some
more theoretical stuff in machine learning optimization. So here's the motivation. So basically
from machine learning people are interested in the learning problem with a large number of
features. Then if you don't do anything with that, really you will have this problem called
overfitting. So what people do is assume some structure of your target. One structure is sparse.
Basically means that your target is linear combination of small number of features.
In that sense, then, you can form the learning problem as sparsity constraint optimization, which
we will show a little bit later.
Generally the problem of this optimization is it's NP hard problem. So people have to do the
approximation algorithms. This talk in particular we want to talk about greedy algorithm.
Essentially this is probably a little bit dry in the sense it's more theoretical. I'll talk about variation
of this greedy algorithm in the contents of sparse constraint optimization. Variations of greedy
algorithm and the focus on the theoretical analysis.
I know there are people in the computer theory literature also doing greedy algorithm analysis.
They have different assumptions, basically [indiscernible] which is not quite the same as sparsity.
It's not as modular. Actually, I briefly look -- I'm not that familiar with this but I will look briefly a
little bit I think sparsity you need a little more involved techniques than computer literature type of
greedy algorithm, because something you have to prove here, that's almost already given if you
have already assumed modularity or already given something which you have the keys to prove
things, something similar assumption of that, although it doesn't hold.
Anyway, so here's the sparsity constraint optimization. The formulation is you want to minimize
some loss function, which we will do the convex loss for now. So you have a parameter W. And
then you have subject to W sparse. Sparse means the number of non-0s in W is smaller than
something. Let's say K.
Example in the -- it's the largest regression for binary classification. Basically the loss function is
log likelihood. Negative log likelihood actually, secure W, with observed data. There's training
data. This is an active log likelihood. You want to minimize log likelihood in regression and you
subject to the sparse constraint.
It's an L0 constraint. So that's the setup for the sparsity constraint optimization. The question is
how to solve that. Generally there are two types of strategies. One is convex relaxation and the
other is greedy algorithms. So in convex relaxation, the standard relaxation is to minimize this
one, say the largest where the other convex loss and subject to instead of L0 constraints now
become L1 constraints.
The questions to ask in this context is what kind of conditions which we're solving to can be
approximately solution of 1. So you want to establish kind of this is kind of a theoretical question
to ask, to study.
The other approach, which is more of this talk is about greedy algorithm. Basic greedy algorithm.
We try to find the non-zero coefficients of W. W has more set of non-0s, one by one, greedy
length.
This talk, we will talk a little bit about the history. And some variations of greedy algorithm in this
context. In particular I will emphasize the theoretical aspect. So I don't show really a lot of
examples. And also I focus on my own research. There are a lot of other people doing this. So
don't be offended if you find that I only mention my own, because this is really about -- I'll give a
rough picture but I probably don't give a very complex, comprehensive list of other people. But
that's mainly because this is my talk. So the type of greedy algorithm, one thing I want to talk a
little bit about that, because we want to say there are different variations and this involves
different ways to say what kind of greedy algorithm you are doing. One is you can doing loss
function. A lot of people in the signal processing community doing this kind of least squares logs.
The same kind of process, it's called matching pursuit. General loss function in the machine
learning is called the boosting.
So people can classify it by using loss function. The simpler is the least squared and boosting is
more complicated. The other thing you can do is using the regularization. The traditional
analysis actually have L1 constraint. I will show you why, give you some results. To derive the
theoretical results, including L1 constraints.
Now you can also say you don't include L1 constraints and what happens. The third is
optimization within the active set. Basically you select the features whether you want to do a full
optimization within that set or not. It's noncorrective versus the fully corrective. Again, there are
some theoretical differences. That's also part how we explain.
Then the last one you can see the search criteria, basically the traditional greedy algorithm is
forwards. The local search, ,what I call local search, would be forward or backward, basically you
try to find a fixed point using forward or backward procedure instead of strictly forward greedy.
So you can view greedy differently by using this kind of different ways. There's also extensions.
In particular I'll talk about this extension basically it have generalization of structure sparsity
optimization. You have more general form of optimization procedure.
Nonlinear in fact the boosting you can really -- although I talk about linear you can easily do
linear. It's not a big issue. So it's not even an extension. You just need to arrange your
algorithm carefully.
Even though they see in the context of linear. So related again I say the greedy algorithm in
theory. It's different assumptions. So this one is kind of different.
History, mainly I would emphasize this theoretical analysis. I talk about different ways to look at
the greedy algorithm. Basically different variations, depending on how they want to classify that.
So one is with L1 constraints. L1 constraints instead of minimize the QW subject to this I will
minimize QW subject to also W is 1 or small. Not only W sparsity small but also one log.
This has been studied back, back end of that. Rediscovered here in node. And this has a lot of
impact, because most modern work -- this actually got forgotten, optimization. And this work
mainly is brought up by Clarkson. 92 is a lot of people follow up. There's a lot of works. That's
why I'm saying I'm not giving comprehensive listing, but this talk I [indiscernible] because I want
to talk a little bit about my analysis in this one so that's a slightly different than the original. This
one is variation. But when I did that I didn't realize -- I don't know this work. I'm normally more
focused on extension of these guys.
Then there's a survey paper. There is a long history. Least squared without L1 constraint is first
using least squared Q. W becomes least squared solution. And then you don't have this one.
You remove this L constraint and you just say sparsity constraint. This one is in the signal
processing. It's matching pursuit. This is not me. So this is a different. And they have some
convergence analysis, some different convergence analysis. I'll comment on that. Once you put
instant, orthogonal matching pursuit, you can get rid of convergence, not only just convergence.
I'll say a little bit more about that one when I talk about results.
Then there's in the signal recovery community, job that does some kind of feature selection. I do
some feature selection with stochastic noise. It's a little bit extension of this work. But I'm not
going to talk about any of these. I will talk a little bit about in the, not in least squared setting but
more general setting.
>>: Question. So conventional analysis. This is about convex properties ->> Tong Zhang: Noncovex.
>>: It's convex to local analysis.
>> Tong Zhang: Actually, the convergence is to say when you get K equals to infinity. Then you
have this convergence. For anything you will find the solution of this. Even X can be infinity
dictionary size [phonetic].
>>: Infinity that's a component that's not a constraint ->> Tong Zhang: It's not. It's not. So I will talk about it later. We'll show analysis. In fact, you
want to say what is the size of K eventually. Here you want to say what K can be -- basically, all
right, the result like this is if I solve K, I cannot solve K. Type of result is I will replace K by K
prime. K prime is little bit larger. So complexity is a little bit larger. Then I can get as good this
arrow as K. So you can define optimal K as this objective arrow. Then I will replace K by K
prime. So that's a flavor of analysis. I will show you later. And then that makes more sense.
Right. But, of course, if you just play with this itself, you will get just say local. But this is all
global, the analysis. The point is you don't want the local convergence to local. So you have to
change the style a little bit. But that's a good question. So what we will see later the details.
Here is this general loss function. So I will talk a little bit about these three. Sparse solution of
minimize of general convex, which is more in the setting of boosting. The original boosting, of
course, add boost with exponential loss, which they have a weak learning assumption, and the
convergence -- again convergence is say K goes to infinity now you can converge. And the
freedom has different set of boosting, without analysis. So we did some analysis more in this
setting. Again it's convergence analysis. Because by the way, why it's not -- this guy give the
rate of convergence but using the fully corrective boosting, you cannot use this one to get this
kind of result. So this result you have to -- I will explain that later to gather rate.
The rate is the more interesting one. And this one also shows that if you do the fully corrective
boosting, you can also get a sparse recovery instead of with additional assumptions. So this is
more to do with compressed sensing type of thing.
So local search. Local search then instead of -- again, you find the same problem. So this kind
of same problem. But you really target it with K now. Essentially you try to use the
forward/backward ideas. And we'll talk about two variations. One is this guy and one is this guy.
Basically once you get that, you get a little bit better result, have a better feature selection
properties. We will see what it means, look at the results. And this also related to away steps in
the [indiscernible] literature CEC [phonetic] literature, the literature in the [indiscernible] algorithm.
That's with L1 constraints. Here we don't do L1 constraints.
Okay. So the other thing is you want to do a little bit more general. So this is a QW versus
sparsity. Now, replace sparsity by general cost function. This is structured sparsity instead of
just L0 norm but you look at the support, just the set itself. You define function with the set and
then there are some results about greedy algorithm extension to this problem. And there are
other extensions which are different. Like eigenvalue problem. There's something here Q may
not be convex. So you have to do that especially.
Anyway, so I will look at just give you a flavor of what kind of algorithm, what kind of analysis
produces. So here is the L1 constraint. What it does is the following: After is just actually L1.
Now you remove this K now. K in this case is not L1 constraint but case infinity. But when you
produce solution to these, you have sparse approximation of this algorithm. And sparse
approximation with K sparse which essentially is this algorithm. Algorithm each time you do a
shrinkage of your curve in W and you add one additional component. The additional component
is greedily openedmized your cost function. And then you update by adding this additional
components, W. So each time you add one more features. When you have K steps, you have K
sparse W. Here is what you get with this kind of algorithm. Basically this one is unconstrained
problem plus CA squared over K.
You can make a constraint. It doesn't make much difference simply because of optimal K or K
prime is always smaller than OPT. So therefore you can approximately solve this problem with
using sparsity solution. So this is the type of flavor of that. Of course it doesn't directly solve
original problem, is one thing. The other thing is main thing from algorithm point, if you want to
implement that. Say if you want to implement it to your learning algorithm. The main issue is
really relies on A and also this adding of the two.
This is a disadvantage. The profile I think I'll briefly mention. So briefly mention in case there are
some people doing theory here. So roughly you need to prove something like that. The first for
each step from WJ minus 1 to WJ you get bound like that. Basically QWJ minus the optimal.
This is your access risk. Minus this ki square and this one is smaller then. So this is your risk
after another step at the J step of the greedy algorithm. This is type of result you get. Once you
get this result, you just solve this recursive relationship.
The relationship that with mutuality, this is not modular. But you have similar equations. So
modular is definition. It's definition of some actually stronger equation than this one. Basically, Q
small QWJ smaller than QJ minus something. There's no square here and something like that.
So this is actually what you really need to prove in the sparsity setting while in the lot of
modularity functions you assume that. And this is actually the key part. Not the other part.
>>: Question so on the Qs, risk.
>> Tong Zhang: Risk.
>>: Average distribution. Assume distribution would be the test of the data? They're not just
training risks.
>> Tong Zhang: It is a training risk. It's actually training risk. Yeah. Yeah. Like the logistic -this is all about training, so I'm not talking about generalization not.
>>: This is training.
>> Tong Zhang: Yes, this is optimization, how you optimize the training error. Completely see
optimizing. There's nothing to do with generalization. Generalization is when you have sparsity
you can get good generalization. So that's where it's the extent. That's why you want sparse.
Here you say if you have some sparse solution you can get some good generalization.
>>: The assumption, does it look at what I do need mostly assumption.
>> Tong Zhang: Yes, you need -- this one, yes. This is C prime. C prime depends on this.
If you have -- actually, this one even assumes your derivative is Lipschitz. If the derivative is not
Lipschitz, if it's not derivative Lipschitz you have actually worse square root of K you can derive
something. Yeah.
>>: So this is [indiscernible].
>> Tong Zhang: Yeah, yeah. Just depends on the smoothness of Q. But it does depend on that.
>>: How much ->> Tong Zhang: Right. It holds on any training data. Generally it's okay like logistic regression
it's small so it's fine and your training data holds. And this constant really is just the real constant
once you get this, for example, logistic loss maybe one or two or whatever something. But
roughly this is type of result.
The key from the algorithm point of view I'm trying to say you depend on A. It's not nice, because
you have many tuning parameters. One is K and one is A. Really you want to tune one
parameter. Two parameters is okay. But there are other issues when we really implement.
Basically you can implement by posting using these but I haven't really seen anybody try that,
mainly because it's complex.
The other posting algorithm tried is more like this kind of flavor. So you get rid of this L one
constraints. Here is the algorithm. Basically without our constraints. With small step size, this is
a little bit of variation of freedom in posting, freedom is this gradient posting, you can do gradient
posting analysis also holds. And essentially you don't get A now. You're just adding one. You
don't have shrinkage over W. That makes things a lot easier. Here you shrink W you add A
you're doing, you have A here. You just don't shrink, you're just adding something.
And then there's some theory again it's just so far this theory as I said is just convergence itself.
The convergence is not trivial in the sense that nibble can be very, very large. Convergence itself
doesn't really depend on the size of the dictionary. So that's why it's not trivial result. Just as you
cannot avoid. It's not completely clear. So you have to really prove that.
The key is all these results doesn't depend on the size of -- the size of the dictionary.
Drawback is you have to pick the learning rate. I cannot compute the convergence rate. Now
you have to make another modification which is a fully correctible thing to gather this thing. This
is a fully posting. This one can do the original one where you can do the gradient. Doesn't make
much difference.
Just pick largest gradient or pick the largest components which give you the most decrease of
your loss. It doesn't change analysis. The key is you have to do this.
Once you find that the active set, you add one element to your set. You need to do a full
optimization. This one is not -- does not exist, for example, in the other boost or in the gradient
boosting. That actually makes a difference in the theory.
Once you do that, then you get a little bit better. You can also get sparse recovery. So I'll talk
about sparse recovery. But essentially you get flavor of this kind of results.
You minimize this guy. It's adaptive for all W. But it's spread to L1 norm over K and the key is if
you look at the original L 1 constraint it's not adapted to K you have to know A and get similar
result. Here you get a similar result but you don't need to know A once you do this optimization
and you get freely adaptable A and you don't have learning route.
Here you also have small step size learning route constraints even had earlier. So here you don't
need to get that and you get a convergence rate.
>>: With this rate do we ignore the previous rates you just find out the workings and you learn
the bits?
>> Tong Zhang: Yeah, that's it exactly. That's fully corrective. And then you get adaptive
results.
This one I'm going to talk about the backward step. So the next one will be backward. But the
main thing is to observe this is good. But when I do the sparse recovery it's not necessarily
optimal. You can do better when you have the backwards step.
Mainly because I will talk about it a little bit later. To prove the idea roughly basically you get a
similar equation as what you observe, got previously in the -- but this one it doesn't depend on A.
Instead of depending on A, in the setting of here, the L1 constraint, you depend on any target.
This target can be anything.
Which has to be K sparse. And you get this. And then you get something like that. Sparse
recovery. So sparse recovery, you can use this fully greedy algorithm to do sparse recovery.
Which is more targeted to the sparsity constraint. This is loosely to do with sparsity constraint,
because you have L1 regularization here. But sparse recovery you see you don't have L1. You
have L0. Roughly the following. You have a little bit stronger assumption. I don't want to talk
about this assumption. But this one makes more sense in comparison of learning. K sparse
signal, basically L0 is small and also you have to add a comparison meaning this one is close to a
global optimal. Route of target, that one doesn't appear in the earlier, this kind of result you don't
need to assume that. You need to add especially these two assumptions. Once you add that,
you can say recover the sparsity using this algorithm, fully corrective. Roughly to say you have -once you find the K, K is a little bit larger than K bar. The original sparse.
You have to do step more than K sparse space in order to get this kind of results. But it's not too
much larger. It's a constant larger than the K bar, then you get this flavor. Basically WK minus W
will be smaller than -- it doesn't depend on K but it depends on K bar epsilon. Actually, this
should be O. I'm too optimistic, this should have a big O here.
And once you do that, you get recovery. For example, if epsilon is 0, basically means that if your
target is sparse and these global solution of the unconstrained problem then you can recover W
bar exactly by using greedy algorithm. So that's kind of flavor results.
But it's optimal in the feature selection I'll briefly say how to do that. Roughly you get a stronger -you have to first do a stronger type of -- this is always key one-step progress. You have WJ
minus 1 to WJ. You say how much each greedy steps decrease your objective function, and you
roughly you get a stronger results than what's by using under this assumption.
Under this assumption you can get stronger results. Under that, you manipulate a little bit to say
this set decrease. Either this set will decrease or this is most enough.
So basically you say once you get this guy, either this is sufficiently small or this one goes to 0.
So two has to happen. But if this goes to 0 you already recovered the original sparse signal. So
that's the rough proof idea.
Another thing I'll mention that you can improve that using local search. Here is a local search.
Again now this one is real sparsity constraint algorithm. The local search algorithm, define
optimal K. K is to say the minimum of what you can achieve using the Q of W with sparsity K.
The faster you can do with K sparse target. So that's the optimization problem in terms of K.
Then with some fixed K what I do is the following for the local search. Repeat the following steps
until decrease of Q is no more than epsilon, for example.
Basically you add one feature. You are starting with arbitrary set of K features. Then you add
one feature, make it K plus 1 sparse vector. This one using the fully corrective boosting. You do
it one step like fully corrected. And there you remove the one feature from that. After that, from K
plus 1 feature to K features. You remove that with smallest, this one, with the smallest decrease
of your Q. Either way is fine. But roughly you add one and remove one. You just keep doing
these two steps.
So this is a local search replacement strategy. Once you do that you get this kind of result.
Again, under some kind of strong convexity, you can say I can find the QW smaller than optimal K
bar plus O epsilon. Epsilon is in the algorithm. Set epsilon to 0, let it go to 0. It just affects speed
of the algorithm but not the result itself. So you can let it think about epsilon to be 0 effectively.
You want to compute with some optimal K bar using sparsity W of K. Here algorithm is W is K.
Then you want to compute with K bar. So how good you can compute with K bar. K bar you
have smaller than C of it. That depends on your condition of loss function. Constant 0 to 1
basically you're saying I cannot solve the original K bar problem but I can solve a little bit worse
problem with a little bit larger complexity. That complexity K is at least constantly larger than this
K bar sparse solution. So that's the flavor of this result. So indeed you still cannot solve this
optimal K bar within itself constraint but you can solve that with a little bit of relaxes constraint.
So that's actually to do with your question. So this kind of flavor, yeah.
And it's globally optimal. It's compared with global. So suboptimal feature selection for sparse
signal basically K always larger than K prime and cannot be reduced, at least in this style
analysis.
You do a little bit of group idea just the high level idea. You see how it goes. You add one
feature. Again, you add one feature. You want to say how much is a decrease. The decrease
you can say is at least this for some C1. Remove feature.
This one gets increases, not improves. It should be increases by at most this guy for some K C2.
Then you solve that. This one, if you make progress, if this one is larger than this one, because
you decrease more each step. You're adding less. And then you select manipulation and then
you just say -- and this one is larger than this guy one. This one is smaller than this one.
So once this holds, your decrease is always increased than that unless this one is minus.
Starting to be minus. So therefore you just keep doing until you get this kind of thing.
So that's the rough idea. But then you just go to details. This high level idea. So more
aggressive local search. It's actually earlier than that paper. But it's more targeted for feature
selection.
You try to minimize this L0 of this form and you define the local condition. Basically two-step
local condition to remove -- to reduce this L 0 regularized objective function. Why is addition.
Addition forward you W decrease. This guy increase by lambda. You're adding one feature.
This one will increase by lambda. This one will decrease.
Feature deletion. This one guy increases. And this one will be decreased by lambda. Idea you
want to alternatingly adding, deleting to reduce objective. It's half this kind of flavor, except it's
the kind of this flavor algorithm. Except this one is hard constraint. The other is soft constraint.
But in addition if you do soft constraint there's another idea you have to do is truck sparse
solution this one has been -- this turns out to be refractive. You're starting with lambda large.
Then gradually decrease lambda. And this also has to be done.
And you cannot, with the very small lambda. Because if you do very small lambda, this guy
quickly L 0 quickly increase until you are starting to do backward step, which is not good. So you
have to start with very large and gradually decrease.
So combine these two ideas. You get this kind of algorithm. You have a forward step error
reduction. You're adding one feature. Then you check if it's small. If small, then you are done.
You have backward -- you have a backward step. You check square error increase, does that
increase by more than 2 of these 2? . If you reduce by delta K but increase by delta at most delta
K over 2, you get one forward, and one backward, but you actually are better off than you were
starting, because you are decreased by at least epsilon K over 2.
Therefore, the backward step make sure when you go forward, go back to an earlier stage, you
actually are always improve your object function. So you're just repeating this guy until you
cannot do that. And you can show this term in polynomial time. That's always the case,
polynomial in epsilon.
Then you have sparse recovery result signal. I'm just trying to give very high level and compared
to the earlier results with not doing this. So again this kind of similar in the setting of fully
corrective. Fully corrective is here. Not this one. Fully corrective is here results compared with
that. This is sparsity K. This is the result and using similar assumptions. But the other one you
can get a better in the sense of G. G is to say J actually should be larger. There seems to be a
lot of typos. Here should be this one larger.
The number of -- the number of ambiguous -- actually, unambiguous features. Basically J should
be larger is small. Sorry about that. So if not only each is sparse and this one but also each
component is relatively large. That's a small number. Then your error is in terms of this number.
So there is some typo here. It should be larger and [indiscernible] should be unambiguous, and
instead of using K bar, this one can be no more than K bar but you're only in terms of the
ambiguous -- oh, actually, right, this is correct. Ambiguous feature is small. Basically means
larger than that. So I don't know why. So basically the set of small ways, small now 0 weights is
small, basically a lot of ways are quite bounded away from 0. In this case you only are in terms of
the small ways. So this one can be much smaller than K bar.
In that case this algorithm fields better. So intuitively, for example, so feature example you're
more effective in the sense you'll useless number of features to solve problem in the following
case. In the forward, fully corrective forward, imagine this case.
This is -- it's a simple example. F1, F2. Your target is linear combination of F1 and F2. You also
have F3, F4, F5. F3 is closest to Y. It doesn't expand by F1, F2. If you just do forward greedy,
you'll pick F1. Eventually you pick correct. You're starting with F1. F1. F3. F2. In that case you
make a mistake in the sense you will add an extra feature which is not necessary. Because when
you do the forward or backward, once you're F3, F1, F2. At this point you'll see Y is already
expanded by F1, F2. And you can remove F3 at this point so you have a more exact model. It
gets better. And this is where you hear this condition because each time you have to judge
whether weight is smaller. And this is gets you a more compact model.
So I'll quickly mention another optimization is a little bit more general than sparsity optimization.
It's generalized sparsity constraint problem. The reason is that these are variables. Sparsity
constraint means you want to take control of neighborhood. These variables are completely
unrelated. In wavelets you may say if this happens, this has to happen. You follow tree
structure. Or follow neighbor structure, for example. And in this kind of flavor you want to
introduce a different cost function than just merely the counting the number of non-0s, put a more
sophisticated function to say I will encourage if they are close together, if this kind of target, this
variable, select, close together variable being selected together and then I will have smaller
penalty than if they are just randomly distributed.
Here is given support of W. You try to keep a cost function. This cost function will be smaller if
they are close together. They will be larger if they are random. Completely random.
So you want to minimize this cost function. Here is a little bit change of your greedy algorithm.
Again, you want to each time you want to find -- each time instead of adding one component you
have to add block set. You have to add a few components simultaneously. Then the key idea is
instead of trying to say Q owed minus Q mu is large. You want to look at the new set you are
forming by adding to the old set, your complexity increases while your cost function decreases.
You want to say per unit increase of your complexity, you want to reduce the cost function most.
So this is intuition. This is generalization of your greedy algorithm.
Maximize decrease of objective function per unit increase of cost. Here is what you can say
about that. Roughly you have fine W with complex CW. CW is the complexity to compute with
target of S. So the idea again similar recovery, we have similar form. You want to find W such
that which probably, we see W larger than S. But you want to compute with the best possible
with S.
Here is S with complexity S. It's what your optimal. What you can do is you say how much
should CW be. Here's a bound under some restricted convexity. CW should be larger than S.
But it's logarithm in terms of 1 over epsilon. If you make more assumptions you can say CW
should be constantly larger than S. The interpretation is computed with complexity of S with not
much more complex solutions. Solution W is complex CW which is ignore the log, you have
constantly larger than S. So that's kind of the result.
You can do local search and so on on this type of thing as well. But this is basically
generalization of totally corrective boosting.
So using the structure. Here is the example just say when you do the structure actually it helps.
I'll show you an example because this one you'll introduce a new optimization procedure. The
question is why you want to introduce new optimization procedure because you want to take
advantage of this structure. Turns out when you take advantage of this structure, the
performance increases. And this example just this one is you don't take advantage of the
structure.
This is original image. This is using the greedy algorithm. This is using the convex relaxation.
This one is actually when you really use the structure. And again similarly this case.
So when you use a structure, turns out you can actually do better. That's why we want to put a
structure on top of this sparsity. So you have to change the sparsity optimization tool to become
the sparse structure, sparsity optimization.
Basically in summary, so I want to just give a summary of what I talked about. Talked about
greedy algorithm variations and their theoretical properties. One thing I started with is more
classical, starting with 50s, Frank Wolf algorithm, which is L1 constraint greedy algorithm. The
algorithm itself, the problem of if you want to implement this algorithm, it will depend on some
kind of constraint size A and learning read, which is not good. It's more practical, it's greedy
boosting, freedom algorithm, removing A dependence. You don't need to worry about A, which is
making it much easier but it still depends on learning route, which is freedom in quite step size.
Then once you can remove this eta, make it fully adaptive making it fully boosting, based on this
freedom algorithm you do fully optimization with your convex set -- within the active set of select
features. This one gets rid of this and this eta problem. And then on top of that you can do local
search. Local search will give you better feature selection. And in this set you can do feature
selection better. I'll give you one intuitive example, you can get better results because you get
more complex solution than if you don't do local search results.
Do the completely greedy. You get less compact selection then you include backward step. That
can help when you can do feature selection. Then talk about generalization. Generalization is to
say instead of sparsity constraint that you can do the more complex, the structured sparsity
optimizer, slightly more general optimization problem. It's harder than the original problem.
I talk about specific algorithm its generalization of fully corrective greedy algorithm but then you
can do the other things. Similarly you can study in this framework. Finally, I want to say a little bit
about this sparsity optimization. Personally I'm more interested in greedy algorithm and local
search because I think at least for this time problem sparsity constraint is more powerful
algorithm.
A lot of people are more interested in the convex relaxation. I think the reason is probably it's
easier to understand the convex relaxation in the sense of you just write down the formula.
Convex relaxation is not a procedure it's just a optimization problem. You change the
optimization problem.
And there's a good advantage of that you can use different procedures to solve that.
Disadvantages more -- I don't know, you can argue it's less flexible in the sense greedy algorithm
you can have all the variations. It directly attacks the underlying L0 problem more directly than
L1. L1 is just 1. You cannot relax to L0.5 for convex relaxation. So you're basically stuck with
this, and then you also you think about optimization. This is not optimization procedure. It's just
a formula. Once you look at optimization, specific optimization procedure will be greedy line.
Therefore, this is called liar solution. If you look at this solution you can -- then you can study this
particular optimization procedure and expand that to look at other greedy algorithms, which are
attacking the underlying problem more directly.
Of course, the advantage of L1 is you don't have to use large you can use the other problem to
solve. But then you have the limitations. You are essentially equivalent to a specific kind of
greedy algorithm.
So that's it. [applause].
>>: In your Frank Wolfe example, there's two things, one is firstly one of the combinations is
minimize the components EI, also eta. One of the formulations, the second formulation you have
just fix some eta I and then just find the best.
>> Tong Zhang: Wait a second. You mean L1 constraint?
>>: In the first one.
>> Tong Zhang: Let's look at that.
>>: Go forward a little bit. Backward. Backward. Here. Backward. Here.
>> Tong Zhang: Okay. Right.
>>: Here you have IG to J [phonetic]. That's a [indiscernible] problem over the eta and EI. And
the next algorithm.
>> Tong Zhang: Don't have A.
>>: Outer means step.
>> Tong Zhang: This actually is wrong. So I think I have some kind of -- you don't have it here
also. This one you should read -- yeah, I think I just copy/paste. There's a lot of typos, I notice.
>>: The law about is you fix eta J and choose vector best EI?
>> Tong Zhang: You can optimize that. This doesn't make too much difference because you can
also pick specific eta J here.
>>: Greedy boosting, essentially the algorithm is just first we find the best EI and then you do a
large search to fund the eta J.
>> Tong Zhang: Eta boost does that, you're right. Eta boost would be equivalent to some of this
finding I and J. I andette ta. Step size -- it's equivalent to optimizing simultaneously I and eta in
this particular form. So that will give you eta, exactly the [indiscernible] of eta.
>>: My question was just basically it would be easy to first find the direction EI first then do a live
search for eta J.
>> Tong Zhang: You could. But freedom N does fix -- you can do this version. This version
basically let me see the version of fully corrective. You can do the greedy or you can do it a
different way. There are several different ways to find I without using the step size.
This is one. But there's another one making Lees squared problem. I didn't mention too much
about that. Because I'll focus on the other things. But there are several -- they're all okay. So
here is to say you want to find maximum gradient. There are different versions of that and then
once you find that, you can say yes you can find particular step size. Replace this guy by finding
the max gradient it's okay, yeah. And then you optimize. But that's still not fully corrective,
because you are only optimize eta. If you get fully corrective you have to not only optimize eta
but you have to optimize everything. Yes, you can do that. So I think that's good. When I wrote
that down, it's not only a version -- there are variations. I think I didn't really go into that because
I'm focusing on the other side of the story, not the variations of implementing greedy algorithm.
>>: A question, if you go back to those compression vectors.
>> Tong Zhang: Which?
>>: The two images that you showed.
>> Tong Zhang: The images, the image recovery.
>>: You're doing compression, right?
>> Tong Zhang: It's image recovery. So you send out some random projection of this image
from random projection you want recovery of original image. Kind of compression.
>>: You're learning projection.
>> Tong Zhang: No, it's random projection, random projection.
>>: So random back to the original.
>> Tong Zhang: Yeah, you're learning back, giving the observed by compressing sensing you
random project that to small set given that you'll go back.
>>: Is this the best method for doing this?
>> Tong Zhang: I think people talk about the structure sparsity a lot nowadays. So I don't know if
it's best. There are convex versions of that, there are other things. Of course we want to say it's
best, but I don't think that's true. Not necessarily true. It's just one -- the advantage of ours is we
can prove things. I don't see much other people doing that. They're just given formulations and
say it's good. I can show you a picture maybe like this. But they don't have these things. So
there are other versions. This is not only. That's why I'm saying I'm not giving a comprehensive
review of the literature. I'm just ->>: It's kind of like ->>: Huh?
>>: It's kind of like doing this?
>> Tong Zhang: That I don't know. So maybe. This one is a little bit not quite related to kernel,
but maybe it is.
Dengyong Zhou: Okay.
[applause]
Download