>> Chris Burges: Let's get started. Today it's... Tewari to Redmond. He is an assistant professor at...

advertisement
>> Chris Burges: Let's get started. Today it's a pleasure to welcome Ambuj
Tewari to Redmond. He is an assistant professor at the Toyota Technological
Institute at Chicago. And also at the University of Chicago. Sounds like a lot.
And today he's going to talk about risk, regularization and regret and strong
convexity. Thank you.
>> Ambuj Tewari: Thank you, Chris. And thanks for inviting me. So this talk has
a sort of grander title than I usually like to give. I mean, I'm not going to talk all
there is to talk about risk -- about risk, regularization and regret. These are three
main very big topics in machine learning. But there is this notion of strong
convexity that appears in a lot of my recent work. So I just will use this as a time
theme to present some of my recent results that have had some bearing on
these three Rs.
And this work is in collaboration with a bunch of smart people, John Duchi, a
student at UC Berkeley, Sham who -- Sham and Shai used to be at TTI, but have
now moved. Yoram at Google, and Karthik, who is a student at TTI. So feel free
to stop me, you know, if I'm rushing or going too slow.
So this talk is going to be about prediction problems which, you know, we want to
predict an output or response Y based on some input X. So that's the generic
prediction problem. So you want to come up with a function of the inputs that
can predict the output. Right? So that's a very general high-level view of
prediction problems.
Examples are -- you know, very common example, we deal with this everyday.
You get e-mails and then your e-mail spam classifier classifies it as spam or
e-mail.
Or you might have microarray DNA data, genetic data trying to classify
individuals as diseased or not, or healthy.
So this is something that Joseph Kashet at TTI works with. He likes to view -this audio speech recognition problem as a prediction problem. So your input is
a speech signal and you want to output the corresponding phoneme sequence
that it corresponds to.
So here, unlike the previous two examples, the set of possible inputs -- outputs is
huge, right? So first of all, there is no limit really on how many things might there
be in your sequence, so it's really a combinatorial space here on the Y side. Sort
of a slightly different flavor this example has rather than the other two.
Similarly, this one from the NLP community. People try to think of the problem of
parsing really as a prediction problem where your input is a sentence and you
want to predict a huge -- you want to predict really the whole parse tree as the
label. So again, the space of labels here is huge.
So these go, you know, and this is a structured prediction that deals with such
problems. I will not go so much into structured prediction but some of the ideas
will apply such problems.
And then there's of course ranking. Talking to Chris about it. Haven't worked
really much on ranking problem per se, but again some of the ideas I will talk
about do apply to the ranking problem where you have a list of documents that
query processing system returns and you want to display them in some ranked
order. All right. So hopefully the top-most document is the most relevant
document.
Okay. So this is just common examples of prediction problems. So what I'm
getting at is really linear predictions, so prediction using linear functions. So
linear functions, I mean the world is nonlinear, but machine learning really likes
linear functions because turns out you can do a lot with linear functions if you do
-- if you work with -- in a higher dimensional space or even an infinite
dimensional space using the kernel idea.
So I will assume that your predictors have this linear form you can right? So W is
vector of weights, and your input has been represented as some huge vector
and, you know, just take the inner product of W and X, that gives you a
prediction.
This is really the workhorse of machine learning, right? Lots of practical methods
are based on this idea. It's not as restrictive as it might like when you see it for
the first time, right? So we'll just see some examples of how you can use this
idea to not just predict real numbers, right? So it feels like this is just -- you just
want to predict one real number, then this makes sense. But what if you want to
predict some structured things, right?
But you can do that. And one thing I'll -- I want to state up front is that this D, the
dimensionality should really be thinking of it as really, really large. I don't want to
depend explicit on the dimensionality and things I will claim about my algorithms,
right? So one to Y explicit dimension depends as far as I want -- as far as I can.
Okay. So how do you use linear predictors, let's say for ranking, right? So one
way to do it, not the way, perhaps, would be to learn a weight vector and then
take its inner product with the documents you want to rank, and then you can sort
those numbers. So maybe these numbers are good indication of the relevance.
And you can get the ranking, right?
So even if you want to predict a permutation, you can sort of do it by just learning
linear functions. So that's one simple reduction.
For multiclass problems, if you want to -- if you have an input and what you want
to predict is one of let's say K classes, you can learn individual weight vectors for
each of the class and simply predict the class by, you know, by you take the
inner product, you want to classify this X, take the inner product with W1, take it
with W2, take -- and you get these K numbers, take the max. And that's also one
of the way in which multiclass classification is done in practice.
So again, you use the simple idea of doing predictors to solve not just binary
classification or regression but multiclass classification.
Then of course you can do multitask prediction, where your input itself maybe
comes from K different tasks, okay? So K is not class as in the last example, but
now these are different tasks. And maybe each task itself is simple. It's maybe
just a regression or a binary classification task. So for each task you just want to
predict one single Y, one number. But the idea of multitask is that it probably
shares some -- has some similarity, right? How do we encode similarity?
People usually put either a low rank assumption, so these predictors are for
individual tasks. But if you're assuming there is similarity, then something should
be simple about this matrix, right? So this matrix should not be just an arbitrary
matrix, right? Now, how do you say simple? Well, that depends on application.
Two kinds of assumptions people have made on this is that maybe there's not
fully independence here. So maybe this matrix -- well, the individual task would
just sit as rows, this has lower dimension, like these don't span the full possible
space. Maybe the rank of this matrix is low.
And another is shared sparsity. Which is if all these tasks -- so maybe the
dimensional -- the apparent dimensionality is huge, but there are only a few
relevant variables, and those relevant variables are perhaps common across
tasks. That's the sharing. So that's people call it shared sparsity and then that is
the whole group [inaudible] based on the idea.
So again, the idea is that you use these individual inner products to predict labels
for individual tasks but then you regularize in a way that encodes your
regularization. And I will come to regularization very soon.
Okay. So this is probably my last example on this general idea of using linear
predictors to solve more complicated problems. This is for structured prediction,
like the phoneme sequences are the parse tree example. So here for every input
X, you have a huge number of possible labels. And the way you deal with that
problem is that you map each input label pair via feature mapping into some
high-dimensional space. So phi XY lives in some R to the D where D is huge,
and then you again use really the multiclass idea, right, that you will -- to take the
inner product of your W with each of these mappings, corresponding to the
labels, and provided this max can be computered efficiently over this
exponentially large set, you can still predict, and then the idea is to come up with
the learning algorithms where you can still do learning efficiently even though
your set is huge. And there has been some success in that front. Okay?
So this is like the introductory part of the talk. So are there any questions at this
stage? Okay.
So here is the outline. I will introduce the three Rs that the talk is about, risk,
regularization and regret; and then spend just a little bit of time introducing this
notion of strong convexity; and then the real meat of the talk is in the last portion
where I will present some new analysis for existing algorithms as well as some
new algorithms in the end. Okay?
Okay. So for the time being, forget about the more structured prediction of the
ranking problems. Let's stick to classification and regression, and we'll come
back to the more complicated ones.
I just wanted to get at this issue of loss functions. So for any prediction task it's
important to design the loss function correctly. But I'll sort of shove that issue
under the rug for this talk and just assume you have picked your favorite loss
function. Just think of, you know, your favorite loss function. The idea is that you
just want a way to go from predictions and true labels to quantification of how
good that prediction is, right? So just a loss function.
For me for this talk, loss function is just a way to do that, all right? So you get
your prediction using your predictor and you have the true label. And L of this -these two guys just gives you how bad -- how badly you did. Okay? So just use
some loss function.
And once you fix the loss function, the traditional statistical learning viewpoint is
that assume that your data comes from some unknown distribution P. And you
try to not assume too much about P, right, because this might be generated from
some really complicated process. But the thing is having this assumption that
your data is from a distribution allows you to associate with each predictor a
single number which is an expected loss, right? So that's the risk.
So that's the first R. So the risk of operator is simply its expected loss on route to
the distribution. And the point is of course we don't know what P is, right? P is
how the world is in some sense. What's the distribution of natural images, you
know, what's the distribution of human language, human language sentences.
But we do have a training set to train on, okay? So this acts as a proxy for this P
that you don't know.
So since the sample's what -- is all you have, it makes sense to use it to minimize
what is known as the empirical risk. So the risk is the expectation under the
underlying distribution of the data, but the empirical risk is just the average loss
on the training set. All right? So that is the empirical risk. And you might go
ahead and minimize it, but then the question is well, what space do I minimize it
on? If I just minimize it over less say all functions, all possible Ws, then, you
know, most of us know that, you know, we hit by overfitting, right?
And that brings us to regularization. But we don't want to search over all -- the
whole space of predictors, right? So maybe we have some belief maybe about
how a good W looks like, and we want to incorporate that in regularization. So if
you ask regularization is, I guess you get different answers from different people.
But at a very, very high level and probably to have least conflict with most
people.
If you say, well, its regularization is something done to prevent overfitting, I think
that is at least something that most people would have little problem with. And
the idea is that you want to encourage simple functions that do fit your data,
right? So there is this eternal tradeoff, right, between fitting the data well, which
is the data parts L hat over W is the prominence on the data. And then you also
want something which is low complexity in some sense. And lambda is a
regularization parameter that trades these off, okay?
For those of you who work with this everyday, I mean, I'm sure I'm boring you to
death but I just wanted to set up the context for the talk. And there is the
constrained version, of course, of this where you keep -- where you pull the
regularization out of the objective and put it in the constraints.
So are there any questions on this very, very high level technical setup of the
situation? Okay.
So let's get into examples of regularization. I guess the most common one is the
L2, the ones that -- the one that is used in SVMs. So here the L2 known factor is
simply square each entry, add them up, take the square root. Also corresponds
to its length, you know, in 2 and 3D and however D you can think of or does an
analog of the length.
One, another regularization that has become very popular recently and has
connections to sparsity is the L1 regularization where you just sum the absolute
values of each component. And that acts as a convex surrogate for the number
of nonzero entries in the vector. So it's another regularization.
The third one, very popular regularization is comes from the max-ent boosting
world where you measure the complexity of a vector by its entropy. Really the
negative entropy because I want to work with convex regularizers, so that's why
I'm dealing with negative entropy because entropy itself is concave. So here I'm
assuming W is a [inaudible] distribution so each component is positive and it
sums to one.
So these all are examples of W that is really are vectors. And what if we have
matrices? So these matrices arise -- you saw they arise in multiclass, multitask.
So the group idea, which I very briefly hinted at a few minutes ago, is to use this
idea of shared sparsity that may be so think of -- so these rows are individual
predictors. So each predictor has three dimensions. And the rows can
correspond to tasks in multitask or to classes in multiclass. And the idea is to
take the usual L2 norm of each column, so that's of each feature, and then add
them up. So this, the idea is that like L1 norm, which encourages sparsity of
each component here this should encourage sparsity at the column level.
Each column should light up like only a few columns should really light
newspaper this matrix and the rest should just be turned off to zero if you use this
regularization. That's at least the idea. That doesn't always happen, happens
under some conditions. That's the idea behind this regularizer.
And the trace or the nuclear norm, which has become very popular due to its
association with the matrix completion problem is to incorporate this idea of
limited linear independence in your rows. So here you regularize, so you take
the SVD of your matrix, W. These are the singular values. And you take the L1
arm of the single values. Why? Because rank is the number of nonzero
elements here, and a surrogate of the -- and a convex surrogate for the number
of nonzero limits in the vector is the L1 norm. So you just take the L1 norm here.
So these are the five regularizers that I'll sort of use as running examples. Okay.
So I'm done with my risk and regularization. So there remains the third R is the
one I want to talk about, which is the regret, which may be not that many people
are familiar with. So I'd like to illustrate the idea with an example. So let's say
you have -- so I'll get to the definition of the regret first. But let's just do this little
example before that.
Let's say you have let's say five experts predicting the weather. And for
simplicity assume that the weather takes only two values, sunny or cloudy. And
you know a very unreasonable assumption, you know that one of these is
perfect, okay? And your job is to look at their predictions and make your
prediction and make the minimum number of mistakes. Okay?
So this is the first day you have no prior information. You just know one of them
is correct. So maybe you just go with the first one. And the reality is cruel. It
doesn't always do what you predict. It's cloudy. So you made a mistake. But
that's fine. You made a mistake, but we can rule this out, right, because we
know there's a perfect predictor, these two can't be perfect. So now we just have
these three. And now they predict cloudy, sunny, sunny. So it was cloudy
yesterday and so maybe if things persist you go with the first one again.
Well, but look again. So you make a mistake. But still progress has happened,
right, because you made a mistake. So you paid in your performance but you did
make progress by zooming in on these two guys.
So next day they both predict sunny, so you're in good shape because one of
them is perfect and they're both predicting sunny so you won't make a mistake.
And then the next day they have a disagreement. You go with the first one and -well, you made another mistake but now you're guaranteed that you won't make
mistakes because this one is the perfect one. Okay?
So the idea is that under this unreasonable assumption of a perfect expert there
is the strategy, a very simple strategy, you just go with the first uneliminated
expert that makes a finite number of mistakes. Okay? So that's the point I'm
making that there is no probabilistic assumption here on how the world behaves.
There is an unreasonable assumption of the perfect expert which can be
removed. But it just gives a flavor of how you can say something without making
probabilistic assumptions. Some adversarial guarantee that no matter what
happens my algorithm will satisfy certain such certain property.
And actually if you go with the -- not with the first expert but with the majority of
the ones that have not been eliminated, you actually only pay log number of -you only make log number of mistakes. So this strategy of going with the first is
actually a bit silly but still gives only a finite number.
So this modulates the online learning setup where -- so now let's -- so this was
not a learning problem, right, a very sort of concocted problem. But let's get back
to learning. We're trying to learn. So maybe you have a temporal process and
you're trying to predict that each time and you keep a predictor at each time, T.
So that's like your move. So I'll think of this as a game between the learner and
the environment, which can be adversarial. So you play a predictor. In the
expert setting that just corresponds to picking, you know, if there are five experts
and I go with the one you can go at that like this. So that will be my move. Then
the online learning protocol is that the world responds with a loss function. And
again the expert setting this loss function just encodes which experts made
mistakes. The first two experts made mistakes then this is what the loss function
looks like.
And then you suffer this loss function on your choice for that date, for that round.
So here in this case the loss function is very simple. It's just linear, right? So if
one of the mistake-making experts was what you picked, you just pay that, right?
So a very complicated way of saying that is you take the inner product between
these two extra as. But I'm making it complicated because like brings me to this
more general setting of online learning where you -- where there is this protocol
and you suffer this losses at each time.
And then what's the regret? Well, the regret is sum up the losses that your
algorithm -- that this online algorithm that, this learner made. There is this
division by T to scale it. And then subtract from it the best you could have done if
you knew all the loss functions in advance. All right? What's the best you could
have done if you knew everything in advance over your space. So this W lives in
the same space as what your algorithm lives.
So I was mentioning this definition to Chris, and he said well, it's sort of -- you're
being harsh on -- I mean, you're being -- you're competing with the quantity that's
not that interesting maybe because it's -- you were loud to change the WT at
each time but what you're competing against has to use the same W throughout.
There's sort of several possible responses to that criticism. So [inaudible]
criticism.
One response is that we do -- so one -- the first response is that it is possible to
study this quantity and prove interesting bound. And not only that, these bounds
often give very competitive answers in the statistical setting if you do assume that
these Ls are not just adversarial but they are coming from some distribution, then
you recover some of the best guarantees that are known for batch algorithms.
Which is kind of administering. And the third response is that yes, this is
restrictive, but this is just a start, so you can consider other notions of regret
where this is allowed to change a little bit, okay?
This is the definition. And I just want to make sure everyone sort of gets it, what
this quantity is measuring. So your cumulative loss minus what's the minimum
you could have done if you saw the sequence in advance. Okay? Yes?
>>: So I mean usually [inaudible] T and N would be the same thing, so is that --
>> Ambuj Tewari: Oh, that's a typo. Zero. That small N is -- so, yeah, in
statistical learning we use N and somehow in the online world we use T, and I
got them mixed up here. NN is equal to T here. Yeah. It's the same. Yeah.
You can see that's just propagating. Okay.
So the regret, unlike the risk -- so risk is just a number. I have a predictor, the
distribution. Risk is the average loss of this predictor. It's a single number. The
distributor depends on the distribution. Regret, on the other hand, depends on
the actual reality, on the actual sequence of loss functions, on the actual data
you saw. So the regret of an algorithm is a sequence dependent quantity. But
the surprising factor is that you can often bound it for all sequences from a
certain class. And it could have been generated adversarially. So the adversity
might be looking at your algorithms and trying to force back examples.
Okay. And this is a slide that's illustrating that if you do come up with a way to
bound this regret, then you can use it in a streaming setting where your data is
IID. But perhaps you're seeing it one by one, right? So what you can do is you
can mimic this online protocol using a data stream where you -- you know,
maybe this is the stream of e-mails that is coming in. You maintain a classifier at
each state. You receive e-mail and then you have a label, you have a correct
label for it. You suffer the loss, maybe just a squared loss, and that's how you
can mimic the online protocol. And the point is at a very high level there are
these conversions that can go from an online adversarial setting on the batch
statistical setting by combining the aggregate guarantee with the assumption that
the data is actually IID. Use some concentration inequalities to give a bound in
the statistical setting. The statistical setting you'll say that the risk of the average
thing you played over time, so that just an average of these Ws is not much
worse than the best risk, sort of blue status is the predictor in your class that's -that has the minimal risk. Plus a little bit where this little bit depends on the
regret guarantee and the extra thing that you're accumulating by using some
concentration inequalities.
But the high-level idea is that you can use these maybe esoteric looking regret
bounds to get things in the statistical setting and you don't lose much. So these
regret bounds are apparently not that loose.
And people have also used these online algorithms to do optimization of batch
objectives. So like there is a spec source algorithm. Which is really an online
algorithm but it works on the SVM objective, solves the primal problem. And the
idea is that instead of this data stream we just mimic the online protocol by
drawing from your given batch dataset. So although you have the batch dataset
before you, maybe you don't want to look through the whole dataset, you can just
sample randomly from it. And then again you can combine the grid guarantee
with this assumption that you're drawing randomly from a big dataset to get
guarantees on the batch error. So the error on this dataset S. So the loss of my
average predictor on the dataset will be less than the empirical minimizer's loss
plus a little bit. Okay? So two ways how online bounds are used within learning.
Okay so. What we want to do is we want to understand all these different
regularizers, entropy, L1, L2, group norm, trace norm. And we want to obtain in
the probabilistic setting we want to say well, the risk window be big. In the
adversarial setting we want to say that the regret will be bounded. And we want
to understand the relationship between the probabilistic model and the
adversarial model. When are the regret bounds the same as the statistical
bounds, when are they different, things like that. And the notion of strong
convexity appears to play an interesting and important role in all of these. I won't
get into the risk bounds because I don't have that much time. I'll mostly
concentrate on the online setting because I think that's also more practical. You
do get these fast algorithms out of it.
Okay. So strong convexity is a very brief intro to this concept though those of
you who haven't seen it. Convexity I'm sure everyone is familiar. So convexity in
one dimension is that if you connect two points on the graph of that function that
the line will always be above. And particularly for the mid point you'll be above
the function.
Another equivalent definition is at least the function is differentiable that you can
draw a tangent at any point and the tangent will be below the function. That's
another equivalent characterization of convexity.
Yet another definition can be that the D derivative is monotonically decreasing or
at least nondecreasing, which means that that the derivative of the derivative, the
second derivative is nonnegative. So there are three definitions. And you can try
to make each definition stronger to get a notion of strong convexity. The good
thing is you get one notion. So you make each definition stronger, approximate
but the resulting notions also coincide. So you have a strong robust notion of
strong -- we have a robust of strong convexity. And that doesn't depend on the
definition of convexity you start with.
So you can take this definition and say, well, this is not just below but below by a
significant amount, right, where that significant amount is quantified by a
parameter alpha. So you can say my function is alpha strongly convex if there is
this -- at least this much gap between the function and this line. Okay? That's
one proposal of strong convexity.
One other proposal could be, well, I'll say that this function is not just above this
linear approximation but above by a significant amount. Again, quantified by
alpha. And you can say well, here my derivative of the derivative is not just
nonnegative but it's actually at least alpha where alpha is a positive number. And
all these three definitions are again equivalent. So we get some robust notion ->>: So twice differential ->> Ambuj Tewari: So that brings me to the definition. This I take as the
definition. And then the rest are equivalent if you have sufficient differential
[inaudible], right. So in high dimensions, the definition of strong convexity that is
used is that the function at the mid point is less than the average of the function
values at the end points minus -- this should be on -- yes, this should be a minus.
Minus alpha over this eight is just for making sure that half equivalent
nonsquared is nonconvex. Because you want that example to have a strong
convexity constraint of one.
>>: Is halfway enough, or can you construct some weird counter stuff like ->> Ambuj Tewari: Oh, so this is all assuming functions are continuous. Yeah.
No weirdness is allowed.
>>: Okay.
>> Ambuj Tewari: Yes. But, yeah, half is enough only if you have continuity.
And then you can infer it for -- and the notion -- the issue of the norm is kind of
important. And I want to say that right now. Because in 1D, pictures are
misleading because there's only one way to measure norm, right, this the
absolute value. But in higher dimensions, it's very important which norm your
given function is strongly convex with. Because the constant will depend on the
norm. And I don't want my constants to depend on dimensionality.
Okay. So the definition as far as I can tell first appeared in 1966 in an
optimization theory paper by Polyak which is now being -- it's appearing more
and more in the machine learning literature in the past four or five years.
And then there are these equivalent definitions if you have -- so this was the
definition and this was a consequence of the definition. If your function is
differentiable been the linear approximation sits below by this much and then the
gradient mapping is a monotone function -- it's actually a strongly monotone
function and then there's the condition involving the Hessian. But these are just
showing that you can state this notion in different ways.
Okay. Examples. So one important family of examples are these LP -- the LP
norms, right? So the L1 and the L2 norms are called L1 and L2 because they
are the LP norm for P equals one and two and generally you can take the Pth
power of each absolute value, add them up and take the one over P power of
that. So that defines a norm. And half LP norm squared is known to be P minus
one strongly convex with respect to the P norm, okay? And this is important,
right? So this is not referring the dimensional at all, this is just P minus one
irrespective of the dimension.
And this is true only for one to two. And don't ask me what happens after two
because it sort of gets more -- [inaudible] says you don't have strong convexity,
you have a notion that's between convexity and strong convexity.
>>: Is it the same as alpha is P minus one? Is that -- when you say P minus one
strong [inaudible].
>> Ambuj Tewari: Yeah, yeah, so the alpha equals P minus one if you take this
function. But this specific function, the alpha -- the biggest alpha that you can
make -- that you can work with is P minus one.
>>: But in alpha zero --
>> Ambuj Tewari: Is always true, right? So the function -- if function is alpha
strongly convex, it's alpha prime strongly convex for all alpha primes less than
alpha.
>>: But is zero strongly convex function [inaudible].
>> Ambuj Tewari: A zero strongly convex function is just convex.
>>: So if you do the L1, it's not actually this strong ->> Ambuj Tewari: Good point. And I'll come to that. So for L1, we have to do a
little trick. You have to stay just enough -- you go in one ->>: L1.0 ->> Ambuj Tewari: Yeah, you do one plus a little bit. Good point. So if you're
exactly at one, you lose convexity. I'll come back to that point. It's a good point.
And for matrices, one very nice -- actually when does the talk end?
>>: [inaudible].
>> Ambuj Tewari: Okay. So I'm doing good on time then. So, yeah, so this is -this is a family of examples on vectors for matrices. You have these so-called
Schatten norms where -- I mean, the name is complicated but the idea is simple.
You do the SVD of your matrix and then just do this O norm on the single values.
Right? And again, the trace norm and the Frobenius norm correspond to taking
the L1 or the L2 norm on the singular values. So that's the family of Schatten
norms.
And the interesting thing is -- and this is not a trivial consequence of this fact. It
takes some proof. It's a paper by Keith Ball and a few other people, that half
Schatten norm squared is also P minus one strongly convex with respect to the
Schatten P norm. So again no dependence on the dimensionality of the matrix
nor dependence on on D1 and D2. And you might hope that this is just -- you
can just infer from it this fact by using some simple linear algebra but because of
non [inaudible] of the matrices, the proof is actually not simple. At least I don't
know of any simple proof.
The thirds example is the negative entropy. And again you can get a dimension
independent strong convexity constant provided you work with the L1 norm. So
negative entropy is strongly convex, not with respect to L2 but with respect to L1.
And the logical content of this statement, that's one strongly convex is actually
equivalent to saying that the Pinsker's inequality holds, which is the KL
divergence is lower bounded by the variation distant squared, for those of you
who know Pinsker's inequality.
It actually has a quant analog. But that, I won't get into it. So in the quantum
world property distribution get replaced by density matrices. Positive
semi-definite matrices with trace one.
Okay. So I'm done with the three Rs, I'm done with the definition of strong
convexity. Are there any questions on the introductory and the definition of
strong convexity?
Okay. So if there are no questions, I'll just move on. Okay. So now coming to
the meat of the talk, which is online algorithms using this idea of -- now I'll make
the connection, sort of try to put things together. Okay. So as I said, I won't go
into the statistical learning period results like how to obtain regularization error or
risk bounds in the statistical world, I'll just stick with the online setting.
So here is the protocol. So you're playing a predictor at each time. You receive
some loss function. And I think it's good to keep this in mind that I'm just not -not something of this L sub-T as an abstract function. Here I'm really thinking in
my head I'm thinking some machine learning loss function applied to -- that's
what I'm thinking. That's what I'm really thinking in my head, what LT is, right?
It's the loss that you incur on the Tth example. And this could be your favorite
loss. This could be even like you could do -- you might be doing structured
prediction. In fact, people have used some of these ideas for structured
prediction.
Okay. So to come to -- to motivate mirror descent, let's start with two very
well-known algorithms in machine learning. One is this online gradient descent.
So the way you get your next iterate is by just going a little bit in the descent
direction for that that given loss function. And then maybe you need a projection
to come back in your set. So that's just simple gradient descent.
The exponentiated version of that is of course let's say you're working on the
probability simplex. So the way you update each component of your -- so you're
keeping track of this WT is evolving, right, so you have these -- this is how your
online algorithm works so to get WT plus one from WT, this is what gradient
descent does, and this is what exponentiated gradient descent does. It multiplies
-- each component gets multiplied by E to the power. This is a learning rate.
And then the appropriate component in the gradient. This is the multiplicative
version of this iterative algorithm.
So both are fairly -- I mean, have been around for a while. And they're both
specific cases of mirror descent. And here's how. So gradient descent, this
thing, has an equivalent description like this. And that the next iterate actually is
obtained by solving a simple optimization problem. What is our optimization
problem? You linearize your loss function at your -- at your current iterate, so
you linearize your function around WT. That's a bad approximation, far from WT.
So you add a proximal term. You don't want to go away from where you made
your approximation.
And you just minimize these two to get WT plus one. And it's actually not that
difficult to verify. So this is the linear approximation, right? The value at WT plus
the gradient times the difference. It's actually not that difficult to verify that if you
just minimize this than this is what you -- what the solution looks like.
This is called a proximal term. And for exponentiated gradient, the only
difference is that the way you measure the distance is instead of Euclidian
distance you use not really a true distance but it's just divergence scale
divergence. And again, it's a bit of a calculation that this minimizing this actually
gives you this closed form.
Okay. So now these algorithms start to look the same, right? Only one
ingredient has changed. And you can say well, okay, I'll just have a generic -- a
general Bregman divergence here. And I'll tell what I mean by Bregman
divergence. Bregman divergence is simply -- so you have a -- if you have a
convex function R, you can get a divergence out of it by simply evaluating that
function at W and the first argument and extracting from it the linear
approximation at W -- where the linear approximation is made at V. Right? So
the picture of this -- so you have this R, and you make the linear approximation at
V, and you evaluate it at W, that will leave a gap, and that gap is exactly the
Bregman divergence.
So the divergence is nonnegative by definition if R is convex. It actually will not
just -- we'll assume R is strongly convex. So this now buys back to the notion of
strong convexity. So this is now a family of algorithms. The ingredient is you
change this ingredient R and you get different algorithms. What are these
algorithms good for? We'll now relate this -- the choose of R to the task at hand,
right? So for multi task, for trace norm, it turns out there are appropriate Rs that
you can use to get the right algorithm and the right bounds. Yes. So I'll assume
that R is strongly convex. Okay.
Now so this is the -- like this is somehow not an intuitive way to learn about this
algorithm. This is the definition. So remember the gradient descent had these
two equal descriptions. You can either view it this way or you can view that
you're doing this gradient step.
So it turns out even for the general middle algorithm there are two views. One
view is while you are doing a tradeoff between this proximal term and the loss
dependent term there's a more geometric [inaudible] like is that so here you have
the WT. You go to a mirrored space. And that's where actually the term mirrored
descent comes from. You do a descent in the mirrored space.
You go through a mirror -- to the mirrored space. What's that mapping? That
mapping is actually the gradient mapping of this strongly convex function. So
you assume it's differential. So this gradient mapping takes you through the
mirrored space. You do descent there. And then you come back. And since R
is strongly convex its gradient is invertible; in fact, for any strictly convex function
is gradient is invertible so you can come back.
So this picture you get the same algorithm as this. And you know online gradient
descent exponentiation, gradient just deferred in the choice of this mapping.
>>: [inaudible] is the matrix.
>> Ambuj Tewari: Grid R -- so both these R vectors of same dimensionality. So
grid R just maps vectors to vectors. So it's the gradient of first scalar value
function. So it takes inputs that are vectors and also puts a vector.
So for the -- for simple gradient descent R is half. So [inaudible] are gradient
descent corresponds to RW being half W2 norm squared and then the gradient is
simply W [inaudible] so nothing happens. The mirror space is the same as the
original space. And then you might project if you go outside the set.
Okay. So here's a very general guarantee not new to us.
>>: [inaudible].
>> Ambuj Tewari: The -- which gradient?
>>: The gradient ->> Ambuj Tewari: So this gradient is -- comes from the loss. And this is a
mapping. So I applied the gradient, add WT to get TT. So this is the mapping.
Takes vectors to vectors. Apply it to the current vector, get this here. Do
descent and come back.
>>: [inaudible] do it on the right-hand side?
>> Ambuj Tewari: Yes. So there are lazy versions where you maintain things
only in the mirrored space and only project where you need to make a prediction,
that's right. But if we're making predictions every time, then you have to come
back.
So here's a generic guarantee. The nice thing is this is nice family of algorithms.
The only ingredient is this strongly convex function. So assume you have an
upper bound for that function and the parameter of strong convexity, and
assuming that all the gradients of your loss functions are not too big in the dual
norm, okay? So R is strongly convex with respect to some norm, which is -which will be application specific. And you have to measure your gradients in the
dual norm, so that's something you have -- that sort of goes together, right? You
don't have freedom in choosing your different norm here.
But then there is this generic guarantee. If you use this mirrored descent
algorithm, no matter what your loss function -- loss sequence is, as long as the
gradients are bounded, your regret will always be bounded by something which
is linear in the dual norm, and then -- so the nice thing is there is no -- at least
there's no explicit dimension dependence here, right? So as long as alpha is
independent of the dimension, all these bounds are also independent of the
dimension.
>>: So [inaudible] always exists for strongly convex --
>> Ambuj Tewari: Actually the inverse exists for any strictly convex R. And for
the -- so it actually exists for strongly convex R just because all you need for the
inversion -- for the gradient to be invertible is strict convexity. It shouldn't have
any flat portions. The R shouldn't have any linear bits in it.
>>: [inaudible].
>> Ambuj Tewari: Okay. So let's -- so the regret is to remind us just this
quantity. Some of the losses of the algorithm minus the best. Okay. So that's
an abstract theorem. They're not clear how -- what it buys you. Let's now do
specific cases. And this is something which is over, right? So this idea that -- so
these bounds are actually over. I'm just putting them here that this comes out of
this generic framework. So these are not new.
So let's say you're doing learning with just with plain vectors, doing some
regression or classification. Then if you -- if your predictors you're competing
against are bounded in the L2 norm, then actually you have to assume that your
data is bounded in the dual norm. So dual of two is two, so you have to assume
that your data is bounded in the two norm. And then the regret bound is really
the product of two norm of the predictor, the two norm bound on your data and
one over square root T. So this is classical. This is known that online gradient
descent has its regret bound and then you can also convert it into the statistical
guarantee if you have -- if your data [inaudible].
And so this is the bound for online gradient descent for exponentiated gradient.
The nice thing is -- so the nice thing is you can bring this dependence on L2 now
on the data down to infinity. So think if your data all features are between zero
and one and the infinity norm is just one with the two norm scales with
dimensionality, right. If your data is dense and has D1s, then the two norm will
scale as square root D. So that's good.
But then you get hit by the -- now you get hit by the L1 [inaudible] predictor. So
that's bad because that's bigger. So you can't say which one is better, right? It
depends on the application. So if you have dense data but a good predictor is
sparse, then go ahead and use this. But if it's the other way around, then you're
probably better off using just simple gradient descent.
So this is well known. This tradeoff is well known. E just recover it using this
general theorem. And I'm using something about the loss function. I'm
assuming it's convex and Lipschitz. So even if you're using squared loss you
have to sort of work in a bounded integral to get these.
And this log D, this comes back to the question of one norm not being strongly
convex. So what you do is to handle the one norm like for the second line what
you do is you actually run mirrored descent. So you can either run mirrored
descent with like exponentiated gradient or if you want to run it with P norms,
then you run it with the -- this is what you choose your R to be, to get the second
bound. And you choose P carefully near one. And that's why P minus one is
one over log D so when you invert alpha you get that log D. So that log D comes
from this choice. And so -- and the reason I make this choice is that if you are
this much near one then the LP norm and the L one norm are within a constant
factor of each other. So the LP norm for this value approximate approximates
the L1 norm within a constant factor.
Okay. So but -- so those are all bounds. Of course in the matrix world, so this is
recent work of Sham and Shai. So in the matrix world, let's say you were using
multitask learning. You can of course use these bound, right? You think of your
matrix like a huge vector and then the words L2 norm, L1 norm. So you can
trivially get these bounds. But what's interesting is that you can get immediately
also get bounds for group, group type regularization or trace regularization. So
you get these new bounds basically for free. And the only thing -- the only
ingredient that changes is that to get this line we use -- there's something I didn't
actually mention that for matrices I can take the P1 norm squared, so this is like
matrix and you do the P norm on this side -- sorry, two P norm. You do the two
norm on this side and then do you the P norm here. And then again this function
is appropriately strongly convex. I didn't mention that was one of the examples
but you needed to get this row.
The high level point is I mean the bounds are sort of difficult to read here. It's too
much -- too many subscripts. But the point is you get either no dependence or
mild dependence on the dimension. You get sort of the expected dependence
on the various norms of the matrices and the data involved. Basically from this
just one unified algorithm and just plug in different ->>: [inaudible].
>> Ambuj Tewari: Yes. So here K is the -- K is the number of tasks and I'm
assuming that the loss function is just is sums over different tasks. The nice
thing is in this analysis, unlike most [inaudible] analysis, we don't require that you
use the same loss function across tasks. So you can actually share features
cross regression and classification tasks. So you might use a logistic class for
classification and use squared for regression. And only -- I mean, here I'm not
doing it, I'm using the same loss. But none of these bounds depend on if you
want to use the same loss function.
So you can you can mix tasks that are classification regression. And this is Eric
Xing [inaudible] is doing some practical work on this. He calls it heterogenous
multitask learning.
>>: And you don't even need [inaudible].
>> Ambuj Tewari: Yes. Yes. So these are -- so this is my -- so this is what my
classifier looks like, right? So the blue L is the Lth row of this matrix. So for the
Lth task I use the Lth row of that matrix. So this is initially how I showed how I
will do multitask.
>>: I was just trying to understand [inaudible] you don't have ->> Ambuj Tewari: Oh, if you --
>>: You only have some features in common in the task [inaudible].
>> Ambuj Tewari: Oh, like this is -- so, well, so these bounds are always true,
but they're meaningful if your assumptions are correct. For example, this 2, 1
norm will be small if there is shared sparsity. The this bound will always hold.
This is the bound for the algorithm. This bound will be large if you don't -- if your
regularizer doesn't match the assumptions that it sort of corresponds to.
You can go -- you can use the -- this algorithm for the case of sparse W and
dense. It will still run and it will have a worse bound, right? So the algorithm is
not restricted to be used only in a certain setting. Its performance will depend on
whether the setting matches its regularizer.
>>: So you have a [inaudible].
>> Ambuj Tewari: So this, on the left hand side, this is the class I'm working with.
So this is -- this corresponds to the group regulation, right, because the
constraint form of the relation says I'll only work with matrices that have group
norm that small. So this is the class I'm competing against. And this is the
assumption on my data. And then this is a bound and these are four different
algorithms.
So I am not sure if I understand your confusion.
>>: [inaudible].
>> Ambuj Tewari: There is X in the loss, yes.
>>: [inaudible].
>> Ambuj Tewari: So the index in the loss. T is the ->>: [inaudible] subscripts [inaudible].
>> Ambuj Tewari: So L is -- the picture is ->>: [inaudible].
>> Ambuj Tewari: So W in our matrix, T itself you get LK tasks is what your WT
looks like. And at time T, since you are in multitask at time T you get all the
tasks. So you -- your matrix X also has K rows. And I think the index in here is
that this is the Lth row of the matrix at time T. Yeah. So it's this notation gets
ugly with the tasks.
>>: So this is presumably the tasks of equal importance. Is there a way to
parameterize that matrix by scaling it by the importance vector and then have ->> Ambuj Tewari: Yes, you can have ->>: [inaudible].
>> Ambuj Tewari: You can scale things, yeah. I mean, the proof are not that
sensitive to different scalings across tasks. So ->>: [inaudible] propagate it through.
>> Ambuj Tewari: Propagate through ->>: The vector. If you have the vector which implies the importance -- the
relative importance of [inaudible] would some properties of the vector then go
through into the actual bounds?
>> Ambuj Tewari: Oh, yeah. Great. So then the bounds will sort of scale with
those scalings, right. Yeah. So if you scale differently then it's not clear what
you can say about -- it definitely could have some effect on the overall norm of
the matrix.
And then I mean there is an interesting question here which we haven't fully -- I
don't have time to go into it. How do you compare? I mean, could you start
comparing these bounds and try to identify what properties of your data and
predictors makes one better than the other. So that part I'm not getting into.
So this is for multitask. You also get immediately new algorithms for multiclass
prediction. So the loss you use is the multiclass hinge loss. So this is the
generalization of the usual hinge when you have more than one class. Basically
you for each -- so YT is the correct label. So X is still a matrix here. Sorry. X is
not a matrix here because this is multiclass. But W is still a matrix here.
Because that -- you use different rows of W to predict different classes. And this
is like -- this is the margin. So YT is the true label and Y is one of the other
labels. You sort of penalized by the worst margin error. Analog of the margin
hinge loss to the multiclass case.
And for this loss, you get these -- so the first two are again already known. So
this is actually -- this is the multiclass perceptron bound due to Yoram and maybe
[inaudible] I'm not really sure who proved the first multiclass perceptron. But
anyway, you get these new algorithms. This actually I was quite excited when
we figured you get this -- so this is a multiclass algorithm, which is the hybrid
between like the group [inaudible] and the multiclass idea. So now you're doing
-- you're sharing features across predictors for different tasks. And you can run
this algorithm and you can see that this bound will actually be much better than
the default multiclass perceptron bound if you have sharing because then W2, 1
will not be much larger while you gain significant -- you make significant gains
and just depending on the biggest element in X.
Anyway, again, so this is a lot of fun but the high-level point is that you can again
even for multiclass get new algorithms. This algorithm is not new. It was
proposed -- this is a trace nonregularized multiclass algorithm was proposed by
Nati Srebro and coauthors, but they didn't analyze it, we just get that analysis for
free by using the strong convexity property of the Schatten norms.
Okay. So I have maybe 15 minutes. So this is the last part of the talk. So in
mirror descent the algorithm is known. Its analysis was known so. This generic
bound was known. This is not new to us. So here I'm just showing how to
change the ingredients and get new bounds and new algorithms in different
settings. But now the last part is actually about a new algorithm. And it's not
very general, unlike mirror decent, unfortunately it only applies to particular
regularizers. So I'll worked with the L1 norm. And actually the work was
motivated by L1 regularized problems.
So you have this online learning protocol where you play a predictor. You incur a
loss. But often, for example in regularized problem your loss consists of some
data part and a part that comes from the regularizer. So you can do -- there's no
-- nothing stopping you from doing just very linear descent or [inaudible] descent
on this, right, just evaluate the gradient from both sides, go in the descent
direction and proceed.
There's a slight complication that this is not differentiable but you can work with
subgradients and all the guarantees of mirrored descent are still true if you just
have -- if you use a subgradient instead of gradient.
Okay. So you can still -- you can do mirror descent even with L1 regularization.
You will linearize your loss and you will also linearize the L1 part of it. And this is
what geometrics will look like. You'll do descent corresponding to the loss part,
then you'll do some descent corresponding to the real regularizer. Never mind
the fact that this is not differentiable. But assume you get one sub-- one
subgradient from the subdifferential. And you come back.
Problem is that this update does nothing -- that's it, nothing in this update that
even hints at the sparsity, right? You're just subtracting some vectors from some
vectors and I mean it's not clear how somehow magically some vectors will
become zero. They will in the limit because you know that this will converge to
be optimum under certain conditions if you minimize L1 regularized losses then
you get sparsity. But there's nothing in -- one update does nothing to promote
sparsity. And this was a big problem in applications that this -- these updates
were not promoting sparsity, even though the reason we use L1 norm is to
encourage sparsity.
Okay. So what's the fix? The fix that we propose -- I mean not we. The idea
was around, but we sort of put in it connection with the mirror descent is if you
have -- if your loss function -- if your online loss consists of two portions, one of
which is not even related, it never changes, then don't linearize it. So this is
mirror descent. Change -- so just instead of linearizing it just throw -- throw it
right there. That part don't touch. Just linearize the loss function.
So this is what we call COMID, Composite Objective Mirror Descent. So, there
are two questions. Well, you modify their algorithm, right? And mirror descent
was good because even though it had this flavor of that you have to solve an
optimization problem at each step, in the end it was just this nice go to the mirror
set and do descent, come back. It was a nice -- it could be implemented
efficiently. So can this modification also be implemented efficiently?
>>: But how does this relate to [inaudible] algorithm?
>> Ambuj Tewari: Yes. It -- they are very similar. You can replace -- instead of
linearizing just one, you can keep a linearization of all the previous losses that
you've seen. And there are actually a whole spectrum of algorithms between ->>: Right. They're [inaudible].
>>: Is this the same as just Duchi and Singer's algorithm?
>> Ambuj Tewari: So that will -- yes. I'm getting to that. It will be a special case.
>>: Okay.
>> Ambuj Tewari: So that's exactly where -- so Duchi and Singer did some work,
me and Shai did some work and then we realized there was a single algorithm of
which they are special cases. So that's what -- on so Duchi and Singer is
COMID but for a specific choice of R.
Okay. And does this modification work? I mean, you change the algorithm,
there's no guarantee that you -- that it still does something reasonable. But one
thing we are sure that if it works it will give sparsity of -- in each update because
in each update you have the L1 regularizer sitting there, right? So in each
update if you can implement it efficiently, you will know that WT plus 1 will be
encouraged to be sparse because this sits right there in the objective function for
each update. Not just in the [inaudible].
Okay. So this is what -- this is what this algorithm becomes when this distances
the usual W minus WT two norm squared. You do descent and then you pass
your intermediate vector through the shrinkage operator so each component of
your vector goes through this shrink operation. So if you're away from zero,
something gets subtracted. But if you're within a little distance of -- if you're
within some distance of zero you just get [inaudible] to zero. So that's why you
get sparsity in this state. So at least for the simple setting where R is the is the
Euclidian norm you get an efficient implementation. You just do the gradient
descent and do the shrinkage. And in general, this is something new which I
don't think it was new -- known before that no matter which LP norm you use in
this general mirrored descent setting, this modified mirror descent algorithm, this
COMID algorithm has the inefficient implementation and this is how mirrored
descent gets modified. It's quite intuitive.
You go to the mirrored step, you do descent from the loss function, you do
squashing the shrinkage operator in the mirrored space and you come back.
The key thing is when you come back sparsity is preserved. So these -- all these
P norm Bregman functions have the nice property that this gradient is sparsity
preserving. If you have 20 nonzero components here, exactly 20 -- exactly those
components will be nonzero.
So [inaudible] disaster if this mapping destroyed the sparsity you gained in the
mirrored space. But you preserve that. And this geometric picture is the
equivalent of this algorithm. So that's a theorem for the family of Bregman.
And actually also explains why we were able to -- so there is some bound which
is not much different from the mirror descent bound. But it's not about improving
the bound, it's about having the sparsity and property at each step. Actually the
bound is cleaner because if you do mirror descent you will get hurt by the -- by
the -- how big the gradient is both in the loss part as well as the L1. But here you
do get pay only in the loss part.
So cleaner bound than mirror descent. And you get -- so this is now coming
back to the question that was asked about Duchi and Singer's work is that you
get two algorithms as special cases. So Shai and I gave a algorithm for L1
regularized problems using P norms that we called SMDMS, stochastic mirror
descent made sparse, and John Duchi and Singer had another algorithm. And
both were discovered independently. And then we realized there is this generic
algorithm that gives these bounds in special cases. And the story doesn't stop
there. Because now -- because of this unified analysis we get new algorithms
where L1 -- instead of L1 think of matrix applications where now you're doing
trace nonregularization, you actually -- now it's very interesting what the
algorithm in the end becomes is you go to the mirrored space so that both
spaces are spaces of matrices. You do the descent from your loss function.
Now you shrink in the singular values. So you don't shrink each component of
this matrix, you shrink -- you go -- you do the 3D and you shrink the singular
values and put the 3D back, and that's what [inaudible] this one is, and you go
back and again this doesn't destroy sparsity in the spectrum. So [inaudible].
>>: [inaudible].
>> Ambuj Tewari: The R should -- yes, in the matrix R should be the Schatten P
norm squared and in the vector world -- actually the property is really that the
gradient at any W its G component, the sine of it should match the sine of WG.
So once you do the gradient -- so you have a vector. You apply some map to it.
You get another vector. The sines should match in each component. And it
does happen for all the P norms and all the ->>: When you say maintain sparsity, do you mean that the Ws also have this
property [inaudible] similar values?
>> Ambuj Tewari: Yes. In this case, the notion of sparsity per definition is
different in matrix. So here you get, if there are five nonzeros in their value, there
will be exactly five nonzero singular values here. Right? Unlike the previous
case where the number of nonzero entries were preserved.
Okay. So here we get new algorithms which you haven't yet tried. John is
actually just beginning to run experiments. And there are some interesting -actually I had questions for Lin about how we can use it with the -- oh, okay. I'll
take the time and have the discussion offline. Anyway, so these, you get new
algorithms for these trace norm applications. If, again, you have this ingredient R
to choose, right, so you can choose R to be simply the Frobenius norm of the
matrix and then you get a particular bound, then this algorithm is actually not
new, was published last year by Goldfarb, Professor Goldfarb at Columbia and
his colleagues. But they only approved that it converges. No rates. So we just
-- we get rates for free again by this generic COMID result. This generic COMID
result and the realization that this is what the algorithm becomes and this is
actually the algorithm that they proposed, the dual shrinkage. And for them,
these two spaces were not different because Frobenius norm means that there is
no mapping. You just do this again and again.
And you actually get P norm version of these -- of this algorithm. So here, what's
the algorithm as well as the analysis is new. And in trace norm applications you
a priori believe that the good predictor has low trace norm. So this shouldn't be
much bigger than this while the operator of a matrix is always less than the
Frobenius norm. So you should gain something. So we belief this should also
work in practice, although we haven't yet started it.
And then there's some interesting questions that, you know, at least in this world
I need to do in a 3D at each step, which is not really a good thing to do so we're
actually looking into issues of how to avoid a 3D, maybe we just update the 3D
from the previous step a little bit. But anyway, so that's actually the end of the
story.
Let me just state that the strong convexity actually arises -- so for the PEGASOS
algorithm, for those of you who know, it's an online algorithm. It was not known
whether in a single run through the -- of the PEGASOS algorithm you get a good
predictor with high probability. It turns out the notion of strong convexity allows
you to answer that question.
And we've also recently analyzed general exponential families with L1
regularization. So that would include L1 regularized logistic impression, L1
regularized sparse covariance estimation, and L1 regularized squared loss
Lasseau all in a single framework. And it turns out there is a restricted form of
strong convexity which gives you a guarantees like the Lasseau guarantees in
general exponential families.
And so that's the summary of the talk that you know we wanted to understand
properties of different regularizers. And derive fast online algorithms and also
understand that the relationship between probabilistic and adversarial is
something that I didn't emphasize in this talk, but I'd be happy to discuss. So
that's the end of the talk.
I just want to mention that I also work on some other areas so the tradeoff
between exploration and exploitation problem is something which I really like to
study. I have a bunch of papers in reenforcement learning and unknown MDPs
and bandwidth problems. And also I'm interested in coming up with new
algorithms for optimization in the large -- for large datasets.
So I'll end with that.
[applause].
>> Chris Burges: More questions?
>>: So I'm just wondering like have you thought of using these [inaudible] for like
load data [inaudible] for regularization across matrices?
>> Ambuj Tewari: Yes so. The log that -- I mean, we have thought about it, but
we don't yet have anything.
>>: [inaudible].
>> Ambuj Tewari: Yeah. I mean actually log data is actually nice because it's
actually a [inaudible] function not just strongly convex, so I ->>: [inaudible] matrix inverses constantly, right?
>> Ambuj Tewari: Sorry?
>>: You have to compute matrix inverses constantly if ->> Ambuj Tewari: Yeah. So there's definitely that issue. But -- yeah, I don't
have like any concrete answers for that problem, but it seems there should be -some of these ideas should carry through. And this is actually -- I'm just trying to
recall [inaudible] Professor at UC Berkeley, I think he has -- he has some
algorithm which has a very similar flavor for the sparse [inaudible] problem.
Haven't actually looked at the real connection but I just felt on reading that paper
that there's some connection.
>> Chris Burges: Let's thank Ambuj again.
[applause]
Download