>> Chris Burges: Let's get started. Today it's a pleasure to welcome Ambuj Tewari to Redmond. He is an assistant professor at the Toyota Technological Institute at Chicago. And also at the University of Chicago. Sounds like a lot. And today he's going to talk about risk, regularization and regret and strong convexity. Thank you. >> Ambuj Tewari: Thank you, Chris. And thanks for inviting me. So this talk has a sort of grander title than I usually like to give. I mean, I'm not going to talk all there is to talk about risk -- about risk, regularization and regret. These are three main very big topics in machine learning. But there is this notion of strong convexity that appears in a lot of my recent work. So I just will use this as a time theme to present some of my recent results that have had some bearing on these three Rs. And this work is in collaboration with a bunch of smart people, John Duchi, a student at UC Berkeley, Sham who -- Sham and Shai used to be at TTI, but have now moved. Yoram at Google, and Karthik, who is a student at TTI. So feel free to stop me, you know, if I'm rushing or going too slow. So this talk is going to be about prediction problems which, you know, we want to predict an output or response Y based on some input X. So that's the generic prediction problem. So you want to come up with a function of the inputs that can predict the output. Right? So that's a very general high-level view of prediction problems. Examples are -- you know, very common example, we deal with this everyday. You get e-mails and then your e-mail spam classifier classifies it as spam or e-mail. Or you might have microarray DNA data, genetic data trying to classify individuals as diseased or not, or healthy. So this is something that Joseph Kashet at TTI works with. He likes to view -this audio speech recognition problem as a prediction problem. So your input is a speech signal and you want to output the corresponding phoneme sequence that it corresponds to. So here, unlike the previous two examples, the set of possible inputs -- outputs is huge, right? So first of all, there is no limit really on how many things might there be in your sequence, so it's really a combinatorial space here on the Y side. Sort of a slightly different flavor this example has rather than the other two. Similarly, this one from the NLP community. People try to think of the problem of parsing really as a prediction problem where your input is a sentence and you want to predict a huge -- you want to predict really the whole parse tree as the label. So again, the space of labels here is huge. So these go, you know, and this is a structured prediction that deals with such problems. I will not go so much into structured prediction but some of the ideas will apply such problems. And then there's of course ranking. Talking to Chris about it. Haven't worked really much on ranking problem per se, but again some of the ideas I will talk about do apply to the ranking problem where you have a list of documents that query processing system returns and you want to display them in some ranked order. All right. So hopefully the top-most document is the most relevant document. Okay. So this is just common examples of prediction problems. So what I'm getting at is really linear predictions, so prediction using linear functions. So linear functions, I mean the world is nonlinear, but machine learning really likes linear functions because turns out you can do a lot with linear functions if you do -- if you work with -- in a higher dimensional space or even an infinite dimensional space using the kernel idea. So I will assume that your predictors have this linear form you can right? So W is vector of weights, and your input has been represented as some huge vector and, you know, just take the inner product of W and X, that gives you a prediction. This is really the workhorse of machine learning, right? Lots of practical methods are based on this idea. It's not as restrictive as it might like when you see it for the first time, right? So we'll just see some examples of how you can use this idea to not just predict real numbers, right? So it feels like this is just -- you just want to predict one real number, then this makes sense. But what if you want to predict some structured things, right? But you can do that. And one thing I'll -- I want to state up front is that this D, the dimensionality should really be thinking of it as really, really large. I don't want to depend explicit on the dimensionality and things I will claim about my algorithms, right? So one to Y explicit dimension depends as far as I want -- as far as I can. Okay. So how do you use linear predictors, let's say for ranking, right? So one way to do it, not the way, perhaps, would be to learn a weight vector and then take its inner product with the documents you want to rank, and then you can sort those numbers. So maybe these numbers are good indication of the relevance. And you can get the ranking, right? So even if you want to predict a permutation, you can sort of do it by just learning linear functions. So that's one simple reduction. For multiclass problems, if you want to -- if you have an input and what you want to predict is one of let's say K classes, you can learn individual weight vectors for each of the class and simply predict the class by, you know, by you take the inner product, you want to classify this X, take the inner product with W1, take it with W2, take -- and you get these K numbers, take the max. And that's also one of the way in which multiclass classification is done in practice. So again, you use the simple idea of doing predictors to solve not just binary classification or regression but multiclass classification. Then of course you can do multitask prediction, where your input itself maybe comes from K different tasks, okay? So K is not class as in the last example, but now these are different tasks. And maybe each task itself is simple. It's maybe just a regression or a binary classification task. So for each task you just want to predict one single Y, one number. But the idea of multitask is that it probably shares some -- has some similarity, right? How do we encode similarity? People usually put either a low rank assumption, so these predictors are for individual tasks. But if you're assuming there is similarity, then something should be simple about this matrix, right? So this matrix should not be just an arbitrary matrix, right? Now, how do you say simple? Well, that depends on application. Two kinds of assumptions people have made on this is that maybe there's not fully independence here. So maybe this matrix -- well, the individual task would just sit as rows, this has lower dimension, like these don't span the full possible space. Maybe the rank of this matrix is low. And another is shared sparsity. Which is if all these tasks -- so maybe the dimensional -- the apparent dimensionality is huge, but there are only a few relevant variables, and those relevant variables are perhaps common across tasks. That's the sharing. So that's people call it shared sparsity and then that is the whole group [inaudible] based on the idea. So again, the idea is that you use these individual inner products to predict labels for individual tasks but then you regularize in a way that encodes your regularization. And I will come to regularization very soon. Okay. So this is probably my last example on this general idea of using linear predictors to solve more complicated problems. This is for structured prediction, like the phoneme sequences are the parse tree example. So here for every input X, you have a huge number of possible labels. And the way you deal with that problem is that you map each input label pair via feature mapping into some high-dimensional space. So phi XY lives in some R to the D where D is huge, and then you again use really the multiclass idea, right, that you will -- to take the inner product of your W with each of these mappings, corresponding to the labels, and provided this max can be computered efficiently over this exponentially large set, you can still predict, and then the idea is to come up with the learning algorithms where you can still do learning efficiently even though your set is huge. And there has been some success in that front. Okay? So this is like the introductory part of the talk. So are there any questions at this stage? Okay. So here is the outline. I will introduce the three Rs that the talk is about, risk, regularization and regret; and then spend just a little bit of time introducing this notion of strong convexity; and then the real meat of the talk is in the last portion where I will present some new analysis for existing algorithms as well as some new algorithms in the end. Okay? Okay. So for the time being, forget about the more structured prediction of the ranking problems. Let's stick to classification and regression, and we'll come back to the more complicated ones. I just wanted to get at this issue of loss functions. So for any prediction task it's important to design the loss function correctly. But I'll sort of shove that issue under the rug for this talk and just assume you have picked your favorite loss function. Just think of, you know, your favorite loss function. The idea is that you just want a way to go from predictions and true labels to quantification of how good that prediction is, right? So just a loss function. For me for this talk, loss function is just a way to do that, all right? So you get your prediction using your predictor and you have the true label. And L of this -these two guys just gives you how bad -- how badly you did. Okay? So just use some loss function. And once you fix the loss function, the traditional statistical learning viewpoint is that assume that your data comes from some unknown distribution P. And you try to not assume too much about P, right, because this might be generated from some really complicated process. But the thing is having this assumption that your data is from a distribution allows you to associate with each predictor a single number which is an expected loss, right? So that's the risk. So that's the first R. So the risk of operator is simply its expected loss on route to the distribution. And the point is of course we don't know what P is, right? P is how the world is in some sense. What's the distribution of natural images, you know, what's the distribution of human language, human language sentences. But we do have a training set to train on, okay? So this acts as a proxy for this P that you don't know. So since the sample's what -- is all you have, it makes sense to use it to minimize what is known as the empirical risk. So the risk is the expectation under the underlying distribution of the data, but the empirical risk is just the average loss on the training set. All right? So that is the empirical risk. And you might go ahead and minimize it, but then the question is well, what space do I minimize it on? If I just minimize it over less say all functions, all possible Ws, then, you know, most of us know that, you know, we hit by overfitting, right? And that brings us to regularization. But we don't want to search over all -- the whole space of predictors, right? So maybe we have some belief maybe about how a good W looks like, and we want to incorporate that in regularization. So if you ask regularization is, I guess you get different answers from different people. But at a very, very high level and probably to have least conflict with most people. If you say, well, its regularization is something done to prevent overfitting, I think that is at least something that most people would have little problem with. And the idea is that you want to encourage simple functions that do fit your data, right? So there is this eternal tradeoff, right, between fitting the data well, which is the data parts L hat over W is the prominence on the data. And then you also want something which is low complexity in some sense. And lambda is a regularization parameter that trades these off, okay? For those of you who work with this everyday, I mean, I'm sure I'm boring you to death but I just wanted to set up the context for the talk. And there is the constrained version, of course, of this where you keep -- where you pull the regularization out of the objective and put it in the constraints. So are there any questions on this very, very high level technical setup of the situation? Okay. So let's get into examples of regularization. I guess the most common one is the L2, the ones that -- the one that is used in SVMs. So here the L2 known factor is simply square each entry, add them up, take the square root. Also corresponds to its length, you know, in 2 and 3D and however D you can think of or does an analog of the length. One, another regularization that has become very popular recently and has connections to sparsity is the L1 regularization where you just sum the absolute values of each component. And that acts as a convex surrogate for the number of nonzero entries in the vector. So it's another regularization. The third one, very popular regularization is comes from the max-ent boosting world where you measure the complexity of a vector by its entropy. Really the negative entropy because I want to work with convex regularizers, so that's why I'm dealing with negative entropy because entropy itself is concave. So here I'm assuming W is a [inaudible] distribution so each component is positive and it sums to one. So these all are examples of W that is really are vectors. And what if we have matrices? So these matrices arise -- you saw they arise in multiclass, multitask. So the group idea, which I very briefly hinted at a few minutes ago, is to use this idea of shared sparsity that may be so think of -- so these rows are individual predictors. So each predictor has three dimensions. And the rows can correspond to tasks in multitask or to classes in multiclass. And the idea is to take the usual L2 norm of each column, so that's of each feature, and then add them up. So this, the idea is that like L1 norm, which encourages sparsity of each component here this should encourage sparsity at the column level. Each column should light up like only a few columns should really light newspaper this matrix and the rest should just be turned off to zero if you use this regularization. That's at least the idea. That doesn't always happen, happens under some conditions. That's the idea behind this regularizer. And the trace or the nuclear norm, which has become very popular due to its association with the matrix completion problem is to incorporate this idea of limited linear independence in your rows. So here you regularize, so you take the SVD of your matrix, W. These are the singular values. And you take the L1 arm of the single values. Why? Because rank is the number of nonzero elements here, and a surrogate of the -- and a convex surrogate for the number of nonzero limits in the vector is the L1 norm. So you just take the L1 norm here. So these are the five regularizers that I'll sort of use as running examples. Okay. So I'm done with my risk and regularization. So there remains the third R is the one I want to talk about, which is the regret, which may be not that many people are familiar with. So I'd like to illustrate the idea with an example. So let's say you have -- so I'll get to the definition of the regret first. But let's just do this little example before that. Let's say you have let's say five experts predicting the weather. And for simplicity assume that the weather takes only two values, sunny or cloudy. And you know a very unreasonable assumption, you know that one of these is perfect, okay? And your job is to look at their predictions and make your prediction and make the minimum number of mistakes. Okay? So this is the first day you have no prior information. You just know one of them is correct. So maybe you just go with the first one. And the reality is cruel. It doesn't always do what you predict. It's cloudy. So you made a mistake. But that's fine. You made a mistake, but we can rule this out, right, because we know there's a perfect predictor, these two can't be perfect. So now we just have these three. And now they predict cloudy, sunny, sunny. So it was cloudy yesterday and so maybe if things persist you go with the first one again. Well, but look again. So you make a mistake. But still progress has happened, right, because you made a mistake. So you paid in your performance but you did make progress by zooming in on these two guys. So next day they both predict sunny, so you're in good shape because one of them is perfect and they're both predicting sunny so you won't make a mistake. And then the next day they have a disagreement. You go with the first one and -well, you made another mistake but now you're guaranteed that you won't make mistakes because this one is the perfect one. Okay? So the idea is that under this unreasonable assumption of a perfect expert there is the strategy, a very simple strategy, you just go with the first uneliminated expert that makes a finite number of mistakes. Okay? So that's the point I'm making that there is no probabilistic assumption here on how the world behaves. There is an unreasonable assumption of the perfect expert which can be removed. But it just gives a flavor of how you can say something without making probabilistic assumptions. Some adversarial guarantee that no matter what happens my algorithm will satisfy certain such certain property. And actually if you go with the -- not with the first expert but with the majority of the ones that have not been eliminated, you actually only pay log number of -you only make log number of mistakes. So this strategy of going with the first is actually a bit silly but still gives only a finite number. So this modulates the online learning setup where -- so now let's -- so this was not a learning problem, right, a very sort of concocted problem. But let's get back to learning. We're trying to learn. So maybe you have a temporal process and you're trying to predict that each time and you keep a predictor at each time, T. So that's like your move. So I'll think of this as a game between the learner and the environment, which can be adversarial. So you play a predictor. In the expert setting that just corresponds to picking, you know, if there are five experts and I go with the one you can go at that like this. So that will be my move. Then the online learning protocol is that the world responds with a loss function. And again the expert setting this loss function just encodes which experts made mistakes. The first two experts made mistakes then this is what the loss function looks like. And then you suffer this loss function on your choice for that date, for that round. So here in this case the loss function is very simple. It's just linear, right? So if one of the mistake-making experts was what you picked, you just pay that, right? So a very complicated way of saying that is you take the inner product between these two extra as. But I'm making it complicated because like brings me to this more general setting of online learning where you -- where there is this protocol and you suffer this losses at each time. And then what's the regret? Well, the regret is sum up the losses that your algorithm -- that this online algorithm that, this learner made. There is this division by T to scale it. And then subtract from it the best you could have done if you knew all the loss functions in advance. All right? What's the best you could have done if you knew everything in advance over your space. So this W lives in the same space as what your algorithm lives. So I was mentioning this definition to Chris, and he said well, it's sort of -- you're being harsh on -- I mean, you're being -- you're competing with the quantity that's not that interesting maybe because it's -- you were loud to change the WT at each time but what you're competing against has to use the same W throughout. There's sort of several possible responses to that criticism. So [inaudible] criticism. One response is that we do -- so one -- the first response is that it is possible to study this quantity and prove interesting bound. And not only that, these bounds often give very competitive answers in the statistical setting if you do assume that these Ls are not just adversarial but they are coming from some distribution, then you recover some of the best guarantees that are known for batch algorithms. Which is kind of administering. And the third response is that yes, this is restrictive, but this is just a start, so you can consider other notions of regret where this is allowed to change a little bit, okay? This is the definition. And I just want to make sure everyone sort of gets it, what this quantity is measuring. So your cumulative loss minus what's the minimum you could have done if you saw the sequence in advance. Okay? Yes? >>: So I mean usually [inaudible] T and N would be the same thing, so is that -- >> Ambuj Tewari: Oh, that's a typo. Zero. That small N is -- so, yeah, in statistical learning we use N and somehow in the online world we use T, and I got them mixed up here. NN is equal to T here. Yeah. It's the same. Yeah. You can see that's just propagating. Okay. So the regret, unlike the risk -- so risk is just a number. I have a predictor, the distribution. Risk is the average loss of this predictor. It's a single number. The distributor depends on the distribution. Regret, on the other hand, depends on the actual reality, on the actual sequence of loss functions, on the actual data you saw. So the regret of an algorithm is a sequence dependent quantity. But the surprising factor is that you can often bound it for all sequences from a certain class. And it could have been generated adversarially. So the adversity might be looking at your algorithms and trying to force back examples. Okay. And this is a slide that's illustrating that if you do come up with a way to bound this regret, then you can use it in a streaming setting where your data is IID. But perhaps you're seeing it one by one, right? So what you can do is you can mimic this online protocol using a data stream where you -- you know, maybe this is the stream of e-mails that is coming in. You maintain a classifier at each state. You receive e-mail and then you have a label, you have a correct label for it. You suffer the loss, maybe just a squared loss, and that's how you can mimic the online protocol. And the point is at a very high level there are these conversions that can go from an online adversarial setting on the batch statistical setting by combining the aggregate guarantee with the assumption that the data is actually IID. Use some concentration inequalities to give a bound in the statistical setting. The statistical setting you'll say that the risk of the average thing you played over time, so that just an average of these Ws is not much worse than the best risk, sort of blue status is the predictor in your class that's -that has the minimal risk. Plus a little bit where this little bit depends on the regret guarantee and the extra thing that you're accumulating by using some concentration inequalities. But the high-level idea is that you can use these maybe esoteric looking regret bounds to get things in the statistical setting and you don't lose much. So these regret bounds are apparently not that loose. And people have also used these online algorithms to do optimization of batch objectives. So like there is a spec source algorithm. Which is really an online algorithm but it works on the SVM objective, solves the primal problem. And the idea is that instead of this data stream we just mimic the online protocol by drawing from your given batch dataset. So although you have the batch dataset before you, maybe you don't want to look through the whole dataset, you can just sample randomly from it. And then again you can combine the grid guarantee with this assumption that you're drawing randomly from a big dataset to get guarantees on the batch error. So the error on this dataset S. So the loss of my average predictor on the dataset will be less than the empirical minimizer's loss plus a little bit. Okay? So two ways how online bounds are used within learning. Okay so. What we want to do is we want to understand all these different regularizers, entropy, L1, L2, group norm, trace norm. And we want to obtain in the probabilistic setting we want to say well, the risk window be big. In the adversarial setting we want to say that the regret will be bounded. And we want to understand the relationship between the probabilistic model and the adversarial model. When are the regret bounds the same as the statistical bounds, when are they different, things like that. And the notion of strong convexity appears to play an interesting and important role in all of these. I won't get into the risk bounds because I don't have that much time. I'll mostly concentrate on the online setting because I think that's also more practical. You do get these fast algorithms out of it. Okay. So strong convexity is a very brief intro to this concept though those of you who haven't seen it. Convexity I'm sure everyone is familiar. So convexity in one dimension is that if you connect two points on the graph of that function that the line will always be above. And particularly for the mid point you'll be above the function. Another equivalent definition is at least the function is differentiable that you can draw a tangent at any point and the tangent will be below the function. That's another equivalent characterization of convexity. Yet another definition can be that the D derivative is monotonically decreasing or at least nondecreasing, which means that that the derivative of the derivative, the second derivative is nonnegative. So there are three definitions. And you can try to make each definition stronger to get a notion of strong convexity. The good thing is you get one notion. So you make each definition stronger, approximate but the resulting notions also coincide. So you have a strong robust notion of strong -- we have a robust of strong convexity. And that doesn't depend on the definition of convexity you start with. So you can take this definition and say, well, this is not just below but below by a significant amount, right, where that significant amount is quantified by a parameter alpha. So you can say my function is alpha strongly convex if there is this -- at least this much gap between the function and this line. Okay? That's one proposal of strong convexity. One other proposal could be, well, I'll say that this function is not just above this linear approximation but above by a significant amount. Again, quantified by alpha. And you can say well, here my derivative of the derivative is not just nonnegative but it's actually at least alpha where alpha is a positive number. And all these three definitions are again equivalent. So we get some robust notion ->>: So twice differential ->> Ambuj Tewari: So that brings me to the definition. This I take as the definition. And then the rest are equivalent if you have sufficient differential [inaudible], right. So in high dimensions, the definition of strong convexity that is used is that the function at the mid point is less than the average of the function values at the end points minus -- this should be on -- yes, this should be a minus. Minus alpha over this eight is just for making sure that half equivalent nonsquared is nonconvex. Because you want that example to have a strong convexity constraint of one. >>: Is halfway enough, or can you construct some weird counter stuff like ->> Ambuj Tewari: Oh, so this is all assuming functions are continuous. Yeah. No weirdness is allowed. >>: Okay. >> Ambuj Tewari: Yes. But, yeah, half is enough only if you have continuity. And then you can infer it for -- and the notion -- the issue of the norm is kind of important. And I want to say that right now. Because in 1D, pictures are misleading because there's only one way to measure norm, right, this the absolute value. But in higher dimensions, it's very important which norm your given function is strongly convex with. Because the constant will depend on the norm. And I don't want my constants to depend on dimensionality. Okay. So the definition as far as I can tell first appeared in 1966 in an optimization theory paper by Polyak which is now being -- it's appearing more and more in the machine learning literature in the past four or five years. And then there are these equivalent definitions if you have -- so this was the definition and this was a consequence of the definition. If your function is differentiable been the linear approximation sits below by this much and then the gradient mapping is a monotone function -- it's actually a strongly monotone function and then there's the condition involving the Hessian. But these are just showing that you can state this notion in different ways. Okay. Examples. So one important family of examples are these LP -- the LP norms, right? So the L1 and the L2 norms are called L1 and L2 because they are the LP norm for P equals one and two and generally you can take the Pth power of each absolute value, add them up and take the one over P power of that. So that defines a norm. And half LP norm squared is known to be P minus one strongly convex with respect to the P norm, okay? And this is important, right? So this is not referring the dimensional at all, this is just P minus one irrespective of the dimension. And this is true only for one to two. And don't ask me what happens after two because it sort of gets more -- [inaudible] says you don't have strong convexity, you have a notion that's between convexity and strong convexity. >>: Is it the same as alpha is P minus one? Is that -- when you say P minus one strong [inaudible]. >> Ambuj Tewari: Yeah, yeah, so the alpha equals P minus one if you take this function. But this specific function, the alpha -- the biggest alpha that you can make -- that you can work with is P minus one. >>: But in alpha zero -- >> Ambuj Tewari: Is always true, right? So the function -- if function is alpha strongly convex, it's alpha prime strongly convex for all alpha primes less than alpha. >>: But is zero strongly convex function [inaudible]. >> Ambuj Tewari: A zero strongly convex function is just convex. >>: So if you do the L1, it's not actually this strong ->> Ambuj Tewari: Good point. And I'll come to that. So for L1, we have to do a little trick. You have to stay just enough -- you go in one ->>: L1.0 ->> Ambuj Tewari: Yeah, you do one plus a little bit. Good point. So if you're exactly at one, you lose convexity. I'll come back to that point. It's a good point. And for matrices, one very nice -- actually when does the talk end? >>: [inaudible]. >> Ambuj Tewari: Okay. So I'm doing good on time then. So, yeah, so this is -this is a family of examples on vectors for matrices. You have these so-called Schatten norms where -- I mean, the name is complicated but the idea is simple. You do the SVD of your matrix and then just do this O norm on the single values. Right? And again, the trace norm and the Frobenius norm correspond to taking the L1 or the L2 norm on the singular values. So that's the family of Schatten norms. And the interesting thing is -- and this is not a trivial consequence of this fact. It takes some proof. It's a paper by Keith Ball and a few other people, that half Schatten norm squared is also P minus one strongly convex with respect to the Schatten P norm. So again no dependence on the dimensionality of the matrix nor dependence on on D1 and D2. And you might hope that this is just -- you can just infer from it this fact by using some simple linear algebra but because of non [inaudible] of the matrices, the proof is actually not simple. At least I don't know of any simple proof. The thirds example is the negative entropy. And again you can get a dimension independent strong convexity constant provided you work with the L1 norm. So negative entropy is strongly convex, not with respect to L2 but with respect to L1. And the logical content of this statement, that's one strongly convex is actually equivalent to saying that the Pinsker's inequality holds, which is the KL divergence is lower bounded by the variation distant squared, for those of you who know Pinsker's inequality. It actually has a quant analog. But that, I won't get into it. So in the quantum world property distribution get replaced by density matrices. Positive semi-definite matrices with trace one. Okay. So I'm done with the three Rs, I'm done with the definition of strong convexity. Are there any questions on the introductory and the definition of strong convexity? Okay. So if there are no questions, I'll just move on. Okay. So now coming to the meat of the talk, which is online algorithms using this idea of -- now I'll make the connection, sort of try to put things together. Okay. So as I said, I won't go into the statistical learning period results like how to obtain regularization error or risk bounds in the statistical world, I'll just stick with the online setting. So here is the protocol. So you're playing a predictor at each time. You receive some loss function. And I think it's good to keep this in mind that I'm just not -not something of this L sub-T as an abstract function. Here I'm really thinking in my head I'm thinking some machine learning loss function applied to -- that's what I'm thinking. That's what I'm really thinking in my head, what LT is, right? It's the loss that you incur on the Tth example. And this could be your favorite loss. This could be even like you could do -- you might be doing structured prediction. In fact, people have used some of these ideas for structured prediction. Okay. So to come to -- to motivate mirror descent, let's start with two very well-known algorithms in machine learning. One is this online gradient descent. So the way you get your next iterate is by just going a little bit in the descent direction for that that given loss function. And then maybe you need a projection to come back in your set. So that's just simple gradient descent. The exponentiated version of that is of course let's say you're working on the probability simplex. So the way you update each component of your -- so you're keeping track of this WT is evolving, right, so you have these -- this is how your online algorithm works so to get WT plus one from WT, this is what gradient descent does, and this is what exponentiated gradient descent does. It multiplies -- each component gets multiplied by E to the power. This is a learning rate. And then the appropriate component in the gradient. This is the multiplicative version of this iterative algorithm. So both are fairly -- I mean, have been around for a while. And they're both specific cases of mirror descent. And here's how. So gradient descent, this thing, has an equivalent description like this. And that the next iterate actually is obtained by solving a simple optimization problem. What is our optimization problem? You linearize your loss function at your -- at your current iterate, so you linearize your function around WT. That's a bad approximation, far from WT. So you add a proximal term. You don't want to go away from where you made your approximation. And you just minimize these two to get WT plus one. And it's actually not that difficult to verify. So this is the linear approximation, right? The value at WT plus the gradient times the difference. It's actually not that difficult to verify that if you just minimize this than this is what you -- what the solution looks like. This is called a proximal term. And for exponentiated gradient, the only difference is that the way you measure the distance is instead of Euclidian distance you use not really a true distance but it's just divergence scale divergence. And again, it's a bit of a calculation that this minimizing this actually gives you this closed form. Okay. So now these algorithms start to look the same, right? Only one ingredient has changed. And you can say well, okay, I'll just have a generic -- a general Bregman divergence here. And I'll tell what I mean by Bregman divergence. Bregman divergence is simply -- so you have a -- if you have a convex function R, you can get a divergence out of it by simply evaluating that function at W and the first argument and extracting from it the linear approximation at W -- where the linear approximation is made at V. Right? So the picture of this -- so you have this R, and you make the linear approximation at V, and you evaluate it at W, that will leave a gap, and that gap is exactly the Bregman divergence. So the divergence is nonnegative by definition if R is convex. It actually will not just -- we'll assume R is strongly convex. So this now buys back to the notion of strong convexity. So this is now a family of algorithms. The ingredient is you change this ingredient R and you get different algorithms. What are these algorithms good for? We'll now relate this -- the choose of R to the task at hand, right? So for multi task, for trace norm, it turns out there are appropriate Rs that you can use to get the right algorithm and the right bounds. Yes. So I'll assume that R is strongly convex. Okay. Now so this is the -- like this is somehow not an intuitive way to learn about this algorithm. This is the definition. So remember the gradient descent had these two equal descriptions. You can either view it this way or you can view that you're doing this gradient step. So it turns out even for the general middle algorithm there are two views. One view is while you are doing a tradeoff between this proximal term and the loss dependent term there's a more geometric [inaudible] like is that so here you have the WT. You go to a mirrored space. And that's where actually the term mirrored descent comes from. You do a descent in the mirrored space. You go through a mirror -- to the mirrored space. What's that mapping? That mapping is actually the gradient mapping of this strongly convex function. So you assume it's differential. So this gradient mapping takes you through the mirrored space. You do descent there. And then you come back. And since R is strongly convex its gradient is invertible; in fact, for any strictly convex function is gradient is invertible so you can come back. So this picture you get the same algorithm as this. And you know online gradient descent exponentiation, gradient just deferred in the choice of this mapping. >>: [inaudible] is the matrix. >> Ambuj Tewari: Grid R -- so both these R vectors of same dimensionality. So grid R just maps vectors to vectors. So it's the gradient of first scalar value function. So it takes inputs that are vectors and also puts a vector. So for the -- for simple gradient descent R is half. So [inaudible] are gradient descent corresponds to RW being half W2 norm squared and then the gradient is simply W [inaudible] so nothing happens. The mirror space is the same as the original space. And then you might project if you go outside the set. Okay. So here's a very general guarantee not new to us. >>: [inaudible]. >> Ambuj Tewari: The -- which gradient? >>: The gradient ->> Ambuj Tewari: So this gradient is -- comes from the loss. And this is a mapping. So I applied the gradient, add WT to get TT. So this is the mapping. Takes vectors to vectors. Apply it to the current vector, get this here. Do descent and come back. >>: [inaudible] do it on the right-hand side? >> Ambuj Tewari: Yes. So there are lazy versions where you maintain things only in the mirrored space and only project where you need to make a prediction, that's right. But if we're making predictions every time, then you have to come back. So here's a generic guarantee. The nice thing is this is nice family of algorithms. The only ingredient is this strongly convex function. So assume you have an upper bound for that function and the parameter of strong convexity, and assuming that all the gradients of your loss functions are not too big in the dual norm, okay? So R is strongly convex with respect to some norm, which is -which will be application specific. And you have to measure your gradients in the dual norm, so that's something you have -- that sort of goes together, right? You don't have freedom in choosing your different norm here. But then there is this generic guarantee. If you use this mirrored descent algorithm, no matter what your loss function -- loss sequence is, as long as the gradients are bounded, your regret will always be bounded by something which is linear in the dual norm, and then -- so the nice thing is there is no -- at least there's no explicit dimension dependence here, right? So as long as alpha is independent of the dimension, all these bounds are also independent of the dimension. >>: So [inaudible] always exists for strongly convex -- >> Ambuj Tewari: Actually the inverse exists for any strictly convex R. And for the -- so it actually exists for strongly convex R just because all you need for the inversion -- for the gradient to be invertible is strict convexity. It shouldn't have any flat portions. The R shouldn't have any linear bits in it. >>: [inaudible]. >> Ambuj Tewari: Okay. So let's -- so the regret is to remind us just this quantity. Some of the losses of the algorithm minus the best. Okay. So that's an abstract theorem. They're not clear how -- what it buys you. Let's now do specific cases. And this is something which is over, right? So this idea that -- so these bounds are actually over. I'm just putting them here that this comes out of this generic framework. So these are not new. So let's say you're doing learning with just with plain vectors, doing some regression or classification. Then if you -- if your predictors you're competing against are bounded in the L2 norm, then actually you have to assume that your data is bounded in the dual norm. So dual of two is two, so you have to assume that your data is bounded in the two norm. And then the regret bound is really the product of two norm of the predictor, the two norm bound on your data and one over square root T. So this is classical. This is known that online gradient descent has its regret bound and then you can also convert it into the statistical guarantee if you have -- if your data [inaudible]. And so this is the bound for online gradient descent for exponentiated gradient. The nice thing is -- so the nice thing is you can bring this dependence on L2 now on the data down to infinity. So think if your data all features are between zero and one and the infinity norm is just one with the two norm scales with dimensionality, right. If your data is dense and has D1s, then the two norm will scale as square root D. So that's good. But then you get hit by the -- now you get hit by the L1 [inaudible] predictor. So that's bad because that's bigger. So you can't say which one is better, right? It depends on the application. So if you have dense data but a good predictor is sparse, then go ahead and use this. But if it's the other way around, then you're probably better off using just simple gradient descent. So this is well known. This tradeoff is well known. E just recover it using this general theorem. And I'm using something about the loss function. I'm assuming it's convex and Lipschitz. So even if you're using squared loss you have to sort of work in a bounded integral to get these. And this log D, this comes back to the question of one norm not being strongly convex. So what you do is to handle the one norm like for the second line what you do is you actually run mirrored descent. So you can either run mirrored descent with like exponentiated gradient or if you want to run it with P norms, then you run it with the -- this is what you choose your R to be, to get the second bound. And you choose P carefully near one. And that's why P minus one is one over log D so when you invert alpha you get that log D. So that log D comes from this choice. And so -- and the reason I make this choice is that if you are this much near one then the LP norm and the L one norm are within a constant factor of each other. So the LP norm for this value approximate approximates the L1 norm within a constant factor. Okay. So but -- so those are all bounds. Of course in the matrix world, so this is recent work of Sham and Shai. So in the matrix world, let's say you were using multitask learning. You can of course use these bound, right? You think of your matrix like a huge vector and then the words L2 norm, L1 norm. So you can trivially get these bounds. But what's interesting is that you can get immediately also get bounds for group, group type regularization or trace regularization. So you get these new bounds basically for free. And the only thing -- the only ingredient that changes is that to get this line we use -- there's something I didn't actually mention that for matrices I can take the P1 norm squared, so this is like matrix and you do the P norm on this side -- sorry, two P norm. You do the two norm on this side and then do you the P norm here. And then again this function is appropriately strongly convex. I didn't mention that was one of the examples but you needed to get this row. The high level point is I mean the bounds are sort of difficult to read here. It's too much -- too many subscripts. But the point is you get either no dependence or mild dependence on the dimension. You get sort of the expected dependence on the various norms of the matrices and the data involved. Basically from this just one unified algorithm and just plug in different ->>: [inaudible]. >> Ambuj Tewari: Yes. So here K is the -- K is the number of tasks and I'm assuming that the loss function is just is sums over different tasks. The nice thing is in this analysis, unlike most [inaudible] analysis, we don't require that you use the same loss function across tasks. So you can actually share features cross regression and classification tasks. So you might use a logistic class for classification and use squared for regression. And only -- I mean, here I'm not doing it, I'm using the same loss. But none of these bounds depend on if you want to use the same loss function. So you can you can mix tasks that are classification regression. And this is Eric Xing [inaudible] is doing some practical work on this. He calls it heterogenous multitask learning. >>: And you don't even need [inaudible]. >> Ambuj Tewari: Yes. Yes. So these are -- so this is my -- so this is what my classifier looks like, right? So the blue L is the Lth row of this matrix. So for the Lth task I use the Lth row of that matrix. So this is initially how I showed how I will do multitask. >>: I was just trying to understand [inaudible] you don't have ->> Ambuj Tewari: Oh, if you -- >>: You only have some features in common in the task [inaudible]. >> Ambuj Tewari: Oh, like this is -- so, well, so these bounds are always true, but they're meaningful if your assumptions are correct. For example, this 2, 1 norm will be small if there is shared sparsity. The this bound will always hold. This is the bound for the algorithm. This bound will be large if you don't -- if your regularizer doesn't match the assumptions that it sort of corresponds to. You can go -- you can use the -- this algorithm for the case of sparse W and dense. It will still run and it will have a worse bound, right? So the algorithm is not restricted to be used only in a certain setting. Its performance will depend on whether the setting matches its regularizer. >>: So you have a [inaudible]. >> Ambuj Tewari: So this, on the left hand side, this is the class I'm working with. So this is -- this corresponds to the group regulation, right, because the constraint form of the relation says I'll only work with matrices that have group norm that small. So this is the class I'm competing against. And this is the assumption on my data. And then this is a bound and these are four different algorithms. So I am not sure if I understand your confusion. >>: [inaudible]. >> Ambuj Tewari: There is X in the loss, yes. >>: [inaudible]. >> Ambuj Tewari: So the index in the loss. T is the ->>: [inaudible] subscripts [inaudible]. >> Ambuj Tewari: So L is -- the picture is ->>: [inaudible]. >> Ambuj Tewari: So W in our matrix, T itself you get LK tasks is what your WT looks like. And at time T, since you are in multitask at time T you get all the tasks. So you -- your matrix X also has K rows. And I think the index in here is that this is the Lth row of the matrix at time T. Yeah. So it's this notation gets ugly with the tasks. >>: So this is presumably the tasks of equal importance. Is there a way to parameterize that matrix by scaling it by the importance vector and then have ->> Ambuj Tewari: Yes, you can have ->>: [inaudible]. >> Ambuj Tewari: You can scale things, yeah. I mean, the proof are not that sensitive to different scalings across tasks. So ->>: [inaudible] propagate it through. >> Ambuj Tewari: Propagate through ->>: The vector. If you have the vector which implies the importance -- the relative importance of [inaudible] would some properties of the vector then go through into the actual bounds? >> Ambuj Tewari: Oh, yeah. Great. So then the bounds will sort of scale with those scalings, right. Yeah. So if you scale differently then it's not clear what you can say about -- it definitely could have some effect on the overall norm of the matrix. And then I mean there is an interesting question here which we haven't fully -- I don't have time to go into it. How do you compare? I mean, could you start comparing these bounds and try to identify what properties of your data and predictors makes one better than the other. So that part I'm not getting into. So this is for multitask. You also get immediately new algorithms for multiclass prediction. So the loss you use is the multiclass hinge loss. So this is the generalization of the usual hinge when you have more than one class. Basically you for each -- so YT is the correct label. So X is still a matrix here. Sorry. X is not a matrix here because this is multiclass. But W is still a matrix here. Because that -- you use different rows of W to predict different classes. And this is like -- this is the margin. So YT is the true label and Y is one of the other labels. You sort of penalized by the worst margin error. Analog of the margin hinge loss to the multiclass case. And for this loss, you get these -- so the first two are again already known. So this is actually -- this is the multiclass perceptron bound due to Yoram and maybe [inaudible] I'm not really sure who proved the first multiclass perceptron. But anyway, you get these new algorithms. This actually I was quite excited when we figured you get this -- so this is a multiclass algorithm, which is the hybrid between like the group [inaudible] and the multiclass idea. So now you're doing -- you're sharing features across predictors for different tasks. And you can run this algorithm and you can see that this bound will actually be much better than the default multiclass perceptron bound if you have sharing because then W2, 1 will not be much larger while you gain significant -- you make significant gains and just depending on the biggest element in X. Anyway, again, so this is a lot of fun but the high-level point is that you can again even for multiclass get new algorithms. This algorithm is not new. It was proposed -- this is a trace nonregularized multiclass algorithm was proposed by Nati Srebro and coauthors, but they didn't analyze it, we just get that analysis for free by using the strong convexity property of the Schatten norms. Okay. So I have maybe 15 minutes. So this is the last part of the talk. So in mirror descent the algorithm is known. Its analysis was known so. This generic bound was known. This is not new to us. So here I'm just showing how to change the ingredients and get new bounds and new algorithms in different settings. But now the last part is actually about a new algorithm. And it's not very general, unlike mirror decent, unfortunately it only applies to particular regularizers. So I'll worked with the L1 norm. And actually the work was motivated by L1 regularized problems. So you have this online learning protocol where you play a predictor. You incur a loss. But often, for example in regularized problem your loss consists of some data part and a part that comes from the regularizer. So you can do -- there's no -- nothing stopping you from doing just very linear descent or [inaudible] descent on this, right, just evaluate the gradient from both sides, go in the descent direction and proceed. There's a slight complication that this is not differentiable but you can work with subgradients and all the guarantees of mirrored descent are still true if you just have -- if you use a subgradient instead of gradient. Okay. So you can still -- you can do mirror descent even with L1 regularization. You will linearize your loss and you will also linearize the L1 part of it. And this is what geometrics will look like. You'll do descent corresponding to the loss part, then you'll do some descent corresponding to the real regularizer. Never mind the fact that this is not differentiable. But assume you get one sub-- one subgradient from the subdifferential. And you come back. Problem is that this update does nothing -- that's it, nothing in this update that even hints at the sparsity, right? You're just subtracting some vectors from some vectors and I mean it's not clear how somehow magically some vectors will become zero. They will in the limit because you know that this will converge to be optimum under certain conditions if you minimize L1 regularized losses then you get sparsity. But there's nothing in -- one update does nothing to promote sparsity. And this was a big problem in applications that this -- these updates were not promoting sparsity, even though the reason we use L1 norm is to encourage sparsity. Okay. So what's the fix? The fix that we propose -- I mean not we. The idea was around, but we sort of put in it connection with the mirror descent is if you have -- if your loss function -- if your online loss consists of two portions, one of which is not even related, it never changes, then don't linearize it. So this is mirror descent. Change -- so just instead of linearizing it just throw -- throw it right there. That part don't touch. Just linearize the loss function. So this is what we call COMID, Composite Objective Mirror Descent. So, there are two questions. Well, you modify their algorithm, right? And mirror descent was good because even though it had this flavor of that you have to solve an optimization problem at each step, in the end it was just this nice go to the mirror set and do descent, come back. It was a nice -- it could be implemented efficiently. So can this modification also be implemented efficiently? >>: But how does this relate to [inaudible] algorithm? >> Ambuj Tewari: Yes. It -- they are very similar. You can replace -- instead of linearizing just one, you can keep a linearization of all the previous losses that you've seen. And there are actually a whole spectrum of algorithms between ->>: Right. They're [inaudible]. >>: Is this the same as just Duchi and Singer's algorithm? >> Ambuj Tewari: So that will -- yes. I'm getting to that. It will be a special case. >>: Okay. >> Ambuj Tewari: So that's exactly where -- so Duchi and Singer did some work, me and Shai did some work and then we realized there was a single algorithm of which they are special cases. So that's what -- on so Duchi and Singer is COMID but for a specific choice of R. Okay. And does this modification work? I mean, you change the algorithm, there's no guarantee that you -- that it still does something reasonable. But one thing we are sure that if it works it will give sparsity of -- in each update because in each update you have the L1 regularizer sitting there, right? So in each update if you can implement it efficiently, you will know that WT plus 1 will be encouraged to be sparse because this sits right there in the objective function for each update. Not just in the [inaudible]. Okay. So this is what -- this is what this algorithm becomes when this distances the usual W minus WT two norm squared. You do descent and then you pass your intermediate vector through the shrinkage operator so each component of your vector goes through this shrink operation. So if you're away from zero, something gets subtracted. But if you're within a little distance of -- if you're within some distance of zero you just get [inaudible] to zero. So that's why you get sparsity in this state. So at least for the simple setting where R is the is the Euclidian norm you get an efficient implementation. You just do the gradient descent and do the shrinkage. And in general, this is something new which I don't think it was new -- known before that no matter which LP norm you use in this general mirrored descent setting, this modified mirror descent algorithm, this COMID algorithm has the inefficient implementation and this is how mirrored descent gets modified. It's quite intuitive. You go to the mirrored step, you do descent from the loss function, you do squashing the shrinkage operator in the mirrored space and you come back. The key thing is when you come back sparsity is preserved. So these -- all these P norm Bregman functions have the nice property that this gradient is sparsity preserving. If you have 20 nonzero components here, exactly 20 -- exactly those components will be nonzero. So [inaudible] disaster if this mapping destroyed the sparsity you gained in the mirrored space. But you preserve that. And this geometric picture is the equivalent of this algorithm. So that's a theorem for the family of Bregman. And actually also explains why we were able to -- so there is some bound which is not much different from the mirror descent bound. But it's not about improving the bound, it's about having the sparsity and property at each step. Actually the bound is cleaner because if you do mirror descent you will get hurt by the -- by the -- how big the gradient is both in the loss part as well as the L1. But here you do get pay only in the loss part. So cleaner bound than mirror descent. And you get -- so this is now coming back to the question that was asked about Duchi and Singer's work is that you get two algorithms as special cases. So Shai and I gave a algorithm for L1 regularized problems using P norms that we called SMDMS, stochastic mirror descent made sparse, and John Duchi and Singer had another algorithm. And both were discovered independently. And then we realized there is this generic algorithm that gives these bounds in special cases. And the story doesn't stop there. Because now -- because of this unified analysis we get new algorithms where L1 -- instead of L1 think of matrix applications where now you're doing trace nonregularization, you actually -- now it's very interesting what the algorithm in the end becomes is you go to the mirrored space so that both spaces are spaces of matrices. You do the descent from your loss function. Now you shrink in the singular values. So you don't shrink each component of this matrix, you shrink -- you go -- you do the 3D and you shrink the singular values and put the 3D back, and that's what [inaudible] this one is, and you go back and again this doesn't destroy sparsity in the spectrum. So [inaudible]. >>: [inaudible]. >> Ambuj Tewari: The R should -- yes, in the matrix R should be the Schatten P norm squared and in the vector world -- actually the property is really that the gradient at any W its G component, the sine of it should match the sine of WG. So once you do the gradient -- so you have a vector. You apply some map to it. You get another vector. The sines should match in each component. And it does happen for all the P norms and all the ->>: When you say maintain sparsity, do you mean that the Ws also have this property [inaudible] similar values? >> Ambuj Tewari: Yes. In this case, the notion of sparsity per definition is different in matrix. So here you get, if there are five nonzeros in their value, there will be exactly five nonzero singular values here. Right? Unlike the previous case where the number of nonzero entries were preserved. Okay. So here we get new algorithms which you haven't yet tried. John is actually just beginning to run experiments. And there are some interesting -actually I had questions for Lin about how we can use it with the -- oh, okay. I'll take the time and have the discussion offline. Anyway, so these, you get new algorithms for these trace norm applications. If, again, you have this ingredient R to choose, right, so you can choose R to be simply the Frobenius norm of the matrix and then you get a particular bound, then this algorithm is actually not new, was published last year by Goldfarb, Professor Goldfarb at Columbia and his colleagues. But they only approved that it converges. No rates. So we just -- we get rates for free again by this generic COMID result. This generic COMID result and the realization that this is what the algorithm becomes and this is actually the algorithm that they proposed, the dual shrinkage. And for them, these two spaces were not different because Frobenius norm means that there is no mapping. You just do this again and again. And you actually get P norm version of these -- of this algorithm. So here, what's the algorithm as well as the analysis is new. And in trace norm applications you a priori believe that the good predictor has low trace norm. So this shouldn't be much bigger than this while the operator of a matrix is always less than the Frobenius norm. So you should gain something. So we belief this should also work in practice, although we haven't yet started it. And then there's some interesting questions that, you know, at least in this world I need to do in a 3D at each step, which is not really a good thing to do so we're actually looking into issues of how to avoid a 3D, maybe we just update the 3D from the previous step a little bit. But anyway, so that's actually the end of the story. Let me just state that the strong convexity actually arises -- so for the PEGASOS algorithm, for those of you who know, it's an online algorithm. It was not known whether in a single run through the -- of the PEGASOS algorithm you get a good predictor with high probability. It turns out the notion of strong convexity allows you to answer that question. And we've also recently analyzed general exponential families with L1 regularization. So that would include L1 regularized logistic impression, L1 regularized sparse covariance estimation, and L1 regularized squared loss Lasseau all in a single framework. And it turns out there is a restricted form of strong convexity which gives you a guarantees like the Lasseau guarantees in general exponential families. And so that's the summary of the talk that you know we wanted to understand properties of different regularizers. And derive fast online algorithms and also understand that the relationship between probabilistic and adversarial is something that I didn't emphasize in this talk, but I'd be happy to discuss. So that's the end of the talk. I just want to mention that I also work on some other areas so the tradeoff between exploration and exploitation problem is something which I really like to study. I have a bunch of papers in reenforcement learning and unknown MDPs and bandwidth problems. And also I'm interested in coming up with new algorithms for optimization in the large -- for large datasets. So I'll end with that. [applause]. >> Chris Burges: More questions? >>: So I'm just wondering like have you thought of using these [inaudible] for like load data [inaudible] for regularization across matrices? >> Ambuj Tewari: Yes so. The log that -- I mean, we have thought about it, but we don't yet have anything. >>: [inaudible]. >> Ambuj Tewari: Yeah. I mean actually log data is actually nice because it's actually a [inaudible] function not just strongly convex, so I ->>: [inaudible] matrix inverses constantly, right? >> Ambuj Tewari: Sorry? >>: You have to compute matrix inverses constantly if ->> Ambuj Tewari: Yeah. So there's definitely that issue. But -- yeah, I don't have like any concrete answers for that problem, but it seems there should be -some of these ideas should carry through. And this is actually -- I'm just trying to recall [inaudible] Professor at UC Berkeley, I think he has -- he has some algorithm which has a very similar flavor for the sparse [inaudible] problem. Haven't actually looked at the real connection but I just felt on reading that paper that there's some connection. >> Chris Burges: Let's thank Ambuj again. [applause]