1

1 >> Ofer Dekel: Okay. So let's get started. It's our pleasure today to have Karthik Sridharan visiting us. He's now from U Penn and he was in TTI before and he was our intern a couple years ago in the summer so we know him very well and like him very much. And this is going to be relax and randomize, a recipe for online learning algorithm. So thank you, Karthik. >> Karthik Sridharan: Thank you. Thanks for the introduction. All right. So I'm going to basically talk about online learning. What does that mean? We have basically time going from 1 to T. Learner picks a mixed distribution or a set of actions curved F. Adversary picks an action, XT, belonging to some subset script X. Learner draws his action from the distribution that he picked, and he suffers the loss L of FT, XT. So and what we are interested in is minimizing this notion of regret. So what does that mean? It means that we are interested in [indiscernible], because we are playing mixed strategies. So the idea is you have your cumulative loss and you want it to be as good as the best single action in hindsight. So if I had known Z1 through Z2 -- sorry. If I had known Z1 through ZT in advance, then I could have picked the best action, F, and I don't want to have too much regret with respect to that. And so this is the basic framework that we're going to be looking at. So we are interested in how well we can do in this scenario, and can we come up with some kind of generic tool for dealing with online learning. So the outline of the talk is this. So first, we're going to look at some nonconstructive -- by that I mean I'm not going to give you algorithms, but we're going to look at how well we can do in these online learning games. So it's coming from Minimax Analysis. I'll describe what that means in a bit. So basically, we're going to look at the notion of the Minimax value of the online learning game. That is, what's the best I can do versus the worst or the [indiscernible] trying to hurt me the most. And from this, what we're going to do is we're going to look at some notions of sequential complexities for online learning that I'll be defining. So this part is, well, it's non-constructive, basically kind of mirrors what happened in empirical process theory and statistical machine learning. 2 So a lot of empirical process theory is used for giving complexity tools for statistical learning theory. And we're going to kind of develop analogous tools for online learning. Of course, as I mentioned, the first part is going to be nonconstructive. So we're also interested in how can we get algorithms for these. So the next part we're going to look at relaxations and algorithms. I'll define what relaxation is more later. But the basic idea is we basically want to take these nonconstructive proofs that we had out here for getting upper bounds on the value and we want to convert them into algorithms. We want to develop algorithms from these. And hopefully, efficient algorithms for at least certain cases. And I'll also talk about random play out, what this means and how we can be used for analysis of things like follow the [indiscernible] leader, how basically you can look at things like follow the [indiscernible] leader as coming from kind of you can look at as them coming from a Minimax type analysis, all the way when you're relaxed, you can get this. And I'll look at some -- I probably won't touch upon many of the examples. I'll mention them, but I won't have time to go into details. So all of this is kind of comes from Minimax analysis, and all of this algorithms that we develop again are going to show you that irrespective of what the adversary does, I can get this kind of bound. But it doesn't tell you that if the adversary gave you something nicer, you can do something better. So okay. So in the worst case, for instance, let's say I can get a square root B, square root type of bound. But I want to also have the assurance that, okay, maybe I'm okay in sort of square root B. I'm okay with it four Times Square root B. But then [indiscernible] if the adversary played something nicer which I was expecting, then I get much better results. And we look basically or we just touch upon how we can do this using some of the techniques from here, inculcating them, and we look at localizing and adapting. I'll again define what these mean. And we look at online learning with predictable sequences. So when we have some type of model -- so the basic idea of online learning with predictable sequences is that in online learning, you deal with an adversary who is worst case. Sometimes he just puts adversary [indiscernible] at you. 3 Online learning with predictable sequence is you have a predictable sequence. You have like some kind of model what you think the world should be. But then there is an adversary who tries to corrupt your model. And the lesser he corrupts your model, you want to be able to do better. So we want to capture that. Okay. So now let's move to the first part. That's sequential complexities for online learning. So before we go to that, let's see, okay, so we have the statistical learning where we have this nice story where basically, nature picks some distribution D and the set of instance is Z and you're given instances drawn already from this distribution DZ1 through ZT, and the goal of the learner is to pick some F hat from F based on the sample. And our goal is to have a low expected loss. So the expected loss of what the learner picks should be close to the best expected loss that could have been done. And if this can go to zero, we say that it's statistically learnable. In online learning again, we have kind of analogous scenario. Scenario here, of course, is all sequential for time T equal to 1 to T. The learner picks some action. Adversary simultaneously picks some ZT and Z, and we play the game. And again, as I mentioned before, the goal is to minimize regret. And if you can take this regret, the average regret to go to zero, then we say that the game is online learnable. Now, let's look at how do we give certificate for learnability and learning rates in general in statistical learning theory. Statistical learning theory, we have this slew of tools from empirical process theory, and we can essentially use these to kind of give complexity measures on F that will give us nice bounds on the learning rate. And examples of these are the VT dimension, the Rademacher complexity covering [indiscernible]. There are a lot of these. And the algorithm here is mostly generic. You just do empirical minimization and for all of these, you can get these as your upper bound. So that gives you an upper bound on the learning rate. However, for the online learning case, at least until fairly recently, the certificate for learnability has been kind of case by case. So you have a particular problem, you come up with an algorithm for this problem and you 4 [indiscernible] bound and the way you kind of show online learnability and you show the rate. Of course, for when the problem is convex, you have like tools that are slightly better in the sense that you can at least for convex scenario, you have generic tool box to some extent. But in general, our question is can we come up with complexity measures for on the actions at F to basically say that a problem is online learnable. Or can we come up with like a generic tool box. And to look at this, we look at what's called the value of the game. That is, so before I get into the definition, I'm going to make, at least for the first part, I'm going to make a slight change in that instead of Z belonging to Z, I'm going to take X belonging to X and in instead of using L of F, Z, I just use the mapping F of X. It's just to kind of mirror what happens in empirical process theory in terms of complexity. If I don't want to hold on to the L in terms of notations, just to simplify notation, that's it. Okay. So what is the value of the game? Well, it's the natural quantity that you would want to, I mean that you would like to bound, which is basically what's the best regret I can get in terms of the worst adversary. So if I play optimally and the adversary plays optimally, then what's the regret that I suffer? So this is the greet, so some T equals 1 through T. My cumulative loss minus the best loss. And we are basically -- so we go a times step one, we pick the best mixed strategy. Adversary goes next. He picks the worst, or in his view the best action that he could have picked. Of course, this is a randomized game, we draw F1 from Q1 and so on. So this is defined as the value of the game. Now, we can do slight manipulations to this to get it doesn't have to explicitly deal with [indiscernible]. that? Before I show you what I mean by that, just to I have a sequence of this form of some operator, some I'm going to use this notation. So T equal to 1 to T that you just write this out in this format. Is everyone clear about this? slide. to a form that kind of And what do I mean by make things simpler, when operator, some operator, of this basically means It's just because I don't want long things in my 5 Okay. So this is just the definition of this operator. Now, [indiscernible] we can make this mix and then we can put the expectation here because we have a [indiscernible] at each term and it's the same thing because it's going to be at the corner. And once we have this, we can do a Minimax swap. Well, you need some mild assumptions on the set script F and set script X. But by those assumptions, you can basically swap these two terms and once you do that, you can push this expectation in here and pull it out. Which basically gives you this term here. So it's [indiscernible] more distribution so at each time set we become a conditional distribution and then we look at N4F belonging to script F, FT belonging to script F of the conditional expected loss. Okay. So from this, we can move to what I'm going to term Sequential Rademacher Complexity. So we have this from the previous slide. The first thing that we can do is instead of N4, FT in single one here, we can just take the N comes out of the minus, it's a sup, and we can just go to our upper bound of this one. And how this already should start looking familiar to guys who have seen in empirical process theory. So the idea, I mean, you have this maximal deviation, right. So you have supremum or a function class towards the maximal deviation. And empirical process theory generally you take [indiscernible] data, so all of this is just like a single expectation. There's no piece of [indiscernible] each time. And you just asking supremum over F, what's the speed at which my average converges to the expectation. And this is the kind of the [indiscernible] version of modeling [indiscernible]. Because these things are modeling in different sequences here. Okay. >>: We can do a little more. XT, so the first expectation and XT for the second expectation. >> Karthik Sridharan: >>: Pardon? The XT on the very right. >> Karthik Sridharan: This one? 6 >>: No, more right. >> Karthik Sridharan: >>: That one. This one? Here? That's the one that's in the expectation. >> Karthik Sridharan: >>: The last XT. Yeah, sorry. [indiscernible]. >> Karthik Sridharan: Yeah, I should have used -- yeah, yeah. So the one in here is the one from here. And the one here is the one from [indiscernible]. Yeah. Sorry. Okay. So now of the next thing you can do is equality, pull these expectations out and you piece of T and then you get this difference. include a Rademacher random variable by which takes one or minus one of equal property. It just multiply this because both of these are, distributions are the same. use the [indiscernible] in can have pairs drawn from this But now you can essentially I mean random coin flip that can put that in here for free and conditionally, their And once you do this in sort of supremum distributions, you can always upper bound it by supremum or two point X prime T and XT. And you get this and, of course, you can divide it into two terms and write it as two times this. So this is the conditional Rademacher complexity. The sequential Rademacher complexity. The idea is that you pick -- so, okay, let me describe what this term is doing here. So I pick a particular X1, which is kind of the worst X1. Then we flip a coin, then based on what I see in the coin, I pick the worst X2 and so on. And where the final term that I want to maximize is this term in here. Now, notice that basically all of this is generated by coin flips, and we are looking amount the coin flips. Like what are the total number of possibilities that are basically forms of T, right? Because at the first step, I flip one. So when have -minus. a free, I choose X1, I just have one choice. Now when I choose X2, I can I have two choices. Either my first flip could have been a plus or a And depending on that, X2 would vary and so on. So basically, it forms and you can rewrite this in a kind of -- or at least depends on your 7 viewpoint, but in a nicer form. Yes? >>: In the last bound in your operation, so the very last, if F is a large offset, the [indiscernible] instead of being 01 becomes 1,000, 1,001. >> Karthik Sridharan: True. So it could blow up. So here, there is no centering, and so here there is an automatic centering and then when you go here, there is no centering. So it turns out that, for instance, like if you look at, say, a classification or things like regression with, like, standard losses, in the worst case, the centering doesn't matter. But so we'll actually come back kind of in the third part again and we use of the centering in a way that actually can give us more meat in terms of if you knew some predictable sequence, you could do some centering through of the predictable sequence. That's a good point. of -- so I'm here, in and 1. But if your F the centering here so There is no centering. So if my F values were in sort a sense I'm implicitly assuming that F is between minus 1 was between thousand and a thousand one, you're losing you're adding like a huge constant. Yes, that's true. You had a question? Okay. So as I mentioned, basically, the sup over X, you can basically look at it is as sup over 3. And this is another form of defining the Sequential Rademacher Complexity. So here we take a supremum, or here when I write the bold X, it's a three. It's a binary three, by which I mean for each X sub T is a mapping from plus minus one party minus one to script X. And basically, notice that this would have been essentially the same as the classical Rademacher Complexity. This wasn't here and we had this sup over X through XT. But here at the part actually tells us which XT to go to next. So pictorially, let me tell you what this is instead of going through the map. So you have this tree here. This X is like this tree here and you have these nodes out here. And as an example, let's draw the part epsilon equal two plus one minus one minus one. So that would correspond to this part out here. So how do we calculate this term in here for a given F? It is some equal to one to T epsilon TF of bold XT of epsilon, right. But what is X1? X1 there is no choice. It's just a root. So you have an X1, but what's the sign that goes with X1, it's epsilon 1. So epsilon 1 comes in there. I don't remember if I had any -- sorry. 8 Okay. So the next -- so now we went to the plus one, which means we go to the left or right. And so that's an X3. So the next one is X3 and its corresponding sign is minus so we go to the left, that gives us X6 and so on. So that's the way the inner term is taken. And we take supremum over [indiscernible] expectation over the part and supremum over the tree. And this is the Rademacher Complexity. And as we saw in the previous slide, essentially the result is that the value of the game or of the online learning game is bounded by two times Rademacher. As we see, many scenarios it's actually tight so what we get here is, in a sense, the best we can do. And I'd like to mention that properties like Lipschitz contraction and other properties of Rademacher Complexity also hold true for the sequential Rademacher Complexity, which kind of allows to go through to get rid of the laws or do other nice operations. Okay. So the next kind of thing that comes to mind when you're talking -- when we think of empirical process theory is covering numbers. So you can also get analogous covering number for online learning. So instead of going through this exact definition, let me show it to you pictorially. So the idea is that we say that set V of real value trees is [indiscernible] is an alpha cover. If for all F and all parts there exists a tree such that on that part it's close. What do we mean by that? Let's evaluate F on all these X1 through X2, and these are the values that you'd get. So I'm just taking the binding case, for example. So you get these trees. Now, the idea is that for all F. So let's pick some F, say F2 or F3, pick a part. So we pick this part here, which is, I guess, minus and it's X1, X2 and X5. So what we want is we want to cover this, right? have two real value trees. In this case, it's enough for binary, and notice that basically, this part out here is covered by this part. I [indiscernible] in this case. and let's plus. So So we now mean, So essentially what we see is of the definition of the covering of what it means to have a cover in the tree sense. And the idea of covering numbers is the smallest cover on a particular tree. So I give you a particular tree, what's the smallest number of such real value tree such that you can cover the script F. And the covering number, without given script X is basically supremum over X. 9 All right. So why is this useful? Well, it's useful because we can essentially -- okay. So we have this Dudley integral bound in statistical learning theory, and you can essentially get exactly an analog [indiscernible] in an online learning world. This is, if you've seen Dudley integral bound before, this is essentially exactly the same thing. It's just that now instead of the usual covering number, we have the tree-based covering number. And then nice thing is analogous result holds for statistical learning world is that whenever F is in minus 1 -- I mean, whenever F maps to minus 1 and 1, we basically have that the Rademacher Complexity is bounded by this Dudley integral complexity. Yes? >>: So what are the [indiscernible]. >> Karthik Sridharan: your actions. >>: These are the functions in your function class. So it's So that might be, the way you did all was -- >> Karthik Sridharan: Yeah, yeah, absolutely. If it's all possible functions you can possibly learn, right. So yeah, definitely. So it's some subset of functions F. So, for instance, if you could take like the class of all linear functions and then you'll get a covering number for the class of all linear functions, by which we mean that I play a vector W and adversary plays a vector X and I suffer W transpose X additional. So this is like a lot of fun in convex optimization is basically this. But here we want to capture in general more, like not just convex scenario, but arbitrary function classes like [indiscernible] and unit networks and all kinds of non-convex function classes. And you can do that using this. >>: So I might be missing something. If I'm restricting myself to zero, one function, didn't the example you just showed show that using two trees, you can cover a whole binary function? >> Karthik Sridharan: >>: Oh, no, these are not all binary functions. But [inaudible] therefore these trees -- >> Karthik Sridharan: There's a value on every and there's a zero, zero. 10 >>: Oh, it's not just a -- >> Karthik Sridharan: >>: No. Oh, you have to cover. >> Karthik Sridharan: Yes. >>: That means if you take a function of zero, one, and you take [indiscernible], the total number of [indiscernible] not the number of steps, but it's also the width of the tree. >> Karthik Sridharan: >>: Right, in the worst case, it could be really bad. [inaudible] gross function and the numeric -- >> Karthik Sridharan: Right, right. It could be really bad. So for instance, if you take the set of all threshold functions, like just on even one dimension, if you take the threshold function, we see dimension as one or two, right. But it's a little [indiscernible], I'll come to it a little bit. But basically, the growth function is as bad as it gets. It's basically not learnable. So because you basically can give examples that force you like -- because you don't have a resolution, you can make it kind of forced like a search that just kind of goes on smaller and smaller intervals. But you never reach that. >>: [indiscernible]. >> Karthik Sridharan: Right. Okay. So now we have an analog of covering numbers. Next is, well, next is the famous one that kind of started all of it, which is we see dimension and we want to get [indiscernible] which has already been done by Littlestone in '88 and then the agnostic case was by Ben-David, Pal and Shalev-Schwartz. So the idea that is you're given, again, an X valued tree X and we see that an X valued tree X of depth D is shattered by a function class if for every part we can realize it. What do we mean by that? So this is a tree, and we basically, for every part that I choose, so like minus, plus, minus, I should be able to find a function. For instance, F2, such that F2 of X1 is minus. F2 11 of X2 is plus, F2 of XY is minus. And so for every part, I should have a corresponding function that can actually attain those values. So if that happens, then we say that it shatters this tree. It Shatters this tree, and I'd like to point out that this tree is really not the same as a covering tree. These are two different objects. Okay. So you can also -- of course, Littlestone, I mentioned, is the largest D such that F shatters some script X value to your depth D. So we can also get similar kind of analogs, combinatorial parameter for the real valued case. Again, we have this X tree. Now we have to have some notion of margin, right. So the notion of margin is given by this weakness tree. So we have a real value tree S, which kind of tells you what point you want the margin across. And the idea is that if I pick, for instance, this part, plus, minus, and plus, then I should be able to find a function, say, for instance, F6 here, such that F6 of X1 minus our 2 is larger than 0.2. So when I was going to the right, it has to be larger than our 2 to the right. And when I was going to the left, it has to be smaller. And so on. So basically, we should be able to not just get these values, but basically be able to move away from them by [indiscernible]. So this is basically, it's very similar to how you get fat shattering dimension in the usual case, except it's on a tree. And how do we use these parameters? I just told you, defined the parameters. But how to use them. So you can basically show an analog of the VC lemma, also known as the Sauer Shelah lemma. The idea is that if you have a function class, say a multivalued function class whose fat shattering value is D, then you can get a bound of this sort. Then there is also a real valued version of it. So the fat shattering, the covering number at scale alpha is basically upper bounded by something like T over alpha [indiscernible] dimension at scale alpha. So now you can use these combinatorial parameters in there to get your bound. So you plug them in, you get the bound. Covering them, put this back in your Dudley integral bound and you get the bounds you want in the value of the game. So let's just take -- so up to here, of course, I was trying to give you some complexity measures for online learning. Let's just take a slight detour for a 12 few seconds. So I just want to mention that, I mean, so, of course, one of the, like, being a machine learning person, I kind of like my main view of things like Rademacher Complexity and VC dimension are in terms of learning. Like basically, they help you give bounds for machine learning algorithms. But in empirical process community, it's also studied because it gives you -- I mean, it basically tells you when you have uniform [indiscernible] convergence or when there's uniform [indiscernible] function class, averages converges to the expectation. And basically, you can kind of give an analogous result in the -- I'm sorry. In the world where you have different sequences. So basically, let's define class F to satisfy uniform universal convergence if for all alpha. I'm sorry. I didn't complete the definition. So basically what you want is supremum over F irrespective of what distribution you pick. Supremum over F of average of F of XT minus the conditional expectation goes to zero. So you want this to go to zero, almost surely. That's basically what I wanted to write. And the supremum here is the distributions over the infinite sequence, X0, X1. If you saw basically how we built it from the sequential Rademacher and so on, one direction is kind of obvious, that if you have finite sequential complexities, like if you have finite fat shattering in the tree, then this almost sure convergence holds. That is an easy direction. But it's not too hard to basically show that this -- that at all scales, fat shattering dimension of alpha being finite is also a necessary condition for this uniform universal convergence. So that's all I wanted to say about that. We can get back to learning. Okay. So let's look at online supervised learning. The idea is that you have, I guess the first example is the binary classification example. You're given X and you're given Y and you basically want to predict Y and you have a function class F such that it takes F of X and gives you binary labels. And in the statistical learning world, we have a complete story in that things are statistically learnable if and only if we have fight night VC dimension, and this is a [indiscernible]. And, well, in online learning world again, it was proved by Ben-David, Pal and Shalev-Schwartz that online binary classification is learnable only if you have finite fat shattering dimension [indiscernible] dimension. The tree-based one that I described before. 13 And you can basically go to a general supervised learning problem, where you F of X is like a real value thing and you're looking at, say, regression with absolute loss. But you can also extend it to other losses. But for now, let's just think of absolute loss. And for this, again, there is this result by Alon, et al., and Bardett, et al. Which basically says that the problem is statistically learnable only if at all scales the fat shattering dimension is finite. And, well, the analogous result for the online supervised problem and it turns out that it is true. So if you have a set of functions that are bounded by one, then the online supervised problem is learnable if and only if we have a finite fat shattering dimension in [indiscernible] case. And moreover what you can show is the value of the game, the Sequential Rademacher Complexity, the Dudley integral complexity, all of it are all within [indiscernible] factor of each other. And, in fact, you can show that the Sequential Rademacher Complexity, both upper and low bounds by a value of at most two. So you cannot really do better than what -- than the bound that the Sequential Rademacher Complexity gives. Okay. So in terms of where you can apply the results, by that I mean can get bounds for these dimensions, as I mentioned, it's nonconstructive. So you can give nonconstructive bounds -- of course, you can give bounds for online convex optimization and learning with linear functions. But this we already knew. But what you can kind of deal with are things that are non-convex like neural networks, decision trees. You can give generic margin bounds where we have an arbitrary function class, and lots of other examples. And the main thing here is that you can deal with non-convex scenarios, but then it's all nonconstructive. Well, we started off with question, can we come up with generic toolbox to kind of tell us how well we can do in online learning, and we said that things are case by case. So we kind of said okay, we don't want to contract an algorithm, get a bound, because that becomes case by case. We want a generic tool box. And we have a generic tool box that tells us how well we do, but -- yes? >>: You're now comparing the statistical regression in that one. In the [indiscernible] examples come from distribution. In the case where you lose the ID part, you're [indiscernible] and you're looking at the worst sequence of 14 [indiscernible], whatever they are. >> Karthik Sridharan: >>: So you must pay for that. Right. Where do you pay? >> Karthik Sridharan: Where do you pay? So the complexities are different, right. Now it's the tree complexities. So it could be much higher. But if you look at all these examples, like neural networks, decision trees, you don't pay much, because the statistical learning and online learning are actually equally bad. But then, for instance, just the threshold function, I mean, threshold is like the simplest example, and we know that it's like very easily online learnable with an efficient algorithm. But in online learning world, it's not at all learnable. So you pay because the tree complexity and the ->>: [inaudible]. >> Karthik Sridharan: >>: No, I said it's not at all learnable. Oh, okay. >> Karthik Sridharan: Sorry, yeah. I just said not at all online learnable. So basically, the tree complexity. You can also say -- the tree complexity is always lower bounded by the classical complexity. Because you can just take a tree, and at each level, just make all the nodes equal, then you exactly get back the classical complexities. So it basically means that the tree complexities are huger. In some cases it can be much huger, but in a lot of applications it's actually more or less the same. >>: [inaudible] online learning as statistical learning is equal to uniform convergence. >> Karthik Sridharan: >>: It's only for supervised learning. Only for the supervised. >> Karthik Sridharan: Seen here. 15 >>: And here is uniform convergence with [indiscernible]. >> Karthik Sridharan: Right, exactly. Yeah. Then the [indiscernible] part, we got it only very recently, took us quite some time to show that it's an also necessary condition that finite fat shattering is necessary for this uniform modeling convergence. Sufficiency is easy to show, but the necessary part takes -- I mean, I guess it's again easy. It's just that we didn't get it for a while. >>: [indiscernible]. >> Karthik Sridharan: So there have been, like, I think like I've seen like a couple of ones which are -- there's one in limited scenarios, not like a generic function class. I have seen like one which is about like Lipschitz function. But the basic thing, like if you look at a lot of empirical process theory for, like, non-independent -- for dependent data, most of it kind of assumes things like [indiscernible] and then the idea is to use VC dimension with things like blocking technique to say that you get [indiscernible] it, but then essentially the same tools kind of apply. Yeah, so as far as like we know, this is the only one. All right. So well we have all this tool, but all of it is nonconstructive valid algorithms. We want the algorithms to get these bounds. Otherwise, why is it useful? And that's kind of what we're going to look at next, which is how we get from all these analysis to actual algorithms. So let's get back to the root, the Minimax algorithm, where we started, the value of the game. So what's the definition of the value of the game? If you remember, it's basically I do best, adversary does best and we kind of do this for many rounds and we look at what regret is. And this was defined as the value of the game. And let's just rewrite this in a slightly different equivalent fashion. You can basically take a few of these [indiscernible] inside by writing the sum out in the sense that you can do this. So the competitor is always going to ->>: [inaudible]. >> Karthik Sridharan: Oh, yeah. So yeah, I mean, you can basically -- the competitor always [indiscernible] but these losses kind of come out a little 16 bit. And let's just kind of rewrite this definition recursively by looking at this form. So I think so you have this in for absolute loss. So basically, you can rewrite the value in a recursive form which is info over Q, sup over Z and then you have expectation drawn from Q of the expected loss. Plus this term, which goes all the way up to capital -- I mean, ZT, Z. And the starting condition is, of course, when you're given everything, it has to be the competitor, that is this term. And once you have this, it's easy enough to see that the value of the game is basically this thing given nothing. So all the way when you're given nothing, it's at value of the game. When you're given everything, it's just minus the competitor. Okay. So in the Minimax strategy again is simple. You play for T rounds. You get F1 through FT, you get Z1 through ZT and now for the next step what you do is you take your Z1 through ZT, plug it into this, and then you take the admin or the mixed strategies of sup over Z in this thing. This is just simple rewriting of stuff. And so the exact Minimax algorithm for particular cases have been done. One by Nicolai and Gabor for absolute loss, which nicely kind of takes the value of the game and shows that you can go to the classical Rademacher Complexity and so on. And there's also, for another case by Abernathy, Warmouth and I don't remember the third person. But in general, the problem is that this Minimax strategy is not computationally feasible. So we cannot really solve this exactly. So, well, the idea is to basically replace this by some other function, which -- let's call it relaxation. Basically, good upper bounds. And, well, relaxation, we say, is admissible if this condition is satisfied. So I add my current loss. I take info over distributions, sup over Z. My expected current loss plus the relaxation with the Z included should be upper bounded by relaxation of only Z1 through ZT without the Z. And the initial condition is that the relaxation is larger than minus the competitor. Actually, it's not initial. It's final. But we are kind of going from inside out, so it's initial --. 17 >>: You have kind of [indiscernible] and instead of obeying capable the recursive formula test the upper bound. >> Karthik Sridharan: >>: Yes. And. [inaudible]. >> Karthik Sridharan: It is very similar. And it resembles something else which I'll quiz you on in just two seconds. Okay. So basically, again, just like how we did for value, you can basically show that for any admissible relaxation, the value is upper bounded by relaxation given nothing. And this is just dynamic programming. The nice part is that this -- I mean, like if you look at this in here, the Q kind of comes in only in the first part. It doesn't come in the second part. This is kind of special to external [indiscernible]. I guess even for [indiscernible], but it doesn't work all the time. Like if you go to games that are not regret form, it doesn't work. And if you go to partial information games, then there is coupling. But for this, it has a nice form. Okay. And the strategy again is kind of similar to what we did to the value. You basically find the Q that minimizes this. And again, the expected regret of the strategy is bounded by relaxation. Okay. So this is very -- I mean, like this is very similar to something that's even used in online learning world, and Nicola knows the answer. So it's basically exactly potential method. And initially, I mean, so we kind of started with value and we wanted to look at how -- so we didn't actually write it in terms of relaxation. We wrote it in terms of what we had in terms of the sequential Rademacher complex. We wrote it down, we said we had that and then we said let's make this more general. And once we made it more general, we just realized that this is potential method. Nothing different. Okay. So the first kind of result is that the conditional sequential Rademacher relaxation, forget that long name. Basically, what it is is kind of a conditional version of the Sequential Rademacher Complexity that we had before. What we had before, though, was no Z1 through ZT given. It's all the way to the end. And we had the same sup over Z and then this expectation and all that. Now, the only difference is that when we're given Z1 through ZT, you 18 just subtract the sum of the losses before. And you get this. And what you can -- I mean, basically what happens is you can look at the proof of how we show values on a basic Sequential Rademacher Complexity. And exactly the same proof shows you that conditional sequential relaxation is permissible and that you can -- and that's basically the [indiscernible] value is upper bounded by two times Rademacher. But now you have an actual strategy that comes from what we had in the previous slide. So the strategy given by this corresponding -- that corresponds to this relaxation. Now, the observation that we make is that most algorithms are found in online learning world actually comes out from relaxations that are up bounds on the conditional sequential Rademacher relaxation. And so the basic idea is that you want a general recipe. And notice that here we have the sup over Z, right. And the Z is indexing over the future. Indexes from T plus 1 through T. And in sense what you're saying is I want to have like a look at what my current losses are, and I want to look at my adversity can hurt me and try to minimize this. That's not just going to happen for this round. It's going to happen for the future rounds. And the sup over Z and the sum over T plus 1 through T is basically kind of discounting for the future. And the idea in all of this is to get rid of the future trees Z, the Z out here, by passing to an [indiscernible] upper bound on Rademacher, sequential Rademacher as possible. And sometimes what kind of happens in all of this is you can go through all the results that I give in a nonconstructive way, and all of them kind of go in a way that is inside-out. Which basically means that you can take each of these and you can convert them into an algorithm. You can convert them into a relaxation and the relaxation gives you an algorithm. And the idea is that you keep going to upper bounds, to larger and larger upper bounds until you have a method that is actually efficient or has some property that you desire. Okay. So kind of easy example, like the first try would be to look at when you have a finite set of F and it's easy enough to see that if you look at, for instance, proof of how you show [indiscernible] finite lemma, you basically show an inequality like this. Of course, this is for the conditional version, 19 because it has the sum over this one. Some over previous losses. If you don't have previous condition, you basically would have one other lambda log size of F plus lambda times capital T. And that's how you prove, like, massage finite lemma. Saying that uniform over script F, what's my -- uniform over script F, what's the -- like how farther do [indiscernible] different sequences converge. And you basically go through the proof, you write this down and you can basically put this info lambda in here for each step. It just kind of, you can derive this relaxation out by just plugging in the first step of massage finite lemma, and this automatically leads to a parameter-free version of the multiplicative rates algorithm. So in multiplicative rates algorithm, you have this time set data. You either have time set based on knowing capital T or you set it as one over square root of T at each time step. Here, basically what it says is that for all this info lambda here. So this is a [indiscernible] problem. Find out the lambda that minimizes this. This only depends on my previous losses. So this is the time step you want to calculate, QT plus 1. Look at my sum over previous losses, take this term and try to find out what the best lambda is and plug that lambda into your ->>: How does the [indiscernible] come up? >> Karthik Sridharan: >>: The [indiscernible], where did you get it from? >> Karthik Sridharan: >>: Log what? This is an upper bound -- Under sequence -- [indiscernible] complexity. >> Karthik Sridharan: Yeah, basically it's just taken from the -- okay. So the idea is you take like the, for instance, how you prove bound on the massage finite lemma and take the step that you use there, and basically the proof is really proof for showing the relaxation is admissible. And that's about it. Yeah, sure, I mean, there has to be one right. So it's not not equality, there there is one like step which you cannot derive. I mean step which you cannot derive because it's an upper bound, the exact thing. So unless it's -- I mean, as long as it's has to be some kind of a, like some kind of an observation 20 that you need to use. >>: [inaudible] random variable. >> Karthik Sridharan: So here, I'm assuming they are bounded. you have this T minus -- here it's bounded by ->>: So that's why [inaudible]. >> Karthik Sridharan: Okay. So you can actually reason out why this is the [indiscernible]. So if you had -- oh, I guess I didn't. Without any other structural assumptions, if you didn't know any other kind of -- if you didn't -- if F were just arbitrary, then you can show that this is a [indiscernible] relaxation, because at least symptotically because of I don't remember. It's Kramer something theorem. Chernov, maybe. Kramer Chernov, something like that. Which kind of shows that symptotically the log partition function is like the right way to get your -- I don't remember the exact statement. But basically, you can show like your -- I mean, it's the symptotic version of [indiscernible] quality. And it shows that the log partition is the right function. And so without any other extra assumptions, you basically, this relaxation kind of comes out. You use that because that's the [indiscernible]. But I am using some -- I mean, I'm using like massage finite lemma as the first step to getting to this. So the basic idea is, I mean, I guess I can't pull it out of the thin air, but basic idea is that if I have a proof in the empirical process theory world and I take that proof and I can [indiscernible] algorithm. And if I have an algorithm in the online learning world, which kind of fits in this framework, then I take this and I want to prove something in the empirical process theory world. And kind of get this thing that I can use results from here into there. And so, for instance, you can -- so for all of this, we have like the algorithm, but for things like neural networks and for -- okay, for generic binary class, online binary classification, there is the Ben-David, Pal and Shalev-Schwartz algorithm. But you can show that if you can kind of like only evaluate little [indiscernible] damage of some function, like if you can get an upper bound on that, then you kind of get algorithms -- I mean, you can get a much more efficient version of it. 21 So for all the bounds that we got where there were no bounds before, you get an algorithm out of this. Okay. Other examples are you can kind of derive -- you can recover great descent and middle descent and X variant. And in terms of [indiscernible] there is like this universality of middle descent, which kind of comes from this result in [indiscernible] based theory by [indiscernible]. And a nice thing is that you can kind of, in a sense you can see that the regularization term that you get there is, in a sense, almost just derived from this. Because you can see that it is just like if you want to [indiscernible] rate, then like the only option you're left with is like to go -- you go to the tightest bound in Rademacher in terms of getting rid of this future treaty and that kind of gives you the exact relaxation that corresponds to that regularizer. So in a sense, you derive it. Okay. And you can also get relaxations based on sequential complexity measures that I mentioned before and get algorithms out of it. So you get constructive way of getting these upper bounds. Of course, even when function F is [indiscernible], even though you get these constructive methods, they may not be efficient at all, or they may not be [indiscernible] even in the sense that computationally visible. Okay. To get -- okay. So if you want to get more efficient algorithms, you basically find like nicer relaxations that are easier to compute. Okay. So now let's look the idea of random payout. So let's go back to the Sequential Rademacher Complexity. So at around T, basically, the conditional Sequential Rademacher Complexity relaxation was this guy here, and we had like some horrible terms in this square bracket. But the basic idea is that, okay, so what we want to do is imagine that I had an article that could give me this Z, that could compute this sup over Z if I gave it all the terms that are required in there. Then the idea is to basically do a random work. And do online learning based on that. So given any Z1 through ZT minus 1, let's say that I can compute this arg max. So I can find the particular treaty that maximizes this term in here. Now, what we can do is, okay, so the basic idea in all of this is going to be that notice that sup basically can come out of all the way to here. And now expectation is linear, right. I mean, linearity of expectation we can use. So the basic idea is that the Q is a distribution that we can pick here. So 22 we're going to try to mirror Q to reflect what's happening here. That's the idea in both this slide and the next one. So we pick Q to mirror what happens here. And then essentially, we get, of course, we can calculate the sup over Z for each term, and then we can put that out and you basically get a single expectation due to linear expectation. You pull that out and effectively what you get is that you can get a randomized strategy. The idea is that at round P, you draw epsilon T plus one through epsilon T. That is the coin flips for the future, and then you basically calculate this -I mean, we are assuming that we can calculate this arg max. So if you can calculate this worst case tree, you basically find the Q that minimizes this. So the idea here is that you can solve this efficiently. And, of course, I'm assuming they have an article for this, which is not realistic assumption, but in the next slide, we'll see that in a lot of cases you can actually get rid of it. And this gives you a randomized algorithm that essentially suffers the same regret, expected regret bound as the Sequential Rademacher Complexity. Okay. So the second area is again, you start with the same thing, but instead, imagine that -- so this kind of stems from the observation that lots of times, basically, the worst case adversary scenario is actually the low bounds are got by actually forming the worst case [indiscernible] distribution. And this happens a lot of times in online learning. So the worst case statistical learning is not too far away from the worst case online learning. And you can actually come up with these distributions. And so you want to kind of use this knowledge in some sense. And so imagine that I could actually do this, then I would basically replace the sup over Z through this distribution D. And now, again, now instead of F from Q, I can replace -- I mean in some sense what I do with Q is I kind of replicate this process, and then by linearity of expectation, it comes out -and due to convex, it comes outside the sup and stuff. And basically what you have is something similar to the last slide, but to kind of make it simpler, let's look at the linear case. Now, there is a key difference here. So the previous slide, we kind of, the Z, the trees that we found on each round were different. Here, I'm putting a single D for all rounds. So this D is not super scripted or sub scripted by small T, which means that I want to find a similar distribution that's really bad. 23 So for this to happen, I should be kind of I told you that it stems from the observation that in lots of cases, the sequential and the online and the statistical complexity are the same, and the statistical and Sequential Rademacher Complexity are the same. So basically, that means that there is some kind of underlying thing that makes this happen. So one -- yes? >>: So [inaudible] loss, which is the same. But this here, you are not going -- this is not an upper bound. You're restricting the adversary by restricting [indiscernible]. >> Karthik Sridharan: Right, but you cannot do it for free, right. I mean, as long as I can find a distribution that's equal. But in general, I'm going to pay a constant. You'll see that I'll pay this constant C here. It will be that I cannot find a distribution that's as bad in terms of without the constant. But if I pay like an extra -- so, for instance, the worst case adversarial might be a like a factor three words in the worst case. But that's fine for me. I'm okay with constant factors. But I need like, I mean, it's not just that these two are off by a constant. You need a step-wise version, which is like the assumption that I'll be introducing here. In the linear case, I'll explain what the assumption is. For instance, the online [indiscernible] are basically static experts. You can again show that basically, it falls under the same framework. So what you can do is you can get like a randomized version whenever you have randomized algorithm, whenever you have -- you get an efficient randomized algorithm whenever you have an online transitive learning problem with convex losses. When it is non-convex, you get a mixed strategy. But there's no guarantee that it will be computationally, like, efficient. But when it's convex, you can get computationally, like something that just solves a convex problem, what it does is it flips coins. It gets some -- I mean, in a sense, it will start resembling follow the [indiscernible] leader. And for linear case, you can't actually get the follow the [indiscernible] leader. But the basic idea is you flip coins and instead of using this -- the idea that these relaxations are -- so notice that basically this was like a relaxation, right? You just took Rademacher as an example, but this proof kind 24 of goes through whenever you have a relaxation of the form expect the some function of X1 through XT. Now this expectation may not be easy to compute. But the idea that is you mirror this expectation by kind of if you knew what the draws were or what the distribution should be, you mirror what this distribution is, you draw the same random variables. And once you're given everything, the expected may not be easy to optimize. But when you're given everything, lots of -- I mean oftentimes it becomes easy enough to optimize. And then you use that fact to kind of get a randomized algorithm. >>: So just like [indiscernible] technical amateur who is going do technical stuff. >>: You're right. >> Karthik Sridharan: Okay. So for linear loss, you can basically get like a simple condition that basically says that sup over Z, this perturbation over Z can be replaced by a draw from some distribution D that we know. And so the idea would be that you draw -- so in terms of algorithm, what it's doing is it draws the future from this distribution D and also the Rademacher variable. And then it solves the simple problem where there's no expectation here. It's just this exact problem. And it turns out that -- and, of course, the bound that it enjoys is the classical Rademacher Complexity up to constant factor C and from this you can easily derive new versions of follow the perturb leader for L1 L infinity. If you've seen the follow the perturb leader proof, for instance, for L1 or infinity in this complex case, it's kind of, I mean, in some sense it's magical, because this [indiscernible] distribution is used in a very, like in a crucial way to actually get this proof to move through. So here, first of all, we get this to work with Gaussian distribution, and the proof is kind of more like, it kind of follows this general framework. And it gives you a more -- yeah? >>: [inaudible]. >> Karthik Sridharan: Yeah, L1 infinity is Gaussian. For L1 L2, we use uniform sampling or the hyper unit. And the nice thing about L2 L2 is at least we don't know if version, at least up to now where you get dimension-free 25 result for L2 L2. And so this gives you like a dimension-free follow the perturb leader for L2 L2. The idea, okay, so to kind of give you a view of what this follow the perturb leader is, is you minimize the low, like sum up to time T minus 1, plus an extra -- so it's like linear term with some of the X1 through XT minus one, plus some extra perturbation. And this perturbation, in a sense, you can view it as regularization. You can view it as inculcating. There are lots of ways of doing what it does. >>: [indiscernible] F is constrained to the one -- >> Karthik Sridharan: Yeah, the first is F and the second is X. And we suspect that basically, I mean, we haven't tried too hard, but we suspect that basically, you can either follow the perturb leader with the optimal aids for any LP class, for like any small LP. But, I mean, we haven't worked that out yet. >>: [inaudible]. >> Karthik Sridharan: >>: Yeah, LP, LQ. LQ is the holder conjugate. [inaudible]. >> Karthik Sridharan: Yeah, the non-dual I don't know. But for dual, I think it can go through. Okay. So this is a random play-out. So it's not really something I'm going to, like, kind of give you a list of other stuff that I didn't have much time to cover. So we kind of get parameter-free variants for most of the commonly used online learning algorithms. Basically just went through [indiscernible] book, try to look for all the online learning algorithms, see if it comes out of the framework. And most of them did. And we get some new online learning algorithms for the binary classification, online binary classification problem, and in some sense, at least -- okay. So if you have a particular article, you can show that given that article, our algorithm is much more efficient compared to what was known before. And we derive algorithms for the nonconstructive bounds that I showed earlier. 26 But I'm not claiming that they're efficient. But we do get new efficient algorithm for online transductive learning with convex losses, and I guess the more interesting one is for online matrix completion problem with trace norm, we get a new algorithm -- I mean, we actually get two variants, both of which are efficient. And one of the variants only needs to calculate like spectral norms at each time step. So in terms of worst case complexity, it's as bad as the other variant or as good as the other variant. But in reality, lots of time spectral tomorrow it's very easy -- like parametric gives you really fast conversion. So at least on small scale experiments, we got some pretty nice results. >>: [inaudible]. >> Karthik Sridharan: Yeah. At least that's what he told me. Okay. So the other thing is that this bound, we don't need some of the assumptions that the other works need. Like there's also the Shy and Hazan and [indiscernible] so they have this optimal algorithm for this based on a particular decomposition. Theirs also like gives an efficient version with optimal bounds. But then they also need like these bounedness assumption and so on. We can basically get the result that gives you the same guarantee as the Rademacher Complexity with no other assumptions. So the final part which I'm going to kind of like blaze through is adaptive online learning algorithms and benign adversary. So the first thing I mean, all of this, even the slides I made, I made at a high level so I'm not going to give too much detail. So all of what we did in the previous part were I have this worst case adversary and I want to get the optimal guarantee or some nice guarantee against this worst case adversary. But what if, I mean, what if you had like some notion in mind saying that I think the adversary behaves like this, but maybe he doesn't. But if he does behave like this, then I want to get a nice bound. And you want to kind of capture this notion. You kind of want to adapt to when the adversary is nicer, you want to actually enjoy better regret bounds. But you don't want to lose too much when he's not. So we want to basically make use of, take advantage of the sub-optimality of 27 the observed sequence so that the idea is that -- so let me define what this set is. You have like some Z1 through ZT given before. So the idea is you want to kind of look at where the competitor is going to be. So if I'm given Z1 through ZT for any like any kind of -- any further sequence into the future, I look at the heuristic class of only items for the future, because that's where my competitor comes through. And you can get this kind of decomposition of regret bound. The basic idea is you kind of choose a partition on the flight using this thing called blocking technique, and then you use any relaxation algorithm that we introduced before as a meta algorithm, but you run it only on a localized set. And, well, that's all that I have. Now I'll tell you what you get out of this. The first thing is something that you're looking for a long time, just local Sequential Rademacher Complexity, you can also derive all of these local complexity measures for online learning, which gives fast rates. And you can get adaptive algorithms that kind of automatically adapt to the data norm. And so, for instance, one example would be you can adapt automatically if the adversity strong convexity function, you get 1 over T. But if he plays linear, you get 1 over square root T and you automatically adapt to it. And the thing is you can also adapt automatically if the adversity played X concave functions and so on. >>: [inaudible]. >> Karthik Sridharan: Yeah, this was known already. But the main thing is so, for instance, you can also adapt in the same algorithm to concave. It doesn't need to know anything beforehand. I mean, of course, there is a trade-off between what you -- how computationally, like, efficient you want it to be versus, like, how much you want it to adapt. But if [indiscernible] computational efficiency, you can kind of adapt to almost anything that happens. So you can kind of get the best thing that you would expect. 28 And you also get an adaptive expert that takes benefit of, like when the adversity does or when some experts are really good and so on. Okay. Another view of all of this is that you can have, like, a constraint adversity, and the adversity at each time step picks a mixed distribution, but from a constrained set. So piece of T is constrained set that depends on what happened previously, Z1 through ZT minus 1. So before, we just had like from the set of all distributions here, the sup was on set of all distributions. Now we are saying that given the observed sequence before, he's [indiscernible] to act in a particular way. And you can again use like all the non-constructive analysis that we did before, like Rademacher. You can kind of get some results, but I don't want to like put them on the slide, because they are complex in terms of rotationally complex. Okay. But let me give you the linear example. So let's say that you have some function, MT of XT minus X1. It's an arbitrary function. So the idea is that I want to do linear learning, but I have a model of what XT I'm going to receive. So in some sense I have a predictability of what I expect the adversity to do. So in a sense, I have an expectation that the adversity is most likely going to play this vector in the next round. Of course, he may not play that, but the idea is if he did play that or he doesn't play too far away from that, he plays only like [indiscernible] far away from that, he's described to do that, then you can show that the value is bounded by this. And in order that this is any arbitrary function. It need not be a particular form. And examples of this are, of course, one is MT equals XT minus 1. That is, he's constrained to not play too far away from the previous round. Another is he's constrained to not play too far away from the average of the previous rounds. And the average of the previous rounds is basically what variants bound is all about. And so we had this result which said that, okay, you can do it with an arbitrary MT, and basically it's going to point to this motion of predictable sequences. The idea that online learning with predictable sequences, you have this predictable sequence. I'm talking about linear loss. So you have an 29 prediction of what XT should be and adversity adds some adversity and noise. If the noise is too huge, then it's as bad as online learning worst case. you want to do something better if things are better. But So there was this nice paper recently in this current [indiscernible] which basically proposed an algorithm for the case when MT is equals to XT minus 1. So what we had was a nonconstructive method and we were kind of hoping that we could actually take the idea from this paper and try to see if we can do it with any predictable sequence. And indeed, you can do that. And you can also get like some -- you can kind of reduce like some of the extra steps that they make. And you can basically get a variant of middle descent. So this is usually middle descent where you just kind of follow the gradient. And if you had Euclidian spaces, it would just be the gradient, usual gradient descent. And now the extra step is that you kind of, you kind of have a trade-off between how far you want to be from GT plus 1, which is usual online learning, versus how good you want to optimize what you think should be the next point in the sequence. And by choosing [indiscernible], I'm not going to mention how, but you can basically choose it such that you get a good bound that looks like this. So if your predictable sequence is good, you get a good bound. If the predictable sequence wasn't good, if it was completely arbitrary, then, of course, you pay like a factor of four. So you can pay like 16 times square root D, but you still get a square root [indiscernible]. So you pay this extra factor. But when things are good, you do better. And without giving much more information, you can basically do all of this in the bandit setting, by which I mean that you don't actually get to see the XTs themselves. You only get to see, like, the dark product with the F that you chose. So here the way it's written is in terms of self-concordance functions, but you can also do it with entropy function which gives you [indiscernible] on bandit. And basically what you get is this term here, this is like an extract term, because you don't know Ms of T. You're to evaluate M bar. And the result is given for any arbitrary evaluation of M bar. If I can get some kind of estimation of M bar, I can plug that in and it depends on how bad this 30 estimation is in this term. And, for instance, Hazan, et al., in '09, had this bandit against benign adversities, which basically gives variance bound and you can essentially see that basically what they're doing is trying to get this term to be small. And they use this trick of [indiscernible] sampling. This term need not always be small. So if MT was equal to XT minus 1, then it's not going to be small. If it's average or some notion of average, you can make it small. Okay. So the next thing -- I'll be very fast, is what if you didn't know the predictable sequence, you want to learn how the predictable sequence. So can we learn a good model for M1 through MT? So for each pi and pi? So what do we mean by this? We have a set of models for capital pi. For each pi and pi, we get a predictable sequence. So these M sub Ts are actually functions of the past. So basically gives you a prediction. Each model gives you a prediction. So, for instance if you're thinking of, like, say, investing in the stock market. We invest in the stock market and we lose -- we basically gain based on what the XT price is on each day and so on. The usual experts. Now, imagine that you also have these hedge funds that also tell you what a prediction about each of the different, like, companies share values and I want to somehow use that and I want to use the fact that if one of these guys were doing good, then I should be able to get stuff for much smaller loss. So can we do as best as knowing the pi and pi in hindsight? So there is some model that is good, but we don't know that beforehand, but we [indiscernible] as good as that. It turns out you can actually doing that by basically what it's happening is it's running on experts with this as the loss, which is MT minus XT squared. And so what you can show is that you get a bound like this. Again, here can be chosen using doubling trick to get you the optimal bound. But now you have an [indiscernible] pi of like with respect to the best model. The only extracting that you pay is this log cardinality of the number of strategies. And extensions to this, which, of course, I absolutely have no time to do is, you can learn by predictable sequences with partial information by which so there are a quite a few variants of this. One variant is when you have bandit 31 feedback on the loss. So you only see the loss on the arm that you suffer. Another variant of this is when you only see the model -- so imagine that you had to pay for a model. You're paying these hedge fund companies to give you predictions so you only get to pay -- you don't have like an infinite budget. You only have money enough to pay for one of these guys. So each day, I choose one of these guys and I pay him. I only see what his predictions are. I don't get to see what the other companies' predictions are. So this is the other environment. Of course, the third variant is you kind the more interesting variant is where, I setting, then the idea is that you don't to give you all predictions, but you pay separately. of put both of these together. And mean if you were in the multiarm pay for, like, the hedge fund company for each individual prediction So in that case, you want to basically pick an arm, and you want to say what do you predict about this particular arm. I don't get to see what he predicted about that arm, but I pay him to tell me what he predicted on this arm and I pick a particular arm, and so that's the fourth variant. Okay. So in summary, you basically started with complexity measures for online learning using Minimax analysis. We started from value, went to some complexity measures that kind of due inspiration from the process theory. And then we basically showed how to get like a principled way to move from these nonconstructive Minimax analysis to online learning algorithms. And finally, I gave you some hints on how to build adaptive online learning algorithms and learning with predictable sequences. So in a further directions, some of the things is can we extend relaxation mechanism to notions of performance beyond regret. So things like [indiscernible] approachability, calibration, et cetera. So there are some key differences between what we did here and the kind of, the structure of the losses or the performance measures in these other contexts. And so I'd like to mention non-constructive, in terms have nonconstructive tools question is, just like how the reason why we're asking this is because in of like the things like Rademacher Complexity, we that can actually get you these bounds. But the we do nonconstructive tools for the usual regret and 32 converted them to algorithm, is that a way to convert these nonconstructive tools into algorithms. Another thing is learning with partial information. As I was mentions before, when you have partial information, there are two terms in your -- in terms of their dependence on Q actually starts getting coupled and so that creates extra problems. But when we have some partial results but we still don't have what we would like to have. And also, can we extend all these analysis to online learning with states and basically into competitive ratio type scenario rather than like [indiscernible] type scenario. So that's it. >> Ofer Dekel: Thank you. So questions? >>: [indiscernible] what I've found is that in this work, you have a lot of connections between the statistical learning and online learning. And I think a very [indiscernible]. I find that interesting. >> Karthik Sridharan: Oh, yeah. Yeah. I guess I was making this slide sequentially so by the time I came here I have a limited memory. I have a faded memory of what I had. Yeah, definitely. So basically, it's like you have the whole empirical process for statistical learning. You have almost the same story for online learning. >>: [inaudible] you have the relaxation is based on the [indiscernible]. you have the other thing where you do the opposite in parts [inaudible]. >> Karthik Sridharan: And Oh, the predictable sequence. >>: The predictable sequence. You look what you have, and just like [indiscernible] what you have to restrict and then you consider the worst of the features, like you had before. There is also a kind of [indiscernible] and then testing on the rest. >>: So if you didn't understand, let's release the audience. He will be here for the rest of the week, everybody is invited to keep badgering him to understand all this stuff.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib