>>: It's a pleasure to introduce Professor csaba szepesvari who's a professor at university Alberta. He has worked on the -- he has worked a lot on reinforcement learning, modeling bandit and statistical learning theory. And today he's going to talk about the sparse contextual sparse -- bandit problems. >> csaba szepesvari: contexture or not, sparsity is going to be there, sparse context, very sparse. so it's a pleasure to be here, and I'm going to be here for four months. so any time, you can knock on my door, send me an e-mail. I will be very happy to talk to any of you. so this joint work is with my former student Yasin Yadkori who is a postdoctor fellow with [indiscernible] at the moment. And what I'm going to talk about is online-to-confidence-set conversion and application to sparse stochastic bandits. And so here's the contents. so first we're going to talk about linear bandits. How many of you are familiar with linear bandits? okay. we will have some motivation, but like maybe we can cut it a little bit shorter. so why should we care, how come we should care about sparsity, how does sparsity come into the picture. And then I'm going to talk about a general topic of algorithm and how it actually just reduces the problem of designing bandit algorithm for the stochastic study to designing tight confidence sets for certain prediction problems and the [indiscernible] regret. so that's generality aside. And then we're going to talk about for this algorithm prediction problems how to construct this confidence sets or confidence regions. And here if you want to get more as opposed to many [indiscernible] we need what's called honest confident sets which means that if you declare that the true parameter is in your confidence set, because a certain probability it has to be there. Is that probability it's not in a specific design. so you got there finite results that you need. so first define what this problem is, and then I'm going to argue that this confidence actually do matter a lot from the perspective of practical -obtaining practical good results. And then I'm going to jump into the main part of the talk which is a new construction for confidence sets. we call it online-to-confidence-set conversion. It takes an online algorithm for a converse setting, and it converts it into a confidence set. It's kind of a neat result. so for that, we're going to talk about the framework of lining up the prediction problems, and then we just look at a conversion. And then I'm going to explain the slide why it works. It's actually pretty simple. And then we talk about application to linear bandits in part to sparsity bandits which is the main topic, look at the results. okay. so linear bandits. so I don't have to explain this to you. This is actually from a paper coauthored by [indiscernible]. You want to recommend news articles to users coming to your website, and the news article could be something -- oops. The laser pointer doesn't work. something like this. or it could be about politics, or it could be about, you know, computer games, or it could be about the olympics, or it could be about science and so on and so forth. so there's zillions of articles the user could be interested in, and you want to put the article on the front that matters the most to the user, and you're hoping to collect clicks. okay. so here the goal is to maximize what's called the clicks through that. okay. so how does this work? so you have a number of rounds and users coming to your website. And then in round T here, you're given a set of articles D of T. so given the set of articles, you have to choose one article that are going to be denoted by xD to put on the front of the web page. And you're going to receive a reward of one if the user clicks on the article. otherwise, you don't receive a reward. And your goal is to maximize the total rewards. so how are we going to deal with large action sets? so we have this article, for example. so there's just one of the many possible articles. You could have zillions of articles in your database. And so how to generalize to articles that you have never shown to the users. And one of the possibilities is to work out some features for the articles as we all know. so we can ask whether the article is about sports. In this case, it's yes. whether it's about politics, and in this case, it's no. whether it's about olympics. In this case, it's yes. And whether it's about games. In this case, it's no. whether it's about science. In this case, it's no. whether this is a trending or not article. Let's say it's trending article and so on and so forth. so you can click off your features. In this case, these features gives the boundary, and you can collect the answers and code them with numbers like zero one numbers, and then you can collect them into vector that will be the feature vector on the line your article. And for every article, you can do this, this way turning the problem into a problem that is defined over some vector space of dimension D. so this is the number of features. And then the probability of the click on x can be modeled by let's say a simple linear model. so you just linearly combine the features of the article, and that gives the probability of the click. And I'm going to use this notation for denoting the linear product of this T vector and the vector x. And so the click is going to be bounded on the variable and your assumption that this P of x for some magical reason lies between zero and one. And the probability of the click is just P of x. okay. so you can alternatively write this in the following way. If you define it as a difference between Y and P of x, then you can write that the click is equal to the inner product plus this Page 1 eta. Eta is just a noise. If you take the expectation of eta, it's a zero mean right away. Page 2 okay. Are we good? so far so good, right? so marginally and more abstractly, linear bandit problems are defined over for some vector space of let's say finite dimension D. And in this case, let's assume that you click on the vector space, and we have an unknown vector space that I'm going to denote by T plus star. And the game is played in these rounds. In round T, you're receiving a convex set of Tx plus Y. can I assume that you receive a convex [indiscernible]. You can randomize them. Randomization you can always achieve any point that is in the convex heart of the original set. okay. In terms of the expected return, expected reward, nothing changes if you introduce this randomization. so you are going to receive a convex DTc, the set of articles. And then you have to choose an action in this convex set, and you're going to receive this reward which is a variable which a linear product of the action that you have chosen and the unknown parameter of the vector and some noise is confounding this inner product. And we're going to assume that this noise is a zero noise by which we mean if you take the expectation of the noise conditioned on everything that you have seen up to that point like all the previous actions and all the previous clicks of rewards then the expectation is just zero. And we're going to assume, of course, that the noise has control tasks as well. And the goal is to maximize total expected reward. okay. so that's just the standard framework that you see for this [indiscernible]. so what are linear bandits? It's like -- you'll see it's a beautiful model, and it's also very general. It allows you to model the dependence between the rewards of arms and the density of like action sets. You can model mark in bandits even monitor arms with it. You can model a lot of other interesting games like computer games, interactive learning, and it has a bunch of applications, user interface optimization. That's what we start with on-line product recommendation given the matrix prediction. Problems can be put into this form. It's kind of cool. And network routing and so forth. so the list continues. There are tons of applications. okay. so how do we [indiscernible] a learner? The learner is a way to intercept the regret. since we allow the action set to change every timestamp, we have to a little bit be careful about how we define the regret. so if start anyway with round T, we would choose this action, right? so that maximizes the expected reward. No question about it. okay. Everyone would do that. And so a regret is how much we lose whether by not doing this and expectation if you wish. so some people call this futile regret, because you're not comparing the actual reward, but you're comparing the convex expectation to the [indiscernible] given the choices any way. so here is the total reward that that you're going to receive, and here is what you have achieved if you knew where you would start. okay. so that defines the regret, and we want regret to go zero or as fast as possible. If you divide the regret by the number of rounds, we want this quantity, the average figure, to go to zero as fast as possible, because this would mean that, well, the average reward that you could have obtained is close to the average reward that you actually obtained. Yes? >>: [inaudible]. Just 01 submissions. >> csaba szepesvari: Right. Right. That's what it is, yeah. so, I mean, like if you are interested in expected, there is no difference. If you're interested in variation bounds like high probability bounds there is a slight difference between the two. >>: [inaudible]. >> csaba szepesvari: >>: [inaudible]. I think the study is just -- like the problem definition you mean? In the past. >> csaba szepesvari: It's like you have this unknown vector P star, and the reward is the linear product of your choice with T to start, which is unknown vector plus so much. so if you take the expected reward -- this is the expected reward, right, because we said that the noise has a zero mean. >>: Do you see that the construction of that D may also be generated by the -- >> csaba szepesvari: >>: Yeah. [inaudible]. Page 3 >> csaba szepesvari: No. First -- the first, yeah. so in the stochastic linear setting, this is the assumption. so if the model is not correct, then whatever I say here you can throw it out of the window. You are going to have a unknown regret in that case anyways. But still you could study that case, right? so it's outside the scope of this. okay. Good. so we want the average regret to go to zero as fast as possible, but we don't want the regret to grow faster than we want the [indiscernible]. so when we say that we have a sparse problem, it's obvious when the star is sparse. so the hope here is sparsity is going to move on to add many features without slowing down learning, right? so in this news article recommendation example, for example, you're just clicking out these features, and you are hoping to capture everything that could possibly influence the reward. You start to add too many features, it inevitably slows down learning except that, maybe, if many of the features are linear to your predictions. That way the coefficients could be taken to be zero. The other features already capture those things. You don't have to be that careful about how you're constructing your features. It's a good thing if you can exploit sparsity. okay. so that there is difference here between a sparse parameter vector versus sparse features. You could use sparse features to like design different algorithms, adapt your algorithms. These problems are such as like what to connect to each other. And, of course, if you have both of them, then you can look into the intersection, and that's interesting, but I'm not going to talk about that. I'm just going to talk about the sparse parameter vector case today. okay. so how do we play in stochastic linear bandits when one of the standard algorithms uses what's called optimism in the face of uncertainty principle. And the way it works is that you maintain a high probability confidence set which is about the unknown parameter vector. okay. so this confidence set is a region in RD and with the high probability of confidence the unknown parameter vector to start. And the algorithm is just one line, and that's the beauty of this thing is you take the joint maximizer of the linear product, but at first -- first component is your action x, and the second one is the parameter vector. And the first component belongs to the action set, and the second component belongs to the confidence set that you have at that moment in time. okay. so this way you're looking at a world -- at the best possible world. You are hoping for the best -- to see the best possible world, and you're playing the best possible end in the best possible way. so you're optimistic. okay. so this way T is going to be optimistic. we're going to call it an optimistic estimate of the unknown parameter vector [indiscernible]. And this principle goes back to at least Lai and Robbins; maybe even before that. And there are a lot of algorithms like ucB1 which was proposed by [indiscernible] and. That's a special occasion. That's not the right difference. But anyway. And it's a widely applied principle, and there it's very active in the search. so you might worry about the implementation of this, but if you have a finite action set, then you can just like enumerate all of the actions and then for all the of the actions, you can compute its optimistic reward. That's -- if your confidence set has some nice shape, then it's going to be a convex organizational problem. You can do that. It's not a big deal. I say, if the number of actions is becoming very large, then you have to be a little bit more organized about this computation, but I'm going to sweep this under the rug. Just move on. okay. But it's an interesting problem on its own how to actually compute these things. so this is just an illustration of what's going on. so you have this confidence set c of D times T. And so P plus star must be inside somewhere, and usually P [indiscernible] which of D which is an estimate of the unknown parameter of the center of this. And then usually this ellipsoidal shape. And [indiscernible] is going to be on the boundary, because you're maximizing the linear object with this confidence set. okay. so that's kind of the picture that should have in mind. And so why does it work? so how [indiscernible] look like, and you can actually analyze the immediate regret of this algorithim, and it goes as follows. so in round T, you're choosing this action. And you should have chosen that action. okay? And so this is your pseudo reward for that round, and this is the best possible pseudo reward. so you have to use P plus star to the unknown parameter vector when realizing the regret, and this can obviously be by this quantity Y, because we're granting that P star belongs to confidence sets. And, of course, this belongs to the decision set. And this is maximizing this inner product with the cross-product of the decisions of the confidence set. so this is at least as large as the other quantity. so this is follows because of the optimistic choice. And if you look at this whereby by linearity, and you can introduce some norm and some duo norm. You can just use [indiscernible], and you get this inequity. so what you see there is you have the freedom to chose any norm. And the regret times T is going to depend on the duo norm of the parameter vector that you choose. And so if, you know, the DFT set has some nice shape, it's Page 4 nice bounded, then it's going to be bounded quantity. right? The other quantities is more interesting, so then the difference between T to start and your optimistic estimate of T to start. that difference cannot be big, then you will be in good shape; okay? so if so you want to choose this norm in such a way that you can show that this differential change. okay. so that's the whole deal idea. so in some sense, this difference is going to match the confidence -- is going to -- is going to measure the size of the confidence set. And you can which I ever think about it as a confidence split. And we can use different norm in every timestamp and more the type and you would do that. okay. so from this you can derive a bound. so I'm going to show some regrets bounds, but like this is just a basic idea of how this goes. And so why optimism? so could you get away without optimism? well, actually you can get away without optimism. so this was the previous algorithim. But you can choose any c of T that is an element of the confidence of c of T. okay. And then you choose the maximizing factor with respect to your chosen parameter vector, and then with [indiscernible] vector, you can get this other inequity. This other inequity looks very much like the previous inequity except for the appearance of these terms. so what you see is that what the optimistic choice buys you is that you don't have to analyze this term, and this term might actually be large or -- or small. Anyways the optimistic choice kind of like shapes likes this. so that's what you gain. But if you really are worried about how expensive this optimistic choice is how expensive to conquer it. Maybe you go with some other choice and just analyze that for finite bounds you can actually analyze it to that. okay. so far so good. so we see that it's -Yeah. >>: so could you go back? so if you take an [indiscernible] from c that inequity holds, but it doesn't necessarily have to in terms of [indiscernible]. >> csaba szepesvari: That's right. If you're not -- if you cannot show that you're controlling this term. This is a new term. It doesn't help you, right? so you can control this term as before, and this term is as before. This is the only difference. That's the new term. And if you think about finite imbalance, and finite imbalance you can actually control that term. For the other imbalance, I don't know. It's a meaningful term, actually. It's an interesting question whether you can control it. >>: Just to make sure I understand. so x to P star is a constant. It doesn't change across the rounds. But the norm -- the document norm is something that could change. >> csaba szepesvari: >>: Yeah. Yeah. It should be in round two. so if you have the same -- [indiscernible]. >> csaba szepesvari: It's because I have the set to change by time. so in my model every timestamp you receive a different set P of T which is a convex set in the space. so set of articles every round is different. >>: okay. >> csaba szepesvari: It's coming from there. This way you actually can have the model [indiscernible]. so but if you don't have the changing decision set, this index would go away. You would have index there, and this would actually measure how confident you are of the reward of the optimal arm if you show that shrink faster, then you could be done. okay. so where do the confidence sets come from? we see that it's -- it's all about the design of the confidence sets. Like once you have the good confidence sets, then you have good regret. so we hope that the [indiscernible] is not going to have any regrets. It's high probability. It's better to go with honest confidence sets in this case. Maybe even be a little bit sensitive I would suggest, okay. But anyways, so let's be honest. so how do we design this honest confidence sets? so if you look at the problem in a general way, you have this linear prediction problem. so there is this convex x1, xN, Y1. Those are the responses you happen to choose in convex x in the previous case. But marginally they could come from any sequential features. Then you assume that the response is a linear function of the [indiscernible] is an unknown vector. Then you have some noise, and then here I'm being more specific about what do I assume about the noise. You could assume that the noise is -- is subgaussian with some constant. subgaussian means that the chaos the noise [indiscernible] off as fast as the gaussian. It's a reasonable assumption under many circumstances. Page 5 And so here as I said xT is often chosen based on the [indiscernible]. so the estimation procedures -- so there are different problems as to stated with the study. one is to just estimating the start, and the other is to contact the confidence set confidence set with a high probability. we're going to be interesting the second problem. so that's the definition of subgaussian. And like I'm just arguing that subgaussianty is something like you're familiar with even if you are not. okay. so what do we mean by this honest confidence sets? we mean a random set in RD such that you're given some data that's your confidence parameter, and the unknown parameter vector lies in the confidence set with probability at least one minus data. Yes? >>: I thought you assumed the noise is bounded, because it have [indiscernible]. >> csaba szepesvari: >>: It could be bounded, yeah. That's a special case of subgaussianty. You find it to be a possibility? >> csaba szepesvari: That was modification. And then marginalize. so I'm moving to something marginal, slightly marginal. If you have a bandit like this and the certain like for bandit cases it's [indiscernible]. certainly. okay. so we want this confidence set. >>: The ability here what is exactly the measurable space? >> csaba szepesvari: You want to measure the space, and then you have all this unknown variables that are supported on this measurable space. so I'm sorry. Like I didn't say x1 and Y1 N YN are the random variables. They're supported on this measure with this space. >>: It's not a product space, right? >> csaba szepesvari: No, it's not. are totally correlated. >>: These Right. >> csaba szepesvari: you could contract. >>: It's like these are -- these are totally correlated. It's like the -- you could contract the space and, all these conditions But unlike regression where you assume this is it's not a -- >> csaba szepesvari: This is not a random space. It's not like random design, fixed designs like [indiscernible] convex x. so it's a little bit harder if you want. But, actually, it happens to be not much harder to analyze it. It's a little bit harder. But you actually get the same results. okay. so this is what we want. And so this is just like picture about showing what we want. And so one approach is to design confidence sets based on the [indiscernible]. so you have your data, and you start going into these matrices, these x by Y matrices and then xT to one going to be responses to the vector and then region with this positive parameter [indiscernible] would produce this estimate. And this is going to be the center of the confidence set, and you proceed as usual. You define the [indiscernible] matrix which is slightly adjusted to take into account the ridge parameter. If you do that then with some work, you can prove that the following set is a confidence set. And here what's important to remark is that the result is pretty cool, because it holds for every timestamp. so it's like up to infinite. so it's not just for a finite. You avoid the union bond basically using a marginal argument. It's learned from [indiscernible]. It's a stopping argument, yes? >>: so in some ways, the dimension of the matrix don't seem to match. >> csaba szepesvari: well, it's -- yeah. guess. obviously that thing is divided. >>: It should be -- that should be [indiscernible], I Divided. >> csaba szepesvari: >>: Are you -- okay. Yeah? shared space. >> csaba szepesvari: Thank you. Yeah? okay. so that's the coolest thing about this, and the proof is based on the method of mixtures that goes back to [indiscernible] in the ?0's, and it's kind of like using a [indiscernible] technique to actually prove this thing. It's kind of a cool tactic. Anyways. so this is a confidence set. okay. so what you see here is that this is basically the distance that you're going to use later on say the bounded bonds. so it's the [indiscernible] matrix. so it creates like the -- so it is going to have a much smaller radii in the directions where you see a lot of convex in the alternate directions. It's going to be very large. It's potentially really Page 6 large, and this is kind of the radius. okay. so here this norm is this matrix norm. okay. so this axis [indiscernible] to space if you care about R cages or gaussian processes. so that's cool. Is this a good bound. so, of course, there's been tons of work in the literature about producing similar bounds. For example, [indiscernible] the paper proved this bound or proves this -- showed that this is a confidence set, and then what you see, the difference here, is that here there is an x for a different time. so if you want to get uniform in time behavior, then it's pretty common that you take a union bond, and here you see that that's called a union bond. That's probably covering argument. And here we are avoiding it, because we use this method of mixture that kind of integrates. And then you use the stopping time argument to get rid of the union bond. so that's kind of neat. so the determinative of B of D could be as large as low -- the determinant of B. so that would be as large as this, but you are basically shaving off one of those terms. And [indiscernible] proved a similar bond. You can see that qualitatively these two bonds are similar or bond looks different. It kind of like adapts. This also adapts to the [indiscernible] matrix which you hope it's a good thing. so whether it's a good thing or not, it's not so easy to see it. so what you can do is that you -- well, first, you can derive a regret bond for the underlying optimistic algorithim. Yes? >>: so you say that your bond is done by using the regression. >> csaba szepesvari: >>: Yeah. This is a regression-based bond. They're the same. >> csaba szepesvari: That's basically the same. they're the same, yeah. >>: They're the same. The algorithim is different. Yeah. I think The assumption's different, assumption of xT. >> csaba szepesvari: No. No. No. No. These guys are like we are going to do like. okay. so here I was lazy to check how to extend our bond when these norms are different, but like we have a general result. Like you know s is a bond and true norm of the known parameter. I was just lazy enough to calculate s would come somehow into this bond, too. And the same holds here. so here I -- here they have it, or I had it, but -- okay. That's the only difference. But you could take the common case when -- when everything is smaller than one. so then s equals one, and then probably that's the only thing I guess on the bond. Yeah. okay. so before -- like you could compare this by trying some experiment, right. so before we go in there, if you just plug in these bonds into the previous regret bonds for these specific algorithms, you get these regret bonds for stochastic linear bandits and you see that the difference on D shows up, and it's going to be linear -- linearly on D. The worst case you can show the D has to be there. so some unregular results are these new confidence sets, any type of the previous ones, or I would just like going in rounds. so I'm going to show some experimental results. so here on the picture what you see is that is a regret. And this is time, and this is a bond based on [indiscernible] paper. And what you can see is that the regret is not starting to curve. so it's kind of like almost linearly growing. If you take your confidence set, the regret is more gentle. It's much better behavior, kind of like curved like a route T which is what it should do. on the picture, I'm also showing you can modify these algorithms not to squeeze that often and the arms would say you've a lot of competition. And the regret is not going to change by much if you do that. so you can save a lot of -exponentially many computation steps, because it's enough to recompute everything [indiscernible] okay. There are ways to construct confidence sets, right? These confidence sets you see that they are really important. They're crucial to achieve good results. In part can get tighter results. And their sparsity is our main subject. so if you only care about the prediction error in the linear regression setting, then what you know is that if there is no sparsity then the prediction error is going to be depending on the dimension, and if you have sparsity within the D, then the prediction's error -- the prediction error you can retrieve in D by P log D, which is pretty cool. so you can play with very large D and smallish P. so but we also note from the literature that the least estimator won't cut it. so you have to look for something, you know. so how do we design confidence sets for this sparsity? so the idea is to do a conversion. It's a reduction from one problem to another. If you can solve all linear predictions with crowd loss with the small regret under the sparse case, then we claim that you're going to have very, very good confidence sets for the sparse case, and that should be enough. so the idea is to create confidence sets based on how well you do in linear prediction, and, so it's pretty cool, because whenever you improve something on the prediction side in this online setting, then you're going to improve your confidence sets ultimately. okay. And hopefully it will give you good bounds for the sparse case. Page 6 >>: I'm confused. I thought the goal was linear prediction. [indiscernible] to the prediction. Now you're -- The convex sets are just >> csaba szepesvari: Yeah. I'm going back and forth. It seems that I'm going in circles, yeah, but not exactly. so now the linear prediction problem is going to be made tougher, because it's going to be converse theorem. And so why converse theorem? The reason why it's converse theorem is if you're independent which is pulling these arms and then kind of like you don't really control. You don't really -- like this algorithim converse theorem, you don't really control like how much the arms are spread out. They happen the way they happen to be. so you have this prediction problem where the arms could be very correlated. so know you're just going to add, okay, let's say it's sparse like the sparse predictor which has a low error to regret. Then can you turn that back into a confidence set. so it's a big thing. You will see. It's going to work out. okay. so what is online linear prediction? anything that we talked about before. what do I call online linear prediction? It's not okay. so this is the sequential worst case framework times T. You receive a xT. You need to produce a prediction which is just a real number, and you receive the correct label or the correct response Y of T. You suffer this particular loss. okay. That's the whole thing. okay. we didn't say anything about whether there is any statistical relationship between the xTs, the YTs, so forth. There's no statistical assumption here. okay. This is a worst case framework. And you want to compete with the best predictor in hindsight, because there is no other thing you could do, right? Because there are statistical assumptions here. so there is algorithms for this problem, starting with all kind of gradient descent algorithms would be another example, online least squares, exponentiated gradient online loss. There is an algorithim that probably not many people know about, but it's actually a simple adaptation of continuous exponentiate weight algorithms for this value. It's just continuous exponentiated weight. The [indiscernible] for sequential. That's for the sparsity. That's why the sE. That's sparse exponential weight. so, basically, you use exponential weight algorithim with some which prefers sparsity. okay. so online linear prediction. so what's the regret? so the regret -- I'm going to use a different letter to denote it as this quantity. so the regret begins at parameter vector T. [indiscernible] is the total loss that you suffer where the total loss you could have suffered is for this parameter vector in every timestamp to make up your prediction, but there is no prediction that you should stick to some parameter vector T in your prediction. Like you don't have to do that. Like you can produce the predictions in many ways. okay. so all of these prediction algorithms that we talked about come with some regret bounds. so what's a regret bound? The grant is that under -- no matter how the data is selected, which is the x piece and the Y piece, the algorithim is granted to have a regret that is below some quantity of N, which you can either compute from data or from prior information that could be like how big the vectors could be, how big the Y of T could be, so on and so forth. so they're different kind of bounds in the literature. There are bounds that people are trained to study. How fast you can learn that only depends on like the magnitudes of things that are coming in. And there are bounds that depend on the data. so data depends on bounds. For us different descent bounds are going to be a little bit better, because they're going to use these bounds to actually design their our confidence sets. so you will have a tighter regret bound. The only thing that it would require is it that like you come up with some algorithim. You have a regret bound. The regret bound should be contemplated based on known quantities. That's it. And then like we turn the wheels, and then we all spit out the confidence set. okay. so typical result in the literature show that the regret is either a proven or of size log of N. If you exploit the curvature of the loss, then you can work a little bit harder and get this type of regret bound. If you kind of like throw away that information, you typically end up getting root in regret. okay. so before we do all these actions, let's look into a reduction that was done previously by other people. so the question here whether small regret implies small risk. so might be with this result it's kind of cool. Like this reduction makes sense. so let's say your event is a statistical problem where you have i.i.d. data. okay. And you want to do linear prediction. so you have xT and YT and i.i.d., and you assume that YT is just linearated to DxT, and you choose an online learning algorithim A that produces sequence of estimates and to make this prediction. so this algorithim is special, because the predictions are based on experimental vectors. Not all algorithms do that. And you have a regret bound. what can we say about how well can we predict like how small a risk can be achieved. okay. so what's the risk of a vector T [inaudible]. so the risk of vector T [inaudible] is the expected squared error with the expectation is about the joint distribution of xT and YT. so these guys prove that if you take the average of the vectors produced by this online learning algorithim then for any data for this which was in minus data, the risk of this average vector was that high probability is going to be bounded by. The main term here is PN divided by N. so Page ? if you had an algorithim that has a regret of log N. Then you have a log N divided by end risk. so it's like turning the online algorithim into a algorithim that has a small risk. so online learning is powerful. so that's kind of where we started. so how do we get to use an online algorithim to produce a confidence set? And that's done here. so we have this data where we don't assume -- don't make this i.i.d. assumption, but we can have xP sequentially generated, but YT is related to xP as before in linear manner. And we have an algorithim A, an online algorithim, that we're going to fill with this data xT and YT. And it produces this prediction Y out of T. okay. And it comes with a regret bound, okay, against this unknown parameter vector T plus star. The regret bound is PN. Then you can show that the following set. This set's a high probability confidence set, and that's a uniform high probability confidence set for all times. so what's in the set? so it's kind of like an ellipsoidal shape set again. You have this quality quantity, and you say that the quality quantity cannot be larger than this. so you take all the parameter vectors for this quantity quality below this other set. so you can see that if you expanded the grand meal would come in and have the exact same flavor as the previous confidence sets. okay. so the shape is going to be like -- it's not surprising. It's exactly the same shape. Apparently what you see -- what's the difference here is that in radius type of quantity depends on the regret of the algorithim. so if you design a algorithim which is a smaller regret your B of N goes down. okay. Then you have a much smaller confidence set. so what you're summing in terms. so if you worry about this one like that's too big, it's actually not too big, because here you have end terms that your confidence is a constant. so if B of N is low end, it means your confidence low end which means the confidence set is actually shrinking. okay. >>: why are you using the YT as the prediction in defining confidence set? >> csaba szepesvari: >>: Because that's how we can do it. But what you actually observe is the YT's. >> csaba szepesvari: well, you have this algorithim A that produces those predictions. actually do observe those, too, right? so you so you take this algorithim for this online setting. It produces these predictions. And then you build your -- if -- so the idea is that if that algorithim is so good in predicting things like we can use what it predicts. so it's kind of like this is the filtered version, if you want, of Y of T. so it's kind of like you're reducing the noise by using the predictions of the algorithms. It looks a little bit dangerous, because it's not the actual data, but you have to trust the algorithim. The algorithm actually uses Y of T and you're indirectly using Y of T. okay. so that's how it works. And the proof. It's just one slide. so if you take the definition of the regret, it compared your loss with the loss that you could have achieved if you used [indiscernible] in the timestamp and that was bounded by B of N. You have this quantity expression. You do the expansion. You do the algorithim, and then from that regret bound just by the algorithm you derive this. And you see that this quantity is what appears in the confidence set. you have this other guy, and that guy is just a martingale; okay? You have the B of N. And Because here you have the noise that multiplies these other things which I mention in the past. so you have to analyze the martingale, and as long as you're done with that, you're done. so that's it. okay. so you use standard techniques to analyze the martingale, and then you get your confidence set. It's a little correlation. so it's five pages. okay. so if we combine it, how can we get good confidence sets for the sparse case? well, we need an algorithm that achieves small regret in the sparse case, and we should get -we need an algorithim that achieves a low-end regret, because if you have a low-end regret, then your confidence set would be just too large. so you have to have an algorithim that achieves low-end regret and which adapts to or exploits the sparsity of the unknown parameter vector. And this algorithim by Gerchinovitz that was published in 2011, which was based on earlier ideas of [indiscernible], actually does it, and it's a -- my simple to describe algorithim. so the idea is that you have the parameter space, and that's the [indiscernible]. You put the prior data which is going to be kind of like a distribution that, you know, prefers sparsity And then you basically take the convex exponentiated gradient target down. That means you can apply [indiscernible]. That data is just as gaussian, and you subtract from the posterior, or you Page 8 compute the expected prediction based on the posterior. That's kind of the same thing. And they showed that with not difficult analyzing that the algorithim is actually scaling with P log and D. That P is the sparsity of the vector T, and N is the number of rounds [indiscernible]. And so as a corollary you get this confidence set. so it's all just plugging in the previous inequities. so you will see that T appears there and D appears there, [indiscernible] because D doesn't appear. so if you apply this to bandits, what do we get? There is a very general -- like because of this reduction a generality is like you have any online prediction algorithim with a regret of B of N. Then you get this regret for the stochastic bandit problem, the stochastic bandit problem here. so you can see that root D and DN appears, and then BN appears in the variable terms. so we don't worry about this. This might be a little bit worrisome, but we are going to come back to that. so if you plug in the regret results are gaussian, and you will see that your regret is going to scale with the root BVN. so previously it was curing this B times root N. so it appears that B by root BP. It's probably not what everyone was hoping for, which is that you can get totally rid of the dependence on D, and the regret is going to depends on B only through the polylog terms. so that didn't happen. so can we do better? Actually, the answer unfortunately is that under these conditions you can't do any better. so there is a lower bound which is based on other paper and where they studied bandits in the algorithim setting. And the lower bounds goes as follows: You take this decision set which is basically all the unit vectors and entities of one as the first component, and you pick some epsilon which happens to be square root of D over N. And you -- the unknown parameter vector is going to be any of the following D parameter vectors. The first component is fixed to 0.5, and you have a [indiscernible] on the [indiscernible] component. so basically here the sparsity is two. And if you want to play in this game, you have to guess like where is the little epsilon. so you have to find the epsilon is positive. so you want reward. And so the standard result goes that if you take the stochastic bandit problem where the Y is going to be boundary with linear product with parameter of linear product of the vector chosen and in [indiscernible] then the reresult of any bandit vector on this problem is at least root DN. okay. so what you see is the root D is never going to get away. It's never going to go away. okay. so no matter what sparsity you're talking about, this is a very sparse problem. okay. so there is a result which constant with this alternate noise model, and there the parameter is part in every timestamp. And there is a strong assumption about that this has to be a i.i.d. noise meaning the components are uncorrelated or independent of each other. They are zero mean. And if you do this. And let's say your decision set is [indiscernible] boss, it has like large vectors in it or kind of directions. Then there is an algorithim that achieves a better regret end. so this result is due to constancy and [indiscernible]. This is concurrent to the result. so if you're willing to make better assumptions, then you can improve the situation, but whether this set of assumptions or some other set of assumptions are going to be the ones that you care about is another question. okay. so back to our results. Do we actually improve empirically. so well we generated some artificial problem where you have 200 dimensions and the sparsity is ten. so only ten vectors and known zero. And that shows the actual set. And you have some noise, and you get c is that you don't apply this reduction result just apply square. And then the regret is not going to change at any time soon. And if you apply this deduction result, then you get a much gentler behavior on this problem. so you gain something. so in summary, we take sparse stochastic bandits and the main tool, what is online-to-set-conversion tool. And we think that this is the first confidence set for sparse linear predictions under general conditions. we got good predictable results. There are other results. I didn't talk about this Yahoo news article recommendation contribution that Lee Holmberg devised awhile ago and in terms of future work -- so currently I'm looking at designs for other problems like matrix prediction where you can just [indiscernible] the framework, and it seems like every step goes through. so there doesn't seem to be any major difficulty there. And the -- one of the challenging questions is whether you can adapt to unknown sparsity. so in -- we don't know the answer to this, but if you'd ask me, I would say probably no, unfortunately. so there are the interesting questions like when the action set has a few extreme points then [indiscernible] and can you design algorithms that ultimately it takes into account things like that and achieves the best of both results in all cases. And this algorithim that I talked about, this algorithim that is based on context exponential weights, it's pretty expensive. The authors say they can run it for tens of thousands of Page 9 dimensions. It requires something of approximate computations. so the question is whether you can get away with cheaper algorithms in the experiments where we use this algorithm and we use exponential gradients which is a cheaper algorithim which comes with a worse regret bound, but we were careful to use a better data dependent regret bound on, and that's where we were winning. so the question whether other algorithms, you know, exploiting the stochastic bandits were cut. cut this? I don't know. And, lastly, the question is whether there is a tradeoff between computation and statistical adaptations to the data which you are learning. so these days, there's a lot of results that appear in this direction, and, of course, you can ask the same question here. You can relate these problems to other problems that we start to have some knowledge about, and then you can hope to be able to study this problem in the form of fashion. so far what we have observed is that while you can have like cheaper algorithms like [indiscernible] algorithms which come with scrap regret bounds and so your regret seems to go up if you use an algorithm which is cheaper to run, and if you have a more expensive algorithm which maybe doesn't have even have -- doesn't even emit any problem parameter complexity bounds that you can get much better results. so the question is whether this tradeoff is real. All right. >>: so thank you. The confidence set for this. >> csaba szepesvari: >>: okay. Is this on? so how did you do this when you choose the ->> csaba szepesvari: >>: [indiscernible]? Yes. >> csaba szepesvari: so like I said, the define each of the actions. Then for each of the actions, you need to find the T vector that maximizes the action. okay. so the T vector that lies inside this confidence. so that's a quality [indiscernible] problem and not an optimization problem. You can [indiscernible] it in close form. It entails in working a matrix. You can do an sVD, once you do an sVD, then you're done. okay. Any questions? [applause] Page 10