23475 >> Dengyong Zhou: Today we have Alekh Agarwal from... our machine learning series. Alekh is a fifth-year Ph.D....

23475 >> Dengyong Zhou: Today we have Alekh Agarwal from UC Berkeley to talk at our machine learning series. Alekh is a fifth-year Ph.D. student advised by Peter Bartlett and Martin Wainwright. Has been a recipient of MSR Fellowship and also a Google Fellowship. So working, machine learning, convex opposition, statistics. And today he's going to talk about learning with non-i.i.d. data. >> Alekh Agarwal: Thanks for the introduction. Good to be back here. Although, I think I prefer the summer more. Okay. So, yeah, this is some joint work with John Duchi, Mikal Johansson and Mike Jordan, mostly trying to understand when do learning algorithms continue to do well even with non-i.i.d. data samples. But before I go into that, I just want to very quickly recap the basic setting that's typically assumed in statistics and a whole lot of machine learning, where you would think that you have N data samples, maybe X1 by X2, YN, so this can be the data and this can be the binary label or a regression value or something. So here I just have a cartoon of, for instance, a binary classification task. And the usual underlying assumption is that all these N samples were sampled independently and identically from a certain fixed distribution. And what we would do then is to train some sort of a model that fits the data well. So in this case just this linear separator that seems to be classifying my training data. And then what a whole lot of theory and statistics focuses on is to show that this model will actually continue to work well even on unseen samples generated from the same distribution. So here if I see a red dot or another red point, the test points, as long as they're from the same distribution, we continue to do along that. But the whole story relies very critically on the fact that everything was sampled in an i.i.d. fashion. And so what happens when things are not i.i.d.? This is not a new question. People have thought about it. And in particular one very nice framework that kind of takes the other extreme is what's often called online adversarial learning where we now view the same set of examples, X1 by 1 through X1 YN as coming at sort of an online stream at us. Quite remarkably, we do not assume any statistical structure on the data at all. Of course, that means that the performance criteria that we use also changes and in particular what people measure here is a notion of regret to a fixed predictor. So what goes on here is that, okay, I'm receiving these samples that I think of as an online data stream, and I train now a sequence of models. And in particular the model at time I has been trained on samples up to I minus 1, is then used to make a prediction on the sample at time I. I measure the loss under some notion of a loss function for on this new sample point, and this cumulative loss occurred by the learner is accumulated. So that's the first term. And the second term is to say that, okay, now suppose I took the entire data and trained the best model from my model class on this data, then how small would that cumulative loss be. And the rationale kind of here is the second quantity is what you were computing in sort of this setup right in the offline setup that's exactly the computation we were doing. And now we are trying to argue how close is our online computation towards this off-line computation. But kind of -- while this is an interesting performance measure, again, it's a bit hard to interpret in terms of what it buys us statistically, or what it means in terms of what's going to happen in the future based on the samples I've seen. So a very nice answer to that was provided by Cesa-Bianchi [indiscernible] in '04 who said I can -- I have this class of algorithms that work with arbitrary data samples and give me guarantees on this regret. But now let me ask what happens when the data samples are indeed run in an i.i.d. fashion and what they showed was that any algorithm that has a small quantity of this regret will also generalize well to unseen examples drawn from the same distribution. And that basically, for instance, can be thought of as one justification almost for this performance measure of regret. Because it implies good generalization at least when your data is i.i.d.. but again this connection very critically relied on the fact that samples were indeed drawn in an i.i.d. fashion. And what -- so the problem with this whole approach is that of course the world is not always R. Dy. And there are many, many scenarios depending on some samples is very naturally built into our sampling procedures. So if you've got time series data just by the very definition there is some temporal structure in your samples that's governing their dependence. Particularly rises, for instance, a lot in financial data. Rain domestically taken snapshot from Google Finance quite clearly the stock prices are between two different instances of time are not i.i.d. with respect to each other. The same with more general order regressive processes or lots of situations in robotics or reinforcement learning. You've got a little robot going around the room measuring quantities positioned at time T and T plus 1 is not going to be independent of each other. This is going to be very heavily dependent. Same situations arise in a lot of other control and dynamic systems and lots of scenarios where we do want to do some sort of learning in our samples of non-i.i.d. And there are results about it, but the existing theory is nowhere as comprehensive as what we understand for i.i.d. data processes. Before I go into more details of the work, I want to give a very high level picture of what this, what our contribution is going to be in this talk. So of course if I want to learn with non-i.i.d. data and the data has no statistical structure, then the notion of generalization to future test samples is a very strange thing to be talking about. So we do need to assume something. What we assume is the data satisfies certain mixing conditions. What this means is, okay, suppose I have a training sample from today and I look at a training sample far, far out in the future. Then these two would be more or less independent. So there is dependence, but the dependence kind of weakens as you go further out into time. So if I take two samples ten days apart, then they are more dependent than two samples 100 days apart. That's the notion of mixing. I'll be more concrete about this later. The second thing you want to understand is what's the right notion of generalization when talking about dependent data. It's not as -- there's actually some confusion on, and there are different definitions exist in the literature. The one that makes most sense to us personally and the one we end up picking is the one depicted in this cartoon here. So the idea is that you've got some, say, time series that's running. And you sample a whole bunch of training data denoted in blue here. And now you stop receiving training data. You train a model on this training data. The time series continues to run. You continue to draw more samples from it. These red squares now are going to form the test sample. So the test samples are just future samples from the same sample path from which your training data came, which means that the test samples are again going to be dependent on the red guys and the blue guys are going to be dependent on each other they're not going to be statistically independent. And the rationale behind this definition is that you're kind of -- you're thinking that you're observing sort of one random realization of a process that's going on in this world. And you want to make future predictions on that same random realization. It's not like you're going to go into an alternate universe and make some prediction on a different random realization of the process. >>: [inaudible] multiple realizations of test samples for a given context? >> Alekh Agarwal: So, no, we actually -- we only let the process run into future, the same process runs into future and the test samples are all drawn from the same realization. >>: [inaudible]. >> Alekh Agarwal: Yes. Only one forward realization. So some people thought about having test samples coming from an independent realization than the training sample, which personally does not make that much sense to me, which is why we ended up picking this definition. Okay. Finally, we're going to make a prediction about what sorts of algorithms can learn well in these non-i.i.d. data scenarios. We're not going to make a claim for just one algorithm. We're going to make a claim for a whole class of algorithms. And these are going to be algorithms that have been designed for basically online learning in an adversarial scenario but satisfy a certain added stability property. And it's going to cover several interesting examples like online gradient descent and Miller descent and regularized dual averaging and several other examples. And so for all these algorithms, we are going to prove that you actually, as long as your loss function, your performance criterion is suitably convex, you can, based on this non-i.i.d. training data generalized to test samples and are the definition of generalization I talked about here. And one thing that people have discovered a lot in optimization is when your loss function has more than convexity if it has some nonridge curvature then optimization can gain from it. And one aspect of our work will be we will show that these gains are also preserved in learning in the error bounds that you can guarantee on the test sample. And in particular this will have some nice consequences for a variety of linear prediction problems, things like Lee's squared regression, Lee's squared SVM boosting, so on. You can learn at a much faster rate than you would expect in the worst case using just convexity. Okay. So that's kind of what we do on the learning side. Now, in the last several years, people have realized that learning and stochastic optimization are essentially like two facets of the same coin. And that allows very nice exchange of technology between the two fields. And we kind of use this intuition to also interpret and develop an application of our results in a broader context of stochastic optimization. So in particular what we are able to argue is that all these stable online algorithms that our theory applies to can actually be used for stochastic optimization but stochastic optimization, where the source of noise in your gradients or function value estimates is again non-i.i.d.. it's a stochastic source that again satisfies these mixing conditions. And turns out, while this is an interesting theoretical result, it turns out it has many, many applications in the area of stochastic optimization. And many scenarios where the result is actually quite relevant. So there is a very simple and elegant distributed optimization algorithm that we will show we can just directly analyze as an application of our results that will also be applications to, say, things like learning ranking functions from partially ordered data observations and a very sort of gnarly but cool application to pseudorandom number generators which I will briefly sketch at the end. Okay. So let me -- I'm going to try to keep the notation to a bare minimum, but a few things I am going to need and mostly in this slide, most of the notation. So you can assume that there's some stochastic process BI that's governing the -governing the revolution of my sample so PI is a state of the process PI and the mixing assumption that I'm going to make will kind of guarantee as I said before that this distribution PI converges to some stationary distribution pi. So, for instance, you can think of -- I'm sure most of you are familiar with, say, the convergence of random walks as you let a random walk run for long enough it fixes to a stationary distribution. And we need that sort of phenomenon in a broader generality here. We're going to receive end training samples according to the stochastic P. Yes? >>: I'm confused about I here and stationary use condition. >> Alekh Agarwal: Uh-huh. >>: Are you saying that the stochastic process at time I has a stationary distribution, or that ->> Alekh Agarwal: No, if you -- look at the -- so you start the distribution from an arbitrary state and then you look at the distribution induced at time I. Then as I goes to infinity the distribution at time I approaches the stationary distribution. So you let say your Markov chains run for long enough then the distribution induced over the sample should start looking like stationary distribution. >>: That's the position stationary. So the problem I have is connecting the stationary process PI. >> Alekh Agarwal: No, PI is not a stationary process. >>: Sorry. The stochastic process PI with a stationary process that you just described. >> Alekh Agarwal: No, no. So okay like PI would be the distribution of the Markov chain, for instance, at time I. And this ->>: The distribution of the state. >> Alekh Agarwal: Yes, the distribution of the states. >>: The margin of the distribution of the states for the Markov chain as an example. >> Alekh Agarwal: Exactly. And we want this converge to a stationary distribution. So it could happen that you have a nonalgorithmic Markov chain that does not convert to a stationary distribution and if that happens then my results aren't going to apply. All I'm saying is I would need this convergence to stationary distribution that happens, for instance, for all finite state Markovian chains happens for the process PI. So PI is from that nice class of processes where you actually do converge to a stationary distribution. Yeah, PI itself is not a process. It's just a fixed distribution. >>: So PI, it's at time I? >> Alekh Agarwal: Yes. >>: What do you see trending X1 to X then process ->> Alekh Agarwal: Yes. So P here is the process run from time 0 through N or, sorry, 1 through N. >>: Okay. >> Alekh Agarwal: Okay. So, for instance, in the very simple scenario of i.i.d. data, PI will be just PI at HI. It will just be the stationary distribution. Okay. And the other thing I'll need, because I'm talking about learning, I need a loss function to measure the quality of my estimator, and that's going to be denoted by capital F in this case. And capital F takes a predictor W and a sample XI and measures the fruit of the predictor to the sample. And a quantity that's going to turn out to be important is the expectation of this loss. Applied to a sample measure from the stationary distribution. Intuitively, also where that might be important if this is an i.i.d. scenario, then this is actually a fresh example drawn according to the same distribution as my training data and measuring the fruit of that sample. And as we'll see in the sequel, the distribution will be important for non-i.i.d. data, but I'll point that out when it comes. Let's just see some examples of the setup just to get a better understanding of the various quantities involved. So again the simplest mixture of non-i.i.d. that you can think of is a finite state Markov chain. So here are shown a simple example. You've got a graph. Your process is just the random walk on the undirected graph, uniform random walk. And the data observation process is the following: So each node in this graph has some distribution associated with it. It could just be the empirical distribution on some local data. Maybe it just has a way of sampling to some distribution and whatever it is each node has a certain distribution associated with it. Now I have a talk on that randomly walking around on this graph. Token stats are the first node draws a sample from that node. And then it moves according to the random walk it will land in one of these two neighbors, maybe with equal probability in this case. And sample from whichever place it lands in. And then we take another step of the random walk. Now it can be any one of these three places and again samples from that node. And so PI in this case is evolving as the marginal distribution on states of the Markov chain. And by a well known Markov chain theory I know this is going to converge and in this case because the graph is regular it's just going to converge to the uniform distribution over all the nodes. So that's the distribution I get in the limit. So that's just what these distributions can look like in, for instance, a simple example. A slightly more realistic example that can actually start capturing data generating processes and machine learning applications is, say, data generated according to order aggressive process. What an order aggressive process is that the sample at time I is obtained by taking the sample by I minus 1, hitting it with a fixed matrix A and then adding some new random noise to it. So it governs kind of how much of the past signal you retain and ZI is the fresh randomness you're getting. For instance, this is what the process can look like. Here I've shown of course XIs are only one dimensional here and it's just a vector in this example. But in general the XIs will be higher dimensional vectors and you can have even dependence on past K samples and so on. So this is a fairly general class of processes and is often used to model time series data to model control system dynamics and so on. And an actual way this can arise in machine learning problems is that your data XI is sampled according to another regressive process and then you've got some regression output associated with this XI. That's what you're trying to fight by minimizing something like a squared loss, right? So this is one form of a problem that we might be interested in. And as long as A the matrix satisfies some nice conditions, again this order regressive process will mix to a stationary distribution. So, again, this sort of a thing can fit into our theory. A third and even a bit more realistic example would be say the task of portfolio optimization. So here your data X will be the stock price vector of the stocks you might be interested in investing. W would be the fraction of your wealth that you want to invest in each stock and a commonly used loss function of these portfolio optimization problems is the log loss. So log of W transpose X would be the error function we use. And in this case PI is the stochastic process that governs the prices of stocks at time I and I'm not going to make any claims about what that converges to actually. But a lot of people seem to expect that if you look at long enough periods of time then it does converge to stationnality maybe things like generic Brownian motion but that's kind of a bit more uncertain. I cannot make formal claims that it converges to stationnality here. But you could expect to at least for long enough periods of time you would expect some sort of mixing to happen. Now I just want to make the notion of mixing precise that I will need. And the notion of mixing that I use is called phi mixing and the idea is this measures the closeness of two distributions in the notion of total variation, which is just like looking at the L1 norm between two distributions and what it says is if you condition on the data at time T and you look K steps into the future, then how close are the -- how close is the distribution induced at time T plus K to the stationary distribution pi. So I fixed the -- I condition on a certain time and look K steps into the future. I want this to be close to stationary distribution and I take a uniform upper bound over all time steps T. That's the phi mixing coefficient. And quite often actually any ways the bound you get for any fixed T will be independent of K. So the maximization of over T doesn't really bother -- so let's see what this can actually look like. So the simplest example is, again, going back to i.i.d. samples, because as we talked earlier, the distribution is just pi at every time. So if I condition on the previous round and look at it one step ahead, then I'm already at the stationary distribution. Right? Because the distribution is just pi. And independently pie. So phi of 1 is already 0 of i.i.d. samples. For a random walk, the kind of Markov chain sample we saw earlier, for instance, by using existing results, we can show that phi of K is bounded by a quantity that goes to 0 geometrically as a function of K. And the exponent of the contraction factor is governed by the spectral gap of the random walk. So it's a very standard result, for instance. And that's just another example of what these coefficients might look like. >>: So your phi of K, it's soup [phonetic] over T. >> Alekh Agarwal: Yeah. >>: So the W or P of the T doesn't converge to pi. If it converge to pi, then -does it -- the P covers the pi. Happens when phi goes to infinity that should be 0 quantity? >> Alekh Agarwal: Yes. But you take soup over T. Right? So you will take the smallest -- you will take the largest value of phi of K. So the largest value will be initially when you start and look K steps ahead. So what you're saying is correct. But... >>: Okay. So that's interesting. >> Alekh Agarwal: Can you gain from it? >>: So the stochastic [inaudible] you consider, the state does converge to a [inaudible]. >> Alekh Agarwal: Right. >>: What's the relationship between the regenerative process? Is that -- I'm just confused. >> Alekh Agarwal: I -- I would have to think about that. >>: Okay. >> Alekh Agarwal: Okay. So now I want to quickly define what is the class of algorithms that we're going to work with. So as I said we will prove a result about all online algorithms that satisfy two conditions. First of all, when we run them on a sequence of convex functions then they should have a bounded regret. So that's all this is saying, that I've got my losses indexed by the sample XIs, the online algorithm generates a sequence of predictors WIs, they better have a bounded regret. The part where stability comes in is I assume that the predictor does not change too much from round to round. There's a bound on how much it can move. And the reason I don't think this is a restrictive assumption, is that for a lot of algorithms, the proof of this part goes. We are approving this as an intermediate step anyways, either implicitly or explicitly. This is just by proof technique proof for a lot online algorithms sort of stability requirement. But for our proof it's kind of an essential assumption to make explicitly. Some examples that would, for instance, satisfy this would be for instance the online gradient descent algorithm from Zinkevich in '03 that runs from the data in online fashion taking gradient step at every time just ignoring the fact that it's a changing sequence of function rather than a fixed function that you're doing gradient descent on. That has a bound on the regret and satisfies stability bounded by the step size of the algorithm. Same thing also holds for the regularized dual averaging algorithm of Linn, it has both of these quantities bounded in the same way. And just one final ingredient that I need that will allow me to state my results and state them in the simplest possible way, would be making this connection pointing out how this expected function F that I mentioned before is important for learning in non-i.i.d. data scenarios as well. So recall we want to measure a generalization on test samples. Drawn from the same sample path as the training data. So I condition on my first N samples which are the training samples and then I ask what is the expected error I could incur on the test samples. And I want that average expected error condition on the training data to be small. And what we can show is that that average error is actually bounded by the error under the function little F and plus a couple of terms which will basically be smaller or not be significant enough in the analysis so we will just focus on bounding this quantity, the gap under F in the SQL with the understanding that that gives us bounds on the generalization as well. One just quick thing to note is that if you put pi equals one in this result then you get this term is 0 and this term is phi of 1. Phi of 1 is 0 of i.i.d processes, for i.i.d processes recovers the intuition that this is the expected function and this captures generalization of the algorithm fully. >>: [inaudible]. >> Alekh Agarwal: Tau is just a free variable. Tau is for any natural number tau. So anything between 1 and infinity, any natural number. This is bound is true for all values of tau. >>: [inaudible]. >> Alekh Agarwal: Exactly. So you can minimize the bound over all values of tau. Based on what the mixing coefficients look like, you can optimize which one gives you the best bound. The algorithm does not use the knowledge of tau. >>: Okay. >> Alekh Agarwal: Although, later on, like if you want to, you know, optimize, for instance, the learning rate, you can incorporate down to the learning rate to get the best possible bound that sort of thing will come in. That's where the algorithm can exploit it. >>: On the right side [inaudible] but sometimes [inaudible]. >> Alekh Agarwal: Well, actually tau, whenever tau is, phi of tau is decreasing as tau increases, right? So the question is how small actually can you keep tau while having phi of tau so small so much. >>: [inaudible] before ->> Alekh Agarwal: And that's what we're going to see in the next results. Here I was kind of short on slide space. So I didn't actually put that in. But, yes, that's precisely what's going to happen. Okay. So with all this, I can give my first main result. So we assume that we've got iterates W1 through WN, through a stable online algorithm. And we take their average and measure the generalization of that average under the function F and this is bounded by some of three main quantities. The first quantity is the average regret of the online algorithm. The second quantity is kind of going to be a deviation bound. It depends -- so this is going to be a probabilistic result that holds with probability one minus delta. And this term depends on how high a probability you want the result to hold, and it kind of has to do with how far things can deviate from their expectation. And the third term really, these two terms kind of analogs of these two terms arise also in similar results of just [inaudible] for i.i.d. data. But the last term is really the main term that's new for non-i.i.d. penalties which depends very critically on this average stability of the algorithm, of the online algorithm that's why we need a stable online algorithm and again on the mixing rate of the data source. So we can again perform the same sanity check that we put tau equals 1 and we recall that phi of 1 equals 0 for i.i.d. samples. So this term becomes 0. This term becomes 0. We are left with just 1 over N, log 1 over delta here and that's exactly the result you get for an i.i.d. setting. So it's kind of backwards compatible. But to understand this a bit more better, what we would like to do is to make, to fix a stable online algorithm so that we have a handle on this and this quantity. And maybe fix a mixing rate for the process so that we have a handle on this quantity so we can actually see a bit more concretely what the error, what the rates look like. So to do that, we first start with fixing an online algorithm. And this applies to both online gradient descent and regularized dual averaging with a separate proportion to 1 square root I at time I, both of them satisfy regret guarantee. That's order square root N and stability that's 1 over root I at time I because the stability is just proportional to step size in this case. So for these algorithms, with high probability, the error is essentially on the order of 1 over root N which is kind of the op optimal error even for i.i.d. settings except you incur a spent based on tau. And there is this second term root N phi of tau. So now you can minimize the bound over tau and basically as long as the minimum of this is suitably small, everything's okay. So one thing that we show -- I mean that's not completely obvious from looking at the expression, but with a few lines of work you can show, is that as long as phi of K goes to 0 and K goes to infinity, it can be arbitrarily slow. As long as this happens, this bound will actually go to 0. We will get consistent learning as long as phi of K eventually goes to 0. And so basically in the limit of infinite data you will actually learn the right model is one thing that this does guarantee you. In particular, if you put ourselves in a really, really nice setting, where our processes converging extremely fast. So if we have an exponentially fast convergence which is, for instance, the Markov chain example that we saw had this sort of a convergence, then actually the rate that you get up to constants and up to logarithmic factors, it's just the same as the i.i.d. rate. So the i.i.d. result was 1 over root N and non-i.i.d. result is log N over root N. So there isn't much of a penalty at all in this case, for instance, due to the fact -- despite the fact that you were learning with non-i.i.d. data. In general, if the process mixes slowly, you can even have this exponent, it can be worse, worse than 1 over root N. It can be a much slower rate as well. But it will go to 0 as long as the process eventually mixes. Okay. So I didn't have the chance to put many experimental slides in this talk. One example that I've included here is for an order regressive process. 25 dimensional order regressive progress. We run online gradient descent algorithm. And this is basically the error under function F as a function of the number of training samples, if you are -- if you are optimistic you could see this curve to be roughly like 1 over square root N, I don't know, shape looks similar. But we are still doing a more careful investigation of the results. So but at least this points to a basic 0 [inaudible] consistency claim being correct when you're testing on future samples. Sorry, yeah, I said this was being measured under function F. That was wrong. This was actually the generalization error measured over a fixed test set over a fixed number of test samples. So it is true generalization error and averaged over 20 files. Okay. So the next thing that I want to briefly discuss is what happens when you have some more information about the structure of the loss function. As I remark in optimization, there is this coverage assumption that we make that often helps us. So what the strong convexity assumption means if you have a function and you take its first order [inaudible] approximation of course the function is convex it's reliable first order approximation but strong convexity means that the gap is lower bounded by a constant times a square distance between the two points in consideration. And so your loss kind of curves up, is not flat anywhere. And the canonical example of that is just square root L2 norm. Very nice property, if you take a convex function and add a strongly convex function to it, then the result is strongly convex. That gives us a license to declare all kinds of regularized objective strongly convex. So things like SVM, ridge regression, regularized logistical regression, entropy based regularized methods like max N, they end up being strongly convex because they incorporate strongly convexed regularizers in them. And we would like to ask whether the strong convexity of the regularizers in these problems helps the learning process somehow. Okay. So the exact assumption we're going to make is that the function F is strongly convex for every sample point X. And the reason we want to make that assumption is that under that assumption algorithms like online gradient descent and regularized dual averaging with the slight modification, they obtain a regret that scales as 1 over N instead of 1 over -- sorry. That scales as log N instead of the square root N regret that we saw for the convex case. So the regret is much, much smaller as a function of N. And again directly translates to learning guarantees. We see now that our learning is 1 over N. Again, times factors that depend on tau and on the mixing rate phi of tau. So first thing is this is faster than the 1 over root N bound that we get for convex losses. Setting tau equals 1 and phi of 1 equals 0 recovers the results for strongly convex losses. That was proven by [inaudible]. And again you can do the same thing. You can look at what happens for geometric mixing and the result is almost as good as i.i.d.. Even though like given the fact that the regret is small seems like this has to be a natural thing. It can kind of, at a technical level it takes a fair bit of work. Even though conceptually it seems like a very obvious thing that has to happen technically it's a bit more -- a bit more involved than it seems at the first site. The other -- the one main interesting application of these results that I will not have much time to describe because it gets even a bit more involved is for the pretty common case of linear prediction that arises in machine learning problems. What I mean by linear prediction is that you have some sort of -- you have some covariate X. You measure -- you have a linear predictor W. You apply a loss function on the prediction W transpose X. This captures things like logistic regression, boosting, Lee's squared regression and so on. You can have a Y as well label. That's perfectly fine. But as long as there's this linear structure, now -- and so this is a fairly common structure in statistics and learning. The problem is with this structure, the sample function F of WX can narrowly be strongly convex. Because there is only one direction, the direction of the sample X, in which it has any non-degenerisity in all other directions you can move the parameter, the function is just flat in all the directions except the direction of the sample X, because that's the only direction in which the loss function is acting. So here you can never have strong convexity, even though quite frustratingly the loss function L itself is often strongly convex in the scalar view on which it acts but it's just not strongly convex in the vector W that we are interested in. So but at the same time people in statistics have actually used this observation to show faster rates of learning with i.i.d. data for several of these problems. And we wanted to see if these gains do actually translate to non-i.i.d. scenarios as well. So the good message is yes they do. The bad message is that they don't quite do that for just our canonical off-the-shelf algorithms that the previous results apply to because those algorithms are not stable on these problems in the right way. If you directly try to apply those algorithms with the parameter settings that we want to apply them to they're not stable anymore. So you have to tweak the algorithms a bit. It's a minor correction but so implementation-wise it doesn't change much. But it gives us this upshot of fast learning. But the details of this had to be left to the paper because introducing the algorithm would have taken a while. Okay. So in the remaining time I want to talk just a tiny bit about this connection between learning and optimization and what it allows us to conclude for the broader setup of stochastic optimization. So I mean what we've seen so far could be alternatively described as we observe a sequence of random functions indexed by samples X1 through XN. These samples are non-i.i.d.. each function is unbiased in a certain sense. That is, if instead of taking the function based on sample X1 if I were to take the sample drawn according to the stationary distribution, then it would give me the right function. So it's roughly unbiased for this. Of course, you're not sampling from pi you're sampling from the dependent process. So it's unbiased in quotes I really should have really said. The results that we proved were on the quantity F of W hat N, right, the value of my pre -- the value of the function at my predicted point, minus the minimum value of the function. So we are trying to minimize the expected function little F, and we are trying to do that with these non-i.i.d. random realizations of the function essentially. And this is exactly what the setup of what's called in optimization the problem of stochastic approximations, except typically people study the problem with i.i.d. data samples and we are studying the problem now with non-i.i.d. data samples. Again, the problem being we're trying to minimize this function F and we are receiving noisy estimates that are going to kind of converge to the function because our process PT is converging to pi. So this sequence is eventually going to start looking like function F. Right? And our results can then be interpreted as showing convergence of stable online algorithms for stochastic optimization with non-i.i.d. noise in the gradients. So, for instance, now like typically when you think of maybe running stochastic gradient descent, we kind of at every step we draw a sample uniformly at random maybe and then evaluate the gradient on it. Now we don't have to necessarily draw it uniformly at random. The sampling at time T could be dependent on what you did at time T minus 1. And that would be fine, as long as the sampling procedure mixes. Okay. So a very interesting application of this arises in distributed optimization. And this is an algorithm called Markov incremental gradient descent by Mikal Johannson and co-authors and it's a very simple idea. So the idea is you've got a distributed network of nodes. And each node has an associated convex function FI. What you are trying to do is you're trying to minimize the average of these functions FIs. And this is how we're going to do it. So we're going to again have a token randomly walking on this graph. So you run the random walk at each time your token is somewhere. You take that function, the function, the local function of the node that your random walk is at. You evaluate the gradient of that particular function and you update your weight vector. You've got now a weight -- basically whenever the token goes, takes the current point WT result, and then you take a step based on the gradient of the local function and then WT the updated WT plus 1 goes to the next node with the random walk and you take another gradient step and again the parameter goes to the next node and you take a gradient step based on the local function of the node. This is a completely local operation because you're in a node you can evaluate the gradient on that node's local function. And, for instance, the application, for instance, you might think of this as this is either like a sensor network where each sensor is measuring some local quantity and can evaluate gradient of the local quantity, but cannot evaluate the gradient of a global quantity or -- this is not something I would apply to a cluster setting but the way you could think of this as this is really, this network is your cluster and at each step only one computer is doing the processing in the cluster, and that computer has its own local dataset. It evaluates the stochastic gradient based on that local dataset. So that's kind of the kinds of applications for this algorithm. And now in this case P to the T, PT is just the distribution of the Markov chain after T steps. If you design the Markov chain correctly, this is not going to be the simple random walk but you can define a Markov chain on the graph so it converges to the uniform distribution over nodes, in which case the expectation over, under the stationary distribution is exactly going to be the average of the local functions. That is the quantity that we are trying to minimize. And because Markov incremental gradient descent as the backbone is using online gradient algorithm, it's a stable algorithm at the backbone. So our results directly imply that under the stationary distribution F, the stationary function F, which is just the average of the local functions now, that function gets optimized by just taking the average of all the iterate steps we've generated, and in particular we show that the error goes to 0 as 1 over square root T if these Fs are convex. And the mixing rate is now -- depends on the mixing of the Markov chain again on this graph. And so sigma 2 of P is again 1 minus 2 of P is going to be the spectral gap of the random walk. And this is basically the most low-powered distributed optimization algorithm that you can think of, because this is communicating very less, it's communicating very less. It's kind of ideal with things like wireless sensor networks. And just by simple application of our theory you'll be able to drive a sharper rate than what was known before for this algorithm. And the -- the graph structure-based dependence is actually kind of reflected, if you run this algorithm. So if we run the algorithm on different kinds of graphs, then the spectral gap depends on the graph topology. So in particular expander graphs random walks mixed really fast, cycled random walks mixes really slowly. So that's reflected here for cycle, the error goes to 0 really slowly for expander goes to 0 really fast. And, again, we are working on more simulations here trying to get a more quantity experimental verification of the exact scaling with the number of nodes that we predict. But that's again something we're still verifying. Okay. So to wrap up, what we saw today was how a [inaudible] of stable algorithms have the nice property that you can generalize well to future test samples, even when your training samples are drawn in a non-i.i.d. fashion. Our analysis applies to a large class of mixing processes. And one nice aspect is that as long as -- for instance, if this mixing is fast enough then we get almost the same rates as i.i.d. learning, and we're also able to show better rates of converge and the losses have a better structure such as stronger convexity. There are several extensions that we actually discuss in our paper. So, first of all, we in our paper we also provide results under weaker assumption mixing, phi mixing is often considered as the strongest mixing condition you can assume, and we give all sorts of results under weaker mixing condition. We also give other applications in the paper. So one is basically if you're trying to optimize a ranking function and your observations are just partial orders over subsets of samples, then how do you solve optimization problems of this nature? And the problem here is that sampling permutation, rankings is hard, if you want to do stochastic optimization, but it's easy to write down Markov chains that sample from this and converge, sample from such combinatorial spaces and converge to some nice stationary distribution. The idea in fact it spans generally from optimization from Markov chain Monte Carlo samples from combinatorial space, apply to problems like optimizing the parameters of a qeueing system of something like a routing network where your packets might be very, where your loads might be governed by some sort of qeueing process. And coming back to this application to pseudorandom number generators, this is just something I randomly came across while looking at the literature, and what they were talking about is that, look, we often run simulations on our computers with these pseudorandom number generators and treat them as a sequence of i.i.d. random numbers, of course pseudorandom numbers are not i.i.d. but they have some sort of dependence because they're being generated according to some, starting from some random seed, they have a dependence on each other. And people have actually thought about this sort of stuff carefully and they have actually shown that even though they are non-i.i.d., the process actually mixes to stationality. So the next time linear testing your new stochastic optimization algorithm don't worry about the fact that pseudo random numbers are not i.i.d., things are going to be fine. So that's all I have to say. Papers that contain the results are available on archive, and I'm around for the rest of the day if you want to talk more. Thanks. [applause]

23475 >> Dengyong Zhou: Today we have Alekh Agarwal from... our machine learning series. Alekh is a fifth-year Ph.D....

Related documents

Products

Support

23475 &gt;&gt; Dengyong Zhou: Today we have Alekh Agarwal from... our machine learning series. Alekh is a fifth-year Ph.D....

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

23475 >> Dengyong Zhou: Today we have Alekh Agarwal from... our machine learning series. Alekh is a fifth-year Ph.D....