>> Lin Xiao: Today it's a pleasure to have Joseph Bradley come here. He is finishing a PhD from the CMU in the Washington department. And he's going to talk about optimization trade-off for scalable machine learning. >> Joseph Bradley: Thank you. Right. As he said, talking about how to optimize different tradeoffs of quantities in order to help scale various machine learning problems, and so I'll start out just how about talking about what those quantities are at a very [inaudible] level view of machine learning where we liked to get data, train a model to fit the data using some sort of optimization, and then test the model in new data. And so what the worries I'm going to be thinking about are if we have big data, a lot of it, have a complex model structured optimization, and all of these difficulties are going to end up feeding into things such as, issues such as simple complexity, how many training examples do we need to learn the model, computation in the optimization during learning, and, in the modern world, how to take advantage of things like parallelism to help with that computation. And so these three quantities are going to eventually, of course, feed into the eventual accuracy of our model on the new data. But it’s these three quantities, which I'm going to be talking about. So in order to improve scalability, you can imagine improving each of these individually. For example, develop methods with better learning bounds, which improve the sample complexity, look at computation by developing faster optimization methods in the sequential setting, or say, do parallel implementations of existing algorithms. So what I'd like to think about instead is how to trade these off where, for example, a small sacrificing computation might lead to a big gain in parallelism. And in the first part of the talk, I'll talk about a method which does this trade-off, and the second part of the talk, I'll talk about upsetting in the method where we can trade off all three of these aspects. So our general approach to scaling is going to be to take some complex problem and decompose it into subproblems which are simpler to solve in and of themselves. Then, through analysis, we’ll look at different trade-offs in this decomposition, and the analysis will actually help us guide how we do those trade-offs in order, do that decomposition in order to optimize those trade-offs. And interestingly, we'll be able to talk about data in model specific ways of doing that optimal tradeoff. So in this talk I'll start out talking about parallel regression, where I’ll talk about parallel coordinate dissent method, where we can trade off computation and parallelism. And the second part, I'll talk about parameter learning for Graphical Models where we look at a method which can trade off all three of these aspects. So first, looking at parallel regression. So in the regression problem we want to predict essentially one label, or a small set of labels, given a large number of features, for example, in one data set in which we look at the label is a measure of stock volatility, which it turns out you can predict somewhat, unlike the direction of the movement in stocks. And we're going to be predicting them from a large number of features which are small pieces of text from financial reports. The sparsity part, of course, is that we want to explain that label using a very small number of features, and of course, this is very useful in high dimensional settings where the number of features is a lot larger than the number of training examples. So in this part, the problems we specifically look at are Lasso, with the least squares loss, and sparse logistic regression. In general, our analysis is applicable to generalized linear models. So in the sequential setting, there's a whole lot of algorithms which can be used for sparse regression, like gradient dissent, stochastic gradient, interior points, different thresholding methods. One which caught our eye was coordinate dissent, also known as Shooting. And it caught our eye because it's been shown to be very fast. There have been a number of theoretical empirical studies explaining this and showing this on many problems. But for big problems where you have millions of features or hundreds of thousands of examples, even this fast sequential method, it may not be ideal. And so we'd like to do is take advantage of parallelism. So here I'll be talking about the multicore setting where their shared memory and low latency and an ongoing work we’re looking at distributed. But in this, in the multicore setting, we can think of parallelizing a number of aspects of the problem for parallel regression. First, matrix vector operations, many methods such as interior points spend a lot of their time doing such operations, and we could think of using existing linear algebra libraries to do that. However, we found this did not work the best empirically, and I think it was largely because the methods which could benefit most from parallel matrix vector operations were not actually the fastest methods for this particular problem. We can think of parallelizing of our examples such as doing this stochastic gradient method, which has some parallel analysis, but one could argue that using stochastic gradient methods tends to be best when you have a large number of examples, not a large number of features which is the setting we were looking at. And then finally, looking at parallelizing over features, for example, Shooting or coordinate dissent and parallelizing that. And we asked the question, which I'll explain in a minute, of whether that should be inherently sequential. But of course, it turns out that it's not, and I'll explain why. So what I'll talk about is a method called Shotgun, which is parallel coordinate dissent, for sparse regression, and I'll first show a convergence analysis which predicts that you get essentially linear speedups up to a problem dependent limit. And then show a big empirical study which shows that Shotgun is quite successful in practice. So just looking at a little background, our problem is going to be to minimize the convex objective, F of W, where W is this weight vector. And F of W would be the loss and the regularization for whatever Lasso or logistic regression problem we are looking at. So for shooting in the sequential setting, the basic algorithm I'm looking at is a stochastic coordinate dissent method, which says while you're not converged, pick a random direction or coordinate J and update the weight for J via sometimes a closed for minimization, sometimes a line search. So if this were the contour map where gray is better and we start at some point, this method would optimize in some direction, pick another direction, optimize in that and eventually get to the optimum. So in the parallel setting, what I'm looking at is very naive parallelization where, rather than updating a single direction at once, excuse me, will update on each of P processors, P different directions just pretending that we are holding the other coordinates fixed. So for example, in this setting, if we start here, pick two random directions and compute the minimizations in those, and then we add those updates together, we will get right to the optimum. Whereas this is a very nice setting where here we have uncorrelated features and the parallel updates are not going to conflict. Now in a bad case with extremely correlated features, if we compute minimizations in both directions independently and then add those updates together, of course we might diverge. And so you might ask is coordinate dissent inherently sequential? Well, here's why it's not. And those are Shotgun Convergence Theorem. And it's essentially stating that if we limit the number of parallel updates, where I'll talk about the limit on P in just a moment, then we get this bound, which states that the distance from the optimal where W is the weight vector, and W, T at T iterations, W star is the optimum, that will be upper bounded by this quantity on the right, where on the numerator we have quantities such as D, the number of features, W star, the optimal weight vector, and W not where we began. Where if I did it by the number of iterations T, dotted with P, the number of parallel updates. And so what this is essentially saying is that we are getting linear speedups since this generalizes the bound from Shooting for the sequential setting. So given this bound, which states that if we limit the number of parallel updates, we should get essentially linear speedups. This limit is going to be essentially D over row, where D is the dimensionality or number of features in our problem, and row is the spectral radius of the normalized grand matrix, ex-transpose X, where X is the state of matrix of examples by features. So, intuitively, row measures the correlation between features. And with proper normalization of the data matrix, row will be between one and the number of features. So in the ideal case with uncorrelated features, row will be one and that means that our theory would predict we can update all of the coordinates at once, which is what you'd expect. In a very bad case with exactly correlated features, row will be D which means that we can only update a single coordinate at once. Yes. >>: If you go back to your theorem, what happens if you subtract 1 trillion off of F? >> Joseph Bradley: If you subtract 1 trillion off of F? >>: It should be the same minimum; it should be [inaudible]? But the right-hand side becomes like huge negative and so I don’t know how to interpret it. >> Joseph Bradley: I see. So if you subtract 1 trillion off of F, so the losses we were looking at>>: [inaudible]? >> Joseph Bradley: I guess they, right. All the losses we were looking at were a non-negative. And that should be, I'm thinking of where that would appear. I think that was implicit in our proof and, right. It's not stated here. They have to be non-negative. >>: [inaudible] divide F by trillion and make that term go away. >> Joseph Bradley: So if you divide F by 1 trillion, let's see, then>>: Then it just scales, it’s relative. >> Joseph Bradley: Okay. Yes. >>: To do [inaudible] assumption [inaudible] F because you scale like one over T and not like one over square root of T? >> Joseph Bradley: Right. So in terms of the types of assumptions we are making, so F is taking the form, we're assuming that we can write out essentially a second order Taylor expansion and that the matrix which appears in the second order term is going, in order to, if it's written as an upper bound matrix, matrix upper bounding that, that matrix is what actually appears as the grand matrix in the next part, defining row, and that limit on row is what we are, that limit on row would be a limit on the smoothness. So I guess if, right. So I think the extreme values of row then row, the spectral radius would essentially measure that smoothness, as I understand. >>: You’re saying like if you have like row for D [inaudible]? >> Joseph Bradley: It would still exist. So in terms of smoothness assumptions, I guess what we are>>: Even if row equals D, [inaudible] and choose T equals one>> Joseph Bradley: Right. >>: Still, you know that there are some problems for which you cannot do any better than [inaudible] square root T. >> Joseph Bradley: So I think it's really, and what I was saying before with being able to upper bound the change, there's an assumption where we can upper bound the change in the objective with a second order Taylor expansion and the assumption that we can upper bound it with that is what's encoding smoothness. Right. I think for some problems, certainly you cannot do that. >>: [inaudible]? >> Joseph Bradley: Right. >>: [inaudible] I think it still is strange to me because you can multiply F by a huge number, then you basically get rid of your W star norm term. >> Joseph Bradley: So, let's see. >>: [inaudible] dimensional [inaudible]. >> Joseph Bradley: Right. >>: It is just feels kind of weird that [inaudible]. >> Joseph Bradley: Right. >>: [inaudible] the proofs, so that everything [inaudible]? >> Joseph Bradley: I unfortunately do not. But I think>>: Is there an assumption that F is like 0, 1 or normalized or something? >> Joseph Bradley: Right. This assumption, this theorem, I think where it would appear perhaps, is that actually hidden, this is really a bound for the case of Lasso. And for general models, there should be another term, which I think in our paper is F Beta, but basically it is a constant which is multiplied by, it appears in a second order Taylor expansion like bounding the change in the objective, and if you multiplied the objective by a huge number, then I think that would essentially appear as a beta multiplied by that W star. I'd have to check, but I'm pretty sure that constant, which is sort of loss dependent, is multiplied by the W star term. And for example, for Lasso, it's like one, for logistic regression, I think it's four, something. So I just kind of hid it here. But you're right. I think if you multiplied by a huge constant then it would appear there. I think that's the answer. Okay. Right. So given that we would expect essentially linear speedups up to some limit, we can see how it actually looks in practice. And so if we look at an example data set where here I'm plotting on the x-axis the number of very carefully assimilated parallel updates and on the y- axis, the number of iterations, where both here we have log scales, then if we do a single update per iteration we require almost 10,000 to converge, extrapolating linear speedups we'd expect to line up on this line; and our theory says that we should be able to do about 79 parallel updates before we start risking divergence. And if you actually run this in practice, then you see essentially that, where approximately linear speedups, and then after the end of this plot we start hitting gain divergence. And you see a similar behavior on other problems. So the experiments seem to match our theory, which is what we'd hoped. So thus far I've talked about Shotgun as this naive parallelization of coordinate dissent working and showing how linear speedups up to a problem dependent limit actually seemed to happen in practice. But now I’ll quickly show you some results from larger experiments. So first looking at logistic regression, where our goal is to predict the probability of a discrete label, we compared a small number of algorithms here, since there has been a big empirical study before, basically showing that Shooting the sequential version of our algorithm is extremely competitive. But we did take time to compare Shotgun with this parallel stochastic gradient method since it was one of the few other right parallel methods with which we had not seen tested on these problems. So stochastic gradient was just as a simple implementation where we estimate the gradient with a simple training example, and of course, is considered to be very scalable. And we are running on eight core AMD Opteron, two point six, nine gigahertz. So just quickly showing an example of our result, this is actually in a high sample setting. The high dimensional setting made us look even better. But in this setting, we had half 1 million examples, 2000 features, on the x-axis as is time, and on the y objective, where lower is better. And so if we look at Shooting, the sequential coordinate dissent does seem to do reasonably well. Stochastic gradient, as you might expect, is faster at first, slower later on. Parallelizing it helps a little bit using this method. But in parallelizing coordinate dissent helps enormously and it looks like Shotgun is the fastest. So basically parallelizing over coordinates seems to give big speedups, and we saw even more extreme behavior differences in the high dimensional setting. >>: Do you know what the, it’s convex, so if you run some sort of thing like [inaudible], do you know what the true objective minimum is? >> Joseph Bradley: Right. So for all of these, we essentially ran them for an extremely long time and then recorded that essentially optimum, and then we would do these experiments running of these until they got within some percent. So we actually I think tried to compute the optimum using Shooting, and if another, that was what we did initially, but then if another method reached a lower objective value, we would record that as the optimum. So- >>: So 150,000 is really the optimum am here? >> Joseph Bradley: Or a little bit below that. Right. We ran some of these for, right, an extremely long time until>>: [inaudible] from the challenge or is this your only [inaudible]? >> Joseph Bradley: It's our only implementation, right. And this is actually really simple SGD implementation. We also tried ones specifically tailored for the L, 1 setting, but those, in terms of objective value, were much slower even though they got sparser answers just because the issues with doing stochastic gradient with L, one; it ended up making them less competitive on these sorts of plots than our implementation. >>: [inaudible] Pascal challenge data, like results [inaudible] optimization challenge? >> Joseph Bradley: Right. So that's a good question, and I'm not sure what their curves would look like relative to this. That would be good to check. Yes. >>: So the [inaudible] average is the eight cores at the end of the [inaudible] right? >> Joseph Bradley: It does. >>: So how do you guys track the progress, how do you terminate each of the cores at convergence? >> Joseph Bradley: Oh. So the idea is at each iteration, suppose we stopped there, compute that average and that's one point in this plot. And so it's essentially, I mean this plot saying, suppose we stopped at this moment, then that's what the parallel SGD would look like. >>: [inaudible] points? >> Joseph Bradley: That's right. >>: So looks like Shotgun is maybe three times faster than Shooting or so. >> Joseph Bradley: I'll show you the actual speedup plots in a couple slides. Yeah. >>: [inaudible]. >> Joseph Bradley: So then looking at Lasso quickly. Goals to predict a real value label and here we haven't found as many big empirical studies, so we tested a lot more types of algorithms, as well as large number of data sets with sizes varying from hundreds to millions of features and examples. I don't have time to go into too much detail with these, but I’ll show you two of the four classes of data sets we looked at. The important thing to note are what's circled in blue which is the average predicted number of maximum parallel updates for the data set shown, and I'll show you in each of these boxes. But the important thing is basically note that a very large number of parallel updates could potentially have been done. So in each of these plots, X axis is Shotgun’s runtime, y, the other algorithm’s runtime, and if something’s above that diagonal line, then that means Shotgun was faster. So this point is saying that on this particular data set, Shotgun took 1.2 seconds, the sequential version, 3.4. So just quickly plotting different methods up here, we have Shooting, L1, LS interior point method, which used parallel matrix vector operations, and then a number of other methods. And most of the dots are above the dotted line, so Shotgun did reasonably well. And then on this larger sparser data set collection, a lot of the methods actually weren't even able to finish in a reasonable amount of time, and so Shotgun looks quite good. So essentially, Shooting seems to be one of the fastest algorithms, and Shotgun provides additional speedups. However, speaking of speedups, if we look at the actual curves of x axis number of cores, which is also the number of parallel updates we are doing, y axis, the speedup we got, these are aggregated results from all of our tests and, of course, diagonal would be optimal. So yeah. If you look at that wall clock time speed up of Lasso, on average it doesn't look that great. And it varied a lot. Sometimes it would be no speedup, sometimes almost optimal, but average was there. But if you look at the number of iterations we’re doing, it's almost optimal. So it is decreasing the number of iterations. And what we thought we were hitting was, what we believe we were hitting is this memory wall, where essentially memory bus is getting flooded. And I think a reasonable explanation for this is that Lasso’s updates are very cheap, very low computation per datum loaded, and so it's rather difficult to hold, to hide things like memory latency. And one thing possibly supporting this is that if you look at the logistic regression times speedup, it’s significantly better, although still not optimal. And logistic regression uses more floating point operations per datum loaded, and we believe that helps hide the memory latency. So you had a question. >>: That was my question. >> Joseph Bradley: Right. So yeah. I think that definitely points to the need for possibly more, perhaps there are more optimizations, hardware specific optimizations we could make, trying this on other hardware, testing it on other types of losses, which might help hide that latency, right. There's a lot to it. >>: Just so I understand the setting, when you’re measuring speedup, you’re holding the problem size constant and just increasing the number of cores or are you also scaling the problem size [inaudible]? >> Joseph Bradley: So this is, each point is the average over data sets of running that, of the, right, speedup for that number of cores. So it's, right. You're right. I'm not, I guess that means that everything is being held constant as we go across. That would definitely be interesting. Yeah. The variance and the actual wall clock time speedups was pretty big. And it would be nice to be able to say a bit more about what types of data sets [inaudible]. >>: What’s the variance of the red line? >> Joseph Bradley: I don't know the number. I know the ranges, which were essentially from one to eight, >>: If it’s one to eight, then it can be [inaudible] one time [inaudible] that fast [inaudible]? >> Joseph Bradley: So I think it depends on this, I'm not sure but I would think it would depend on the dimensionality and the sparsity of the data. And I'm not sure. Right. >>: Did you store your data in columns or in rows? Because your>> Joseph Bradley: You want to access it by column. So that was how we did it. For sure. >>: [inaudible]? But they still are [inaudible] different columns and so the memory sub system has to sort of screen through a bunch of stuff, but it's jumping. >> Joseph Bradley: Well, sort of. In practice we didn't choose completely randomly, we did a random permutation and then went through it. And there were things which we later tried, which did help a bit, like trying to carefully sort the columns and choose which ones we handled at which time. Right. So it wasn't completely random, but yeah. There were definitely issues with locality. >>: And this is an eight core standard [inaudible]. It’s just one [inaudible]. >> Joseph Bradley: Right. >>: Do you need to see a whole column[inaudible]? >> Joseph Bradley: The whole column. >>: Do you need to see [inaudible] everything is there you might not have to [inaudible]? >> Joseph Bradley: I think that's definitely a question which interests me; whether you can mix the ideas of stochastic coordinate with stochastic gradient. It's not something that we have experimented with really. But it would definitely be interesting. Yeah. >>: So I guess both for distributed or for like a NUMA system it be interesting if each processor was choosing a column from a set of prearranged subset of columns. Have you looked, done any analysis or looked whether to see it would be just as effective if each>> Joseph Bradley: We've done that sort of analysis to try to deal with the, I guess you'd call it the statistical issue of conflicting updates where we tried to sort of sort columns based on correlation between them. We haven't, I guess, done it as much for more system side I guess. So in terms of sorting by correlation, there are, there can be some benefit to doing it. There's some recent work by Chad Sharer[phonetic], I believe, which looked at sort of extending this idea and combining greedy coordinate dissent with stochastic coordinate sort of where they did some smart sorting of columns. >>: You mentioned that. I just forgot; what's the dimensionality and the number of [inaudible]? >> Joseph Bradley: So it varied a lot from hundred to hundreds of thousands or millions. >>: I don't think the speedup would highly depend on whether it's in the cache or not. These kinds of things, so>> Joseph Bradley: That is definitely>>: Your method would [inaudible] data sets where, you know, single box implementation would not be viable [inaudible] and then you>> Joseph Bradley: As far as single box implementations, I mean that's something we're thinking about now. This was really targeted at multicore. I think in the distributed setting we are having to think about a pretty different approaches. >>: You did the experiment with different CPU’s? >> Joseph Bradley: We tried a little bit. There is a 16 core [inaudible] machine which we were testing with. We saw, it was a bit better, I think it had, right, less issues with cache. >>: I think you have a lot of [inaudible] about [inaudible], so cache, and the structure of cache and>> Joseph Bradley: I mean, definitely on a more souped-up machine with, right. It would help. Yeah, I think it would be interesting to look at other types of hardware. >>: [inaudible] really [inaudible] computation [inaudible]. I think that's what you're trying [inaudible]. >> Joseph Bradley: Right. Definitely with Lasso. Okay. So just to sum up this part quickly. Looking at parallel regression, we talked about Shotgun, this parallel coordinate dissent, gave an analysis showing essentially linear speedups, of course in our experiments we did not get the ideal speedups, but especially since the sequential version was one of the best methods for these problems, speeding it up a bit with parallelism made Shotgun one of the best methods. So we are going back to the themes I talked about at the beginning. What we did was decompose this computation based on different coordinate updates. And we saw that, basically even though these coordinate updates would conflict and cause a little bit of divergence, making us do a bit of extra computation, we ended up getting a big gain in parallelism. And we could sort of optimize this trade-off by choosing the number of parallel updates based on the amount of correlation in our data. So there's a pretty simple example of this, these sorts of trade-offs. What I'd like to do, if I have time, is talk about parameter learning for Graphical Models, which is a case where we can actually trade-offs things in much more complex but beneficial ways. So in Graphical Models, motivating example might be say you want to model user interests or behavior in a social network, and to do this you might want to say model a probability distribution over a bunch of random variables x, for each x, say x, one, models a particular user, user one. So given a model for the probability distribution over all these random variables, you could ask queries, like probability of some set of variables given another, which could translate to predicting some users’ interests given what you know about others. So the general framework I'll talk about are Markov Random Fields, or MRF's, where, of course, an edge in this graphical structure will indicate a direct dependence between two variables filling out this graph, gives this structure, which essentially encodes our statistical assumptions, and then will factorize this model by writing it as the product of the factors, which I'll write as Si[phonetic], and each of these will be functions over small sets of random variables, which will corresponded to edges, in this case, or perhaps a hyper edges, in our graphical model. And so in this, if we fill out the rest of the factors, of course we have a fully specified probability distribution; and even though I'll talk about MRF's, all of these results generalize to conditional random Fields as well. So basically the setup, of course, is very principled and statistical, principle and statistical in computational framework, there's been a whole lot of work, including a lot from here, and many applications of graphical models showing they are quite useful. So, in this, I'm going to be talking about parameter learning problem where given the structure, or given the structure of this and data sampled from P star, which is this target distribution, we’ll want to learn parameters, which are the values of these actual factors. So traditional method, using maxlikelihood estimation, or MLE, we want to maximize, with respect to our parameters, the expectation of our data, of the log probability of seeing that data point. And of course, MLE is, in a sense, a gold standard in that it's optimally statistically efficient and in the infinite sample limit, no method is really going to be better than it. But of course the problem is that in this loss, you have this probability over the full distribution, and computing this probability is difficult because of this proportionality constant. And estimating that proportionality constant requires inference over the entire model, all of X, and this has been shown to be provably hard, in general. Of course, being tractable in some cases, such as if the graphical structure is a tree. So given the inference is hard, the question is can we learn without intractable inference? And, in parameter learning, there has been, there have been a bunch of works, often using approximate inference or approximate objectives; but the problem is that most of these work lack really strong guarantees for general types of models, especially if the model is not a tree or the like. And so what our solution is going to be is to use as a baseline MLE, which, as we stated, has optimal sample complexity but requires a lot of computation, and as we’ll see, is difficult to parallelize. I'll then talk about pseudo-likelihood, which is this method which essentially breaks the problem into a separate optimization for each variable in your model, and as we'll see, it has higher sample complexity, but much lower computational complexity and easy parallelization. And actually, our analysis gives the first finite sample complexity bounds for general models. I'll then talk about composite likelihood, which is a method of that ranges between MLE and pseudo-likelihood, allowing you essentially to choose substructures in this Graphical Model to have more structured estimator for the parameters. What that will let us do is, in many situations, choose this estimator structure in order to optimize these trade-offs and get the best of sample complexity computation and parallelism for many problems. So, first to motivate the idea pseudo-likelihood, MLE is essentially estimating this entire distribution, P over X at once. And what pseudo-likelihood is going to do is start by observing that if you take the statistical assumptions encoded in the graph, this is essentially saying, for example, the probability of one variable X, one, given the rest of the graph, where its neighbor is proportional to the single factor, Si[phonetic] 1, 2. So you could imagine doing regression, which we talked about in the first part of the talk, in order to get an estimate of this one factor. You can imagine doing this for every variable in your model, for example, probability of X, two, given its neighbors, and we are estimating that view of regression would get the estimates of these other factors. And you can note that we run into some issues with multiple estimates of these factors. But you can actually show that you can average these in log space and still get good guarantees. So what I'll call this is pseudo-likelihood with this joint optimization where you regress each variable on its neighbor; that gives you factor estimates, and then if you have duplicates, then you average them together in log space. I'll also talk about pseudo-likelihood with joint optimization, which is essentially the same problem, but you do parameter sharing when doing the optimization. And this is actually how pseudo-likelihood was originally presented. The key is that this formulation allows tractable inference where for each of these sub problems, we have the probability of a single variable given its neighbors. And in order to estimate this probability, compute this probability by our model, we just have to compute this proportionality constant by summing out a single random variable, in this case, X, 1. So to compare MLE with pseudo-likelihood. MLE estimate the full model at once, pseudolikelihood essentially regresses each variable on its neighbors. MLE is going to be intractable, in general, pseudo-likelihood will be tractable. MLE has been shown to be optimally statistically efficient, but people have shown that pseudo-likelihood is often empirically successful. MLE has had finite sample complexity bounds, meaning people know its behavior in the finite sample case, but before this work, pseudo-likelihood did not. And so it often heard it referred to as a sort of heuristic. But we'll see we can cross that off. Yes. >>: [inaudible] when you say [inaudible] so the model is the same. Right? So how is the inference becoming tractable in one case and intractable in the other? I mean, the model is the same [inaudible]? >> Joseph Bradley: So I'm talking about the inference required during learning, not inference at test time. That is actually, raises a number of interesting questions. At learning time though, the loss you’re optimizing doesn't have to probability of the full distribution in it. >>: So in the sense you’re changing the model itself. >> Joseph Bradley: It's an interesting, it's hard to know exactly how to phrase it. I would phrase it I guess as changing the loss, and optimizing that loss does give us guarantees with respect to our original model. >>: You can actually change the model [inaudible] dependency [inaudible]. You just give up on the graph. >> Joseph Bradley: That is another option. >>: The way I see this, you are constraining your potentials into essentially [inaudible] model. That's how I'm seeing it. >> Joseph Bradley: Right. >>: [inaudible] the loss. [inaudible]. >> Joseph Bradley: I guess it's hard to say. I mean, maybe you could phrase it another way. >>: Can you go back to where you were>> Joseph Bradley: I'm not sure what the model would be, I guess, which you would, which would be listed by this loss. >>: Can you go back to the [inaudible]? >>: You said you only change the training, only for training time not testing time. You're not actually optimizing that opt. >> Joseph Bradley: That's true. >>: But does that mean you change the model? I mean, I guess it’s semantic. I wouldn’t go changing the model. >> Joseph Bradley: Right. I mean>>: You’re changing parameters>> Joseph Bradley: There might be something interesting, we said like this is actually optimizing MLE with respect to this model, but I'm not sure what that model would be unless it's, right. I mean that the closest, my guess dependency networks seem very relevant, but I'm not sure if that's, right. >>: But for small scale problems, suppose you construct something very small from which you actually can compute the likelihood. Do compare the likelihood [inaudible]? How much difference is it going to be? [inaudible]? >> Joseph Bradley: Right. I will show comparisons between like how accurate the parameter estimates are from pseudo-likelihood versus optimizing. >>: [inaudible]. >> Joseph Bradley: Versus optimizing MLE. If that's your question. >>: I'm more thinking about if you use a pseudo-likelihood, you change the loss [inaudible]. >> Joseph Bradley: Right. >>: And then you estimate the parameter from the parameter. From the parameter that you estimate which is [inaudible], then you can compute the likelihood for those [inaudible] estimates through pseudo-likelihood. >> Joseph Bradley: Yes, you can. >>: [inaudible] for small-scale [inaudible]. >> Joseph Bradley: Right, and that's something I'll show. >>: Okay. >> Joseph Bradley: Good point. Okay. So where were we? Right. So talking about sample finite, sample complexity analysis of pseudo-likelihood. The result actually I'll phrase in terms of MLE. So first, sample complexity result for MLE, which will be very similar to the pseudolikelihood one, would look like this: to achieve a given error epsilon, we need at most n training examples where n is around this amount, where here we have epsilon, the L, 1 norm of the parameter error, and that's actually normalized by the number of parameters. R is the number of parameters, Delta is this probability that this bound doesn't hold, very common in these analysis, and lambda n is an eigenvalue condition which could I'll discuss a little bit later. But first, what I'd like to show is the balance analogous bound for pseudo-likelihood, which you notice is essentially the same except that actually this lambda n value is going to be different. So before I explain that, I want to point out that yes, as I said, this is the first finite sample complexity bounds for very general models, and when you add this together with tractable inference for many classes of MRF's and CRF’s, you can actually show that this implies back learnability. And basically what you need to do is to control how that lambda n value grows with respect to whatever problem parameter you're defining your class of models, with respect to. So given this, what I need to do is explain lambda n. Yes. >>: What does error mean here? >> Joseph Bradley: Error? So it's the L, 1 norm of the parameter. Error, so it’s, we have an optimal parameter vector and, right. And it's also normalized by one over R, the number of parameters. So lambda n, we'd expected the number of examples to need to be around one lambda n squared, so we expect this eigenvalue to be important. And I won't take too much time to talk about this, but essentially it's an eigenvalue condition which measures the curvature of the objective where greater curvature in this objective is going to make it easier to learn. And the important thing really is how it varies for MLE versus pseudo-likelihood. Essentially, MLE is going to have a larger lambda n, implying lower sample complexity, but of course it requires more computation. Pseudo-likelihood will have smaller lambda n, higher sample complexity, but of course, less computation. So we have this is sample computational complexity trade-off. And speaking of trade-offs, I’d like to point out one more which is specific to pseudo-likelihood. Now recall that pseudo-likelihood is essentially taking each variable and regressing it on its neighbors. But I mentioned you can do it both with joint optimization with shared parameters and disjoint, where you average duplicate estimates later. And the bounds I've been showing have actually been for joint optimization which will have lower sample complexity based on our bounds. And with disjoint optimization, we get slightly worse bounds on the number of samples required, but of course that's completely data parallel. And so we have something of a sample complexity parallelism trade-off. So finally, looking at the predictive power of the bounds. On the left here is lower sample complexity or lower error for a fixed number of samples, on the right, higher. And so we'd expect to see MLE being the best in this respect, then the pseudo-likelihood with disjoint, with joint optimization, and then disjoint. And so what I'm plotting here is a synthetic example where we can compare with the ground truth, X access, number of training examples on a log scale, y axis, the parameter error, lower being better. And so you can see with max-likelihood we do indeed learn, decreasing the error. Pseudo-likelihood is a bit worse, especially at the low sample regime, and pseudo-likelihood with disjoint optimization is a bit worse, just as we'd expect. So the sample complexity [inaudible] actually occurs in practice. >>: [inaudible] disjoint optimization that you have. You basically different [inaudible] model have nothing to do with each other at same bounds, and they're very [inaudible] assumption. They both have the same sets of variables and so it seems that they could get very different potentials? >> Joseph Bradley: There are definitely cases where that does happen and where that doesn't. And it's essentially actually going to be this like lambda n value which hides all of that behavior. In the actual proof, you could imagine that this joint optimization proof as looking like a proof showing that simple regression works and then basically showing that if simple regression works, each regression will be reasonably accurate, and so averaging their estimates together will still be reasonably accurate. And our proof just used a simple>>: [inaudible] create for different model. You won't get accurate for the same problem. I tell you X, 1 depends on X, 2 and then I tell you X, 2 depends on X3. >> Joseph Bradley: In terms of the, so it would be for the same model in the sense that both problems would be using the same set of statistical assumptions where the Graphical Model’s encoding the statistical assumptions, and each problem would be still working with that same Graphical Model, although with a different loss, I agree. But the guarantees would be with respect to the same model. >>: One thing I'm [inaudible]. How would this sort of like [inaudible] resolve [inaudible] like, you know, basically what they call [inaudible] potentials? Where, you know, X1, X2 should be different? X, 2 and X, 1 should be different as well. I mean, you have basically a kind of, you understand what I'm saying? >>: Like a [inaudible] model? >>: Basically yeah. [inaudible] model where everything is not [inaudible]. And those cases the pseudo-likelihood will quickly fail. >> Joseph Bradley: So, in that case, I'm not sure what the eigenvalue condition would behave as. We definitely tested it on models where there were both, I guess we did not actually, I think test it on models where they were completely all repulsive potentials. Only on ones with a mix and with all attractive. But it>>: People have looked at pseudo-likelihood for differences on [inaudible], and that actually found that when the credentials are not attractive pseudo-likelihoods to diverge quite a bit. >> Joseph Bradley: Right. So I think that's something I would like to test on data with a ground truth. I think if you have a model which is, right, maybe completely repulsive, I don't know. I'd have to explore that more. If you have some repulsive, like problem areas that I think actually the method I wanted to talk about next would be quite useful. >>: So can you show me what type of [inaudible]? >> Joseph Bradley: Right. I should know my slide numbers. Right. So for MLE, it's a minimum eigenvalue of the Hessian of the objective, the log probability over all effects at the optimum. For pseudo-likelihood, it’s this minimum or for each variable of that eigenvalue condition for that local loss. >>: It’s for the true likelihood minimization. Not for what you are actually [inaudible]. So runtime is a measure on every problem. It's a measure on your [inaudible]>> Joseph Bradley: So this is both are with respect to, like include the target distribution, but I mean, in terms of the loss from which this Hessian is computed, that is loss specific. That's different for the two. But both involve expectations over the target distribution. >>: So can that be zero then? Because I mean, I think it’s [inaudible]- >> Joseph Bradley: So there are. It's actually the smallest nonzero eigenvalue. If you do have zero>>: In the case of like infinite data, the pseudo-likelihood is going to converge to the same parameters as the MLE. >> Joseph Bradley: So that's a simplification. Really, it's that, I mean, in all cases, if you have an over complete representation, really what you'll converge to is something which can be, is of the same rank and can be transformed, but it is going to be equivalent in terms of the loss. >>: Yeah. So the parameters might not be the same, but the probability of distribution [inaudible] by the two sets of parameters would be the same. >> Joseph Bradley: Right. I think you would always need a>>: If you have intractable inference, surely the learning has multiple [inaudible] minimum. So there's something [inaudible]. >> Joseph Bradley: Intractable inference? >>: Yeah. I mean if the underlying Graphical Model is not [inaudible], are there multiple minimum for the learning? >> Joseph Bradley: So when I talk about MLE, I'm not talking about using approximate inference. So it would still be convex. If you threw in some types of approximate inference, then certainly, you'd run into problems with non-convexity. Right. >>: What about for [inaudible]? >> Joseph Bradley: Right. So for, and I should say, perhaps, the types of models I’m looking at are like log linear MRF’s and CRF’s. I should've said that, perhaps. >>: Okay. So [inaudible]. >> Joseph Bradley: If you have latent variables, that is definitely an interesting question and it's something as well as the semi-supervised case, which I can't really talk about here, but would like to look more at in the future. And there are some simple ideas, which you can immediately read from this composite likelihood method, which I wanted to talk about later, but yeah. This is not immediately applicable to that. Yeah. Other than through something like EM, right. We'd seen that basically the ordering of these, in terms of error and sample complexity, is what we'd expect. Looking at further at the predictive power of these bounds you can see that terms of sample complexity have our bound like this, and what I'd like to say is that first, ignoring the log term, which is, as you'd expect, not that significant in practice, and fixing the number of training examples, what you might expect to see if this bound is reasonably tight, is that the error increases at approximately 1 over lambda n. And what I want to argue here is that lambda n is important in controlling the difficulty of learning. So here if you look at one over lambda n, so harder problems going to the right and actual error on the y-axis, you can indeed see that for what we'd expect to be easier problems, we get lower error for a fixed number of training examples, and vice versa. So lambda n does indeed seem to be important in controlling this difficulty of learning. >>: [inaudible]? >> Joseph Bradley: Controlling it. So basically generate a whole bunch of random models and each of those models is a point on this and generate enough that we get points along the whole line. >>: Have you thought about using a different objective function here, like the error of the final joint probability, rather than the L, 1 error in parameters? Because at the end of the day, it’s the probability that [inaudible]. >> Joseph Bradley: Right. So our analysis sort of does it in two steps where we bound the parameter error given a number of training examples and then the loss in terms of the parameter error. That second bound is not that interesting, but I do think that going directly from bound on the loss to the number of training examples required would definitely be interesting to do. It wasn't super clear how to do that in the analysis, but that would definitely be valuable. Okay. By the way, can ask when I should go until? >>: [inaudible]. >> Joseph Bradley: Okay. Thank you. Good. In terms of what this lets us do, given that we have this lambda n value, which seems to control the difficulty of learning with MLE and pseudo-likelihood, what I’d like to say is how this varies for different types of models. So what I'll do is compare this lambda n ratio between MLE and pseudo-likelihood. So first, looking at model diameter. And here I'm just looking at chains as they increase in length to the right; and on the y-axis, this ratio between MLE’s lambda n and pseudo-likelihood, where basically higher means that MLE is better. You can see that other than end effects, the performance of pseudolikelihood as you might expect, doesn't change that much as chains get longer. So I'll call that not a problem scenario. In terms of factor strength, meaning basically the magnitude of parameters or how strongly variables directly interact with each other, as you increase that going to the right, this is on a log scale, you can see that this ratio actually does shoot up. So you can see that for very strong factors, pseudo-likelihood can actually run into problems. Finally, for node degree. As you increase the degree of a node, this ratio again increases, albeit not as quickly as with factor strength. So we might call that a problem scenario for pseudolikelihood. But what I'll now talk about is that we can often fix this using a method called composite likelihood. So MLE is essentially estimating the entire model at once. And pseudo-likelihood is essentially taking each variable and regressing it on its neighbors. You can imagine the natural generalization of, instead take a chunk of variables and regresses them on their neighbors. And you can see that that this varies, this generalizes both MLE, where we have a single chunk of variables, and pseudo-likelihood, where each chunk of variables is a single variable. And so what we show are similar sample complexity bounds for joint and disjoint optimization, just like you saw for pseudo-likelihood; they take the same form, so I won't show them again. But I do want to emphasize that we analyze structured composite likelihood, meaning that we really paid attention to how these components were structured, especially with respect to our model and data. And that's something which surprisingly wehaven't seen that much in previous literature. In previous literature, often these chunks of variables were just, for example, all sets of two variables or three variables in your model. And it really benefited us to look at the structure. So the obvious question is how do you choose these estimator structures. And what we do is look to our experiments with pseudo-likelihood for guidelines. First, we limit each component to trees so that inference within that component will be tractable. We'd like to follow the structure of our model in order to avoid those failure modes of pseudo-likelihood where, for example, if we have a star structure, we'd like to cover it with a single component. If we have a strong factor, we'd like to cover it with a single component, and we'd also like to try to choose large components or minimize the number of components in order to be intuitively closer to MLE than to pseudo-likelihood. >>: Do you have to know beforehand which factor [inaudible] or>> Joseph Bradley: It’s an interesting question, and it's sort of ongoing work; how you can adaptively, for example, get a rough estimate of your distribution in order to choose a much better estimator and then get an accurate estimate. Right. Unless you have some kind of prior knowledge. Yeah. So, for example, given this grid, we might choose these two combs to cover it, where each is a component in our composite likelihood estimator, and it sort of follows our guidelines to the right. So showing some examples, experiments with this, here is a synthetic grid model, where we are learning with a fixed number of training examples; and here I'm going to show increasing grid size to the right, and on the Y axis, comparing the log loss of one method versus MLE. So here it's saying that pseudo-likelihood is just as good as MLE for a single node, since it’s the same, but as the grid gets bigger, MLE is actually going to do better, pseudo-likelihood worse, as we'd expect. Now if we use these two combs for composite likelihood, we make a significant gain, but what's really is the clincher is that if you look at training time, MLE takes a lot longer than pseudo-likelihood and using this structured composite likelihood, but limiting ourselves to trees doesn't actually make us require more time for training. Yes. >>:. How do you compare when you use approximate inference but maximum likelihood estimation? >> Joseph Bradley: Yes. So this is actually using approximate inference with exact. It was not comparable. Here it was with Gibbs Sampling, and there is certainly a lot of tuning which could be done to, yeah. >>: But Gibbs Sampling is not the [inaudible], right? >> Joseph Bradley: I think it varies based on the problem. >>: [inaudible] it can to do>> Joseph Bradley: Certainly you could do like belief propagation or something. And, yeah. It would definitely merit like further comparisons. There's a lot of tuning to be done there. >>: [inaudible] MLE training time is based on Gibbs Sampling. >> Joseph Bradley: It is. >>: But you're using exact for the pseudo, for the composite, you're using exact. >> Joseph Bradley: That's right. So for this, composite likelihood essentially lowers sample complexity without increasing computation which is what we'd hoped. But I do want to emphasize that we estimate, these estimators were tailored to the model structure; and I won't talk about here, but it's also useful to tailor it to the correlations in the data, the strong factors where you need either expert knowledge or some adaptive method. So to sum up here, we looked at finite sample complexity bounds for general models and basically showed first, that pseudo-and composite likelihood are not heuristics, you can get real bounds for those, and this allows PAC learnability for certain classes and models. And then we looked at structured composite likelihood, where we gave some guidelines for choosing estimator structures based on failure modes of pseudo-likelihood, especially trying to tailor those estimators to the model and data. So let us do this sort of decomposition, which was model specific, and get the best of all these trade-offs in certain cases. So what I've been trying to argue is that we can scale learning by using different types of decompositions which let us trade off quantities like sample complexity, computation, and parallelization. And these decompositions used model structure and locality, and the trade-offs were tailored to the model and data. So, for example, in parallel regression, we decomposed over coordinate updates, and we were able to choose the number of parallel updates according to correlation in our data. In parameter learning, we decomposed the loss into sub graphs, essentially, and were able to tailor that decomposition according to both very original model structure and the strength of factors in our data. And I then also looked at structure learning in the thesis where we saw some similar trade-offs, but I didn't have time here. So there are a number of future work things I'd like to look at. First, in terms of parallel regression, I've just talked about the multicore setting, but I'm very interested in the distributed setting where you have limited communication, where you both want to limit the number of messages you need to send, as well as the size of those messages, and they are really taking advantage of sparsity in those messages as a challenging problem. I’m also interested in heterogeneous parallelism, where you might have multicore, plus distributed, plus perhaps GPU, and you’d like to parallelize over multiple facets of your problem, for example, both examples and features. There are also a lot of interesting questions when you start talking about, for example, parameter learning. Used phrase does this where you have more structured objectives. And for that, I think a really interesting question is how to do partly joint optimization with composite likelihood. And there I think methods such as alternating directions method and multipliers might be very applicable to this. So another thing I'd like to look at in parameter learning is I’ve sort of talked about some guidelines for how to do these trade-offs, but there's a lot of work which still needs to be done in order to create sort of an automated learning system, which can automatically choose the structure and parallelization strategy. So there are a lot of trade-offs in terms of sample complexity computation, but also things I haven't talked about such as in parallelization, how you balance the number of components you can parallelize across, versus the amount of communication required, if you're doing partly joined optimization. And I think an interesting way of phrasing this might be, for example, graph partitioning, where some questions are how do you even estimate or computed this lambda n, given a model and estimator, cheaply of course? And how do you balance these various objectives and constraints in this problem in order to choose estimators and divide its components across machines? So in the interest of time, I'll just get through this and mention that there are a number of applications I'm very interested in. First, in terms of social network modeling, I think that there is a lot to be done in terms of applying sort of Graphical Model formalisms to social networks. And there I'm really interested in both, in first, generative models, which model both the network structure and the semantic content, like text and images in those networks; and there, I think, there are a lot of interesting questions about both scalability of, up to social network size, and in terms of how to jointly model those aspects. I’m also interested in looking at some work in temporal modeling and how to modify the methods I've talked about for the semisupervised case. I've also done some initial work, which I'm interested in continuing, with machine reading, where an example goal might be to build a probabilistic knowledge base from text on the web, and there I think difficult questions are, if say, you phrase this as a big Graphical Model, how you do focused learning and inference within that model, I sense you really would need to prioritize with that scale; as well as how to do active learning in order to keep building this knowledge base, but minimize the amount of human supervision and interference required. So with that, I'll leave a summary of this talk here, and thank you very much for listening. >>: Have you ever [inaudible] divergence compared to pseudo-[inaudible]? >> Joseph Bradley: Right. I've looked at it some, but have not done a full comparison, and I think it's a really interesting case where it sort of mixes ideas of, from several cases like stochastic gradient with approximating the objective, and it would definitely be interesting to look at. >>: Did you look at [inaudible] so that you propose on the generalization of [inaudible]? The reason I estimate, but the thing I really want to do is ask a question about [inaudible]? So maybe [inaudible]? So [inaudible] you could have a lower [inaudible] complexity to which you give [inaudible] error? >> Joseph Bradley: Right. So in terms of, you're asking about going more directly from some sort of loss to>>: Yeah. So I guess the bound is hard but I think kind of [inaudible]>> Joseph Bradley: Right. It is definitely an interesting question. I would like to look into it more. I think somewhat relatedly, one initial thing I've been thinking about is how to make these bounds not the sort of like wide sweeping bounds over all your parameters, but really be more parameter specific, where you know, you can imagine some parts of your model are easy to estimate and some are hard; and that in and of itself would let you perhaps get better bounds with respect to the loss you care about, but also would perhaps, I may not have fully understood your question, but I think would help in cases where what you really cared about was a particular part of your model and wanted to estimate it well. And then you might have sort of model part specific bounds. >>: [inaudible] composite like a parallelism is done by doing, you know, different parts simultaneously. >> Joseph Bradley: That's right. That's how I've been thinking about it. >>: Okay. So [inaudible] would it be more parallel. >> Joseph Bradley: That's right. >>: Do you have a kind of curve to show what kind of speedup you get for that the earlier use of error [inaudible]? >> Joseph Bradley: Right. I don't. I think that in terms of disjoint optimization, the curves I showed really are immediately applicable since that's completely data parallel. In terms of cases where you're doing joint optimization, but distributed, that's something I'm very interested in looking at; and they are, I think really, also more analysis is needed in terms of seeing how joint optimization behaves and with the hope that you don't really need to make it fully joint but can, you know, essentially stop early. Right. >>: And how about the Graphic Model? To you [inaudible] in terms of>> Joseph Bradley: I have not. In terms of parameter learning, I guess that's, in a sense more straightforward since you don't have to do inference over the full model. >>: Oh, I see. Okay. >> Joseph Bradley: But in terms of structure learning, that's very interesting. In structure learning, I stress looking at undirected. Thanks.