Document 17857967

advertisement
>> Lin Xiao: Today it's a pleasure to have Joseph Bradley come here. He is finishing a PhD from
the CMU in the Washington department. And he's going to talk about optimization trade-off
for scalable machine learning.
>> Joseph Bradley: Thank you. Right. As he said, talking about how to optimize different tradeoffs of quantities in order to help scale various machine learning problems, and so I'll start out
just how about talking about what those quantities are at a very [inaudible] level view of
machine learning where we liked to get data, train a model to fit the data using some sort of
optimization, and then test the model in new data. And so what the worries I'm going to be
thinking about are if we have big data, a lot of it, have a complex model structured
optimization, and all of these difficulties are going to end up feeding into things such as, issues
such as simple complexity, how many training examples do we need to learn the model,
computation in the optimization during learning, and, in the modern world, how to take
advantage of things like parallelism to help with that computation. And so these three
quantities are going to eventually, of course, feed into the eventual accuracy of our model on
the new data. But it’s these three quantities, which I'm going to be talking about. So in order
to improve scalability, you can imagine improving each of these individually. For example,
develop methods with better learning bounds, which improve the sample complexity, look at
computation by developing faster optimization methods in the sequential setting, or say, do
parallel implementations of existing algorithms.
So what I'd like to think about instead is how to trade these off where, for example, a small
sacrificing computation might lead to a big gain in parallelism. And in the first part of the talk,
I'll talk about a method which does this trade-off, and the second part of the talk, I'll talk about
upsetting in the method where we can trade off all three of these aspects. So our general
approach to scaling is going to be to take some complex problem and decompose it into subproblems which are simpler to solve in and of themselves. Then, through analysis, we’ll look at
different trade-offs in this decomposition, and the analysis will actually help us guide how we
do those trade-offs in order, do that decomposition in order to optimize those trade-offs. And
interestingly, we'll be able to talk about data in model specific ways of doing that optimal tradeoff. So in this talk I'll start out talking about parallel regression, where I’ll talk about parallel
coordinate dissent method, where we can trade off computation and parallelism. And the
second part, I'll talk about parameter learning for Graphical Models where we look at a method
which can trade off all three of these aspects.
So first, looking at parallel regression. So in the regression problem we want to predict
essentially one label, or a small set of labels, given a large number of features, for example, in
one data set in which we look at the label is a measure of stock volatility, which it turns out you
can predict somewhat, unlike the direction of the movement in stocks. And we're going to be
predicting them from a large number of features which are small pieces of text from financial
reports. The sparsity part, of course, is that we want to explain that label using a very small
number of features, and of course, this is very useful in high dimensional settings where the
number of features is a lot larger than the number of training examples. So in this part, the
problems we specifically look at are Lasso, with the least squares loss, and sparse logistic
regression. In general, our analysis is applicable to generalized linear models. So in the
sequential setting, there's a whole lot of algorithms which can be used for sparse regression,
like gradient dissent, stochastic gradient, interior points, different thresholding methods. One
which caught our eye was coordinate dissent, also known as Shooting. And it caught our eye
because it's been shown to be very fast. There have been a number of theoretical empirical
studies explaining this and showing this on many problems. But for big problems where you
have millions of features or hundreds of thousands of examples, even this fast sequential
method, it may not be ideal. And so we'd like to do is take advantage of parallelism.
So here I'll be talking about the multicore setting where their shared memory and low latency
and an ongoing work we’re looking at distributed. But in this, in the multicore setting, we can
think of parallelizing a number of aspects of the problem for parallel regression. First, matrix
vector operations, many methods such as interior points spend a lot of their time doing such
operations, and we could think of using existing linear algebra libraries to do that. However, we
found this did not work the best empirically, and I think it was largely because the methods
which could benefit most from parallel matrix vector operations were not actually the fastest
methods for this particular problem. We can think of parallelizing of our examples such as
doing this stochastic gradient method, which has some parallel analysis, but one could argue
that using stochastic gradient methods tends to be best when you have a large number of
examples, not a large number of features which is the setting we were looking at.
And then finally, looking at parallelizing over features, for example, Shooting or coordinate
dissent and parallelizing that. And we asked the question, which I'll explain in a minute, of
whether that should be inherently sequential. But of course, it turns out that it's not, and I'll
explain why. So what I'll talk about is a method called Shotgun, which is parallel coordinate
dissent, for sparse regression, and I'll first show a convergence analysis which predicts that you
get essentially linear speedups up to a problem dependent limit. And then show a big empirical
study which shows that Shotgun is quite successful in practice.
So just looking at a little background, our problem is going to be to minimize the convex
objective, F of W, where W is this weight vector. And F of W would be the loss and the
regularization for whatever Lasso or logistic regression problem we are looking at. So for
shooting in the sequential setting, the basic algorithm I'm looking at is a stochastic coordinate
dissent method, which says while you're not converged, pick a random direction or coordinate J
and update the weight for J via sometimes a closed for minimization, sometimes a line search.
So if this were the contour map where gray is better and we start at some point, this method
would optimize in some direction, pick another direction, optimize in that and eventually get to
the optimum. So in the parallel setting, what I'm looking at is very naive parallelization where,
rather than updating a single direction at once, excuse me, will update on each of P processors,
P different directions just pretending that we are holding the other coordinates fixed. So for
example, in this setting, if we start here, pick two random directions and compute the
minimizations in those, and then we add those updates together, we will get right to the
optimum. Whereas this is a very nice setting where here we have uncorrelated features and
the parallel updates are not going to conflict.
Now in a bad case with extremely correlated features, if we compute minimizations in both
directions independently and then add those updates together, of course we might diverge.
And so you might ask is coordinate dissent inherently sequential? Well, here's why it's not.
And those are Shotgun Convergence Theorem. And it's essentially stating that if we limit the
number of parallel updates, where I'll talk about the limit on P in just a moment, then we get
this bound, which states that the distance from the optimal where W is the weight vector, and
W, T at T iterations, W star is the optimum, that will be upper bounded by this quantity on the
right, where on the numerator we have quantities such as D, the number of features, W star,
the optimal weight vector, and W not where we began. Where if I did it by the number of
iterations T, dotted with P, the number of parallel updates. And so what this is essentially
saying is that we are getting linear speedups since this generalizes the bound from Shooting for
the sequential setting.
So given this bound, which states that if we limit the number of parallel updates, we should get
essentially linear speedups. This limit is going to be essentially D over row, where D is the
dimensionality or number of features in our problem, and row is the spectral radius of the
normalized grand matrix, ex-transpose X, where X is the state of matrix of examples by
features. So, intuitively, row measures the correlation between features. And with proper
normalization of the data matrix, row will be between one and the number of features. So in
the ideal case with uncorrelated features, row will be one and that means that our theory
would predict we can update all of the coordinates at once, which is what you'd expect. In a
very bad case with exactly correlated features, row will be D which means that we can only
update a single coordinate at once. Yes.
>>: If you go back to your theorem, what happens if you subtract 1 trillion off of F?
>> Joseph Bradley: If you subtract 1 trillion off of F?
>>: It should be the same minimum; it should be [inaudible]? But the right-hand side becomes
like huge negative and so I don’t know how to interpret it.
>> Joseph Bradley: I see. So if you subtract 1 trillion off of F, so the losses we were looking at>>: [inaudible]?
>> Joseph Bradley: I guess they, right. All the losses we were looking at were a non-negative.
And that should be, I'm thinking of where that would appear. I think that was implicit in our
proof and, right. It's not stated here. They have to be non-negative.
>>: [inaudible] divide F by trillion and make that term go away.
>> Joseph Bradley: So if you divide F by 1 trillion, let's see, then>>: Then it just scales, it’s relative.
>> Joseph Bradley: Okay. Yes.
>>: To do [inaudible] assumption [inaudible] F because you scale like one over T and not like
one over square root of T?
>> Joseph Bradley: Right. So in terms of the types of assumptions we are making, so F is taking
the form, we're assuming that we can write out essentially a second order Taylor expansion and
that the matrix which appears in the second order term is going, in order to, if it's written as an
upper bound matrix, matrix upper bounding that, that matrix is what actually appears as the
grand matrix in the next part, defining row, and that limit on row is what we are, that limit on
row would be a limit on the smoothness. So I guess if, right. So I think the extreme values of
row then row, the spectral radius would essentially measure that smoothness, as I understand.
>>: You’re saying like if you have like row for D [inaudible]?
>> Joseph Bradley: It would still exist. So in terms of smoothness assumptions, I guess what we
are>>: Even if row equals D, [inaudible] and choose T equals one>> Joseph Bradley: Right.
>>: Still, you know that there are some problems for which you cannot do any better than
[inaudible] square root T.
>> Joseph Bradley: So I think it's really, and what I was saying before with being able to upper
bound the change, there's an assumption where we can upper bound the change in the
objective with a second order Taylor expansion and the assumption that we can upper bound it
with that is what's encoding smoothness. Right. I think for some problems, certainly you
cannot do that.
>>: [inaudible]?
>> Joseph Bradley: Right.
>>: [inaudible] I think it still is strange to me because you can multiply F by a huge number,
then you basically get rid of your W star norm term.
>> Joseph Bradley: So, let's see.
>>: [inaudible] dimensional [inaudible].
>> Joseph Bradley: Right.
>>: It is just feels kind of weird that [inaudible].
>> Joseph Bradley: Right.
>>: [inaudible] the proofs, so that everything [inaudible]?
>> Joseph Bradley: I unfortunately do not. But I think>>: Is there an assumption that F is like 0, 1 or normalized or something?
>> Joseph Bradley: Right. This assumption, this theorem, I think where it would appear
perhaps, is that actually hidden, this is really a bound for the case of Lasso. And for general
models, there should be another term, which I think in our paper is F Beta, but basically it is a
constant which is multiplied by, it appears in a second order Taylor expansion like bounding the
change in the objective, and if you multiplied the objective by a huge number, then I think that
would essentially appear as a beta multiplied by that W star. I'd have to check, but I'm pretty
sure that constant, which is sort of loss dependent, is multiplied by the W star term. And for
example, for Lasso, it's like one, for logistic regression, I think it's four, something. So I just kind
of hid it here. But you're right. I think if you multiplied by a huge constant then it would
appear there. I think that's the answer. Okay.
Right. So given that we would expect essentially linear speedups up to some limit, we can see
how it actually looks in practice. And so if we look at an example data set where here I'm
plotting on the x-axis the number of very carefully assimilated parallel updates and on the y-
axis, the number of iterations, where both here we have log scales, then if we do a single
update per iteration we require almost 10,000 to converge, extrapolating linear speedups we'd
expect to line up on this line; and our theory says that we should be able to do about 79 parallel
updates before we start risking divergence. And if you actually run this in practice, then you
see essentially that, where approximately linear speedups, and then after the end of this plot
we start hitting gain divergence. And you see a similar behavior on other problems. So the
experiments seem to match our theory, which is what we'd hoped.
So thus far I've talked about Shotgun as this naive parallelization of coordinate dissent working
and showing how linear speedups up to a problem dependent limit actually seemed to happen
in practice. But now I’ll quickly show you some results from larger experiments. So first looking
at logistic regression, where our goal is to predict the probability of a discrete label, we
compared a small number of algorithms here, since there has been a big empirical study before,
basically showing that Shooting the sequential version of our algorithm is extremely
competitive. But we did take time to compare Shotgun with this parallel stochastic gradient
method since it was one of the few other right parallel methods with which we had not seen
tested on these problems. So stochastic gradient was just as a simple implementation where
we estimate the gradient with a simple training example, and of course, is considered to be
very scalable. And we are running on eight core AMD Opteron, two point six, nine gigahertz.
So just quickly showing an example of our result, this is actually in a high sample setting. The
high dimensional setting made us look even better. But in this setting, we had half 1 million
examples, 2000 features, on the x-axis as is time, and on the y objective, where lower is better.
And so if we look at Shooting, the sequential coordinate dissent does seem to do reasonably
well. Stochastic gradient, as you might expect, is faster at first, slower later on. Parallelizing it
helps a little bit using this method. But in parallelizing coordinate dissent helps enormously and
it looks like Shotgun is the fastest. So basically parallelizing over coordinates seems to give big
speedups, and we saw even more extreme behavior differences in the high dimensional setting.
>>: Do you know what the, it’s convex, so if you run some sort of thing like [inaudible], do you
know what the true objective minimum is?
>> Joseph Bradley: Right. So for all of these, we essentially ran them for an extremely long
time and then recorded that essentially optimum, and then we would do these experiments
running of these until they got within some percent. So we actually I think tried to compute the
optimum using Shooting, and if another, that was what we did initially, but then if another
method reached a lower objective value, we would record that as the optimum. So-
>>: So 150,000 is really the optimum am here?
>> Joseph Bradley: Or a little bit below that. Right. We ran some of these for, right, an
extremely long time until>>: [inaudible] from the challenge or is this your only [inaudible]?
>> Joseph Bradley: It's our only implementation, right. And this is actually really simple SGD
implementation. We also tried ones specifically tailored for the L, 1 setting, but those, in terms
of objective value, were much slower even though they got sparser answers just because the
issues with doing stochastic gradient with L, one; it ended up making them less competitive on
these sorts of plots than our implementation.
>>: [inaudible] Pascal challenge data, like results [inaudible] optimization challenge?
>> Joseph Bradley: Right. So that's a good question, and I'm not sure what their curves would
look like relative to this. That would be good to check. Yes.
>>: So the [inaudible] average is the eight cores at the end of the [inaudible] right?
>> Joseph Bradley: It does.
>>: So how do you guys track the progress, how do you terminate each of the cores at
convergence?
>> Joseph Bradley: Oh. So the idea is at each iteration, suppose we stopped there, compute
that average and that's one point in this plot. And so it's essentially, I mean this plot saying,
suppose we stopped at this moment, then that's what the parallel SGD would look like.
>>: [inaudible] points?
>> Joseph Bradley: That's right.
>>: So looks like Shotgun is maybe three times faster than Shooting or so.
>> Joseph Bradley: I'll show you the actual speedup plots in a couple slides. Yeah.
>>: [inaudible].
>> Joseph Bradley: So then looking at Lasso quickly. Goals to predict a real value label and here
we haven't found as many big empirical studies, so we tested a lot more types of algorithms, as
well as large number of data sets with sizes varying from hundreds to millions of features and
examples. I don't have time to go into too much detail with these, but I’ll show you two of the
four classes of data sets we looked at.
The important thing to note are what's circled in blue which is the average predicted number
of maximum parallel updates for the data set shown, and I'll show you in each of these boxes.
But the important thing is basically note that a very large number of parallel updates could
potentially have been done. So in each of these plots, X axis is Shotgun’s runtime, y, the other
algorithm’s runtime, and if something’s above that diagonal line, then that means Shotgun was
faster. So this point is saying that on this particular data set, Shotgun took 1.2 seconds, the
sequential version, 3.4. So just quickly plotting different methods up here, we have Shooting,
L1, LS interior point method, which used parallel matrix vector operations, and then a number
of other methods. And most of the dots are above the dotted line, so Shotgun did reasonably
well. And then on this larger sparser data set collection, a lot of the methods actually weren't
even able to finish in a reasonable amount of time, and so Shotgun looks quite good.
So essentially, Shooting seems to be one of the fastest algorithms, and Shotgun provides
additional speedups. However, speaking of speedups, if we look at the actual curves of x axis
number of cores, which is also the number of parallel updates we are doing, y axis, the speedup
we got, these are aggregated results from all of our tests and, of course, diagonal would be
optimal. So yeah. If you look at that wall clock time speed up of Lasso, on average it doesn't
look that great. And it varied a lot. Sometimes it would be no speedup, sometimes almost
optimal, but average was there. But if you look at the number of iterations we’re doing, it's
almost optimal. So it is decreasing the number of iterations. And what we thought we were
hitting was, what we believe we were hitting is this memory wall, where essentially memory
bus is getting flooded. And I think a reasonable explanation for this is that Lasso’s updates are
very cheap, very low computation per datum loaded, and so it's rather difficult to hold, to hide
things like memory latency. And one thing possibly supporting this is that if you look at the
logistic regression times speedup, it’s significantly better, although still not optimal. And
logistic regression uses more floating point operations per datum loaded, and we believe that
helps hide the memory latency. So you had a question.
>>: That was my question.
>> Joseph Bradley: Right. So yeah. I think that definitely points to the need for possibly more,
perhaps there are more optimizations, hardware specific optimizations we could make, trying
this on other hardware, testing it on other types of losses, which might help hide that latency,
right. There's a lot to it.
>>: Just so I understand the setting, when you’re measuring speedup, you’re holding the
problem size constant and just increasing the number of cores or are you also scaling the
problem size [inaudible]?
>> Joseph Bradley: So this is, each point is the average over data sets of running that, of the,
right, speedup for that number of cores. So it's, right. You're right. I'm not, I guess that means
that everything is being held constant as we go across. That would definitely be interesting.
Yeah. The variance and the actual wall clock time speedups was pretty big. And it would be
nice to be able to say a bit more about what types of data sets [inaudible].
>>: What’s the variance of the red line?
>> Joseph Bradley: I don't know the number. I know the ranges, which were essentially from
one to eight,
>>: If it’s one to eight, then it can be [inaudible] one time [inaudible] that fast [inaudible]?
>> Joseph Bradley: So I think it depends on this, I'm not sure but I would think it would depend
on the dimensionality and the sparsity of the data. And I'm not sure. Right.
>>: Did you store your data in columns or in rows? Because your>> Joseph Bradley: You want to access it by column. So that was how we did it. For sure.
>>: [inaudible]? But they still are [inaudible] different columns and so the memory sub system
has to sort of screen through a bunch of stuff, but it's jumping.
>> Joseph Bradley: Well, sort of. In practice we didn't choose completely randomly, we did a
random permutation and then went through it. And there were things which we later tried,
which did help a bit, like trying to carefully sort the columns and choose which ones we handled
at which time. Right. So it wasn't completely random, but yeah. There were definitely issues
with locality.
>>: And this is an eight core standard [inaudible]. It’s just one [inaudible].
>> Joseph Bradley: Right.
>>: Do you need to see a whole column[inaudible]?
>> Joseph Bradley: The whole column.
>>: Do you need to see [inaudible] everything is there you might not have to [inaudible]?
>> Joseph Bradley: I think that's definitely a question which interests me; whether you can mix
the ideas of stochastic coordinate with stochastic gradient. It's not something that we have
experimented with really. But it would definitely be interesting. Yeah.
>>: So I guess both for distributed or for like a NUMA system it be interesting if each processor
was choosing a column from a set of prearranged subset of columns. Have you looked, done
any analysis or looked whether to see it would be just as effective if each>> Joseph Bradley: We've done that sort of analysis to try to deal with the, I guess you'd call it
the statistical issue of conflicting updates where we tried to sort of sort columns based on
correlation between them. We haven't, I guess, done it as much for more system side I guess.
So in terms of sorting by correlation, there are, there can be some benefit to doing it. There's
some recent work by Chad Sharer[phonetic], I believe, which looked at sort of extending this
idea and combining greedy coordinate dissent with stochastic coordinate sort of where they did
some smart sorting of columns.
>>: You mentioned that. I just forgot; what's the dimensionality and the number of
[inaudible]?
>> Joseph Bradley: So it varied a lot from hundred to hundreds of thousands or millions.
>>: I don't think the speedup would highly depend on whether it's in the cache or not. These
kinds of things, so>> Joseph Bradley: That is definitely>>: Your method would [inaudible] data sets where, you know, single box implementation
would not be viable [inaudible] and then you>> Joseph Bradley: As far as single box implementations, I mean that's something we're
thinking about now. This was really targeted at multicore. I think in the distributed setting we
are having to think about a pretty different approaches.
>>: You did the experiment with different CPU’s?
>> Joseph Bradley: We tried a little bit. There is a 16 core [inaudible] machine which we were
testing with. We saw, it was a bit better, I think it had, right, less issues with cache.
>>: I think you have a lot of [inaudible] about [inaudible], so cache, and the structure of cache
and>> Joseph Bradley: I mean, definitely on a more souped-up machine with, right. It would help.
Yeah, I think it would be interesting to look at other types of hardware.
>>: [inaudible] really [inaudible] computation [inaudible]. I think that's what you're trying
[inaudible].
>> Joseph Bradley: Right. Definitely with Lasso. Okay. So just to sum up this part quickly.
Looking at parallel regression, we talked about Shotgun, this parallel coordinate dissent, gave
an analysis showing essentially linear speedups, of course in our experiments we did not get the
ideal speedups, but especially since the sequential version was one of the best methods for
these problems, speeding it up a bit with parallelism made Shotgun one of the best methods.
So we are going back to the themes I talked about at the beginning. What we did was
decompose this computation based on different coordinate updates. And we saw that,
basically even though these coordinate updates would conflict and cause a little bit of
divergence, making us do a bit of extra computation, we ended up getting a big gain in
parallelism. And we could sort of optimize this trade-off by choosing the number of parallel
updates based on the amount of correlation in our data. So there's a pretty simple example of
this, these sorts of trade-offs. What I'd like to do, if I have time, is talk about parameter
learning for Graphical Models, which is a case where we can actually trade-offs things in much
more complex but beneficial ways.
So in Graphical Models, motivating example might be say you want to model user interests or
behavior in a social network, and to do this you might want to say model a probability
distribution over a bunch of random variables x, for each x, say x, one, models a particular user,
user one. So given a model for the probability distribution over all these random variables, you
could ask queries, like probability of some set of variables given another, which could translate
to predicting some users’ interests given what you know about others. So the general
framework I'll talk about are Markov Random Fields, or MRF's, where, of course, an edge in this
graphical structure will indicate a direct dependence between two variables filling out this
graph, gives this structure, which essentially encodes our statistical assumptions, and then will
factorize this model by writing it as the product of the factors, which I'll write as Si[phonetic],
and each of these will be functions over small sets of random variables, which will
corresponded to edges, in this case, or perhaps a hyper edges, in our graphical model. And so
in this, if we fill out the rest of the factors, of course we have a fully specified probability
distribution; and even though I'll talk about MRF's, all of these results generalize to conditional
random Fields as well.
So basically the setup, of course, is very principled and statistical, principle and statistical in
computational framework, there's been a whole lot of work, including a lot from here, and
many applications of graphical models showing they are quite useful. So, in this, I'm going to be
talking about parameter learning problem where given the structure, or given the structure of
this and data sampled from P star, which is this target distribution, we’ll want to learn
parameters, which are the values of these actual factors. So traditional method, using maxlikelihood estimation, or MLE, we want to maximize, with respect to our parameters, the
expectation of our data, of the log probability of seeing that data point. And of course, MLE is,
in a sense, a gold standard in that it's optimally statistically efficient and in the infinite sample
limit, no method is really going to be better than it. But of course the problem is that in this
loss, you have this probability over the full distribution, and computing this probability is
difficult because of this proportionality constant. And estimating that proportionality constant
requires inference over the entire model, all of X, and this has been shown to be provably hard,
in general. Of course, being tractable in some cases, such as if the graphical structure is a tree.
So given the inference is hard, the question is can we learn without intractable inference? And,
in parameter learning, there has been, there have been a bunch of works, often using
approximate inference or approximate objectives; but the problem is that most of these work
lack really strong guarantees for general types of models, especially if the model is not a tree or
the like. And so what our solution is going to be is to use as a baseline MLE, which, as we
stated, has optimal sample complexity but requires a lot of computation, and as we’ll see, is
difficult to parallelize. I'll then talk about pseudo-likelihood, which is this method which
essentially breaks the problem into a separate optimization for each variable in your model,
and as we'll see, it has higher sample complexity, but much lower computational complexity
and easy parallelization. And actually, our analysis gives the first finite sample complexity
bounds for general models.
I'll then talk about composite likelihood, which is a method of that ranges between MLE and
pseudo-likelihood, allowing you essentially to choose substructures in this Graphical Model to
have more structured estimator for the parameters. What that will let us do is, in many
situations, choose this estimator structure in order to optimize these trade-offs and get the
best of sample complexity computation and parallelism for many problems. So, first to
motivate the idea pseudo-likelihood, MLE is essentially estimating this entire distribution, P
over X at once. And what pseudo-likelihood is going to do is start by observing that if you take
the statistical assumptions encoded in the graph, this is essentially saying, for example, the
probability of one variable X, one, given the rest of the graph, where its neighbor is
proportional to the single factor, Si[phonetic] 1, 2. So you could imagine doing regression,
which we talked about in the first part of the talk, in order to get an estimate of this one factor.
You can imagine doing this for every variable in your model, for example, probability of X, two,
given its neighbors, and we are estimating that view of regression would get the estimates of
these other factors.
And you can note that we run into some issues with multiple estimates of these factors. But
you can actually show that you can average these in log space and still get good guarantees. So
what I'll call this is pseudo-likelihood with this joint optimization where you regress each
variable on its neighbor; that gives you factor estimates, and then if you have duplicates, then
you average them together in log space. I'll also talk about pseudo-likelihood with joint
optimization, which is essentially the same problem, but you do parameter sharing when doing
the optimization. And this is actually how pseudo-likelihood was originally presented. The key
is that this formulation allows tractable inference where for each of these sub problems, we
have the probability of a single variable given its neighbors. And in order to estimate this
probability, compute this probability by our model, we just have to compute this
proportionality constant by summing out a single random variable, in this case, X, 1.
So to compare MLE with pseudo-likelihood. MLE estimate the full model at once, pseudolikelihood essentially regresses each variable on its neighbors. MLE is going to be intractable, in
general, pseudo-likelihood will be tractable. MLE has been shown to be optimally statistically
efficient, but people have shown that pseudo-likelihood is often empirically successful. MLE
has had finite sample complexity bounds, meaning people know its behavior in the finite
sample case, but before this work, pseudo-likelihood did not. And so it often heard it referred
to as a sort of heuristic. But we'll see we can cross that off. Yes.
>>: [inaudible] when you say [inaudible] so the model is the same. Right? So how is the
inference becoming tractable in one case and intractable in the other? I mean, the model is the
same [inaudible]?
>> Joseph Bradley: So I'm talking about the inference required during learning, not inference at
test time. That is actually, raises a number of interesting questions. At learning time though,
the loss you’re optimizing doesn't have to probability of the full distribution in it.
>>: So in the sense you’re changing the model itself.
>> Joseph Bradley: It's an interesting, it's hard to know exactly how to phrase it. I would
phrase it I guess as changing the loss, and optimizing that loss does give us guarantees with
respect to our original model.
>>: You can actually change the model [inaudible] dependency [inaudible]. You just give up on
the graph.
>> Joseph Bradley: That is another option.
>>: The way I see this, you are constraining your potentials into essentially [inaudible] model.
That's how I'm seeing it.
>> Joseph Bradley: Right.
>>: [inaudible] the loss. [inaudible].
>> Joseph Bradley: I guess it's hard to say. I mean, maybe you could phrase it another way.
>>: Can you go back to where you were>> Joseph Bradley: I'm not sure what the model would be, I guess, which you would, which
would be listed by this loss.
>>: Can you go back to the [inaudible]?
>>: You said you only change the training, only for training time not testing time. You're not
actually optimizing that opt.
>> Joseph Bradley: That's true.
>>: But does that mean you change the model? I mean, I guess it’s semantic. I wouldn’t go
changing the model.
>> Joseph Bradley: Right. I mean>>: You’re changing parameters>> Joseph Bradley: There might be something interesting, we said like this is actually optimizing
MLE with respect to this model, but I'm not sure what that model would be unless it's, right. I
mean that the closest, my guess dependency networks seem very relevant, but I'm not sure if
that's, right.
>>: But for small scale problems, suppose you construct something very small from which you
actually can compute the likelihood. Do compare the likelihood [inaudible]? How much
difference is it going to be? [inaudible]?
>> Joseph Bradley: Right. I will show comparisons between like how accurate the parameter
estimates are from pseudo-likelihood versus optimizing.
>>: [inaudible].
>> Joseph Bradley: Versus optimizing MLE. If that's your question.
>>: I'm more thinking about if you use a pseudo-likelihood, you change the loss [inaudible].
>> Joseph Bradley: Right.
>>: And then you estimate the parameter from the parameter. From the parameter that you
estimate which is [inaudible], then you can compute the likelihood for those [inaudible]
estimates through pseudo-likelihood.
>> Joseph Bradley: Yes, you can.
>>: [inaudible] for small-scale [inaudible].
>> Joseph Bradley: Right, and that's something I'll show.
>>: Okay.
>> Joseph Bradley: Good point. Okay. So where were we? Right. So talking about sample
finite, sample complexity analysis of pseudo-likelihood. The result actually I'll phrase in terms
of MLE. So first, sample complexity result for MLE, which will be very similar to the pseudolikelihood one, would look like this: to achieve a given error epsilon, we need at most n training
examples where n is around this amount, where here we have epsilon, the L, 1 norm of the
parameter error, and that's actually normalized by the number of parameters. R is the number
of parameters, Delta is this probability that this bound doesn't hold, very common in these
analysis, and lambda n is an eigenvalue condition which could I'll discuss a little bit later. But
first, what I'd like to show is the balance analogous bound for pseudo-likelihood, which you
notice is essentially the same except that actually this lambda n value is going to be different.
So before I explain that, I want to point out that yes, as I said, this is the first finite sample
complexity bounds for very general models, and when you add this together with tractable
inference for many classes of MRF's and CRF’s, you can actually show that this implies back
learnability. And basically what you need to do is to control how that lambda n value grows
with respect to whatever problem parameter you're defining your class of models, with respect
to. So given this, what I need to do is explain lambda n. Yes.
>>: What does error mean here?
>> Joseph Bradley: Error? So it's the L, 1 norm of the parameter. Error, so it’s, we have an
optimal parameter vector and, right. And it's also normalized by one over R, the number of
parameters. So lambda n, we'd expected the number of examples to need to be around one
lambda n squared, so we expect this eigenvalue to be important. And I won't take too much
time to talk about this, but essentially it's an eigenvalue condition which measures the
curvature of the objective where greater curvature in this objective is going to make it easier to
learn. And the important thing really is how it varies for MLE versus pseudo-likelihood.
Essentially, MLE is going to have a larger lambda n, implying lower sample complexity, but of
course it requires more computation. Pseudo-likelihood will have smaller lambda n, higher
sample complexity, but of course, less computation. So we have this is sample computational
complexity trade-off.
And speaking of trade-offs, I’d like to point out one more which is specific to pseudo-likelihood.
Now recall that pseudo-likelihood is essentially taking each variable and regressing it on its
neighbors. But I mentioned you can do it both with joint optimization with shared parameters
and disjoint, where you average duplicate estimates later. And the bounds I've been showing
have actually been for joint optimization which will have lower sample complexity based on our
bounds. And with disjoint optimization, we get slightly worse bounds on the number of
samples required, but of course that's completely data parallel. And so we have something of a
sample complexity parallelism trade-off.
So finally, looking at the predictive power of the bounds. On the left here is lower sample
complexity or lower error for a fixed number of samples, on the right, higher. And so we'd
expect to see MLE being the best in this respect, then the pseudo-likelihood with disjoint, with
joint optimization, and then disjoint. And so what I'm plotting here is a synthetic example
where we can compare with the ground truth, X access, number of training examples on a log
scale, y axis, the parameter error, lower being better. And so you can see with max-likelihood
we do indeed learn, decreasing the error. Pseudo-likelihood is a bit worse, especially at the low
sample regime, and pseudo-likelihood with disjoint optimization is a bit worse, just as we'd
expect. So the sample complexity [inaudible] actually occurs in practice.
>>: [inaudible] disjoint optimization that you have. You basically different [inaudible] model
have nothing to do with each other at same bounds, and they're very [inaudible] assumption.
They both have the same sets of variables and so it seems that they could get very different
potentials?
>> Joseph Bradley: There are definitely cases where that does happen and where that doesn't.
And it's essentially actually going to be this like lambda n value which hides all of that behavior.
In the actual proof, you could imagine that this joint optimization proof as looking like a proof
showing that simple regression works and then basically showing that if simple regression
works, each regression will be reasonably accurate, and so averaging their estimates together
will still be reasonably accurate. And our proof just used a simple>>: [inaudible] create for different model. You won't get accurate for the same problem. I tell
you X, 1 depends on X, 2 and then I tell you X, 2 depends on X3.
>> Joseph Bradley: In terms of the, so it would be for the same model in the sense that both
problems would be using the same set of statistical assumptions where the Graphical Model’s
encoding the statistical assumptions, and each problem would be still working with that same
Graphical Model, although with a different loss, I agree. But the guarantees would be with
respect to the same model.
>>: One thing I'm [inaudible]. How would this sort of like [inaudible] resolve [inaudible] like,
you know, basically what they call [inaudible] potentials? Where, you know, X1, X2 should be
different? X, 2 and X, 1 should be different as well. I mean, you have basically a kind of, you
understand what I'm saying?
>>: Like a [inaudible] model?
>>: Basically yeah. [inaudible] model where everything is not [inaudible]. And those cases the
pseudo-likelihood will quickly fail.
>> Joseph Bradley: So, in that case, I'm not sure what the eigenvalue condition would behave
as. We definitely tested it on models where there were both, I guess we did not actually, I think
test it on models where they were completely all repulsive potentials. Only on ones with a mix
and with all attractive. But it>>: People have looked at pseudo-likelihood for differences on [inaudible], and that actually
found that when the credentials are not attractive pseudo-likelihoods to diverge quite a bit.
>> Joseph Bradley: Right. So I think that's something I would like to test on data with a ground
truth. I think if you have a model which is, right, maybe completely repulsive, I don't know. I'd
have to explore that more. If you have some repulsive, like problem areas that I think actually
the method I wanted to talk about next would be quite useful.
>>: So can you show me what type of [inaudible]?
>> Joseph Bradley: Right. I should know my slide numbers. Right. So for MLE, it's a minimum
eigenvalue of the Hessian of the objective, the log probability over all effects at the optimum.
For pseudo-likelihood, it’s this minimum or for each variable of that eigenvalue condition for
that local loss.
>>: It’s for the true likelihood minimization. Not for what you are actually [inaudible]. So
runtime is a measure on every problem. It's a measure on your [inaudible]>> Joseph Bradley: So this is both are with respect to, like include the target distribution, but I
mean, in terms of the loss from which this Hessian is computed, that is loss specific. That's
different for the two. But both involve expectations over the target distribution.
>>: So can that be zero then? Because I mean, I think it’s [inaudible]-
>> Joseph Bradley: So there are. It's actually the smallest nonzero eigenvalue. If you do have
zero>>: In the case of like infinite data, the pseudo-likelihood is going to converge to the same
parameters as the MLE.
>> Joseph Bradley: So that's a simplification. Really, it's that, I mean, in all cases, if you have an
over complete representation, really what you'll converge to is something which can be, is of
the same rank and can be transformed, but it is going to be equivalent in terms of the loss.
>>: Yeah. So the parameters might not be the same, but the probability of distribution
[inaudible] by the two sets of parameters would be the same.
>> Joseph Bradley: Right. I think you would always need a>>: If you have intractable inference, surely the learning has multiple [inaudible] minimum. So
there's something [inaudible].
>> Joseph Bradley: Intractable inference?
>>: Yeah. I mean if the underlying Graphical Model is not [inaudible], are there multiple
minimum for the learning?
>> Joseph Bradley: So when I talk about MLE, I'm not talking about using approximate
inference. So it would still be convex. If you threw in some types of approximate inference,
then certainly, you'd run into problems with non-convexity. Right.
>>: What about for [inaudible]?
>> Joseph Bradley: Right. So for, and I should say, perhaps, the types of models I’m looking at
are like log linear MRF’s and CRF’s. I should've said that, perhaps.
>>: Okay. So [inaudible].
>> Joseph Bradley: If you have latent variables, that is definitely an interesting question and it's
something as well as the semi-supervised case, which I can't really talk about here, but would
like to look more at in the future. And there are some simple ideas, which you can immediately
read from this composite likelihood method, which I wanted to talk about later, but yeah. This
is not immediately applicable to that. Yeah. Other than through something like EM, right.
We'd seen that basically the ordering of these, in terms of error and sample complexity, is what
we'd expect. Looking at further at the predictive power of these bounds you can see that terms
of sample complexity have our bound like this, and what I'd like to say is that first, ignoring the
log term, which is, as you'd expect, not that significant in practice, and fixing the number of
training examples, what you might expect to see if this bound is reasonably tight, is that the
error increases at approximately 1 over lambda n. And what I want to argue here is that
lambda n is important in controlling the difficulty of learning.
So here if you look at one over lambda n, so harder problems going to the right and actual error
on the y-axis, you can indeed see that for what we'd expect to be easier problems, we get lower
error for a fixed number of training examples, and vice versa. So lambda n does indeed seem to
be important in controlling this difficulty of learning.
>>: [inaudible]?
>> Joseph Bradley: Controlling it. So basically generate a whole bunch of random models and
each of those models is a point on this and generate enough that we get points along the whole
line.
>>: Have you thought about using a different objective function here, like the error of the final
joint probability, rather than the L, 1 error in parameters? Because at the end of the day, it’s
the probability that [inaudible].
>> Joseph Bradley: Right. So our analysis sort of does it in two steps where we bound the
parameter error given a number of training examples and then the loss in terms of the
parameter error. That second bound is not that interesting, but I do think that going directly
from bound on the loss to the number of training examples required would definitely be
interesting to do. It wasn't super clear how to do that in the analysis, but that would definitely
be valuable. Okay. By the way, can ask when I should go until?
>>: [inaudible].
>> Joseph Bradley: Okay. Thank you. Good. In terms of what this lets us do, given that we
have this lambda n value, which seems to control the difficulty of learning with MLE and
pseudo-likelihood, what I’d like to say is how this varies for different types of models. So what
I'll do is compare this lambda n ratio between MLE and pseudo-likelihood. So first, looking at
model diameter. And here I'm just looking at chains as they increase in length to the right; and
on the y-axis, this ratio between MLE’s lambda n and pseudo-likelihood, where basically higher
means that MLE is better. You can see that other than end effects, the performance of pseudolikelihood as you might expect, doesn't change that much as chains get longer. So I'll call that
not a problem scenario. In terms of factor strength, meaning basically the magnitude of
parameters or how strongly variables directly interact with each other, as you increase that
going to the right, this is on a log scale, you can see that this ratio actually does shoot up. So
you can see that for very strong factors, pseudo-likelihood can actually run into problems.
Finally, for node degree. As you increase the degree of a node, this ratio again increases, albeit
not as quickly as with factor strength. So we might call that a problem scenario for pseudolikelihood.
But what I'll now talk about is that we can often fix this using a method called composite
likelihood. So MLE is essentially estimating the entire model at once. And pseudo-likelihood is
essentially taking each variable and regressing it on its neighbors. You can imagine the natural
generalization of, instead take a chunk of variables and regresses them on their neighbors. And
you can see that that this varies, this generalizes both MLE, where we have a single chunk of
variables, and pseudo-likelihood, where each chunk of variables is a single variable. And so
what we show are similar sample complexity bounds for joint and disjoint optimization, just like
you saw for pseudo-likelihood; they take the same form, so I won't show them again. But I do
want to emphasize that we analyze structured composite likelihood, meaning that we really
paid attention to how these components were structured, especially with respect to our model
and data. And that's something which surprisingly wehaven't seen that much in previous
literature. In previous literature, often these chunks of variables were just, for example, all sets
of two variables or three variables in your model. And it really benefited us to look at the
structure.
So the obvious question is how do you choose these estimator structures. And what we do is
look to our experiments with pseudo-likelihood for guidelines. First, we limit each component
to trees so that inference within that component will be tractable. We'd like to follow the
structure of our model in order to avoid those failure modes of pseudo-likelihood where, for
example, if we have a star structure, we'd like to cover it with a single component. If we have a
strong factor, we'd like to cover it with a single component, and we'd also like to try to choose
large components or minimize the number of components in order to be intuitively closer to
MLE than to pseudo-likelihood.
>>: Do you have to know beforehand which factor [inaudible] or>> Joseph Bradley: It’s an interesting question, and it's sort of ongoing work; how you can
adaptively, for example, get a rough estimate of your distribution in order to choose a much
better estimator and then get an accurate estimate. Right. Unless you have some kind of prior
knowledge. Yeah. So, for example, given this grid, we might choose these two combs to cover
it, where each is a component in our composite likelihood estimator, and it sort of follows our
guidelines to the right. So showing some examples, experiments with this, here is a synthetic
grid model, where we are learning with a fixed number of training examples; and here I'm going
to show increasing grid size to the right, and on the Y axis, comparing the log loss of one
method versus MLE.
So here it's saying that pseudo-likelihood is just as good as MLE for a single node, since it’s the
same, but as the grid gets bigger, MLE is actually going to do better, pseudo-likelihood worse,
as we'd expect. Now if we use these two combs for composite likelihood, we make a significant
gain, but what's really is the clincher is that if you look at training time, MLE takes a lot longer
than pseudo-likelihood and using this structured composite likelihood, but limiting ourselves to
trees doesn't actually make us require more time for training. Yes.
>>:. How do you compare when you use approximate inference but maximum likelihood
estimation?
>> Joseph Bradley: Yes. So this is actually using approximate inference with exact. It was not
comparable. Here it was with Gibbs Sampling, and there is certainly a lot of tuning which could
be done to, yeah.
>>: But Gibbs Sampling is not the [inaudible], right?
>> Joseph Bradley: I think it varies based on the problem.
>>: [inaudible] it can to do>> Joseph Bradley: Certainly you could do like belief propagation or something. And, yeah. It
would definitely merit like further comparisons. There's a lot of tuning to be done there.
>>: [inaudible] MLE training time is based on Gibbs Sampling.
>> Joseph Bradley: It is.
>>: But you're using exact for the pseudo, for the composite, you're using exact.
>> Joseph Bradley: That's right. So for this, composite likelihood essentially lowers sample
complexity without increasing computation which is what we'd hoped. But I do want to
emphasize that we estimate, these estimators were tailored to the model structure; and I won't
talk about here, but it's also useful to tailor it to the correlations in the data, the strong factors
where you need either expert knowledge or some adaptive method.
So to sum up here, we looked at finite sample complexity bounds for general models and
basically showed first, that pseudo-and composite likelihood are not heuristics, you can get real
bounds for those, and this allows PAC learnability for certain classes and models. And then we
looked at structured composite likelihood, where we gave some guidelines for choosing
estimator structures based on failure modes of pseudo-likelihood, especially trying to tailor
those estimators to the model and data.
So let us do this sort of decomposition, which was model specific, and get the best of all these
trade-offs in certain cases. So what I've been trying to argue is that we can scale learning by
using different types of decompositions which let us trade off quantities like sample complexity,
computation, and parallelization. And these decompositions used model structure and locality,
and the trade-offs were tailored to the model and data. So, for example, in parallel regression,
we decomposed over coordinate updates, and we were able to choose the number of parallel
updates according to correlation in our data. In parameter learning, we decomposed the loss
into sub graphs, essentially, and were able to tailor that decomposition according to both very
original model structure and the strength of factors in our data. And I then also looked at
structure learning in the thesis where we saw some similar trade-offs, but I didn't have time
here.
So there are a number of future work things I'd like to look at. First, in terms of parallel
regression, I've just talked about the multicore setting, but I'm very interested in the distributed
setting where you have limited communication, where you both want to limit the number of
messages you need to send, as well as the size of those messages, and they are really taking
advantage of sparsity in those messages as a challenging problem. I’m also interested in
heterogeneous parallelism, where you might have multicore, plus distributed, plus perhaps
GPU, and you’d like to parallelize over multiple facets of your problem, for example, both
examples and features. There are also a lot of interesting questions when you start talking
about, for example, parameter learning. Used phrase does this where you have more
structured objectives. And for that, I think a really interesting question is how to do partly joint
optimization with composite likelihood. And there I think methods such as alternating
directions method and multipliers might be very applicable to this.
So another thing I'd like to look at in parameter learning is I’ve sort of talked about some
guidelines for how to do these trade-offs, but there's a lot of work which still needs to be done
in order to create sort of an automated learning system, which can automatically choose the
structure and parallelization strategy. So there are a lot of trade-offs in terms of sample
complexity computation, but also things I haven't talked about such as in parallelization, how
you balance the number of components you can parallelize across, versus the amount of
communication required, if you're doing partly joined optimization. And I think an interesting
way of phrasing this might be, for example, graph partitioning, where some questions are how
do you even estimate or computed this lambda n, given a model and estimator, cheaply of
course? And how do you balance these various objectives and constraints in this problem in
order to choose estimators and divide its components across machines?
So in the interest of time, I'll just get through this and mention that there are a number of
applications I'm very interested in. First, in terms of social network modeling, I think that there
is a lot to be done in terms of applying sort of Graphical Model formalisms to social networks.
And there I'm really interested in both, in first, generative models, which model both the
network structure and the semantic content, like text and images in those networks; and there,
I think, there are a lot of interesting questions about both scalability of, up to social network
size, and in terms of how to jointly model those aspects. I’m also interested in looking at some
work in temporal modeling and how to modify the methods I've talked about for the semisupervised case. I've also done some initial work, which I'm interested in continuing, with
machine reading, where an example goal might be to build a probabilistic knowledge base from
text on the web, and there I think difficult questions are, if say, you phrase this as a big
Graphical Model, how you do focused learning and inference within that model, I sense you
really would need to prioritize with that scale; as well as how to do active learning in order to
keep building this knowledge base, but minimize the amount of human supervision and
interference required. So with that, I'll leave a summary of this talk here, and thank you very
much for listening.
>>: Have you ever [inaudible] divergence compared to pseudo-[inaudible]?
>> Joseph Bradley: Right. I've looked at it some, but have not done a full comparison, and I
think it's a really interesting case where it sort of mixes ideas of, from several cases like
stochastic gradient with approximating the objective, and it would definitely be interesting to
look at.
>>: Did you look at [inaudible] so that you propose on the generalization of [inaudible]? The
reason I estimate, but the thing I really want to do is ask a question about [inaudible]? So
maybe [inaudible]? So [inaudible] you could have a lower [inaudible] complexity to which you
give [inaudible] error?
>> Joseph Bradley: Right. So in terms of, you're asking about going more directly from some
sort of loss to>>: Yeah. So I guess the bound is hard but I think kind of [inaudible]>> Joseph Bradley: Right. It is definitely an interesting question. I would like to look into it
more. I think somewhat relatedly, one initial thing I've been thinking about is how to make
these bounds not the sort of like wide sweeping bounds over all your parameters, but really be
more parameter specific, where you know, you can imagine some parts of your model are easy
to estimate and some are hard; and that in and of itself would let you perhaps get better
bounds with respect to the loss you care about, but also would perhaps, I may not have fully
understood your question, but I think would help in cases where what you really cared about
was a particular part of your model and wanted to estimate it well. And then you might have
sort of model part specific bounds.
>>: [inaudible] composite like a parallelism is done by doing, you know, different parts
simultaneously.
>> Joseph Bradley: That's right. That's how I've been thinking about it.
>>: Okay. So [inaudible] would it be more parallel.
>> Joseph Bradley: That's right.
>>: Do you have a kind of curve to show what kind of speedup you get for that the earlier use
of error [inaudible]?
>> Joseph Bradley: Right. I don't. I think that in terms of disjoint optimization, the curves I
showed really are immediately applicable since that's completely data parallel. In terms of
cases where you're doing joint optimization, but distributed, that's something I'm very
interested in looking at; and they are, I think really, also more analysis is needed in terms of
seeing how joint optimization behaves and with the hope that you don't really need to make it
fully joint but can, you know, essentially stop early. Right.
>>: And how about the Graphic Model? To you [inaudible] in terms of>> Joseph Bradley: I have not. In terms of parameter learning, I guess that's, in a sense more
straightforward since you don't have to do inference over the full model.
>>: Oh, I see. Okay.
>> Joseph Bradley: But in terms of structure learning, that's very interesting. In structure
learning, I stress looking at undirected. Thanks.
Download