1 >> John Platt: Okay. I'm happy to introduce... >> Ali Rahimi: Hi. I'm going to talk...

advertisement
1
>> John Platt: Okay. I'm happy to introduce Ali Rahimi from Intel Labs Berkeley.
>> Ali Rahimi: Hi. I'm going to talk about random kitchen sinks, but before I get into it,
I want to just make sure everybody can pace themselves through the talk. I'm going to
start with really lightweight stuff, and then we're going to ramp up slowly, and then I'll do
experiments and you can turn your minds off. And then I'll hit you again with some
math. And then this is new work and it's not published. So we'll try to breeze through it,
but it's still pretty mathy.
I'm going to start here. So this is to give you a little bit of context about where all this
work comes from.
There's a new trend in AI. Back in the day when we wanted to build smart things, we
would start with a really complicated statistical model, like Bayes nets, where inference
was NP-hard and learning was even NP-harder. And we gave it some data and a few
thousand examples. And we trained some intelligent thing.
And something happened in the late '90s where instead of models like this, these really
structured statistical models, we started using more generic models, nonparametric
models, like RBFs.
And to compensate for the lack of statistical structure in these models, we started feeding
these models lots and lots of data. Domain knowledge started come from here instead of
from here.
And the nice thing about these generic models is that the optimization problems tend to
be convex. At least they're in P. But because we have so much data, the optimization
problem in practice ends up taking a lot of time.
So this talk is really about tricks for making these types of optimization problems on
these types of datasets go faster to support the new kind of artificial intelligence that
we're doing these days.
So I'll give you a few examples of this trend from here to here. Here's an example from
Alyosha Efros. This is a problem where you want to -- you're given an image, you say,
oh, I really don't like these houses being here, so you blot them out, and then you want to
fill in this blotted out part with something pretty and relevant.
So Alyosha these days takes a very data-heavy approach. He just crawls through millions
of images in Flickr and finds little patches and substitutes the patches in here. It's a very
heavyweight, heavy data-driven approach.
Contrast this with something he did ten years ago where he actually had a pretty
sophisticated HMM-based model, not very data driven, but the model was a lot more
complicated. This works a lot better.
Here's another example. This is work from Antonio Torralba. Here's -- each of these
little cells is a tiny image. There's 10 million images in here. And he uses this dataset to
do object recognition. Extremely data heavy. The operation that it goes through is just
the nearest neighbors search in these 10 million images.
2
Compare that with what we did a while back where they actually build this pretty
complicated discriminative model that can take into account the spatial relationships
between objects and wasn't trained on that much data.
This works really well.
Here's another example from Greg Shakhnarovich. Here's a pose detector. I hear
Microsoft recently solved this problem in the Xbox. So the goal is to recover the 3D pose
of the human body. And Greg's approach here is he takes a graphic simulator, generates
150,000 examples of a fake person under random different poses and just matches this
image against this image using a very simple distance metric.
Contrast this with something the same group did ten years ago that involved actually
reasoning about the 3D geometry of the shape in real time and trying to match it against
the2D image and recover the3D shape of the body. This workings amazingly well and
it's fast.
My own motivation for this stuff is building object recognition systems that you can train
on, that can recognize millions of objects in your real world.
So, you know, this is the system trained on about 30 objects. Anyway, runs in real time.
But as you scale the number of objects, it goes more and more slowly and the accuracy
drops. So it would be neat to take systems and be able to scale them up, just the same
way those previous examples I showed you work.
Part of the reason this trend is happening now is we have access to a lot more data than
we did before. I think. I'm speculating here. We have really high-fidelity simulators like
the graphics simulator that Greg was using to generate these body poses. We have the
Web that has lots of images and annotations on it.
And ever since the MacArthur Foundation started giving grants to people for building
games, there's a lot of -- mechanical turks, too, there's a lot of hand-annotated stuff that
you get off of the Web.
So this talk is about supporting this trend. And I'm going to show you two tricks that I've
been playing with. I'll start with the random features trick. This is a way to speed up
kernel machines. So a little bit of background on kernel machines.
Here's a classification task. So this is a trick for speeding up your classifier. And I'm
going to tell you about the classification problem a little bit, I'll tell you about the kernel
machines, and I'll tell you about how we speed them up.
So in this classification problem you got a space, a bunch of points. The points are
labeled among the two classes. And you're trying to find a decision surface between
them. Linear decision surfaces don't always separate your two classes well, so one would
like to consider nonlinear decision surfaces.
And in kernel machines, the decision surface -- the form of the decision surface that we
use is a weighted sum of kernels placed on your training examples.
3
So there are N parameters here. There's a kernel that we define. Maybe it's a Gaussian or
something like that. And we place the kernel on each one of these points, and then we
come up with a good weighting and that will describe the family of curves in this space.
So this is -- turns out -- I mean, this is well known. A function of this form when this
kernel is positive-definite is equivalent to a linear function in a featurized space of the
input.
And these features are such that they satisfy this relationship. So the kernel effectively
maps the features, maps your inputs into some feature space and then takes the product in
those spaces. Okay. So this is a review of kernel machines.
This is a really neat trick because whereas you would normally be trying to fit a decision
surface in an infinite dimensional space, right, so this feature, for example, in general can
be an infinite-dimensional feature mapping, whereas you would normally have to find an
omega and some infinite-dimensional space.
This kernel trick lets you search for N parameters only. So you can start implementing
these things inside computers, which is great. And it even has this nice interpretation,
like I said. Instead of searching for curves like this, you map your data into a potentially
infinite-dimensional space implicitly, and then linear decision surface in that space.
So this works really well. The problem with these kernel machines is that they force you
to deal with these enormous matrices. If you have ten million training examples, one
way or another you're going to have to represent a 10-million-by-10-million matrix
whose entries consist of the kernel evaluated on pairs of your training data points.
I made this really big to take up the whole screen to emphasize how big these matrices
can be.
So you can do infinite-dimensional things in finite-dimensional computers with the
kernel trick, but these things are still huge. And so some researchers have come up with
very popular tricks for dealing with matrices like this.
Here's another trick. The trick is -- well, we talked about how these kernels actually
compute inner products between the featurized inputs.
So what we're going to do is instead of dealing with the kernel or with this
infinite-dimensional feature mapping, we're going to find a finite-dimensional feature
mapping, in fact a low-dimensional feature mapping, such that instead of having to take
the inner product in this infinite-dimensional space, you can just take the dot product of
the featurized inputs.
So just like with kernel machines, your decision surface took this form. With this
machine, because we're now using finite-dimensional features, your kernel machine just
takes this form.
And to go back to this diagram, the idea, again, is we're going to define these nonlinear
decision boundaries, we're going to map our data into some finite-dimensional,
4
low-dimensional space, and then train a linear decision boundary there.
And this mapping is going to be such that this relationship holds. So instead of training
your kernel machine with kernels and dealing with these enormous matrices, we're
actually going to randomly featurize your inputs and then train a linear classifier in this
relatively low-dimensional space.
And we're going to guarantee that the resulting classifier is going to be close to the
classifier we would have gotten.
And I'm going to tell you about two different types of random features. One of them are
Fourier random features, and they're based on the Fourier transform of the kernel. And
another one is a discrete random feature that's based on gooding up the space. I'll go
through both of them.
The proof for why this works is really simple. It's four lines and I think it's kind of neat
to look at. So I'm going to just pop up some math, and I'm going to walk through it
because this works -- this is just really neat.
So the trick is we would like -- we're given a kernel. I forgot to mention, this only works
with shift-invariant kernels, so you have to be able to represent the kernel like this. And
at the bottom we're going to get -- we're going to derive these random features such that
this relationship holds. We want the inner product between the random featurized inputs
to almost be equal to the value of the kernel.
And I'll walk you through here. So step one. Pick the Fourier transform of your kernel.
So this is just the standard Fourier transform that you learn about in elementary school.
Step two. So this is an integral. We're going to replace the integral. So P is the Fourier
transform of the kernel. We're going to replace the integral by approximating this
integral with an average. So treat this Fourier transform as a probability distribution.
Draw samples from it.
>>: Sorry.
>> Ali Rahimi: Yes.
>>: Going to the previous notations where Xs were vectors, you're now in a scalar
space?
>> Ali Rahimi: No. Xs are still vectors. Okay.
>>: So these are multidimensional?
>> Ali Rahimi: Yeah. So this is -- so omega prime is actually a new product between
omega and a vector X minus.
>>: Aren't you missing a WI there? You wanted a weighted set of ->> Ali Rahimi: Here?
5
>>: Or in the sum?
>> Ali Rahimi: No. Because I'm drawing from P of -- so what I didn't -- what I slipped
under the rug here is that because this kernel is positive-definite, the Fourier transform -so here's a theorem. The Fourier transform of the positive-definite kernel is
positive-definite. This is Bochner's theorem. You don't learn that in elementary school
for some reason.
>>: You don't?
>> Ali Rahimi: I meant -- sorry -- in systems -- signals and systems. In Alan Willsky's
book that has all these Fourier identities, this identity is not there, unfortunately. And it's
a really powerful one.
So the point is that we can treat this Fourier transform as a probability distribution. It's
positive. You can sample from it.
So let's approximate this integral using a sample average. And now I'm just going to
rewrite this summation. I'm going to split up each of these terms into a product.
And then I'm going write this in vector form. This vector depends only on X. This
vector depends only on Y. And we have our random features. So the Fourier random
feature, really what it's doing is saying if you want compute X of X and Y, take X,
project it down into the random direction W. W is drawn from the Fourier transform of
the kernel.
So you take X, you project it down onto a hyperplane, and then you compute a phaser
from that. So you just project it down and then you just wrap it around the complex
circle, and this complex number now becomes your random feature.
And there's a squiggle mark here. Certainly this relationship holds an expectation. So
certainly this is true.
>>: What about [inaudible]? It seems like some things are probably better than others.
>> Ali Rahimi: For this? So the sampling scheme is given to you. The sampling scheme
is draw from the Fourier transform of the kernel.
>>: Right. But you could draw regular samples as in a discrete transform or you could
just randomly sample -- I mean, is one better than the other or ->> Ali Rahimi: You could draw nonrandom samples you say?
>>: Yeah. I mean, you could draw ->> Ali Rahimi: Yeah. So certainly this has that. So this has -- there exists a random
sampling such that these two guys are close to each other. Sorry. A random sampling
will probably produce something that's close to each other, and that implies that there
exists a deterministic sampling, such as these two guys are close to each other.
6
The problem is I don't know how to come up with one. I know how to come up with one
by just sampling, but I don't know how to construct one.
Okay. So the point of this was to show you that at least an expectation featurizing your X
and Y and computing the inner product gives you something -- gives you the kernel
value.
I've also shown you how to compute this Z. It's just draw a bunch of samples from the
Fourier transform of the kernel and compute these phasers.
What I really want, though, is not these results and expectations. We want to show that
this actually holds throughout this space. So let me go through that right now. Let me
tell you what we know how to do.
So we know that the inner product for a given X and Y is going to be close to K of X and
Y in expectation. But we can also show that the tails of this are quite light. So this is just
by Hoeffding. So these guys were given X and Y. I'm going to deviate very much with a
very high probability.
Using the union bound on this, you can show this same thing on a discrete dataset of
endpoints. But even more so using a covering number argument. You can show that this
holds throughout the whole space. So if you draw enough -- if the dimensionality of your
random feature is high enough, then this inner product will approximate this kernel for all
the points in your space with very high probability.
So it's not just the result and expectation. This is actually a result that holds with a very
high probability throughout the whole space.
In fact, let me reinterpret that theorem for you. So it says that with probability at least 1
minus P, P's some probability that you're given, the inner product of the whole space
approximates the kernel over the whole space with probability at least 1 minus P as long
as you have enough -- the dimension of your random feature is high enough, as long as
you sample from the Fourier transform enough times.
So this depends on -- linearly on the dimension of the space. This is a standard epsilon
squared dependence on the error over there. And the dependence on -- there's a
dependence on the curvature of the kernel as well, as you might expect to see.
>>: But D you have to pay for at runtime, so you don't want big D.
>> Ali Rahimi: That's right. That's right. In fact, we'll see -- I'll show you experiments
in the second part of the talk that I'll compare the cost of this D versus the cost of
choosing these features optimally. Yeah.
>>: Can you just say what the typical values of D are in your concrete things?
>> Ali Rahimi: Yeah. Why don't I -- just when we get to the experiments I'll tell you.
So here's another random feature. So there was the Fourier random features. There's a
7
totally different class of features that one can also construct. So you give me a kernel.
And my job, again, to remind you, is to build a function, possibly a randomized one, such
that the inner product between the featurized inputs is equal to the kernel.
So this random feature works like this. Grid up your input space, your space of Xs, so
just lay down a random grid. I'll tell you how to pick the pitch of the grid. In each bin of
the grids, assign a binary bin string. Binary bin string is just the representation of the
number of the grid in unary. So grid 1 gets a 1 over there, grid 2, grid 3. Okay.
And then the random feature representation of a point is just its grid ID written in unary.
That way when you compute the inner product, you're basically just 1 if you're in the
same bin or 0 if you're not in the same bin.
And now all that's left is for me to tell you how to compute these random grid pitches.
And in the same way that we picked the omegas from the Fourier transform of the kernel,
here we define a hat transform of a kernel instead of sinusoids, it's in terms of these hat
basis functions.
So you'd randomly sample your grid pitches from the hat transform of the kernel you're
trying to approximate. And again you get the same theorems and same results as you do
with the Fourier kernel.
So let me show you how this looks like in code. It's very simple. If you want to train a
dataset with an L2 loss, you want to train a classifier using an L2 loss with Fourier
random features, you generate a bunch of random Ws, so these are the -- you just
sampled from the Fourier transform of, say, the Gaussian kernel. Fourier transform of
the Gaussian kernel is again a Gaussian.
So you just draw a bunch of Gaussian Ws. And then you pass your data through to the
random feature, so this is the complex sinusoid. And then you just now fit a linear solver.
You just fit a hyperplane in this featurized space, and that's just the least squares problem
[inaudible] over here. And, boom, you have your hyperplane. That's training in three
lines of MATLAB code.
And then for testing, you map your input vector through the random map. And you
evaluate the inner product with respect to the alpha that you just lined.
So let's get to the issue of dimensions that you brought up. So here's a bunch of datasets
that we run this on. So typical dimension is anywhere from 21 dimensions to 127
dimensions on these datasets. I'll show you high-dimensional datasets. Dataset sizes
range from a few thousands to a few million. And generally training is really fast with
these random Fourier features.
And here are the typical dimensions that I pick. These are much smaller than what the
theorem would predict. So the theorem is quite loose.
The theorem would predict -- well, so depends what epsilon you want. There's a 1 over
epsilon squared. So if you wanted accuracy of 1 percent, it would be .01 squared. So
that's 1 over .01 squared, so it's 10,000.
8
So we're getting into -- so the theorem is loose because of this guy right here.
Okay. So, in practice, you know, we get typically better than state-of-the-art
performance on various heavily-tuned SVM libraries.
The Fourier random features work on most datasets. On some of them, like the forest
dataset, it didn't work so well. And so this dataset has this characteristic that if you
actually were to train an SVM with RBF kernels on it, most of the points become support
vectors.
>>: [inaudible]
>> Ali Rahimi: I'm sorry, what?
>>: What were the substitutes? Test theorems?
>> Ali Rahimi: Test theorems. Yes. Sorry.
>>: [inaudible] two different flavors: there's the two-class version and there's the
seven-class version. I suspect this is the two-class version given the rate.
>> Ali Rahimi: I took the version of the forest cover from these people here, from the
core vector machine. I just grabbed it from there and ran everything else on it. Anyway.
So the point of this was -- the point of this line is to tell you that there are -- that there are
the two free -- that the two random features are complementary in some sense. These are
really good for learning really smooth decision surfaces; these are really good for doing
nearest neighbors types of surfaces.
>>: So how do you [inaudible]?
>> Ali Rahimi: Set it to 500, see if it works. And then if it works really well, then you
set it to a hundred. If it doesn't work really well, you set it to a thousand.
>>: Have you played with combining the two?
>> Ali Rahimi: I have. I have, yes. So you just -- you can just stack up the random
features for the two guys, and things work quite well.
>>: Is it the best of both worlds?
>> Ali Rahimi: Yeah. You tend to -- I mean, so in the sense that if you run it on all of
these guys, you get basically the same performance.
>>: Why does it perform better than [inaudible]?
>> Ali Rahimi: Well, so there's an approximation going on. The approximation was in
terms of the kernel, not in terms of the decision surface or not even in terms of the ideal
decision surface.
We really are learning a different surface. It just tells you that the RBF representation
9
isn't the best representation.
>>: And it could be that the hinge -- you're also using regularized least squared
classification, right?
>> Ali Rahimi: Yes. So you get basically -- I've run all this stuff with -- all these things
with the hinge loss, so I basically get the same results. And it stops -- it stops to matter
what loss you use when you have large datasets.
Okay. So let me just tell you a few of the properties of these things.
So as you would expect, this is on the forest dataset [inaudible] the bigger the dataset. So
big datasets help, but you knew that.
And also, as you would expect, as you increase the dimension of the random feature, your
error also drops. So this is error dropping quite fast and training and testing time not
increasing very fast. So in practice these things tend to have desirable properties.
Let me -- okay. So this is random features. This was about training kernel machines
faster. Let me generalize the problem.
This is where the random kitchen sinks come from. So we're learning these feature
mappings based on a kernel that you give me. But why start with the kernel in the first
place?
So back in the day -- this is a picture out of a paper by Block from 1962. We had these
neural networks and there was this idea just from day one that there should be some
randomization that happens at the first layer.
So this idea of having some randomization in your training algorithm is classical. We
don't draw our neural networks like this anymore. Here's maybe a more modern way of
doing things.
Here's your input. It goes through some nonlinearities. The nonlinearities have some
parameter, and then you weight the output of the nonlinearities. And what you're
learning during training is these weights in the parameters of the nonlinearities. And
actually this is also outmoded. We just write this now.
Our neural network is a weighted sum of nonlinearities and they're parameterized. And
we just learn the weights and the parameters.
So let me focus on one popular way of doing it, of training these parameters. So when
we need to do AdaBoosting, you build this function stage-wise, you train the alphas, you
train the omegas, and then you do that for the next step, and so we have T of these stages.
In random features we're also training a decision surface as a similar form. Our omegas
were random and we were just training for the alphas.
In kernel machines we're also doing something similar, except instead of a finite sum it's
just an integral, and we're learning this function alpha.
10
So a lot of these -- so basically the world of shallow network machine learning just all is
focused on learning decision boundaries of this form. And I'm going to focus on one
particular way of training these, and that's the greedy method, which goes back to -- well,
it goes back about 50 years.
So the idea -- and I'm going to focus on a function-fitting framework. Forget the dataset
for now. Somebody gives you a function f* and says please approximate it with a
function of this form. You get to choose the omegas and the alphas. But I want the
resulting sum to be close to this function in some norm in a function space.
So you're given a function, a target function to approximate and you ask to come up with
a bunch of weights and a bunch of parameters, such that the weighted sum is close to the
target function.
So the greedy approach, which looks like AdaBoost, looks like Matching Pursuit, looks
like a lot of these other things, goes like this. Start with the function zero. And then
we're going to find one term. We're going to add one term to the function. And that's
going to be the term that gets -- the one term addition that gets us closest to f*.
And now we have a new function, and then you iterate again. For the next addition, for
the next term of the function, you again come up with the one that minimizes the
difference between the residual and the target function, and you do this T times. So this
has the flavor of AdaBoost.
And we know a lot a lot about the performance of a function-fitting algorithm like this.
In fact, this result goes back 20 years, 25 years. 15 years. 15 years.
So suppose you want to approximate a function of this form. This is our target function.
It consists of infinitely many terms. And we want to proximate it with a T term function
that we built, as in the previous slide.
So what's known is that the distance between the approximation that we built, as in the
previous slide, and the target function decays as 1 over square root of the number of
terms. And there is a constant here that measures in some sense the norm of the target
function. So the L1 norm of the alpha is a norm on functions of this one.
All right. So this is proved by induction over T. Let me write this down graphically for
you. It says that if you -- for any function in this space with infinite expansion, there
exists a function with a T term expansion if you're allowed to pick alpha and omega.
That's not too far from that function. It's within 1 over square root of T. So it's a
statement about for all functions in the blue there exists a function in the purple. That's
not too far.
Okay. That's about as good of a rate as you can get. This rate is tight.
So here's another idea. This is the random kitchen sinks idea for fitting functions. You're
again given the target function. And you're again asked to come up with alphas and
omegas such that this T term approximation is close to f*.
But now we do something much simpler. Just pick the omegas from some distribution.
11
Just randomly. Instead of going through this greedy algorithm. There's nothing greedy
about this now. Just pick them randomly. And then to pick the alphas, just solve this
convex problem. So this looks like a least squares problem, for example.
>>: So all at once [inaudible].
>> Ali Rahimi: All at once. It's a badge optimization over T alphas. So now and then
just return the alphas and the omegas that you computed.
So let's see how well this does. And you would expect that the performance guarantees
would somehow depend on the distribution that you used to sample the omegas. Right?
And so here's the results.
If you remember, the theoretical result for the greedy algorithm dependent on the L1
norm of the coefficients of the function we were trying to approximate, the C over here.
So let's define an analogous norm for the target function we're trying to approximate.
We're going to call it the P norm. And it's going to depend on the probability distribution
that you use to sample your parameters. So you could think of this as an important
sampling ratio between the alphas and the distribution that you're using to sample these
guys.
So the theorem says if somebody gives me an f* to approximate using the algorithm I just
showed you on the previous slide, then the T term approximation with probability at least
1 minus delta, for any delta that you pick, also drops as 1 over square root of T.
So we have the same rate in terms of the number of terms that we need in the expansion
as we do with the greedy algorithm. But here we just manage to sample the parameters
randomly. And then there's this dependence on the importance ratio between the alphas.
Yeah.
>>: So here f* is -- you fix it, right?
>> Ali Rahimi: Yeah.
>>: And then your picture before, I thought you said you had something that [inaudible]
problem. So can you [inaudible].
>> Ali Rahimi: Yeah. So here's -- so here's what this theorem says. So you fix f* and
then we're saying -- so you fix f* and f* is drawn from this big set that looks like that, it's
infinite expansions of your weighted features. And we say that if you consider this
random set, so this is a random set consisting of all alphas, all weights, but then these
features are drawn randomly. It says that with very high probability the distance between
this fixed f* and this set is -- drops 1 over square root of T.
So whereas before we were making a claim about for all points in this set, there is a point
close to here. In this case we're just saying for a fixed point in this set. And that's all you
need to talk about function fitting. After all, somebody gives you a function to fit and
then you draw stuff. You don't need to -- you don't need to approximate the whole space
ever; you just need to approximate the function at hand that you need to approximate.
12
And that's why we manage to get this 1 over square root of T.
>>: This gives the rate of improvement is the same, but the difference in the -- the
constant could be substantial.
>> Ali Rahimi: This constant could be substantial, yeah. You could pick a sampling
distribution for the weak learners, for the features that's terrible for the given function.
Yeah. It's easy to construct. It's just if you use a direct delta, for example, for your
omegas, you'll just learn a really crappy classifier. But at least the theorem is correct in
that that crappiness is reflected in this [inaudible].
>>: Have you scored any win doing any refinement at all? Because right now you're
assuming that you just pick all random ones. Because if you heed out some fraction and
replace them or something, where you can do a little bit more, do you win?
>> Ali Rahimi: What works really well is if you start with this random thing and then
just do a few iterations of gradient descent on the omegas and the alphas. Yeah. So that's
works incredibly well.
>>: Oh, that sounds like a neural network.
>> Ali Rahimi: Of course. It's all a neural network. You just can't say that out loud.
>>: Oops.
>> Ali Rahimi: Yeah. So okay. What is neat about it is that it's a neural network that
you initialize with random weights and you have guarantees about how far you end up
from the thing you're trying to approximate.
>>: How far is that [inaudible]?
>> Ali Rahimi: Well, so it depends where you start with the training from scratch. I'm
not very good at it, even though I've tried very hard. I often get stuck in local minima.
Generally you end up doing quite well if you do this and then gradient descent.
In the experiments I'll report, I don't do the refinement, gradient descent refinement. I've
informally just tried training starting from zero or various parameters that I thought might
be good, and it works okay. But nowhere near as well as this.
>>: [inaudible]
>> Ali Rahimi: Oh, and it's much slower because you need to run gradient descent for a
lot longer too.
>>: Is it important to have the random -- it's important to have the random sampling,
right? Because I've tried stuff where the fees were from the PCA of the dataset, and that
didn't work well either.
>> Ali Rahimi: Yeah. So this random sampling is completely independent of the data in
that.
13
>>: But you need to choose the omegas that you would expect to be good because of the
[inaudible] sample?
>> Ali Rahimi: That's right. That's right. That's right. So it is a design issue.
>>: [inaudible] okay. Well, I don't know. I tried it once [inaudible].
>> Ali Rahimi: So the reason this stuff ends up working well, and my intuition about all
of this stuff and what these theorems mean is that really it doesn't matter what the
nonlinearities are, but just put a lot of effort into figuring out what their weights should
be. That's where the magic is. Not in here necessarily. That's how I view this result.
>>: It's interesting because in boosting, for instance, people are trying to do things that -where instead of taking the greedy approach you go and you take the classifiers that you
fit and then you go back and you try to do the least squares batch fit of [inaudible] and
you end up doing much worse in terms of generalization. So it's interesting to hear
[inaudible] hurting you.
>> Ali Rahimi: It's not hurting you because of the way we picked the weak learners.
Right.
>>: Because of the random [inaudible].
>> Ali Rahimi: Because of the random, right. So you're talking about Dale Sherman's
result of -- yeah. Yeah. So yeah.
Yeah. You can't -- right. So if you pick your -- so Dale Sherman's result is if you pick
the omegas from boosting, you don't want to go back and refine your alphas only. You
got to go back and -- okay.
All right. So this theorem is in terms of some norm defined. And, I mean, we can come
up with a much stronger form of this theorem in terms of the L infinity norm of between
the functions. But these features have to be Sigmoids like this.
Again, this is if you're going to nitpick about my choice of function norm here. This
gives you a result in terms of the L infinity norm.
So you buy that it's enough to just fit a fixed function in this space, that you don't need a
universal claim about the whole space. Did I convince you that this is a good enough
thing?
>>: Let's take it offline.
>> Ali Rahimi: Great. Perfect.
I'm going to skip the proof. The proof idea. It basically boils down to coming up with
tail bounds for a zero mean random variable in a Banach space. It's a bounded zero mean
random variable in a Banach space. Just replaces this Mth with this random variable, and
then you apply standard results from there.
14
I'm -- so everything I told you about so far about the random kitchen sinks was about
fitting functions, but really we're going to be fitting data and using a standard
decomposition of the empirical risk -- sorry, of the defect between the empirical risk and
the true risk. You can show this bound.
So if F hat is the T term expansion that you derived by looking at N data points and
minimizing the empirical risk, then the true risk of F hat compared to the true risk of the
best risk minimizer decays as follows: 1 over square root of T plus 1 over square root of
the number of data points that you looked at.
So the 1 over square root of T comes from the previous theorem. This 1 over square root
of M is a standard result from learning theory.
>>: [inaudible]
>> Ali Rahimi: This is -- no. It's not. It's not a uniform convergence result. This is a
result about this optimizer.
It uses the uniform convergence for this part of the decomposition. This part, as I was
talking with Ofar [phonetic] about, is only needs a pointwise.
So let's go over some experiments. Here is the adult dataset. It's a relatively small
dataset. But everybody uses it. This is the number of terms that we add in the expansion.
This is AdaBoost's testing error. So after adding a few terms, about 40, 50 terms is
enough for AdaBoost, it plateaus out to about 15 percent error.
For us we need to draw a lot more random features to get the similar error rates. So
AdaBoost got there faster with many -- AdaBoost got there with many fewer terms. We
got to use a lot more terms to get to the same accuracy.
But we got there much faster. Our optimization is much, much faster. AdaBoost does
this pretty heavyweight iterative thing where it has to touch more or less the entire dataset
at every iteration. We just touch the dataset once in our least squares solution.
So this is the runtime as the number of features increase. AdaBoost takes a lot of time.
This is on a log scale. We take very little time. And, in fact, let me combine these two
graphs together. This graph, this is the amount of training time versus the amount of
error that you get.
So even though we ended up using a lot more terms in our expansion, we're still much,
much faster because our optimization procedure is much faster for a given error rate.
Here's another dataset. This is data coming from an accelerometer from hip 1 thing that
detects your physical activity. We stopped AdaBoost after about a hundred iterations
because it was just taking too long. Whereas this thing -- the random kitchen sinks kept
on ticking. And, again, you have a couple of orders of magnitude less time that you spent
for the same error rate.
Another standardized dataset. Similar thing. Again, few orders of magnitude for similar
15
error rates. And that's consistent across the board. Here's the face detection experiment.
We compared AdaBoost with [inaudible] against the Fourier random features.
What's neat about this comparison is that you can't train Fourier random features with
AdaBoost very well. It's a hard weak learner to fit. It's hard to fit sinusoids to data. But
we're picking them randomly in our case, so that's easy.
So part of the benefit of this random kitchen sinks trick is that you can start using features
that you wouldn't be able to use with AdaBoost because you no longer need to fit them to
data that's convenient.
So we get slightly better performance than AdaBoost on our test accuracy. Training is
much faster; seconds instead of minutes. But we do use about factor of 10 more in weak
learners. And, again, here's your point, John, about at runtime D is what's important.
In these types of experiments, we were hoping to have a fast trainer. And there are lots of
tricks that we started using in a face detector that we built that can avoid you having to
slide the window. The detection window over the whole image. There's an optimization
that happens where you can prune a lot of the search space for the sliding windows. So
that's how we get around that kind of slowdown.
>>: Have you tried to see -- so when you find the random features and then [inaudible],
then when you're doing the squares. So do you see any sparsity in these? Because it
seems like, you know, the way to overcome this would be to find a sparse set of alphas
and then get rid of those random guys that you never use.
>> Ali Rahimi: Yeah. So that set of experiments, I don't report on them here. But one
idea was instead of least squares just to use Pegasus and hope to get sparseness out of
that. I can't find a good setting of the parameters of Pegasus to get as good accuracy as I
get with least squares. So ->>: [inaudible]
>> Ali Rahimi: Yeah, but then those problems become huge, and I don't have -- I would
like to ask you for a really large-scale L1 regularis solver [phonetic]. I think there's a
couple out there. I just don't -- haven't talked to anybody who could just recommend one.
Offline.
>>: [inaudible]
>> Ali Rahimi: Similar things with MNIST. In this one we were comparing against
boosting by filtering, which is a much faster version of boosting where you -- instead of
touching the entire dataset at every iteration you just touch a random substantive. And,
again, you see similar types of results where you're about a hundred times faster, similar
accuracy, but you use more features.
I'm going to skip this. So this is -- I can't really talk about this part, but I think it's neat
that Intel may consider using this in something one day. Right.
So here's the lesson from this part of the talk. So here's typically the way people fit these
16
nonlinear decision boundaries. You run the minimization over the weights, you run the
minimization over the parameters, and I just -- here's a caricature; this is not
mathematical. The caricature of what we've done is minimize over these weights but
randomize over the omegas and prove that you get very similar results.
So for the next few minutes, I want to talk about less baked things that I've been working
on for the past six months or so, unless there are questions about that stuff, then we can...
So here's a neat trend. Everybody's doing semi-definite programs for everything and
getting good results as long as they have 10,000 variables.
So it would be neat to come up with a way to solve semi-definite programs faster. So
these semi-definite programs typically take this form. You want to minimize a convex
function over matrices subject to a polyhedral constraint. So there's a linear constraint on
the matrix, and you want the matrix to be positive-definite.
So this blue thing is -- represents the cone of positive-definite matrices. And the problem
is while it's polynomially hard to perform minimizations like this, it's still -- it's still hard
to do it on computers today. We don't have very fast solvers. So it's a challenge to come
up with good solvers that can minimize things over this convex cone.
So a trick that Guiomo Obozinsky [phonetic] pointed out to me that they'd used in a
paper is to replace this set in the optimization with the polyhedral set. They use a random
polyhedral set. That's the green thing over here. So you just generate a bunch of random
vertices that are positive-definite matrices, and you require X to live in that cone.
And it worked amazingly well for their application. And they didn't know why it worked
well. And I run a bunch of experiments and it looks like, you know, seems to work well
for -- as long as your optimum is not on the wall of this cone, as long as it's not an
extreme point of this cone, it will work amazingly well. And if it is on the extreme point,
then you can still get within some epsilon with high probability.
So what can we say about this type of thing. So here's a theorem about it. Actually,
before the theorem, let me tell you how one uses this trick.
So we just replace the positive-definite in this constraint with this constraint. This is the
constraint that X has to lie in this polyhedral cone whose vertices are these randomly
drawn VIs. And just to say that it's a cone means that these weights have to be positive.
So now if F is -- if F is linear, for example, this turns into a linear program. If F is
quadratics, this is a quadratic program. We can solve all of these things really fast. And
this graph is a simulation that shows that actually a lot of these matrices that you draw
from this positive-definite cone do end up being extremely close to this random convex
polyhedral code.
So the theorem is -- and it's still in flux. I think some of it can be improved -- is that if
you're given a target matrix X knot, so for fixed X knot, draw a bunch of random positive
matrices from the Wishart distribution. Construct this cone.
So this is just positive combinations of these random Wishart matrices. Then with really
17
high probability, the target matrix is close to this complex polyhedral cone, as long as the
number of random points that you drew is large enough.
And large enough of course depends on how actually you want to approximate the target
matrix, and it also depends on this guy, which quantifies how close the target matrix is to
the boundary of the convex cone.
So with this, you just -- you now have a tool to convert hairy semi-definite programs into
optimization problems over random polyhedral cones, like just turning a semi-definite
program into a random linear program. So that's one thread of research.
Here's another thread of research that I don't know if it's going to pay off, but it's all sci-fi
and it feels good to work on it.
So it turns out if you take a normal CPU and you drop its voltage below the voltage that
the instruction manual tells you to run it at, the CPU will still run. But it will make
mistakes.
So and you save a lot of power. Power consumption drops somewhere between V square
and V cubed.
>>: I thought the classical scale was V squared.
>> Ali Rahimi: Right. But you get to drop the clock, too.
>>: Oh.
>> Ali Rahimi: Because there's a dependence on the clock.
>>: Sure, sure, sure. [inaudible] squared F. Yeah.
>> Ali Rahimi: So wouldn't it be neat if the next processors that you -- actually I'm
totally not allowed to say it that way. Wouldn't it be neat if in the laboratory somebody
were allowed to -- somebody were to build a processor whose floating point unit ran at a
lower voltage. Saves a lot of power. But made some mistakes here and there. Or made a
lot of mistakes.
So actually that's explored. This is not an entirely new idea. People have been
building -- have been prototyping these circuits where they're designed normally, and
when you drop them at low voltage they have this little shadow latch that detects whether
the circuit is misbehaving.
And if the circuit detects that it's misbehaving, then it will flush the instruction pipeline
and reissue the instruction anew and raise the voltage a little bit higher.
So this is the hardware approach at resilience on undervolted processors. Yes.
>>: [inaudible] per unit the unit can just deliver [inaudible] and let [inaudible] take care
of it later. And [inaudible] which I'm sure you know about. Results are off.
18
>> Ali Rahimi: I've heard about it. Yeah. So there are various ways to notify the
software layer that an error has occurred. There are ways to mask it at the hardware layer
by just reissuing the instructions and not letting the software worry about it. But here's
another idea. Let's just get rid of the overhead of the shadow latch. That's taking up
power, it's taking up die area, and it may even force you to run stuff at high voltage just
to get the shadow latch to work correctly.
And let's expose all the errors to the software. The floating point unit will not just return
[inaudible] when it's made a mistake; it will just return the garbage that it computed. It
will just say A plus B is equal to something totally random. But now let's design our
algorithms so that they can tolerate that type of error.
So here's the idea. So let's start with a classical combinatorial problem, say bipartite
graph matching. So this is a standard problem. Let's express it as a linear program and
then convert the linear program into an optimization problem that's unconstrained. And
then so none of this involves computation. This is all pen-and-paper transformations.
And then to solve a bipartite graph matching problem, let's just toss this unconstrained
optimization problem into a stochastic gradient solver. The reason is is stochastic
gradient we know can tolerate noise in the gradients.
So whenever we compute the gradient, which is where you spend the bulk of the
computation when you're doing this type of minimization, drop the voltage. Feel free to
compute a really noisy gradient. And then do the update at normal voltage.
>>: Are the arrows a rank of arrows?
>> Ali Rahimi: Yeah. So depends what regime you're in. If you're a regime slightly
below design voltage, the errors that you get are timing errors. And they're random only
in the sense that they're hard to model, in that they're -- the result that you get out of the
FPU depends on the previous result that the FPU returned. But if you're far below that,
then you actually get transistor noise, which is actually more modelled as a stochastic
source.
And here's some preliminary results. This is quite preliminary. So we built these -- this
hardware simulator that actually has a spark processor on an FPGA and then the FPGA
injects noise into the output of the floating point unit.
So here's a least squares solver. This is just the least squares solver from OpenCV. If
you drop the voltage of the CPU and inject all these errors, the least squares solvers starts
-- returns really noisy results. So this is the difference between the output of the least
squares solver at low voltage and the optimum.
And you just get these very large residuals. But using our stochastic gradient solver, no
matter what the error rate, you just nail a result eventually.
Similar thing with bipartite graph matching. If you use OpenCV's earth mover's distance
solver at low voltages, it will do quite poorly, whereas ours does -- well, it doesn't get a
hundred percent yet, because there's a bug, but basically its performance doesn't depend
on the amount of error that you inject.
19
So we really just -- we are taking longer to compute these results because stochastic
gradient is obviously slower than, say, the simplex method or the SVD in this case. But
at least we're getting robustness right now. Yes.
>>: What if all you were doing a single bit [inaudible] multiply the result by a few
hundred -- by a few hundred orders of magnitude, it doesn't happen or you can recover or
what's the deal?
>> Ali Rahimi: It does happen, and we can recover. That's right. Yes.
>>: Do you have the comparison between like the time lost and the amount of power you
saved?
>> Ali Rahimi: So what I can -- what I do know is that from here to here corresponds to
about power savings of -- so this uses about 2 percent of the power savings -- 2 percent of
the power and this uses about a hundred percent of the power. So it's a factor of a
hundred in power savings.
The amount of time that you spend is just ridiculous. This is just not a worthwhile
technique right now. So you end up using up a lot more power right now because you
keep running the stochastic gradient solver for too long.
>>: [inaudible] more energy.
>> Ali Rahimi: You end up using more energy because you're waiting for your
competition. But the trick -- so this is a motivation for us to develop faster stochastic
solvers instead of just following in the gradient, let's do -- let's try [inaudible] gradient
methods or second-order methods, or anything other than the gradient direction probably
will help.
>>: There are some very low-resolution floating point numbers [inaudible] about 8 bits,
but there was codecs. If you use that, you must [inaudible].
>> Ali Rahimi: That's right. That's right. So an alternative is to just compute -- have
your FPU be narrower and then stitch the output together later.
I wanted to -- before I get on my pontification slide, I wanted to acknowledge some
collaborators from various universities, a lot of people who I've talked to about this stuff
over the past couple of years who have been very helpful.
So I -- part of the flavor of this talk was about randomizing things instead of optimizing
things and just about generally doing less work and hoping that your random number
generator will just get you the right answer.
And turns out that we can prove that it often does. And I don't -- I like digging back into
the back literature and finding the root of some of these ideas like you saw with the
neural network picture, and I was trying to figure out why more people don't randomize
things. And my literature search there took me to the original source. And there's
support for both ways of doing things, for optimizing really hard or just throwing caution
20
to the wind.
And I can open for questions if you'd like. I probably won't be able to answer most of
them. Don't ask me. Okay. Yes.
>>: [inaudible] but if you reduce voltage by 30 percent, I can see a reduced power by a
factor of two, but a factor of 20, either it's megahertz or it's black magic.
>> Ali Rahimi: Well, certainly, if you drop voltage by a factor of two, you're dropping
power by a factor of four. So but you also get to run your clock more slowly under this
scheme.
>>: You can run the clock slowly [inaudible]?
>> Ali Rahimi: Pardon?
>>: I can see [inaudible] underclocking reduces power.
>> Ali Rahimi: Yeah.
>>: [inaudible]
>> Ali Rahimi: I'm sorry. Yes. Yes.
>>: The rest of it is either black magic or all in megahertz.
>> Ali Rahimi: So I failed to actually manifest any black magic here because I admitted
to you that these stochastic gradient solvers actually end up taking a long time. So don't
be too impressed by these results and don't think that I'm some voodoo master. This is
just a first step toward getting -- toward an algorithmic way to tolerate noise in numerical
algorithms.
The rest of it, this idea of making -- of becoming resilient to undervolting, that's standard
and classical. People have been solving this at the circuit level for a decade. The
innovation is to do this at the software level using tricks from the machine learning
community. Yes.
>>: [inaudible] randomization experiments, you mentioned the boosting thing didn't
actually look at the whole dataset, just looked at like a random subset. I was wondering
why that didn't help more and also related to if you can prove things about randomized
greedy schemes.
>> Ali Rahimi: Yeah. So the -- actually, Joseph Bradley is the one who came up -- has
all the results on randomized boosting schemes. Boosting by filtering is his work. He
has bounds on how well it does. And there, again, you get the 1 over square root of T
type of thing.
And my sense is that you just -- when you're going to train a weak learner, you just need
to look at the data, if you're going to pick the weights using the greedy AdaBoost method.
21
>>: [inaudible] a little tiny bit [inaudible].
>> Ali Rahimi: That's right. So ->>: Almost random.
>> Ali Rahimi: Right. So that's their trick. So, if you will, they are picking their weak
learner randomly, just like I do. Except they pick it by looking at a random -- their
randomization is by looking at random substantive data. The way they add these weak
learners to the final function is by the stage-wise thing. And I'm still insisting that this
stage-wise thing is what kills you.
>>: So [inaudible] you're fitting on the entire dataset once you -- I'm not sure ->> Ali Rahimi: The stage-wise thing is that when you pick the weight, the optimum
weight for the weak learner that you just learned, you're again looking at a subset of the
dataset. You're always just looking at a subset of the dataset in the stage-wise thing.
Yeah.
>>: [inaudible] but when you're testing the function and when you're training ->> Ali Rahimi: In this work, you mean?
>>: Yes.
>> Ali Rahimi: So there's no testing and training here. This is ->>: [inaudible]
>> Ali Rahimi: Here? No. This is just a randomly generated least squares problem.
>>: Okay.
>> Ali Rahimi: There's no -- don't think of it as a machine learning problem.
>>: Okay.
>> Ali Rahimi: The stochastic gradient is a machine learning tool, but there's no data
fitting going on. Okay.
>> John Platt: Let's thank the speaker.
[applause]
Download