1 >> John Platt: Okay. I'm happy to introduce Ali Rahimi from Intel Labs Berkeley. >> Ali Rahimi: Hi. I'm going to talk about random kitchen sinks, but before I get into it, I want to just make sure everybody can pace themselves through the talk. I'm going to start with really lightweight stuff, and then we're going to ramp up slowly, and then I'll do experiments and you can turn your minds off. And then I'll hit you again with some math. And then this is new work and it's not published. So we'll try to breeze through it, but it's still pretty mathy. I'm going to start here. So this is to give you a little bit of context about where all this work comes from. There's a new trend in AI. Back in the day when we wanted to build smart things, we would start with a really complicated statistical model, like Bayes nets, where inference was NP-hard and learning was even NP-harder. And we gave it some data and a few thousand examples. And we trained some intelligent thing. And something happened in the late '90s where instead of models like this, these really structured statistical models, we started using more generic models, nonparametric models, like RBFs. And to compensate for the lack of statistical structure in these models, we started feeding these models lots and lots of data. Domain knowledge started come from here instead of from here. And the nice thing about these generic models is that the optimization problems tend to be convex. At least they're in P. But because we have so much data, the optimization problem in practice ends up taking a lot of time. So this talk is really about tricks for making these types of optimization problems on these types of datasets go faster to support the new kind of artificial intelligence that we're doing these days. So I'll give you a few examples of this trend from here to here. Here's an example from Alyosha Efros. This is a problem where you want to -- you're given an image, you say, oh, I really don't like these houses being here, so you blot them out, and then you want to fill in this blotted out part with something pretty and relevant. So Alyosha these days takes a very data-heavy approach. He just crawls through millions of images in Flickr and finds little patches and substitutes the patches in here. It's a very heavyweight, heavy data-driven approach. Contrast this with something he did ten years ago where he actually had a pretty sophisticated HMM-based model, not very data driven, but the model was a lot more complicated. This works a lot better. Here's another example. This is work from Antonio Torralba. Here's -- each of these little cells is a tiny image. There's 10 million images in here. And he uses this dataset to do object recognition. Extremely data heavy. The operation that it goes through is just the nearest neighbors search in these 10 million images. 2 Compare that with what we did a while back where they actually build this pretty complicated discriminative model that can take into account the spatial relationships between objects and wasn't trained on that much data. This works really well. Here's another example from Greg Shakhnarovich. Here's a pose detector. I hear Microsoft recently solved this problem in the Xbox. So the goal is to recover the 3D pose of the human body. And Greg's approach here is he takes a graphic simulator, generates 150,000 examples of a fake person under random different poses and just matches this image against this image using a very simple distance metric. Contrast this with something the same group did ten years ago that involved actually reasoning about the 3D geometry of the shape in real time and trying to match it against the2D image and recover the3D shape of the body. This workings amazingly well and it's fast. My own motivation for this stuff is building object recognition systems that you can train on, that can recognize millions of objects in your real world. So, you know, this is the system trained on about 30 objects. Anyway, runs in real time. But as you scale the number of objects, it goes more and more slowly and the accuracy drops. So it would be neat to take systems and be able to scale them up, just the same way those previous examples I showed you work. Part of the reason this trend is happening now is we have access to a lot more data than we did before. I think. I'm speculating here. We have really high-fidelity simulators like the graphics simulator that Greg was using to generate these body poses. We have the Web that has lots of images and annotations on it. And ever since the MacArthur Foundation started giving grants to people for building games, there's a lot of -- mechanical turks, too, there's a lot of hand-annotated stuff that you get off of the Web. So this talk is about supporting this trend. And I'm going to show you two tricks that I've been playing with. I'll start with the random features trick. This is a way to speed up kernel machines. So a little bit of background on kernel machines. Here's a classification task. So this is a trick for speeding up your classifier. And I'm going to tell you about the classification problem a little bit, I'll tell you about the kernel machines, and I'll tell you about how we speed them up. So in this classification problem you got a space, a bunch of points. The points are labeled among the two classes. And you're trying to find a decision surface between them. Linear decision surfaces don't always separate your two classes well, so one would like to consider nonlinear decision surfaces. And in kernel machines, the decision surface -- the form of the decision surface that we use is a weighted sum of kernels placed on your training examples. 3 So there are N parameters here. There's a kernel that we define. Maybe it's a Gaussian or something like that. And we place the kernel on each one of these points, and then we come up with a good weighting and that will describe the family of curves in this space. So this is -- turns out -- I mean, this is well known. A function of this form when this kernel is positive-definite is equivalent to a linear function in a featurized space of the input. And these features are such that they satisfy this relationship. So the kernel effectively maps the features, maps your inputs into some feature space and then takes the product in those spaces. Okay. So this is a review of kernel machines. This is a really neat trick because whereas you would normally be trying to fit a decision surface in an infinite dimensional space, right, so this feature, for example, in general can be an infinite-dimensional feature mapping, whereas you would normally have to find an omega and some infinite-dimensional space. This kernel trick lets you search for N parameters only. So you can start implementing these things inside computers, which is great. And it even has this nice interpretation, like I said. Instead of searching for curves like this, you map your data into a potentially infinite-dimensional space implicitly, and then linear decision surface in that space. So this works really well. The problem with these kernel machines is that they force you to deal with these enormous matrices. If you have ten million training examples, one way or another you're going to have to represent a 10-million-by-10-million matrix whose entries consist of the kernel evaluated on pairs of your training data points. I made this really big to take up the whole screen to emphasize how big these matrices can be. So you can do infinite-dimensional things in finite-dimensional computers with the kernel trick, but these things are still huge. And so some researchers have come up with very popular tricks for dealing with matrices like this. Here's another trick. The trick is -- well, we talked about how these kernels actually compute inner products between the featurized inputs. So what we're going to do is instead of dealing with the kernel or with this infinite-dimensional feature mapping, we're going to find a finite-dimensional feature mapping, in fact a low-dimensional feature mapping, such that instead of having to take the inner product in this infinite-dimensional space, you can just take the dot product of the featurized inputs. So just like with kernel machines, your decision surface took this form. With this machine, because we're now using finite-dimensional features, your kernel machine just takes this form. And to go back to this diagram, the idea, again, is we're going to define these nonlinear decision boundaries, we're going to map our data into some finite-dimensional, 4 low-dimensional space, and then train a linear decision boundary there. And this mapping is going to be such that this relationship holds. So instead of training your kernel machine with kernels and dealing with these enormous matrices, we're actually going to randomly featurize your inputs and then train a linear classifier in this relatively low-dimensional space. And we're going to guarantee that the resulting classifier is going to be close to the classifier we would have gotten. And I'm going to tell you about two different types of random features. One of them are Fourier random features, and they're based on the Fourier transform of the kernel. And another one is a discrete random feature that's based on gooding up the space. I'll go through both of them. The proof for why this works is really simple. It's four lines and I think it's kind of neat to look at. So I'm going to just pop up some math, and I'm going to walk through it because this works -- this is just really neat. So the trick is we would like -- we're given a kernel. I forgot to mention, this only works with shift-invariant kernels, so you have to be able to represent the kernel like this. And at the bottom we're going to get -- we're going to derive these random features such that this relationship holds. We want the inner product between the random featurized inputs to almost be equal to the value of the kernel. And I'll walk you through here. So step one. Pick the Fourier transform of your kernel. So this is just the standard Fourier transform that you learn about in elementary school. Step two. So this is an integral. We're going to replace the integral. So P is the Fourier transform of the kernel. We're going to replace the integral by approximating this integral with an average. So treat this Fourier transform as a probability distribution. Draw samples from it. >>: Sorry. >> Ali Rahimi: Yes. >>: Going to the previous notations where Xs were vectors, you're now in a scalar space? >> Ali Rahimi: No. Xs are still vectors. Okay. >>: So these are multidimensional? >> Ali Rahimi: Yeah. So this is -- so omega prime is actually a new product between omega and a vector X minus. >>: Aren't you missing a WI there? You wanted a weighted set of ->> Ali Rahimi: Here? 5 >>: Or in the sum? >> Ali Rahimi: No. Because I'm drawing from P of -- so what I didn't -- what I slipped under the rug here is that because this kernel is positive-definite, the Fourier transform -so here's a theorem. The Fourier transform of the positive-definite kernel is positive-definite. This is Bochner's theorem. You don't learn that in elementary school for some reason. >>: You don't? >> Ali Rahimi: I meant -- sorry -- in systems -- signals and systems. In Alan Willsky's book that has all these Fourier identities, this identity is not there, unfortunately. And it's a really powerful one. So the point is that we can treat this Fourier transform as a probability distribution. It's positive. You can sample from it. So let's approximate this integral using a sample average. And now I'm just going to rewrite this summation. I'm going to split up each of these terms into a product. And then I'm going write this in vector form. This vector depends only on X. This vector depends only on Y. And we have our random features. So the Fourier random feature, really what it's doing is saying if you want compute X of X and Y, take X, project it down into the random direction W. W is drawn from the Fourier transform of the kernel. So you take X, you project it down onto a hyperplane, and then you compute a phaser from that. So you just project it down and then you just wrap it around the complex circle, and this complex number now becomes your random feature. And there's a squiggle mark here. Certainly this relationship holds an expectation. So certainly this is true. >>: What about [inaudible]? It seems like some things are probably better than others. >> Ali Rahimi: For this? So the sampling scheme is given to you. The sampling scheme is draw from the Fourier transform of the kernel. >>: Right. But you could draw regular samples as in a discrete transform or you could just randomly sample -- I mean, is one better than the other or ->> Ali Rahimi: You could draw nonrandom samples you say? >>: Yeah. I mean, you could draw ->> Ali Rahimi: Yeah. So certainly this has that. So this has -- there exists a random sampling such that these two guys are close to each other. Sorry. A random sampling will probably produce something that's close to each other, and that implies that there exists a deterministic sampling, such as these two guys are close to each other. 6 The problem is I don't know how to come up with one. I know how to come up with one by just sampling, but I don't know how to construct one. Okay. So the point of this was to show you that at least an expectation featurizing your X and Y and computing the inner product gives you something -- gives you the kernel value. I've also shown you how to compute this Z. It's just draw a bunch of samples from the Fourier transform of the kernel and compute these phasers. What I really want, though, is not these results and expectations. We want to show that this actually holds throughout this space. So let me go through that right now. Let me tell you what we know how to do. So we know that the inner product for a given X and Y is going to be close to K of X and Y in expectation. But we can also show that the tails of this are quite light. So this is just by Hoeffding. So these guys were given X and Y. I'm going to deviate very much with a very high probability. Using the union bound on this, you can show this same thing on a discrete dataset of endpoints. But even more so using a covering number argument. You can show that this holds throughout the whole space. So if you draw enough -- if the dimensionality of your random feature is high enough, then this inner product will approximate this kernel for all the points in your space with very high probability. So it's not just the result and expectation. This is actually a result that holds with a very high probability throughout the whole space. In fact, let me reinterpret that theorem for you. So it says that with probability at least 1 minus P, P's some probability that you're given, the inner product of the whole space approximates the kernel over the whole space with probability at least 1 minus P as long as you have enough -- the dimension of your random feature is high enough, as long as you sample from the Fourier transform enough times. So this depends on -- linearly on the dimension of the space. This is a standard epsilon squared dependence on the error over there. And the dependence on -- there's a dependence on the curvature of the kernel as well, as you might expect to see. >>: But D you have to pay for at runtime, so you don't want big D. >> Ali Rahimi: That's right. That's right. In fact, we'll see -- I'll show you experiments in the second part of the talk that I'll compare the cost of this D versus the cost of choosing these features optimally. Yeah. >>: Can you just say what the typical values of D are in your concrete things? >> Ali Rahimi: Yeah. Why don't I -- just when we get to the experiments I'll tell you. So here's another random feature. So there was the Fourier random features. There's a 7 totally different class of features that one can also construct. So you give me a kernel. And my job, again, to remind you, is to build a function, possibly a randomized one, such that the inner product between the featurized inputs is equal to the kernel. So this random feature works like this. Grid up your input space, your space of Xs, so just lay down a random grid. I'll tell you how to pick the pitch of the grid. In each bin of the grids, assign a binary bin string. Binary bin string is just the representation of the number of the grid in unary. So grid 1 gets a 1 over there, grid 2, grid 3. Okay. And then the random feature representation of a point is just its grid ID written in unary. That way when you compute the inner product, you're basically just 1 if you're in the same bin or 0 if you're not in the same bin. And now all that's left is for me to tell you how to compute these random grid pitches. And in the same way that we picked the omegas from the Fourier transform of the kernel, here we define a hat transform of a kernel instead of sinusoids, it's in terms of these hat basis functions. So you'd randomly sample your grid pitches from the hat transform of the kernel you're trying to approximate. And again you get the same theorems and same results as you do with the Fourier kernel. So let me show you how this looks like in code. It's very simple. If you want to train a dataset with an L2 loss, you want to train a classifier using an L2 loss with Fourier random features, you generate a bunch of random Ws, so these are the -- you just sampled from the Fourier transform of, say, the Gaussian kernel. Fourier transform of the Gaussian kernel is again a Gaussian. So you just draw a bunch of Gaussian Ws. And then you pass your data through to the random feature, so this is the complex sinusoid. And then you just now fit a linear solver. You just fit a hyperplane in this featurized space, and that's just the least squares problem [inaudible] over here. And, boom, you have your hyperplane. That's training in three lines of MATLAB code. And then for testing, you map your input vector through the random map. And you evaluate the inner product with respect to the alpha that you just lined. So let's get to the issue of dimensions that you brought up. So here's a bunch of datasets that we run this on. So typical dimension is anywhere from 21 dimensions to 127 dimensions on these datasets. I'll show you high-dimensional datasets. Dataset sizes range from a few thousands to a few million. And generally training is really fast with these random Fourier features. And here are the typical dimensions that I pick. These are much smaller than what the theorem would predict. So the theorem is quite loose. The theorem would predict -- well, so depends what epsilon you want. There's a 1 over epsilon squared. So if you wanted accuracy of 1 percent, it would be .01 squared. So that's 1 over .01 squared, so it's 10,000. 8 So we're getting into -- so the theorem is loose because of this guy right here. Okay. So, in practice, you know, we get typically better than state-of-the-art performance on various heavily-tuned SVM libraries. The Fourier random features work on most datasets. On some of them, like the forest dataset, it didn't work so well. And so this dataset has this characteristic that if you actually were to train an SVM with RBF kernels on it, most of the points become support vectors. >>: [inaudible] >> Ali Rahimi: I'm sorry, what? >>: What were the substitutes? Test theorems? >> Ali Rahimi: Test theorems. Yes. Sorry. >>: [inaudible] two different flavors: there's the two-class version and there's the seven-class version. I suspect this is the two-class version given the rate. >> Ali Rahimi: I took the version of the forest cover from these people here, from the core vector machine. I just grabbed it from there and ran everything else on it. Anyway. So the point of this was -- the point of this line is to tell you that there are -- that there are the two free -- that the two random features are complementary in some sense. These are really good for learning really smooth decision surfaces; these are really good for doing nearest neighbors types of surfaces. >>: So how do you [inaudible]? >> Ali Rahimi: Set it to 500, see if it works. And then if it works really well, then you set it to a hundred. If it doesn't work really well, you set it to a thousand. >>: Have you played with combining the two? >> Ali Rahimi: I have. I have, yes. So you just -- you can just stack up the random features for the two guys, and things work quite well. >>: Is it the best of both worlds? >> Ali Rahimi: Yeah. You tend to -- I mean, so in the sense that if you run it on all of these guys, you get basically the same performance. >>: Why does it perform better than [inaudible]? >> Ali Rahimi: Well, so there's an approximation going on. The approximation was in terms of the kernel, not in terms of the decision surface or not even in terms of the ideal decision surface. We really are learning a different surface. It just tells you that the RBF representation 9 isn't the best representation. >>: And it could be that the hinge -- you're also using regularized least squared classification, right? >> Ali Rahimi: Yes. So you get basically -- I've run all this stuff with -- all these things with the hinge loss, so I basically get the same results. And it stops -- it stops to matter what loss you use when you have large datasets. Okay. So let me just tell you a few of the properties of these things. So as you would expect, this is on the forest dataset [inaudible] the bigger the dataset. So big datasets help, but you knew that. And also, as you would expect, as you increase the dimension of the random feature, your error also drops. So this is error dropping quite fast and training and testing time not increasing very fast. So in practice these things tend to have desirable properties. Let me -- okay. So this is random features. This was about training kernel machines faster. Let me generalize the problem. This is where the random kitchen sinks come from. So we're learning these feature mappings based on a kernel that you give me. But why start with the kernel in the first place? So back in the day -- this is a picture out of a paper by Block from 1962. We had these neural networks and there was this idea just from day one that there should be some randomization that happens at the first layer. So this idea of having some randomization in your training algorithm is classical. We don't draw our neural networks like this anymore. Here's maybe a more modern way of doing things. Here's your input. It goes through some nonlinearities. The nonlinearities have some parameter, and then you weight the output of the nonlinearities. And what you're learning during training is these weights in the parameters of the nonlinearities. And actually this is also outmoded. We just write this now. Our neural network is a weighted sum of nonlinearities and they're parameterized. And we just learn the weights and the parameters. So let me focus on one popular way of doing it, of training these parameters. So when we need to do AdaBoosting, you build this function stage-wise, you train the alphas, you train the omegas, and then you do that for the next step, and so we have T of these stages. In random features we're also training a decision surface as a similar form. Our omegas were random and we were just training for the alphas. In kernel machines we're also doing something similar, except instead of a finite sum it's just an integral, and we're learning this function alpha. 10 So a lot of these -- so basically the world of shallow network machine learning just all is focused on learning decision boundaries of this form. And I'm going to focus on one particular way of training these, and that's the greedy method, which goes back to -- well, it goes back about 50 years. So the idea -- and I'm going to focus on a function-fitting framework. Forget the dataset for now. Somebody gives you a function f* and says please approximate it with a function of this form. You get to choose the omegas and the alphas. But I want the resulting sum to be close to this function in some norm in a function space. So you're given a function, a target function to approximate and you ask to come up with a bunch of weights and a bunch of parameters, such that the weighted sum is close to the target function. So the greedy approach, which looks like AdaBoost, looks like Matching Pursuit, looks like a lot of these other things, goes like this. Start with the function zero. And then we're going to find one term. We're going to add one term to the function. And that's going to be the term that gets -- the one term addition that gets us closest to f*. And now we have a new function, and then you iterate again. For the next addition, for the next term of the function, you again come up with the one that minimizes the difference between the residual and the target function, and you do this T times. So this has the flavor of AdaBoost. And we know a lot a lot about the performance of a function-fitting algorithm like this. In fact, this result goes back 20 years, 25 years. 15 years. 15 years. So suppose you want to approximate a function of this form. This is our target function. It consists of infinitely many terms. And we want to proximate it with a T term function that we built, as in the previous slide. So what's known is that the distance between the approximation that we built, as in the previous slide, and the target function decays as 1 over square root of the number of terms. And there is a constant here that measures in some sense the norm of the target function. So the L1 norm of the alpha is a norm on functions of this one. All right. So this is proved by induction over T. Let me write this down graphically for you. It says that if you -- for any function in this space with infinite expansion, there exists a function with a T term expansion if you're allowed to pick alpha and omega. That's not too far from that function. It's within 1 over square root of T. So it's a statement about for all functions in the blue there exists a function in the purple. That's not too far. Okay. That's about as good of a rate as you can get. This rate is tight. So here's another idea. This is the random kitchen sinks idea for fitting functions. You're again given the target function. And you're again asked to come up with alphas and omegas such that this T term approximation is close to f*. But now we do something much simpler. Just pick the omegas from some distribution. 11 Just randomly. Instead of going through this greedy algorithm. There's nothing greedy about this now. Just pick them randomly. And then to pick the alphas, just solve this convex problem. So this looks like a least squares problem, for example. >>: So all at once [inaudible]. >> Ali Rahimi: All at once. It's a badge optimization over T alphas. So now and then just return the alphas and the omegas that you computed. So let's see how well this does. And you would expect that the performance guarantees would somehow depend on the distribution that you used to sample the omegas. Right? And so here's the results. If you remember, the theoretical result for the greedy algorithm dependent on the L1 norm of the coefficients of the function we were trying to approximate, the C over here. So let's define an analogous norm for the target function we're trying to approximate. We're going to call it the P norm. And it's going to depend on the probability distribution that you use to sample your parameters. So you could think of this as an important sampling ratio between the alphas and the distribution that you're using to sample these guys. So the theorem says if somebody gives me an f* to approximate using the algorithm I just showed you on the previous slide, then the T term approximation with probability at least 1 minus delta, for any delta that you pick, also drops as 1 over square root of T. So we have the same rate in terms of the number of terms that we need in the expansion as we do with the greedy algorithm. But here we just manage to sample the parameters randomly. And then there's this dependence on the importance ratio between the alphas. Yeah. >>: So here f* is -- you fix it, right? >> Ali Rahimi: Yeah. >>: And then your picture before, I thought you said you had something that [inaudible] problem. So can you [inaudible]. >> Ali Rahimi: Yeah. So here's -- so here's what this theorem says. So you fix f* and then we're saying -- so you fix f* and f* is drawn from this big set that looks like that, it's infinite expansions of your weighted features. And we say that if you consider this random set, so this is a random set consisting of all alphas, all weights, but then these features are drawn randomly. It says that with very high probability the distance between this fixed f* and this set is -- drops 1 over square root of T. So whereas before we were making a claim about for all points in this set, there is a point close to here. In this case we're just saying for a fixed point in this set. And that's all you need to talk about function fitting. After all, somebody gives you a function to fit and then you draw stuff. You don't need to -- you don't need to approximate the whole space ever; you just need to approximate the function at hand that you need to approximate. 12 And that's why we manage to get this 1 over square root of T. >>: This gives the rate of improvement is the same, but the difference in the -- the constant could be substantial. >> Ali Rahimi: This constant could be substantial, yeah. You could pick a sampling distribution for the weak learners, for the features that's terrible for the given function. Yeah. It's easy to construct. It's just if you use a direct delta, for example, for your omegas, you'll just learn a really crappy classifier. But at least the theorem is correct in that that crappiness is reflected in this [inaudible]. >>: Have you scored any win doing any refinement at all? Because right now you're assuming that you just pick all random ones. Because if you heed out some fraction and replace them or something, where you can do a little bit more, do you win? >> Ali Rahimi: What works really well is if you start with this random thing and then just do a few iterations of gradient descent on the omegas and the alphas. Yeah. So that's works incredibly well. >>: Oh, that sounds like a neural network. >> Ali Rahimi: Of course. It's all a neural network. You just can't say that out loud. >>: Oops. >> Ali Rahimi: Yeah. So okay. What is neat about it is that it's a neural network that you initialize with random weights and you have guarantees about how far you end up from the thing you're trying to approximate. >>: How far is that [inaudible]? >> Ali Rahimi: Well, so it depends where you start with the training from scratch. I'm not very good at it, even though I've tried very hard. I often get stuck in local minima. Generally you end up doing quite well if you do this and then gradient descent. In the experiments I'll report, I don't do the refinement, gradient descent refinement. I've informally just tried training starting from zero or various parameters that I thought might be good, and it works okay. But nowhere near as well as this. >>: [inaudible] >> Ali Rahimi: Oh, and it's much slower because you need to run gradient descent for a lot longer too. >>: Is it important to have the random -- it's important to have the random sampling, right? Because I've tried stuff where the fees were from the PCA of the dataset, and that didn't work well either. >> Ali Rahimi: Yeah. So this random sampling is completely independent of the data in that. 13 >>: But you need to choose the omegas that you would expect to be good because of the [inaudible] sample? >> Ali Rahimi: That's right. That's right. That's right. So it is a design issue. >>: [inaudible] okay. Well, I don't know. I tried it once [inaudible]. >> Ali Rahimi: So the reason this stuff ends up working well, and my intuition about all of this stuff and what these theorems mean is that really it doesn't matter what the nonlinearities are, but just put a lot of effort into figuring out what their weights should be. That's where the magic is. Not in here necessarily. That's how I view this result. >>: It's interesting because in boosting, for instance, people are trying to do things that -where instead of taking the greedy approach you go and you take the classifiers that you fit and then you go back and you try to do the least squares batch fit of [inaudible] and you end up doing much worse in terms of generalization. So it's interesting to hear [inaudible] hurting you. >> Ali Rahimi: It's not hurting you because of the way we picked the weak learners. Right. >>: Because of the random [inaudible]. >> Ali Rahimi: Because of the random, right. So you're talking about Dale Sherman's result of -- yeah. Yeah. So yeah. Yeah. You can't -- right. So if you pick your -- so Dale Sherman's result is if you pick the omegas from boosting, you don't want to go back and refine your alphas only. You got to go back and -- okay. All right. So this theorem is in terms of some norm defined. And, I mean, we can come up with a much stronger form of this theorem in terms of the L infinity norm of between the functions. But these features have to be Sigmoids like this. Again, this is if you're going to nitpick about my choice of function norm here. This gives you a result in terms of the L infinity norm. So you buy that it's enough to just fit a fixed function in this space, that you don't need a universal claim about the whole space. Did I convince you that this is a good enough thing? >>: Let's take it offline. >> Ali Rahimi: Great. Perfect. I'm going to skip the proof. The proof idea. It basically boils down to coming up with tail bounds for a zero mean random variable in a Banach space. It's a bounded zero mean random variable in a Banach space. Just replaces this Mth with this random variable, and then you apply standard results from there. 14 I'm -- so everything I told you about so far about the random kitchen sinks was about fitting functions, but really we're going to be fitting data and using a standard decomposition of the empirical risk -- sorry, of the defect between the empirical risk and the true risk. You can show this bound. So if F hat is the T term expansion that you derived by looking at N data points and minimizing the empirical risk, then the true risk of F hat compared to the true risk of the best risk minimizer decays as follows: 1 over square root of T plus 1 over square root of the number of data points that you looked at. So the 1 over square root of T comes from the previous theorem. This 1 over square root of M is a standard result from learning theory. >>: [inaudible] >> Ali Rahimi: This is -- no. It's not. It's not a uniform convergence result. This is a result about this optimizer. It uses the uniform convergence for this part of the decomposition. This part, as I was talking with Ofar [phonetic] about, is only needs a pointwise. So let's go over some experiments. Here is the adult dataset. It's a relatively small dataset. But everybody uses it. This is the number of terms that we add in the expansion. This is AdaBoost's testing error. So after adding a few terms, about 40, 50 terms is enough for AdaBoost, it plateaus out to about 15 percent error. For us we need to draw a lot more random features to get the similar error rates. So AdaBoost got there faster with many -- AdaBoost got there with many fewer terms. We got to use a lot more terms to get to the same accuracy. But we got there much faster. Our optimization is much, much faster. AdaBoost does this pretty heavyweight iterative thing where it has to touch more or less the entire dataset at every iteration. We just touch the dataset once in our least squares solution. So this is the runtime as the number of features increase. AdaBoost takes a lot of time. This is on a log scale. We take very little time. And, in fact, let me combine these two graphs together. This graph, this is the amount of training time versus the amount of error that you get. So even though we ended up using a lot more terms in our expansion, we're still much, much faster because our optimization procedure is much faster for a given error rate. Here's another dataset. This is data coming from an accelerometer from hip 1 thing that detects your physical activity. We stopped AdaBoost after about a hundred iterations because it was just taking too long. Whereas this thing -- the random kitchen sinks kept on ticking. And, again, you have a couple of orders of magnitude less time that you spent for the same error rate. Another standardized dataset. Similar thing. Again, few orders of magnitude for similar 15 error rates. And that's consistent across the board. Here's the face detection experiment. We compared AdaBoost with [inaudible] against the Fourier random features. What's neat about this comparison is that you can't train Fourier random features with AdaBoost very well. It's a hard weak learner to fit. It's hard to fit sinusoids to data. But we're picking them randomly in our case, so that's easy. So part of the benefit of this random kitchen sinks trick is that you can start using features that you wouldn't be able to use with AdaBoost because you no longer need to fit them to data that's convenient. So we get slightly better performance than AdaBoost on our test accuracy. Training is much faster; seconds instead of minutes. But we do use about factor of 10 more in weak learners. And, again, here's your point, John, about at runtime D is what's important. In these types of experiments, we were hoping to have a fast trainer. And there are lots of tricks that we started using in a face detector that we built that can avoid you having to slide the window. The detection window over the whole image. There's an optimization that happens where you can prune a lot of the search space for the sliding windows. So that's how we get around that kind of slowdown. >>: Have you tried to see -- so when you find the random features and then [inaudible], then when you're doing the squares. So do you see any sparsity in these? Because it seems like, you know, the way to overcome this would be to find a sparse set of alphas and then get rid of those random guys that you never use. >> Ali Rahimi: Yeah. So that set of experiments, I don't report on them here. But one idea was instead of least squares just to use Pegasus and hope to get sparseness out of that. I can't find a good setting of the parameters of Pegasus to get as good accuracy as I get with least squares. So ->>: [inaudible] >> Ali Rahimi: Yeah, but then those problems become huge, and I don't have -- I would like to ask you for a really large-scale L1 regularis solver [phonetic]. I think there's a couple out there. I just don't -- haven't talked to anybody who could just recommend one. Offline. >>: [inaudible] >> Ali Rahimi: Similar things with MNIST. In this one we were comparing against boosting by filtering, which is a much faster version of boosting where you -- instead of touching the entire dataset at every iteration you just touch a random substantive. And, again, you see similar types of results where you're about a hundred times faster, similar accuracy, but you use more features. I'm going to skip this. So this is -- I can't really talk about this part, but I think it's neat that Intel may consider using this in something one day. Right. So here's the lesson from this part of the talk. So here's typically the way people fit these 16 nonlinear decision boundaries. You run the minimization over the weights, you run the minimization over the parameters, and I just -- here's a caricature; this is not mathematical. The caricature of what we've done is minimize over these weights but randomize over the omegas and prove that you get very similar results. So for the next few minutes, I want to talk about less baked things that I've been working on for the past six months or so, unless there are questions about that stuff, then we can... So here's a neat trend. Everybody's doing semi-definite programs for everything and getting good results as long as they have 10,000 variables. So it would be neat to come up with a way to solve semi-definite programs faster. So these semi-definite programs typically take this form. You want to minimize a convex function over matrices subject to a polyhedral constraint. So there's a linear constraint on the matrix, and you want the matrix to be positive-definite. So this blue thing is -- represents the cone of positive-definite matrices. And the problem is while it's polynomially hard to perform minimizations like this, it's still -- it's still hard to do it on computers today. We don't have very fast solvers. So it's a challenge to come up with good solvers that can minimize things over this convex cone. So a trick that Guiomo Obozinsky [phonetic] pointed out to me that they'd used in a paper is to replace this set in the optimization with the polyhedral set. They use a random polyhedral set. That's the green thing over here. So you just generate a bunch of random vertices that are positive-definite matrices, and you require X to live in that cone. And it worked amazingly well for their application. And they didn't know why it worked well. And I run a bunch of experiments and it looks like, you know, seems to work well for -- as long as your optimum is not on the wall of this cone, as long as it's not an extreme point of this cone, it will work amazingly well. And if it is on the extreme point, then you can still get within some epsilon with high probability. So what can we say about this type of thing. So here's a theorem about it. Actually, before the theorem, let me tell you how one uses this trick. So we just replace the positive-definite in this constraint with this constraint. This is the constraint that X has to lie in this polyhedral cone whose vertices are these randomly drawn VIs. And just to say that it's a cone means that these weights have to be positive. So now if F is -- if F is linear, for example, this turns into a linear program. If F is quadratics, this is a quadratic program. We can solve all of these things really fast. And this graph is a simulation that shows that actually a lot of these matrices that you draw from this positive-definite cone do end up being extremely close to this random convex polyhedral code. So the theorem is -- and it's still in flux. I think some of it can be improved -- is that if you're given a target matrix X knot, so for fixed X knot, draw a bunch of random positive matrices from the Wishart distribution. Construct this cone. So this is just positive combinations of these random Wishart matrices. Then with really 17 high probability, the target matrix is close to this complex polyhedral cone, as long as the number of random points that you drew is large enough. And large enough of course depends on how actually you want to approximate the target matrix, and it also depends on this guy, which quantifies how close the target matrix is to the boundary of the convex cone. So with this, you just -- you now have a tool to convert hairy semi-definite programs into optimization problems over random polyhedral cones, like just turning a semi-definite program into a random linear program. So that's one thread of research. Here's another thread of research that I don't know if it's going to pay off, but it's all sci-fi and it feels good to work on it. So it turns out if you take a normal CPU and you drop its voltage below the voltage that the instruction manual tells you to run it at, the CPU will still run. But it will make mistakes. So and you save a lot of power. Power consumption drops somewhere between V square and V cubed. >>: I thought the classical scale was V squared. >> Ali Rahimi: Right. But you get to drop the clock, too. >>: Oh. >> Ali Rahimi: Because there's a dependence on the clock. >>: Sure, sure, sure. [inaudible] squared F. Yeah. >> Ali Rahimi: So wouldn't it be neat if the next processors that you -- actually I'm totally not allowed to say it that way. Wouldn't it be neat if in the laboratory somebody were allowed to -- somebody were to build a processor whose floating point unit ran at a lower voltage. Saves a lot of power. But made some mistakes here and there. Or made a lot of mistakes. So actually that's explored. This is not an entirely new idea. People have been building -- have been prototyping these circuits where they're designed normally, and when you drop them at low voltage they have this little shadow latch that detects whether the circuit is misbehaving. And if the circuit detects that it's misbehaving, then it will flush the instruction pipeline and reissue the instruction anew and raise the voltage a little bit higher. So this is the hardware approach at resilience on undervolted processors. Yes. >>: [inaudible] per unit the unit can just deliver [inaudible] and let [inaudible] take care of it later. And [inaudible] which I'm sure you know about. Results are off. 18 >> Ali Rahimi: I've heard about it. Yeah. So there are various ways to notify the software layer that an error has occurred. There are ways to mask it at the hardware layer by just reissuing the instructions and not letting the software worry about it. But here's another idea. Let's just get rid of the overhead of the shadow latch. That's taking up power, it's taking up die area, and it may even force you to run stuff at high voltage just to get the shadow latch to work correctly. And let's expose all the errors to the software. The floating point unit will not just return [inaudible] when it's made a mistake; it will just return the garbage that it computed. It will just say A plus B is equal to something totally random. But now let's design our algorithms so that they can tolerate that type of error. So here's the idea. So let's start with a classical combinatorial problem, say bipartite graph matching. So this is a standard problem. Let's express it as a linear program and then convert the linear program into an optimization problem that's unconstrained. And then so none of this involves computation. This is all pen-and-paper transformations. And then to solve a bipartite graph matching problem, let's just toss this unconstrained optimization problem into a stochastic gradient solver. The reason is is stochastic gradient we know can tolerate noise in the gradients. So whenever we compute the gradient, which is where you spend the bulk of the computation when you're doing this type of minimization, drop the voltage. Feel free to compute a really noisy gradient. And then do the update at normal voltage. >>: Are the arrows a rank of arrows? >> Ali Rahimi: Yeah. So depends what regime you're in. If you're a regime slightly below design voltage, the errors that you get are timing errors. And they're random only in the sense that they're hard to model, in that they're -- the result that you get out of the FPU depends on the previous result that the FPU returned. But if you're far below that, then you actually get transistor noise, which is actually more modelled as a stochastic source. And here's some preliminary results. This is quite preliminary. So we built these -- this hardware simulator that actually has a spark processor on an FPGA and then the FPGA injects noise into the output of the floating point unit. So here's a least squares solver. This is just the least squares solver from OpenCV. If you drop the voltage of the CPU and inject all these errors, the least squares solvers starts -- returns really noisy results. So this is the difference between the output of the least squares solver at low voltage and the optimum. And you just get these very large residuals. But using our stochastic gradient solver, no matter what the error rate, you just nail a result eventually. Similar thing with bipartite graph matching. If you use OpenCV's earth mover's distance solver at low voltages, it will do quite poorly, whereas ours does -- well, it doesn't get a hundred percent yet, because there's a bug, but basically its performance doesn't depend on the amount of error that you inject. 19 So we really just -- we are taking longer to compute these results because stochastic gradient is obviously slower than, say, the simplex method or the SVD in this case. But at least we're getting robustness right now. Yes. >>: What if all you were doing a single bit [inaudible] multiply the result by a few hundred -- by a few hundred orders of magnitude, it doesn't happen or you can recover or what's the deal? >> Ali Rahimi: It does happen, and we can recover. That's right. Yes. >>: Do you have the comparison between like the time lost and the amount of power you saved? >> Ali Rahimi: So what I can -- what I do know is that from here to here corresponds to about power savings of -- so this uses about 2 percent of the power savings -- 2 percent of the power and this uses about a hundred percent of the power. So it's a factor of a hundred in power savings. The amount of time that you spend is just ridiculous. This is just not a worthwhile technique right now. So you end up using up a lot more power right now because you keep running the stochastic gradient solver for too long. >>: [inaudible] more energy. >> Ali Rahimi: You end up using more energy because you're waiting for your competition. But the trick -- so this is a motivation for us to develop faster stochastic solvers instead of just following in the gradient, let's do -- let's try [inaudible] gradient methods or second-order methods, or anything other than the gradient direction probably will help. >>: There are some very low-resolution floating point numbers [inaudible] about 8 bits, but there was codecs. If you use that, you must [inaudible]. >> Ali Rahimi: That's right. That's right. So an alternative is to just compute -- have your FPU be narrower and then stitch the output together later. I wanted to -- before I get on my pontification slide, I wanted to acknowledge some collaborators from various universities, a lot of people who I've talked to about this stuff over the past couple of years who have been very helpful. So I -- part of the flavor of this talk was about randomizing things instead of optimizing things and just about generally doing less work and hoping that your random number generator will just get you the right answer. And turns out that we can prove that it often does. And I don't -- I like digging back into the back literature and finding the root of some of these ideas like you saw with the neural network picture, and I was trying to figure out why more people don't randomize things. And my literature search there took me to the original source. And there's support for both ways of doing things, for optimizing really hard or just throwing caution 20 to the wind. And I can open for questions if you'd like. I probably won't be able to answer most of them. Don't ask me. Okay. Yes. >>: [inaudible] but if you reduce voltage by 30 percent, I can see a reduced power by a factor of two, but a factor of 20, either it's megahertz or it's black magic. >> Ali Rahimi: Well, certainly, if you drop voltage by a factor of two, you're dropping power by a factor of four. So but you also get to run your clock more slowly under this scheme. >>: You can run the clock slowly [inaudible]? >> Ali Rahimi: Pardon? >>: I can see [inaudible] underclocking reduces power. >> Ali Rahimi: Yeah. >>: [inaudible] >> Ali Rahimi: I'm sorry. Yes. Yes. >>: The rest of it is either black magic or all in megahertz. >> Ali Rahimi: So I failed to actually manifest any black magic here because I admitted to you that these stochastic gradient solvers actually end up taking a long time. So don't be too impressed by these results and don't think that I'm some voodoo master. This is just a first step toward getting -- toward an algorithmic way to tolerate noise in numerical algorithms. The rest of it, this idea of making -- of becoming resilient to undervolting, that's standard and classical. People have been solving this at the circuit level for a decade. The innovation is to do this at the software level using tricks from the machine learning community. Yes. >>: [inaudible] randomization experiments, you mentioned the boosting thing didn't actually look at the whole dataset, just looked at like a random subset. I was wondering why that didn't help more and also related to if you can prove things about randomized greedy schemes. >> Ali Rahimi: Yeah. So the -- actually, Joseph Bradley is the one who came up -- has all the results on randomized boosting schemes. Boosting by filtering is his work. He has bounds on how well it does. And there, again, you get the 1 over square root of T type of thing. And my sense is that you just -- when you're going to train a weak learner, you just need to look at the data, if you're going to pick the weights using the greedy AdaBoost method. 21 >>: [inaudible] a little tiny bit [inaudible]. >> Ali Rahimi: That's right. So ->>: Almost random. >> Ali Rahimi: Right. So that's their trick. So, if you will, they are picking their weak learner randomly, just like I do. Except they pick it by looking at a random -- their randomization is by looking at random substantive data. The way they add these weak learners to the final function is by the stage-wise thing. And I'm still insisting that this stage-wise thing is what kills you. >>: So [inaudible] you're fitting on the entire dataset once you -- I'm not sure ->> Ali Rahimi: The stage-wise thing is that when you pick the weight, the optimum weight for the weak learner that you just learned, you're again looking at a subset of the dataset. You're always just looking at a subset of the dataset in the stage-wise thing. Yeah. >>: [inaudible] but when you're testing the function and when you're training ->> Ali Rahimi: In this work, you mean? >>: Yes. >> Ali Rahimi: So there's no testing and training here. This is ->>: [inaudible] >> Ali Rahimi: Here? No. This is just a randomly generated least squares problem. >>: Okay. >> Ali Rahimi: There's no -- don't think of it as a machine learning problem. >>: Okay. >> Ali Rahimi: The stochastic gradient is a machine learning tool, but there's no data fitting going on. Okay. >> John Platt: Let's thank the speaker. [applause]