>> Dengyong Zhou: Hi. It's my great pleasure to have Professor Le Song. Le is currently an assistant professor at the Department of Computer Science and Engineering, School of Computing, Georgia Institute of Technology. Le has a PhD in Computer Science from the University of Sydney in 2008 and then he conducted his postdoc research at School of Computer Science, Carnegie Mellon University from 2008 to 2011. His basic interests include nonparametric kernel methods, probabilistic graphical models and dynamics of networked processes and also its applications. He received an NSF CAREER Award in 2014. Please. >> Le Song: Thanks everyone for coming. I'm going to talk about kernel methods and how to make it scalable to millions of data points or even tens of millions of data points. So the motivation for this work is we try to catch up deep [indiscernible] with the kernel methods. The kind of motivating problem is this gigantic image classification problem. You have millions of labeled examples and then the number of classes is also very big so 1000 classes is just a subset of the classes. There's even more. And then what we want to do is take this image and then predict these labels. The label can be just mushroom, things like this, and then what is it shows here the label is predicted by some algorithms. You might predict a total five labels and then some label has high probability. So you see this prediction is actually provided by neural nets and it’s doing a pretty good job. Sometimes when the correct label is not in the top places but it might be in a second place, for instance like this. So the best performing algorithm I mentioned is the deep neural nets. So it's a very complicated model. It has multiple layers but the layers are structured in such a way that you see patterns. In the first five layers is this convolution pooling layer, highly structured process unit I’m going to explain a little bit more. And then in the next two layers is going to be these fully connected layers and then trying to perform some kind of nonlinear transformations. And the last layer is going to be these multiclass logistic regressions. So what is in this convolution max pooling layer? Its performance [indiscernible] very simple operations. So essentially the convolution layer is just taking some convolution kernel that’s different from the [indiscernible]. I'm going to talk about that. It's also a kernel. So we take these small template and it has a number to it. Essentially you perform weighted sum over your original image and move this window across the entire image you produce another image and then very often neural nets you also apply this operation max after you finish a convolution. That’s called rectified linear units. You also perform nonlinear threshold type of thing. After you perform this convolution, so essentially the first five layers perform these type of operations. You come off the image with some templates and you perform that nonlinear operation and also do max pooling. I'm going to explain. So the max pulling is something even simpler. It's essentially reducing the resolution image and you just take a small patch and look at the value in the small patch and you just return the maximum value in that small patch. That’s the operation. But that operation typically you would use a reduce [indiscernible] image. You interleave between this convolution and max pooling for many layers, five layers, and it’s the best performing model. And then after that you have two layers of these fully connected layers. So it is just taking the input image of vector and perform the weighted combination of everything in that image and then produce an output pixel. So you also apply these in max [indiscernible] linear units, this operation. You're doing it twice in a hierarchical fashion. It's a highly mysterious way. And then afterwards you put logistic regression on top of it. Essentially you try to learn this weight W so you do a weighted combination of the input vectors and then you push this through the exponential function that gives you something related to the probability of predicting label Y. So this is the model. >>: So can you explain to us again this max pooling operation? >> Le Song: So max pooling operation is just take a small patch in your original image and then just return the maximum value in that small patch and the theory in as a new pixel. >>: And so you are saying that I'm doing convolution and then max pooling? >> Le Song: Yeah. You interleave between them. >>: So when you say interleave you mean that one layer is doing>> Le Song: So max pooling together [indiscernible]. So one layer in the convolution and the max pooling convolution [indiscernible] follow that max pooling. Sometimes they may skip this step by model selection or whatever. I don't know why. So it's such a high structure model doing lots of maxs and then in the end you push it so this multiclass logistic regression it's highly structured. Different layer has different function but the goal of this model is to do this gigantic nonlinear classification task. This is just a detail about the model, the parameter. So when you have a convolution you need to specify these filters, kernels, this is the way to sort of specify those parameters using convolution max pooling, things like this. You have this original image which is 224 by 20 and 24 and three color channels. So the kernel size is going to be 7 by 7. So you slide across this image and you might jump rather than sliding window but you jump 2 and then you use 96 of them you get an image of 110 by 110 and 96 after the first layer of convolution, and you do max pooling, you reduce the spatial resolution, you get something like 55 by 55 by pooling in a 3 by 3 window. So that's how they specified this model. And then you do convolution again, 5 by five window and [indiscernible] 2 and you use 256 and then you get a smaller image. You do max pooling convolution again, convolution, convolution and here you skip this max pooling operation and here is another convolution and max pooling and pushes through fully connected layer here and fully connected layer and then multiclass logistic regression. So if you look at this very complicated model the convolution actually consists of five layers but the total number of parameter in the convolution layer is only 3.7 million parameters. The majority of the parameter actually lies in the last three layers, this fully connected layer and the multiclass logistic regression and it's 58 million parameters. So essentially, given this gigantic image data set you want to learn all these 60 million parameters. That’s the problem. >>: Do you happen to know how they came up with this rich goal of numbers? >> Le Song: So actually they have a nice package you can just specify the configuration for these layers. >>: How do they come up with 7 by 7 and not>> Le Song: I think model selection over 10 years or maybe 20 years. That’s my impression. You can experiment with this configuration by specifying a small configuration file. They can just search for this architecture. But this convolution makes sense in some sense [indiscernible] image feature. I don't quite understand the fully connected layer. I understand these multiclass logistic regressions. So we all try to understand when we put all these things together what's going to happen. Why is it doing so well? So we try to understand. Yeah. >>: [inaudible] 96 that’s the number of filters you use? >> Le Song: Number of filters. 96 filters. That's why you get some image like 96 colors. >>: Basically you reduce some image but you increase the number of filters? >> Le Song: That's right. So when you do convolution for instance in this layer, 5 by 5 filter, it can [indiscernible] across different color channels. So some [indiscernible] look across 96 different color channels. So it's something like this. So they managed to get this working so nicely. When you try to learn these parameters the algorithm that has been used is very simple. The driving force behind this is really this stochastic gradient descent. You take a small patch of data and compute a gradient and then you use a [indiscernible] derivative and you get update for all these parameters. You have a gigantic nonconvex optimization problem, you just take the gradient of the object with respect to parameter and you use the train rule to update it. But when you try to train it you also generate something called the virtual images. You're not just using the original data set of 1.3 million data points you are actually generating some image by randomly crop the image or mirror image something like this. Do some transformation image; generate something like the original data set but slightly different. So along the way the original data set is 1.3 million. You actually generate up to 100 million data points and it takes a week in GPU to train this model. You get this performance if you predict the top one it and use the top one label and then compare to the actual label and then the error is like 42 percent. That’s the top one classification error. >>: [inaudible]? >> Le Song: So I'm explaining in this point many adjustments [indiscernible]. The way they do it is they set this a learning rate to be some constant and then after some point they see these test error [indiscernible] flattens them you drop the learning rate and then you have a sudden decrease in the error and then after some point flattens you decrease until after three of these adjustments you don't see any improvement anymore. So this is something that if you don't try it you won't realize how they actually trade[phonetic]. >>: How many manually adjust? >> Le Song: So you just watch this [indiscernible]. You can basically estimate the change I guess if you want to do it automatically. >>: What's the training error? >> Le Song: Training error is going to be like 30 or 20. There's a huge over[indiscernible] there. >>: And training error for [inaudible]? >> Le Song: Top five. Nowadays the best-performing model can be below 10 percent. It's very accurate. It's close to human judgment. So this is the state of this model, and actually for this model it creates lots of unsolved questions as well. Of course it performs so well in many applications: speech, image, the question is where is special kind of architecture? Maybe there’s some special characteristic in the data which is particularly suitable for this type of architecture. It's not clear what kind of assumption you can make about the data such that this [indiscernible] is really good. So I haven't seen any principle theoretical work for this question. And also with five layer convolution max pooling can we use other architecture operations to extract these features? It's not clear whether does any alternative and simpler operation which you can also use. Then of course why three layer fully connected nonlinear units? What you need is nonlinear transformations. Why not four? And is there a way to do it in a shuttle fashion? Can you use alternative nonlinear classifiers if you just wanted a nonlinear classification? So in this talk I'm going to essentially explore these two questions. I'm also very interested in top two but now I don't have anything concrete for the top two. So I'm going to use kernel method to try and explore these two questions a little bit. I'm going to try to replace these three layers, fully connected nonlinear units and the multiclass logistic regression by kernel method and see what I'm able to achieve a comparable result. If I'm able to do that maybe this really is not necessary. You could just replace it by traditional nonlinear nonparametric method and what's maybe really useful is this convolution max pooling layers. So essentially the kernel method is this type of kernel, positive semi definite kernel. When you have it’s essentially a function taking two arguments. If you have the data set fixed size M you make this kernel matrix, matrix has to be positive semi definite. It’s this special type of kernel. It's different from the smoothing kernel from statistics; it’s different from this convolution kernel. It’s this type of function, kernel function. So essentially I'm going to estimate some nonlinear classification functions. I'm trying to do multiclass nonlinear classification. I'm going to restrict my classification function to be in this space of function, the space of function spanned by these kernels. I fixed one argument. I’ll chase all the class of function using these data points from a particular space. So when I have this positive semi definite kernel the nice properties the kernel function itself is going to lie in that space and then the inner product between two kernel functions is going to give you this kernel value. And then if you have a function in that space and then you can value that function by just performing in the products that space is a nice property of it. For instance, some familiar kernel function upon nonlinear kernels, so this kernel is definitely not a smoothing kernel family and then Gaussian kernel is also smoothing kernel in the statistics is richer. So essentially for this particular kernel function if you have a function that space which is a linear combination of this kernel you would get a function of this shape. For instance you can represent highly nonlinear functions just by linear combination of kernel sits on each individual data points. So this is just a brief introduction of this type of kernel function and spatial function chased out by these kernels. In [indiscernible] you can think about this kernel method as transforming the data for instance. So here is a binary classification problem. You have negative class and positive class. You want the nonlinear classification. This kernel function is going to transform this data to a new space, potentially a three-dimensional space, in that space you try to find a linear relationship in that space. So many kernel methods have this intuition behind and many method kernel method can actually formulate this optimization problem and try to find some function F in that space. You minimize some kind of expected loss. L is lost function. You have some data X and Y. Y can be the label generated from some distribution. You try to minimize this expected loss function and subject to some constrained [indiscernible] function [indiscernible] is bounded. And the equivalent formulation of this top optimization problem is you move the constraints to the objective function and add this new is something like a regularization parameter. The beta and mu is related. So you can choose with different loss function you get different algorithms. Kernel logistic regression if you choose this loss function. You get [indiscernible] you get logistic regression here. So typically for kernel method you solve it in the [inaudible]. So essentially you use this socalled representative theorem. You have end data points. You replace the expected loss by this empirical loss, and then it turns out the solution of the optimization problem is going to have a form like this. It’s just weighted combination over the kernel function applied on the training data points or they are optimizing over some function in function space could be infinite dimensional. So if you have the representative theorem you can in terms of a [indiscernible] that into the original optimization problem and then just optimize over this alpha instead. Alpha is in some RM. It's not F. It’s just some final number of alpha. So you solve this optimization problem, but the problem is if you observe carefully you have this kind of double sum in object function. That means for each pair of data points you have to evaluate this kernel function and that creates lot of the problems. Think about M as 1 million. So if you want to evaluate parallelized kernel function, and essentially you need to fill in this entrance in this big matrix, and then one million and one million cost this memory ten to the twelfth, and the computation is generally M squared. D is the dimension of the original data. So it's huge. So you have to come up with a way to scale this kernel method up. There has been a lot of effort already, for instance, in some approach based on lower end decomposition of this kernel matrix. So you take this kernel matrix, you don't compute its entry, you have a way to incrementally approximate this kernel matrix by some lower end vectors. So this method, I did a [indiscernible] method [indiscernible]. Usually is the computational of this method is going to be linear in the data point, T squared into the rank you choose, and D the dimensional data. And the storage is going to just MT. But if you look at the generalization ability after you do this lower end approximation, plug into your optimization algorithm and you get your function, you look at the generalization ability the best you can prove without any further assumption is the generalization ability comparing the expected loss produced by this best function in the family and the function produced by this lower end decomposition and then the generalization ability is the difference is going to be one over square root T, the rank, class one of square root number data points. If you want to get best out of your data you have to match these two things essentially and that means that T has to be the order of M. So you get M cubic kind of computation here and then you get M squared memory consumption again. So actually in practice you also find this is the problem. So you do some lower end decomposition fixed to some small rank, you learn that the classifier always loses some performance comparing to the case where your optimizer directly with respect to this full kind of matrix. That's which you observed because in theory you need to match these two things up. Recently people are more looking into this so-called random feature approximation of the kernel function. So essentially there's some interesting relationship between this kernel function, positive semi definite kernel function and some random processes. Essentially if you have this kind of function you can always express it as some kind of integral form. You have some random variable omega, it follows on distribution P, and then this kernel function can be written as some random function which is indexed by this omega, apply on your X and [indiscernible] apply on X, pi you do this [indiscernible] product thing with respect to this omega. So you can also write it as this expected form. So for some of the kernel function you can find this distribution P, omega in closed form. You can find this phi, omega in closed form but it can also go from this direction to that direction. You can pick whatever nonlinear function phi, omega wherever you want it. You can pick some distribution wanted and then you just define kernel this way. It’s to give you [indiscernible] positive semi definite kernel. So you can go both directions, but the people have worked out some closed [indiscernible] expressions for some unknown kernel functions. For instance, for Gaussian [indiscernible] kernel the form is like this. Delta is the difference between X and X, pi. The random function phi, omega is just going to be something cosine omega transpose times X plus Tao and the omega follows Gaussian distribution and Tao follows uniform distribution. Actually uniform distribution between I think zero to pi, something like that. And if you have a non-kernel Laplacian distribution any family of translation kernel you have some nice solution. The corresponding distribution of omega is going to be costly distribution. If you have costly kernel then [indiscernible] distribution is Laplacian distribution. There’s some nice relationship between them. So comparing this type of random feature approximation approach to this lower matrix approximation approach there is some advantage already. Essentially what you do is you draw some random parameter omega, random feature parameter, and then you're going to approximate this kernel function by the average of these random features. Instead of expectation just draw this omega random from P, omega and approximate it by sign of simple average. And computation is simpler. So choosing [indiscernible] for the moment suppose you have D there. Essentially you just need to apply each one of this random feature function on each one of your data points. So if you have T random features it’s going to be T times D for each data point and you have M data point that’s the operation. So log data is because I'm going to [indiscernible] some tricks for efficient matrix go back to modification and get log D kind of scaling. The memory is still all of M times T so essentially you can also do this matrix instead of using lower end matrix factorization you just compute. You apply this random function directly on a data point you get a number for each one of these random features. So again once you have this lower factor A you solve this two-dimensional problem. And again if you want to prove something you will find that the generalization ability is going to be 1 over squared 2T plus 1 over square root M. Again, you want to balance two things. >>: I'm missing something on the computation part. Even if I just want to just to write this approximate matrix I'm not getting the true [inaudible] including the formation. Since it says size M by M then it’s 1 to M squared. >> Le Song: Yeah. That's right. So you don't explicitly approximate and train this matrix. You just keep this lower factor A and work with this lower end factor as if your data from [indiscernible] dimensional space or T dimension. So you're going to work with this matrix. >>: But my algorithm is going to access all the entries you need. >> Le Song: So your algorithm is>>: So in some terms the position algorithm is going to stay M squared regardless of>> Le Song: So before you run the optimizer you do this preprocessing and that preprocessing actually don't need to answer the entry in that matrix. It directly works with these data points. The same thing for this [indiscernible]. You don't actually go through every entry in this matrix. That's how you get M, D squared. Otherwise you can't know why M square operations. Just use a few entries in this matrix to come up with an approximation. So in this case you directly apply this random feature on your data points to get this [indiscernible]. So again you need to match the two. >>: So how tight is this [inaudible] bound? Imagine if your data line in lower dimension>> Le Song: So in that case you’re lucky. This lower end approximation will work really nicely and in that case if you incorporate that lower end knowledge you might get better bounds. In here I'm not making an assumption on data. If you want to do something fully nonparametric that's the kind of bound you will get. With lower end assumption possibly you will get better results, better theoretical guarantee. That’s the problem. And then what I'm going to do is I'm going to look into some scalable algorithm which has been applied in many other places, the stochastic gradient descent, but I'm going to add another layer of randomness to this algorithm to make it scalable for this kernel case. So first I'm going to show you is why traditional stochastic gradient dissent is no good for kernel method. I'm going to directly optimize this function in the [indiscernible]. I'm going to use something called functional gradient instead of gradient over some finite dimensional vector. It’s just a generalization of that. So essentially you protect this function by some epsilon in the direction with G and you look at a change you can express that change in the product. And this thing in front of this G is going to be your function gradient. For instance, if your function F, X is here you take a gradient with respect to this function; then using the reproducing property in this space you can express the function like this. Then we take the gradients like linear function here you get this term. If you have square null in the [indiscernible] you get function gradient like this two times F. You think about it as a vector that everything seems to be very natural. So essentially for this expected loss if you take the gradient you apply train rule once, you take the gradient of this loss function with respect to this effects and take gradient with respect to F again you get this additional turn here and this comes from the square null. The expectation can be exchanged with operation of taking the gradient but that's what you get. So essentially for many of these kernel methods formulated as convex optimization you can take gradient, it has form like this, and then what you can do is you take a subset of data points, you take one data point is better case, but you can take a mini-batch and then update your function using the stochastic gradient computed using individual data points. So essentially in the end you'll find that your function is a weighted combination of the data point you have seen so far and then the number of trains in this summation can be equal to the number of data points. So if you apply some standard nodes for stochastic gradient descent or mirror descent and then you will get this type of rate 1 over square root T. In this case the number iteration is T match up with the number data point you see so you get this rate. But the problem of this approach is you need to remember all these training points. If you want to evaluate this function in a new test point you have to keep all training points. You plug in your new test point to this function in the kernel value and then do a weighted sum of them. You cannot throw away those training points in general. >>: So is this T the same as the T before the [inaudible] points? >> Le Song: I tried to make them the same. So this T you can think about it as the rank. It’s not the same parameter but they’re comparable in some sense. So the P iteration here plays the same role as rank in the previous slides. >>: So I don't understand the comment you need to remember all points. You just need to remember T points. >> Le Song: That's right. You need to remember T points. So suppose you want to get one over square root M generalization ability then you need to essentially remember M points. And you need to remember the points you have seen so far. >>: Where is function [inaudible] gradient which is [inaudible]? >> Le Song: It's more like>>: Where do you use [inaudible]? >> Le Song: So essentially I can actually plug in this into the [indiscernible] equation and then you can actually derive this result. For instance, you plug in this F here into this R and then you plus epsilon G and then you just look at the difference and then divide it by epsilon and you will get this. So it's just like you can think about it just a vector. It's a vector. >>: Where is G here? So for this one>> Le Song: G is just here but your directional derivative is just the part in front of G. It doesn't go off the G. >>: So for here that’s the part>> Le Song: Here it just gives you the final results. I didn't go to that step. I just gave you the final result of taking this directional derivative. >>: [inaudible]? >> Le Song: Some G [indiscernible] but it doesn't matter. Usually for this simple function that direction doesn't matter. >>: I'm just trying to understand the [inaudible] the next slide. So you're saying I need to choose T to be on the order of M in the generalization I want but then the number of points that I need to remember is just the support. It's not going to be everything. >> Le Song: For instance, for [indiscernible]. You might get sparse solution sum of alpha is zero and you then can>>: If you choose the kernel correctly most of them would be zero. >> Le Song: So for [indiscernible] dimension that might happen and then hopefully the number of supporters orders of magnitude smaller than the actual data. For our original regression points the solution is dense and for many of those functions the solution may be dense, and also for support [indiscernible] dimension generally there’s guarantee that the number of the vector is orders of magnitude smaller than the actual training points, especially for this high nonlinear case you'll find that you actually need lots of support vectors. That happens typically in practice. >>: The number of support usually grows linearly. >> Le Song: So if it's growing linearly then it’s that same order. Maybe the factor is small but it's still linear. So this is the key idea: how do we deal with these [indiscernible] or these training points? So you just make a second stochastic approximation. So your stochastic gradient and your approximate gradient using data points now I know this duality between kernel function and this random feature. I'm going to sample some random features. In this particular case I just sample one but you can sample a mini-batch as well. You're going to sample these random features and approximate this kernel function by this product of random functions. So the advantage of doing this is of course after you get this doubly stochastic gradient you just move your function after your function you use this doubly stochastic gradient and what you will find in the final function you get as a weighted combination of these random functions. You know exactly what the form this function is and then you only need to plug-in your tax[phonetic] X and the value of the tax[phonetic] X on this random function and do a weighted combination. You will get your prediction. You don't need to remember all these points anymore. The reason why is you just need to incorporate evaluation of this random function or these trading points into this weight alpha. So in the end this training end point maybe it's 1 million dimension but you just need to evaluate in this random feature and if it's a single number it's incorporated into the single number. So if you go to 1 million data points you just need to remember 1 million numbers alpha. And then this random function you know the form of it you know which distribution you’re sampling from where typically we sample, for instance, from some pseudorandom number generator. If I know the seed I can also re-instantiate the random number. I don't need to actually store this omega; I draw it from a distribution. I just remember the seed. Next time I will need to use that particular random feature I’m going to redraw it using the same seed. So the algorithm is actually very simple. It just keeps updating, join some data points, join some random features, and keep updating this alpha. So the algorithm can be summarized in one slice. In the end the form of the function is going to be weighted combination of this random function you have drawn. So essentially you’ll sample some data points, sample some random feature using a particular seed corresponding to the iterations. Then you can re-instantiate this random feature very easily in test time. In the current iteration you already get a bunch of alpha, you already get a bunch of random features. If you want to re-evaluate a new test point on this function F you need to pass in the kernels to the alpha you get so the evaluation of this function is going to re-instantiate a random feature then apply the random feature in this new test point and weighted by this alpha you learned before and then you just cumulate this. And because the [indiscernible] are a regularization parameter you see some modification of all these other J for J small equal to I. You try to forget a little bit about the alpha you have before. For the kernel alpha you're going to update you just use this doubly stochastic gradient. >>: So that reminds me of summation generation process and signal processing [inaudible]. >> Le Song: So here it's doing more naive [indiscernible]. You're not doing it smartly in terms of sampling this random feature. You’re just to doing it joining from some random distribution. You may choose some omega maybe more smartly you’ll get better convergence but I don't know how to do that. >>: But then if you [inaudible] alpha I is very [inaudible] zero, meaning that the random direction is orthogonal to the training function. Then maybe you can ignore that I. Is that correct? >> Le Song: It’s possible. We try to just explore the simplest version and this way we don’t do any big bookkeeping and you just keep averaging all these random functions, but you can think about the extension of this by joining these random features more intelligently, do bookkeeping more intelligently. Potentially you can prove convergence of this algorithm. >>: I'm sorry. I'm not following, but are you describing to us the original technique of the random [inaudible] or is this a new thing? >> Le Song: This is the new thing. >>: It doesn’t make sense because to me it’s exactly what they said. >> Le Song: What they did is they first generated this random feature and then they optimized in the final dimensional space. >>: You can do that optimization stochastically which is what people will do in practice. Is that what you're saying? >> Le Song: People don't do that in practice. They regenerate maybe just 1000 random features. You’re optimizing just 1000 dimensional space and that's how people make it scalable for these kernel methods. And then if they want to try more random features what they would do is they would regenerate another maybe 10,000 features and re-optimizing it. So this one is essentially generating this random feature on the fly. >>: So you’re saying yours is more flexible? >> Le Song: In some sense very flexible. >>: How do you make sure from the same omega if you generate [inaudible] on-the-fly>> Le Song: There's only way to generate the sample is if you use programming sample of this and then you can always supply the seed to the random generator. So that's why I need to keep track of the seed to make sure that every time I sample exactly the right. The question is if you do something like this is it going to converge? You have analysis for stochastic gradient dissent. Now you added this second randomness to your stochastic gradient. The question is whether it’s convergent to the same rate with the same rate or not. So we also have some analysis whether this is convergent and essentially if you do this doubly stochastic gradient you'll find that the rate of convergence is the same under certain conditions. The next you actually need to remember this random things. What to actually remember in the algorithm just is alphas. If you had 1 million data points you just remember 1 million numbers. That's it. >>: So I get confused. So you can apply this technique to linear kernel, right? So if I apply this technique to a linear kernel [inaudible] features. So the process is [inaudible]. In that case maybe it would be zero because if I do a sparse feature [inaudible] zero then the chance I get in update would be very low, right? >> Le Song: There’s no advantage for doing this linear features. But the mapping of this [indiscernible] linear case exactly in your sample is these dimensions uniformly random. Because if a linear kernel, you just do inner product. It's a sum over the product of one dimension and then the second dimension for that [inaudible]. >>: I think the problem for the linear is linear feature is in a linear case it's not needed to [inaudible] trick because if you update any feature does a partial derivative for any feature you need to evaluate the [inaudible] anyway. [inaudible] every feature is correct. >>: [inaudible] better technique [inaudible]. In some case maybe you need to sample a lot of times [inaudible]. >> Le Song: For linear kernel I don't suggest using this approach. This is for nonlinear case. >>: So the new feature iteration you pick a new feature and you take a single sum of a point. So if you generate all new features basically you'll observe only certain points. >> Le Song: You make these separate. So you can take a mini-batch of data points. The minibatch can be different from the main batch size of random features. >>: So you just mentioned that you don’t recommend this for linear and use it for nonlinear, but [inaudible] not really that different so for your original data you can just do like polynomial [inaudible]. >> Le Song: Exactly. So polynomial [indiscernible] is already nonlinear. Actually this random feature approximation for polynomial kernel recently [inaudible]. Exactly you do some kind of random hashing and then here you can think about this omega is performing some random hashing of your data. Again, you want to get the generalization bound. You would need to know the number of random hashing will have to match the number of data points you see. Otherwise, you're losing some performance without any assumption of distribution of data. So the algorithm is just that simple. I mentioned about this random feature evaluation. You get log D instead of D because, for instance for this translation we’re in kernel you have this random feature, essentially you draw some omega from some Gaussian distribution. You do an inner product between omega, X. If you have a bunch of random features, you have many, many of these omega draw from distribution and you want to evaluate essentially, you put this omega in the matrix. You essentially want to perform the inner product between this W matrix and each one of your data points. So omega is a random matrix. The column is going to be some random number drawn from some distribution, for instance Gaussian distribution, and you want to perform this inner product. So there's some way to perform this inner product in an efficient way. So essentially this is some technique called fast-forward. So you can approximate this matrix omega, W by several highly structured matrix. So you can actually have a product H,G,pi H,B matrix. The B matrix is going to be a diagonal matrix where the uniformly distributed, just entry minus 1, 1 on the diagonal and then the probability of minus 1 and 1 are the same. And then actually it's hard on the matrix; it’s very structured. If you're D is 2 is some point this, 4 is like this, so you don't actually need to compute this matrix, but the nice thing about this matrix it allows you to do very fast matrix vector products and then the pi is a random permutation matrix and then G is just a diagonal for Gaussian. That actually is the Hadamard matrix again. So essentially you don't actually need to draw these D times T random numbers from the distribution. So what you can do is you just draw this diagonal, supposed T go to D, diagonal of Gaussians and you use this very structured matrix to mimic a huge random Gaussian matrix. You can do that. And then this matrix allows you to do fast matrix modification and you can actually even probably guarantee that after you perform this approximation into the matrix vector product you get pretty much the same results. So this is some result from random matrix modification and you can actually use it here to speed up the evaluation of these random features on the data points. So this is just some speed up you can do. >>: So this is only for the [inaudible]? >> Le Song: So translation invariant kernel for some other kernel for instance there's some rotation invariant kernel. You can also derive the similar thing. It's not for every kernel. So it has to be a kernel where you to draw this omega from some distribution. It may be some Gaussian and then multiply by these input data points and apply some nonlinear function afterwards and you will have these types of results. And in the interest of the convergence we also provide do some analysis for convergence algorithm. Essentially what we try to do is to find some function that we produce in kernel Hilbert space. Without making this doubly stochastic gradient approximation the function we get is going to be something like this: weighted sum of the kernels apply each remaining training points but we have to make some approximation and then we get something like this. So essentially, suppose you use this translation invariant kernel, then you're going to get a weighted sum of sine, cosine functions. These functions themselves may not actually be in the RKHS. You're actually using function in another space approximate function RKHS and you try to show the conversion. In here we just showed that when you have that function F,T plus 1X the function you obtain the T plus 1 iteration and evaluate some point X and compared to this F star function which is in the [indiscernible] the best optimal function of value X the difference is going to be small. So essentially you can decompose this difference into two terms, one, by introducing this single or stochastic kernel machine just approximate with the stochastic gradient coming from the data. So essentially we can do composite error in the terms but this decomposition is only expectation, by the way. One source of error is due to this random feature and then the second source of error is due to this random data. So essentially for the analysis of it we're going to analyze these two terms separately. So you turn down the second term you can just use standard analysis from the mirror descent. You just generalize it to a function RKHS and you can get one over T kind of convergence. For the first term we did something special to this problem. We used basically the concentration in quality for marking our differences. Essentially if you compare this function in RKHS edge and this function [indiscernible] RKHS the difference is this: the it’s the sum of a bunch of [indiscernible] the function you obtain each iteration. So somehow this sequence of these consists form marking our different [indiscernible] and you just apply some concentration in quality. You also get 1 over T rate together you get this kind of rate, one over T rate for the function, the squared difference of the function, and then you can also get a generalization of B,T which is 1 over square root of T. So essentially that's how you get an analysis working and essentially you get the best possible generalization ability for this nonparametric estimation problem. And the algorithm is very simple, the question is whether it works or not. So we tried this way>>: Could you go back one minute? So H is what you would get>> Le Song: Without applying this random feature. >>: That is what the [inaudible] would it do? >> Le Song: Actually it's not. Actually it's this guy which you don't apply an approximation to this kernel. >>: [inaudible] the actual kernel itself? >> Le Song: Yeah. Actual kernel itself. [indiscernible] is applying stochastic gradients and using random data and also the random features simultaneously. You never revisit those random features. You just keep generating new ones. >>: So the martingale is just doing the one by one? >> Le Song: Yeah. The difference is [indiscernible] zero if you look at the difference between the two. So V,I is the distance. Using one random feature you’re trying to approximate this kernel. Conditioning on the previous randomness the expectation of this guy compared to this guy is zero because this kernel function expectation, this random feature approximate [indiscernible] expectation is equal to that. So conditional or previous randomness because we have the previous function here this expectation is zero and essentially if you look at the difference it’s the sum of a bunch of this. >>: So why is everything squared there? Why is the difference between F,T plus>> Le Song: Instead of squared difference and then take the expectation for it. So this decomposition is in terms of expectation. We just look at the squared difference between the function value. It's easier to analyze. You look at the absolute difference as well, but we just look at squared difference. And then the second part is just zero descent. And then if you have some property for loss function then you get this kind of rate for the risk expected loss. >>: And what is this [inaudible]? What is that thing over, this constant? [inaudible] Up on the top? >> Le Song: On top is primarily related to the kernel function. It's like the upper bound for the kernel function. For instance, if you have R,B,F kernel and then there it’s using the reproducing property. You can express the evaluation of a function RKHS inner product between H and K,X, T and then you use [indiscernible]. That's how you get the [indiscernible]. So we're going to apply this algorithm to [indiscernible] training data set. You need to go to maybe 10 million or even the more data points in order to get a state of our performance. So we compared three models. One model we call jointly trained units is the original neural units we're going to use stochastic gradient descent and mainly we adjust the learning rates three times to get a best performance. Then we're going to have the second model called fixed neural nets. It's basically [indiscernible] check. So we're going to take this convolution layer learned by in the first model and fix it here. I'm just going to retrain again just this fully connected nonlinear neural nets and classification. And in some sense this model is the closest to our model. We are also going to reuse exactly the convolution layer learned by the neural nets but we going to replace the top part by this kernel machine. So it seems that this convolution pooling is really doing something amazing. So we're going to replace that and use this doubly stochastic kernel machine. >>: [inaudible] use a joint>> Le Song: It’s not a convex problem anymore. If you want to also optimize>>: It’s not like the first one is a convex problem either. >> Le Song: That's right. Actually I will show you the result but at the moment I'm just focusing on replacing this part by this kernel machine which you have a very nice guarantee and see whether they perform well in practice or not. So this is the previous figure you have seen that’s for training these deep neural nets and then if you trained as fixed neural nets you have already learned this convolution layer. You just try to optimize the top of fully connected layer and then somehow you don't get the best performance comparing to>>: Using all the same tricks of [inaudible]? >> Le Song: This one is not. That's right. This one you have adjusted. Somehow at this point even when you adjust it it’s not decreasing anymore. Somehow there's some co-adaptation of the convolution layer and then the fully connected layer. It’s not like you’re training jointly you get better model but if you fix one point and train the other one it seems that you're not getting as good a result. Here you are still trying to optimize some kind of nonconvex object [indiscernible]. We don't understand why this is the case, but if you use this doubly stochastic kernel machine that's what you get. You have faster convergence and you just use 1 over T type of learning rates. That's what you get. >>: You have also this graph in terms of time? >> Le Song: This is one week and the time is comparable to these neural nets. >>: So the iteration time is comparable. >> Le Song: We also implement our algorithm GPU’s file. It’s comparable. So actually most of the time is spent on loading the data and then performing the transformation in convolution. So all of this convolution only has three medium parameters. It actually performed the most of the computation and then these [indiscernible] parameter later. It’s only taking a fraction of the time. >>: But you only have to go through it once? >> Le Song: No, not once. For us we can also go through a data point many times because every time you don't see exactly the same sample and then you actually make a random cropping of this image and then run about mirror image in some transformations. In some sense you never see the same data point. >>: So what is the red line [inaudible]? >> Le Song: So it's not improving anymore. >>: How do you know? Did you go to the>> Le Song: So we got to some longer thing and it doesn't. >>: So it stops at 10 times less data than the green line. >> Le Song: Yeah. >>: So if the green line takes a week and you are at the same speed>> Le Song: At this point it’s a week. The student is not patient enough to wait for. >>: That you're using the results of the green line generating the convolution. So how much does just your training>> Le Song: A few days. It takes a few days as well. So I don't remember the exact time but it takes a few days. So most of the time is actually loading the data and doing all this convolution because our method is something new the student is not sure whether this method will work or not. He tried many different things, some constant learning rate, and then he didn't go for one week with each one of these experiments. >>: What’s the dimension of the inputs [inaudible]? >> Le Song: The dimension of this input essentially this layer is like 6000 dimension. >>: [inaudible] then maybe if you just use the traditional kernel learning just to train it it can also be done probably in several days. >> Le Song: Possibly. Hopefully someone can try it and that means that we really haven't tried hard enough kernel. Actually in this case we need 10 million or more to get this data [indiscernible]. We need to go to that kind of scale in order to get a state of our performance. >>: [inaudible]? >> Le Song: So the original image is one plus 3,000,000, but as I mentioned your generate this virtual image, right? You take the original image, you crop it and flip it and then do some transformation and you generate a slight variation of the original image. >>: With so many samples you cannot do the traditional>> Le Song: It’s harder. Maybe you can wait longer or use a larger machine but for a typical desktop it's difficult. So we also tried some other data set that the story are pretty much the same. For [indiscernible] you also generate virtual image that’s why a data point can go up to 10 million. So this kernel machine converges much faster and then you get about the same performance in this case. The data is simple. It's only 10 classes so we actually get the same performance as neural nets so you can also look at even easier data set. >>: So here you get the different feature from before which is the [inaudible] neural net actually performed better than the [inaudible]. So could it be that, is the difference statistically significant? Can you analyze something like that? >> Le Song: It’s taking quite a long time. We just look at this test error and if we run several times probably we can average a curve and it’s maybe not significantly different. The data set is simpler and what I can say is they pretty much get the same performance. In this case I believe we didn't adjust the running rates. >>: What is loss function that uses the doubly stochastic gradient? >> Le Song: It’s multi-class logistic regression. The same thing for neural nets and [indiscernible] that's even simpler and you can actually also make some small transformation of the image generally up to a million data points. In this case error method gets very small error, below one percent. So for instance, I’ll slightly faster, but I cannot say statistically whether it is significant or not. And this one is faster because the image is smaller and then the neural nets is only actually two layers. So it's not error model is this [indiscernible] layers. This is for ImageNet. For the simpler dataset you actually use fewer layers. This is two or three layers and in the image it is also much smaller and this one is [indiscernible] a day, very fast. You get similar type of results and we also have tried regression problem not just a classification problem. In this case we tried something like trying to take some threedimensional structure, this molecule, and predict some property of this molecule. For instance, how efficient is molecule converting sunlight to energy, things like this; and in order to do that you had to first represent this kind of three-dimensional structure, some feature representation. So people have already come up with some permutation environment because in this case it’s a three-dimensional structure coordinates, you rotate it you get different coordinates but it's the same molecule; you need some way to represent this thing. They came up with something called the column matrix looking at the parallelized charge and divided by the distance and you get this matrix permuted as your data points. It’s a regression problem. The output is between the range of 0 to 12 and it’s just a regression problem. You don't use this convolution pooling anymore. In this case it's really just fully connecting neural nets, so three layers, and then we just do this kind of kernel [indiscernible] regression. So this is just comparing the fully connected network to kernel machine. So in this case we go through this 2.3 million one and look at basically the absolute error in a test set; and in this case you see fewer points actually. The reason why this curve is so small we have fewer points for the evaluation and then the method [indiscernible] feels like smaller. >>: So have you tried to [inaudible] method to the traditional>> Le Song: In the paper yes we have. We compared two of these coordinated distances using our data set where we can actually run the competitor. So in that case essentially what we show is the convergence rate is about the same. We are not clearly better than another in terms of rate of convergence and then the final classification accuracy; it’s just that our method is more scalable allowing us to try out this type of data set. So what we find is this fully connected layer and the neural nets is really not useful. You can replace it by other kernel method, nonparametric method, you get about the same performance. What seems to be really useful this convolution and pooling layers. So what we tried is exactly, we're going to also learn two things together, we’re going to learn this kernel method classifier and also several times adapting this convolution layer so just by [indiscernible] derivative. And what you get is you can also get close to the performance neural nets. We just tried recently. We get 46. For neural nets you get 42. There's still a gap but this is much better than running [indiscernible]. Running [indiscernible] 99 point something percent error. >>: So is this better than doing this separate [inaudible]? >> Le Song: It’s about the same. Separate thing you already learned this in field test. That’s a much easier problem. You just learn a classifier. Here [indiscernible] learn the filter then we don't have any guarantee for this but somehow works. You don't need to use this fully connected [indiscernible] linear units. You put kernel machine here, you adjust this, it somehow works. We didn’t tune it extensively. We could possibly get even better results. >>: So here you still have the convolution layers, you just learn them, you do the [inaudible] from the kernel method. >> Le Song: Right. >>: So it's a fixed convolution [inaudible]? >> Le Song: So it's like 44. We just tried it in a few months and that's the result. It can improve. Essentially with this I'd like to summarize a little bit. The method that we used to scale this kernel machine is really just this one slice instead of using the random batch of training data you also use a random batch of random features simultaneously to approximate this gradient and using this doubly stochastic gradient to update your function. In the end your function is going to be a weighted sum of this randomness. That’s the same representation has this random kitchen sink but the way we use this random feature is different. We generate this random feature online and we never revisit them. But in practice you can revisit them. You can plan between the two. So the advantage of this approach is you can actually use it to handle streaming data. Suppose your data just keeps streaming in, you can just keep increasing a random feature to accommodate this increasing complexity of the data. Of course, there is also an interesting problem if some data or classes you will never see in the future. You want to forget about it. How do you do it? Nicely such that you keep the overall [indiscernible] still manageable. So it can be applied to force this convex objective function but we also recently training for kernel PCA, for Principal Component Analysis. You solve some kind of nonconvex problem, you try command maximize a convex function instead of minimizing it; you can also use this stochastic gradient to do it and then it doesn't prevent you to use the doubly stochastic gradient. You can actually also provide the guarantee for this convergence if your initialization is close enough to the spacing of [indiscernible]. Gaussian process you want to estimate a predictive mean in covariance and it can be set as some convex optimization problem. You can also use this approach. So memory-wise we just need to remember this alpha, one for each data point, so that's why it's OT if you run T iterations. And in terms of estimating the function we get this rate; in terms of the generalization ability we had 1 over square root T. That’s the best possible for this kind of nonparametric method with [indiscernible] assumption. In practice it also works nicely. That’s the most important thing in some sense. So there are many other questions that haven't been resolved. At the moment I’m also looking at these two questions on the top. So what is the maximum pooling and convolution is actually doing? Why can't we use other architectures and to get about the same performance? Some people have gone to the extreme that you're not just having five layer of this. You might have 15 layers or maybe 20 layers of this, very small convolution kernel, and you get even better model. So it's amazing you can do this and you get some model across the human judgment. And then if this architecture works there must be some characteristical data which is suitable for that type of operations. If you apply this architecture to other domain, maybe language, it's not going to work. There must be some characteristical data. What kind of characteristic is there to make this architecture work? It's not clear. We are studying this. Hopefully we can get some results in the future. If I come back I will give this talk. That's everything. Any further questions? >>: From the previous slide [inaudible] function [inaudible] perspective you get the same [inaudible] as the single [inaudible]. So the question is [inaudible] do you observe more variants? >> Le Song: Yes. You have more variants. Essentially the variants add up in the bound. >>: And also even from the theoretical result you got the same rate in terms of T but the constant is much larger. >> Le Song: Yeah. It’s going to be larger. You have two sorts of variants, one sort is from the data and another sort is from the random feature. So it is convex optimization. In practice you will try to grab as large a mini-batch as possible, as large batch of random features as possible, as much as your memory can hold and do this. You don’t actually just use one data point. It's actually interesting that it’s different for neural nets. Somehow in order to make these neural nets work you have to use a mini-batch that is small enough, not too small or too large, have a few hundred points or something. You want to see the best convergence for neural nets. It’s a nonconvex problem. Somehow there's a best stochasticity[phonetic] that allows you to get a better result but here it's a convex optimization. You really want to have as little randomness as possible. So we will grab a large batch, mini-batch. Any other questions? I will be here for a few days. If you guys are free I would love to talk you guys more.