>> Lin Xiao: Today we are very happy to have Yuchen Zhang give a talk. He's a Ph.D. candidate at UC Berkeley in Michael Jordan and Martin Wainwright's group. He has made actually a number of very impressive contributions to different topics in machine learning, optimization, as well as statistics. And he also not shy about coding and software implementations. You will see. And also most recently has been working learning neural networks. So that will be focus of his talk. >> Yuchen Zhang: Yeah, thanks, Lin, very much for the introduction. see if the microphone works. Okay. So ->>: You have one on your shirt. >> Yuchen Zhang: >>: Let me It's fine. Okay. It's okay. >> Yuchen Zhang: Great. Yeah. So it's my great pleasure to give a talk here at Microsoft Research Redmond. In this talk I will mainly talk about some provable algorithm for learning neural networks. And since this is an [inaudible] talk, supposed to be [inaudible] talk, so I will also -- I will first spend a little moment to introduce myself. So I'm a fifth-year Ph.D. student at UC Berkeley working with Michael Jordan and Martin Wainwright. And my general research interest is to develop machine learning algorithms for building artificial intelligence. This is a broad area. So my interest is also a little elastic. In my personal perspective, there's three key challenges in the machine learning research. The first one is how to represent AI in a rigorous mathematical form, so that the is modeling part. The second question is, given a concrete model, how to learn the parameters of the model using efficient machine learning algorithm. And this problem can be challenging if it is a very complicated model or the corresponding loss function is non-convex. And the third question is, given a complex model, we need to use a huge amount of data to learn it, and this scale of data is usually not feasible in a single machine but has to be stored and processed in a distributed system. So the question is how to perform efficient machine learning with distributed system. My research tries to answer these three questions in different aspects. And before going into the main part of this talk about learning neural networks, I would like to spend a few minutes to quickly go through my earlier work and the connections to the above challenges to probably give you a more concrete idea of my research background. Okay. Now, let's start from a specific family of algorithm that have been interesting. So it is called the divide-and-conquer algorithm, and this algorithm divide a large-scale problem into smaller subproblems and combine the solution to the subproblems into a single solution for the original problem. This computation scheme is very suitable for distributed computing. And indeed we have proposed some efficient divide-and-conquer algorithms for learning parametric models. Such algorithms are computation efficient in the sense that every machine only has to solve a small-scale problem after divide-and-conquer. It is also communication efficient in the sense that the computation on the separate machines are mutually independent and there's only one round of communication required at end of the computation. We also theoretically guarantee that the algorithm has optimal performance in the sense that if the number of machine is not too large, comparing to the overall number of samples, then the optimal statistical accuracy is guaranteed. And all of these theoretical statements were verified by a real data experiment. We also find that the idea of divide-and-conquer can be applied to learning nonparametric models. For kernel ridge regression, for example, the classical way of learning it requires a running time of N cube, where N is the overall number of samples in the dataset. We have proposed a divide-and-conquer based, more efficient algorithm whose running time has been reduced from N cubed to almost linear to N. And also the optimal statistical accuracy is guaranteed. And this is the very first efficient algorithm for kernel ridge regression which guarantees the optimal rate. I'm also interested in classical convex optimization algorithms as the convex model is widely used in practice. A key challenge in convex optimization is how to efficiently minimize a convex loss function given a very large condition number on the loss function. The state-of-the-art empirical risk minimization algorithms such as SAG, SDCA, or SVRG may need to take 1 plus the condition number divided by the number of sample passes over the dataset in order to achieve a high accuracy. We have proposed a more efficient algorithm called SPDC which improve this iteration complexity from 1 plus kappa divided by N to 1 plus square root of kappa divided by N. So it is a -- it could be order of magnitude faster for very large -- for problem with very large condition number kappa. It appears that efficiency of the distributed optimization algorithm also suffers from the high condition number problem. For example, the convergence rate of the popular distributed algorithms such as ADMM or L-BFGS has convergence rate depending on the condition number which further depends on the overall number of samples in the dataset. So that means that even to process a very large dataset, then the algorithm will converge slow. We have a proposed a more efficient algorithm called DiSCO which converges [inaudible] faster than these popular algorithms. And we can show theoretically that its convergence rate is independent of the sample size. So it is suitable for processing very large datasets. And all of the theoretical statement for SPDC and DiSCO were verified by a real data experiment. When people try to solve complicated problems, they tend to build more complicated models, such as non-convex models. A very important family of non-convex models are neural networks, and this will be the focus of this talk. So I will [inaudible] the discussion on the provable algorithms for learning neural networks to the main part of this talk. Besides learning neural networks, I have also been working on some other problems which are non-convex, including crowdsourcing problem. So the goal of crowdsourcing is to label a dataset with the help of Internet users. But since these labels are noisy, we want to infer the true label as far as estimating the quality of users at the same time. This requires a non-convex model because loss function may have many local minimums. For this problem, we have a [inaudible] efficient algorithm based on the idea of spectrum method and the idea of EM algorithm, which guarantees the optimal performance. So this is not a very nice example that non-convex problem can be solved by carefully designed algorithms. Some other questions that I have been asking are more theoretical, and one of them is what is the fundamental tradeoff between communication cost and statistical accuracy for all possible machine learning algorithms on a distributed system. This is basically an area that is not fully explored, so there are some very fundamental, basic, open questions. One of these open questions is if you want to ask me the parameter of a probability distribution and you have a distributed dataset sample i.i.d. from this distribution, then the question is what is the best possible statistical accuracy given a fixed communication budget. And our work has established this tradeoff for some very fundamental problems, including estimating the mean for Gaussian estimation and estimating the coefficient for linear regression or probit regression. And by showing that, we have a showing that this tradeoff is essentially tight in the sense that it applies to all possible distributed algorithms and they are realizable by practical algorithms. Another interesting set of problems is distributed linear algebra. For example, if you want to compute the rank of an N x N matrix [inaudible] or the number of eigenvalue of this matrix that above a specific threshold. For this problem, we have proposed a more efficient algorithm which uses the idea of randomization which only communicates n bits for DiSCO. We have also theoretically characterized the communication and accuracy tradeoff for all possible algorithms for this kind of problem. And interesting enough, we find that this tradeoff dramatically differs between the deterministic algorithms and randomized algorithms. connection between communication and accuracy. So this is some And another interesting interface is between the computation and accuracy. In statistical learning theory, we know that the optimal accuracy is usually characterized by the so called minimax rate, the minimax error rate. For some important problems, we know that their exponential time algorithm that achieve this minimax rate, but there's no known polynomial-time algorithm achieving the same rate. We study an interesting problem called sparse linear regression, which is also widely used, and show that it is necessary to rethink the notion of statistical accuracy and a certain computation constraint. More specifically, we show that there's an arbitrarily large gap between the performance of the best exponential-time algorithm and all possible polynomial-time algorithms. This shows that the minimax rate cannot be achieved by any polynomial-time algorithm. And indeed we have to redefine the notion of statistical optimality for the classical polynomial-time algorithms, and we have show similar gaps for improper learning as well. I'm also interested in building machine learning systems, and one example is Splash, which is a framework that I designed and implemented for parallelizing stochastic algorithms. So Splash is a general-purpose program interface that allows user to develop stochastic algorithms, such as stochastic gradient descent or Gibbs sampling, without knowing any detail about distributed system. But on the other hand, it is also execution engine that can automatically parallelize the algorithm that the user implement on a distributed system. We have built Splash on top of Apache Spark, and it has integrated with the Spark ecosystem. So the existing Spark users can very easily use Splash. And we have verified by experiment, by large-scale optimization or machine learning problems, that Splash is able to achieve order of magnitude speedup over the official machine learning package of Spark. And this is partially because of the advantage of stochastic algorithms over the traditional batch algorithms of machine learning and partially because of the efficient design of Splash. And this is an open source project which is available online, and through this URL you can find some guidelines for programming, installation, and some examples that you can try on your laptop or on any cluster. [inaudible] to apply machine learning techniques to real-world problems, and one example is click modeling. The goal is to model user's behavior in Web search and analyze user's feedback to the search engine to improve the search engine's ranking function and online advertising algorithms. This line of work was done then with an intern at Microsoft Research Asia and eventually shipped into product to improve Bing's NDCG by more than 0.8 percent. I've also been working on some recommenders, a recommender system project, and the goal is to learn a non-parametric model to solve the data sparsity problem for online shopping and recommendation. And this work has dramatically improved the quality of recommending long tail items in online recommendation. So roughly these are the works that have been done in the past. If you are interested in any one of them, you're more than welcome to come to talk to me after this presentation for more details. Okay. >>: [inaudible]. >> Yuchen Zhang: Yeah. So it was an internship project at Google. So let's go back to the main topic of this talk about the provable algorithms for learning neural networks. This is a joint work with my two advisors as well as my coach Jason Lee from -- who's now a postdoc at UC Berkeley. Right. So in recent years, we have all witnessed the great success of neural networks in many applications of artificial intelligence, including vision, speech, NLP, reinforcement learning, and many others. Comparing with the classical linear models, we know that a neural network is able to encode nonlinear functions which makes the model more powerful. Another added benefit of using neural network is that it allows the researchers to incorporate their domain knowledge into the design of the architecture of the model. And it has been shown in practice that a specifically designed architecture can dramatically outperform the generic architecture, even for neural network. And successful examples as we know include convolution neural net, or STN, and recurrent neural net. Despite a diversity in the model architecture, the algorithm for learning these model parameters are relatively uniform. [inaudible] formulize the learning problem as an optimization problem by constructing a loss function and then run any optimization algorithm to minimize the loss function. So let's consider a simple neural network which contains only one neuron. It takes a feature vector X as input and then take the linear transformation, then apply the sigmoid function to define the output. If you want to solve a regression problem, we can write a loss function as the empirical least of square loss. And since the sigmoid is not a linear function, this loss function may not be convex. So it means that there could be multiple local minimums in the loss function. And if you run gradient descent to minimize this function, then there's no guarantee that the algorithm will converge to the global minimum. The problem of local minimums exist even if we only consider two points in one dimension. For example, if you choose the feature of error pairs taking these two values and shape the landscape of the loss function here, you will see there are two local minimums immediately. And the second local minimum is substantially higher than the first one. So if you run gradient descent initializing in this region, then it is likely that the algorithm will converge to the second minimum, which is obviously suboptimal. The classical way of solving the local minimum problem is to do multiple rounds of random initialization and then run gradient descent. So we can do this for multiple times and choose the best solution that achieve the smallest loss function value. This general approach can be combined with many heuristics which improve the performance in practice. For example, we can use mini-batch training to improve the efficiency of gradient descent. We can use the momentum method to improve the -- to get out of the bad local minimums. And we can use drop-out to improve the robustness of the training procedure. It's been shown in practice that the proper combination of this general approach with the heuristics can achieve very decent performance on real-world problems. But does it mean that this approach is the correct way for solving all kinds of learning neural network problems? The answer is negative because there essentially is some interesting problem, some interesting neural network that is difficult learn using this gradient descent and random initialization approach. To see this, let's look at the concrete example called learning parity function, which is a famous and classical problem in learning theory. The problem setting is that we have a feature vector X and a label Y. The feature vector is uniformly sampled from the vertices of a 50-dimensional hypercube, and the true label is defined as the product of a subset of coordinates of the feature vector. The number of coordinates involved is called a degree of the parity function and is indicated by this letter P. The learner observed the feature vector as well as the [inaudible] version of the label. With probability 0.9, it observed the true label and with probability 0 .9 -- with probability 0.1 it observed a negative of the true label. The goal is to train a classifier using multiple instances in this form such that given a new feature vector it can predict the value of Y, whether it is equal to minus 1 or equal to 1. So this image plots the true value of the parity function given two involving coordinates. If both coordinates are equal to minus 1 or equal to 1, then the true value is equal to plus 1. Otherwise the true value is equal to minus 1. It's pretty easy to verify that this distribution of positive and negative samples cannot be separated by any linear classifier but it can be separated by a two-layer neural net. So this is a classical example of showing the limitation of linear classifiers and the power of neural nets. We want to train a two-layer neural net for solving this problem. And because the loss function is non-convex, we use multiple rounds of random initialization and back-propagation to train the neural network. And here is the result. If the degree of the parity function is equal to 2, then the algorithm successfully learns the parity function by using two hidden nodes in the only hidden layer and achieve the optimal classification error of 0.1. But if the degree of the parity function is equal to 5, you can see that no matter how many hidden nodes it's using, the algorithm fails to learn a parity function and its classification error is always around 0.5, which is the error of ->>: [inaudible] try this problem using a rectifier [inaudible]? >> Yuchen Zhang: >>: So I have tried both sigmoid and rectifier. [inaudible]. >> Yuchen Zhang: Yeah, if I remember correctly I have tried rectifier, and they have the similar behavior. So it means that on this more complicated and more nonlinear function, back propagation doesn't really outperform random guessing. Now, this observation motivates us thinking about the following question. If we observe that a neural network fails on a particular task, we want to know what is the true reason of the failure. There are two possible reasons. The first one is that the neural network architecture is not powerful enough so that it cannot encode the function but that we try to learn. And the second possibility is that the learning algorithm is not good enough, so that even though the good neural network exist, it cannot be learned by the algorithm. And in perspective optimization, the algorithm is trapped in bad local minimums. For this particular problem of learning parity function, we know that a two-layer neural net is able to encode a parity function, so it must be because of a bad learning algorithm. But for the more general and more complicated problem, it's really hard to distinguish these two reasons, so we don't really know what to improve if we observe a bad performance. So this motivates us to think about provable algorithms. If you know that the algorithm provably learns the model, then if you observe a bad performance, then we must modify the model to reduce the prediction error. Instead, if we observe a slow algorithm, then we must modify the learning algorithm to improve the running time. So in this way, the design of the model and the design of the algorithm has been clearly separated, which is good tradition in machine learning research. >>: [inaudible] when you I see [inaudible] is this theoretical result that you cannot reach that, or is it empirical running [inaudible]? >> Yuchen Zhang: >>: Empirical result. But how do you know [inaudible] tricks? >> Yuchen Zhang: So I'm sure that when you apply tricks, there is a pretty good chance that you can improve that. So one point is that this is just an empirical illustration for the motivation of the work. And you're absolutely correct that there could be other possibilities. Yeah. So more precisely there are three questions that we want to answer for provable algorithms. The first one is under which assumptions we can show that a neural network is efficiently learnable in polynomial time. And second question is if we weaken these assumptions a little bit, then can we show that there's no efficient algorithm possible and there's some classical assumptions. And third, we do want to know if this theortical understanding can help us to design a better algorithm that also works on real data. So I hope that I have motivated that this is interesting question. So here's the outline of the talk. First a study on a simple problem called learning linear classifier with non-convex loss. And we'll find that the idea for solving this problem will be useful for solving the more complicated problem of learning neural networks. Then we'll proceed to proper learning of neural networks and show some algorithm as long as there is theoretical guarantees. We also talk about improper learning, where the goal is to learn arbitrary classifier which is not necessary a neural network, but we want its performance to be competitive with the best possible neural net. We're also going to show an algorithm and the corresponding theoretical result. Okay. So let's start from learning linear classifiers. To formulize the problem, let's assume that we are given a training set of [inaudible] and every instance takes this feature value pair form. The feature is the D dimensional vector and the value is either minus 1 or 1. So this is a simplified binary classification problem. The goal is to learn a vector W such that the [inaudible] loss function is minimized. So this function H is an arbitrary function that measures the classification loss on a single instance. Typically we would like to choose H to be the step function so this loss is exactly equal to the classification error. But unfortunately we know that even approximately minimizing this function is NP-hard. The hardness of this problem comes from the fact that the loss function is not continuous, so it is reasonable to use some continuous approximation to the step function and assume that H is L-Lipschitz continuous. But if L is very large, then the problem is still challenging because it can be proved that minimizing this loss function in poly(L)-time is still NP-hard. It has no result. So it motivates us to assume that L is a constant that doesn't grow with the sample size or the input dimension D. So let's see some examples of loss function that satisfy this condition. We can consider piecewise linear function which is equal to 0 if X is below an active threshold and equal to 1 if X is above a positive threshold. It's a linear function in between. We could also certainly consider some smoother approximations such as the sigmoid function. And for both functions, their Lipschitz continuity is characterized by this constant L. So given a loss function, what is the most straightforward way of minimizing this loss function? We have a mention that a very natural choice is to first do random initialization and then run gradient descent. For the simplicity of statement, let's assume that the minimizer of the loss function w* has a unit norm, and we also assume that the data is normalized so that all of the feature vectors has a unit norm or is contained in a unit node. The first step of the algorithm uniformly sample random vector from the unit sphere of the D dimensional space. Because we know that a correct solution lies on the sphere of the unit -- the surface of the unit sphere. And then we treat W0 as initial point and run gradient descent or any optimization algorithm to improve the loss function value. Eventually we'll gather get [inaudible] W which is also contained [inaudible] and whose loss function value is at least as good as the initial point. We do this for multiple times and choose the best vector that achieve the smallest loss function value as the final output. This is a widely used approach. The first initialization step is very simple, and optimization step is very flexible. But in a theoretical perspective, it is problematic because the loss function is non-convex, and the second optimization step does not guarantee that it will converge to the global optimum of w* unless the optimization -- unless the initial point W0 is very close to w*. But what is the probability that W0 is very close to w*? Since we are initializing a high-dimensional space, this probability is extremely small. And it's easy to verify that. If you want to guarantee a high probability that a good initialization appear at least once in each iteration, then the number of iteration must scale as 1 or epsilon to D minus 1's power. So this is exponentially dependant on D and too expensive unless D is very small. So the question we must answer is how to remove this exponential dependence on D. Surprisingly there's a very simple solution to the problem. We only have to modify one place of the algorithm. That is, instead of uniformly sample from the unit sphere of the D dimensional space, we sample from a larger sphere of radius R is where R is [inaudible] greater than 1. And all the remaining part of the algorithm is not changed. With this simple algorithm, there is an interesting guarantee. If you choose to have [inaudible] R following this formula, then the algorithm is guaranteed to be approximately optimal with high probability. And iteration complexity depends on the polynomial function of N where the power only depends on epsilon. So if epsilon is a constant, then this is a polynomial-time algorithm in both N and D. If you exam closer into the theory, then it shows that a hyper [inaudible] to R is a tradeoff between the efficiency and accuracy of the algorithm. If you choose a greater value of R, then you [inaudible] of epsilon. It means that the algorithm will be less accurate but since the running time has inverse dependence on epsilon, it also means the algorithm will run faster. So this is a very simple initialization scheme which can have some theoretical guarantee. But it does have some limitations that we want to overcome. So three of the limitations are, first, the output of the algorithm will have a radius of R and it's not necessarily containing in the unit ball. But we know that optimal solution is contained in the unit ball. So it is not exactly a proper learning algorithm and may cause some overfitting. And, second, the constraint on w* and on all of the feature vectors are in terms of the LQ-norm and cannot be generalized in LQ-norms. But, in practice, in many cases we want to impose other ones like the L1-norm to impose sparsity. So this difficult is because that in the proof of the theorem, we use the Johnson-Lindenstrauss lemma, but this lemma only preserve the Euclidean distance instead of the arbitrary LQ distance. And the third problem is that iteration complexity is a polynomial function of N with a relatively large power. So ->>: Sorry. If we go back to this result, I mean, you're already rushing ahead to the next iteration. So here, I mean, you're assuming that the function is defined over the entire space but the optimum is on the unit sphere. >> Yuchen Zhang: Yes. >>: And somehow that, you know, that assumption allows use you to prove your theorem, so somehow the basis of attractions around w* become bigger when you go farther away. Something is hiding here. I mean, you somehow -- you know, you are making some assumption. It cannot be just an arbitrary objective function. I could choose any objective function, of course I can make the points that are far away, some [inaudible] structure of [inaudible] but some very, very delicate structure of local minimum, you will always get stuck, but somehow your assumptions aren't making the problem nicer. Can you shed some more light on what is the key here that makes this, the going farther away ->> Yuchen Zhang: Okay. So one key assumption is that your objective function has to be -- has to be continuous with a constant Lipschitz constant. Another key assumption is that the data should be i.i.d. So the intuition behind this theorem is that if you consider some optimal solution w*, then this w* affects the loss function only by the inner product between w* and these features. Right? So if you can preserve the inner product, then you can approximate the optimize -- you can approximate the loss function and even approximate optimal solution. And now we know that w* is in a high-dimensional space, and now we construct a random subspace and just to project w* onto this smaller dimension random subspace. Using the Johnson-Lindenstrauss lemma, you can show that after this projection, if you scale this vector by another factor of R, so R is a factor that is greater than 1, then the inner product will be preserved. So there is a -- there is indeed a dependence on the number of samples because we know that as the number of samples increases, the dimension after the projection will increase logarithmically as a function of N. But this is only a logarithm dependence. So this vector R comes from the Johnson-Lindenstrauss lemma where after doing the projection you have to scale all of the vectors to preserve the distance and preserve the inner product. And why do we -- why don't we have to construct this random space in the algorithm is because that you do two consecutive steps. The first step is to construct a random space, and then you draw from the unit sphere of low dimensional space. But combining both is equivalent of directly drawing from the original space but with a greater radius. Yeah. So the projection step is hidden, but the fact of the projection is this greater radius. >>: There's no assumption about the convexity of the loss function? >> Yuchen Zhang: >>: There is no assumption on the convexity. Regarding that this algorithm is always going to [inaudible] optimum. >> Yuchen Zhang: Actually, it doesn't -- so there's no notion of convergence because the optimization set is arbitrary. So if just a [inaudible] and then you can -- so in the simplest case you can directly output initialization as your solution. But in the first step you repeat it for multiple times, and there is a selection here. You use the loss function to choose the best one. So this step is also very important. >>: The loss function [inaudible]. >> Yuchen Zhang: Yeah, so there is no convergence because it could be just a random initialization and selecting the best solution. Selecting the best sample. Okay. So we know that they have some limitations, and it can be solved by selecting a more complicated algorithm by constructing this initialization using a least-square problem. For this more general algorithm we can assume more generally that the feature vector has an LQ-norm that is bounded by 1 and all of the feature vectors has LQ-norm that is also bounded by 1 where P and Q satisfy this normal condition on the [inaudible] norms. The first step of the algorithm sample this K i.i.d. instances from the training set and then sample a vector U uniformly from the K-dimensional hypercube. And then it solve a least-square problem to compute a vector W0 original D-dimensional space. Then W0 is treated as an initial then you can run any optimization procedure to improve it. And repeat this procedure for T times and choose the best one using function and then define the output. in the point, and then we the loss So for this modified algorithm, there is a better theoretical guarantee. If you choose K properly, the [inaudible] K properly, then the algorithm guarantees to be approximately optimal with high probability, and iteration complexities is exponential function [inaudible]. And comparing this theoretical result with the previous one, you can see that there's three improvements. First, the algorithm always returns the vector in the unit ball. So it is a problem in the algorithm. And, second, it applies to arbitrary LQ-norms. And it has a better iteration complexity because it has to replace the dependence on N by the dependence on a universal constant E. The proof of this theorem is even simpler. If first to define an empirical based -- a sample-based empirical loss function G which only depends on your random samples. And then using the classical learning theory, you can easily show that minimizing G will approximately minimize the original loss function L. On the other hand, it is easy to see that G of W for any vector W is uniquely determined by the vector of inner products. It's a K-dimensional vector. And using the norm constraint, it is also easy to verify that phi of w* is contained in a K-dimensional hypercube. So if you draw a vector uniformly from the K-dimensional hypercube, then no matter where w* is, you will be sufficiently close U of phi w* with probability on the order of epsilon to Kth power because this is a K-dimensional space. Now, if you assume that you're sufficiently close to phi w*, then we can solve this least-square problem. This will give you a vector W0 such that phi of W0 is close to U and we know that U is close to w*. So W0 is almost equivalent to w* in minimizing the loss function. And we get a good initialization with probability epsilon to the Kth power. And we can repeat this for multiple times to guarantee the high probability bound. So the number of iteration is 1 over this quantity. And plug in a value of K into the expression, you will see the iteration complexity exactly matches the theoretical claim. Okay. So this is our algorithm for learning linear classifiers without any convexity assumption. Now let's proceed to the learning of neural networks. Before describing the algorithm, let's first define a function class that we want to learn. If the neural network has only one layer, then it is exactly linear function. Right? So it reduces to the problem that we have just studied. If the neural net has multiple players, multiple layers, then for the m-layer neural nets we recursively define it as a function mapping from the feature space to a real number. And a positive number in the case that it is a positive class, otherwise it is a negative class. And this function is a linear combination of several components, and every component depends on the lower level neural nets, so this M minus one-layer neural net. The activation function is arbitrary, but we assume that it should be 1-Lipschitz continuous. It is a condition that is easy to satisfy. And the number of hidden layers -- number of hidden nodes to construct the m-layer is assumed to be finite, but otherwise it could be arbitrarily large. But we make this additional assumption that the L1-norm of all of the incoming weight of the combination coefficient is bounded by a constant B. And B is independent of the sample size or the input dimension. So this might be the strongest assumption of this work. But it makes sense in practice because the L1-norm constraint imposes some sparsely connected neural network. And we know in practice that the sparsely connected neural network is able to represent some meaningful functions that characterize the real data. For example, the convolution neural net is sparsely connected, and the sparsity of the connection could be independent of the scale of the image. Only depending on the size of the sliding window and independent of the cardinality of the training set. And we also know that imposing the L1-norm constraint in practice will improve the robustness of the neural network train. So given this function class, we also define the objective function as the empirical loss also characterized by the single instance loss function H, and F is the function that we want to learn. So it belongs to a neural network class. If the number of layers is equal to 1, then F is a linear function. So it reduces to the linear classifier learning problem. Otherwise F is a more interesting nonlinear function. So we saw in the previous slide that the random initialization scheme can help to learn a linear classifier is if the loss function is non-convex. So a natural question is can the same idea be applied for learning multi-layer networks. Right? And the answer is yes. And here I'm going to present a generalization of the second algorithm for learning neural networks, for learning linear classifiers for learning the multi-layer neural networks. The first step is still to sample i.i.d. K instances from the training set. In the second step, we generate a random neural network to treat it as initialization and do it in a recursive way. If the number of layers to be generated is equal to 1, then we do exactly as generating a linear function, right, because this is a linear function. So we randomly sample vector U from the K-dimensional hypercube and compute W0 by solving the least-square problem and define and construct the linear function using W0. If the number of layers to be constructed is greater than 1, then we construct this n-layer neural net recursively. Because this is a recursive program, we can first generate a sequence of lower level neural net called G1 through GS where S is also a hyperfunction of the algorithm following the same program, then we solve alternative least-square problem to compute W0. Then we construct this n-layer neural net F0 using W0 and all of this N minus 1 layer neural nets. If we compare this to least-square problem, you'll see that the only difference are in the blue terms. In the first case, the blue term is the feature vector, and in the second case it is an output of the N minus 1 layer neural network. So it means that for constructing weights for the Nth layer, we use the output of the lower level neural network combined with the activation function as the feature vector. And this is intuitive. So we can recursively construct the weight of the neural network from bottom to the top. And we treat this neural network as initial network, run back propagation or any optimization algorithm to improve it until we get another neural network F, which is also in a function class and whose loss function value is at least as good as the original neural net. And repeat this procedure over T times and select the best neural network that achieve the smallest function value. >>: So you do this [inaudible] but in practice you'll be much slower than [inaudible]. >> Yuchen Zhang: In practice we find that running gradient descent for the optimization step is much better than just directly outputting the initialization. So theoretically, we don't need this optimization step. We're going to see that there are still some polynomial time guarantee but in practice indeed optimization helps a lot. And it is an open problem, how to understand the role played by optimization. >>: I think the question is how fast or slow does [inaudible] just running the ->>: [inaudible]. >>: [inaudible]. >> Yuchen Zhang: Computationally-wise, comparing to what? >>: To just -- >>: [inaudible] batch, for example, the mainstream method. >>: [inaudible] step, right? >> Yuchen Zhang: So on one hand if you only consider computation, then on the specific class of problems that stochastic gradient descent works, there's totally no reason to use another algorithm which is -- even it is comparable to stochastic gradient inefficiency, but we already know that stochastic gradient descent work. But for some other problem like, for example, we will show later that for learning parity functions, then there's no true way to make stochastic gradient descent work. But we're going to show that this method combining with some other technique we're going to introduce later will succeed easily in learning that complicated function. >>: [inaudible]. >> Yuchen Zhang: Yeah. So this is one algorithm, but [inaudible]. Okay. So because the algorithm is a generalization for learning linear classifier, the theoretical result is also a generalization of what we have [inaudible] before. If you properly choose the hyper parameters K and S, then the algorithm is guaranteed to be approximately optimal, it's high probability. An iteration complexity is an exponential function of epsilon N. So if M is equal to 1, then this theoretical result reduces to the theory for learning linear classifiers. Exactly. If M is greater than 1, then as long as the target optimality gap, the number of layers to fit, the L1-norm constraint and the Lipschitz constant of the loss function are assumed to be constant, then this is a polynomial-time algorithm in ND. And importantly I also like to emphasize that we haven't made any assumption on the data distribution. So this is a purely agnostic learning algorithm, and this will be important for the future improvement. >>: Could you mention some more intuition about the squared problem that you are solving [inaudible] you first presented a linear classifier, and then you used it for the neural network portion. the step ->> Yuchen Zhang: So if you go to the previous slide, This one? >>: It would be the first step under 2, when you draw this uniform vector U from the hypercube, basically when you're solving this least-square problem, you're completely ignoring the Y prime, basically the labels. >> Yuchen Zhang: Yeah, exactly. So this is -- yeah, this is a very good point. So all this initialization is performing in unsupervised manner. We don't use any information in Y. We only explore the structure of the input space. >>: [inaudible] give some intuition about this? Because if data is well separated, right, this blind assignment of, you know, labels U plus minus 1 to [inaudible] will all this -- I mean, I can't see how -- it's not like one of the random vectors U approximates the structure or anything, right? Because it is well separated and it's an easy learning problem, then what's going to happen here? >> Yuchen Zhang: Yeah, I'll try to explain. So minimizing a non-convex function is very hard. No matter it's a low low-dimensional problem or a high dimensional problem. And the only difference is that in a low dimensional problem you can just randomly try and evaluate. But in a high-dimensional space, it is exponentially expensive to randomly try. And we don't -- essentially we don't know how to do optimization, so what we can do is to just randomly sample. And so this view -- so the purpose of solving this least-square problem is that we assume that U represent the -- a mapping on the optimal solution, and U is in a low-dimensional space. So solving this, the purpose of solving this least-square problem is to decode the structure of the low-dimensional space and map it back to the original high-dimensional space. And we know that -- so we know that if your solution is close to the optimal solution in the low-dimensional space, then after mapping back, they could be far apart, but their performance for minimizing a loss function is also similar. So the point here is that in a high-dimensional space, you cannot be close to optimal solution, but you can be close to it in terms of performance. So you can do it be sampling. Because the performance only depends on the structure of a low-dimensional space. Okay. And now here's the key part. So so far we have an algorithm which has exponential dependence on 1 over epsilon. This is good if epsilon is assumed to be a constant, but if you have a large number of samples, you want an error to be diminishing. So you want epsilon to decrease to 0 as a function of N. But in that way, it will be too expensive because of the exponential dependence. So a natural question is how can we improve it to be a polynomial-time algorithm given some reasonable assumptions. So we need additional assumptions because it can be proved. It is already proved in the paper that it is impossible to improve the exponential dependence to polynomial in the worse case. It is NP-hard problem. But it is possible enter some reasonable assumptions, so what assumptions that we need. So the inspiration comes from the fact that learning a linear classifier with 01 loss is NP-hard, but this problem will become easy if you know that the data is linearly separable. And in this way you can formalize the learning problem as a linear programming problem, which is known to be solvable by polynomial time. So we're asking if the same intuition holds for learning neural networks. If you know the distribution of data is separable by some neural network, then does it mean that it is easier to learn your network that actually separated data. Before answering this question, let's first define a notion of separability. We say that a dataset is [inaudible] separable if there exists some unknown neural network called F* such that for any instance in the training set you have a Y times F* of X is lower bounded by gamma. So gamma is the margin of classification. And the distribution is called gamma separable if any random sample satisfied this gamma separability almost surely. Now, with this notion of separability, we can show the full theoretical result. Assume that the number of layers, the separability margin, the L1-norm constraint and the Lipschitz constant of the loss function are constant, then there is an algorithm such that on any gamma separable dataset it trains the neural network for F hat such that it correctly classified every data point with a margin on the order of gamma in polynomial time. And, furthermore, we can show that on any gamma separable distribution, it learns the linear classifier to train an epsilon -- to achieve a epsilon generalization error using a polynomial number of samples and rounding polynomial time. So we can see here that both the sample complexity and the time complexity has polynomial dependence of 1 over epsilon. So we have improved the exponential dependence to polynomial dependence by assuming the separability condition in. >>: [inaudible]. >> Yuchen Zhang: It's exponential. So we have to assume that the gamma is a constant. And we also have hardness results showing that it is impossible to improve it to polynomial even if you assume the separability. But under these assumptions, it is also worst case hardness. >>: So back -- the second that you made, is it for this algorithm or for any algorithm in general? >> Yuchen Zhang: No, I'm going to present the specific algorithm called BoostNet [inaudible]. Yeah. So the algorithm is actually running AdaBoost to construct the m-layer neural nets. In the first -- so it is iterative. And in the first step of any iteration, it uses the standard AdaBoost technique to weigh the samples according to the existing classifier. And then it trains an M minus 1 layer neural network called G which achieve a classification error which is at most a gamma divided by 2B worse than the best possible neural network. And then in the third step it combined this M minus 1 layer neural net into the stronger classifier to construct the m-layer neural net. And of course finally we may need to do some normalization to ensure that the neural net belong to the -- belongs to the function class we define. So I'd like to share three insights why this algorithm works in achieving this theoretical result. The first insight is the equivalence between separability and weak learnability. So more precisely it says that if the data is gamma separable, then for any re-weighting on the data there exists some neural network called g* which achieve a classification error bounded by 1/2 minus gamma divided by B. And recall that 1/2 is the error of the random Gaussian classifier. So it means that g* is the nontrivial classifier whose classification error is bounded away from random guessing. Now, please look at a second step of the algorithm. This trains the neural network g whose classification error is at most a gamma divided by 2B worse than g*. So the classification error of g is upper bounded by 1/2 minus gamma divided by 2B. and this still bounded away from 1 hat, which is sufficient for AdaBoost. On the other hand, the [inaudible] of g is assumed to be gamma divided by 2B which assumed to be a constant because gamma and B are assumed to be constant. So we can use the agnostic learning algorithm we just develop in this talk to implement a second step. And this is a polynomial-time algorithm in terms of NND. >>: Question. How does the alpha weighting play into the training on that? >> Yuchen Zhang: So the role played by alpha is that you first train a minus 1 layer neural network and evaluate it. And if some sample is correctly classified by the first neural network, then the weight will decrease. Otherwise, the weight will increase. So you are focused on the hard samples, you always focus on hard samples given existing classifier. So that is the intuition of boosting to train incrementally a sequence of classifiers and everyone just focus on the hard example at the present stage. >>: [inaudible] how is that done [inaudible]? >> Yuchen Zhang: So you first have this function F which is an m-layer neural net. Right? And given this F, you go to the next iteration and you assign alpha to be proportional to this. So you can see that -- okay, I can see here. So there is a minus sign here. If F times YI is positive, it means that you have a correct classification and the weight will be small. Otherwise the weight will be large. So, yeah, this is the specific way how it is implemented. >>: [inaudible] loss function? >> Yuchen Zhang: The actual loss function, you can consider it exponential loss. That is the standard way of analyzing AdaBoost. But here we are interested in classification error. So because the -- because the exponential loss is an upper bound on the 01 loss, you can transfer the convergence on the exponential loss to the convergence on the 01 loss. >>: [inaudible] loss function embedded [inaudible]. >> Yuchen Zhang: >>: Sorry? On the alpha weighted data [inaudible] weight the data. >>: Yeah, yeah, I know, but -- okay. I see. So each of those terms. are these weights like additive weights? What are those? Are they multiplicative at each of the loss terms? So >> Yuchen Zhang: Oh. So if you have an alpha weighted data, then alpha is multiplied to every loss on the single instance. And all the algorithm that are presented before can be generalized easily to present weighted data. So that is not a problem. And now we have this declassifier which can achieve nontrivial classification error, then the standard theory of AdaBoost guarantees that after over 1 over gamma square iterations the classifier learned while correctly classifying every data point with a margin on order of gamma. So this establish the first claim of the theorem. And the second claim falls as a direct consequence of the first claim because if the first claim can be established with this constant margin, then you can show that the generalization error of the same classifier will be upper bounded by epsilon if the number of training instances scales as 1 over epsilon square. So the sample complexity has this polynomial dependence on 1 over epsilon. And since the running time also has a polynomial dependency on the sample complexity, so the running time also has a polynomial dependence over 1 over epsilon. Now, given the theoretical guarantees for the BoostNet algorithm, let's see how it performs in practice. And we revisit this problem of learning parity functions where we know that back propagation fails. So for learning a parity function of degree of 2, both algorithms learns the correct parity function and achieve the optimal classification error. But the BoostNet algorithm incrementally construct these hidden nodes and it counts there's five of them until it achieve the optimal rate. So slightly less efficient than back propagation. On the other hand, if the degree of the parity function is equal to five, then we know that a back propagation is roughly equivalent to random guessing. But BoostNet is able to learn a correct parity function by incrementally constructing less than 50 hidden nodes. >>: [inaudible] do gradient descent after the random [inaudible]? >> Yuchen Zhang: Yeah, we do do gradient descent. If you don't do gradient descent, it's much, much slower than this. Yeah. Okay. >>: And probably there are few [inaudible]. >> Yuchen Zhang: No, no, we don't do any careful tuning, just [inaudible]. >>: [inaudible] like step 3 where you just go [inaudible] reduce the [inaudible]. >> Yuchen Zhang: >>: Yeah. Yes. [inaudible]. >> Yuchen Zhang: Yes. >>: So why does the parity function satisfy the assumptions? separability, right? You need gamma >> Yuchen Zhang: So actually so there are two points. The first point is that it doesn't satisfy the gamma separability if there is noise. But because of the symmetry of the noise is roughly satisfy the same structure, this is the first point. And this is empirical results. So it just shows that the algorithm also works under the case where the assumption is not exactly satisfied. And the second point is that in the paper we do have an extension of the algorithm that can work in this case with theoretical guarantee. So with some corrupted version of the separability which is satisfied by the parity function. So it is in the paper, but not covered by the talk. So we know that both algorithm learns the same architecture, the two-layer neural nets, but the second algorithm learns a better neural net. And in the perspective of optimization, it means that BoostNet is harder to be trapped in back local minimum because if a trap, then you add another node. So it is incremental algorithm and it's easier to get out of the back local minimum, and that is the intuition why the empirical result is better. >>: So why would I choose a BoostNet over like a boosting with just neural networks? You know? Is there a -- is there a ->> Yuchen Zhang: Yeah, it's actually -- that is a great question. So I think the intuition that is provided by the experiment is that you do want to choose boosting over neural networks to construct a deeper neural network. You can even do this recursively to -- like this is just a boosting one layer, but you can boost in many layers. But theoretically the intuition that this work shows is that instead of using boosting alone, you also have to do very careful initialization and in practice also know that initialization is important. Right? So the initialization combined with boosting can achieve this polynomial rate. But any one of them alone cannot. >>: [inaudible] have you tried experiments just with boosting on the neural networks? >> Yuchen Zhang: [inaudible] tried using boosting but less careful initialization. And it can still achieve something like that. So it shows that empirically, at least for this problem, it is boosting that is more important to the good performance. So finally let's talk about some improper learning algorithms. As we have mentioned, the goal of improper learning is to learn some classifier that is not necessarily a neural network, but we want a generalization error to be compatible with the -- competitive with the best possible neural network. So it is at most epsilon worse than the best possible neural net. So this is the target. And how do we do this? The main idea is that we define another function class called F and define F hat as the empirical risk minimizer of everything in this class for the loss function. And we should define this function class satisfying the following three conditions. The first condition is that the empirical risk minimizer should be easy to compute. So it should be computable in polynomial time. Otherwise there's no reason to choose another methodology. And second condition is that this function class should be powerful enough to contain the neural network. And, third, it shouldn't be too large. Otherwise, it is too easy to overfit. So if it's not too large in the sense that with the polynomial sample complexity the generalization error of the empirical risk minimizer can be controlled by the best possible generalization error within this function class plus epsilon. And combining this inequality with the second probability, we can show that -- so it is easy -- actually easy to see that the best possible generalization using F is upper bounded by the best possible generalization using the function class of neural networks. inequalities, we achieved a target of this. So combining these Now, the only remaining problem is how to find a good function class that satisfy the three conditions, and our solution is to use the kernel method. We define a sequence of kernels. The 0 over kernel is defined as the inner product between the two inputs, and the P over kernel is defined as a function of the P minus 1 over kernel. So given an inner product, all of these kernels can be easily computed given the final T. So we choose P to be equal to the number of layers that we want to fit. And it can be verified that K of M is a valid kernel function, so it induces a reproducing kernel Hilbert space. If we treat this phi functions as the basis function of the reproducing kernel Hilbert space, we can define a function class which is linear combination of the basis functions, so F is still in the reproducing kernel Hilbert space, but we make the constraint that LQ-norm of these combination coefficients are upper bounded by a constant that only depends on the number of layer to fit and the L1-norm constraint on the neural net. So if you define a function class in this way, you can show that it contains the neural network class. On the other hand, you can also show that this function class is not too large in the sense that the Rademacher complexity of the function class with N random samples is bounded by the same constant divided by square root of N. So this satisfied the third condition of the previous slide, and this satisfy the second condition of the previous slide. Now, the remaining problem is how to compute empirical risk [inaudible]. >>: [inaudible] how did you arrive at this kernel? >> Yuchen Zhang: >>: How did I arrive at this kernel? [inaudible]. >> Yuchen Zhang: So there is some intuition indeed. There is a very seminal paper by [inaudible] and another other several years ago, and they are talking about learning linear classifier with non-convex loss. And they are using a kernel that is similar to this. And it is a very interesting observation that if you stack these kernels, then it will be strong enough to approximate a neural network. Because the neural network is just recursively defined as a stack of the linear combination ->>: [inaudible]. >> Yuchen Zhang: This one? >>: Yeah. >> Yuchen Zhang: So the intuition behind this is that you can expand this as the sequence of -- as a sum of polynomial functions. So this basis function are actually polynomial functions. And you can show that the neural network class can have a polynomial expansion and the coefficient for that expansion satisfy this constraint on the LQ-norm. >>: So you don't use a Gaussian kernel? >> Yuchen Zhang: I don't use a Gaussian kernel. And indeed a Gaussian kernel is not strong enough to cover the neural network. Although like the practical performance of Gaussian could also be good. >>: [inaudible] contain an M for any sigma? >> Yuchen Zhang: No. So I'm going to talk about the condition sigma. So there is a condition -- so the only condition sigma is that it should be smooth. But sufficiently smooth in some sense. But I'm going to show some examples of sigma that satisfy this condition, they and are very, very close approximation to the standard activation function that people use in practice. >>: [inaudible] also have this multiple deep kernel. this? Is this similar to >> Yuchen Zhang: So I may have read that paper. And I think there -- so there are some similar ideas using the stacked version of kernels to approximate the neural network. And it is known that the kernel idea has connection to the neural network. And so one contribution here is this theoretical justification for that connection and the more rigorous justification of why they're connected. >>: This special kind of recursive kernel to do all the [inaudible]. >> Yuchen Zhang: >>: Yes. Exactly. Using [inaudible] the stack kernel [inaudible]. >> Yuchen Zhang: I guess not. But I'm not sure because I need to look more carefully into the paper. Okay. So the efficient computation is also easy because it is a kernel method. And by using the represented theorem, we know that the empirical risk minimizer can be representative of the linear combination of kernel functions, so it suffices to learn this combination coefficient. And this can be written in the standard way into a convex organization problem that can be solved efficiently. So ->>: [inaudible]. >> Yuchen Zhang: >>: H? >> Yuchen Zhang: >>: So this is a convex constraint. H? Where is H? H is the -- >> Yuchen Zhang: Oh. So H is a convex function. So H is the final -- is the final penalty on the output of the neural net. So you can choose, for example, the hinge loss. And you can minimize the hinge loss, you can get an upper bound and 01 loss. >>: How many times do you need to recurse, redefine the kernel? >> Yuchen Zhang: >>: So it depends on how deep the neural net we want to -- [inaudible]. >> Yuchen Zhang: Yeah. Yeah, it is the same as the depth that you want to fit. >>: Yeah, I have some kind of [inaudible] using the intuition is that the you use this Gaussian kernel you effectively can get [inaudible] size of the but using this [inaudible] do you have that probability? >> Yuchen Zhang: >>: Infinite number of -- yeah. Yes. So -- [inaudible]. >> Yuchen Zhang: Oh. So actually this can -- so this can approximate any neural network in the class of neural networks that I define in an earlier side. So if you recall the definition, the number of nodes in that function class can be arbitrarily large. So it could even be infinite in the limit. And the problem of the Gaussian kernel actually is that the Gaussian kernel can fit any function that is sufficiently smooth, but that condition on the smoothness is very, very strong if the dimension is high. But the neural network class contains the function that doesn't satisfy the smoothness condition. >>: So just trying to understand the relationship between the number of hidden nodes and the level of the Gaussian. So you mentioned the level of regression should be same as their depth. But how does the number of hidden units and layers ->> Yuchen Zhang: >>: I see. It is not related, but it is related to L1-norm constraint. >> Yuchen Zhang: Right? Because B appears in the algorithm. So it turns out so it is a well-known fact that if you make a constraint, an L1-norm constraint, then the complexity of your function class could be independent of the number of nodes. Yeah. So the L1-norm constraint is also a strong constraint which can control your complexity. So putting all the pieces together, the theoretical guarantees that if you assume M and B are constant and sigma is smooth enough -- I'm going to show what smooth enough means -- then the classifier trained can be computed in polynomial time and the corresponding sample complexity for achieving an epsilon generalization error is also polynomial. You can compare this result with the theoretical guarantee for BoostNet. Going to see that it doesn't need any separability condition on the data distribution. So it means that by moving from proper learning to improper learning, you're considering a wider class of algorithms and now we can achieve similar -- you can achieve similar learnability result with a weaker assumption on the distribution. The only additional assumption is the smoothness condition on sigma, which I'm going to show two examples. So you can consider an Erf function, Erf function that satisfy the constraint, and it's a very close approximation to the standard sigmoid function. You can also consider a smoother version of the hinge loss, which is a smoother approximation to the standard ReLU function. So for both the standard sigma and the standard ReLU function, you can find some very close approximation. So this approximation is one of them but can be even closer as long as the smoothness constant -- smoothness constant is a universal constant. So for both standard functions, you can find some approximation that is close and satisfy the constraint of the theory. And finally we also have some simple experiment to test the algorithm on the MNIST dataset. Because the original MNIST data recognition task is relatively easy, we also compare with some variations of MNIST, including randomly rotating [inaudible] by some random angle, we're adding a noisy background into the image, we're combining the rotation and the noisy backgrounds. So, roughly speaking, the experiment says that the kernel method outperforms logistic regression and the back propagation trained two-layer neural networks. But it is not -- so it is comparable with the convolution neural net on basic and rotation datasets but not as good on the other two datasets. This is also expected because the kernel methods, we just design a generic kernel instead of using any probability of the recognition problem. But the convolution neural net indeed uses the domain knowledge. Another observation is that if you increase the hyper-parameter M, it means that a model will be able to fit deeper neural networks and actual performance also gets better in all cases. And this also intuitive because you can expect that a performance of a deeper neural network can be slightly better on the MNIST dataset. Okay. So as a summary of this talk, we have a study to prove an algorithm to learn constant-depth neural networks in polynomial time. The take-home message is that for improper learning, there are two conditions needed. The first one is that the incoming weights of all the neurons has an L1-norm bound which is a constant. And the second condition is the separability condition on the data distribution. And under these conditions, you can show that the neural network can be learned in polynomial time. And for improper learning, we also have two conditions. The first one is still the L1-norm constraint on the neural network, and the second one is the activation function is sufficiently smooth. So in summary, we know that learning neural network is a challenging problem. It is NP-hard in the worst case, but it could be easy if you use -- if you add reasonable assumptions and use novel and principled algorithms. You can get theoretical guarantees. Okay. So understanding deep learning is very a challenging field, and there's certainly many open problems. And this work is only an early step in exploring, searching for the algorithm that are both understandable in theory and also useful in practice. And there are certainly many open problems. Well, at least the three of them. So the first open problem is that we know that in our algorithm some quantities assume to be constant. And this is because that the running time of the algorithm has exponential dependence on those quantities. But on the other hand, we also know that some exponential dependence can be improved to polynomial dependence if you can see there is some additional constraint on the data distribution. Right? So it is very promising to consider some further probabilities of the real data distribution to further reduce the complexity of the algorithm to make it fully polynomial. And the second point is that it is a little weird to observe that a theoretical guarantee still holds even if your optimization algorithm does nothing. It's just an output initialization. So it's very interesting to understand the role played by optimization algorithm. And if you can have the understanding, it will definitely help the designing of the algorithm because we know that in practice the optimization step helps a lot. And third point is that so the essential difference, as we have noticed, the essential difference between BoostNet and the traditional back propagation is that BoostNet incrementally construct a neural network. Well, the back propagation trains a fixed architecture. But although BoostNet only construct the top layer, I believe that using some normal technique, it is possible to design a better algorithm that incrementally construct a whole neural network from bottom to the top. And every incremental step is very simple, very principled and analyzable. And the combination of a large number of this incremental step will be sufficient to construct a very complicated neural network. And if this can be done, then a very deep and complicated neural network can be learned in a principled way and with theoretical guarantee. So this is a more ambitious goal. Besides learning neural networks, another field that I'm excited to work on is deep reinforcement learning, so I also cover some point. So the first one is a branch of deep learning and all the principled method that works for classical learning of neural networks can also be applied to deep reinforcement learning. So the effort that is done for the first problem will not be wasted for deep reinforcement learning. On the other hand, the primary goal of my research is to explore the algorithm for [inaudible] artificial intelligence, and I believe that a strong IA must interact with the environment and to take actions on the environment and to collect a reward from the environment, and reinforcement learning is exactly a methodology for doing that. So there are some unique challenges in reinforcement learning that some key innovations can be made. So one of them is that the current methodology of enforcement learning uses neural network to approximate the cue function in the deep network and use that cue function approximation to make decisions. But the model itself cannot remember anything. On the other hand, in order to interact with a more complex environment, it is essential to maintain a memory that can collect the knowledge and experiences from the past experience to improve the performance. So appears that a long-term memory mechanism is very interesting to incorporate and could be useful in practice. This memory could not only be about how to solve a specific problem but also about how to interact with human beings, with many different human beings. So, in other words, if you can personalize the reinforcement learning model, then they will find application in many interesting field including playing games with multiple players instead of just a video game with one player, or the other applications may include interactive search or personalized recommendation or even some business applications like decision-making, sales or negotiation. The system level, so I think it's also interesting to build some interactive system that can interact with the Internet users to actively collect data from the users and serve the user simultaneously with the model that is training on this data. So roughly these are a subset of interesting work that I think worth doing in the future. And more than happy to talk to any one of you about the details of that. So just to wrap up, we have a talk about some provable algorithms for learning neural networks and some past work of mine and some future work. So and looking forward to any further meeting with the people in MSR. Thank you very much. [applause] >>: So two. What is better, the method that you talked about earlier on [inaudible] theoretical results can be [inaudible]? >> Yuchen Zhang: So in the -- so for both the improper and [inaudible] binary classification algorithm, for improper we do the MNIST dataset on ten classes, and the way that we do it is just to train ten one-versus-all classifiers. It is not bad according to the performance. So I believe that proper learning can also be generalized to that. On the other hand, if you ->>: [inaudible] efficient than not doing -- >> Yuchen Zhang: Yeah, it's not efficient as doing a native multi-class classifier. So it's very interesting question. And I believe that you can define some notion of separability at least for [inaudible] for multi-class classification, and you know that there are plenty of work on multi-class boosting. So I think the idea can be transferred. But might be future work. >>: So I have a question about the [inaudible] that you explained. So is this in your work that this is first proposed, or is it something that exists in the literature? I'm just curious to know. >> Yuchen Zhang: Well, as far as I concerned, I don't know the other work because they are very limited work in the theoretical analysis on neural networks, but a separability condition is indeed a center thing for linear classifiers. And we know that there's a connection between the [inaudible] neural network if you consider non-convex function. So I think conceptually you can find some way to work. But specific for neural network, I don't know any other paper that uses similar assumption. >>: My question was that then basically given a dataset, based on like [inaudible] and machine learning, and I don't know the exact terms very well, but just [inaudible] any neural network can approximate any function, and so isn't it true that for a given dataset there will always be some neural network that will be able to achieve [inaudible] separability? >> Yuchen Zhang: So, yeah, so first it is not exactly true because it is possible that with the same feature the label could be different because your dataset could be noisy. If that happens, then no classifier can achieve perfect performance. On the other hand, so even if -- so even though there are some results saying that a neural network can approximate any function, but that is under some smoothness condition. And if you look at a high-dimensional problem, then smoothness becomes a very serious problem. Because in the high dimension, there is a very interesting result in geometric functional analysis saying that if your function is sufficiently smooth, then it's almost a constant. So if you make -- impose very strong smoothness conditions, then you cannot learn anything interesting. So now the problem now lies on how we weaken the smoothness conditions, and that is the power of neural network I think. >>: I have a small practical question. So I think the method you showed you tested on I think it was a two-layer neural network, so it's very shallow, and it was a great result. You showed it by more intelligent optimization -or more intelligent initialization you could learn more complex classes of classifiers that back propagation can't. That's great. In practice, when we meet that problem now, so this is a practical question, we tend to add more layers and then that gives us more learnability. I wonder if you could sort of comment on what amount of depth might be required for that problem with standard back propagation and random initialization. I'm trying to get a sense for how much compression might that have saved. Is it saving one layer or ten layers? Can you speculate? Or do you have any results? >> Yuchen Zhang: Yeah, so that question is -- so you're asking -- so I think this can be understood in two aspects. The first one is as you get a more and more layer, you get a stronger power of representation. So that might be one reason why they achieve better performance. Another perspective is that when the neural network gets deeper, it is even easier to train because we have more parameters and there are more local minimums. And actually even if you go to one of the local minimum it is good enough. But if your neural net is shallow and small, most of the local minimum is not good enough. >>: [inaudible] more important but I think both are true. >> Yuchen Zhang: Yeah. So I cannot answer both questions explicitly. For the first question it is basically assortment question to this line of work because we don't care the representation part, we just assume that if the representation power is strong enough to classify your data, then we can learn this representation. But we cannot characterize what kind of distribution can be separated by a ten-layer neural net. And the second one is, if you want to answer the second question, I need to understand how optimization behaves in the landscape of the loss function of a ten-layer neural net, at least deep neural net. So yeah. So this is a challenging problem and still we cannot answer that. And the partial answer is if you look at a boosted mechanism here, then the boosting is a special case of a gradient method because boosting can be understood as a greedy method. Right? And we see that using greedy method can improve the complexity of the algorithm from exponential to polynomial in some sense. And what I believe is that the reason that a deeper neural network is easier to train is because it is a better architecture for the greedy method to work because gradient descent is also greedy method. It's just a search of your neighborhood to try to find the best direction to descend. Right? So I think there are some very strong connection between the greedy nature of gradient descent and the depth of the neural network or the architecture of the neural network. And it's a very interesting open problem to formalize then, either by some new theoretical analysis on the gradient descent or by some new modeling technique such that you can build a new model that is almost as powerful as the traditional way of building a neural network, and for that new model there's guarantee that it can be constructed incrementally in a principled way, and the third point suggests. Yeah. So I only have some future visions, but there's no explicit answer on the question. >> Lin Xiao: [applause] Okay. Let's thank [inaudible] again.