>> Lin Xiao: Today we are very happy to... a Ph.D. candidate at UC Berkeley in Michael Jordan and...

advertisement
>> Lin Xiao: Today we are very happy to have Yuchen Zhang give a talk. He's
a Ph.D. candidate at UC Berkeley in Michael Jordan and Martin Wainwright's
group. He has made actually a number of very impressive contributions to
different topics in machine learning, optimization, as well as statistics.
And he also not shy about coding and software implementations. You will see.
And also most recently has been working learning neural networks. So that
will be focus of his talk.
>> Yuchen Zhang: Yeah, thanks, Lin, very much for the introduction.
see if the microphone works. Okay. So ->>:
You have one on your shirt.
>> Yuchen Zhang:
>>:
Let me
It's fine.
Okay.
It's okay.
>> Yuchen Zhang: Great. Yeah. So it's my great pleasure to give a talk
here at Microsoft Research Redmond. In this talk I will mainly talk about
some provable algorithm for learning neural networks.
And since this is an [inaudible] talk, supposed to be [inaudible] talk, so I
will also -- I will first spend a little moment to introduce myself. So I'm
a fifth-year Ph.D. student at UC Berkeley working with Michael Jordan and
Martin Wainwright. And my general research interest is to develop machine
learning algorithms for building artificial intelligence. This is a broad
area. So my interest is also a little elastic.
In my personal perspective, there's three key challenges in the machine
learning research. The first one is how to represent AI in a rigorous
mathematical form, so that the is modeling part. The second question is,
given a concrete model, how to learn the parameters of the model using
efficient machine learning algorithm. And this problem can be challenging if
it is a very complicated model or the corresponding loss function is
non-convex. And the third question is, given a complex model, we need to use
a huge amount of data to learn it, and this scale of data is usually not
feasible in a single machine but has to be stored and processed in a
distributed system. So the question is how to perform efficient machine
learning with distributed system.
My research tries to answer these three questions in different aspects. And
before going into the main part of this talk about learning neural networks,
I would like to spend a few minutes to quickly go through my earlier work and
the connections to the above challenges to probably give you a more concrete
idea of my research background.
Okay. Now, let's start from a specific family of algorithm that have been
interesting. So it is called the divide-and-conquer algorithm, and this
algorithm divide a large-scale problem into smaller subproblems and combine
the solution to the subproblems into a single solution for the original
problem.
This computation scheme is very suitable for distributed computing. And
indeed we have proposed some efficient divide-and-conquer algorithms for
learning parametric models.
Such algorithms are computation efficient in the sense that every machine
only has to solve a small-scale problem after divide-and-conquer. It is also
communication efficient in the sense that the computation on the separate
machines are mutually independent and there's only one round of communication
required at end of the computation.
We also theoretically guarantee that the algorithm has optimal performance in
the sense that if the number of machine is not too large, comparing to the
overall number of samples, then the optimal statistical accuracy is
guaranteed. And all of these theoretical statements were verified by a real
data experiment.
We also find that the idea of divide-and-conquer can be applied to learning
nonparametric models. For kernel ridge regression, for example, the
classical way of learning it requires a running time of N cube, where N is
the overall number of samples in the dataset. We have proposed a
divide-and-conquer based, more efficient algorithm whose running time has
been reduced from N cubed to almost linear to N. And also the optimal
statistical accuracy is guaranteed. And this is the very first efficient
algorithm for kernel ridge regression which guarantees the optimal rate.
I'm also interested in classical convex optimization algorithms as the convex
model is widely used in practice. A key challenge in convex optimization is
how to efficiently minimize a convex loss function given a very large
condition number on the loss function.
The state-of-the-art empirical risk minimization algorithms such as SAG,
SDCA, or SVRG may need to take 1 plus the condition number divided by the
number of sample passes over the dataset in order to achieve a high accuracy.
We have proposed a more efficient algorithm called SPDC which improve this
iteration complexity from 1 plus kappa divided by N to 1 plus square root of
kappa divided by N. So it is a -- it could be order of magnitude faster for
very large -- for problem with very large condition number kappa.
It appears that efficiency of the distributed optimization algorithm also
suffers from the high condition number problem. For example, the convergence
rate of the popular distributed algorithms such as ADMM or L-BFGS has
convergence rate depending on the condition number which further depends on
the overall number of samples in the dataset. So that means that even to
process a very large dataset, then the algorithm will converge slow.
We have a proposed a more efficient algorithm called DiSCO which converges
[inaudible] faster than these popular algorithms. And we can show
theoretically that its convergence rate is independent of the sample size.
So it is suitable for processing very large datasets. And all of the
theoretical statement for SPDC and DiSCO were verified by a real data
experiment.
When people try to solve complicated problems, they tend to build more
complicated models, such as non-convex models. A very important family of
non-convex models are neural networks, and this will be the focus of this
talk. So I will [inaudible] the discussion on the provable algorithms for
learning neural networks to the main part of this talk.
Besides learning neural networks, I have also been working on some other
problems which are non-convex, including crowdsourcing problem. So the goal
of crowdsourcing is to label a dataset with the help of Internet users. But
since these labels are noisy, we want to infer the true label as far as
estimating the quality of users at the same time. This requires a non-convex
model because loss function may have many local minimums.
For this problem, we have a [inaudible] efficient algorithm based on the idea
of spectrum method and the idea of EM algorithm, which guarantees the optimal
performance. So this is not a very nice example that non-convex problem can
be solved by carefully designed algorithms.
Some other questions that I have been asking are more theoretical, and one of
them is what is the fundamental tradeoff between communication cost and
statistical accuracy for all possible machine learning algorithms on a
distributed system. This is basically an area that is not fully explored, so
there are some very fundamental, basic, open questions.
One of these open questions is if you want to ask me the parameter of a
probability distribution and you have a distributed dataset sample i.i.d.
from this distribution, then the question is what is the best possible
statistical accuracy given a fixed communication budget. And our work has
established this tradeoff for some very fundamental problems, including
estimating the mean for Gaussian estimation and estimating the coefficient
for linear regression or probit regression.
And by showing that, we have a showing that this tradeoff is essentially
tight in the sense that it applies to all possible distributed algorithms and
they are realizable by practical algorithms.
Another interesting set of problems is distributed linear algebra. For
example, if you want to compute the rank of an N x N matrix [inaudible] or
the number of eigenvalue of this matrix that above a specific threshold. For
this problem, we have proposed a more efficient algorithm which uses the idea
of randomization which only communicates n bits for DiSCO.
We have also theoretically characterized the communication and accuracy
tradeoff for all possible algorithms for this kind of problem. And
interesting enough, we find that this tradeoff dramatically differs between
the deterministic algorithms and randomized algorithms.
connection between communication and accuracy.
So this is some
And another interesting interface is between the computation and accuracy.
In statistical learning theory, we know that the optimal accuracy is usually
characterized by the so called minimax rate, the minimax error rate. For
some important problems, we know that their exponential time algorithm that
achieve this minimax rate, but there's no known polynomial-time algorithm
achieving the same rate.
We study an interesting problem called sparse linear regression, which is
also widely used, and show that it is necessary to rethink the notion of
statistical accuracy and a certain computation constraint. More
specifically, we show that there's an arbitrarily large gap between the
performance of the best exponential-time algorithm and all possible
polynomial-time algorithms.
This shows that the minimax rate cannot be achieved by any polynomial-time
algorithm. And indeed we have to redefine the notion of statistical
optimality for the classical polynomial-time algorithms, and we have show
similar gaps for improper learning as well.
I'm also interested in building machine learning systems, and one example is
Splash, which is a framework that I designed and implemented for
parallelizing stochastic algorithms. So Splash is a general-purpose program
interface that allows user to develop stochastic algorithms, such as
stochastic gradient descent or Gibbs sampling, without knowing any detail
about distributed system.
But on the other hand, it is also execution engine that can automatically
parallelize the algorithm that the user implement on a distributed system.
We have built Splash on top of Apache Spark, and it has integrated with the
Spark ecosystem. So the existing Spark users can very easily use Splash.
And we have verified by experiment, by large-scale optimization or machine
learning problems, that Splash is able to achieve order of magnitude speedup
over the official machine learning package of Spark. And this is partially
because of the advantage of stochastic algorithms over the traditional batch
algorithms of machine learning and partially because of the efficient design
of Splash.
And this is an open source project which is available online, and through
this URL you can find some guidelines for programming, installation, and some
examples that you can try on your laptop or on any cluster.
[inaudible] to apply machine learning techniques to real-world problems, and
one example is click modeling. The goal is to model user's behavior in Web
search and analyze user's feedback to the search engine to improve the search
engine's ranking function and online advertising algorithms. This line of
work was done then with an intern at Microsoft Research Asia and eventually
shipped into product to improve Bing's NDCG by more than 0.8 percent.
I've also been working on some recommenders, a recommender system project,
and the goal is to learn a non-parametric model to solve the data sparsity
problem for online shopping and recommendation. And this work has
dramatically improved the quality of recommending long tail items in online
recommendation.
So roughly these are the works that have been done in the past. If you are
interested in any one of them, you're more than welcome to come to talk to me
after this presentation for more details. Okay.
>>:
[inaudible].
>> Yuchen Zhang: Yeah. So it was an internship project at Google. So let's
go back to the main topic of this talk about the provable algorithms for
learning neural networks. This is a joint work with my two advisors as well
as my coach Jason Lee from -- who's now a postdoc at UC Berkeley. Right.
So in recent years, we have all witnessed the great success of neural
networks in many applications of artificial intelligence, including vision,
speech, NLP, reinforcement learning, and many others. Comparing with the
classical linear models, we know that a neural network is able to encode
nonlinear functions which makes the model more powerful.
Another added benefit of using neural network is that it allows the
researchers to incorporate their domain knowledge into the design of the
architecture of the model. And it has been shown in practice that a
specifically designed architecture can dramatically outperform the generic
architecture, even for neural network. And successful examples as we know
include convolution neural net, or STN, and recurrent neural net.
Despite a diversity in the model architecture, the algorithm for learning
these model parameters are relatively uniform. [inaudible] formulize the
learning problem as an optimization problem by constructing a loss function
and then run any optimization algorithm to minimize the loss function.
So let's consider a simple neural network which contains only one neuron. It
takes a feature vector X as input and then take the linear transformation,
then apply the sigmoid function to define the output. If you want to solve a
regression problem, we can write a loss function as the empirical least of
square loss.
And since the sigmoid is not a linear function, this loss function may not be
convex. So it means that there could be multiple local minimums in the loss
function. And if you run gradient descent to minimize this function, then
there's no guarantee that the algorithm will converge to the global minimum.
The problem of local minimums exist even if we only consider two points in
one dimension. For example, if you choose the feature of error pairs taking
these two values and shape the landscape of the loss function here, you will
see there are two local minimums immediately. And the second local minimum
is substantially higher than the first one.
So if you run gradient descent initializing in this region, then it is likely
that the algorithm will converge to the second minimum, which is obviously
suboptimal. The classical way of solving the local minimum problem is to do
multiple rounds of random initialization and then run gradient descent. So
we can do this for multiple times and choose the best solution that achieve
the smallest loss function value.
This general approach can be combined with many heuristics which improve the
performance in practice. For example, we can use mini-batch training to
improve the efficiency of gradient descent. We can use the momentum method
to improve the -- to get out of the bad local minimums. And we can use
drop-out to improve the robustness of the training procedure.
It's been shown in practice that the proper combination of this general
approach with the heuristics can achieve very decent performance on
real-world problems. But does it mean that this approach is the correct way
for solving all kinds of learning neural network problems? The answer is
negative because there essentially is some interesting problem, some
interesting neural network that is difficult learn using this gradient
descent and random initialization approach.
To see this, let's look at the concrete example called learning parity
function, which is a famous and classical problem in learning theory. The
problem setting is that we have a feature vector X and a label Y. The
feature vector is uniformly sampled from the vertices of a 50-dimensional
hypercube, and the true label is defined as the product of a subset of
coordinates of the feature vector.
The number of coordinates involved is called a degree of the parity function
and is indicated by this letter P. The learner observed the feature vector
as well as the [inaudible] version of the label. With probability 0.9, it
observed the true label and with probability 0 .9 -- with probability 0.1 it
observed a negative of the true label.
The goal is to train a classifier using multiple instances in this form such
that given a new feature vector it can predict the value of Y, whether it is
equal to minus 1 or equal to 1.
So this image plots the true value of the parity function given two involving
coordinates. If both coordinates are equal to minus 1 or equal to 1, then
the true value is equal to plus 1. Otherwise the true value is equal to
minus 1.
It's pretty easy to verify that this distribution of positive and negative
samples cannot be separated by any linear classifier but it can be separated
by a two-layer neural net. So this is a classical example of showing the
limitation of linear classifiers and the power of neural nets. We want to
train a two-layer neural net for solving this problem. And because the loss
function is non-convex, we use multiple rounds of random initialization and
back-propagation to train the neural network.
And here is the result. If the degree of the parity function is equal to 2,
then the algorithm successfully learns the parity function by using two
hidden nodes in the only hidden layer and achieve the optimal classification
error of 0.1. But if the degree of the parity function is equal to 5, you
can see that no matter how many hidden nodes it's using, the algorithm fails
to learn a parity function and its classification error is always around 0.5,
which is the error of ->>:
[inaudible] try this problem using a rectifier [inaudible]?
>> Yuchen Zhang:
>>:
So I have tried both sigmoid and rectifier.
[inaudible].
>> Yuchen Zhang: Yeah, if I remember correctly I have tried rectifier, and
they have the similar behavior. So it means that on this more complicated
and more nonlinear function, back propagation doesn't really outperform
random guessing.
Now, this observation motivates us thinking about the following question. If
we observe that a neural network fails on a particular task, we want to know
what is the true reason of the failure. There are two possible reasons. The
first one is that the neural network architecture is not powerful enough so
that it cannot encode the function but that we try to learn.
And the second possibility is that the learning algorithm is not good enough,
so that even though the good neural network exist, it cannot be learned by
the algorithm. And in perspective optimization, the algorithm is trapped in
bad local minimums.
For this particular problem of learning parity function, we know that a
two-layer neural net is able to encode a parity function, so it must be
because of a bad learning algorithm. But for the more general and more
complicated problem, it's really hard to distinguish these two reasons, so we
don't really know what to improve if we observe a bad performance.
So this motivates us to think about provable algorithms. If you know that
the algorithm provably learns the model, then if you observe a bad
performance, then we must modify the model to reduce the prediction error.
Instead, if we observe a slow algorithm, then we must modify the learning
algorithm to improve the running time. So in this way, the design of the
model and the design of the algorithm has been clearly separated, which is
good tradition in machine learning research.
>>: [inaudible] when you I see [inaudible] is this theoretical result that
you cannot reach that, or is it empirical running [inaudible]?
>> Yuchen Zhang:
>>:
Empirical result.
But how do you know [inaudible] tricks?
>> Yuchen Zhang: So I'm sure that when you apply tricks, there is a pretty
good chance that you can improve that. So one point is that this is just an
empirical illustration for the motivation of the work. And you're absolutely
correct that there could be other possibilities. Yeah.
So more precisely there are three questions that we want to answer for
provable algorithms. The first one is under which assumptions we can show
that a neural network is efficiently learnable in polynomial time. And
second question is if we weaken these assumptions a little bit, then can we
show that there's no efficient algorithm possible and there's some classical
assumptions. And third, we do want to know if this theortical understanding
can help us to design a better algorithm that also works on real data.
So I hope that I have motivated that this is interesting question. So here's
the outline of the talk. First a study on a simple problem called learning
linear classifier with non-convex loss. And we'll find that the idea for
solving this problem will be useful for solving the more complicated problem
of learning neural networks.
Then we'll proceed to proper learning of neural networks and show some
algorithm as long as there is theoretical guarantees. We also talk about
improper learning, where the goal is to learn arbitrary classifier which is
not necessary a neural network, but we want its performance to be competitive
with the best possible neural net. We're also going to show an algorithm and
the corresponding theoretical result.
Okay. So let's start from learning linear classifiers. To formulize the
problem, let's assume that we are given a training set of [inaudible] and
every instance takes this feature value pair form. The feature is the D
dimensional vector and the value is either minus 1 or 1. So this is a
simplified binary classification problem.
The goal is to learn a vector W such that the [inaudible] loss function is
minimized. So this function H is an arbitrary function that measures the
classification loss on a single instance. Typically we would like to choose
H to be the step function so this loss is exactly equal to the classification
error. But unfortunately we know that even approximately minimizing this
function is NP-hard.
The hardness of this problem comes from the fact that the loss function is
not continuous, so it is reasonable to use some continuous approximation to
the step function and assume that H is L-Lipschitz continuous. But if L is
very large, then the problem is still challenging because it can be proved
that minimizing this loss function in poly(L)-time is still NP-hard. It has
no result. So it motivates us to assume that L is a constant that doesn't
grow with the sample size or the input dimension D.
So let's see some examples of loss function that satisfy this condition. We
can consider piecewise linear function which is equal to 0 if X is below an
active threshold and equal to 1 if X is above a positive threshold. It's a
linear function in between.
We could also certainly consider some smoother approximations such as the
sigmoid function. And for both functions, their Lipschitz continuity is
characterized by this constant L.
So given a loss function, what is the most straightforward way of minimizing
this loss function? We have a mention that a very natural choice is to first
do random initialization and then run gradient descent. For the simplicity
of statement, let's assume that the minimizer of the loss function w* has a
unit norm, and we also assume that the data is normalized so that all of the
feature vectors has a unit norm or is contained in a unit node.
The first step of the algorithm uniformly sample random vector from the unit
sphere of the D dimensional space. Because we know that a correct solution
lies on the sphere of the unit -- the surface of the unit sphere. And then
we treat W0 as initial point and run gradient descent or any optimization
algorithm to improve the loss function value. Eventually we'll gather get
[inaudible] W which is also contained [inaudible] and whose loss function
value is at least as good as the initial point. We do this for multiple
times and choose the best vector that achieve the smallest loss function
value as the final output.
This is a widely used approach. The first initialization step is very
simple, and optimization step is very flexible. But in a theoretical
perspective, it is problematic because the loss function is non-convex, and
the second optimization step does not guarantee that it will converge to the
global optimum of w* unless the optimization -- unless the initial point W0
is very close to w*.
But what is the probability that W0 is very close to w*? Since we are
initializing a high-dimensional space, this probability is extremely small.
And it's easy to verify that. If you want to guarantee a high probability
that a good initialization appear at least once in each iteration, then the
number of iteration must scale as 1 or epsilon to D minus 1's power. So this
is exponentially dependant on D and too expensive unless D is very small. So
the question we must answer is how to remove this exponential dependence on
D.
Surprisingly there's a very simple solution to the problem. We only have to
modify one place of the algorithm. That is, instead of uniformly sample from
the unit sphere of the D dimensional space, we sample from a larger sphere of
radius R is where R is [inaudible] greater than 1. And all the remaining
part of the algorithm is not changed.
With this simple algorithm, there is an interesting guarantee. If you choose
to have [inaudible] R following this formula, then the algorithm is
guaranteed to be approximately optimal with high probability. And iteration
complexity depends on the polynomial function of N where the power only
depends on epsilon. So if epsilon is a constant, then this is a
polynomial-time algorithm in both N and D.
If you exam closer into the theory, then it shows that a hyper [inaudible] to
R is a tradeoff between the efficiency and accuracy of the algorithm. If you
choose a greater value of R, then you [inaudible] of epsilon. It means that
the algorithm will be less accurate but since the running time has inverse
dependence on epsilon, it also means the algorithm will run faster.
So this is a very simple initialization scheme which can have some
theoretical guarantee. But it does have some limitations that we want to
overcome. So three of the limitations are, first, the output of the
algorithm will have a radius of R and it's not necessarily containing in the
unit ball. But we know that optimal solution is contained in the unit ball.
So it is not exactly a proper learning algorithm and may cause some
overfitting.
And, second, the constraint on w* and on all of the feature vectors are in
terms of the LQ-norm and cannot be generalized in LQ-norms. But, in
practice, in many cases we want to impose other ones like the L1-norm to
impose sparsity. So this difficult is because that in the proof of the
theorem, we use the Johnson-Lindenstrauss lemma, but this lemma only preserve
the Euclidean distance instead of the arbitrary LQ distance.
And the third problem is that iteration complexity is a polynomial function
of N with a relatively large power. So ->>: Sorry. If we go back to this result, I mean, you're already rushing
ahead to the next iteration. So here, I mean, you're assuming that the
function is defined over the entire space but the optimum is on the unit
sphere.
>> Yuchen Zhang:
Yes.
>>: And somehow that, you know, that assumption allows use you to prove your
theorem, so somehow the basis of attractions around w* become bigger when you
go farther away. Something is hiding here. I mean, you somehow -- you know,
you are making some assumption. It cannot be just an arbitrary objective
function. I could choose any objective function, of course I can make the
points that are far away, some [inaudible] structure of [inaudible] but some
very, very delicate structure of local minimum, you will always get stuck,
but somehow your assumptions aren't making the problem nicer. Can you shed
some more light on what is the key here that makes this, the going farther
away ->> Yuchen Zhang: Okay. So one key assumption is that your objective
function has to be -- has to be continuous with a constant Lipschitz
constant. Another key assumption is that the data should be i.i.d.
So the intuition behind this theorem is that if you consider some optimal
solution w*, then this w* affects the loss function only by the inner product
between w* and these features. Right? So if you can preserve the inner
product, then you can approximate the optimize -- you can approximate the
loss function and even approximate optimal solution.
And now we know that w* is in a high-dimensional space, and now we construct
a random subspace and just to project w* onto this smaller dimension random
subspace. Using the Johnson-Lindenstrauss lemma, you can show that after
this projection, if you scale this vector by another factor of R, so R is a
factor that is greater than 1, then the inner product will be preserved.
So there is a -- there is indeed a dependence on the number of samples
because we know that as the number of samples increases, the dimension after
the projection will increase logarithmically as a function of N. But this is
only a logarithm dependence.
So this vector R comes from the Johnson-Lindenstrauss lemma where after doing
the projection you have to scale all of the vectors to preserve the distance
and preserve the inner product.
And why do we -- why don't we have to construct this random space in the
algorithm is because that you do two consecutive steps. The first step is to
construct a random space, and then you draw from the unit sphere of low
dimensional space. But combining both is equivalent of directly drawing from
the original space but with a greater radius. Yeah. So the projection step
is hidden, but the fact of the projection is this greater radius.
>>:
There's no assumption about the convexity of the loss function?
>> Yuchen Zhang:
>>:
There is no assumption on the convexity.
Regarding that this algorithm is always going to [inaudible] optimum.
>> Yuchen Zhang: Actually, it doesn't -- so there's no notion of convergence
because the optimization set is arbitrary. So if just a [inaudible] and then
you can -- so in the simplest case you can directly output initialization as
your solution. But in the first step you repeat it for multiple times, and
there is a selection here. You use the loss function to choose the best one.
So this step is also very important.
>>:
The loss function [inaudible].
>> Yuchen Zhang: Yeah, so there is no convergence because it could be just a
random initialization and selecting the best solution. Selecting the best
sample. Okay.
So we know that they have some limitations, and it can be solved by selecting
a more complicated algorithm by constructing this initialization using a
least-square problem. For this more general algorithm we can assume more
generally that the feature vector has an LQ-norm that is bounded by 1 and all
of the feature vectors has LQ-norm that is also bounded by 1 where P and Q
satisfy this normal condition on the [inaudible] norms.
The first step of the algorithm sample this K i.i.d. instances from the
training set and then sample a vector U uniformly from the K-dimensional
hypercube.
And then it solve a least-square problem to compute a vector W0
original D-dimensional space. Then W0 is treated as an initial
then you can run any optimization procedure to improve it. And
repeat this procedure for T times and choose the best one using
function and then define the output.
in the
point, and
then we
the loss
So for this modified algorithm, there is a better theoretical guarantee. If
you choose K properly, the [inaudible] K properly, then the algorithm
guarantees to be approximately optimal with high probability, and iteration
complexities is exponential function [inaudible].
And comparing this theoretical result with the previous one, you can see that
there's three improvements. First, the algorithm always returns the vector
in the unit ball. So it is a problem in the algorithm. And, second, it
applies to arbitrary LQ-norms. And it has a better iteration complexity
because it has to replace the dependence on N by the dependence on a
universal constant E.
The proof of this theorem is even simpler. If first to define an empirical
based -- a sample-based empirical loss function G which only depends on your
random samples. And then using the classical learning theory, you can easily
show that minimizing G will approximately minimize the original loss function
L.
On the other hand, it is easy to see that G of W for any vector W is uniquely
determined by the vector of inner products. It's a K-dimensional vector.
And using the norm constraint, it is also easy to verify that phi of w* is
contained in a K-dimensional hypercube. So if you draw a vector uniformly
from the K-dimensional hypercube, then no matter where w* is, you will be
sufficiently close U of phi w* with probability on the order of epsilon to
Kth power because this is a K-dimensional space.
Now, if you assume that you're sufficiently close to phi w*, then we can
solve this least-square problem. This will give you a vector W0 such that
phi of W0 is close to U and we know that U is close to w*. So W0 is almost
equivalent to w* in minimizing the loss function. And we get a good
initialization with probability epsilon to the Kth power. And we can repeat
this for multiple times to guarantee the high probability bound. So the
number of iteration is 1 over this quantity. And plug in a value of K into
the expression, you will see the iteration complexity exactly matches the
theoretical claim. Okay.
So this is our algorithm for learning linear classifiers without any
convexity assumption. Now let's proceed to the learning of neural networks.
Before describing the algorithm, let's first define a function class that we
want to learn. If the neural network has only one layer, then it is exactly
linear function. Right? So it reduces to the problem that we have just
studied.
If the neural net has multiple players, multiple layers, then for the m-layer
neural nets we recursively define it as a function mapping from the feature
space to a real number. And a positive number in the case that it is a
positive class, otherwise it is a negative class.
And this function is a linear combination of several components, and every
component depends on the lower level neural nets, so this M minus one-layer
neural net.
The activation function is arbitrary, but we assume that it should be
1-Lipschitz continuous. It is a condition that is easy to satisfy. And the
number of hidden layers -- number of hidden nodes to construct the m-layer is
assumed to be finite, but otherwise it could be arbitrarily large. But we
make this additional assumption that the L1-norm of all of the incoming
weight of the combination coefficient is bounded by a constant B. And B is
independent of the sample size or the input dimension.
So this might be the strongest assumption of this work. But it makes sense
in practice because the L1-norm constraint imposes some sparsely connected
neural network. And we know in practice that the sparsely connected neural
network is able to represent some meaningful functions that characterize the
real data.
For example, the convolution neural net is sparsely connected, and the
sparsity of the connection could be independent of the scale of the image.
Only depending on the size of the sliding window and independent of the
cardinality of the training set. And we also know that imposing the L1-norm
constraint in practice will improve the robustness of the neural network
train.
So given this function class, we also define the objective function as the
empirical loss also characterized by the single instance loss function H, and
F is the function that we want to learn. So it belongs to a neural network
class. If the number of layers is equal to 1, then F is a linear function.
So it reduces to the linear classifier learning problem. Otherwise F is a
more interesting nonlinear function.
So we saw in the previous slide that the random initialization scheme can
help to learn a linear classifier is if the loss function is non-convex. So
a natural question is can the same idea be applied for learning multi-layer
networks. Right? And the answer is yes.
And here I'm going to present a generalization of the second algorithm for
learning neural networks, for learning linear classifiers for learning the
multi-layer neural networks. The first step is still to sample i.i.d. K
instances from the training set.
In the second step, we generate a random neural network to treat it as
initialization and do it in a recursive way. If the number of layers to be
generated is equal to 1, then we do exactly as generating a linear function,
right, because this is a linear function. So we randomly sample vector U
from the K-dimensional hypercube and compute W0 by solving the least-square
problem and define and construct the linear function using W0.
If the number of layers to be constructed is greater than 1, then we
construct this n-layer neural net recursively. Because this is a recursive
program, we can first generate a sequence of lower level neural net called G1
through GS where S is also a hyperfunction of the algorithm following the
same program, then we solve alternative least-square problem to compute W0.
Then we construct this n-layer neural net F0 using W0 and all of this N minus
1 layer neural nets.
If we compare this to least-square problem, you'll see that the only
difference are in the blue terms. In the first case, the blue term is the
feature vector, and in the second case it is an output of the N minus 1 layer
neural network. So it means that for constructing weights for the Nth layer,
we use the output of the lower level neural network combined with the
activation function as the feature vector. And this is intuitive.
So we can recursively construct the weight of the neural network from bottom
to the top. And we treat this neural network as initial network, run back
propagation or any optimization algorithm to improve it until we get another
neural network F, which is also in a function class and whose loss function
value is at least as good as the original neural net. And repeat this
procedure over T times and select the best neural network that achieve the
smallest function value.
>>: So you do this [inaudible] but in practice you'll be much slower than
[inaudible].
>> Yuchen Zhang: In practice we find that running gradient descent for the
optimization step is much better than just directly outputting the
initialization. So theoretically, we don't need this optimization step.
We're going to see that there are still some polynomial time guarantee but in
practice indeed optimization helps a lot. And it is an open problem, how to
understand the role played by optimization.
>>: I think the question is how fast or slow does [inaudible] just running
the ->>:
[inaudible].
>>:
[inaudible].
>> Yuchen Zhang:
Computationally-wise, comparing to what?
>>:
To just --
>>:
[inaudible] batch, for example, the mainstream method.
>>:
[inaudible] step, right?
>> Yuchen Zhang: So on one hand if you only consider computation, then on
the specific class of problems that stochastic gradient descent works,
there's totally no reason to use another algorithm which is -- even it is
comparable to stochastic gradient inefficiency, but we already know that
stochastic gradient descent work. But for some other problem like, for
example, we will show later that for learning parity functions, then there's
no true way to make stochastic gradient descent work. But we're going to
show that this method combining with some other technique we're going to
introduce later will succeed easily in learning that complicated function.
>>:
[inaudible].
>> Yuchen Zhang: Yeah. So this is one algorithm, but [inaudible]. Okay.
So because the algorithm is a generalization for learning linear classifier,
the theoretical result is also a generalization of what we have [inaudible]
before. If you properly choose the hyper parameters K and S, then the
algorithm is guaranteed to be approximately optimal, it's high probability.
An iteration complexity is an exponential function of epsilon N. So if M is
equal to 1, then this theoretical result reduces to the theory for learning
linear classifiers. Exactly. If M is greater than 1, then as long as the
target optimality gap, the number of layers to fit, the L1-norm constraint
and the Lipschitz constant of the loss function are assumed to be constant,
then this is a polynomial-time algorithm in ND.
And importantly I also like to emphasize that we haven't made any assumption
on the data distribution. So this is a purely agnostic learning algorithm,
and this will be important for the future improvement.
>>: Could you mention some more intuition about the squared problem that you
are solving [inaudible] you first presented a linear classifier, and then you
used it for the neural network portion.
the step ->> Yuchen Zhang:
So if you go to the previous slide,
This one?
>>: It would be the first step under 2, when you draw this uniform vector
U from the hypercube, basically when you're solving this least-square
problem, you're completely ignoring the Y prime, basically the labels.
>> Yuchen Zhang: Yeah, exactly. So this is -- yeah, this is a very good
point. So all this initialization is performing in unsupervised manner. We
don't use any information in Y. We only explore the structure of the input
space.
>>: [inaudible] give some intuition about this? Because if data is well
separated, right, this blind assignment of, you know, labels U plus minus 1
to [inaudible] will all this -- I mean, I can't see how -- it's not like one
of the random vectors U approximates the structure or anything, right?
Because it is well separated and it's an easy learning problem, then what's
going to happen here?
>> Yuchen Zhang: Yeah, I'll try to explain. So minimizing a non-convex
function is very hard. No matter it's a low low-dimensional problem or a
high dimensional problem. And the only difference is that in a low
dimensional problem you can just randomly try and evaluate. But in a
high-dimensional space, it is exponentially expensive to randomly try.
And we don't -- essentially we don't know how to do optimization, so what we
can do is to just randomly sample. And so this view -- so the purpose of
solving this least-square problem is that we assume that U represent the -- a
mapping on the optimal solution, and U is in a low-dimensional space.
So solving this, the purpose of solving this least-square problem is to
decode the structure of the low-dimensional space and map it back to the
original high-dimensional space. And we know that -- so we know that if your
solution is close to the optimal solution in the low-dimensional space, then
after mapping back, they could be far apart, but their performance for
minimizing a loss function is also similar.
So the point here is that in a high-dimensional space, you cannot be close to
optimal solution, but you can be close to it in terms of performance. So you
can do it be sampling. Because the performance only depends on the structure
of a low-dimensional space.
Okay. And now here's the key part. So so far we have an algorithm which has
exponential dependence on 1 over epsilon. This is good if epsilon is assumed
to be a constant, but if you have a large number of samples, you want an
error to be diminishing. So you want epsilon to decrease to 0 as a function
of N. But in that way, it will be too expensive because of the exponential
dependence.
So a natural question is how can we improve it to be a polynomial-time
algorithm given some reasonable assumptions. So we need additional
assumptions because it can be proved. It is already proved in the paper that
it is impossible to improve the exponential dependence to polynomial in the
worse case. It is NP-hard problem. But it is possible enter some reasonable
assumptions, so what assumptions that we need.
So the inspiration comes from the fact that learning a linear classifier with
01 loss is NP-hard, but this problem will become easy if you know that the
data is linearly separable. And in this way you can formalize the learning
problem as a linear programming problem, which is known to be solvable by
polynomial time.
So we're asking if the same intuition holds for learning neural networks. If
you know the distribution of data is separable by some neural network, then
does it mean that it is easier to learn your network that actually separated
data.
Before answering this question, let's first define a notion of separability.
We say that a dataset is [inaudible] separable if there exists some unknown
neural network called F* such that for any instance in the training set you
have a Y times F* of X is lower bounded by gamma. So gamma is the margin of
classification. And the distribution is called gamma separable if any random
sample satisfied this gamma separability almost surely.
Now, with this notion of separability, we can show the full theoretical
result. Assume that the number of layers, the separability margin, the
L1-norm constraint and the Lipschitz constant of the loss function are
constant, then there is an algorithm such that on any gamma separable dataset
it trains the neural network for F hat such that it correctly classified
every data point with a margin on the order of gamma in polynomial time.
And, furthermore, we can show that on any gamma separable distribution, it
learns the linear classifier to train an epsilon -- to achieve a epsilon
generalization error using a polynomial number of samples and rounding
polynomial time. So we can see here that both the sample complexity and the
time complexity has polynomial dependence of 1 over epsilon. So we have
improved the exponential dependence to polynomial dependence by assuming the
separability condition in.
>>:
[inaudible].
>> Yuchen Zhang: It's exponential. So we have to assume that the gamma is a
constant. And we also have hardness results showing that it is impossible to
improve it to polynomial even if you assume the separability. But under
these assumptions, it is also worst case hardness.
>>: So back -- the second that you made, is it for this algorithm or for any
algorithm in general?
>> Yuchen Zhang: No, I'm going to present the specific algorithm called
BoostNet [inaudible]. Yeah. So the algorithm is actually running AdaBoost
to construct the m-layer neural nets. In the first -- so it is iterative.
And in the first step of any iteration, it uses the standard AdaBoost
technique to weigh the samples according to the existing classifier.
And then it trains an M minus 1 layer neural network called G which achieve a
classification error which is at most a gamma divided by 2B worse than the
best possible neural network.
And then in the third step it combined this M minus 1 layer neural net into
the stronger classifier to construct the m-layer neural net. And of course
finally we may need to do some normalization to ensure that the neural net
belong to the -- belongs to the function class we define.
So I'd like to share three insights why this algorithm works in achieving
this theoretical result. The first insight is the equivalence between
separability and weak learnability. So more precisely it says that if the
data is gamma separable, then for any re-weighting on the data there exists
some neural network called g* which achieve a classification error bounded by
1/2 minus gamma divided by B.
And recall that 1/2 is the error of the random Gaussian classifier. So it
means that g* is the nontrivial classifier whose classification error is
bounded away from random guessing.
Now, please look at a second step of the algorithm. This trains the neural
network g whose classification error is at most a gamma divided by 2B worse
than g*. So the classification error of g is upper bounded by 1/2 minus
gamma divided by 2B. and this still bounded away from 1 hat, which is
sufficient for AdaBoost.
On the other hand, the [inaudible] of g is assumed to be gamma divided by 2B
which assumed to be a constant because gamma and B are assumed to be
constant. So we can use the agnostic learning algorithm we just develop in
this talk to implement a second step. And this is a polynomial-time
algorithm in terms of NND.
>>:
Question.
How does the alpha weighting play into the training on that?
>> Yuchen Zhang: So the role played by alpha is that you first train a minus
1 layer neural network and evaluate it. And if some sample is correctly
classified by the first neural network, then the weight will decrease.
Otherwise, the weight will increase.
So you are focused on the hard samples, you always focus on hard samples
given existing classifier. So that is the intuition of boosting to train
incrementally a sequence of classifiers and everyone just focus on the hard
example at the present stage.
>>:
[inaudible] how is that done [inaudible]?
>> Yuchen Zhang: So you first have this function F which is an m-layer
neural net. Right? And given this F, you go to the next iteration and you
assign alpha to be proportional to this. So you can see that -- okay, I can
see here. So there is a minus sign here. If F times YI is positive, it
means that you have a correct classification and the weight will be small.
Otherwise the weight will be large. So, yeah, this is the specific way how
it is implemented.
>>:
[inaudible] loss function?
>> Yuchen Zhang: The actual loss function, you can consider it exponential
loss. That is the standard way of analyzing AdaBoost. But here we are
interested in classification error. So because the -- because the
exponential loss is an upper bound on the 01 loss, you can transfer the
convergence on the exponential loss to the convergence on the 01 loss.
>>:
[inaudible] loss function embedded [inaudible].
>> Yuchen Zhang:
>>:
Sorry?
On the alpha weighted data [inaudible] weight the data.
>>: Yeah, yeah, I know, but -- okay. I see. So each of those terms.
are these weights like additive weights? What are those? Are they
multiplicative at each of the loss terms?
So
>> Yuchen Zhang: Oh. So if you have an alpha weighted data, then alpha is
multiplied to every loss on the single instance. And all the algorithm that
are presented before can be generalized easily to present weighted data. So
that is not a problem. And now we have this declassifier which can achieve
nontrivial classification error, then the standard theory of AdaBoost
guarantees that after over 1 over gamma square iterations the classifier
learned while correctly classifying every data point with a margin on order
of gamma. So this establish the first claim of the theorem.
And the second claim falls as a direct consequence of the first claim because
if the first claim can be established with this constant margin, then you can
show that the generalization error of the same classifier will be upper
bounded by epsilon if the number of training instances scales as 1 over
epsilon square. So the sample complexity has this polynomial dependence on 1
over epsilon. And since the running time also has a polynomial dependency on
the sample complexity, so the running time also has a polynomial dependence
over 1 over epsilon.
Now, given the theoretical guarantees for the BoostNet algorithm, let's see
how it performs in practice. And we revisit this problem of learning parity
functions where we know that back propagation fails. So for learning a
parity function of degree of 2, both algorithms learns the correct parity
function and achieve the optimal classification error.
But the BoostNet algorithm incrementally construct these hidden nodes and it
counts there's five of them until it achieve the optimal rate. So slightly
less efficient than back propagation. On the other hand, if the degree of
the parity function is equal to five, then we know that a back propagation is
roughly equivalent to random guessing. But BoostNet is able to learn a
correct parity function by incrementally constructing less than 50 hidden
nodes.
>>:
[inaudible] do gradient descent after the random [inaudible]?
>> Yuchen Zhang: Yeah, we do do gradient descent. If you don't do gradient
descent, it's much, much slower than this. Yeah. Okay.
>>:
And probably there are few [inaudible].
>> Yuchen Zhang:
No, no, we don't do any careful tuning, just [inaudible].
>>: [inaudible] like step 3 where you just go [inaudible] reduce the
[inaudible].
>> Yuchen Zhang:
>>:
Yeah.
Yes.
[inaudible].
>> Yuchen Zhang:
Yes.
>>: So why does the parity function satisfy the assumptions?
separability, right?
You need gamma
>> Yuchen Zhang: So actually so there are two points. The first point is
that it doesn't satisfy the gamma separability if there is noise. But
because of the symmetry of the noise is roughly satisfy the same structure,
this is the first point. And this is empirical results. So it just shows
that the algorithm also works under the case where the assumption is not
exactly satisfied.
And the second point is that in the paper we do have an extension of the
algorithm that can work in this case with theoretical guarantee. So with
some corrupted version of the separability which is satisfied by the parity
function. So it is in the paper, but not covered by the talk.
So we know that both algorithm learns the same architecture, the two-layer
neural nets, but the second algorithm learns a better neural net. And in the
perspective of optimization, it means that BoostNet is harder to be trapped
in back local minimum because if a trap, then you add another node. So it is
incremental algorithm and it's easier to get out of the back local minimum,
and that is the intuition why the empirical result is better.
>>: So why would I choose a BoostNet over like a boosting with just neural
networks? You know? Is there a -- is there a ->> Yuchen Zhang: Yeah, it's actually -- that is a great question. So I
think the intuition that is provided by the experiment is that you do want to
choose boosting over neural networks to construct a deeper neural network.
You can even do this recursively to -- like this is just a boosting one
layer, but you can boost in many layers.
But theoretically the intuition that this work shows is that instead of using
boosting alone, you also have to do very careful initialization and in
practice also know that initialization is important. Right? So the
initialization combined with boosting can achieve this polynomial rate. But
any one of them alone cannot.
>>: [inaudible] have you tried experiments just with boosting on the neural
networks?
>> Yuchen Zhang: [inaudible] tried using boosting but less careful
initialization. And it can still achieve something like that. So it shows
that empirically, at least for this problem, it is boosting that is more
important to the good performance.
So finally let's talk about some improper learning algorithms. As we have
mentioned, the goal of improper learning is to learn some classifier that is
not necessarily a neural network, but we want a generalization error to be
compatible with the -- competitive with the best possible neural network. So
it is at most epsilon worse than the best possible neural net. So this is
the target.
And how do we do this? The main idea is that we define another function
class called F and define F hat as the empirical risk minimizer of everything
in this class for the loss function. And we should define this function
class satisfying the following three conditions. The first condition is that
the empirical risk minimizer should be easy to compute. So it should be
computable in polynomial time. Otherwise there's no reason to choose another
methodology.
And second condition is that this function class should be powerful enough to
contain the neural network. And, third, it shouldn't be too large.
Otherwise, it is too easy to overfit. So if it's not too large in the sense
that with the polynomial sample complexity the generalization error of the
empirical risk minimizer can be controlled by the best possible
generalization error within this function class plus epsilon.
And combining this inequality with the second probability, we can show
that -- so it is easy -- actually easy to see that the best possible
generalization using F is upper bounded by the best possible generalization
using the function class of neural networks.
inequalities, we achieved a target of this.
So combining these
Now, the only remaining problem is how to find a good function class that
satisfy the three conditions, and our solution is to use the kernel method.
We define a sequence of kernels. The 0 over kernel is defined as the inner
product between the two inputs, and the P over kernel is defined as a
function of the P minus 1 over kernel.
So given an inner product, all of these kernels can be easily computed given
the final T. So we choose P to be equal to the number of layers that we want
to fit. And it can be verified that K of M is a valid kernel function, so it
induces a reproducing kernel Hilbert space.
If we treat this phi functions as the basis function of the reproducing
kernel Hilbert space, we can define a function class which is linear
combination of the basis functions, so F is still in the reproducing kernel
Hilbert space, but we make the constraint that LQ-norm of these combination
coefficients are upper bounded by a constant that only depends on the number
of layer to fit and the L1-norm constraint on the neural net.
So if you define a function class in this way, you can show that it contains
the neural network class. On the other hand, you can also show that this
function class is not too large in the sense that the Rademacher complexity
of the function class with N random samples is bounded by the same constant
divided by square root of N.
So this satisfied the third condition of the previous slide, and this satisfy
the second condition of the previous slide. Now, the remaining problem is
how to compute empirical risk [inaudible].
>>:
[inaudible] how did you arrive at this kernel?
>> Yuchen Zhang:
>>:
How did I arrive at this kernel?
[inaudible].
>> Yuchen Zhang: So there is some intuition indeed. There is a very seminal
paper by [inaudible] and another other several years ago, and they are
talking about learning linear classifier with non-convex loss. And they are
using a kernel that is similar to this.
And it is a very interesting observation that if you stack these kernels,
then it will be strong enough to approximate a neural network. Because the
neural network is just recursively defined as a stack of the linear
combination ->>:
[inaudible].
>> Yuchen Zhang:
This one?
>>:
Yeah.
>> Yuchen Zhang: So the intuition behind this is that you can expand this as
the sequence of -- as a sum of polynomial functions. So this basis function
are actually polynomial functions. And you can show that the neural network
class can have a polynomial expansion and the coefficient for that expansion
satisfy this constraint on the LQ-norm.
>>:
So you don't use a Gaussian kernel?
>> Yuchen Zhang: I don't use a Gaussian kernel. And indeed a Gaussian
kernel is not strong enough to cover the neural network. Although like the
practical performance of Gaussian could also be good.
>>:
[inaudible] contain an M for any sigma?
>> Yuchen Zhang: No. So I'm going to talk about the condition sigma. So
there is a condition -- so the only condition sigma is that it should be
smooth. But sufficiently smooth in some sense. But I'm going to show some
examples of sigma that satisfy this condition, they and are very, very close
approximation to the standard activation function that people use in
practice.
>>: [inaudible] also have this multiple deep kernel.
this?
Is this similar to
>> Yuchen Zhang: So I may have read that paper. And I think there -- so
there are some similar ideas using the stacked version of kernels to
approximate the neural network. And it is known that the kernel idea has
connection to the neural network. And so one contribution here is this
theoretical justification for that connection and the more rigorous
justification of why they're connected.
>>:
This special kind of recursive kernel to do all the [inaudible].
>> Yuchen Zhang:
>>:
Yes.
Exactly.
Using [inaudible] the stack kernel [inaudible].
>> Yuchen Zhang: I guess not. But I'm not sure because I need to look more
carefully into the paper. Okay. So the efficient computation is also easy
because it is a kernel method. And by using the represented theorem, we know
that the empirical risk minimizer can be representative of the linear
combination of kernel functions, so it suffices to learn this combination
coefficient. And this can be written in the standard way into a convex
organization problem that can be solved efficiently. So ->>:
[inaudible].
>> Yuchen Zhang:
>>:
H?
>> Yuchen Zhang:
>>:
So this is a convex constraint.
H?
Where is H?
H is the --
>> Yuchen Zhang: Oh. So H is a convex function. So H is the final -- is
the final penalty on the output of the neural net. So you can choose, for
example, the hinge loss. And you can minimize the hinge loss, you can get an
upper bound and 01 loss.
>>:
How many times do you need to recurse, redefine the kernel?
>> Yuchen Zhang:
>>:
So it depends on how deep the neural net we want to --
[inaudible].
>> Yuchen Zhang:
Yeah.
Yeah, it is the same as the depth that you want to fit.
>>: Yeah, I have some kind of [inaudible] using the intuition is that the
you use this Gaussian kernel you effectively can get [inaudible] size of the
but using this [inaudible] do you have that probability?
>> Yuchen Zhang:
>>:
Infinite number of -- yeah.
Yes.
So --
[inaudible].
>> Yuchen Zhang: Oh. So actually this can -- so this can approximate any
neural network in the class of neural networks that I define in an earlier
side. So if you recall the definition, the number of nodes in that function
class can be arbitrarily large. So it could even be infinite in the limit.
And the problem of the Gaussian kernel actually is that the Gaussian kernel
can fit any function that is sufficiently smooth, but that condition on the
smoothness is very, very strong if the dimension is high. But the neural
network class contains the function that doesn't satisfy the smoothness
condition.
>>: So just trying to understand the relationship between the number of
hidden nodes and the level of the Gaussian. So you mentioned the level of
regression should be same as their depth. But how does the number of hidden
units and layers ->> Yuchen Zhang:
>>:
I see.
It is not related, but it is related to L1-norm constraint.
>> Yuchen Zhang: Right? Because B appears in the algorithm. So it turns
out so it is a well-known fact that if you make a constraint, an L1-norm
constraint, then the complexity of your function class could be independent
of the number of nodes. Yeah. So the L1-norm constraint is also a strong
constraint which can control your complexity.
So putting all the pieces together, the theoretical guarantees that if you
assume M and B are constant and sigma is smooth enough -- I'm going to show
what smooth enough means -- then the classifier trained can be computed in
polynomial time and the corresponding sample complexity for achieving an
epsilon generalization error is also polynomial.
You can compare this result with the theoretical guarantee for BoostNet.
Going to see that it doesn't need any separability condition on the data
distribution. So it means that by moving from proper learning to improper
learning, you're considering a wider class of algorithms and now we can
achieve similar -- you can achieve similar learnability result with a weaker
assumption on the distribution.
The only additional assumption is the smoothness condition on sigma, which
I'm going to show two examples. So you can consider an Erf function, Erf
function that satisfy the constraint, and it's a very close approximation to
the standard sigmoid function. You can also consider a smoother version of
the hinge loss, which is a smoother approximation to the standard ReLU
function.
So for both the standard sigma and the standard ReLU function, you can find
some very close approximation. So this approximation is one of them but can
be even closer as long as the smoothness constant -- smoothness constant is a
universal constant. So for both standard functions, you can find some
approximation that is close and satisfy the constraint of the theory.
And finally we also have some simple experiment to test the algorithm on the
MNIST dataset. Because the original MNIST data recognition task is
relatively easy, we also compare with some variations of MNIST, including
randomly rotating [inaudible] by some random angle, we're adding a noisy
background into the image, we're combining the rotation and the noisy
backgrounds.
So, roughly speaking, the experiment says that the kernel method outperforms
logistic regression and the back propagation trained two-layer neural
networks. But it is not -- so it is comparable with the convolution neural
net on basic and rotation datasets but not as good on the other two datasets.
This is also expected because the kernel methods, we just design a generic
kernel instead of using any probability of the recognition problem. But the
convolution neural net indeed uses the domain knowledge. Another observation
is that if you increase the hyper-parameter M, it means that a model will be
able to fit deeper neural networks and actual performance also gets better in
all cases. And this also intuitive because you can expect that a performance
of a deeper neural network can be slightly better on the MNIST dataset.
Okay.
So as a summary of this talk, we have a study to prove an algorithm to learn
constant-depth neural networks in polynomial time. The take-home message is
that for improper learning, there are two conditions needed. The first one
is that the incoming weights of all the neurons has an L1-norm bound which is
a constant.
And the second condition is the separability condition on the data
distribution. And under these conditions, you can show that the neural
network can be learned in polynomial time. And for improper learning, we
also have two conditions. The first one is still the L1-norm constraint on
the neural network, and the second one is the activation function is
sufficiently smooth.
So in summary, we know that learning neural network is a challenging problem.
It is NP-hard in the worst case, but it could be easy if you use -- if you
add reasonable assumptions and use novel and principled algorithms. You can
get theoretical guarantees.
Okay. So understanding deep learning is very a challenging field, and
there's certainly many open problems. And this work is only an early step in
exploring, searching for the algorithm that are both understandable in theory
and also useful in practice. And there are certainly many open problems.
Well, at least the three of them.
So the first open problem is that we know that in our algorithm some
quantities assume to be constant. And this is because that the running time
of the algorithm has exponential dependence on those quantities. But on the
other hand, we also know that some exponential dependence can be improved to
polynomial dependence if you can see there is some additional constraint on
the data distribution. Right? So it is very promising to consider some
further probabilities of the real data distribution to further reduce the
complexity of the algorithm to make it fully polynomial.
And the second point is that it is a little weird to observe that a
theoretical guarantee still holds even if your optimization algorithm does
nothing. It's just an output initialization. So it's very interesting to
understand the role played by optimization algorithm. And if you can have
the understanding, it will definitely help the designing of the algorithm
because we know that in practice the optimization step helps a lot.
And third point is that so the essential difference, as we have noticed, the
essential difference between BoostNet and the traditional back propagation is
that BoostNet incrementally construct a neural network. Well, the back
propagation trains a fixed architecture. But although BoostNet only
construct the top layer, I believe that using some normal technique, it is
possible to design a better algorithm that incrementally construct a whole
neural network from bottom to the top.
And every incremental step is very simple, very principled and analyzable.
And the combination of a large number of this incremental step will be
sufficient to construct a very complicated neural network. And if this can
be done, then a very deep and complicated neural network can be learned in a
principled way and with theoretical guarantee. So this is a more ambitious
goal.
Besides learning neural networks, another field that I'm excited to work on
is deep reinforcement learning, so I also cover some point. So the first one
is a branch of deep learning and all the principled method that works for
classical learning of neural networks can also be applied to deep
reinforcement learning. So the effort that is done for the first problem
will not be wasted for deep reinforcement learning.
On the other hand, the primary goal of my research is to explore the
algorithm for [inaudible] artificial intelligence, and I believe that a
strong IA must interact with the environment and to take actions on the
environment and to collect a reward from the environment, and reinforcement
learning is exactly a methodology for doing that.
So there are some unique challenges in reinforcement learning that some key
innovations can be made. So one of them is that the current methodology of
enforcement learning uses neural network to approximate the cue function in
the deep network and use that cue function approximation to make decisions.
But the model itself cannot remember anything.
On the other hand, in order to interact with a more complex environment, it
is essential to maintain a memory that can collect the knowledge and
experiences from the past experience to improve the performance. So appears
that a long-term memory mechanism is very interesting to incorporate and
could be useful in practice. This memory could not only be about how to
solve a specific problem but also about how to interact with human beings,
with many different human beings.
So, in other words, if you can personalize the reinforcement learning model,
then they will find application in many interesting field including playing
games with multiple players instead of just a video game with one player, or
the other applications may include interactive search or personalized
recommendation or even some business applications like decision-making, sales
or negotiation.
The system level, so I think it's also interesting to build some interactive
system that can interact with the Internet users to actively collect data
from the users and serve the user simultaneously with the model that is
training on this data.
So roughly these are a subset of interesting work that I think worth doing in
the future. And more than happy to talk to any one of you about the details
of that.
So just to wrap up, we have a talk about some provable algorithms for
learning neural networks and some past work of mine and some future work. So
and looking forward to any further meeting with the people in MSR. Thank you
very much.
[applause]
>>: So two. What is better, the method that you talked about earlier on
[inaudible] theoretical results can be [inaudible]?
>> Yuchen Zhang: So in the -- so for both the improper and [inaudible]
binary classification algorithm, for improper we do the MNIST dataset on ten
classes, and the way that we do it is just to train ten one-versus-all
classifiers. It is not bad according to the performance. So I believe that
proper learning can also be generalized to that. On the other hand, if
you ->>:
[inaudible] efficient than not doing --
>> Yuchen Zhang: Yeah, it's not efficient as doing a native multi-class
classifier. So it's very interesting question. And I believe that you can
define some notion of separability at least for [inaudible] for multi-class
classification, and you know that there are plenty of work on multi-class
boosting. So I think the idea can be transferred. But might be future work.
>>: So I have a question about the [inaudible] that you explained. So is
this in your work that this is first proposed, or is it something that exists
in the literature? I'm just curious to know.
>> Yuchen Zhang: Well, as far as I concerned, I don't know the other work
because they are very limited work in the theoretical analysis on neural
networks, but a separability condition is indeed a center thing for linear
classifiers. And we know that there's a connection between the [inaudible]
neural network if you consider non-convex function.
So I think conceptually you can find some way to work. But specific for
neural network, I don't know any other paper that uses similar assumption.
>>: My question was that then basically given a dataset, based on like
[inaudible] and machine learning, and I don't know the exact terms very well,
but just [inaudible] any neural network can approximate any function, and so
isn't it true that for a given dataset there will always be some neural
network that will be able to achieve [inaudible] separability?
>> Yuchen Zhang: So, yeah, so first it is not exactly true because it is
possible that with the same feature the label could be different because your
dataset could be noisy. If that happens, then no classifier can achieve
perfect performance. On the other hand, so even if -- so even though there
are some results saying that a neural network can approximate any function,
but that is under some smoothness condition.
And if you look at a high-dimensional problem, then smoothness becomes a very
serious problem. Because in the high dimension, there is a very interesting
result in geometric functional analysis saying that if your function is
sufficiently smooth, then it's almost a constant. So if you make -- impose
very strong smoothness conditions, then you cannot learn anything
interesting. So now the problem now lies on how we weaken the smoothness
conditions, and that is the power of neural network I think.
>>: I have a small practical question. So I think the method you showed you
tested on I think it was a two-layer neural network, so it's very shallow,
and it was a great result. You showed it by more intelligent optimization -or more intelligent initialization you could learn more complex classes of
classifiers that back propagation can't. That's great.
In practice, when we meet that problem now, so this is a practical question,
we tend to add more layers and then that gives us more learnability. I
wonder if you could sort of comment on what amount of depth might be required
for that problem with standard back propagation and random initialization.
I'm trying to get a sense for how much compression might that have saved. Is
it saving one layer or ten layers? Can you speculate? Or do you have any
results?
>> Yuchen Zhang: Yeah, so that question is -- so you're asking -- so I think
this can be understood in two aspects. The first one is as you get a more
and more layer, you get a stronger power of representation. So that might be
one reason why they achieve better performance. Another perspective is that
when the neural network gets deeper, it is even easier to train because we
have more parameters and there are more local minimums. And actually even if
you go to one of the local minimum it is good enough. But if your neural net
is shallow and small, most of the local minimum is not good enough.
>>:
[inaudible] more important but I think both are true.
>> Yuchen Zhang: Yeah. So I cannot answer both questions explicitly. For
the first question it is basically assortment question to this line of work
because we don't care the representation part, we just assume that if the
representation power is strong enough to classify your data, then we can
learn this representation.
But we cannot characterize what kind of distribution can be separated by a
ten-layer neural net. And the second one is, if you want to answer the
second question, I need to understand how optimization behaves in the
landscape of the loss function of a ten-layer neural net, at least deep
neural net.
So yeah. So this is a challenging problem and still we cannot answer that.
And the partial answer is if you look at a boosted mechanism here, then the
boosting is a special case of a gradient method because boosting can be
understood as a greedy method. Right? And we see that using greedy method
can improve the complexity of the algorithm from exponential to polynomial in
some sense.
And what I believe is that the reason that a deeper neural network is easier
to train is because it is a better architecture for the greedy method to work
because gradient descent is also greedy method. It's just a search of your
neighborhood to try to find the best direction to descend. Right?
So I think there are some very strong connection between the greedy nature of
gradient descent and the depth of the neural network or the architecture of
the neural network. And it's a very interesting open problem to formalize
then, either by some new theoretical analysis on the gradient descent or by
some new modeling technique such that you can build a new model that is
almost as powerful as the traditional way of building a neural network, and
for that new model there's guarantee that it can be constructed incrementally
in a principled way, and the third point suggests. Yeah. So I only have
some future visions, but there's no explicit answer on the question.
>> Lin Xiao:
[applause]
Okay.
Let's thank [inaudible] again.
Download