Document 17865189

advertisement
>> Sebastien Bubeck: Okay, good afternoon, everyone. So I’m very happy to have Matus Telgarsky from
University of Michigan. So many of us are interested in deep learning and trying to understand what are
its theoretical underpinnings, and Matus will try to tell us something new about those.
>> Matus Telgarsky: Hi. Thanks a lot, and I’m very happy to be here. It was very nice to be invited by
Sebastien. And I’ll also say that I like this topic very, very much, and not just because it’s become
popular, but actually, I think there’s a lot of just independently interesting things about this problem,
and I think there’s a lot for basically everybody to contribute to it, just … from almost any standpoint,
even just a purely mathematical one. So before I tell you what’s in the talk and any sort of summary,
most this talk will be first principles, so I’ll just tell you even what a neural net is and specifically, the
kinds of neural nets we’ll talk about today. So a neural net is just a way to write a function as a graph.
So it’s a computational graph that works as follows: so the graph has some kind of multivariate input,
and then, there are nodes, and they’re computational nodes. The way they work is they collect a vector
from all of their parents; they wait for their vectors … their parents to compute something; then they
perform a linear combination of what their parents did; and then they apply a fixed nonlinear function,
sigma. So a class of neural nets—the way it’s defined—is we fix the network layout—you’ve noticed I
left out some edges—so it’s just some collection nodes and some collection of edges. That’s fixed, and
the sigma is fixed; what we vary are these linear combination weights.
And one of the standard choices for this nonlinear function, sigma, is this funny function which is
identity on one side and zero on the other side, on … of zero. So this function’s actually become very
popular lately; it’s kind of the most popular one right now; and when I started thinking about this
problem, I thought, “Well—you know—I might as well just try to get caught up and use this one, and
because it’s popular in practice,” but fascinatingly, this was actually beautiful to work with
mathematically. So this will be the nicest one for us to work with today, and if you follow any of this
literature, there’s all sorts of other words that people use to describe the kind of fanciest, most
complicated versions of neural nets that are popular, and we won’t be discussing those—these are
words like pooling and convolution—we won’t be discussing that today. So this is just a very simple
version of a neural network.
Okay, so this is what the talk will cover: the … we’ll try to get a handle on exactly what this class of
functions is, because I’ve just defined it symbolically, but it’s not at all clear what these actually look like
as I vary all these linear combination weights. So let’s … we want to understand what these actually
look like—what this class of functions is. So I’ll first cover a classical result, which is that they can con …
they can fit continuous functions, and this result has a very strong limitation in a sense that it does not
tell us anything about what we gain from having multiple layers of these things. So in practice—and
especially lately—people build these very deep circuits, but these classical results only say something
about, in fact, something with only two layers. So it … because of that, we’ll talk about two results that
do tell us a little bit about the benefit of depth. So one is one of my personal favorite results in, actually,
the entire machine learning and statistical literature, which is the computation of VC dimension of these
functions. And don’t worry if you don’t know what that means; we’ll also … I’ll also explain what VC
dimension is. And then, the second one is what I call exponential separation; it’s basically a case where,
if you allow yourself to have multiple layers, then you can get away with using exponentially fewer
nodes or logarithmically as many nodes. And so this is—in terms of hype—this is the new result, if you
want to call it that, but on the other hand, it’s actually—as I’ll say in the closing remarks when I give a lot
of open problems—the new result is actually the … I’ll tell you guys it’s actually the wrong result; it’s not
nearly what we wanted to prove. So the problem is still very open, so like I said, there’s lots of room for
everybody to contribute for all … to all of these things.
Okay, so just to warm up and start this whole setup very slowly, when I say that we’re gonna fit a
continuous function, I have to say what “fit” means. So one of the standard ways to say we fit a function
is in the Lp sense, so if you’re familiar with it, then this is an Lp norm, and it’s just a … it just … basically
an integral. And the point is that, often, we would like this uniform sense, which means for every point
in our domain, we are some epsilon away; this one’s a little bit weaker; we can average so that we can
give up entirely on some small regions of points. So this one is a … this one’s easier to satisfy; this one’s
harder to satisfy. Oh, and there was little pictures; that’s kind of the average sense, and this was the
uniform sense. And then, in the later sections, we’ll actually care about what I’ll call the classification
sense of “fit,” which means that there’s a problem we want to classify, and we’re gonna care about how
close we can get to classifying correctly on this problem.
Okay, so continuing with the warm-up theme, let’s just cover a very simple … kind of the simplest
possible setting. So what if we only have one layer, and of course, if we only have on layer, and I’m only
doing a univariate prediction, so this is really just one node—it’s a neural net with one node—what can
we fit with that? ‘Kay, it’s a simple question, and it has a simple answer; if this function sigma is
monotone, then basically, if it’s monotone, in this linear combination, I almost have an indicator on a
half-space; there’s some half-space; and I’m gonna be going up, basically, along with the normal vector
of that half-space. So another way to say that is: I clearly cannot fit arbitrary continuous functions,
because consider this one. If I’m correct over here, I have to be small over here, so that means I’m large
over here. So if I’m correct on this, I’m wrong on either of these two; if I’m correct on one of these two,
I’m wrong on this one. So there’s kind of a direct argument tells us that in either of the sense of fit, we
have to make a pretty substantial error. So this was kind of a—maybe—a stupid, trivial example, but
there are two reasons it was valuable; so one is: we proved that one layer is not sufficient; and it’s
gonna end up that this is actually tight, so with two layers, this is all we’re gonna need; and the second
thing is that, in this lower bound, we were … we had a function that basically has one bump; we have
this half-space, and a monotone function along it, so we’re basically fitting something with kind of one
bump, and it’s gonna happen in all the results today that, if we build a shallow network, we basically
need to have the same order of nodes and number of bumps in the function. So this principle will drive
all the lower bounds today.
Okay, so to make things a little bit more interesting, let me tell you about one of the … kind of the … I
consider this to be the folklore proof of how powerful a neural net is. This isn’t quite as strong a result
as the standard one people throw around to say that neural nets can fit any continuous function, but I
find—personally find—this result to be extremely illustrative. So let’s say you want to fit a continuous
function from zero-one to the d—so the hypercube—to just zero-one, and in this picture, I’ve made it
even easier; I have the red re … the thick red regions are supposed to be where the function is one; so
it’s one in those regions and it’s zero outside. So if I wanted to fit this with a neural net, I claim that it’s
trivial if I can fit a box with a neural net, so by box …
>>: This is … sorry, didn’t say what was the continuous function here?
>> Matus Telgarsky: Oh, what … the … so in … you mean in the picture or …?
>>: Yes.
>> Matus Telgarsky: So it’s from the interval … from … it’s from O … zero-one squared—so it’s from the
plane.
>>: Right.
>> Matus Telgarsky: And then, I’ve just drawn the level curves, so the two thick, red …
>>: Sorry, so is it one inside the …
>> Matus Telgarsky: Yeah, sorry. It’s one inside here, one inside here, zero out here.
>>: So you have some decaying thing at the boundary; otherwise, it’s not continuous, right?
>> Matus Telgarsky: Yeah.
>>: So you have to …
>> Matus Telgarsky: Yeah, I was just trying to simplify the picture, so …
>>: So near the boundary, it goes from one to zero in a continuous fashion.
>> Matus Telgarsky: Yeah, very quickly.
>>: Mmhmm, okay, mmhmm.
>>: It’s a thick line, I guess.
>> Matus Telgarsky: Yeah, you could call it … or you could call it an iso-line, and I was just—you know—
at res … at scale one, but …
>>: Oh, okay.
>> Matus Telgarsky: Yeah, somehow drawing pictures is difficult, sorry. But … so if we had a neural
network that could L1 fit a bo … said … and in the Lp sense, fit a box, then I claim this problem is trivial,
and the reason is: we just grid the space, and we just fit each one of those boxes, because then we can
just add these all up—I can add things up, I can take linear combinations through a neural net. So I’ve
reduced the problem of fitting this function to just the task of fitting a box. And let me just say there’s
kind of a common theme that’ll come up here, which is that I’m building, basically, a gadget out of a
neural network; I’m gonna have a little tiny neural network that fits a box, and then I’m just gonna look
at the span of those things. So I just like calling it a gadget. Okay, and to fit a box is also very easy for us.
So just suppose that this nonlinear function, sigma, is very close to the indicator function; we’re gonna
say how to fit it with just two d plus one nodes; and the hint as to why it’s two d plus one is because a
box is an intersection of two d half-spaces. So because these neural net single nodes behave roughly as
an indicator on a half-space—got two d half-spaces—so what I can do is I can just take one dimension,
and I can put one of these … I can put one function with each of these normals … or along each of these
normals—so it’s two in here and one in each of these—and I can do this for all of them; I end up with
this; and now, I’m in good shape, because if I just apply one more—and I can threshold it at two d minus
a half—and is … it’ll … so only this region will be satisfied. So basically, it just intersecting together two d
half-spaces, and then that’s kind of what you get after that. So with this kind of fuzzy reasoning, we’ve
fit a continuous function from zero-one to d to R with two point five layers; I say two point five, because
I’m just using the linear combination part of the neural network. So with two layers, I fit a box, and then
the span of these can fit any continuous function from zero-one to the d to R, and then, if I apply
another nonlinearity, then I’ll be at three layers, but then, I can’t fit any … but then I … my image is
constrained to be zero … my range is constrained to be zero-one.
Okay, so this is a … this is kind of the folk lore result, and so there are a couple problems with it. So one
is that—as was kind of pointed out—because everything is continuous, we have a lot of fudge factors on
these boundaries; I can’t exactly do these hyper-rectangles; I have a little bit of fudge on the ends; and
so that’s why I have to do an Lp-type fit—because I have to allow myself to kind of make some errors on
the boundaries. If I wanted to do a uniform fit, then—in the supremum sense—then I wouldn’t be able
to use this argument; it wouldn’t work. I also want to say cow old this proof is; I consider this proof to
be as ancient as mathematics, basically. If you look at Jordan content, or definition of Lebesgue integral,
or any of these things, you see these kinds of box arguments. I’ll also say that—notice the way the proof
worked—I claim that we really didn’t use anything about composition of function in neural nets. I built
up a basic class of functions, and then I looked at its span, and so in fact, if you look up how to prove, for
instance, that boosting is consistent—AdaBoost algorithm or whatever you call it—then one way to do it
is to take decision trees of a certain size—you make them have two d nodes—then they can fit boxes
also; and so the same proof says that boosting can fit arbitrary decision surfaces. So again, this is only a
vector space argument; it is not an argument using anything about composition of functions; we
constructed a basis class, then we reasoned about its span.
Okay, and so we had a gap; we had two point five layers upper bound; we didn’t have an … a uniform fit;
and so we have a gap between this and the lower bound. So we can close this gap with the … so this is
the result everyone actually cites when they say neural nets can fit any function; they cite this result by
this guy, George Cybenko, from 1989, and I don’t know how it is for everybody else, but even though I
see this cited basically infinitely—it’s like in every paper—I’ve never seen anyone actually discuss it, just
for reasons I can’t really comprehend—so just a side comment. But I actually like this proof a lot; it’s
extremely clean, and every bug that you … and every little kind of nastiness, and sloppiness, and
everything I just said is gone from this proof. So … and I will say that I’m watching the clock, and so I’m
gonna rush … ah, screw it, I’ll just give the whole thing in detail. It’s fine; it’s very clean. So the setup of
the proof is very similar to the last one: I’m going to build a gadget—gonna build some kind of primitive
object out of just neural net nodes—and then I’m gonna reason about the span of this thing. And
before, I kind of used very hazy reasoning; I said fit things with boxes, grid the space, Jordan content,
but we don’t need to do any of that, because vector spaces are such well-understood objects that I can
just … there’s theorems I can use; I don’t have to say—you know—approximate things. So the proof
itself is all in … using functional analysis; that might be why it’s not discussed much, because the way it’s
stated is actually maybe a little bit impenetrable. But I’m actually gonna give a Hilbert space version of
it, and the Hilbert space version, you can just read it knowing what vector … if you know what a vector is
and what Pythagoras theorem is, you can just know … understand the proof.
So okay, so here’s the proof: the first step is you sit … is you prove that a single neural net node is what I
call a correlation gadget, and let me … I’m living … I left something out of this slide; I have to tell you
what sigma is. All we need for sigma in this proof is that: on one side, it limits to zero; on one side, it
limits to one; the whole function is bounded and measurable. It’s all you need; it can be zero; it can do
any stupid thing you want; and then, it can go to one. So it’s an approximation of the indicator, but it’s a
very weak approximator of the indicator. So you have to believe that this … what I call the correlation
property; so give me an f that’s nonzero—continuous f that’s nonzer—and f in … it has to be in L2—it
has to be an L2 function—but give me an f, then there exi … if it’s nonzero, then there exist choices of
the linear combination parameters so that this integral is nonzero. So for any nonzero function, I can
kind of detect some structure in it. And interesting enough, the proof uses basically the same box-fitting
argument I gave earlier. The effective way the proof goes: it says, if the function is continuous, then I
can kind of build up boxes—I can integrate the function using boxes—but those boxes have to have
nonzero mea … they have to have nonzero measure; otherwise, the function is zero. So the proof of this
actually embeds the previous proof; technically, it has to use Fourier analysis, but it’s the same … it’s
using the same thing under the hood. So this is a lemma, and then here’s the … in a Hilbert case, it’s just
the … just almost a direct argument. So I have all of these … so these correlation gadgets, these are just
all of the functions here, as I vary w0 and w—so look at the span of that, ‘kay? So that’s a subspace. I’m
in infinite dimensions, so it’s not … might not be closed, so I take the closure of it.
>>: So closure in the …
>> Matus Telgarsky: L2 sense.
>>: Uh-huh.
>> Matus Telgarsky: Yeah, in … this is one of the places where, in the full proof, you have to work in a
uniform topology, and this kind of—I guess—so … but anyway, so it’s the … it’s this closure. But now,
it’s a closed subspace; it’s a closed subspace, and I’m in a Hilbert space; I can talk about things like perp
vectors, Hil—you know—Pythagorean theorem—all this kind of stuff. So the theorem will be that the
closure of the span of these things equals the continuous functions; by definition of closure, this means
that for any epsilon, I have an element in here that is epsilon-close. So the proof goes like this: so I say,
“Give me any continuous function, and I can project it onto my closed subspace, and then I can just look
at this difference of … I click the difference of the two.”
>>: This implies that the continuous functions are closed in L2.
>> Matus Telgarsky: Sorry? I didn’t hear.
>>: So this implies that the continuous functions on zero-one to the t is a closed subspace of L2.
>>: No, it’s just …
>>: And it’s not. So what you can say is that continuous functions is a subset of this closure, but this
cannot be an equality.
>>: No, but any continuous function … they’re dense in L2.
>>: Right, right.
>>: [indiscernible]
>> Matus Telgarsky: I’m talking about the L2 continuous functions.
>>: So … but the closure is in the topology of the inverses, which is L2 closure of some subspace …
>> Matus Telgarsky: Yeah.
>>: If that has to be equal to the set of all continuous functions …
>> Matus Telgarsky: It’s not the set of all continuous functions.
>>: That is the statement there, right?
>> Matus Telgarsky: This is the … I’m giving the weakened version of the proof, which is only for Hilbert
spaces. I’m not giving the whole uniform version of the proof.
>>: So I don’t understand that statement—that equality there, which is the closure of that span equals
what?
>> Matus Telgarsky: You’re right, I should’ve justified the right-hand side with only Hilbert spaces, but …
so the full statement uses the uniform cloak … the uniform closure and the uniform topology.
>>: The pesky thing is: the continuous functions themselves are inside the Hilbert space; they’re not the
Hilbert space, ‘cause it’s not complete.
>> Matus Telgarsky: Yeah.
>>: But it is a inner product space, so you can do a closure in there.
>> Matus Telgarsky: Yeah, since I see that …
>>: Okay.
>>: We can move on.
>>: Yeah, that’s okay.
>> Matus Telgarksy: Yeah, we’ll … what I’ll do is: after I state … af … because it’s—I mean—it’s clear to
me that you understand this very well. So after I give the Hilbert version, I’ll tell you how to translate all
of the lines into the functionalytic version. This was …
>>: This was designed for different audience.
>> Matus Telgarsky: Well …
>>: Less picky people.
>> Matus Telgarsky: No, I’m … this is the audience I always want; I mean, this is great, because I have a
background and use functional analysis myself. I have to ask myself, at some point, why I so rarely see
the details of this proof discussed, and the only thing I can come up with is that people get scared when
they see the phrases like Hahn-Banach theorem, and they have to—you know, the … this thing, for
instance, this proof, you know … so I said it used Fourier analysis, but you have to take Fourier transform
of a measure, not of a function, and so this already is something that, you know, for instance, isn’t well
covered in my grad analysis textbook. So …
>>: Really?
>> Matus Telgarsky: But you are right that I … that in this case, I left out too much detail, and that
equality doesn’t hold. So I apologize for that. That aside, let me just say how the rest of the proof goes.
So now, let’s look at this thing. So I know that f minus g is orthogonal … it’s in the orthogonal
complement to this subspace S, and that means that the inner product between this and every element
in S itself also has to be zero, but then I can use the contrapositive this: if that’s zero, that means that for
all of these, that’s zero, which means that there is nothing outside of … that means that my perp is zero
itself. Every perp element evaluates to zero, so that equality holds.
So now that that’s been said, let me explain—just quickly, for the experts in the audience—how to
translate this thing into what it actually is supposed to be. So I cannot … I don’t use the L2 space; what I
use is … the correct topology for continuous functions is the set of continuous functions—by the way,
under the restriction zero-one to the d to R; it’s … that’s important, ‘cause it’s important that I’m talking
about continuous functions over a compact set—so this together with the uniform norm, which is that
sup norm, that is a Banach space, and the important thing is that dual of this space is the set of brit
Radon measures on this thing, so—signed … Radon signed measured—and so then, this step … so we
cannot take projections, but I can use the Hahn-Banach theorem to get what’s effectively a perp vector,
and so then I can still use the same reasoning over here. So no, that was an excellent point, and …
>>: Okay, it’s fine.
>>: Matus, we should move on.
>> Matus Telgarsky: Okay, okay, okay, yeah, yeah. I just feel … I feel very bad that—you know …
>>: No, no, don’t feel bad, just [indiscernible]
>> Matus Telgarsky: Okay. Alright, so this is the quick summary, so two layers is enough, in a very
strong sense—this L-infinity sense—and also, notice what we also constructed in this theorem, these
correlation gadgets; this is exactly what you’re doing in boosting algorithms, for instance. You basically
find the weak learner which is most correlated with the thing. So this is an algorithmic proof. You know,
you … greedily, you would … you wouldn’t, if you were constructing this fit, you would actually pick the
most correlated thing every iteration. And so this has been done; this is actually what I would argue
popularized, to a great extent, these greedy methods. This is big paper by Andrew Barron from 1993.
And just a funny remark: we can fake make this algorithm deep in the following way. I can ma … I can
take a deep network for what it does; it devotes part of every layer to copying the input forward, and so
I’m just … and so then, I can just … every time I would do my boosting-type thing, I would have the
nodes just kind of go like that. And the … while this is kind of stupid, because that means I have—you
know—d nodes in every layer doing the copying, maybe there’s some way to compress it and get
something nicer. Okay, the problem, though, with this is that there were … there’s not a good
understanding of the number of nodes, the number of layers tradeoff or of function composition in
general. So we haven’t … it’s nice that this result is true, but—and I’ll … I really like the Cybenko proof—
but we have not really gained that much about the heart of the problem.
So the next section—like I said—are results very dear to my heart, and because I’ve slowed down a little
bit, I’ll just kind of state the highlights here, but I will say again that these are absolutely beautiful
results, and I find them, personally, very surprising. So I’m just gonna assume everyone knows what VC
dimension is. So the question is: so now that I gave you these networks, and so suppose that, in those
networks, every one of those sigmoid functions is just the indicator, zero-one. The question is: what is
the VC dimension? And for me, the fastening thing is that it’s—literally ignoring a log factor—it is just
the number of parameters—so it’s this first one—and this, to me, is interesting to think about, because
a simple perceptron—so only one node—is also Theta of W. Now, this one is actually The … this one’s
W log W; W’s the number of parameters in network—so all the edges, for instance. So this schema,
where I used that function like just a indicator, doesn’t reflect the structure of the nonlinearity
whatsoever. Okay, the proof is actually kind of a straightforward induction.
Now, if I allow these—that I said—these kind of popular nonlinearities now, the VC dimension changes,
and not only does it become just the number of parameters times the number of layers, the Theta was
only closed by Peter Bartlett, and he hasn’t even typed it up yet. If you look in his book, there’s … it’s
not completely nailed down yet. So this is already fascinating to me, that it only increases just
multiplicatively, by the number of layers. So yeah, I personally find this quite fascinating, and I have to
say: not only do I think the proof is beautiful, but I would argue that Bartlett himself loves the proof,
because the proof is actually the cover of the book. I’m actually serious; I’m not making this up. This
might look like some kind of stain—you know—some radioactive staining of a neuron with, like, axons or
something [indiscernible] it’s actually not. If you look in the book, in chapter eight, the figure appears,
and he just kind of doodled on it to make this. I haven’t asked him about this yet, but literally, the proof
of how this works is in there, and this is an amazing proof.
>>: So Matus, in general, so VC dimension is going to be at most the number of parameters squared if
you have a sigmoid? If you are something that …
>> Matus Telgarsky: If we do not have a bound on the number of layers, the only upper bound we know
right now is W squared, yes. But for these piecewise … lens being piecewise polynomial, this if the
sigma is piecewise polynomial. As we change what sigma is, all sorts of terrible things start happening.
>>: Is it … what I just said is true for piecewise …
>> Matus Telgarsky: Yes.
>>: … polynomial? Okay.
>> Matus Telgarsky: Yes. I’ll just say one brief thing: you might say, “Ah, I’m sure regularity assumptions
hold, so let’s make sigma concave, convex—goes to zero, goes to one—and I can impose a smoothness
bound on it.” Turns out, I can choose one of these to get VC dimension infinite with only three nodes.
So it … this is a very delicate business, and how do you actually prove these veese dimension bounds?
You have to use really high-powered techniques. So for those of you in the audience that know Sard’s
theorem and know Basu’s theorem, so Basu’s theorem talks about counting intersections of polynomials
in high dimensions, and it starts to make sense why that would come up. So I … these proofs are
amazing. So if you do have time after or later this week, you can ask me, because I love all these results
a lot.
Okay, so what we know so far is that a flat network can fit any continuous function, and we also know
that the number of functions we have in the classification sense—so how many classifiers we get—it
doesn’t actually grow that fast with the number of layers, but what we don’t know is what these
functions actually look like. So in an attempt to get a sense of what the functions with many layers look
like and how different they are with what you can get with a flat network, I asked and answered the
following question. So the setup is as follows: you give me an integer, k—so this is gonna be a result
that holds for all positive integers, k—you give me a k, I can construct two k points with the following
property. Any flat network with less than two k nodes in the network will have error at least a sixth.
And the result itself will quantify what flat and all these things mean; there are no … it’s actually not
even gonna use asymptotic notation; it’s a very clean, easy-to-prove, and easy-to-state result; and it’s
not just separation from flat and deep; it’ll be for any arbitrary number, number of layers. And then, the
punchline will be that if you give me—it ends up—two k layers, I can get zero error with just two k
parameters and something called a recurrent net, which is a fixed, small network—so a network of
constant size—and then I take its output and plug it back into itself, and I do this k times, it’ll also get
zero error. And the reason why I care about this audition … in addition to just understanding this class
of functions, is because in a statistical sense, I know that this thing has exponential the VC dimension of
this thing, so maybe there’s some hope for learning these functions from data well. And the thing to
contrast this against is the switching lemma and related circuit complexity results, which get a similar
tradeoff. So yeah, I’ll talk about this more maybe offline.
Okay, so let me tell you what the class of sigmas that I deal with is. Actually, yeah, okay, I’ll just do this.
So this is just the class that has two nice properties: it’ll slightly generalize that constant-and-thenidentity function, and also, I can use this class of functions to induct the reason about what every layer
and every node in the entire network is doing—so that that’s why this’ll be a convenient class to work
with. So I just call this function t-sawtooth if it has t pieces—so I take the real line; I partition it into t
intervals, possibly … of course, there … two of them have to be infinite—and it’s affine in each of those
pieces, and I don’t require the function to be continuous, so it can have discontinuities—so it can be like
a piece, and then another piece that doesn’t connect, and another piece. So this is t-affine, and two
examples are: so this kind of popular function is piecewise affine with two pieces, but then, you can
come up with other example. So again, this is only a univariate example, so a decision tree kind of isn’t
as meaningful as usual, but a decision tree with t minus one nodes will just have … it’s t-sawtooth. So
this lower bound will also apply to diff … very different algorithms. So for instance, this lower bound
also applies to boosting, but you might say it doesn’t matter, ‘cause it’s univariate problem, but still,
result holds.
So reasoning about these functions is very easy. So first, if I have something that’s s-sawtooth—so it’s
piecewise affine in s pieces—I have another function which is t-sawtooth—so it’s piecewise affine in t
pieces—I claim the summation of these two is just s-plus-t-minus-one-sawtooth. And the proof is just to
look at every time the slope changes—so every one of the pieces over here—and I notice that in each
one of these pieces, they’re both … they both have a fixed slope, so I can just count the number of time
it changed between the two of them. So that’s actually t minus … it’s s plus t minus two changes; they
both agree on the first interval, so then s plus t minus one. On the other hand, if I compose the two
together, it’s s-times-t-sawtooth, and the way to see what goes wrong is: so I compose this with this, so
this comes first; so I take the t-sawtooth function; I look at any interval—any one of the t pieces that
define it—well, if I take that interval, and I map it through the function, I get another interval, ‘cause it’s
affine in that piece, but that interval I mapped to can hit every piece in this thing. So for every interval
of the second funct … of the first … of the function you apply the second one to—sorry, it’s …
composition should be defined the other way. For every piece over here, I get s pieces over here. So
together, I get s times t.
So … and if I apply an inductive argument to a neural network, this is actually quite easy to see. So what
you do is you look at any node in the network, and I take all the functions that plug into it—so if I’m at
some layer, j, right now—all the nodes plugging into it are inductively gonna be tm-to-the-j-sawtooth; I
add together m of them, so I’m gonna be m-tm-to-the-j-sawtooth; I apply my nonlinearity, tm-times-tmto-the-j-sawtooth, so the whole thing is gonna be tm-j-plus-one-sawtooth—that’s the anti-hyp … that’s
the … so that’s t inductive step of the proof. So an … so saying it again, a network with a t-sawtooth
nonlinearity, m nodes in each layer, and l layers is tm-to-the-l-sawtooth. And the point of this is that the
number of bumps—the number of pieces in the function—grows exponentially in the number of layers,
but only linearly in the … in all the other parameters. So this is actually what’s gonna make the proof go
through; we’re building these bumps much more quickly with composition than with addition. Okay, so
this, as it turns out, is basically gonna complete the proof of the lower bound.
So I’ll prove the next slide a lemma; the lemma just says that if you give me any sequence of two k
points with labels alternating—so a prediction promise from the reals to zero-one—you just give me any
sequence of reals, and I label them zero, one, zero, one, zero, one—alternate as fast as possible—and
the claim is that: any t-prime-sawtooth—so it’s piecewise affine in at most t minus … in t prime pieces—
has to have error at least this … with—oh, sorry, I should say what it is: two to the k minus two t prime
over three times two to the k. So yeah, so I’ll prove that on the next slide, and the reason this completes
the proof is because: suppose that you have use … your network structure satisfies this inequality. So if
you just plug this in, you get that the error is at least a sixth, and to make sense of what—so this is
where I’m going to say the concrete version of what the lower bound actually is—so to make sense of
this quantity, let’s say l equals two—so two layers—and let’s say we’re with this … we’re … do I put … ah,
yeah, sorry, I wrote all this out here. So if I use this thing, which was zero on one side and identity on
the other side—which is two-sawtooth—so t is two, and let’s say I use two layers; then, if I have less
than two to the k over two nodes per layer, error’s at least a sixth; if I make it root k, and I have less than
two to the root k, then I, once again, have error—error of one sixth. So this is pretty rough; I will say,
actually, this is very much improvable. So I guess a theory audience, so Rocco Servedio has this very nice
improvement of switching lemma from … you set it to this fox, right? Yeah, so that one actually says
that if you just bump down the number of layers by one, you still get an exponential gap—that isn’t
implied by this result. So the kind of separation you get from his result, over—you know—circuits and
whatever—AC0—it’s a stronger.
‘Kay, so the way … so here’s how we prove this lemma up here. So I have my t-sawtooth function—or I
guess I said t prime—and I have two to the k alternating points. So because I’m talking about
classification, and classification, I take everything that’s above a half, and I make it one, and below,
make it zero. So all that matters is where I cross zero bay … where I cross a half, basically. So that
means I get a function which is piecewise constant in two t pieces. The reason it’s two is because at
discontinuities, I can also cross; so that’s why it’s not t-sawtooth to t-sawtooth; it’s t to two t. So the
discontinuities actually nuke the constant, and the bound is not tight with constants in the continuous
case. Okay, so now, notice that if I just treat each of these intervals as a bin, then the number of bins
that get at most one point has to be at most the number of bins. So I have at most two t bins with a
single point, and that means that the number of bins with at least two points is n minus two t—so the
total number of points that land in bins with at least two points is n minus two t. And the reason this
completes the proof is because if you’re an interval that gets at least two points, they’re an alternating
label, so you have to make error at least a third; in the limit, it asymptotes to one half, but it’s at least a
third; so I just divide this by three, and it completes the proof. Sorry, that was the end, okay. So yeah,
it’s a … yeah, it’s just a counting argument; it’s all it is. I thought the proof would be much more
complicated.
And so you notice that there are—you know—there are no—you know—there’s no “Oh, it’s not like it
holds for certain k and others;” it’s just that easy. Now, let me tell you really quick how the upper
bound works, and this one’s even easier. So the upper bound is I have to find a function that’s either in
k repetitions of a constant-size network—that’s the recurrent case—or it’s the … a k-layer network with
k parameters that fits these points exactly, and it’s really easy. So I take this function—and I wrote it out
tediously up there just to establish to you, if you’re doubting, that it is just three nodes in two layers, but
this is the easier way to read the functional form of it—so just a little pyramid, and now the question is:
what happens when you compose the pyramid with itself? It’s quite easy; so it looks at points that are
less than a half, multiplies them by two; points that are bigger than a half, it reflects them. So it’s
literally all it does—that’s just it—so … and then, by induction. So then, I can just take this set of points
with alternating labels to be the bottoms and the tops of this function, and then that’s really it. One
thing that I found kind of fascinating, because it wasn’t by design—it was just kind of an accident—so
you could argue—I mean, this isn’t … I don’t know how—okay, you can argue that this is ‘fectively an
approximation by Fourier basis; you give me a k on it; I construct these kind of high-frequency piecewise
affine functions. And there is a fairly recent survey on neural nets where they also just asserted that a
Fourier basis is easy for a deep network. Also—you know—the switching lemma, that was parity
functions; that’s the Fourier base over the Boolean domain. So I don’t know if everyone’s doing Fourier
bases ‘cause that’s the first thing you think of, but I don’t know … just interesting coincidence.
Okay, so we can close from here. So to summarize, so first, we gave a couple classical results, where we
can fit any continuous function with a shallow network; then, we pointed out that the VC dimension, at
least—so the number of classifiers we can construct—does not grow too quickly; so it’s just linear in the
number of layers; and then, we gave this case where there do exist functions where you have to blow up
the amount of parameters exponentially in order to fit them with a shallow thing. And of course, that
doesn’t contradict the VC result in any sense, because it is just one class of functions; it’s not
describing—you know—it’s not saying that in every case, we can kind of reduce the complexity by a
log—you know—logarithmically—just kind of a nuance that’s lost in the VC characterization.
Okay, so I have a bunch of kind of random remarks that are just things I found very interesting. So at
least for this construction, if I use these indicator functions, the result is false; you don’t increase the
complexity at all when you do compositions. I mentioned this to Sebastien earlier, and he pointed out
immediately that I’m using only the univariate case, and I can actually tell you that in multivariate case,
it is slightly different, but broidy, I found it interesting that you have to use something that … all the
proofs I know for these upper bounds—not just that pyramid function I showed you, but other proofs I
know—they need a continuous function. But like I said earlier, that class has infinite VC dimension, so
you have to be very careful which continuous functions you use. Oh, sorry, yeah, so this has … oh, I
didn’t mention this; this is used as a lemma in the proof of the infinity VC dimension for a very specific
nonlinearity.
Okay, so another thing—and this is a result I’d really like; I actually think I’m gonna probably try to prove
it sometime this month—is: so we constructed a single function that we know is very expensive to
construct with something that’s shallow. The question now is: what are some other ones? And the
result I would like to approach is: can we dig … find a notion of independence and rank? So in other
words, let’s say I just have a grab bag of functions that I can represent efficiently with a multilayer neural
network and inefficiently with a shallow network; is there some other function, now, which I can’t
represent as—let’s say—a linear combination or just a composition of these other ones, now that I can
throw into this bag, and I can kind of keep increasing the set of functions I can characterize? Because
we didn’t at all characterize all the functions that have l layers and m nodes in each layer. We didn’t
even remotely characterize that function class, but maybe there’s a … maybe we can find a couple more
of these functions, maybe not just these piecewise affine Fourier transforms, but maybe there’s a
couple others that, together, they actually do characterize the entire cla … that’s what I mean by
“dense” in quotes there. So this is something I’d really like to answer.
Another thing is: what actually is this function? So a lot of you know Léon Bottou; I gave actually this
talk in Facebook about a … or, sorry, I gave a version of these results—some other stuff—at Facebook,
and Léon, he said to me that the thing … he was very fascinated by this pyramid function. And so this
pyramid function is pretty funny. So if you take a symmetric function, g, and compo … and look at the
composition g composed with the pyramid, so what does it do? If g is from zero-one to zero-one, if I
go—so it precompositions—if I go from one to two to the minus k, I’m gonna replicate g, just a
condensed version of it, and then from here to here—so from two to the minus k to two to minus k plus
one—I’m gonna get the reverse of it, so as that’s symmetric, I’m gonna duplicate it. So if I … this
precomposition’s gonna repeat the function two to the k times. So it’s like this periodis … it’s like a
period operator or a looping operator. So … and I … so I say that there’s prior work; if you know about
this Kolmogorov-Arnold representation result, it uses space-filling curves and fractals, and there are—so
this is a paper—there are people that claim the result’s irrelevant, but I say, actually, that it captures a
lot of this kind of … this interesting stricture that’s also in this pyramid map.
And then, so I think in general, there’s a lot of room to develop just nice theories about what
composition of functions does. So some—oh, and by the way, I’m not making fun of functionalysis,
because I actually spent way too much of my PhD reading functionalysis books and using it—but now I
realize: wait, like, I really need to know about more than just vector spaces. So … but there are a lot of
interesting fields, and maybe in during the question section, which we’ll get to in just a moment, people
can just tell me about other ones. But I feel like we can just keep doing so much more. So of course, in
TCS, there’s circuit complexity results; so another family of results that I literally know nothing about—I
just found out about from Luca Trevisan’s blog—is a … so in additive combinatorics, they look at the
following problem: you give me a group, and I look at a subset of it—not a subgroup, but a subset—and
then if I look at what just happens when I take the group operation is to apply it to itself. So I can take
the group … some—sorry—some subset of the group, not a subgroup—so I call it S—and I pile the
elements themselves; I get … call it S squared, S cubed. How does … what is the rate of growth of this
thing? To me, this is very, very similar to this “How many functions do I get as I add more layers”
problem. So I think there are lots of fields that are attacking this problem of exactly what is this function
class issue as you apply these compositions? I think there could be … so this, for me, is where this is just
a purely beautiful mathematical question, ‘cause I still don’t understand what these compositions are.
And so of course, from our computational perspective, what we really want is some structure that
actually helps us design algorithms, some structure an algorithm can pull out—you know, something
maybe like that correlation thing, but a correlation that works in a multilayer fashion. So for me, that
would be the best … that’d be the best kind of structure result. Okay, we’re done.
>> Sebastien Bubeck: Thank you. [applause] Questions?
>>: Maybe just a comment: so the result that you told us, it’s far from the separation in circuit
complexity—right—because here, it’s just for one …
>> Matus Telgarksy: Univariate.
>>: Not only univariate, but it’s also … it’s not a worst case. It’s a worst case, it could still be that
shallow and deep are the same thing.
>> Matus Telgarsky: Yeah, there could be certain functions that have that property. I just gave one
function class.
>>: So yeah. Whereas, in … what Servedio did in that he showed that in the worst case, there is
[inaudible] for Boolean circuits.
>> Matus Telgarsky: Yes.
>>: Yeah.
>> Matus Telgarsky: I … well, I have to trust you on that. I thought he did not show that, but I just
maybe don’t know the result well enough. I thought he did something kind of like the switching lemma,
where he gave a specific function which is hard to approximate with one less layer, but maybe I’m
wrong. Does anyone know? To be clear, I thought the result was a specific class—he calls it the Sipsersomething functions—where it exactly exists in a depth-k circuit, but then if you go down to k-minusone circuits, you can only get one half minus little o of one close without getting …
>>: That’s true.
>> Matus Telgarsky: … without having an exponential blow-up in the size of the circuit. That was my
understanding of the Servedio result.
>>: Yeah. That’s true; this is right.
>> Matus Telgarsky: But—you know—the … yeah, I feel bad saying that, ‘cause I’m not trying to
diminish the result, like it’s not …
>>: Actually, can we go to your slides on VC dimension? You just went all the way [indiscernible] There
were two results you said, right? Or …
>> Matus Telgarsky: Yes.
>>: So I [indiscernible]
>> Matus Telgarsky: Go ahead; you can … you …
>>: No, I just want to recall them.
>> Matus Telgarsky: You want to admire the book cover or …? [laughter] That’s all I … I didn’t do this
very well, did I? I spent more time talking about the book cover than about the results, but
[indiscernible]
>>: Okay, so there was one result before. So there was two sigmas said … talked about, right? So this …
>> Matus Telgarsky: Yes, the indicator …
>>: Mmhmm, and then the … just the size of the L?
>> Matus Telgarsky: Oh, yeah, yes, yes, yes. So just … yeah. So it was … so if it was … it was this one.
This is the answer. So it’s number of parameters.
>>: Yeah, if sigma is just a … here?
>> Matus Telgarsky: Indicator, yeah.
>>: And the next …?
>> Matus Telgarsky: And this is if I use that one that is popular now. And the reason I picked this
function class was because for this, we do have a Theta … we do have a tight upper and bower … were
you gonna ask about what other function classes we …?
>>: Yeah, no, I just wanted to … so you always have these … like okay, that’s … the two numbers or twotwo bullet point below, that was like a multiple choice—like okay, this could be that answer, or do you
expect something over there, or it …?
>> Matus Telgarsky: So you’re asking why I thought these were relevant bullet points to include. I could
have tried to sneakily construct this talk where I try to make it sound like my result is actually showing
that there are tons of these functions where it takes exponentially as many nodes. So if it was true that I
always could take any function that has exponential size with a shallow network and compress it down
to a linearly-sized deep network, that would … that must imply that the veese dimension is exponential,
because I’m getting all those functions, so they have to live somewhere. So if my result was not … did
that make sense as I explained, or was that …?
>>: Mmhmm [indiscernible]
>> Matus Telgarsky: So what this … you can interpret this result as saying that my result, which was for
a fixed class of functions, there actually aren’t that many of them. There aren’t that many of them.
Then, the second bullet point is because if you look at all the existing bounds, a lot of the upper
bounds—not for this function, this sigma, but for many others—actually are quadratic. And I won’t call
out names, ‘cause maybe that’s unfavorable if the conjecture is wrong, but people I rate—you know—
way, way up, they have conjecture to me that it actually is super-linear in … or super-multilinear—so
quadratic—in some cases, which I have only … the only intuition I have that that might be true is what
those people told me. So it might actually be quadratic in some real cases.
>>: So with some sigmas, you’re saying.
>> Matus Telgarsky: Yeah, yeah, including ones that people care about. Yeah, it’s not just pathologies.
By the way, an—I mean—an interesting case is if—‘cause you saw all my reasoning was doing these little
piecewise things, talking about bins, right, discretizing—you know, what if—you know—what if that
thing is—you know—one over one plus e to the x? Which is the sigmoid, which is the common thing
people use—or one over one plus e to the minus x. That thing, we can still prove a theorem, but right
now, the only upper bound there is actually quadratic—or actually, sorry, that upper bound does not
even depend on the number of layers, so actually exhibiting a dependence on number of layers is often
tricky.
>>: So we found a better result ten years ago …
>> Matus Telgarsky: I couldn’t hear …
>>: Something about a [indiscernible] result ten years ago saying that the neural networks, the size of
the weights is more important than num … than the size of the network.
>> Matus Telgarsky: The … you mean the magnitude of the weights.
>>: Yes. So can you kind of connect the dots? How does it connect to the results you presented here
today?
>> Matus Telgarksy: You mean how does it connect to these veese dimension results, or how does it
connect back to my …?
>>: Yeah, he showed that the complexity of kind of the class represented by neural nets is dependent
more on the—this is what he claimed in this paper—depends more on the size—the absolute fi …
>> Matus Telgarsky: Like the L1 norm or something.
>>: Sorry?
>> Matus Telgarsky: Like the L1 norm.
>>: Yeah, the L1 norms of the weights as opposed to the number of the weights. So how does it … kind
of how should we put it in context here?
>> Matus Telgarsky: Sure, so it’s possible that what I’m going to say is wrong, because I don’t know
exactly what he meant by that, but I’ll just tell you what I do understand. So it depends on what exactly
you’re giving a bound on. So the VC dimension is a bound on the classification error. If we were, for
instance, caring about the logistic loss, or the exponential loss, or hinge loss, or one of these, for these,
we should use something like a Rademacher complexity result. If we used Rademacher complexity van,
where we think about the Rademacher complexity of this loss class; so I’m looking at Rademacher
complexity of the loss composed with a predictor. In that case, then then norms of the weights will
come out naturally. So if you care about those kinds of problems—for instance, you care about
regression—then that quantity is the one that will come out naturally.
The reason I do consider it to be apples and oranges is because [indiscernible] really do care about
classification, and these are Theta bounds; they’re tight. So not only that, there really are cases where
you want some and not the others. So one thing I was telling Sebastien about earlier today is that this
fun—okay, I won’t jump to it—but the … that composed pyramid function I constructed, it’s got two to
the k up-and-downs in the interval; that means its Lipschitz constant is two to the k. So there are some
ways to analyze this class of functions that would just blow up the complexity, but if you care about
classification, then that’s not the right estimate. So I would say that it’s problem-dependent when you
think about the two different bounds, and they are—I would say—kind of apples and oranges. I don’t
find that answer entirely satisfactory; I just kind of summarized for you the results about one of the
results about other … but …
>> Sebastien Bubeck: I think this is still a very interesting open question, actually.
>> Matus Telgarsky: Yeah.
>> Sebastien Bubeck: To understand precisely what is a norm that—you know—characterizes a capacity
of these neural networks.
>> Matus Telgarsky: Yeah.
>>: This is still a essentially open. So what bit of Laplin did, like—I guess—fifteen years ago, was indeed
either for the L1 norm over the entire network or the L2 norm over the entire network, but you could
think of things which are much fine, because—you know—the last layer and the first layer, they
shouldn’t be represented in the same way in this norm. So … but that’s still essentially open.
>> Matus Telgarsky: Oh, okay.
>>: Just—sorry—but one note about that result. Was that result, like, as a showing that if you have a
boundary dilemma? So [indiscernible]
>> Sebastien Bubeck: Right, so one … so you can express the Rademacher on … so you can get a
dimension-free Rademacher complexity bound if you don’t show the weights of the entire network
either in L1 norm or in L2 norm, just like—you know—just like one layer. This is still true for multilayer,
but you could expect, presumably, to have a much bigger class, where you still have a dimension-free
bound on the Rademacher complexity, but where the norm is not going to be the same in the last layer
and the first layer. Does that make sense?
>> Matus Telgarsky: Actually, it … if I can make one more comment here: if you look at the original
Rademacher and Gaussity complexity paper—Bartlett-Mendelson—there’s a very nice veese … there’s a
very nice Rademacher complexity bound in there that I do not see discussed much—particularly, I never
see discussed in summaries of Rademacher complexity—on a two-layer network. So for a two-layer
network, so the val … one of the big—okay, so I was making it sound like VC versus Rademacher is a
question about whether you care about real-valued objects or zero-one objects, basically—but there’s
another big deal, which is, of course, Rademacher complexity’s distribution-dependent, and so he has a
bound in there that actually depends on sparsity structure a lot. And so then he has an … a really, really,
really nice theorem in there for … basically, the Rademacher complexity you get is this s log n type of
effect that you always expect to sparse things, or … if you know what I mean. Instead of being root n,
it’s an s log n—so s is the number of nonzeroes. And so there is this other benefit of Rademacher
complexity that does allow us to have these nice distribution-dependent things.
>> Sebastien Bubeck: Alright, thanks Matus. [applause]
Download