Uploaded by Arda Saygan

Approximation Theory

advertisement
Chapter 5
Approximation Theory
5.1
Universality
Before we dive into the question of how to build and learn deeper neural networks, we would like to reassure ourselves that the representational power of
architectures like the 3 layer MLP is sufficient. What function class can we
approximate with neural networks and how well?
5.1.1
Approximation & Density
=⇒ The best we can hope for is that the achievable is dense in the
inachievable.
Uniform Metric How do we measure the quality of the approximation of
some real-valued function f by some other function g? Ideally, one would like
to avoid that there are any points in the domain S of f for which g differs
too much. This naturally leads to the infinity norm and its induced (extended)
metric
sup |f (x)| ,
def → kf k∞ := x∈S
d(f, g) := kf − gk∞ .
(infinity norm)
In order to make this a proper norm and metric, one usually considers bounded
functions, such that kf k∞ ≤ M < ∞. Using the infinity or uniform metric has
the advantage to not depend on data distributions or measures and yields very
strong guarantees. It is the criterion used in much of the classical literature in
mathematical analysis.
The infinity norm is often used in combination with restrictions of functions
to compact sets. In the Euclidean case, K ⊂ Rn being compact simply means
it is closed and bounded.
Example 5.1. The compact sets of the real line are the closed intervals [a; b].
61
One can then define the norm and metric for the restrictions to K
sup |f (x)|,
def → kf k∞,K := x∈K
dK (f, g) := kf − gk∞,K ,
(5.1)
As kf k∞,K = kf|K k∞ , K may sometimes be tacit. Note that continuous functions are bounded on compact domains (generalization of extreme value theorem) and thus for f ∈ C(K), simply kf k∞ = maxx∈K |f (x)|.
Density Now assume we have a function class G with which we want to
approximate functions f . The best we can hope for is to reach an error of
inf g∈G d(f, g). We then say that f is approximated by G (to arbitrary precision), if the approximation error vanishes, i.e.
def → f ' G :
inf d(f, g) = 0
g∈G
(approximation)
This is equivalent to there being a sequence of (approximating) functions gm ∈
G, m = 1, 2, . . . , that converges uniformly to f .
∞
⇒ gm −→
f ⇐⇒ f ' G
(5.2)
Here uniform convergence means
∞
def → gm −→
f ⇐⇒ ∀ > 0 : ∃M ≥ 1 : ∀m ≥ M : kgm − f k∞ < Clearly every representable f ∈ G is also approximated. However, if G is not
closed, strictly more functions can be approximated. The largest class of approximated functions is the closure cl(G).
⇒ G ⊆ cl(G) ' G
and f 6∈ cl(G) =⇒ f 6' G
(5.3)
One then also says that G is dense in F or has the density property with regard
to F, if F ⊆ cl(G), meaning all functions in F can be approximated by G.
Example 5.2. This generalizes the well-known density of rational numbers in
the reals, R = cl(Q). Every real number is a limit point of some (non-unique)
sequence of rational numbers, i.e. we can think of a real number as the equivalence class of such sequences.
Neural Networks One explicit way to construct function classes is from
finitely parameterized models such as neural networks. We will work with MLPs
with one hidden layer as a concrete example. We can then intantiate:
Gm := {g : g is realized by an MLP with m hidden units}, G :=
∞
[
Gm
m=1
Note that Gm defines a nested sequence Gm ⊂ Gm+1 , yet that we put no bound
on m. This effectively allows us to chose m = m() as a function of the desired
approximation accuracy.
62
Continuous Functions As the continuous functions C(Rn ) are of particular
importance, it is common to use a title of grandeur for function classes that
approximate them.
def →
G is a universal approximator
⇐⇒
C(S) ' G(S) for any compact S ⊂ Rn
where G(S) is the restriction of functions in G to S. It will be a central result
to establish that neural networks are universal approximators.
5.1.2
Weierstrass Theorem
=⇒ The classical result in approximation theory.
One of the most fundamental and classical results in approximation theory is the
uniform approximability of continuous functions on compacta by polynomials.
We will focus on the case of n = 1, i.e. on functions C(R).
Theorem 5.3 (Weierstrass Theorem). Polynomials P are dense in C(I), where
I = [a; b] for any a < b.
We will provide a sketch of the proof.
First note that wlog we can focus on I = [0; 1] as
⇒ C([0; 1]) ' P([0; 1]) =⇒ C([a; b]) ' P([a; b]), ∀a < b
(5.4)
Proof.
1. Define φ : C([0; 1]) → C([a; b]), φ(f ) = f ◦ φ, φ = (1 − t)a + tb.
∞
∞
2. if gm −→ f ∈ C([0; 1)], then gm ◦ φ −→ f ◦ φ ∈ C([a; b])
3. The result follows from observing that gm ◦ φ is a polynomial, if gm is.
Second, consider the Bernstein basis polynomials:
def → bm
k (x) =
m k
x (1 − x)m−k .
k
(5.5)
We know them from the Bernoulli probability distribution, where x is the (fixed)
success probability and k is the number of successes out of m trials. They form
a partition of unity, i.e. a convex combination for every x.
⇒
m
X
bm
k (x) = 1 ∀x ∈ [0; 1]
k=0
63
(5.6)
We use the basis polynomials to convexly combine function values on a lattice
with spacing 1/m and define Bernstein polynomials
def → qm (x) =
m
X
k
f
bm (x),
m k
bm
k (x) =
k=0
m k
x (1 − x)m−k .
k
(5.7)
which is a polynomial of degree m
∞
Third, we show that qm −→ f by looking (independently) at the residuals
|f (x) − qm (x)| =
m
X
rkm (x)
,
rkm (x)
:= f (x) − f
k=0
k
m
bm
k (x)
(5.8)
Informally, the law of large numbers implies that the probability mass of the
binomial distribution will concentrate around x = k/m as m → ∞. This basic
observation is at the core of the proof.
Proof of Weierstrass Theorem. The proof proceeds in the following steps:
1. Each residual is the product of two factors. It will vanish, if either one
vanishes (while the counterpart remains bounded).
2. Chose δ such that |x − y| ≤ δ implies |f (x) − f (y)| ≤ /2. This is possible
k
as f is continuous. Consider lattics points I := {k : |x − m
| ≤ δ}, then
X
rkm (x) ≤
k∈I
m
X m
X m
bk (x) ≤
bk (x) =
2
2
2
k∈I
k=1
3. Now look at lattice points I c . By continuity there is a R > 0 such that
X
rkm (x) ≤
k6∈I
X
|rkm (x)| ≤ R
k6∈I
X
!
bm
k (x) <
k6∈I
2
4. For the final inequality, one may introduce (x − m/k)2 > δ 2 and show that
2
m X
k
x(1 − x)
x−
bm
k (x) =
m
m
k=1
which is a variance formula (we will not prove here).
5. Then, for m large enough
R
X
k6∈I
bm
k (x) ≤
Rx(1 − x)
R ! ≤
< ,
2
mδ
4mδ 2
2
64
m>
R
2δ 2
5.1.3
Spans of Smooth Functions
=⇒ Everything, but a polynomial.
Let us now consider an arbitrary smooth function σ ∈ C ∞ (R) and the span
of its composition with affine functions
def →
Gσ1 = {g : g(x) = σ(ax + b) for some a, b ∈ R}
Hσ1
=
span(Gσ1 )
(5.9)
(5.10)
Then the following holds:
Theorem 5.4 (Lechno. Lin, Pinkus, Schocken 1993). For any σ ∈ C ∞ (R) not
a polynomial on any [a; b]. Hσ1 is a universal approximator.
Proof. We proceed backwards for didactic reasons.
1. It is sufficient to show that polynomials can be approximated
( r
)
X
k
P=
αk x | ak ∈ R, r ≥ 0 ⊆ cl(Hσ1 ),
k=0
We can then evoke the Weierstrass theorem along with the subadditivity
of the sup-norm (triangle inequality) to proof the claim.
2. Since Hσ1 is a linear span, it is sufficient to show that monomials can be
approximated
{xk : k ≥ 0} ⊂ cl(Hσ1 ).
1
k
This
Pr is because if h1k ∈ Hσ are k -approximations of x then for h(x) :=
k=0 ak hk (x) ∈ Hσ one obtains by virtue of subadditivity
r
X
k=0
αk xk − h(x)
≤
∞
r
X
|αk |kxk − hk (x)k∞ <
k=0
r
X
|ak |k := .
k=0
3. As σ ∈ C ∞ all its k-th order derivatives exist, namely for z = ax + b
dk σ(z)
a=0
= xk σ (k) (z) = xk σ (k) (b)
dak
Note that if σ was a k-th degree polynomial, we would get σ (k) = 0. It
turns out the opposite is also true. If σ 6∈ P, then σ (k) 6= 0 (see theorem
below). We can thus chose bk such that σ (k) (bk ) 6= 0.
4. Let us focus on k = 1. The derivative can be uniformly approximated on
any compact set by a finite difference
σ((a + h)x + b) − σ(ax + b) ∞ dσ(ax + b)
−→
h
da
65
as h → 0
The left-hand side is indeed a linear combination of σ-ridge functions.
Concretely we can chose b such that σ 0 (b) 6= 0 and define the sequence
hm (x) =
m
σ 0 (b)
∞
x
σ( m
+ b) − σ(b) −→ x
Similar expressions exist for higher order derivatives (skipped for the sake
of brevity) and provide uniform approximations of monmials xk ' Hσ1 .
The missing step above follows from:
Theorem 5.5 (Donoghue, 1969; Pinkus 1999). If σ is C ∞ on (a; b) and it is not
a polynomial thereon, then there exists a point x0 ∈ (a; b) such that σ (k) (x0 ) 6= 0
for k = 0, 1, 2, . . . .
The above result can be further generalized and in fact it can be shown that
the smoothness assumption is not necessary and σ ∈ C(R) − P is sufficient. See,
for instance, Proposition 3.7 of [59].
What we have shown now is that for the case of n = 1 (one-dimensional
inputs), an MLP with smooth activation function σ is a universal approximator,
unless σ is a polynomial. This is because the linear output layer is picking an
element in the span of the hidden units, each one of which computes a function
in Gσ1 . So as far as universal function approximation is concerned, there is
nothing special about the logistic function or the hyperbolic tangent as choices
of activation functions.
5.1.4
Universality of Ridge Functions
=⇒ Ridge functions are rich functions.
One basic tool for lifting results from one dimension to higher dimensions
are ridge functions. They are universal approximators. Define
Gσn = {g : g(x) = σ(θθ · x), θ ∈ Rn },
def →
Gn =
[
Gσn ,
Hn = span(G n )
Hσn = span(Gσn )
(5.11)
σ∈C(R)
Theorem 5.6 (Vostrecov and Kreines, 1961). Hn is a universal function approximator.
Proof. Omitted.
Remark 5.7. In fact it can be shown that it is sufficient to consider weights
θ ∈ Θ such that no homogeneous polynomial vanishes on Θ. We do not need
that stronger version here, but will assume that we can chose θ ∈ S n−1 .
66
Note that one can absorb linear combination weights into the ridge functions,
so that
(
)
m
X
gk , gk ∈ G n
(5.12)
⇒ Hn = h : h =
k=1
So this theorem can be used to show that 3-layer neural networks with adaptive
activation functions would be universal approximators. However, this in itself
is not practical as we do not know how to represent all C(R) functions that we
may want to choose for σ.
5.1.5
Dimension Lifting
=⇒ Heavy lifting, but elegantly.
The following beautiful theorem will provide the missing link that connects
the above result with the universality result we have for n = 1.
Theorem 5.8 (Pinkus 1999, [59] Proposition 3.3).
Hσ1 universal for C(R) =⇒ Hσn universal for C(Rn ) for any n ≥ 1
Proof sketch.
1. Fix f , compact K ⊂ Rn , we can find ridge functions gk s.t.
f (x) −
m
X
gk (θθ k · x) <
k=1
2
(∀x ∈ K)
2. Since K is compact, θ k · x ∈ [αk , βk ] for x ∈ K.
3. Because Hσ1 is dense in each C([αk , βk ]), we can find constants s.t.
gk (z) −
mk
X
ckj σ(akj z + bkj ) ≤
j=1
2m
(∀k = 1, . . . , m)
4. Plugging things together yields the result.
So to summarize the argument: (1) For n = 1, an MLP with any continuous,
non-polynomial activation function is a universal approximator. (2) Spans of
ridge functions are universal approximators for C(Rn ). (3) The non-linear part
of any/each ridge function can be approximated by (1). (4) Hence MLPs are
universal function approximators.
67
5.2
Complexity
=⇒ Universality!? At what price?!
As we have seen, there is no need to worry that function classes represented
by neural network are not rich enough. Universality is the rule, rather than the
exception. But important questions remain: (1) How many units or parameters are required to obtain a desired approximation accuracy? (2) Is there an
advantage of compositionality (multiple layers) over MLPs with a single hidden
layer, e.g. function classes span(Gσn )? These are two questions investigated in
this section.
5.2.1
Barron’s Theorem
We will state and discuss without proof (a simplified) version of the famous result
of [5], which relates the residual in approximation to the number of sigmoidal
neurons in the (single) hidden layer.
Fourier Transform We need the concept of a Fourier transform and some
basic understanding of Fourier analysis. For any absolutely integrable f , i.e.
Z
|f (x)|dx < ∞
Rn
define
Z
def → fb(ξξ ) =
e−2πiξξ ·x f (x)dx,
fb : Rn → C
(Fourier transfrom)
Rn
Intuitively, the Fourier transform will represent the frequency components of a
function. Specifically note that every function is the sum of an even an odd
function
f (x) =
f (x) + f (−x) f (x) − f (−x)
+
2
2
(5.13)
The real part of fˆ will only depend on the even part of f , whereas the imaginary
part will depend on its odd part as by Euler’s formula
eia = cos(a) + i sin(a)
Regularity condition The theorem below applies to functions g that have a
Fourier transform gb such that
Z
def → Cg :=
kωk |b
g (ω)|dω < ∞
68
(gradient regularity)
Note that for differentiable g, the Fourier transform of the gradient function is
given by
c
⇒ ∇g(ω)
= ωb
g (ω) .
(5.14)
So the condition Cg < ∞ can be interpreted as stating that Fourier transformations of gradient functions have to be absolutely integrable. However the
condition is also meaningful for non-differentiable g (see discussion in [5]).
Main theorem The theorem below applies to the logistic activation function
and more generally to any bounded (measurable) and monotonic function σ
t→∞
t→−∞
such that σ(t) −→ 1 and σ(t) −→ 0.
Theorem 5.9 (Barron, 1993). For every g : Rn → R with finite Cg and
any
Pm r > 0, there is a sequence of MLP functions fm (x) of the form fm (x) =
θ j · x + bj ) + b0 such that
j=1 βj σ(θ
Z
2
(g(x) − fm (x)) µ(dx) ≤ O
rB
1
m
where rB = {x ∈ Rn : kxk ≤ r} and µ is any probability measure on rB.
Interpretation There most striking aspects of the accuracy bound in the
theorem and its (omitted) proof are:
1. The lack of dependency on n, the dimensionality of the input. It shows
that MLPs do not suffer from the curse of dimensionality when approximating functions that fulfill (gradient regularity).
2. The freedom in the choice of the measure (or data distribution).
3. The remarkable drop in approximation error ∝ 1/m.
4. Additional bounds and constraints on the parameters can be imposed.
5. Moreover, the proof uses an iterative construction of successively adding
units to fit residuals.
To appreciate the first point, note that for any fixed set of m basis functions
the best approximation order is (1/m)2/n , which is much worse. Details can be
found in [5], Section X.
5.2.2
Depth Separation
=⇒ Why go deep, if shallow is universal?
69
Prologue There is clear and consistent evidence that deeper network yield
better approximations for many practical problems involving large data sets.
However, classical results on universality have largely focused on showing the
strength of shallow models such as MLPs with a single hidden layer. This raises
the question of whether deep networks offer representational benefits. It is fair
to say that – as of today – there is no comprehensive theory of why deeper
models are preferred and when, but there are interesting pieces of the puzzle.
We will present an example due to [19] that shows the benefit of a compositional
(2 hidden layer) over a shallow (1 hidden layer) architecture. This will yields a
paradigmatic example along with an illuminating construction.
The key idea is very simple. Define a radial function g(x) = ψ(kxk) that can
be naturally approximated by first approximating the norm or some one-to-one
function thereof (via the span of the first hidden layer) and then approximating ψ (via the span of the second hidden layer). It should be clear that the
norm can be approximated in a dimension-efficient manner. One needs an MLP
approximation of the square fm (z) ≈ z 2 and can form
f¯mn (x) =
n
X
i=1
fm (xi ) =
n X
m
X
βj σ(θj xi + bj ) ≈ kxk2 .
(5.15)
i=1 j=1
This efficient separation across dimensions may get lost, as we will see, when
trying to approximate g with an MLP with a single hidden layer.
Setup Recall that an MLP with one hidden layer of width m implements a
function
f ∈ span{σj (x) := σ(θθ j · x), 1 ≤ j ≤ m} ⊂ span(Gσn )
(5.16)
We are interested in the L2 loss between f Rand some target function g with
regard to some measure with density φ2 (i.e. φ2 (x)dx = 1),
Z
Z
φ
2 2
` (f, g) := (f (x) − g(x)) φ (x)dx = (f (x)φ(x) − g(x)φ(x))2 dx (5.17)
(1)
c 2 (2)
b b b ? φk
b 2
=kf φ − gφk2L2 = kfcφ − gφk
L2 = kf ? φ − g
L2
where b
h denotes the Fourier transform of a function h (if h is square integrable)
or its generalized Fourier transform1 and (1) holds because of the unitarity of the
Fourier transform (Parseval identity), whereas (2) follows from the convolution
theorem. The goal will be to chose a density φ and a function g such that small
`φ (f, g) can only be achieved with (loosely speaking) exponential growth in m.
Spans of ridge functions: the weakness Let us first discuss limitation of
spans of ridge functions that will be exploited as a weakness. We are interested
1 Through
the use of tempered distributions, see [34, Chapter 11].
70
Figure 5.1: φ function that defines density φ2 , visualized in 2 dimensions. Figure
taken from [19].
in the (generalize) Fourier transform of f (as above) as well as that of f φ. In
b The choice for φ used in the
particular, we are interested in the support of fb? φ.
construction will be such that φb = 1[Bn ], the indicator on the unit ball. This
implies φ is isotropic and bandlimited. A visualization of φ is shown in Figure
5.1
We first want to re-express the fact that a single ridge functions σ(θθ · x)
is only sensitive to changes in a single direction θ in terms of their Fourier
transform. This translates into a constraint on the support of σ
b as follows:
⇒ supp (b
σ ) = span{θθ }
(5.18)
Non-technically speaking, this is because ridge functions can be represented as
a superposition of sine waves all with spatial frequencies ω ∝ θ . If we convolve
σ
b with φb as above, then the support becomes2
⇒ supp σ
b ? φb = span{θθ } + B
(5.19)
which is a tube of radius 1. Finally note that
b b
⇒ supp(f ? φ) =
[
(span{θθ j } + B)
(5.20)
j
which follows from linearity
X
b
b ξ ) = 0 (∀j)
fb ? φb =
βj (σbj ? φ),
s.t. fb(ξξ ) = 0, if (σbj ? φ)(ξ
(5.21)
j
So the frequency components of a function f ∈ span(Gσn ) have a peculiar structure. In order to be sensitive to all spatial frequencies, one has to chose m
2 t-h:
Aiming to add a simple argument.
71
Figure 5.2: Schematic view of the construction of a target function g in d
dimensions.
large enough so that supp(fcφ) ⊇ rB as r grows. Note that if supp(fcφ) 6⊇ rB,
then there would be frequencies ξ ∈ rB representing oscillations that fcφ could
not capture or approximate. In fact one can show the following volume ratio
formula as n → ∞ [19]
b ∩ rB
V supp(fb ? φ)
/ me−n
V(rB)
Designing the target g It now remains to chose a target function that is
difficult to approximate by an MLP with one hidden layer, yet can be efficiently
approximated by an MLP with two hidden layers. The key idea is to chose a
radial function, g(x) = g(kxk) or – as a function composition – g = ψ ◦ k · k. In
a two-hidden layer network, the span of the first layer would approximate the
norm function and the span of the second the univariate function ψ. Intuitively
speaking, one needs to chose ψ so that it puts enough mass far away from the
origin. The technical details are involved, but one construction that works are
random sign indicator functions of thin shells. For instance, if kxk ≤ 1, then
one can chose a partition {∆i } of [0; 1] and define
ψ(z) =
N
X
i ψi (z),
ψi (z) = 1{∆i },
i ∈ {−1, 1}
(5.22)
i=1
The sign flips generate the oscillations. Although ψ and all ψi are discontinuous,
an approximation by Lipschitz functions with regard to the chosen measure is
possible. For further details we refer to [19]. A 2d-sketch of such a function is
shown in Figure 5.2.
Exponential Separability Assume that the ridge function used as activation functions has polynomially bounded growth, |σ(z)| ≤ C(1 + |z|)α for some
72
constants C, α > 0. Moreover there is a technical Lipschitz condition (Assumption 1 in [19]) leading to a constant c, which is fulfilled by standard activation
functions such as the sigmoid functions introduced in the chapter.
Theorem 5.10 ([19]). For n ≥ C there exists a probability measure µ with
density φ2 and a function g with the following properties:
√
1. g is bounded in [−2; 2] supported on {x : kxk ≤ C n} and expressible by
a 2 hidden layer network with width Ccn19/4 .
2. Every function f ∈ Gσn of width m ≤ cecn satisfies
2
Ex∼µ (f (x) − g(x)) ≥ c
73
Download