Chapter 5 Approximation Theory 5.1 Universality Before we dive into the question of how to build and learn deeper neural networks, we would like to reassure ourselves that the representational power of architectures like the 3 layer MLP is sufficient. What function class can we approximate with neural networks and how well? 5.1.1 Approximation & Density =⇒ The best we can hope for is that the achievable is dense in the inachievable. Uniform Metric How do we measure the quality of the approximation of some real-valued function f by some other function g? Ideally, one would like to avoid that there are any points in the domain S of f for which g differs too much. This naturally leads to the infinity norm and its induced (extended) metric sup |f (x)| , def → kf k∞ := x∈S d(f, g) := kf − gk∞ . (infinity norm) In order to make this a proper norm and metric, one usually considers bounded functions, such that kf k∞ ≤ M < ∞. Using the infinity or uniform metric has the advantage to not depend on data distributions or measures and yields very strong guarantees. It is the criterion used in much of the classical literature in mathematical analysis. The infinity norm is often used in combination with restrictions of functions to compact sets. In the Euclidean case, K ⊂ Rn being compact simply means it is closed and bounded. Example 5.1. The compact sets of the real line are the closed intervals [a; b]. 61 One can then define the norm and metric for the restrictions to K sup |f (x)|, def → kf k∞,K := x∈K dK (f, g) := kf − gk∞,K , (5.1) As kf k∞,K = kf|K k∞ , K may sometimes be tacit. Note that continuous functions are bounded on compact domains (generalization of extreme value theorem) and thus for f ∈ C(K), simply kf k∞ = maxx∈K |f (x)|. Density Now assume we have a function class G with which we want to approximate functions f . The best we can hope for is to reach an error of inf g∈G d(f, g). We then say that f is approximated by G (to arbitrary precision), if the approximation error vanishes, i.e. def → f ' G : inf d(f, g) = 0 g∈G (approximation) This is equivalent to there being a sequence of (approximating) functions gm ∈ G, m = 1, 2, . . . , that converges uniformly to f . ∞ ⇒ gm −→ f ⇐⇒ f ' G (5.2) Here uniform convergence means ∞ def → gm −→ f ⇐⇒ ∀ > 0 : ∃M ≥ 1 : ∀m ≥ M : kgm − f k∞ < Clearly every representable f ∈ G is also approximated. However, if G is not closed, strictly more functions can be approximated. The largest class of approximated functions is the closure cl(G). ⇒ G ⊆ cl(G) ' G and f 6∈ cl(G) =⇒ f 6' G (5.3) One then also says that G is dense in F or has the density property with regard to F, if F ⊆ cl(G), meaning all functions in F can be approximated by G. Example 5.2. This generalizes the well-known density of rational numbers in the reals, R = cl(Q). Every real number is a limit point of some (non-unique) sequence of rational numbers, i.e. we can think of a real number as the equivalence class of such sequences. Neural Networks One explicit way to construct function classes is from finitely parameterized models such as neural networks. We will work with MLPs with one hidden layer as a concrete example. We can then intantiate: Gm := {g : g is realized by an MLP with m hidden units}, G := ∞ [ Gm m=1 Note that Gm defines a nested sequence Gm ⊂ Gm+1 , yet that we put no bound on m. This effectively allows us to chose m = m() as a function of the desired approximation accuracy. 62 Continuous Functions As the continuous functions C(Rn ) are of particular importance, it is common to use a title of grandeur for function classes that approximate them. def → G is a universal approximator ⇐⇒ C(S) ' G(S) for any compact S ⊂ Rn where G(S) is the restriction of functions in G to S. It will be a central result to establish that neural networks are universal approximators. 5.1.2 Weierstrass Theorem =⇒ The classical result in approximation theory. One of the most fundamental and classical results in approximation theory is the uniform approximability of continuous functions on compacta by polynomials. We will focus on the case of n = 1, i.e. on functions C(R). Theorem 5.3 (Weierstrass Theorem). Polynomials P are dense in C(I), where I = [a; b] for any a < b. We will provide a sketch of the proof. First note that wlog we can focus on I = [0; 1] as ⇒ C([0; 1]) ' P([0; 1]) =⇒ C([a; b]) ' P([a; b]), ∀a < b (5.4) Proof. 1. Define φ : C([0; 1]) → C([a; b]), φ(f ) = f ◦ φ, φ = (1 − t)a + tb. ∞ ∞ 2. if gm −→ f ∈ C([0; 1)], then gm ◦ φ −→ f ◦ φ ∈ C([a; b]) 3. The result follows from observing that gm ◦ φ is a polynomial, if gm is. Second, consider the Bernstein basis polynomials: def → bm k (x) = m k x (1 − x)m−k . k (5.5) We know them from the Bernoulli probability distribution, where x is the (fixed) success probability and k is the number of successes out of m trials. They form a partition of unity, i.e. a convex combination for every x. ⇒ m X bm k (x) = 1 ∀x ∈ [0; 1] k=0 63 (5.6) We use the basis polynomials to convexly combine function values on a lattice with spacing 1/m and define Bernstein polynomials def → qm (x) = m X k f bm (x), m k bm k (x) = k=0 m k x (1 − x)m−k . k (5.7) which is a polynomial of degree m ∞ Third, we show that qm −→ f by looking (independently) at the residuals |f (x) − qm (x)| = m X rkm (x) , rkm (x) := f (x) − f k=0 k m bm k (x) (5.8) Informally, the law of large numbers implies that the probability mass of the binomial distribution will concentrate around x = k/m as m → ∞. This basic observation is at the core of the proof. Proof of Weierstrass Theorem. The proof proceeds in the following steps: 1. Each residual is the product of two factors. It will vanish, if either one vanishes (while the counterpart remains bounded). 2. Chose δ such that |x − y| ≤ δ implies |f (x) − f (y)| ≤ /2. This is possible k as f is continuous. Consider lattics points I := {k : |x − m | ≤ δ}, then X rkm (x) ≤ k∈I m X m X m bk (x) ≤ bk (x) = 2 2 2 k∈I k=1 3. Now look at lattice points I c . By continuity there is a R > 0 such that X rkm (x) ≤ k6∈I X |rkm (x)| ≤ R k6∈I X ! bm k (x) < k6∈I 2 4. For the final inequality, one may introduce (x − m/k)2 > δ 2 and show that 2 m X k x(1 − x) x− bm k (x) = m m k=1 which is a variance formula (we will not prove here). 5. Then, for m large enough R X k6∈I bm k (x) ≤ Rx(1 − x) R ! ≤ < , 2 mδ 4mδ 2 2 64 m> R 2δ 2 5.1.3 Spans of Smooth Functions =⇒ Everything, but a polynomial. Let us now consider an arbitrary smooth function σ ∈ C ∞ (R) and the span of its composition with affine functions def → Gσ1 = {g : g(x) = σ(ax + b) for some a, b ∈ R} Hσ1 = span(Gσ1 ) (5.9) (5.10) Then the following holds: Theorem 5.4 (Lechno. Lin, Pinkus, Schocken 1993). For any σ ∈ C ∞ (R) not a polynomial on any [a; b]. Hσ1 is a universal approximator. Proof. We proceed backwards for didactic reasons. 1. It is sufficient to show that polynomials can be approximated ( r ) X k P= αk x | ak ∈ R, r ≥ 0 ⊆ cl(Hσ1 ), k=0 We can then evoke the Weierstrass theorem along with the subadditivity of the sup-norm (triangle inequality) to proof the claim. 2. Since Hσ1 is a linear span, it is sufficient to show that monomials can be approximated {xk : k ≥ 0} ⊂ cl(Hσ1 ). 1 k This Pr is because if h1k ∈ Hσ are k -approximations of x then for h(x) := k=0 ak hk (x) ∈ Hσ one obtains by virtue of subadditivity r X k=0 αk xk − h(x) ≤ ∞ r X |αk |kxk − hk (x)k∞ < k=0 r X |ak |k := . k=0 3. As σ ∈ C ∞ all its k-th order derivatives exist, namely for z = ax + b dk σ(z) a=0 = xk σ (k) (z) = xk σ (k) (b) dak Note that if σ was a k-th degree polynomial, we would get σ (k) = 0. It turns out the opposite is also true. If σ 6∈ P, then σ (k) 6= 0 (see theorem below). We can thus chose bk such that σ (k) (bk ) 6= 0. 4. Let us focus on k = 1. The derivative can be uniformly approximated on any compact set by a finite difference σ((a + h)x + b) − σ(ax + b) ∞ dσ(ax + b) −→ h da 65 as h → 0 The left-hand side is indeed a linear combination of σ-ridge functions. Concretely we can chose b such that σ 0 (b) 6= 0 and define the sequence hm (x) = m σ 0 (b) ∞ x σ( m + b) − σ(b) −→ x Similar expressions exist for higher order derivatives (skipped for the sake of brevity) and provide uniform approximations of monmials xk ' Hσ1 . The missing step above follows from: Theorem 5.5 (Donoghue, 1969; Pinkus 1999). If σ is C ∞ on (a; b) and it is not a polynomial thereon, then there exists a point x0 ∈ (a; b) such that σ (k) (x0 ) 6= 0 for k = 0, 1, 2, . . . . The above result can be further generalized and in fact it can be shown that the smoothness assumption is not necessary and σ ∈ C(R) − P is sufficient. See, for instance, Proposition 3.7 of [59]. What we have shown now is that for the case of n = 1 (one-dimensional inputs), an MLP with smooth activation function σ is a universal approximator, unless σ is a polynomial. This is because the linear output layer is picking an element in the span of the hidden units, each one of which computes a function in Gσ1 . So as far as universal function approximation is concerned, there is nothing special about the logistic function or the hyperbolic tangent as choices of activation functions. 5.1.4 Universality of Ridge Functions =⇒ Ridge functions are rich functions. One basic tool for lifting results from one dimension to higher dimensions are ridge functions. They are universal approximators. Define Gσn = {g : g(x) = σ(θθ · x), θ ∈ Rn }, def → Gn = [ Gσn , Hn = span(G n ) Hσn = span(Gσn ) (5.11) σ∈C(R) Theorem 5.6 (Vostrecov and Kreines, 1961). Hn is a universal function approximator. Proof. Omitted. Remark 5.7. In fact it can be shown that it is sufficient to consider weights θ ∈ Θ such that no homogeneous polynomial vanishes on Θ. We do not need that stronger version here, but will assume that we can chose θ ∈ S n−1 . 66 Note that one can absorb linear combination weights into the ridge functions, so that ( ) m X gk , gk ∈ G n (5.12) ⇒ Hn = h : h = k=1 So this theorem can be used to show that 3-layer neural networks with adaptive activation functions would be universal approximators. However, this in itself is not practical as we do not know how to represent all C(R) functions that we may want to choose for σ. 5.1.5 Dimension Lifting =⇒ Heavy lifting, but elegantly. The following beautiful theorem will provide the missing link that connects the above result with the universality result we have for n = 1. Theorem 5.8 (Pinkus 1999, [59] Proposition 3.3). Hσ1 universal for C(R) =⇒ Hσn universal for C(Rn ) for any n ≥ 1 Proof sketch. 1. Fix f , compact K ⊂ Rn , we can find ridge functions gk s.t. f (x) − m X gk (θθ k · x) < k=1 2 (∀x ∈ K) 2. Since K is compact, θ k · x ∈ [αk , βk ] for x ∈ K. 3. Because Hσ1 is dense in each C([αk , βk ]), we can find constants s.t. gk (z) − mk X ckj σ(akj z + bkj ) ≤ j=1 2m (∀k = 1, . . . , m) 4. Plugging things together yields the result. So to summarize the argument: (1) For n = 1, an MLP with any continuous, non-polynomial activation function is a universal approximator. (2) Spans of ridge functions are universal approximators for C(Rn ). (3) The non-linear part of any/each ridge function can be approximated by (1). (4) Hence MLPs are universal function approximators. 67 5.2 Complexity =⇒ Universality!? At what price?! As we have seen, there is no need to worry that function classes represented by neural network are not rich enough. Universality is the rule, rather than the exception. But important questions remain: (1) How many units or parameters are required to obtain a desired approximation accuracy? (2) Is there an advantage of compositionality (multiple layers) over MLPs with a single hidden layer, e.g. function classes span(Gσn )? These are two questions investigated in this section. 5.2.1 Barron’s Theorem We will state and discuss without proof (a simplified) version of the famous result of [5], which relates the residual in approximation to the number of sigmoidal neurons in the (single) hidden layer. Fourier Transform We need the concept of a Fourier transform and some basic understanding of Fourier analysis. For any absolutely integrable f , i.e. Z |f (x)|dx < ∞ Rn define Z def → fb(ξξ ) = e−2πiξξ ·x f (x)dx, fb : Rn → C (Fourier transfrom) Rn Intuitively, the Fourier transform will represent the frequency components of a function. Specifically note that every function is the sum of an even an odd function f (x) = f (x) + f (−x) f (x) − f (−x) + 2 2 (5.13) The real part of fˆ will only depend on the even part of f , whereas the imaginary part will depend on its odd part as by Euler’s formula eia = cos(a) + i sin(a) Regularity condition The theorem below applies to functions g that have a Fourier transform gb such that Z def → Cg := kωk |b g (ω)|dω < ∞ 68 (gradient regularity) Note that for differentiable g, the Fourier transform of the gradient function is given by c ⇒ ∇g(ω) = ωb g (ω) . (5.14) So the condition Cg < ∞ can be interpreted as stating that Fourier transformations of gradient functions have to be absolutely integrable. However the condition is also meaningful for non-differentiable g (see discussion in [5]). Main theorem The theorem below applies to the logistic activation function and more generally to any bounded (measurable) and monotonic function σ t→∞ t→−∞ such that σ(t) −→ 1 and σ(t) −→ 0. Theorem 5.9 (Barron, 1993). For every g : Rn → R with finite Cg and any Pm r > 0, there is a sequence of MLP functions fm (x) of the form fm (x) = θ j · x + bj ) + b0 such that j=1 βj σ(θ Z 2 (g(x) − fm (x)) µ(dx) ≤ O rB 1 m where rB = {x ∈ Rn : kxk ≤ r} and µ is any probability measure on rB. Interpretation There most striking aspects of the accuracy bound in the theorem and its (omitted) proof are: 1. The lack of dependency on n, the dimensionality of the input. It shows that MLPs do not suffer from the curse of dimensionality when approximating functions that fulfill (gradient regularity). 2. The freedom in the choice of the measure (or data distribution). 3. The remarkable drop in approximation error ∝ 1/m. 4. Additional bounds and constraints on the parameters can be imposed. 5. Moreover, the proof uses an iterative construction of successively adding units to fit residuals. To appreciate the first point, note that for any fixed set of m basis functions the best approximation order is (1/m)2/n , which is much worse. Details can be found in [5], Section X. 5.2.2 Depth Separation =⇒ Why go deep, if shallow is universal? 69 Prologue There is clear and consistent evidence that deeper network yield better approximations for many practical problems involving large data sets. However, classical results on universality have largely focused on showing the strength of shallow models such as MLPs with a single hidden layer. This raises the question of whether deep networks offer representational benefits. It is fair to say that – as of today – there is no comprehensive theory of why deeper models are preferred and when, but there are interesting pieces of the puzzle. We will present an example due to [19] that shows the benefit of a compositional (2 hidden layer) over a shallow (1 hidden layer) architecture. This will yields a paradigmatic example along with an illuminating construction. The key idea is very simple. Define a radial function g(x) = ψ(kxk) that can be naturally approximated by first approximating the norm or some one-to-one function thereof (via the span of the first hidden layer) and then approximating ψ (via the span of the second hidden layer). It should be clear that the norm can be approximated in a dimension-efficient manner. One needs an MLP approximation of the square fm (z) ≈ z 2 and can form f¯mn (x) = n X i=1 fm (xi ) = n X m X βj σ(θj xi + bj ) ≈ kxk2 . (5.15) i=1 j=1 This efficient separation across dimensions may get lost, as we will see, when trying to approximate g with an MLP with a single hidden layer. Setup Recall that an MLP with one hidden layer of width m implements a function f ∈ span{σj (x) := σ(θθ j · x), 1 ≤ j ≤ m} ⊂ span(Gσn ) (5.16) We are interested in the L2 loss between f Rand some target function g with regard to some measure with density φ2 (i.e. φ2 (x)dx = 1), Z Z φ 2 2 ` (f, g) := (f (x) − g(x)) φ (x)dx = (f (x)φ(x) − g(x)φ(x))2 dx (5.17) (1) c 2 (2) b b b ? φk b 2 =kf φ − gφk2L2 = kfcφ − gφk L2 = kf ? φ − g L2 where b h denotes the Fourier transform of a function h (if h is square integrable) or its generalized Fourier transform1 and (1) holds because of the unitarity of the Fourier transform (Parseval identity), whereas (2) follows from the convolution theorem. The goal will be to chose a density φ and a function g such that small `φ (f, g) can only be achieved with (loosely speaking) exponential growth in m. Spans of ridge functions: the weakness Let us first discuss limitation of spans of ridge functions that will be exploited as a weakness. We are interested 1 Through the use of tempered distributions, see [34, Chapter 11]. 70 Figure 5.1: φ function that defines density φ2 , visualized in 2 dimensions. Figure taken from [19]. in the (generalize) Fourier transform of f (as above) as well as that of f φ. In b The choice for φ used in the particular, we are interested in the support of fb? φ. construction will be such that φb = 1[Bn ], the indicator on the unit ball. This implies φ is isotropic and bandlimited. A visualization of φ is shown in Figure 5.1 We first want to re-express the fact that a single ridge functions σ(θθ · x) is only sensitive to changes in a single direction θ in terms of their Fourier transform. This translates into a constraint on the support of σ b as follows: ⇒ supp (b σ ) = span{θθ } (5.18) Non-technically speaking, this is because ridge functions can be represented as a superposition of sine waves all with spatial frequencies ω ∝ θ . If we convolve σ b with φb as above, then the support becomes2 ⇒ supp σ b ? φb = span{θθ } + B (5.19) which is a tube of radius 1. Finally note that b b ⇒ supp(f ? φ) = [ (span{θθ j } + B) (5.20) j which follows from linearity X b b ξ ) = 0 (∀j) fb ? φb = βj (σbj ? φ), s.t. fb(ξξ ) = 0, if (σbj ? φ)(ξ (5.21) j So the frequency components of a function f ∈ span(Gσn ) have a peculiar structure. In order to be sensitive to all spatial frequencies, one has to chose m 2 t-h: Aiming to add a simple argument. 71 Figure 5.2: Schematic view of the construction of a target function g in d dimensions. large enough so that supp(fcφ) ⊇ rB as r grows. Note that if supp(fcφ) 6⊇ rB, then there would be frequencies ξ ∈ rB representing oscillations that fcφ could not capture or approximate. In fact one can show the following volume ratio formula as n → ∞ [19] b ∩ rB V supp(fb ? φ) / me−n V(rB) Designing the target g It now remains to chose a target function that is difficult to approximate by an MLP with one hidden layer, yet can be efficiently approximated by an MLP with two hidden layers. The key idea is to chose a radial function, g(x) = g(kxk) or – as a function composition – g = ψ ◦ k · k. In a two-hidden layer network, the span of the first layer would approximate the norm function and the span of the second the univariate function ψ. Intuitively speaking, one needs to chose ψ so that it puts enough mass far away from the origin. The technical details are involved, but one construction that works are random sign indicator functions of thin shells. For instance, if kxk ≤ 1, then one can chose a partition {∆i } of [0; 1] and define ψ(z) = N X i ψi (z), ψi (z) = 1{∆i }, i ∈ {−1, 1} (5.22) i=1 The sign flips generate the oscillations. Although ψ and all ψi are discontinuous, an approximation by Lipschitz functions with regard to the chosen measure is possible. For further details we refer to [19]. A 2d-sketch of such a function is shown in Figure 5.2. Exponential Separability Assume that the ridge function used as activation functions has polynomially bounded growth, |σ(z)| ≤ C(1 + |z|)α for some 72 constants C, α > 0. Moreover there is a technical Lipschitz condition (Assumption 1 in [19]) leading to a constant c, which is fulfilled by standard activation functions such as the sigmoid functions introduced in the chapter. Theorem 5.10 ([19]). For n ≥ C there exists a probability measure µ with density φ2 and a function g with the following properties: √ 1. g is bounded in [−2; 2] supported on {x : kxk ≤ C n} and expressible by a 2 hidden layer network with width Ccn19/4 . 2. Every function f ∈ Gσn of width m ≤ cecn satisfies 2 Ex∼µ (f (x) − g(x)) ≥ c 73