Approximation Error and Approximation Theory Federico Girosi Center for Basic Research in the Social Sciences Harvard University and Center for Biological and Computational Learning MIT fgirosi@latte.harvard.edu 1 Plan of the class • Learning and generalization error • Approximation problem and rates of convergence • N-widths • “Dimension independent” convergence rates 2 Note These slides cover more extensive material than what will be presented in class. 3 References The background material on generalization error (first 8 slides) is explained at length in: 1. P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for Radial Basis Functions. Neural Computation, 8:819–842, 1996. 2. P. Niyogi and F. Girosi. Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10:51–80, 1999. [1] has a longer explanation and introduction, while [2] is more mathematical and also contains a very simple probabilistic proof of a class of “dimension independent” bounds, like the ones discussed at the end of this class. As far as I know it is A. Barron who first clearly spelled out the decomposition of the generalization error in two parts. Barron uses a different framework from what we use, and he summarizes it nicely in: 3. A.R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133, 1994. The paper is quite technical, and uses a framework which is different from what we use here, but it is important to read it if you plan to do research in this field. The material on n-widths comes from: 4. A. Pinkus. N-widths in Approximation Theory, Springer-Verlag, New York, 1980. Although the book is very technical, the first 8 pages contain an excellent introduction to the subject. The other great thing about this book is that you do not need to understand every single proof to appreciate the beauty and significance of the results, and it is a mine of useful information. 5. H.N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8:164–177, 1996. 4 6. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transaction on Information Theory, 39:3, 930–945, 1993. 7. F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates A.I. Memo 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992. For a curious way to prove dimension independent bounds using VC theory see: 8. F. Girosi. Approximation error bounds that use VC-bounds. In Proc. International Conference on Artificial Neural Networks, F. Fogelman-Souliè and P. Gallinari, editors, Vol. 1, 295–302. Paris, France, October 1995. 5 Notations review • I[f] = • R X×Y V(f(x), y)p(x, y)dxdy 1 Pl Iemp[f] = l i=1 V(f(xi), yi) • f0 = arg minf I[f] , f0 • fH = arg minf∈H I[f] • f^H,l = arg minf∈H Iemp[f] ∈T 6 More notation review • I[f0] = how well we could possibly do • I[fH] = how well we can do in space H • I[f^H,l] = how well we do in space H with l data • |I[f] − Iemp[f]| ≤ Ω(H, l, δ) ∀f ∈ H (from VC theory) • I[f^H,l] is called generalization error • I[f^H,l] − I[f0] is also called generalization error ... 7 A General Decomposition I[f^H,l] − I[f0] = I[f^H,l] − I[fH] + I[fH] − I[f0] generalization error = estimation error + approximation error When the cost function V is quadratic: I[f] = kf0 − fk2 + I[f0] and therefore kf0 − fH,lk2 = I[f^H,l] − I[fH] + kf0 − fHk2 generalization error = estimation error + approximation error 8 A useful inequality If, with probability 1 − δ I[f] − Iemp[f] ≤ Ω(H, l, δ) ∀f ∈ H then I[f^H,l] − I[fH] ≤ 2Ω(H, l, δ) You can prove it using the following observations: • I[fH] ≤ I[f^H,l] (from the definition of fH) • Iemp[f^H,l] ≤ Iemp[fH] (from the definition of f^H,l) 9 Bounding the Generalization Error 2 2 kf0 − fH,lk ≤ 2Ω(H, l, δ) + kf0 − fHk Notice that: • Ω has nothing to do with the target space T , it is studied mostly in statistics; • kf0 − fHk has everything to do with the target space T , it is studied mostly in approximation theory; 10 Approximation Error We consider a nested family of hypothesis spaces Hn: H0 ⊂ H1 ⊂ . . . Hn ⊂ . . . and define the approximation error as: T (f, Hn) ≡ inf kf − hk h∈Hn is the smallest error that we can make if we approximate f ∈ T with an element of Hn (here k · k is the norm in T ). T (f, Hn) 11 Approximation error For reasonable choices of hypothesis spaces Hn: lim T (f, Hn) = 0 n→∞ This means that we can approximate functions of T arbitrarily well with elements of {Hn}∞ n=1 Example: T = continuous functions on compact sets, and Hn = polynomials of degree at most n. 12 The rate of convergence The interesting question is: How fast does T (f, Hn) go to zero? • The rate of convergence is a measure of the relative complexity of T with respect to the approximation scheme H. • The rate of convergence determines how many samples we need in order to obtain a given generalization error. 13 An Example • In the next slides we compute explicitly the rate of convergence of approximation of a smooth function by trigonometric polynomials. • We are interested in studying how fast the approximation error goes to zero when the number of parameters of our approximation scheme goes to infinity. • The reason for this exercise is that the results are representative: more complex and interesting cases all share the basic features of this example. 14 Approximation by Trigonometric Polynomials Consider the set of functions C2[−π, π] ≡ C[−π, π] \ L2[−π, π] Functions in this set can be represented as a Fourier series: f(x) = ∞ X ckeikx , Zπ ck ∝ k=0 −π The L2 norm satisfies the equation: kfk2L2 = ∞ X k=0 ck2 dxf(x)e−ikx 15 We consider as target space the following Sobolev space of smooth functions: dsf 2 < +∞ Ws,2 ≡ f ∈ C2[−π, π] | s dx L2 The (semi)-norm in this Sobolev space is defined as: kfk2Ws,2 s 2 ∞ d f X 2s 2 = ≡ k ck dxs L2 k=1 If f belongs to Ws,2 then Fourier coefficients ck must go to zero at a rate which increases with s. 16 We choose as hypothesis space Hn the set of trigonometric polynomials of degree n: p(x) = n X ikx ak e k=1 Given a function of the form f(x) = ∞ X ckeikx k=0 the optimal hypothesis fn(x) is given by the first n terms of its Fourier series: fn(x) = n X k=1 ckeikx 17 Key Question For a given f ∈ Ws,2 we want to study the approximation error: n[f] ≡ kf − fnk2L2 • Notice that n, the degree of the polynomial, is also the number of parameters that we use in the approximation. • Obviously n goes to zero as n → +∞, but the key question is: how fast? 18 An easy estimate of n n[f] ≡ P∞ P∞ 2 2 kf − fnkL2 = k=n+1 ck = k=n+1 c2kk2s k12s < kfk2W P P ∞ 1 2 2s 2 2s s,2 < n12s ∞ c k < c k = k=n+1 k k=1 k n2s n2s ⇓ n[f] < kfk2Ws,2 n2s More smoothness ⇒ faster rate of convergence But what happens in more than one dimension? 19 Rates of convergence in d dimensions It is enough to study d = 2. We proceed in full analogy with the 1-d case: f(x, y) = ∞ X ckmei(kx+my) k,m=1 kfk2Ws,2 s 2 s 2 ∞ d f d f X 2s 2s 2 + = ≡ (k + m )ckm dxs dys L2 L2 k,m=1 Here Ws,2 is defined as the set of functions such that kfk2Ws,2 < +∞ 20 We choose as hypothesis space Hn the set of trigonometric polynomials of degree l: l X p(x) = (ikx+imy) akme k,m=1 A trigonometric polynomial of degree l in d variables has a number of coefficients n = ld. We are interested in the behavior of the approximation error as a function of n. The approximating function is: fn(x, y) = l X k,m=1 ckme(ikx+imy) 21 An easy estimate of n n[f] ≡ < = ∞ X ∞ X 2s 2s (k + m ) 2 2 2 < kf − fnkL2 = ckm = ckm 2s 2s k +m k,m=l+1 k,m=l+1 1 ∞ X 2l2s k,m=l+1 c2km(k2s + m2s) < 1 ∞ X 2l2s k,m=1 c2km(k2s + m2s) kfk2Ws,2 2l2s d Since n = l , then l = 1 nd (with d = 2), and we obtain: n < kfk2Ws,2 2s 2n d 22 (Partial) Summary The previous calculations generalizes easily to the d-dimensional case. Therefore we conclude that: if we approximate functions of d variables with s square integrable derivatives with a trigonometric polynomial with n coefficients, the approximation error satisfies: n < C 2s nd More smoothness s ⇒ faster rate of convergence Higher dimension d ⇒ slower rate of convergence 23 Another Example: Generalized Translation Networks Consider networks of the form: f(x) = n X akφ(Akx + bk) k=1 where x ∈ Rd, bk ∈ Rm, 1 ≤ m ≤ d, Ak are m × d matrices, ak ∈ R and φ is some given function. For m = 1 this is a Multilayer Perceptron. For m = d, Ak diagonal and φ radial this is a Radial Basis Functions network. 24 Theorem (Mhaskar, 1994) Let Wsp(Rd) be the space of functions whose derivatives up to order s are p-integrable in Rd. Under very general assumptions on φ one can prove that there exists d × m p d n matrices {Ak}k=1 such that, for any f ∈ Ws (R ), one can find bk and ak such that: kf − n X akφ(Akx + bk)kp ≤ cn − ds kfkWsp k=1 Moreover, the coefficients ak are linear functionals of f. This rate is optimal 25 The curse of dimensionality If the approximation error is ds n ∝ 1 n then the number of parameters needed to achieve an error smaller than is: ds 1 n∝ the curse of dimensionality is the d factor; the blessing of smoothness is the s factor; 26 Jackson type rates of convergence It happens “very often” that rates of convergence for functions in d dimensions with “smoothness” of order s are of the Jackson type: s! O 1 d n Example: polynomial and spline approximation techniques, many non-linear techniques. Can we do better than this? Can we defeat the curse of dimensionality? Have we tried hard enough to find “good” approximation techniques? 27 N-widths: definition (from Pinkus, 1980) Let X be a normed space of functions, A a subset of X. We want to approximate elements of X with linear superposition of n basis functions {φi}ni=1. Some sets of basis functions are better than others: which are the best basis function? what error do they achieve? To answer these questions we define the Kolmogorov n-width of A in X: dn(A, X) = inf φ1,...φn sup f∈A inf c1,...cn X n ciφi f − i=1 X 28 Example (Kolmogorov, 1936) X = L2[0, 2π] W̃2s ≡ {f | f ∈ Cs−1[0, 2π], f(j) periodic, j = 0, . . . , s − 1} A = B̃s2 ≡ {f | f ∈ W̃2s , kf(s)k2 ≤ 1} ⊂ X Then 1 s s d2n−1(B̃2, L2) = d2n(B̃2, L2) = s n and the following Xn is optimal (in the sense that it achieves the rate above): X2n−1 = span{1, sin(x), cos(x), . . . , sin(n − 1)x, cos(n − 1)x} 29 Example: multivariate case Id ≡ [0, 1]d X = L2[Id] W2s[Id] B2s ≡ {f | f ∈ Cs−1[Id] , f(s) ∈ L2[Id]} ≡ {f | f ∈ W2s[Id] , kf(s)k2 ≤ 1} Theorem (from Pinkus, 1980) dn(Bs2, L2) ds ≈ 1 n Optimal basis functions are usually splines (or their relatives) 30 Dimensionality and smoothness Classes of functions in d dimensions with smoothness of order s have an intrinsic complexity characterized by the ratio ds : • the curse of dimensionality is the d factor; • the blessing of smoothness is the s factor; We cannot expect to find an approximation technique that “beats the curse of dimensionality”, unless we let the smoothness s increase with the dimension d. 31 Theorem (Barron, 1991) Let f be a function such that its Fourier transform satisfies Z Rd dω kωk|f̃(ω)| < +∞ Let Ω be a bounded domain in Rd. Then we can find a neural network with n coefficients ci, n weights wi and n biases θi such that 2 X n ciσ(x · wi + θi) f − i=1 L2(Ω) < c n The rate of convergence is independent of the dimension d. 32 Here is the trick... The space of functions such that Z Rd dω kωk|f̃(ω)| < +∞ . is the space of functions that can be written as f= 1 kxkd−1 ∗λ where λ is any function whose Fourier transform is integrable. Notice how the space becomes more constrained as the dimension increases. 33 Theorem (Girosi and Anzellotti, 1992) Let f ∈ Hs,1(Rd), where Hs,1(Rd) is the space of functions whose partial derivatives up to order s are integrable, and let Ks(x) be the Bessel-Macdonald kernel, that is the Fourier transform of K̃s(ω) = 1 s (1 + kωk2) 2 s>0. If s > d and s is even, we can find a Radial Basis Functions network with n coefficients cα and n centers tα such that kf − n X α=1 cαKs(x c 2 − tα)kL∞ < n 34 Theorem (Girosi, 1992) Let f ∈ Hs,1(Rd), where Hs,1(Rd) is the space of functions whose partial derivatives up to order s are integrable. If s > d and s is even, we can find a Gaussian basis function network with n coefficients cα, n centers tα and n variances σα such that kf − n X α=1 cα (x−tα)2 − 2 e 2σα c 2 kL∞ < n 35 Same rate of convergence: O( √1n ) Function space Norm Approximation scheme R dω |f̃(ω)| < +∞ Rd L2 (Ω) f(x) = Pn i=1 ci sin(x · wi + θi ) L2 (Ω) f(x) = Pn i=1 ci σ(x · wi + θi ) L2 (Ω) f(x) = Pn i=1 ci |x · wi + θi |+ + (Jones) R dω kωk|f̃(ω)| < +∞ Rd (Barron) R dω kωk2 |f̃(ω)| < +∞ Rd +a·x+b (Breiman) f̃(ω) ∈ Cs0 , 2s > d Pn −kx−tα k2 c e α α=1 L∞ (Rd ) f(x) = L∞ (Rd ) − P f(x) = n c e α=1 α (Girosi and Anzellotti) H2s,1 (Rd ), 2s > d (Girosi) kx−tα k2 σ2α 36 Summary • There is a trade off between the size of the sample (l) and the size of the hypothesis space n; • For a given pair of hypothesis and target space the approximation error depends on the trade off between dimensionality and smoothness; • The trade off has a “generic” form and sets bounds on what can and cannot be done, both in linear and non-linear approximation; • Suitable spaces, which trade dimensionality versus smoothness, can be defined in such a way that the rate of convergence of the approximation error is independent of the dimensionality.