Approximation Error and Approximation Theory Federico Girosi

advertisement
Approximation Error and Approximation Theory
Federico Girosi
Center for Basic Research in the Social Sciences
Harvard University
and
Center for Biological and Computational Learning
MIT
fgirosi@latte.harvard.edu
1
Plan of the class
• Learning and generalization error
• Approximation problem and rates of convergence
• N-widths
• “Dimension independent” convergence rates
2
Note
These slides cover more extensive material than what will
be presented in class.
3
References
The background material on generalization error (first 8 slides) is explained at length in:
1. P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity,
and sample complexity for Radial Basis Functions. Neural Computation, 8:819–842, 1996.
2. P. Niyogi and F. Girosi. Generalization bounds for function approximation from scattered noisy
data. Advances in Computational Mathematics, 10:51–80, 1999.
[1] has a longer explanation and introduction, while [2] is more mathematical and also contains
a very simple probabilistic proof of a class of “dimension independent” bounds, like the ones
discussed at the end of this class.
As far as I know it is A. Barron who first clearly spelled out the decomposition of the
generalization error in two parts. Barron uses a different framework from what we use, and he
summarizes it nicely in:
3. A.R. Barron. Approximation and estimation bounds for artificial neural networks. Machine
Learning, 14:115–133, 1994.
The paper is quite technical, and uses a framework which is different from what we use here,
but it is important to read it if you plan to do research in this field.
The material on n-widths comes from:
4. A. Pinkus. N-widths in Approximation Theory, Springer-Verlag, New York, 1980.
Although the book is very technical, the first 8 pages contain an excellent introduction to the
subject. The other great thing about this book is that you do not need to understand every
single proof to appreciate the beauty and significance of the results, and it is a mine of useful
information.
5. H.N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural Computation, 8:164–177, 1996.
4
6. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transaction on Information Theory, 39:3, 930–945, 1993.
7. F. Girosi and G. Anzellotti. Rates of convergence of approximation by translates A.I. Memo
1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992.
For a curious way to prove dimension independent bounds using VC theory see:
8. F. Girosi. Approximation error bounds that use VC-bounds. In Proc. International Conference
on Artificial Neural Networks, F. Fogelman-Souliè and P. Gallinari, editors, Vol. 1, 295–302.
Paris, France, October 1995.
5
Notations review
• I[f] =
•
R
X×Y V(f(x), y)p(x, y)dxdy
1 Pl
Iemp[f] = l i=1 V(f(xi), yi)
• f0 = arg minf I[f] ,
f0
• fH = arg minf∈H I[f]
• f^H,l = arg minf∈H Iemp[f]
∈T
6
More notation review
• I[f0] = how well we could possibly do
• I[fH] = how well we can do in space H
• I[f^H,l] = how well we do in space H with l data
• |I[f] − Iemp[f]| ≤ Ω(H, l, δ) ∀f ∈ H (from VC theory)
• I[f^H,l] is called generalization error
• I[f^H,l] − I[f0] is also called generalization error ...
7
A General Decomposition
I[f^H,l] − I[f0] = I[f^H,l] − I[fH]
+ I[fH] − I[f0]
generalization error = estimation error + approximation error
When the cost function V is quadratic:
I[f] = kf0 − fk2 + I[f0]
and therefore
kf0 − fH,lk2 = I[f^H,l] − I[fH]
+ kf0 − fHk2
generalization error = estimation error + approximation error
8
A useful inequality
If, with probability 1 − δ
I[f] − Iemp[f] ≤ Ω(H, l, δ) ∀f ∈ H
then
I[f^H,l] − I[fH] ≤ 2Ω(H, l, δ)
You can prove it using the following observations:
• I[fH] ≤ I[f^H,l]
(from the definition of fH)
• Iemp[f^H,l] ≤ Iemp[fH]
(from the definition of f^H,l)
9
Bounding the Generalization Error
2
2
kf0 − fH,lk ≤ 2Ω(H, l, δ) + kf0 − fHk
Notice that:
• Ω has nothing to do with the target space T , it is
studied mostly in statistics;
• kf0 − fHk has everything to do with the target space T ,
it is studied mostly in approximation theory;
10
Approximation Error
We consider a nested family of hypothesis spaces Hn:
H0
⊂ H1 ⊂ . . . Hn ⊂ . . .
and define the approximation error as:
T (f, Hn)
≡ inf kf − hk
h∈Hn
is the smallest error that we can make if we
approximate f ∈ T with an element of Hn (here k · k is
the norm in T ).
T (f, Hn)
11
Approximation error
For reasonable choices of hypothesis spaces Hn:
lim T (f, Hn) = 0
n→∞
This means that we can approximate functions of T
arbitrarily well with elements of {Hn}∞
n=1
Example: T = continuous functions on compact sets,
and Hn = polynomials of degree at most n.
12
The rate of convergence
The interesting question is:
How fast does T (f, Hn) go to zero?
• The rate of convergence is a measure of the relative
complexity of T with respect to the approximation
scheme H.
• The rate of convergence determines how many samples
we need in order to obtain a given generalization error.
13
An Example
• In the next slides we compute explicitly the rate of
convergence of approximation of a smooth function by
trigonometric polynomials.
• We are interested in studying how fast the approximation
error goes to zero when the number of parameters of
our approximation scheme goes to infinity.
• The reason for this exercise is that the results are
representative: more complex and interesting cases all
share the basic features of this example.
14
Approximation by Trigonometric Polynomials
Consider the set of functions
C2[−π, π]
≡ C[−π, π]
\
L2[−π, π]
Functions in this set can be represented as a Fourier
series:
f(x) =
∞
X
ckeikx ,
Zπ
ck
∝
k=0
−π
The L2 norm satisfies the equation:
kfk2L2 =
∞
X
k=0
ck2
dxf(x)e−ikx
15
We consider as target space the following Sobolev space
of smooth functions:




dsf 2
< +∞
Ws,2 ≡ f ∈ C2[−π, π] | s


dx
L2
The (semi)-norm in this Sobolev space is defined as:
kfk2Ws,2
s 2
∞
d f
X
2s 2
=
≡
k
ck
dxs L2
k=1
If f belongs to Ws,2 then Fourier coefficients ck must go
to zero at a rate which increases with s.
16
We choose as hypothesis space Hn the set of
trigonometric polynomials of degree n:
p(x) =
n
X
ikx
ak e
k=1
Given a function of the form
f(x) =
∞
X
ckeikx
k=0
the optimal hypothesis fn(x) is given by the first n terms
of its Fourier series:
fn(x) =
n
X
k=1
ckeikx
17
Key Question
For a given f ∈ Ws,2 we want to study the approximation
error:
n[f]
≡ kf − fnk2L2
• Notice that n, the degree of the polynomial, is also the
number of parameters that we use in the approximation.
• Obviously n goes to zero as n → +∞, but the key
question is: how fast?
18
An easy estimate of n
n[f]
≡
P∞
P∞
2
2
kf − fnkL2 = k=n+1 ck = k=n+1 c2kk2s k12s <
kfk2W
P
P
∞
1
2 2s
2 2s
s,2
< n12s ∞
c
k
<
c
k
=
k=n+1 k
k=1 k
n2s
n2s
⇓
n[f] <
kfk2Ws,2
n2s
More smoothness ⇒ faster rate of convergence
But what happens in more than one dimension?
19
Rates of convergence in d dimensions
It is enough to study d = 2. We proceed in full analogy
with the 1-d case:
f(x, y) =
∞
X
ckmei(kx+my)
k,m=1
kfk2Ws,2
s 2 s 2
∞
d f
d f
X
2s
2s 2
+ =
≡
(k
+
m
)ckm
dxs dys L2
L2
k,m=1
Here Ws,2 is defined as the set of functions such that
kfk2Ws,2 < +∞
20
We choose as hypothesis space Hn the set of
trigonometric polynomials of degree l:
l
X
p(x) =
(ikx+imy)
akme
k,m=1
A trigonometric polynomial of degree l in d variables has
a number of coefficients n = ld.
We are interested in the behavior of the approximation
error as a function of n. The approximating function is:
fn(x, y) =
l
X
k,m=1
ckme(ikx+imy)
21
An easy estimate of n
n[f]
≡
<
=
∞
X
∞
X
2s
2s
(k
+
m
)
2
2
2
<
kf − fnkL2 =
ckm =
ckm 2s
2s
k +m
k,m=l+1
k,m=l+1
1
∞
X
2l2s k,m=l+1
c2km(k2s + m2s) <
1
∞
X
2l2s k,m=1
c2km(k2s + m2s)
kfk2Ws,2
2l2s
d
Since n = l , then l =
1
nd
(with d = 2), and we obtain:
n <
kfk2Ws,2
2s
2n d
22
(Partial) Summary
The previous calculations generalizes easily to the
d-dimensional case. Therefore we conclude that:
if we approximate functions of d variables with s
square integrable derivatives with a trigonometric
polynomial with n coefficients, the approximation error
satisfies:
n <
C
2s
nd
More smoothness s ⇒ faster rate of convergence
Higher dimension d ⇒ slower rate of convergence
23
Another Example: Generalized Translation Networks
Consider networks of the form:
f(x) =
n
X
akφ(Akx + bk)
k=1
where x ∈ Rd, bk ∈ Rm, 1 ≤ m ≤ d, Ak are m × d
matrices, ak ∈ R and φ is some given function.
For m = 1 this is a Multilayer Perceptron.
For m = d, Ak diagonal and φ radial this is a Radial Basis
Functions network.
24
Theorem (Mhaskar, 1994)
Let Wsp(Rd) be the space of functions whose derivatives
up to order s are p-integrable in Rd. Under very general
assumptions on φ one can prove that there exists d × m
p d
n
matrices {Ak}k=1 such that, for any f ∈ Ws (R ), one can
find bk and ak such that:
kf −
n
X
akφ(Akx + bk)kp
≤ cn
− ds
kfkWsp
k=1
Moreover, the coefficients ak are linear functionals of f.
This rate is optimal
25
The curse of dimensionality
If the approximation error is
ds
n
∝
1
n
then the number of parameters needed to achieve an
error smaller than is:
ds
1
n∝
the curse of dimensionality is the d factor;
the blessing of smoothness is the s factor;
26
Jackson type rates of convergence
It happens “very often” that rates of convergence for
functions in d dimensions with “smoothness” of order s
are of the Jackson type:
s!
O
1
d
n
Example: polynomial and spline approximation
techniques, many non-linear techniques.
Can we do better than this? Can we defeat the
curse of dimensionality? Have we tried hard enough
to find “good” approximation techniques?
27
N-widths: definition
(from Pinkus, 1980)
Let X be a normed space of functions, A a subset of X.
We want to approximate elements of X with linear
superposition of n basis functions {φi}ni=1.
Some sets of basis functions are better than others: which
are the best basis function? what error do they achieve?
To answer these questions we define the Kolmogorov
n-width of A in X:
dn(A, X) =
inf
φ1,...φn
sup
f∈A
inf
c1,...cn
X
n
ciφi
f −
i=1
X
28
Example (Kolmogorov, 1936)
X = L2[0, 2π]
W̃2s
≡ {f | f ∈ Cs−1[0, 2π], f(j) periodic, j = 0, . . . , s − 1}
A = B̃s2
≡ {f | f ∈ W̃2s , kf(s)k2 ≤ 1} ⊂ X
Then
1
s
s
d2n−1(B̃2, L2) = d2n(B̃2, L2) = s
n
and the following Xn is optimal (in the sense that it
achieves the rate above):
X2n−1 =
span{1, sin(x), cos(x), . . . , sin(n − 1)x, cos(n − 1)x}
29
Example: multivariate case
Id
≡ [0, 1]d
X = L2[Id]
W2s[Id]
B2s
≡ {f | f ∈ Cs−1[Id] , f(s) ∈ L2[Id]}
≡ {f | f ∈ W2s[Id] , kf(s)k2 ≤ 1}
Theorem
(from Pinkus, 1980)
dn(Bs2, L2)
ds
≈
1
n
Optimal basis functions are usually splines (or their
relatives)
30
Dimensionality and smoothness
Classes of functions in d dimensions with smoothness of
order s have an intrinsic complexity characterized by the
ratio ds :
• the curse of dimensionality is the d factor;
• the blessing of smoothness is the s factor;
We cannot expect to find an approximation technique
that “beats the curse of dimensionality”, unless we let
the smoothness s increase with the dimension d.
31
Theorem (Barron, 1991)
Let f be a function such that its Fourier transform
satisfies
Z
Rd
dω
kωk|f̃(ω)| < +∞
Let Ω be a bounded domain in Rd. Then we can find a
neural network with n coefficients ci, n weights wi and n
biases θi such that
2
X
n
ciσ(x · wi + θi)
f −
i=1
L2(Ω)
<
c
n
The rate of convergence is independent of the
dimension d.
32
Here is the trick...
The space of functions such that
Z
Rd
dω
kωk|f̃(ω)| < +∞ .
is the space of functions that can be written as
f=
1
kxkd−1
∗λ
where λ is any function whose Fourier transform is
integrable.
Notice how the space becomes more constrained as the
dimension increases.
33
Theorem (Girosi and Anzellotti, 1992)
Let f ∈ Hs,1(Rd), where Hs,1(Rd) is the space of functions
whose partial derivatives up to order s are integrable, and
let Ks(x) be the Bessel-Macdonald kernel, that is the
Fourier transform of
K̃s(ω) =
1
s
(1 + kωk2) 2
s>0.
If s > d and s is even, we can find a Radial Basis
Functions network with n coefficients cα and n centers tα
such that
kf −
n
X
α=1
cαKs(x
c
2
− tα)kL∞ <
n
34
Theorem (Girosi, 1992)
Let f ∈ Hs,1(Rd), where Hs,1(Rd) is the space of functions
whose partial derivatives up to order s are integrable. If
s > d and s is even, we can find a Gaussian basis function
network with n coefficients cα, n centers tα and n
variances σα such that
kf −
n
X
α=1
cα
(x−tα)2
−
2
e 2σα
c
2
kL∞ <
n
35
Same rate of convergence: O( √1n )
Function space
Norm
Approximation scheme
R
dω |f̃(ω)| < +∞
Rd
L2 (Ω)
f(x) =
Pn
i=1 ci sin(x · wi + θi )
L2 (Ω)
f(x) =
Pn
i=1 ci σ(x · wi + θi )
L2 (Ω)
f(x) =
Pn
i=1 ci |x · wi + θi |+ +
(Jones)
R
dω kωk|f̃(ω)| < +∞
Rd
(Barron)
R
dω kωk2 |f̃(ω)| < +∞
Rd
+a·x+b
(Breiman)
f̃(ω) ∈ Cs0 , 2s > d
Pn
−kx−tα k2
c
e
α
α=1
L∞ (Rd )
f(x) =
L∞ (Rd )
−
P
f(x) = n
c
e
α=1 α
(Girosi and Anzellotti)
H2s,1 (Rd ), 2s > d
(Girosi)
kx−tα k2
σ2α
36
Summary
• There is a trade off between the size of the sample (l)
and the size of the hypothesis space n;
• For a given pair of hypothesis and target space the
approximation error depends on the trade off between
dimensionality and smoothness;
• The trade off has a “generic” form and sets bounds
on what can and cannot be done, both in linear and
non-linear approximation;
• Suitable spaces, which trade dimensionality versus
smoothness, can be defined in such a way that the rate
of convergence of the approximation error is independent
of the dimensionality.
Download