Analysis of Generalized Ridge Functions in High Dimensions Sandra Keiper

advertisement
Analysis of Generalized Ridge Functions in High
Dimensions
Sandra Keiper
Technische Universität Berlin
Institut für Mathematik
Straße des 17. Juni 136, 10623 Berlin
Email: keiper@math.tu-berlin.de
Abstract—The approximation of functions in many variables
suffers from the so-called “curse of dimensionality”. Namely,
functions on RN with smoothness of order s can be recovered
at most with an accuracy of n−s/N applying n-dimensional
spaces for linear or nonlinear approximation. However, there is
a common belief that functions arising as solutions of real world
problems have more structure than usual N -variate functions.
This has led to the introduction of different models for those
functions. One of the most popular models is that of so-called
ridge functions, which are of the form
RN ⊇ Ω 3 x 7→ f (x) = g(Ax),
(1)
m,N
where A ∈ R
is a matrix and m is considerably smaller than
N . The approximation of such functions was for example studied
in [1], [2], [3], and [4].
However, by considering functions of the form (1), we assume
that real world problems can be described by functions that are
constant along certain linear subspaces. Such assumption is quite
restrictive and we, therefore, want to study a more generalized
form of ridge functions, namely functions which are constant
along certain submanifolds of RN . Hence, we introduce the notion
of generalized ridge functions, which are defined to be functions
of the form
RN 3 x 7→ f (x) = g(dist(x, M )),
(2)
N
where M is a d-dimensional, smooth submanifold of R and g ∈
C s (R). Note that if M is an (N −1)-dimensional, affine subspace
of RN and we consider the signed distance in equation (2), we
indeed have the case of a usual ridge function. We will analyze
how the methods to approximate usual ridge functions apply to
generalized ridge functions and investigate new algorithms for
their approximation.
I. I NTRODUCTION
An approach to break the curse of dimensionality is to
consider ridge functions of the form (1), where A is usually
called ridge matrix and g ∈ C s (Rm ), 1 ≤ s ≤ 2, is called
ridge profile.
For particular choices of A, different approaches have
been investigated. For example, if A is of the form AT =
[ei1 , . . . , eim ], for eik ∈ RN the canonical unit vectors, ik ∈
{1, . . . , N }, f can be rewritten as a function which depends
only on a few variables, i.e. f (x1 , . . . , xN ) = g(xi1 , . . . , xim ).
An approach of recovering the active variables and approximating the ridge profile g has been given in [1]. For g in
some approximation class As defined in [1] the result reads
as follows:
Theorem I.1 ([1]). If f (x) = g(xi1 , . . . , xim ) with g ∈ As ,
then the function fˆ determined by the algorithm introduced in
[1] satisfies
kf − fˆkC([0,1]N ) ≤ |g|As l−s ,
where the number of point values used in the algorithm is
smaller than
(l + 1)m #(A) + mdlog2 N e,
for a family A of partitions of {1, . . . , N } into m disjoint
subsets, which is rich enough in the sense that given m distinct
integers i1 , . . . , im ∈ {1, 2, . . . , N } there is a partition A ∈ A
such that each set in A contains precisely one of the integers
i1 , . . . , im .
Note that A can be chosen such that #(A) is bounded by
#(A) . log2 N .
Another approach is to assume that m = 1 and that the
matrix A therefore is a vector, usually called ridge vector and
denoted by A =: a. In this case f is of the form
f (x) = g(hx, ai).
(3)
We usually denote the space of all ridge functions of this type
with R(s) and with R+ (s) if all entries of a are assumed to
be positive. The recovery of usual ridge functions from point
queries was first considered by Cohen, Daubechies, DeVore,
Kerkyacharian and Picard in [2] for ridge functions with
positive ridge vector. It was shown that the accuracy of their
method is close to the approximation rate of one-dimensional
functions:
Theorem I.2 ([2]). Let f = g(h·, ai) such that f ∈ R+ (s),
with s > 1 and kgkC s ≤ M0 as well as kakw`p ≤ M1 .
Then the algorithm introduced in [2] requires O(l) queries to
approximate f by fˆ satisfying,
kf − fˆkC([0,1]N ) ≤ C0 M0 l−s + C1 M1 (N, l)1/p−1 ,
where l ≥ 1, C0 is a constant depending only on s, C1 depends
on p and
(
[1 + log(N/l)]/l, if l < N
(N, l) :=
0,
if l ≥ N.
In the remainder of this abstract we will denote the space
of functions f ∈ R+ (s) which fulfill the assumptions of the
above theorem by R+ (s, p; M0 , M1 ).
However, the algorithm from [2] does not apply to arbitrary
ridge vectors. In [3] and [4] new algorithms were introduced
to waive the assumption of a positive ridge vector. In [4] it
was shown:
Theorem I.3 ([4]). Let f : [−1, 1]N → R be a ridge function
with f (x) = g(hx, ai), where g is a Lipschitz continuous
function which is differentiable with g 0 (0) > 0 and g 0 also
Lipschitz continuous. For a arbitrary h>0 one can construct
a function fˆ, satisfying
kf − fˆk∞ ≤ 2c0 ka − âk1 ≤
4c0 c1 h
,
0
g (0) − c1 h
using N + 1 samples and assuming that g 0 (0) − c1 h > 0.
Here is â the approximation of the ridge vector a found by the
algorithm. The constants c0 and c1 are given by the Lipschitz
constants of g and g 0 .
The main idea of the algorithm in [4] is to approximate the
gradient of f by divided differences exploiting the fact that the
gradient of f is some scalar multiple of the ridge vector. The
accuracy of the approximation of the gradient is determined by
the choice of the number h, whereas the number of sampling
points is fixed.
The approach by Fornasier, Schnass and Vybiral [3] is rather
based on compressed sensing. Thus, not the gradient but the
directional derivatives of f were approximated at a certain
number, say LX , random points in LΦ random directions. It
was shown:
Theorem I.4 ([3]). Let 0 < s < 1 and let log N ≤ LΦ ≤
[log 6]−2 N . Then for every h > 0 there is a constant C such
that using LX (LΦ +1) function evaluations of f , the algorithm
introduced in [3] defines a function fˆ that, with probability
!
2 2
1−
e−CLΦ + e−
√
LΦ N
−
+ 2e
2LX s α
4
C2
,
will satisfy
kf − fˆk∞ ≤ 2C2 p
ν1
α(1 − s) − ν1
,
where α is the expected value of g 0 with respect to the uniform
surface measure µSd−1 on the sphere, namely
Z
α=
|g 0 (ha, ξi)|2 dµSd−1 (ξ),
Sd−1
which ensures a lower bound for g 0 (ha, ·i), and
ν1 = C
0
LΦ
log(N/LΦ )
1/2−1/q
h
+√
LΦ
Note that h again plays the role of determining the accuracy
of approximating the directional derivative by divided differences and that the last mentioned approach can also be applied
to functions of type (1).
II. A LMOST R IDGE F UNCTIONS
The assumption that functions arising as solutions of real
world problems are precisely of the form (3) is very strong.
Therefore, it may be useful to consider functions that are only
close to usual ridge functions.
Hence, we introduce two possibilities to define almost ridge
functions and analyze how particular algorithms, which are
designed to capture usual ridge functions, apply to recover
almost ridge functions. In particular we analyze functions of
type (2) with M being close to some hyperplane.
Let us define almost ridge functions of type I. Note that for
an (N − 1)-dimensional submanifold M the signed distance
is given as

dist(x, M ),
if x lies on an outward




normal ray of M
dist± (x, M ) =

− dist(x, M ), if x lies on an inward



normal ray of M ,
and that we denote the orthogonal projection to an affine
subspace H of RN by PH .
Definition II.1. Let g ∈ C s (R) and let M be an (N − 1)dimensional smooth, connected submanifold of RN , so that we
can find an (N −1)-dimensional affine subspace H of RN and
an ε ≥ 0, such that dist(H, M ) := supx∈M kPH (x)−xk2 ≤ ε
and that PH : M → H is surjective. Then we call
f (x) := g(dist± (x, M ))
an almost ridge function of type I and fˆ = g(dist(·, H)) a
ridge estimator of the almost ridge function.
The algorithm by Cohen et al. [2] can be applied immediately, since it was proven a stability result for noisy
measurements.
Theorem II.2 ([2]). Suppose that we receive the values of
a ridge function f only up to an accuracy ε. That is, when
sampling the value of f at any point x, we receive instead
the value f˜(x) satisfying |f (x) − f˜(x)| ≤ ε. Then if f ∈
R+ (s, q; M0 , M1 ) and
M0 −2S+ 2S+1/2 +3/2
s̄
l
,
6
where s̄, S > 0 and s̄ ∈ N, such that s̄ + 1 ≤ s ≤ S, the
output fˆ of the algorithm satisfies
kf − fˆk∞ ≤ C0 M0 l−s + C1 M1 (N, l)1/q−1 ,
ε≤
!
,
where the positive constant C 0 depends only on C1 ≥ kakq
and on C2 ≥ sup|α|≤2 kDα gk∞ .
for some constants C0 depending on s̄, S and C1 depending
on q and l the number of function evaluations.
Note that we indeed reach an approximation result for
almost ridge functions of type I by setting the almost ridge
function f = f˜ and a corresponding ridge estimator fˆ = f .
Then the algorithm gives an approximation fˆ which fulfills:
kf − fˆk∞ ≤ kf − fˆk∞ + kfˆ − fˆk∞
≤ ε + C0 M0 l−s + C1 M1 (N, l)1/q−1 .
Also in [4] a stability result for the case of random noise
was proven. However, the difference between an almost ridge
function and a usual ridge function cannot be seen as random
noise. Instead we can prove the following result.
Theorem II.3. Given an almost ridge function of type I as
f (x) = g(dist± (x, M )) and let fˆ = g(dist± (·, H)) ∈ R+ (s)
be a ridge estimator such that for (x) := dist± (x, M ) −
dist± (x, H) holds kε0 (x)k2 ≤ η. Then by the method introduced in [4] we can approximate the ridge vector a of the
ridge estimator fˆ by â, with an error bounded from above by:
p
hν1 (η) + h|g 0 (0)|ν2 (η) + |g 0 (0)|N η
,
ka − âk2 ≤ p
|g 0 (0)|2 ν3 (η) − |g 0 (0)|hν4 (η) − h2 ν5 (η)
where νi (η), i = 1, . . . , 5, are some constants depending on
and decreasing with η.
Note that the assumption on (·) to be bounded by η is very
natural since it ensures that the normal vector to the manifold
M does not change to much and is therefore close to the
normal vector a of H, which we try to recover.
Another possibility to define almost ridge functions is to
allow a varying ridge vector. These are almost ridge functions
to which we refer to as almost ridge functions of type II.
Definition II.4. Let a(·) : RN → RN be a smooth function
which has norm one and is close to a constant vector, i.e. there
exists a vector a ∈ RN , kak2 = 1, with ka(x) − ak2 ≤ ε and
ka(x)k2 = 1 for all x. We then call a function of the form
Fig. 1: Generalized ridge function of the form f (x) :=
g(dist(x, L)2 ), where L is some one-dimensional affine subspace of RN . The figure shows two different sets of constant
function values for the function f (x1 , x2 , x3 ) = x22 + x23 =
dist(x, L)2 , where L := span (1, 0, 0) (blue line).
III. G ENERALIZED R IDGE F UNCTIONS
The disadvantage of the method described before is that
the approximation error cannot fall below inf fˆ∈R(s) kf − fˆk,
where R(s) denotes the space of usual ridge functions. Thus,
it would be more convenient to approximate a function of the
form (2) by an estimator of the same form. Moreover, it would
be favorable to waive the assumption that the manifold M is
close to some hyperplane. We initially analyze generalized
ridge functions of the form
f (x) = g(dist(x, L)2 ),
(4)
where L is some d-dimensional affine subspace of RN . We
will exploit the fact that we can estimate the tangent plane in
some x0 ∈ RN of the (N − 1)-dimensional submanifold
x ∈ RN : dist(x, L) = dist(x0 , L)
as the unique hyperplane which is perpendicular to the gradient
of f in x0 and that the function f restricted to this tangent
plane is again of the form (4).
We propose the following algorithm:
f (x) = g(hx, a(x)i),
an almost ridge function of type II.
Algorithm
The algorithm of Cohen et al. does again immediately
apply to almost ridge functions of type II. For the algorithm
introduced in [4] we can prove the following result on the
approximation of almost ridge functions of type II.
Theorem II.5. Given an almost ridge function f by f (x) =
g(hx, a(x)i) with g ∈ C 2 ([0, 1]) and a := a(0) such that ka −
a(x)k2 ≤ min{ε, ε/kxk2 }. Then by the method introduced in
[4] we can approximate the ridge vector a(x) by â with an
error of at most
ka(x) − âk2 ≤ ε +
√
N
(1 + ε)ϑ + |g 0 (0)|ε
,
[|g 0 (0)| − c1 h(1 + ε)] (1 − ε)
where
1/2
ϑ = c1 h(1 + ε) + 2c1 h [1 + ε + (2c1 h(1 + ε|g 0 (0)|))]
for all x in RN and c1 the Lipschitz constant of g 0 .
Initialize: f˜N := f and T̃ N = RN .
Repeat: For i = N, . . . , d + 1:
1) For some arbitrarily chosen x̃i ∈ T˜i compute
"
#N
˜i (x̃i + hek ) − f˜i (x̃i )
f
i
∇h f˜ (x̃i ) =
.
h
k=1
2) Set ũi := ∇h f˜i (x̃i )/k∇h f˜i (x̃i )k2 .
⊥
3) Define T̃ i−1 = (span{ũi , . . . , ũN }) .
4) Let f˜i−1 be the restriction of f to T̃ i−1 .
Result: Set L̃⊥ = T̃ d .
Using this algorithm to recover f , we can show the following approximation result.
Theorem III.1. Let f be a generalized ridge function of the
form (4). Assume that the derivative of g ∈ C s (R), s ∈ (1, 2],
is bounded by some positive constants c2 , c3 . By sampling the
function f at (N − d)(N + 1) appropriate points we can
construct an approximation of L by a subspace L̃ ⊂ RN ,
such that the error is bounded by
√
kPL − PL̃ kop . (1 + K)d dh,
for some arbitrarily small h > 0, where K is some constant
depending on the Hölder constant and bounds of g 0 .
It then only remains to recover the ridge profile g. We begin
to show that the described algorithm is well-defined if L is a
subspace of RN , i.e. contains the zero. Thus, our first aim is
to show that the system, which is formed by the gradients of
the restrictions of f , forms indeed a basis for L. We denote
the gradient of a function f by ∇f .
Theorem III.2. Let f (x) = g(dist(x, L)2 ) = g(kPP xk22 ),
where P = L⊥ . We compute the vectors ui , i = N, . . . d + 1,
iteratively in the following way:
Initialize f N = f and T N = RN . For i = N, . . . , d + 1 set
1) ui := ∇f i (xi )/k∇f i (xi )k`i2 for some randomly chosen
point xi ∈ T i ,
⊥
2) T i−1 := span ui , . . . , uN ,
3) f i−1 := f T i−1 the restriction of f to T i−1 .
Then L is given by L = T d .
Proof: Assume we can write L as L = span {u1 , . . . , ud }
where {u1 , . . . , ud , ud+1 , . . . , uN } is an orthonormal basis of
RN , then we set V = [u1 . . . ud ud+1 . . . uN ]. We begin by
computing the gradient of f and obtain
∇f (x) = 2g 0 (kPP xk22 )PP x,
which is obviously perpendicular to L, since P and L are
perpendicular.
Therefore, we can assume that uN = ∇f (xN )/k∇f (xN )k2 ,
since the choice of V is not unique. Thus, we get the representation of T N −1 = span{u1 , . . . , uN −1 }. Now define f N −1 to
be the restriction of f to T N −1 and hN −1 : RN −1 → R by
x̂
x̂
hN −1 (x̂) := f (V
) = f N −1 (V
).
0
0
T
Then ∇hN −1 (x̂) = ∇f (x)T V̂N −1 , where V̂i results by
deleting the (i + 1)-th up to the N -th column of V and x :=
V (x̂, 0)T ∈ T N −1 . Thus,the gradient of hN −1 considered as a
∇hN −1 (x̂)
vector in RN is given by
= VNT−1 ∇f (x), where
0
Vi results by substituting the i-th up to the N -th column of V
with the zero-vector. And since hN −1 is the rotated version
of f N −1 the gradient of f N −1 is the rotated version of the
gradient of hN −1 , i.e.
∇f N −1 (x) = V VNT−1 ∇f (x) = 2g 0 (kPP xk22 )V VNT−1 PP x.
An easy computation shows that V VNT−1 PP x is the projection of PP x to T N −1 . Thus, it is obvious that
∇f N −1 (x) is perpendicular to L. Further is ∇f N −1 (x)
also orthogonal to uN . Therefore, we set uN −1 =
∇f N −1 (xN −1 )/k∇f N −1 (xN −1 )k2 for some xN −1 ∈ T N −1
and T N −2 := span {u1 , . . . , uN −2 }. We repeat this procedure
until we get a basis {ud+1 , . . . , uN } of P , which then gives
us the desired space L = P ⊥ .
The previous theorem shows that, if we could compute the
gradient in (N − d) points, we would be able to recover the
space L, respective its orthogonal complement P , exactly.
However, we are only allowed to sample the function at a
few points. Thus, we can only approximate the gradients by
computing the divided differences:
N
f (x + hei ) − f (x)
∇h f (x) =
.
h
i=1
The described procedure can of course not find the correct
plane P , however, it is able to compute a good approximation
of P , where the approximation error depends on the choice of
h.
Proof of Theorem III.1: As mentioned above the idea is
to approximate the gradients of f = f N and f i for i = d +
1, . . . , N − 1. Since we need N + 1 samples for each gradient
approximation, we need (N − d)(N + 1) samples altogether.
We already know from Theorem III.2 that the subspace P can
be written in terms of the gradients of f and its restrictions.
Hence, we assume P = L⊥ = span {ud+1 . . . , uN }, where
the ui ’s are given as stated in Theorem III.2.
Lemma III.3. Under the assumptions of Theorem III.1 and
with the choice of the ũi ’s, i = d + 1, . . . , N , as proposed in
the algorithm it holds:
√
√
2Ĉ dh
2Ĉ dh
≤
=: S0 ,
(5)
kũN − uN k2 ≤
c2
Cd
Q
where C j := min{1, min{ i∈I k∇f N −i (xi )k2 } : I ⊂
{0, . . . , j}}, Ĉ some positive constant and xi = PT i x̃i .
We use the approximation of the gradient to approximate
the tangent plane T N −1 at x with T̃ N −1 = ∇h f (x)⊥ . The
approximation error is then of course given by (5). Further
we let f N −1 and f˜N −1 be the restriction of f to T N −1
and T̃ N −1 respective.
In addition, we define the functions
x
hN −1 = f (V
)) and its approximation through h̃N −1 :=
0
x
f (Ṽ
) for x ∈ RN −1 , where V and Ṽ are unitary matrices
0
mapping RN −1 ⊂ RN to T N −1 respective T̃ N −1 . Again we
want to step by step compute the column vectors ui of V ,
i = N, . . . , d + 1, as the normalized gradients of f , f i . But
instead of computing the gradient of f j , respective hj , we can
only approximate it trough a approximation of the gradient
of f˜j , respective h̃j . Thus, we step by step set the columns
ũi , i = N, . . . , d + 1 of Ṽ by the normalized approximated
gradients of f˜i . The error of the approximation can then be
estimated by:
Lemma III.4. With the same assumptions and choices as in
Theorem III.1 and Claim III.3 it holds
S1 := kũN −1 − uN −1 k2 ≤ S0 + 2[c2 + C̃]S0 ,
where C̃ := 2kgkC s + c3 .
We first have to proof the following lemma:
Lemma III.5. With the same assumptions and choices as
before, let x := PT N −1 x̃, where x̃ ∈ T̃ N −1 then
kPP x − PP x̃k2 ≤ kuN − ũN k2 .
We can show that similar estimations as in Lemma III.4
hold for kui − ũi k2 , i = d + 1, . . . , N − 2. First it follows
similarly to Lemma III.5:
Lemma III.6. With the same assumptions and choices as
before it holds for x̃i ∈ T̃ i and x = PT i x̃i that:
kPP xi − PP x̃i k22 ≤
N
X
kũj − uj k22 .
j=i+1
direction η ∈ S N −1 , and all H ∈ G(N − d, N ) such that
η∈
/ H ⊥:
gH (kPH xt k22 ):= gH,η (kPH xt k22 ) := f (xt ),
and we will minimize the objective function
G(N − d, N ) 3 H 7→F̂ l (H)
n
X
2
l
=
f (xi ) − ĝH
(kPH xi k22 ) ,
i=1
l
for some appropriately chosen points x1 , . . . , xn and ĝH
being
the approximation of the one-dimensional function gH , which
can be computed by sampling gH at l equally spaced points
along the line given by η. Note that η is almost surely not
in L and that in this case gL = g holds true. By choosing the
xi , i = 1, . . . , n, carefully we can ensure that L is indeed the
unique minimizer of the objective function
This inequality in turn yields the estimation we wished for:
Lemma III.7. With the same assumption and choices as
before and with K := 2[C̃ + 1 + c2 ] it holds:
kuN −i − ũN −i k2 ≤ S0 + K
i−1
X
Sj := Si ,
j=0
and therefore Sd ≤ (1 + K)d S0 . It further holds for every
constant K ≥ 1
d−1
X
(1 + K)i ≤ (1 + K)d .
i=0
Putting the conclusions of the previous lemmas together
finishes the proof of Theorem III.1.
The estimation of g itself is again very simple. Thus,
computing the gradient gives the direction in which f changes,
and in this direction it becomes a one-dimensional function.
Hence, we can estimate g with well-known numerical methods. Indeed, we have already seen that the gradient of f in
some point x is given by ∇f (x) = g 0 (kPP xk22 )PP x, i.e. the
normalized direction is a := PP x/kPP xk2 . Setting xt := ta
yields
f (xt ) = g(kPP xt k22 ) = g(
t2
kPP xk22 ) = g(t2 ).
kPP xk22
We only have to work further if we do not know that H is a
subspace but an affine subspace in advance. In this case, we
can only approximate g up to a translation with the method
described before.
However, similar to the algorithm in [4], this algorithm
uses a fixed number of samples and the estimation cannot
be improved by taking more samples. We therefore aim for
an algorithm which yields a true convergence result. For this
purpose we again rewrite the distance of a point x ∈ RN to a
subspace L as dist(x, L)2 = kPL⊥ xk22 . A possibility is now
to introduce an optimization problem over a Grassmannian
manifold G(N −d, N ). Thereto we set, for a randomly chosen
G(N −d, N ) 3 H 7→ F(H) =
n
X
f (xi ) − gH (kPH xi k22 )
2
.
i=1
Thus, we will give results on how many points xi , i = 1, . . . , n
are needed and how to choose them. We will further show that
l
F̂ almost surely converges to F for l → ∞ and that therefore
the solution L̃ of the minimization problem
l
L̃ := argminH∈G(N −d,N ) F̂ (H)
gives a suitable approximation to L. Namely, we can show the
following result:
Theorem III.8. Suppose that the derivative of g is bounded
from below by some positive constant and that d = 1 or d =
N − 1. Let L̃ := argminH∈G(N −d,N ) F̂M (H), then
√
kPL̃ − PL kop . εl = l−1/2 .
IV. O UTLOOK
The algorithm we have introduced to recover generalized
ridge functions of type (4) is based on the approximation of
the gradient of f at several points. We also aim to use gradient
approximations to capture generalized ridge functions of the
type (2). In particular we want to use the gradients to compute
samples from the manifold. We then aim to apply the methods
of [5] to estimate the manifold M .
R EFERENCES
[1] R. DeVore, G. Petrova, and P. Wojtaszczyk, “Approximation of functions of few variables in high dimensions,” Constructive Approximation,
vol. 33, no. 1, pp. 125–143, 2011.
[2] A. Cohen, I. Daubechies, R. DeVore, Kerkyarcharian, and D. Picard,
“Capturing ridge functions in high dimensions from point queries,”
Constr. Approx., vol. 35, pp. 225–243, 2012.
[3] M. Fornasier, K. Schnass, and J. Vybiral, “Learning functions of few
arbitrary linear parameters in high dimensions,” Found. Comput. Math.,
vol. 12, pp. 229–262, 2012.
[4] A. Kolleck and J. Vybiral, “On some aspects of approximation of ridge
functions,” 2014. [Online]. Available: http://arxiv.org/pdf/1406.1747.pdf
[5] W. K. Allard, G. Chen, and M. Maggioni, “Multi-scale geometric methods for data sets ii: Geometric multi-resolution analysis,” Applied and
Computational Harmonic Analysis, vol. 32, no. 3, pp. 435–462, 2012.
Download