Document 10822223

advertisement
Hindawi Publishing Corporation
Abstract and Applied Analysis
Volume 2012, Article ID 619138, 9 pages
doi:10.1155/2012/619138
Research Article
On the Convergence Rate of Kernel-Based
Sequential Greedy Regression
Xiaoyin Wang,1 Xiaoyan Wei,2 and Zhibin Pan1
1
2
College of Sciences, Huazhong Agricultural University, Wuhan 430070, China
Department of Statistics and Applied Mathematics, Hubei University of Economics, Wuhan 430205, China
Correspondence should be addressed to Zhibin Pan, zhibinpan2008@gmail.com
Received 13 October 2012; Accepted 27 November 2012
Academic Editor: Jean M. Combes
Copyright q 2012 Xiaoyin Wang et al. This is an open access article distributed under the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
A kernel-based greedy algorithm is presented to realize efficient sparse learning with datadependent basis functions. Upper bound of generalization error is obtained based on complexity
measure of hypothesis space with covering numbers. A careful analysis shows the error has a
satisfactory decay rate under mild conditions.
1. Introduction
Kernel methods have been extensively utilized in various learning tasks, and its generalization performance has been investigated from the viewpoint of approximation theory 1, 2.
Among these methods, a family of them can be considered as coefficient-based regularized
framework in data-dependent hypothesis spaces; see, for example, 3–8. For given samples
{xi , yi }ni1 , the solution of these kernel methods has the following expression ni1 αi Kxi , ·,
where αi ∈ R and K is a Mecer kernel. The aim of these coefficient-based algorithms is to
search a set of coefficients {αi } with good predictive performance.
Inspired by greedy approximation methods in 9–12, we propose a sparse greedy
algorithm for regression. The greedy approximation has two advantages over the regularization methods: one is that the sparsity is directly controlled by a greedy approximation algorithm, rather than by the regularization parameter; the other is that greedy approximation
does not change the objective optimization function, while the regularized methods usually
modify the objective function by including a sparse regularization term 13.
Before introducing the greedy algorithm, we recall some preliminary background for
regression. Let the input space X ⊂ Rd be a compact subset and Y −M, M for some
constant M > 0. In the regression model, the learner gets a sample set z {xi , yi }ni1 , where
2
Abstract and Applied Analysis
xi , yi ∈ Z : X × Y, 1 ≤ i ≤ n, are randomly independently drawn from an unknown
distribution ρ on Z. The goal of learning is to pick a function f : X → Y with the expected
error
E f Z
fx − y
2
1.1
dρ
as small as possible. Note that the regression function
fρ x Y
ydρ y | x ,
x ∈ X,
1.2
is the minimizer of Ef, where ρ· | x is the conditional probability measure at x induced
by ρ.
The empirical error is defined as
n
1
2
Ez f fxi − yi .
n i1
1.3
We call a symmetric and positive semidefinite continuous function K : X × X → R a
Mercer kernel. The reproducing kernel Hilbert space RKHS HK is defined to be the closure
of the linear span of the set of functions {Kx : Kx, · : x ∈ X} with the inner product ·, ·K
defined by Kx , Kx K Kx, x . For all x ∈ X and f ∈ HK , the reproducing property is
given by Kx , fK fx. We can see κ : supx∈X Kx, x < ∞ because of the continuity of
K and the compactness of X.
Different from the coefficient-based regularized method 3–6, we use the idea of
i }2n ,
{h
sequential greedy approximation to realize sparse learning in this paper. Denote H
i1
2i −Kx . The hypothesis space depending on z is defined as
2i−1 Kx and h
where h
i
i
CO2n H
2n
2n
i x, αi ≥ 0, αi ≤ 1 .
f : fx αi h
i1
1.4
i1
For any hypothesis function space G, we denote βG {f : f βg, g ∈ G}.
The definition of fρ tells us |fρ x| ≤ M, so it is natural to restrict the approximating
functions to −M, M. The projection operator has been used in error analysis of learning
algorithms see, e.g., 2, 14.
Definition 1.1. The projection operator π πM is defined on the space of measurable
functions f : X → R as
⎧
⎪
if fx > M;
⎪
⎨M,
π f x −M, if fx < −M;
⎪
⎪
⎩fx, otherwise.
1.5
Abstract and Applied Analysis
3
The kernel-based greedy algorithm can be summarized as below. Let t be a stopping
time and let β be a positive constant. Set fβ0 0. And then for τ 1, 2, . . . , t, define
τ , α
τ , βτ h
arg min
≤β
h∈H,0≤α≤1,0≤β
Ez 1 − αfβτ−1 αβ h ,
1.6
τ .
fβτ 1 − α
τ fβτ−1 α
τ βτ h
Different from the regularized algorithms in 6, 12, 14–18, the above learning algorithm
tries to realize efficient learning by greedy approximation. The study for its generalization
performance can enrich the learning theory of kernel-based regression. In the remainder of
this paper, we focus on establishing the convergence rate of πfβt to the regression function fρ
under choice of suitable parameters. The theoretical result is dependent on weaker conditions
than the previous error analysis for kernel-based regularization framework in 4, 5.
2. Main Result
Define a data-free basis function set
H {hi : h2i−1 Kui , h2i −Kui , ui ⊂ X, i 1, . . . , ∞},
∞
∞
COH f : fx αi hi x, αi ≥ 0, αi ≤ 1 .
i1
2.1
i1
To investigate the approximation of πfβt to fρ , we introduce a data-independent
function
fβ∗ arg min E f .
f∈βCOH
2.2
Observe that
E π fβt
− E fρ
≤ E π fβt
− Ez π fβt
Ez fβ∗ − E fβ∗ Ez π fβt
− Ez fβ∗
2.3
E fβ∗ − E fρ .
Here, the three terms on the right-hand side are called as the sample error, the hypothesis
error, and the approximation error, respectively.
To estimate the sample error, we usually need the complexity measure of hypothesis
function space HK . For this reason, we introduce some definitions of covering numbers to
measure the complexity.
4
Abstract and Applied Analysis
Definition 2.1. Let U, d be a pseudometric space and denote a subset S ⊂ U. For every > 0,
the covering number NS, , d of S with respect to , d is defined as the minimal number of
balls of radius whose union covers S, that is,
⎫
l
⎬
l
NS, , d min l ∈ N : S ⊂ B sj , for some sj j1 ⊂ U ,
⎭
⎩
j1
⎧
⎨
2.4
where Bsj , {s ∈ U : ds, sj ≤ } is a ball in U.
The empirical covering number with 2 metric is defined as below.
Definition 2.2. Let F be a set of functions on X, u ui ki1 and F|u {fui ki1 : f ∈ F} ⊂ Rk .
Set N2,u F, N2,u F|u , , d2 . The 2 empirical covering number of F is defined by
N2 F, sup sup N2,u F, ,
> 0,
2.5
∀a ai ki1 ∈ Rk , b bi ki1 ∈ Rk .
2.6
k∈N u∈Xk
where 2 metric
d2 a, b k
1
|ai − bi |2
k i1
1/2
,
Denote Br as the ball of radius r with r > 0, where Br {f ∈ HK : fHK ≤ r}. We
need the following capacity assumption on HK , which has been used in 5, 6, 18.
Assumption 2.3. There exist an exponent p, with p ∈ 0, 2 and a constant cp,K > 0 such that
log N2 B1 , ≤ cp,K −p .
2.7
We now formulate the generalization error bounds for πfβt . The result follows from
Propositions 3.2–3.5 in the next section.
Theorem 2.4. Under Assumption 2.3, for any 0 < δ < 1, the following inequality holds with
confidence 1 − δ
2
32β2 4 3M κ2 β
2
t
∗
log
E π fβ − E fρ ≤ 4 E fβ − E fρ t
n
δ
p 2/2p
2
2
1280M cp,K 4Mκβ
log
n−2/2p .
δ
2.8
From the result, we know there exists a constant C independent of n, t, δ such that with
confidence 1 − δ
β2 β2 βp 2/2p
2
t
∗
log
, ,
.
E π fβ − E fρ ≤ 4 E fβ − E fρ C max
t n
n
δ
2.9
Abstract and Applied Analysis
5
for some fixed constant β and t ≥ n, we have Eπft −Efρ →
In particular, if fρ ∈ βCOH
β
0 with decay rate On−2/2p . The learning rate is satisfactory as p → 0.
Here, the estimate of the hypothesis error is simple and does not need the strict
condition on ρ and X in 3–5 for learning with data-dependent hypothesis spaces.
If there are some additional conditions on approximation error with the increasing of
β, we can obtain the explicit learning rates with suitable parameter selection.
Corollary 2.5. Assume that the RKHS HK satisfies 2.7 and Efβ∗ − Efρ ≤ cγ β−γ for some γ > 0.
Choose β np/42p . For any 0 < δ < 1 and t n, one has
2
E π fβt
− E fρ ≤ Cn− min{4p/42p,pγ/84p} log
δ
2.10
with confidence 1 − δ. Here C is a constant independent of n, δ.
Observe that the learning rate depends closely on the approximation condition
between fρ and fβ∗ . This means that only the target function can be well described by the
functions from the hypothesis space, the learning algorithm can achieve good generalization
performance. In fact, similar approximation assumption is extensively studied for error
analysis in learning theory; see, for example, 1, 2, 4, 17.
From Corollary 2.5, when the kernel K ∈ C∞ , p > 0 can be arbitrarily small, one can
easily see that the learning rate is quite low. Future research direction may be furthered to
improve the estimate by introducing some new analysis techniques.
3. Proof of Theorem 2.4
In this section, we provide the proof of Theorem 2.4 based on the upper bound estimates of
sample error and hypothesis error. Denote
S1 Ez fβ∗ − Ez fρ − E fβ∗ − E fρ ,
S2 E π fβt
− E fρ − Ez π fβt
− Ez fρ .
3.1
We can observe that the sample error
E π fβt
− Ez π fβt
Ez fβ∗ − E fβ∗ S1 S2 .
3.2
Here S1 can be bounded by applying the following one-side Bernstein type probability
inequality; see, for example, 1, 2, 14.
Lemma 3.1. Let ξ be a random variable on a probability space Z with mean Eξ and variance σ 2 ξ σ 2 . If |ξz − Eξ| ≤ B for almost all z ∈ Z, then for all ε > 0,
Probz∈Zn
n
1
ξzi − Eξ ≥ ε
n i1
nε2
.
≤ exp −
2σ 2 Bε/3
3.3
6
Abstract and Applied Analysis
Proposition 3.2. For any 0 < δ < 1, with confidence 1 − δ, one has
2
2 3M κ2 β
1
∗
log
S1 ≤ E fβ − E fρ .
n
δ
3.4
Proof. Following the definition of S1 , we have S1 1/n ni1 ξxi − Eξ, where random
variable ξx y − fβ∗ x2 − y − fρ x2 .
From the definition of fβ∗ , we know fβ∗ K ≤ κβ and fβ∗ ∞ ≤ κfβ∗ K ≤ κ2 β. Then
2
fβ∗ x − y fρ x − y ≤ 3M κ2 β : c1
|ξx| fβ∗ x − fρ x
3.5
and |ξ − Eξ| ≤ 2c1 . Moreover,
σ ≤ Eξ 2
2
Z
fβ∗ x − fρ x
2 2
fβ∗ x − y fρ x − y
dρ
≤ c1 E fβ∗ − E fρ .
3.6
Applying Lemma 3.1 with B 2c1 and σ 2 c1 {Efβ∗ − Efρ }, we get
n
1
ξxi − Eξ ≤ t
n i1
3.7
with confidence at least 1−exp{nt2 /2c1 Efβ∗ −Efρ 2/3t}. By setting −nt2 /2c1 Efβ∗ −
Efρ 2/3t logδ, we derive the solution
t∗ 2c1 /3 log1/δ 2c1 /3 log1/δ
2
2c1 log1/δ E fβ∗ − E fρ
n
3.8
n
1
2c1
1
log
E fβ∗ − E fρ .
ξxi − Eξ ≤
n i1
n
δ
3.9
2c1
1
≤
log
E fβ∗ − E fρ .
n
δ
Thus, with confidence 1 − δ, we have
This completes the proof.
To establish the uniform upper bound of S2 , we introduce a concentration inequality
established in 18.
Abstract and Applied Analysis
7
Lemma 3.3. Assume that there are constants B, c > 0 and α ∈ 0, 1 such that f∞ ≤ B and
Ef ≤ cEfα for every f ∈ F. If for some a > 0 and p ∈ 0, 2,
logN2 F, ≤ a−p ,
∀ > 0,
3.10
then there exists a constant cp depending only on p such that for any t > 0, with probability at least
1 − e−t , there holds
Ef −
1/2−α
n
α
1
ct
18Bt
1
,
fzi ≤ η1−α Ef cp η 2
n i1
2
n
n
∀f ∈ F,
3.11
where
η : max c
2−p/4−2αpα
2/4−2αpα
2/2p a
2−p/2p a
.
,B
n
n
3.12
Proposition 3.4. Under Assumption 2.3, for any 0 < δ < 1, one has with confidence at least 1 − δ
S2 ≤
p 2/2p
1 t 1
E π fβ − E fρ 640M2 cp,K 4Mκβ
n−2/2p .
log
2
δ
3.13
Proof. From the definition of fβt , we have fβt K ≤ κβ. Denote
2 2
Fκβ gz y − π f x − y − fρ x : f ∈ Bκβ .
We can see that Eg Eπf − Efρ and 1/n
πf∞ ≤ M and |fρ x| ≤ M, we have
n
i1
3.14
gzi Ez πf − Ez fρ . Since
gz π f x − fρ x π f x − y fρ x − y ≤ 8M2 ,
2 2
2
dρ ≤ 16M2 Eg.
Eg π f x − fρ x
π f x − y fρ x − y
3.15
Z
For g1 , g2 ∈ Fκβ , we have
g1 z − g2 z y − π f1 x 2 − y − π f2 x 2 ≤ 4Mπ f1 x − π f2 x
≤ 4Mf1 x − f2 x.
3.16
Then, from Assumption 2.3,
N2,z Fκβ , ≤ N2,x
p
≤ N2,x B1 ,
≤ cp,K 4Mκβ −p .
Bκβ ,
4M
4Mκβ
3.17
8
Abstract and Applied Analysis
Applying Lemma 3.3 with B c 16M2 and a cp,K 4Mκβp , for any δ ∈ 0, 1 and
for all g ∈ Fκβ ,
p 2/2p
cp,K 4Mκβ
log1/δ
,
320M2
n
n
p 2/2p
1
1
n−2/2p
≤ Eg 640M2 cp,K 4Mκβ
log
2
δ
3.18
n
2−p/2p
1
1
gzi ≤ Eg cp 16M2
Eg −
n i1
2
holds with confidence 1 − δ. This completes the proof.
Different from the previous studies related with regularized framework 3–5, we
introduce the estimate of hypothesis error Ez fβt − Ez fβ∗ based on Theorem 4.2 in 11 for
sequential greedy approximation.
Proposition 3.5. For a fixed sample z, one has
16β2
.
Ez fβt − Ez fβ∗ ≤
t
3.19
The desired result in Theorem 2.4 can be derived directly by combining Propositions
3.2–3.5.
Acknowledgments
This work was supported partially by the National Natural Science Foundation of China
under Grant no. 11001092, Humanities and Social Science Projects of the Ministry of
Education of China Program no. 11y3jc630197, and the Fundamental Research Funds for
the Central Universities Programs nos. 2011PY130, and 2011QC022.
References
1 F. Cucker and S. Smale, “On the mathematical foundations of learning,” Bulletin of the American
Mathematical Society, vol. 39, no. 1, pp. 1–49, 2002.
2 F. Cucker and D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge University
Press, Cambridge, Mass, USA, 2007.
3 Q. Wu and D. X. Zhou, “Learning with sample dependent hypothesis spaces,” Computers &
Mathematics with Applications, vol. 56, no. 11, pp. 2896–2907, 2008.
4 Q. W. Xiao and D. X. Zhou, “Learning by nonsymmetric kernels with data dependent spaces and
1 -regularizer,” Taiwanese Journal of Mathematics, vol. 14, no. 5, pp. 1821–1836, 2010.
5 L. Shi, Y. L. Feng, and D. X. Zhou, “Concentration estimates for learning with 1 -regularizer and
data dependent hypothesis spaces,” Applied and Computational Harmonic Analysis, vol. 31, no. 2, pp.
286–302, 2011.
6 Y. L. Feng and S. G. Lv, “Unified approach to coefficient-based regularized regression,” Computers &
Mathematics with Applications, vol. 62, no. 1, pp. 506–515, 2011.
7 S. G. Lv and J. D. Zhu, “Error bounds for lp -norm multiple kernel learning with least square loss,”
Abstract and Applied Analysis, vol. 2012, Article ID 915920, 18 pages, 2012.
Abstract and Applied Analysis
9
8 Y. K. Zhu and H. W. Sun, “Consistency analysis of spectral regularization algorithms,” Abstract and
Applied Analysis, vol. 2012, Article ID 436510, 16 pages, 2012.
9 A. R. Barron, A. Cohen, W. Dahmen, and R. A. DeVore, “Approximation and learning by greedy
algorithms,” The Annals of Statistics, vol. 36, no. 1, pp. 64–94, 2008.
10 S. Mannor, R. Meir, and T. Zhang, “Greedy algorithms for classification—consistency, convergence
rates, and adaptivity,” Journal of Machine Learning Research, vol. 4, no. 4, pp. 713–741, 2003.
11 T. Zhang, “Sequential greedy approximation for certain convex optimization problems,” IEEE
Transactions on Information Theory, vol. 49, no. 3, pp. 682–691, 2003.
12 H. Chen, L. Q. Li, and Z. B. Pan, “Learning rates of multi-kernel regression by orthogonal greedy
algorithm,” Journal of Statistical Planning and Inference, vol. 143, no. 2, pp. 276–282, 2013.
13 T. Zhang, “Approximation bounds for some sparse kernel regression algorithms,” Neural Computation, vol. 14, no. 12, pp. 3013–3042, 2002.
14 D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou, “Support vector machine soft margin classifiers: error
analysis,” Journal of Machine Learning Research, vol. 5, pp. 1143–1175, 2004.
15 H. Chen, L. Q. Li, and J. T. Peng, “Error bounds of multi-graph regularized semi-supervised
classification,” Information Sciences, vol. 179, no. 12, pp. 1960–1969, 2009.
16 H. Chen, “On the convergence rate of a regularized ranking algorithm,” Journal of Approximation, vol.
164, no. 12, pp. 1513–1519, 2012.
17 Z. C. Guo and D. X. Zhou, “Concentration estimates for learning with unbounded sampling,”
Advances in Computational Mathematics. In press.
18 Q. Wu, Y. Ying, and D. X. Zhou, “Multi-kernel regularized classifiers,” Journal of Complexity, vol. 23,
no. 1, pp. 108–134, 2007.
Advances in
Operations Research
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Advances in
Decision Sciences
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Mathematical Problems
in Engineering
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Journal of
Algebra
Hindawi Publishing Corporation
http://www.hindawi.com
Probability and Statistics
Volume 2014
The Scientific
World Journal
Hindawi Publishing Corporation
http://www.hindawi.com
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
International Journal of
Differential Equations
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Volume 2014
Submit your manuscripts at
http://www.hindawi.com
International Journal of
Advances in
Combinatorics
Hindawi Publishing Corporation
http://www.hindawi.com
Mathematical Physics
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Journal of
Complex Analysis
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
International
Journal of
Mathematics and
Mathematical
Sciences
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com
Stochastic Analysis
Abstract and
Applied Analysis
Hindawi Publishing Corporation
http://www.hindawi.com
Hindawi Publishing Corporation
http://www.hindawi.com
International Journal of
Mathematics
Volume 2014
Volume 2014
Discrete Dynamics in
Nature and Society
Volume 2014
Volume 2014
Journal of
Journal of
Discrete Mathematics
Journal of
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Applied Mathematics
Journal of
Function Spaces
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Optimization
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Download