Research Article ERM Scheme for Quantile Regression Dao-Hong Xiang

advertisement
Hindawi Publishing Corporation
Abstract and Applied Analysis
Volume 2013, Article ID 148490, 6 pages
http://dx.doi.org/10.1155/2013/148490
Research Article
ERM Scheme for Quantile Regression
Dao-Hong Xiang
Department of Mathematics, Zhejiang Normal University, Jinhua, Zhejiang 321004, China
Correspondence should be addressed to Dao-Hong Xiang; daohongxiang@gmail.com
Received 30 November 2012; Accepted 21 February 2013
Academic Editor: Ding-Xuan Zhou
Copyright © 2013 Dao-Hong Xiang. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper considers the ERM scheme for quantile regression. We conduct error analysis for this learning algorithm by means of
a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The learning rates are
derived by applying concentration techniques involving the β„“2 -empirical covering numbers.
1. Introduction
In this paper, we study empirical risk minimization scheme
(ERM) for quantile regression. Let 𝑋 be a compact metric
space (input space) and π‘Œ = R. Let 𝜌 be a fixed but unknown
probability distribution on 𝑍 := 𝑋 × π‘Œ which describes
the noise of sampling. The conditional quantile regression
aims at producing functions to estimate quantile regression
functions. With a prespecified quantile parameter 𝜏 ∈ (0, 1),
a quantile regression function π‘“πœ,𝜌 is defined by its value π‘“πœ,𝜌 (π‘₯)
to be a 𝜏-quantile of 𝜌(⋅ | π‘₯), that is, a value 𝑑 ∈ π‘Œ satisfying
𝜌 ((−∞, 𝑑] | π‘₯) ≥ 𝜏,
𝜌 ([𝑑, ∞) | π‘₯) ≥ 1 − 𝜏,
π‘₯ ∈ 𝑋, (1)
where 𝜌(⋅ | π‘₯) is the conditional distribution of 𝜌 at π‘₯ ∈ 𝑋.
We consider a learning algorithm generated by ERM
scheme associated with pinball loss and hypothesis space H.
The pinball loss 𝐿 𝜏 : R → [0, ∞) is defined by
𝐿 𝜏 (π‘Ÿ) = {
(𝜏 − 1) π‘Ÿ,
πœπ‘Ÿ,
if π‘Ÿ ≤ 0,
if π‘Ÿ ≥ 0.
(2)
The hypothesis space H is a compact subset of 𝐢(𝑋). So there
exists some 𝑀 > 0 such that ‖𝑓‖𝐢(𝑋) ≤ 𝑀 for any 𝑓 ∈ H. We
assume without loss of generality ‖𝑓‖𝐢(𝑋) ≤ 1 for any 𝑓 ∈ H.
The ERM scheme for quantile regression is defined with
π‘š
a sample z = {(π‘₯𝑖 , 𝑦𝑖 )}π‘š
𝑖=1 ∈ 𝑍 drawn independently from 𝜌
as follows:
1 π‘š
∑𝐿 𝜏 (𝑦𝑖 − 𝑓 (π‘₯𝑖 )) .
𝑓∈H π‘š
𝑖=1
𝑓z = arg min
(3)
A family of kernel based learning algorithms for quantile
regression has been widely studied in a large literature [1–4]
and references therein. The form of the algorithms is a
regularized scheme in a reproducing kernel Hilbert space H𝐾
(RKHS, see [5] for details) associated with a Mercer kernel
𝐾. Given a sample z the kernel based regularized scheme for
quantile regression is defined by
𝑓z,πœ† = arg min {
𝑓∈H𝐾
1 π‘š
σ΅„© σ΅„©2
∑𝐿 𝜏 (𝑦𝑖 − 𝑓 (π‘₯𝑖 )) + πœ†σ΅„©σ΅„©σ΅„©π‘“σ΅„©σ΅„©σ΅„©πΎ } .
π‘š 𝑖=1
(4)
In [1, 3, 4], error analysis for general H𝐾 has been done.
Learning with varying Gaussian kernel was studied in [2].
ERM scheme (3) is very different from kernel based
regularized scheme (4). The output function 𝑓z produced
by the ERM scheme has a uniform bound, under our
assumption, ‖𝑓z ‖𝐢(𝑋) ≤ 1. However, we cannot expect it for
𝑓z,πœ† . It is easy to see that πœ†β€–π‘“z,πœ† β€–2𝐾 ≤ ∑π‘š
𝑖=1 |𝑦𝑖 | by choosing 𝑓 =
0. It happens often that ‖𝑓z,πœ† ‖𝐾 → ∞ as πœ† → 0. The lack of
a uniform bound for 𝑓z,πœ† has a serious negative impact on the
learning rates. So in the literature of kernel based regularized
schemes for quantile regression, values of the output function
𝑓z,πœ† are always projected onto the interval [−1, 1], and error
analysis is conducted for the projected function, not 𝑓z,πœ† itself.
In this paper, we aim at establishing convergence and
learning rates for the error ‖𝑓z − π‘“πœ,𝜌 β€–πΏπ‘ŸπœŒ in the space πΏπ‘ŸπœŒπ‘‹ .
𝑋
Here π‘Ÿ > 0 depends on the pair (𝜌, 𝜏) which will be decided in
Section 2 and πœŒπ‘‹ is the marginal distribution of 𝜌 on 𝑋. In the
rest of this paper, we assume π‘Œ = [−1, 1] which in turn leads
to values of the target function π‘“πœ,𝜌 lie in the same interval.
2
Abstract and Applied Analysis
2. Noise Condition and Main Results
There has been a large literature in learning theory (see [6]
and references therein) devoted to the least square regression.
It aims at learning the regression function π‘“πœŒ (π‘₯) = ∫π‘Œ π‘¦π‘‘πœŒ(𝑦 |
π‘₯). The identity Els (𝑓) − Els (π‘“πœŒ ) = ‖𝑓 − π‘“πœŒ β€–2𝐿2 for the
πœŒπ‘‹
generalization error Els (𝑓) = ∫𝑍 (𝑦 − 𝑓(π‘₯))2 π‘‘πœŒ leads to a
variance-expectation bound with the form of Eπœ‰2 ≤ 4Eπœ‰,
where πœ‰ = (𝑦 − 𝑓(π‘₯))2 − (𝑦 − π‘“πœŒ (π‘₯))2 on (𝑍, 𝜌). It plays
an essential role in error analysis of kernel based regularized
schemes.
However, this identity relation and expectation-variance
bound fail in the setting of the quantile regression. The reason
is that the pinball loss is lack of strong convexity. If we add
some noise condition on distribution 𝜌 named 𝜏-quantile of
𝑝-average type π‘ž (see Definition 1), we can also get a similar
identity relation which in turn enables us to have a varianceexpectation bound stated in the following which is proved by
Steinwart and Christman [1].
Definition 1. Let 𝑝 ∈ (0, ∞] and π‘ž ∈ [1, ∞). A distribution 𝜌
on 𝑋 × [−1, 1] is said to have a 𝜏-quantile of 𝑝-average type π‘ž
if for πœŒπ‘‹ -almost every π‘₯ ∈ 𝑋, there exist a 𝜏-quantile 𝑑∗ ∈ R
and constants π›ΌπœŒ(⋅|π‘₯) ∈ (0, 2], π‘πœŒ(⋅|π‘₯) > 0 such that for each
𝑠 ∈ [0, π›ΌπœŒ(⋅|π‘₯) ],
𝜌 ((𝑑∗ − 𝑠, 𝑑∗ ) | π‘₯) ≥ π‘πœŒ(⋅|π‘₯) π‘ π‘ž−1 ,
(5)
𝜌 ((𝑑∗ , 𝑑∗ + 𝑠) | π‘₯) ≥ π‘πœŒ(⋅|π‘₯) π‘ π‘ž−1 ,
π‘ž−1
and that the function 𝛾 on 𝑋 defined by 𝛾(π‘₯) = π‘πœŒ(⋅|π‘₯) π›ΌπœŒ(⋅|π‘₯)
satisfies 𝛾−1 ∈ πΏπ‘πœŒπ‘‹ .
We also need capacity of the hypothesis space H for our
learning rates. Here in this paper, we measure the capacity by
empirical covering numbers.
Definition 2. Let (M, 𝑑) be a pseudometric space and 𝑆 be a
subset of M. For every πœ€ > 0, the covering number N(𝑆, πœ€, 𝑑)
of 𝑆 with respect to πœ€ and 𝑑 is defined as the minimal number
of balls of radius πœ€ whose union covers 𝑆, that is,
N (𝑆, πœ€, 𝑑)
β„“
= min {β„“ ∈ N : 𝑆 ⊂ ∪ℓ𝑗=1 𝐡 (𝑠𝑗 , πœ€) for some {𝑠𝑗 }𝑗=1 ⊂ M} ,
(6)
where 𝐡(𝑠𝑗 , πœ€) = {𝑠 ∈ M : 𝑑(𝑠, 𝑠𝑗 ) ≤ πœ€} is a ball in M.
(π‘₯𝑖 )π‘˜π‘–=1
Definition 3. Let F be a set of functions on 𝑋, x =
⊂
π‘˜
π‘˜
π‘˜
𝑋 and F|x = {(𝑓(π‘₯𝑖 ))𝑖=1 : 𝑓 ∈ F} ⊂ R . Set N2,x (F, πœ€) =
N(F|x , πœ€, 𝑑2 ). The β„“2 -empirical covering number of F is
defined by
N2 (F, πœ€) = sup sup N2,x (F, πœ€) ,
π‘˜∈N x∈π‘‹π‘˜
πœ€ > 0.
(7)
Here 𝑑2 is the normalized β„“2 -metric on the Euclidean space
Rπ‘˜ given by
1/2
1 π‘˜σ΅„¨
󡄨2
𝑑2 (a, b) = ( ∑σ΅„¨σ΅„¨σ΅„¨π‘Žπ‘– −𝑏𝑖 󡄨󡄨󡄨 )
π‘˜ 𝑖=1
π‘˜
π‘˜
for a = (π‘Žπ‘– )𝑖=1 , b = (𝑏𝑖 )𝑖=1 ∈ Rπ‘˜ .
(8)
Assumption. Assume that the empirical covering number of
the hypothesis space H is bounded for some π‘Ž > 0 and πœ„ ∈
(0, 2),
log N2 (H, πœ€) ≤ π‘Žπœ€−πœ„ ,
∀πœ€ > 0.
(9)
Theorem 4. Assume that 𝜌 satisfies (5) with some 𝑝 ∈ (0, ∞]
and π‘ž ∈ [1, ∞). Denote π‘Ÿ = π‘π‘ž/(𝑝 + 1). One further assumes
that π‘“πœ,𝜌 is uniquely defined. If π‘“πœ,𝜌 ∈ H and H satisfies (9)
with πœ„ ∈ (0, 2), then for any 0 < 𝛿 < 1, with confidence 1 − 𝛿,
one has
σ΅„©
σ΅„©σ΅„©
Μƒ −πœ— log 2 ,
󡄩󡄩𝑓z − π‘“πœ,𝜌 σ΅„©σ΅„©σ΅„© π‘Ÿ ≤ πΆπ‘š
󡄩𝐿 πœŒπ‘‹
σ΅„©
𝛿
(10)
where
πœ—=
2 (𝑝 + 1)
4π‘ž (𝑝 + 1) − (2 − πœ„) min {2 (𝑝 + 1) , π‘π‘ž}
(11)
Μƒ is a constant independent of π‘š and 𝛿.
and 𝐢
Remark 5. In the ERM scheme, we can choose π‘“πœ,𝜌 ∈ H
which in turn makes the approximation error described by
(23) equal to zero. However, it is impossible for the kernel
based regularized scheme because of the appearance of the
penalty term πœ†β€–π‘“β€–2𝐾 .
If π‘ž ≤ 2, all conditional distributions around the quantile
behave similar to the uniform distribution. In this case π‘Ÿ =
π‘π‘ž/(𝑝 + 1) ≤ 2 and πœƒ = min{2/π‘ž, 𝑝/(𝑝 + 1)} = 𝑝/(𝑝 + 1) for
all 𝑝 > 0. Hence, πœ— = 2(𝑝 + 1)/((2 + πœ„)π‘π‘ž + 4π‘ž). Furthermore,
when 𝑝 is large enough, the parameter π‘Ÿ tends to π‘ž and the
power index for the above learning rate arbitrarily approaches
to 2/(2 + πœ„)π‘ž which shows that the learning rate power index
π‘ž
for ‖𝑓z −π‘“πœ,𝜌 β€–πΏπ‘ž is arbitrarily close to 2/(2+πœ„) independent of
πœŒπ‘‹
π‘ž. In particular, πœ„ can be arbitrarily small when H is smooth
enough. In this case, the power index of the learning rates
2/(2 + πœ„) can be arbitrarily close to 1 which is the optimal
learning rate for the least square regression.
Let us take some examples to demonstrate the above main
result.
Example 6. Let H be a unit ball of the sobolev space 𝐻𝑠
with 𝑠 > 0. Observe that the empirical covering number is
bounded above by the uniform covering number defined in
Definition 2. Hence we have (see [6, 7])
1 𝑛/𝑠
log N2 (H, πœ€) ≤ 𝐢𝑠 ( ) ,
πœ€
(12)
where 𝑛 is the dimension of the input space 𝑋 and 𝐢𝑠 > 0.
Abstract and Applied Analysis
3
Under the same assumptions as Theorem 4, we get that by
replacing πœ„ by 𝑛/𝑠, for any 0 < 𝛿 < 1, with confidence 1 − 𝛿,
σ΅„©σ΅„©
σ΅„©
Μƒ −πœ— log 2 ,
󡄩󡄩𝑓z − π‘“πœ,𝜌 σ΅„©σ΅„©σ΅„© π‘Ÿ ≤ πΆπ‘š
σ΅„©
󡄩𝐿 πœŒπ‘‹
𝛿
(13)
where
2 (𝑝 + 1)
πœ—=
4π‘ž (𝑝 + 1) − (2 − 𝑛/𝑠) min {2 (𝑝 + 1) , π‘π‘ž}
2/(2+𝑛/𝑠) independent of π‘ž. Furthermore, 𝑠 can be arbitrarily
large if the Sobolev space is smooth enough. In this special
case, the learning rate power index arbitrarily approaches to
1.
Example 7. Let H be a unit ball of the reproducing kernel
Hilbert space H𝜎 generated by a Gaussian kernel (see [5]).
Reference [7] tells us that
(15)
where 𝐢𝑛,𝜎 > 0 depends only on 𝑛 and 𝜎 > 0. Obviously, the
right-hand side of (15) is bounded by 𝐢𝑛,𝜎 (1/πœ€)𝑛+1 .
So from Theorem 4, we can get different learning rates
with power index
2 (𝑝 + 1)
.
πœ—=
4π‘ž (𝑝 + 1) − (1 − 𝑛) min {2 (𝑝 + 1) , π‘π‘ž}
(16)
πœŒπ‘‹
which is very slow if 𝑛 is large. However, in most data sets the
data are concentrated on a much lower dimensional manifold
embedded in the high dimensional space. In this setting an
analysis that replaces 𝑛 by the intrinsic dimension of the
manifold would be of great interest (see [8] and references
therein).
3. Error Analysis
(19)
> 0,
one has
2
E {(𝐿 𝜏 (𝑦 − 𝑓 (π‘₯)) − 𝐿 𝜏 (𝑦 − π‘“πœ,𝜌 (π‘₯))) }
πœƒ
≤ πΆπœƒ (E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )) ,
(20)
∀𝑓 : 𝑋 → π‘Œ.
The above result implies that we can get convergence rates
of 𝑓z in the space πΏπ‘ŸπœŒπ‘‹ by bounding the excess generalization
error E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 ).
To bound E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 ), we need a standard error
decomposition procedure [6] and a concentration inequality.
3.1. Error Decomposition. Define the empirical error associated with the pinball loss 𝐿 𝜏 as
Ez,𝜏 (𝑓) =
1 π‘š
∑𝐿 (𝑦 − 𝑓 (π‘₯𝑖 ))
π‘š 𝑖=1 𝜏 𝑖
for 𝑓 : 𝑋 → [−1, 1] .
(21)
𝑓H = arg minE𝜏 (𝑓) = arg min {E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )} . (22)
𝑓∈H
𝑓∈H
Lemma 9. Let 𝐿 𝜏 be the pinball loss, 𝑓z be defined by (3) and
𝑓H ∈ H by (22). Then
E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 )
≤ [E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 )]−[Ez,𝜏 (𝑓z ) − Ez,𝜏 (π‘“πœ,𝜌 )]
+[Ez,𝜏 (𝑓H )−Ez,𝜏 (π‘“πœ,𝜌 )]−[E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 )]
(23)
+E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 ) .
Proof. The excess generalization error can be written as
Define the noise-free error called generalization error associated with the pinball loss 𝐿 𝜏 as
𝑍
πΆπœƒ =
σ΅„© σ΅„©πœƒ
22−πœƒ π‘žπœƒ 󡄩󡄩󡄩󡄩𝛾−1 󡄩󡄩󡄩󡄩𝐿𝑝
πœŒπ‘‹
Define
If π‘ž ≤ 2 and 𝑝 is large enough, the power index of the
π‘ž
learning rates for ‖𝑓z − π‘“πœ,𝜌 β€–πΏπ‘ž is arbitrarily close to 2/(3 + 𝑛)
E𝜏 (𝑓) = ∫ 𝐿 𝜏 (𝑦 − 𝑓 (π‘₯)) π‘‘πœŒ
2 𝑝
πœƒ = min { ,
} ∈ (0, 1] ,
π‘ž 𝑝+1
(14)
πœŒπ‘‹
∀πœ€ > 0,
1/π‘ž
σ΅„©
σ΅„© σ΅„©1/π‘ž
σ΅„©σ΅„©
󡄩󡄩𝑓 − π‘“πœ,𝜌 σ΅„©σ΅„©σ΅„© π‘Ÿ ≤ 21−1/π‘ž π‘ž1/π‘ž 󡄩󡄩󡄩𝛾−1 σ΅„©σ΅„©σ΅„© 𝑝 (E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )) .
󡄩𝐿 πœŒπ‘‹
σ΅„© 󡄩𝐿 πœŒπ‘‹
σ΅„©
(18)
Furthermore, with
Μƒ is a constant independent of π‘š and 𝛿.
and 𝐢
We carry out the same discussions on the case of π‘ž ≤ 2
and large enough 𝑝 as Remark 5. Therefore the power index
π‘ž
of the learning rates for ‖𝑓z − π‘“πœ,𝜌 β€–πΏπ‘ž is arbitrarily close to
1 𝑛+1
log N (H, πœ€) ≤ 𝐢𝑛,𝜎 (log ) ,
πœ€
Proposition 8. Let 𝐿 𝜏 be the pinball loss. Assume that 𝜌
satisfies (5) with some 𝑝 ∈ (0, ∞] and π‘ž ∈ [1, ∞). Then for
all 𝑓 : 𝑋 → [−1, 1] one has
for 𝑓 : 𝑋 → [−1, 1] .
(17)
Then the measurable function π‘“πœ,𝜌 is a minimizer of E𝜏 .
Obviously, π‘“πœ,𝜌 (π‘₯) ∈ [−1, 1].
We need the following results from [1] for our error
analysis.
E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 ) = [E𝜏 (𝑓z ) − Ez,𝜏 (𝑓z )]
+ [Ez,𝜏 (𝑓z ) − Ez,𝜏 (𝑓H )]
+ [Ez,𝜏 (𝑓H ) − E𝜏 (𝑓H )]
(24)
+ [E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 )] .
The definition of 𝑓z implies that Ez,𝜏 (𝑓z ) − Ez,𝜏 (𝑓H ) ≤
0. Furthermore, by subtracting and adding E𝜏 (π‘“πœ,𝜌 ) and
Ez,𝜏 (π‘“πœ,𝜌 ) in the first term and third term, we see that
Lemma 9 holds true.
4
Abstract and Applied Analysis
We call the term (23) approximation error. It has been
studied in [9].
3.2. Concentration Inequality and Sample Error. Let us recall
the one-sided Bernstein inequality as follows.
Lemma 10. Let πœ‰ be a random variable on a probability space
𝑍 with variance 𝜎2 satisfying |πœ‰−E(πœ‰)| ≤ π‘€πœ‰ for some constant
π‘€πœ‰ . Then for any 0 < 𝛿 < 1, with confidence 1 − 𝛿, one has
We apply Lemma 12 to a function set F, where
F = {𝐿 𝜏 (𝑦 − 𝑓 (π‘₯)) − 𝐿 𝜏 (𝑦 − π‘“πœ,𝜌 (π‘₯)) : 𝑓 ∈ H} .
Proposition 13. Assume 𝜌 on 𝑋 × [−1, 1] satisfies the variance
bound (20) with index πœƒ indicated in (19). If H satisfies (9) with
πœ„ ∈ (0, 2), then for any 0 < 𝛿 < 1, with confidence 1 − 𝛿/2, one
has
[E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )]
− [Ez,𝜏 (𝑓) − Ez,𝜏 (π‘“πœ,𝜌 )]
2π‘€πœ‰ log (1/𝛿)
2𝜎2 log (1/𝛿)
1 π‘š
+√
.
∑πœ‰ (𝑧𝑖 ) − E (πœ‰) ≤
π‘š 𝑖=1
3π‘š
π‘š
(25)
Proposition 11. Let 𝑓H ∈ H. Assume that 𝜌 on 𝑋 × [−1, 1]
satisfies the variance bound (20) with index πœƒ indicated in (19).
For any 0 < 𝛿 < 1, with confidence 1 − 𝛿/2, (3.6) can be
bounded as
[Ez,𝜏 (𝑓H ) − Ez,𝜏 (π‘“πœ,𝜌 )] − [E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 )]
4 log (2/𝛿)
2𝐢 log (2/𝛿) 1/(2−πœƒ)
≤
+( πœƒ
)
3π‘š
π‘š
Proof. Let πœ‰(𝑧) = 𝐿 𝜏 (𝑦 − 𝑓H (π‘₯)) − 𝐿 𝜏 (𝑦 − π‘“πœ,𝜌 (π‘₯)) which
satisfies |πœ‰| ≤ 2 and in turn |πœ‰ − E(πœ‰)| ≤ 2. The variance bound
(20) implies that
πœƒ
E(πœ‰ − E (πœ‰))2 ≤ Eπœ‰2 ≤ πΆπœƒ (E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 )) .
(27)
Using (25) on the random variable πœ‰(𝑧) = 𝐿 𝜏 (𝑦 − 𝑓H (π‘₯)) −
𝐿 𝜏 (𝑦 − π‘“πœ,𝜌 (π‘₯)), we can get the desired bound (28) with the
help of Young’s inequality.
Let us turn to estimate the sample error (3.5) involving
the function 𝑓z which runs over a set of functions since z is
a random sample itself. To estimate it, we use a concentration inequality below involving empirical covering numbers
[10–12].
Lemma 12. Let F be a class of measurable functions on 𝑍.
Assume that there are constants 𝐡, 𝑐 > 0 and 𝛼 ∈ [0, 1] and
‖𝑓‖∞ ≤ 𝐡 and E𝑓2 ≤ 𝑐(E𝑓)𝛼 for every 𝑓 ∈ F. If (7) holds,
then there exists a constant π‘πœ„σΈ€  depending only on πœ„ such that for
any 𝑑 > 0, with probability at least 1 − 𝑒−𝑑 , there holds
E𝑓 −
1 π‘š
1
𝛼
∑𝑓 (𝑧𝑖 ) ≤ πœ‚1−𝛼 (E𝑓)
π‘š 𝑖=1
2
+ π‘πœ„σΈ€  πœ‚ + 2(
𝑐𝑑 1/(2−𝛼) 18𝐡𝑑
+
)
,
π‘š
π‘š
(28)
∀𝑓 ∈ F,
(2−πœ„)/(2+πœ„)
𝐡
π‘Ž 2/(4−2𝛼+πœ„π›Ό)
,
)
π‘š
π‘Ž 2/(2+πœ„)
( )
}.
π‘š
1
[E (𝑓) − E𝜏 (π‘“πœ,𝜌 )]
2 𝜏
2 1 2/(4−2πœƒ+πœ„πœƒ)
+ πΆπœ„,πœƒ log ( )
,
𝛿 π‘š
(29)
(31)
∀𝑓 ∈ H,
where
1
πΆπœ„,πœƒ = ( + π‘πœ„σΈ€  ) max {πΆπœƒ(2−πœ„)/(4−2πœƒ+πœ„πœƒ) π‘Ž2/(4−2πœƒ+πœ„πœƒ) ,
2
2(2−πœ„)/(2+πœ„) π‘Ž2/(2+πœ„) } + 2πΆπœƒ1/(2−πœƒ) + 36.
(32)
Proof. Take 𝑔 ∈ F with the form 𝑔(𝑧) = 𝐿 𝜏 (𝑦 − 𝑓(π‘₯)) −
𝐿 𝜏 (𝑦 − π‘“πœ,𝜌 (π‘₯)) where 𝑓 ∈ H. Hence E𝑔 = E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )
and (1/π‘š) ∑π‘š
𝑖=1 𝑔(𝑧𝑖 ) = Ez,𝜏 (𝑓) − Ez,𝜏 (π‘“πœ,𝜌 ).
The Lipschitz property of the pinball loss 𝐿 𝜏 implies that
󡄨󡄨󡄨𝑔 (𝑧)󡄨󡄨󡄨 ≤ 󡄨󡄨󡄨󡄨𝑓 (π‘₯) − π‘“πœ,𝜌 (π‘₯)󡄨󡄨󡄨󡄨 ≤ 2.
(33)
󡄨 󡄨
󡄨
󡄨
For 𝑔1 , 𝑔2 ∈ F, we have
󡄨󡄨
󡄨󡄨
󡄨󡄨𝑔1 (𝑧) − 𝑔2 (𝑧)󡄨󡄨
󡄨
󡄨
= 󡄨󡄨󡄨𝐿 𝜏 (𝑦 − 𝑓1 (π‘₯)) − 𝐿 𝜏 (𝑦 − 𝑓2 (π‘₯))󡄨󡄨󡄨
(34)
󡄨󡄨
󡄨󡄨
≤ 󡄨󡄨𝑓1 (π‘₯) − 𝑓2 (π‘₯)󡄨󡄨 ,
where 𝑓1 , 𝑓2 ∈ H. It follows that
N2,z (F, πœ€) ≤ N2,x (H, πœ€) .
(35)
log N2 (F, πœ€) ≤ π‘Žπœ€−πœ„ .
(36)
Hence
Applying Lemma 12 with 𝐡 = 2, 𝛼 = πœƒ, and 𝑐 = πΆπœƒ , we
know that for any 0 < 𝛿 < 1, with confidence 1 − 𝛿/2, there
holds
E𝑓 −
1 π‘š
∑𝑔 (𝑧𝑖 )
π‘š 𝑖=1
𝐢 log (2/𝛿) 1/(2−πœƒ)
1
πœƒ
≤ πœ‚1−πœƒ (E𝑔) + π‘πœ„σΈ€  πœ‚+2( πœƒ
)
2
π‘š
+
where
πœ‚ := max {𝑐(2−πœ„)/(4−2𝛼+πœ„π›Ό) (
≤
(26)
+ E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 ) .
(30)
36 log (2/𝛿)
π‘š
𝐢 log (2/𝛿) 1/(2−πœƒ)
1
1
≤ E𝑔 + ( + π‘πœ„σΈ€  ) πœ‚ + 2( πœƒ
)
2
2
π‘š
+
36 log (2/𝛿)
.
π‘š
(37)
Abstract and Applied Analysis
5
Here
πœ‚ ≤ max {πΆπœƒ(2−πœ„)/(4−2πœƒ+πœ„πœƒ) π‘Ž2/(4−2πœƒ+πœ„πœƒ) , 2(2−πœ„)/(2+πœ„) π‘Ž2/(2+πœ„) }
×(
1 2/(4−2πœƒ+πœ„πœƒ)
.
)
π‘š
(38)
Proof. Combining (18), (19), and (40), with confidence 1 − 𝛿,
we have
σ΅„©σ΅„©
σ΅„©σ΅„©
󡄩󡄩󡄩𝑓z −π‘“πœ,𝜌 σ΅„©σ΅„©σ΅„©πΏπ‘ŸπœŒ
𝑋
Μƒ 𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 ))
≤ 𝐢(E
2/π‘ž(4−2πœƒ+πœ„πœƒ)
Μƒ log 2 ( 1 )
+𝐢
𝛿 π‘š
1/π‘ž
Μƒ inf {E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )}
≤𝐢
Note that
𝑓∈H
𝐢 log (2/𝛿) 1/(2−πœƒ) 36 log (2/𝛿)
1
+
( + π‘πœ„σΈ€  ) πœ‚ + 2( πœƒ
)
2
π‘š
π‘š
2 1 2/(4−2πœƒ+πœ„πœƒ)
≤ πΆπœ„,πœƒ log ( )
,
𝛿 π‘š
(39)
Μƒ=
𝐢
2 (𝑝 + 1)
,
4π‘ž (𝑝 + 1) − (2 − πœ„) min {2 (𝑝 + 1) , π‘π‘ž}
σ΅„© σ΅„©1/π‘ž 4
2π‘ž1/π‘ž 󡄩󡄩󡄩󡄩𝛾−1 󡄩󡄩󡄩󡄩𝐿𝑝 (
πœŒπ‘‹
3
1/(2−πœƒ)
+ (2πΆπœƒ )
+ πΆπœ„,πœƒ )
1/π‘ž
(44)
.
Proof of Theorem 1. The assumption π‘“πœ,𝜌 ∈ H implies that
inf {E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )} = 0.
𝑓∈H
(45)
Therefore, our desired result comes directly from Theorem 15.
E𝜏 (𝑓z ) − E𝜏 (π‘“πœ,𝜌 )
≤ 2 (E𝜏 (𝑓H ) − E𝜏 (π‘“πœ,𝜌 ))
(40)
2 1 2/(4−2πœƒ+πœ„πœƒ)
+ 2πΆπœ„,πœƒ log ( )
.
𝛿 π‘š
The above bound follows directly from Propositions 11
and 13 with the fact that 𝑓z ∈ H.
3.3. Bounding the Total Error. Now we are in a position to
present our general result on error analysis for algorithm (3).
Theorem 15. Assume that 𝜌 satisfies (5) with some 𝑝 ∈ (0, ∞]
and π‘ž ∈ [1, ∞). Denote 𝛾 = π‘π‘ž/(𝑝 + 1). Further assume that
H satisfies (9) with πœ„ ∈ (0, 2) and π‘“πœ,𝜌 is uniquely defined. Then
for any 0 < 𝛿 < 1, with confidence 1 − 𝛿, one has
Μƒ −πœ— log 2 ,
+ πΆπ‘š
𝛿
where
πœ—=
Proposition 14. Assume 𝜌 on 𝑋 × [−1, 1] satisfies the variance
bound (20) with index πœƒ indicated in (19). If H satisfies (9) with
πœ„ ∈ (0, 2) Then for any 0 < 𝛿 < 1, with confidence 1 − 𝛿, there
holds
σ΅„©σ΅„©
σ΅„©
Μƒ inf {E𝜏 (𝑓) − E𝜏 (π‘“πœ,𝜌 )}1/π‘ž
󡄩󡄩𝑓z − π‘“πœ,𝜌 σ΅„©σ΅„©σ΅„© π‘Ÿ ≤ 𝐢
σ΅„©
󡄩𝐿 πœŒπ‘‹
𝑓∈H
Μƒ −πœ— log 2 ,
+ πΆπ‘š
𝛿
(43)
where πΆπœ„,πœƒ is indicated in (32). Then our desired bound holds
true.
8 log (2/𝛿)
2𝐢 log (2/𝛿) 1/(2−πœƒ)
+
+ 2( πœƒ
)
3π‘š
π‘š
1/π‘ž
(41)
4. Further Discussions
In this paper, we studied ERM algorithm (3) for quantile
regression and provide convergence and learning rates. We
showed some essential differences between ERM scheme and
kernel based regularized scheme for quantile regression. We
also point out the difficulty to deal with quantile regression:
the lack of strong convexity of the pinball loss. To overcome
this difficulty, some noise condition on 𝜌 is proposed to
enable us to get a variance-expectation bound similar to the
one for the least square regression.
In our analysis we just consider 𝑓 ∈ H and ‖𝑓‖𝐢(𝑋) ≤
1. The case ‖𝑓‖𝐢(𝑋) ≤ 𝑅 for 𝑅 ≥ 1 would be interesting in
the future work. The approximation error involving 𝑅 can be
estimated by the knowledge of interpolation space.
In our setting, the sample is drawn independently from
the distribution 𝜌. However, in many practical problems, the
i.i.d condition is a little demanding, so it would be interesting
to investigate the ERM scheme for quantile regression with
nonidentical distributions [13, 14] or dependent sampling
[15].
Acknowledgment
This work described in this paper is supported by NSF of
China under Grant 11001247 and 61170109.
where
πœ—=
2 (𝑝 + 1)
4π‘ž (𝑝 + 1) − (2 − πœ„) min {2 (𝑝 + 1) , π‘π‘ž}
Μƒ are constant independent of π‘š and 𝛿.
and 𝐢
(42)
References
[1] I. Steinwart and A. Christman, How SVMs Can Estimate Quantile and the Median, vol. 20 of Advances in Neural Information
Processing Systems, MIT Press, Cambridge, Mass, USA, 2008.
6
[2] D. H. Xiang, “Conditional quantiles with varying Gaussians,”
Advances in Computational Mathematics, 2011.
[3] D.-H. Xiang, T. Hu, and D.-X. Zhou, “Learning with varying
insensitive loss,” Applied Mathematics Letters, vol. 24, no. 12, pp.
2107–2109, 2011.
[4] D.-H. Xiang, T. Hu, and D.-X. Zhou, “Approximation analysis of
learning algorithms for support vector regression and quantile
regression,” Journal of Applied Mathematics, vol. 2012, Article ID
902139, 17 pages, 2012.
[5] N. Aronszajn, “Theory of reproducing kernels,” Transactions of
the American Mathematical Society, vol. 68, pp. 337–404, 1950.
[6] F. Cucker and D.-X. Zhou, Learning Theory: An Approximation
Theory Viewpoint, vol. 24, Cambridge University Press, Cambridge, UK, 2007.
[7] D.-X. Zhou, “The covering number in learning theory,” Journal
of Complexity, vol. 18, no. 3, pp. 739–767, 2002.
[8] S. Mukherjee, Q. Wu, and D.-X. Zhou, “Learning gradients on
manifolds,” Bernoulli, vol. 16, no. 1, pp. 181–207, 2010.
[9] S. Smale and D.-X. Zhou, “Estimating the approximation error
in learning theory,” Analysis and Applications, vol. 1, no. 1, pp.
17–41, 2003.
[10] S. Mukherjee and Q. Wu, “Estimation of gradients and coordinate covariation in classification,” Journal of Machine Learning
Research, vol. 7, pp. 2481–2514, 2006.
[11] Y. Yao, “On complexity issues of online learning algorithms,”
Institute of Electrical and Electronics Engineers, vol. 56, no. 12,
pp. 6470–6481, 2010.
[12] Y. Ying, “Convergence analysis of online algorithms,” Advances
in Computational Mathematics, vol. 27, no. 3, pp. 273–291, 2007.
[13] T. Hu and D.-X. Zhou, “Online learning with samples drawn
from non-identical distributions,” Journal of Machine Learning
Research, vol. 10, pp. 2873–2898, 2009.
[14] S. Smale and D.-X. Zhou, “Online learning with Markov
sampling,” Analysis and Applications, vol. 7, no. 1, pp. 87–113,
2009.
[15] Z.-C. Guo and L. Shi, “Classification with non-i.i.d. sampling,”
Mathematical and Computer Modelling, vol. 54, no. 5-6, pp.
1347–1364, 2011.
Abstract and Applied Analysis
Advances in
Operations Research
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Advances in
Decision Sciences
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Mathematical Problems
in Engineering
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Journal of
Algebra
Hindawi Publishing Corporation
http://www.hindawi.com
Probability and Statistics
Volume 2014
The Scientific
World Journal
Hindawi Publishing Corporation
http://www.hindawi.com
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
International Journal of
Differential Equations
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Volume 2014
Submit your manuscripts at
http://www.hindawi.com
International Journal of
Advances in
Combinatorics
Hindawi Publishing Corporation
http://www.hindawi.com
Mathematical Physics
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Journal of
Complex Analysis
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
International
Journal of
Mathematics and
Mathematical
Sciences
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com
Stochastic Analysis
Abstract and
Applied Analysis
Hindawi Publishing Corporation
http://www.hindawi.com
Hindawi Publishing Corporation
http://www.hindawi.com
International Journal of
Mathematics
Volume 2014
Volume 2014
Discrete Dynamics in
Nature and Society
Volume 2014
Volume 2014
Journal of
Journal of
Discrete Mathematics
Journal of
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Applied Mathematics
Journal of
Function Spaces
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Optimization
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com
Volume 2014
Download