Here - School of Computer Science

advertisement
Chapter 1
A GRADIENT-BASED FORWARD GREEDY
ALGORITHM FOR SPARSE GAUSSIAN
PROCESS REGRESSION
Ping Sun, Xin Yao
CERCIA, School of Computer Science
University of Birmingham, Edgbaston Park Road
Birmingham, B15 2TT, UK
p.sun@cs.bham.ac.uk,x.yao@cs.bham.ac.uk
Abstract
In this chaper, we present a gradient-based forward greedy method for sparse
approximation of Bayesian Gaussian Process Regression (GPR) model. Different from previous work, which is mostly based on various basis vector selection
strategies, we propose to construct instead of select a new basis vector at each
iterative step. This idea was motivated from the well-known gradient boosting
approach. The resulting algorithm built on gradient-based optimisation packages
incurs similar computational cost and memory requirements to other leading
sparse GPR algorithms. Moreover, the proposed work is a general framework
which can be extended to deal with other popular kernel machines, including
Kernel Logistic Regression (KLR) and Support Vector Machines (SVMs). Numerical experiments on a wide range of datasets are presented to demonstrate
the superiority of our algorithm in terms of generalisation performance.
Keywords:
Gaussian process regression, sparse approximation, sequential forward greedy
algorithm, basis vector selection, basis vector construction, gradient-based optimisation, gradient boosting
1.
Introduction
Recently, Gaussian Processes (GP) [16] have become one of the most popular kernel machines in the machine learning community. Besides its simplicity
in training and model selection, GP models also yield the probabilistic predictions for testing examples with excellent generalisation capability. However,
original GP models are prevented from applying to large datasets due to their
high computational demands. Firstly, GP models require the computation and
2
storage of the full-order kernel matrix K (also known as covariance matrix) of
size n × n, where n is the number of training examples. Secondly, the computational cost of training GP models is about O(n3 ). Thirdly, predicting a
test case requires O(n) for evaluating the mean and O(n2 ) for computing the
variance. In order to overcome these limitations a number of approximation
schemes have been proposed recently (see [21], chapter 8) to accelerate the
computation of GP. Most of these approaches in the literature can be broadly
classified into two main types: (i) Greedy forward selection methods that can
also be viewed as iteratively approximating the full kernel matrix by a lowrank representation [1, 29, 28, 34, 9, 19, 36, 26, 15, 30]; (ii) The methods
of approximating the matrix-vector multiplication (MVM) operations by Fast
Gauss Transform (FGT) [35] and more generally N-body approach [14]. All
of these algorithms could achieve a linear scalability in the number of training
examples for both computational cost and memory requirement. In contrast
to the MVM approximation, the method of approximating the kernel matrix
is simpler to implement since it does not require determining some additional
critical parameters [35]. In this chapter we follow the path of approximating
the full kernel matrix and propose a different forward greedy algorithm from
previous work for achieving a low-rank kernel representation. The main idea
is to construct instead of select basis vectors, which was inspired by the wellknown gradient boosting [10] framework. Here we focus only on regression
problems and the work can be extended to classification tasks [37].
We now outline the contents of this chapter. In Section 2, we introduce GP
regression (GPR) and briefly show how to achieve approximate GPR models in
the current literature. In Section 3, we review some forward greedy algorithms
for approximating the full GPR model and present our motivation. In Section
4, we detail our approach. Some experimental results are reported in Section
5. Finally, Section 6 concludes this chapter by presenting possible directions
of future research.
2.
Gaussian Process Regression
In regression problems, we are given training data composed of n examples,
D = {(x1 , y1 ), ..., (xn , yn )} where xi ∈ Rm is the m-dimensional input and
yi ∈ R is the corresponding target. It is common to assume that the outputs yi
are generated by
yi = f (xi ) + ǫi
(1.1)
where ǫi is a normal random variable with the density P(ǫi ) = N (ǫi |0, σ 2 )
and f (x) is an unobservable latent function. The goal of the regression task is
to estimate the function f (x) which is then used to predict the target y∗ on an
unseen test case x∗ .
A Gradient-based Greedy Algorithm for Sparse GPR
Nomenclature
n
total number of training examples
m
dimension of the input
xi , X
input example i and X = [x1 ... xn ]⊤ ∈ Rn×m
xi (l)
the l-th entry of the input xi
yi , y
target of xi and y = [y1 , ..., yn ]⊤ ∈ Rn
Idq , 1q
the identity matrix of size q × q and all one vector in Rq
K(xi , xj ) kernel function, also known as covariance function
θ 0 , θl , θb
hyperparameters of kernel K(xi , xj )
K
training kernel matrix and (K)ij = K(xi , xj ), i, j = 1, ..., n
σ2
variance of the noise
f (xi )
an unobservable latent function
f
vector of latent function values, i.e., f = [f (x1 ), ..., f (xn )]⊤
N (·|µ, Σ) density of a Gaussian with mean µ and covariance Σ
P(·)
the probability density function
x∗ , y∗
test input and target
f∗
latent function value on x∗
(k∗ )i = K(xi , x∗ ), i = 1, ..., n and k∗∗ = K(x∗ , x∗ )
k∗ , k∗∗
µ∗ , σ∗
the predictive mean and variance
α
weight parameter and α ∈ Rn
E(·)
the objective (error) function
p
iteration index or number of selected (or constructed) basis
ip
index of p-th basis vector to be added
Ip
index set and Ip = {i1 , ..., ip }
x̃j , X̃p
selected or constructed basis vector j and X̃p = [x̃1 ... x̃p ]⊤
x̃j (l)
the l-th entry of the basis vector x̃j
Kp
kernel columns, (Kp )ij = K(xi , x̃j ), i = 1, ..., n; j = 1, ..., p
kp
the p-th column of Kp
Qp
matrix induced by {x̃j }pj=1 , (Q)ij = K(x̃i , x̃j )
qp∗ , qp
qp∗ is p-th diagonal and qp is p-th column of Qp except qp∗
⊤
K̃
approximate kernel matrix of K and K̃ = Kp Q−1
p Kp
⊤
Qp (·)
probability density function conditioned on K̃ = Kp Q−1
p Kp
2
µ̃∗ , σ̃∗
approximate predictive mean and variance
k̃∗
(k̃∗ )j = K(x̃j , x∗ ), j = 1, ..., p
αp
a sparse estimate of α and αp = (Kp⊤ Kp + σ 2 Qp )−1 Kp⊤ y
µp , rp
training mean µp = Kp αp and residual error rp = y − µp
Hp
matrix Idn − Kp Σp Kp⊤
Lp
factor of Cholesky decomposition: Qp = Lp L⊤
p
Gp
the product Kp L−⊤
p
2
⊤
Mp
Cholesky decomposition: (G⊤
p Gp + σ Idp ) = Mp Mp
3
4
In GPR framework, the underlying f (x) is assumed to be a zero mean
Gaussian process, which is a collection of random variables, any finite number
of which have a joint Gaussian distribution [21]. Let f = [f (x1 ), ..., f (xn )]⊤
be a vector of latent function values, GPR assumes a GP prior over the functions, i.e. P(f ) = N (f |0, K), where K is the covariance matrix generated
by evaluating paired inputs {(xi , xj )|i, j = 1, ..., n} on a covariance function
K(xi , xj ).
A common example of K(xi , xj ) is the squared-exponential function
!
m
1X
θl (xi (l) − xj (l))2 + θb ,
(1.2)
K(xi , xj ; θ) = θ0 exp −
2
l=1
where θ0 , θl , θb > 0 are hyperparameters, θ = [θ0 , θ1 , ..., θm , θb ]⊤ ∈ Rm+2
and xi (l) denotes the l-th entry of xi .
In order to make a prediction for a new input x∗ we need to compute the
predictive distribution P(f∗ |x∗ , y). First, the probability P(y|f ), known as
likelihood, can be evaluated by
P(y|f ) =
n
Y
N (yi |f (xi ), σ 2 ) = N (y|f, σ 2 Idn ),
(1.3)
i=1
where Idn is an identity matrix of size n × n. Second, the posterior probability
of f can be written as
P(f |y) ∝ P(f )P(y|f )
∝ N (f |K(K + σ 2 Idn )−1 y, σ 2 K(K + σ 2 Idn )−1 ).
(1.4)
Third, the joint GP prior P(f, f∗ ) is multivariate Gaussian as well, denoted as
K k∗
f
f
,
(1.5)
0,
P
=N
f∗
f∗
k∗⊤ k∗∗
where
k∗ = (K(xi , x∗ ))ni=1 ,
k∗∗ = K(x∗ , x∗ ).
(1.6)
Furthermore the conditional distribution of f∗ given f is a Gaussian
P(f∗ |f, x∗ ) = N (k∗⊤ K −1 f, k∗∗ − k∗⊤ K −1 k∗ )
and finally the predictive distribution P(f∗ |x∗ , y) can be found by
Z
P(f∗ |x∗ , y) = P(f∗ |f, x∗ )P(f |y)df
=
N (f∗ |µ∗ , σ∗2 ),
(1.7)
(1.8)
A Gradient-based Greedy Algorithm for Sparse GPR
5
where
µ∗ = k∗⊤ α,
σ∗2 = k∗∗ − k∗⊤ (K + σ 2 Idn )−1 k∗ ,
(1.9)
and the weight parameter
α = (K + σ 2 Idn )−1 y.
(1.10)
Clearly, the main task of learning a GPR model is to estimate α. From
(1.9) and (1.10), we can note that training a full GPR model requires O(n3 )
time, O(n2 ) memory and computing the predictive mean and variance for a
new test case, leading to O(n) and O(n2 ), respectively. So, it is impractical
to apply GPR to large-scale training or testing datasets. This has led people to
investigate approximate GPR models.
In order to understand the main ideas of approximate GPR models appeared
in the literature, we view estimating α in (1.10) as the solution of the following
optimisation problem [28, 30]:
min E(α) =
α
=
1
1 ⊤ 2
α (σ K + K ⊤ K)α − (K ⊤ y)⊤ α + y ⊤ y (1.11)
2
2
2
1
σ
ky − Kαk2 + α⊤ Kα.
(1.12)
2
2
Based on formulation (1.12), it can be noted that many other popular kernel
machines invented later such as Kernel Ridge Regression (KRR) [24], Least
Squares Support Vector Machines (LS-SVM) [31], Kernel Fisher Discriminant [18], Regularised Least Squares Classification (RLSC) [23] and Proximal
Support Vector Machine (PSVM) [11], are equivalent to the GPR model in
essence.
Since the matrix (σ 2 K + K ⊤ K) in (1.11) is symmetric and the objective is
a quadratic function, it is straightforward to exploit the well-known Conjugate
Gradient (CG) method [12]. The CG method solves the problem (1.11) by
iteratively performing the matrix-vector multiplication (MVM) operations Kc
where c ∈ Rn is a vector. This directly motivated some researchers to apply
improved fast Gauss transform (IFGT) [35], KD-trees [27] and general N-body
approach [13] to accelerating the computation of the full GPR model through
a series of efficient approximations of the product Kc.
Another class of approximate GPR models is based on the sparse estimate
of α and can be further explained as approximating the full kernel matrix K
by a low-rank kernel representation. A sparse estimate of α is defined as one
in which redundant or uninformative entries are set to exactly zero. If we use
αp to denote all the non-zero entries of α indexed by Ip = [i1 , ..., ip ], then the
objective function (1.12) can be equivalently written as
σ2
1
min E(αp ) = ky − Kp αp k2 + αp⊤ Qp αp ,
αp
2
2
(1.13)
6
where Kp denotes the submatrix of the columns of K centred on {xij , j =
1, ..., p}. Let x̃j = xij and we refer to {x̃j }pj=1 as the set of basis vectors1 .
Qp denotes the kernel matrix generated by these basis vectors, i.e., (Q)ij =
K(x̃i , x̃j ), i, j = 1, ..., p. The sparse estimate αp can be obtained from (1.13)
as
αp = Σp Kp⊤ y
(1.14)
with
Σp = (Kp⊤ Kp + σ 2 Qp )−1 .
(1.15)
In contrast to (1.10), computing αp in (1.14) only needs O(np2 ) operations
instead of the original O(n3 ), which greatly alleviates the computational burden involved in the training and testing procedures of the full GPR model if
p ≪ n in practice.
It was observed that selecting a good index set Ip has a crucial effect on the
generalisation performance of the obtained sparse GPR model. Most current
algorithms generally formulate the selection procedure as an iterative forward
selection process. At each iteration, a new basis vector is identified based
on greedy optimisation of some criterion and the corresponding αp is then
updated. So we refer to this class of methods as greedy forward selection
algorithms.
In fact, the above sparsifying procedure can also be understood as approximating the kernel matrix K by a low-rank representation of the form K̃ =
⊤
Kp Q−1
p Kp . This can be seen from the optimal objective values of the problem (1.11) and the sparse version (1.13):
E(α) =
σ2 ⊤
y (K + σ 2 Idn )−1 y
2
(1.16)
and
σ2 ⊤
⊤
2
−1
y (Kp Q−1
(1.17)
p Kp + σ Idn ) y.
2
Further, it means that the sparse GPR model is obtained by replacing original
⊤
GP prior P(f ) = N (0, K) with an approximate prior Qp (f ) = N (f |0, Kp Q−1
p Kp )
[5]. Following the same derivation procedures as the full GPR model, the approximate predictive distribution Qp (f∗ |x∗ , y) of the sparse GPR model becomes
Z
Qp (f∗ |x∗ , y) = Qp (f∗ |f )P(f |y)df = N (f∗ |µ̃∗ , σ̃∗2 ),
(1.18)
E(αp ) =
where
k̃∗ = (K(x̃j , x∗ ))pj=1 ,
(1.19)
2 ⊤
σ̃∗2 = k∗∗ − k̃∗⊤ Q−1
p k̃∗ + σ k̃∗ Σp k̃∗ .
(1.20)
µ̃∗ = k̃∗⊤ αp ,
A Gradient-based Greedy Algorithm for Sparse GPR
7
It can be noted that computing the predictive mean and variance only needs
O(p) and O(p2 ), respectively, in sparse approximation of GPR models.
Compared to the approaches of approximating MVM by IFGT [35] and
KD-trees [27], greedy forward selection algorithms only involve some linear
algebra algorithms and are not required to specify any critical parameters as
in the case of IFGT [35]. Moreover, the approximation quality of MVM is
degenerated when we are confronted with much high-dimensional problems
even though some more complex improved algorithms have been proposed
[22, 3].
As we mentioned above, the crucial step of greedy forward algorithms is to
select a good index set Ip based on some criteria. In other words, the problem
is how to find representative basis vectors from the original training examples.
A number of basis vector selection schemes were proposed before [1, 29, 28,
34, 9, 19, 36, 26, 15, 30]. In the next section, we briefly summarise these
algorithms and tease out our motivation of a new gradient-based algorithm.
3.
Basis Vector Selection Algorithms
Clearly, choosing p basis vectors out of n possible choices involves a combinatorial search over the Cnp space and it is a NP-hard problem [20]. So we
have to resort to near-optimal search schemes, like greedy forward selection
algorithms mentioned above, to ensure computational efficiency. This section
shall review some principled basis vector selection schemes and analyse their
corresponding computational complexity. For any greedy forward selection
approach, the associated time complexity is composed of two parts: Tbasic
and Tselection as defined in [15]. Tbasic denotes the cost associated with
updating of the sparse GPR model if given the index set Ip . This cost is the
same for all forward selection algorithms. Another part Tselection refers to
the cost incured by the procedure of selecting basis vectors. In the following,
for simplicity we will always neglect the Tbasic cost and all the involved time
complexity issues refer to the Tselection cost. For convenience, we categorise
the algorithms appeared in the literature into unsupervised (i.e., independent of
the target information) and supervised types here. Although some algorithms,
such as [1, 2, 9, 19], are not proposed to directly deal with sparse GPR models,
their ideas can be extended easily to select the set of basis vectors for GPR
models.
Unsupervised methods
The simplest unsupervised method is random selection [29, 34]. But several experimental studies [26, 15] have shown that this would produce poor
results. All of other unsupervised methods [2, 9, 7, 8] attempt to directly minimise the trace of the residual matrix tr(∆Kp ) = tr(K − K̃) = tr(K −
8
⊤
⊤
Kp Q−1
p Kp ). Let Kp−1 = Lp−1 Lp−1 be decomposed in Cholesky factors and
Gp−1 = Kp−1 L−⊤
. Let ip be the index of next added basis vector, kp =
p−1
n p−1
K(xi , xip ) i=1 , qp = K(x̃j , xip ) j=1 , q̃p∗ = K(xip , xip ) and lp = L−1
p−1 qp .
We have
Jp = tr(∆Kp ) = Jp−1 − kgp k2 ,
(1.21)
where
kp − Gp−1 lp
.
gp = q
qp∗ − lp⊤ lp
(1.22)
So, to compute the exact reduction kgp k2 after including the ip -th column is an
O(np) operation [2]. If this were to be done for all the remaining columns at
each iteration, it would lead to a prohibitive total complexity of O(n2 p2 ). Fine
and Scheinberg [9] proposed a cheap implementation. Since kgp k2 is lower
bounded by (gp (ip ))2 = k̃p∗ − lp⊤ lp , which can be recursively maintained, they
just evaluate this bound with negligible cost to choose the p-th basis vector.
Another cheaper implementation for this idea is to consider an on-line scheme
[7, 8].
Supervised methods
It is quite straightforward for approximating K to consider the target information since we are confronted with a supervised learning task. Continuing
on the results of unsupervised methods, Bach and Jordan [1] recently proposed
an algorithm which select a new basis vector based on the trade-off between
⊤
the unsupervised term tr(K −Kp Q−1
p Kp ) and another training squared error
term ky − Kp αp k2 . Combining with an efficient ‘look-ahead’ strategy, their
selection scheme only incurs O(δnp) complexity of Tselection if p basis vectors are selected, where δ is set to a small value. Removing the unsupervised
term, Nair et al. [19] developed a very cheap strategy to decrease the supervised term ky − Kp αp k2 , which is achieved by examining the current residual
(rp = y − Kp αp ) and searching for the entry with the largest absolute value.
Following the formulation (1.13) of sparse GPR model, it would be preferable to choose the basis vector which leads to the largest reduction in the objective (1.17), which was firstly proposed in [28]. Let Hp = Idn − Kp Σp Kp⊤ ,
E(αp ) can be recursively computed as [30]:
Ep = Ep−1 − ∆E1 (ip ),
(1.23)
where
∆E1 (ip ) =
1 (gp⊤ Hp−1 y)2
.
2 σ 2 + gp⊤ Hp−1 gp
(1.24)
A Gradient-based Greedy Algorithm for Sparse GPR
9
Similar to the criterion (1.21), computing the reduction ∆E1 (j), j ∈
/ Ip−1 for
all n + 1 − p previously unselected vectors till p basis vectors are accumulated is a prohibitive O(n2 p2 ) operation. Therefore, Smola and Bartlett [28]
resorted to a sub-greedy scheme by considering only κ candidates randomly
chosen from outside Ip−1 during the p-th basis vector selection. They used
a value of κ = 59. For this sub-greedy method, the complexity is reduced to
O(κnp2 ). Alternatively, Sun and Yao [30] recently improved the original complexity O(n2 p2 ) to O(n2 p) by recursively maintaining some quantities for all
remaining vectors. Furthermore, they [30] suggest only using the numerator
part of ∆E1 (ip ), i.e.,
1
1
(1.25)
∆E2 (ip ) = (gp⊤ Hp−1 y)2 = (gp⊤ rp−1 )2 ,
2
2
where rp−1 = Hp−1 y = y − Kp−1 αp−1 , as the criterion of scoring all remaining vectors, which could produce almost the same prediction accuracy
as the criterion (1.24). The advantage of this simplified version (1.25) is that
the computational cost can be decreased to O(κnp) when combining with the
sub-greedy scheme compared to the O(κnp2 ) cost incurred by the sub-greedy
method of [28].
Another scoring criterion, also based on optimising objective (1.13), is a
matching pursuit approach [15] which was motivated by [33]. Instead of minimising (1.13) through all of the entries of αp as in the case of (1.24), they just
adjust the last entry of αp to optimise (1.13). The resulting selection criterion
is [15]
1 [kp⊤ rp−1 − σ 2 qp⊤ αp−1 ]2
.
(1.26)
∆E3 (ip ) =
2
σ 2 qp∗ + kp⊤ kp
The computational cost of using (1.26) to score one basis vector is O(n) time,
which is similar to the criterion (1.25). The empirical study conducted in [30]
showed that (1.26) is always inferior to (1.25) in generalisation performance,
especially on large-scale datasets.
The last supervised method we introduce here is the so-called ‘Info-gain’
approach [26]. Let Qp (f |y) denote the posterior probability of f given the
approximate GP prior Qp (f ) like (1.4), Info-gain scores the “informativeness”
of one basis vector by the Kullback-Leibler distance between Qp (f |y) and
Qp−1 (f |y), i.e. KL[Qp kQp−1 ]. Under some assumptions, this criterion can be
simplified to a very cheap approach of only O(1) cost for evaluating one basis
vector. But sometimes Info-gain leads to very poor results reported in [15] and
also shown in our experiments.
Across the algorithms discussed above, we can note that, at the p-th iteration, all of them try to select a new basis vector from the remaining (n − p + 1)
columns of K . If the dataset is very large, the computational cost of scoring
(n − p + 1) candidates would be prohibitive for some of previous selection
10
criteria. The interesting question here is: why we have to select from a huge
pool of vectors and why not construct it! This is the starting point of our work.
4.
A Gradient-based Forward Greedy Algorithm
The key idea is to construct not select a basis vector at each iteration. This
is motivated by the well-known gradient boosting framework [10]. Before
proceeding to our new algorithm, we briefly describe what the boosting was.
The basic idea behind boosting is rather than using just a single learner for
prediction, a linear combination of T base learners
F (x) =
T
X
βt ht (x)
(1.27)
t=1
is used [17]. Here each ht (x) is a base learner (e.g. decision trees) and βt is its
coefficient in the linear combination. Following the pioneering work by Friedman [10], the boosting procedure can be generally viewed as a gradient-based
incremental search for a good additive model [10]. This is done by searching,
at each iteration, for the base learner which gives the “steepest descent” in the
loss denoted by L(y, f ). The essential steps of a boosting procedure can be
summarised as follows:
1 F0 (x) = 0;
2 For t = 1 : T do:
(a) (β
Ptn, ht (x)) = arg minβ ∗ ,h(x)
∗
i=1 L(yi , ft−1 (xi ) + β h(xi ))
(b) Ft (x) = Ft−1 (x) + βt ht (x)
3 EndFor
4 F (x) = FT (x) =
PT
t=1
βt ht (x).
If replacing the loss L(y, f ) by different kinds of loss functions, a family of
boosting algorithms can be produced. The most prominent example is AdaBoost [25], which employs the exponential loss function
L(yi , f (xi )) = exp {yi f (xi )},
with
yi ∈ {−1, +1}.
(1.28)
Let us go back to the sparse GPR approach which aims to find a sparse
representation of the regression model and has the form
fp (x) =
p
X
αp (j)k(x̃j , x),
(1.29)
j=1
where αp (j) is the j-th entry of αp . If we conceptually regard each term
k(x̃j , x), j = 1, ..., p, involved in (1.29) as a base learner, all of greedy forward selection algorithms summarised in Section 3 are equivalent to the above
A Gradient-based Greedy Algorithm for Sparse GPR
11
boosting procedure. The only difference is that greedy forward selection algorithms select a new base learner at each iteration but boosting construct a base
learner by gradient-based search. This ultimately motivates us to propose the
following new approach for sparse GPR.
We formulate the problem of building sparse GPR model as a boosting procedure. First, the loss L(y, f ) is replaced by the objective (1.13). Then, at each
iteration, we construct the ‘base learner’ k(x̃p , x) through optimising (1.13)
w.r.t. the parameters x̃p and its coefficient αp∗ is also changed accordingly. In
detail, it can be described by the following optimisation problem:
1
E(αp∗ , x̃p ) = ky − Kp−1 αp−1 − αp∗ kp (x̃p )k2
2
σ 2 αp−1 ⊤
Qp−1
qp (x̃p )
αp−1
+
.
αp∗
αp∗
qp (x̃p )⊤ qp∗ (x̃p )
2
(1.30)
In order to emphasize that kp , qp and qp∗ are dependent on x̃p , we have expressed them in the function form in (1.30). For simplicity sometimes we still
neglect the explicit dependence on x̃p . It is easy to show that
min
α∗p ∈R,x̃p ∈Rm
1
E(αp∗ , x̃p ) = Ep−1 + (αp∗ )2 (σ 2 qp∗ + kp⊤ kp )
2
∗ 2 ⊤
+ αp (σ qp αp−1 − kp⊤ rp−1 ).
(1.31)
Since the condition for optimality of αp∗ is
∂E(αp∗ , x̃p )
= αp∗ (σ 2 qp∗ + kp⊤ kp ) + [σ 2 qp⊤ αp−1 − kp⊤ rp−1 ] = 0
∂αp∗
(1.32)
we can get
αp∗ =
kp⊤ rp−1 − σ 2 qp⊤ αp−1
.
σ 2 qp∗ + kp⊤ kp
(1.33)
Substituting αp∗ in (1.31) with (1.33), the problem (1.30) can be equivalently
written as
1 [kp (x̃p )⊤ rp−1 − σ 2 qp (x̃p )⊤ αp−1 ]2
min E(x̃p ) = Ep−1 −
. (1.34)
x̃p ∈Rm
2
σ 2 qp∗ (x̃p ) + kp (x̃p )⊤ kp (x̃p )
In fact, the objective function (1.34) we derived is the same as the criterion
(1.26). The only difference is that we would not pick up training example as
the candidate of the next basis vector. The derivative of (1.34) w.r.t. x̃p (l), l =
12
1, ..., m, can be easily obtained, that is,
p = 1,
p > 1,
∂E(x̃p )
1
= αp∗ [2k̇p⊤ rp−1 − αp∗ (σ 2 q̇p∗ + 2kp⊤ k̇p )],
∂ x̃p (l)
2
∂E(x̃p )
1
= αp∗ [2(k̇p⊤ rp−1 − σ 2 q̇p⊤ αp−1 ) − αp∗ (σ 2 q̇p∗ + 2kp⊤ k̇p )],
∂ x̃p (l)
2
where
∂qp (x̃p ) ∗ ∂qp∗ (x̃p )
∂kp (x̃p )
, q̇p =
, q̇ =
.
(1.35)
∂ x̃p (l)
∂ x̃p (l) p
∂ x̃p (l)
So, any gradient-based optimisation algorithms can be used to construct the
base learner k(x̃p , x) and thus the new basis vector x̃p . Note that it just costs
O(n) time to compute E(x̃p ) and corresponding gradient information if the dimension m ≪ n and the number of selected basis p ≪ n. Therefore our algorithm is applicable to large-scale datasets as well as (1.26). From a complexity
viewpoint, the proposed method is the same as the criteria (1.25) and (1.26),
but our approach still requires to compute corresponding gradient information
(1.35) which makes it slightly slower than other approaches. The updating of
related quantities after x̃p was constructed is detailed in the Appendix.
In our implementation, we employ the routine BFGS [4] as the gradientbased optimisation package. In the course of numerical experiments, it was
found that even with a small number of BFGS steps at each iteration we can
get better results than those obtained by other leading algorithms. In order
to improve even further the performance of the gradient-based algorithm proposed, we use the following multiple initialisation strategy. At the beginning
of each iteration, we randomly take 20 training examples as initial basis vectors and rank them by (1.34). The best one is used to initialise the routine
BFGS. Moreover, we set the maximal allowed BFGS steps at each iteration
to 39. Thus, the total number of evaluating the objective function (1.34) is
59. The aim of this setting is to compare the performance of our work with
other sub-greedy algorithms [28, 15, 30], which just evaluate the corresponding selection criteria κ = 59 times at each iteration. The steps of the proposed
gradient-based forward greedy algorithm can be summarised as follows:
k̇p =
For p = 1, ..., pmax (which is the maximal number of basis vectors.)
1 Randomly taking 20 training examples from {xi }n
i=1 and score them by
(1.34); then pick up the highest one, denoted as x̃0p ;
2 Using x̃0p as the initial value and run the routine BFGS; the output x̃p is
the p-th constructed basis vector ;
3 Updating Ip−1 , Kp−1 , Qp−1 , Gp−1 , Lp−1 , αp−1 , µp−1 , rp−1 and other
related quantities (see Appendix for details);
End For
Outputs: {x̃j }pj=1 , αp , Qp and Σp .
A Gradient-based Greedy Algorithm for Sparse GPR
13
Finally, it is worth emphasizing that the proposed gradient-based approach
to sparse GPR with the objective (1.13) can be straightforwardly extended to
deal with other types of objective functions, which are responsible for different
kinds of kernel machines. For example, the following two objectives EKLR and
ESVM are corresponding to kernel logistic regression (KLR) [37] and support
vector machines (SVM) [6], respectively:
n
EKLR =
σ2
1X
ln(1 + exp {−yi fp (xi )}) + αp⊤ Qp αp
n
2
(1.36)
i=1
and
n
ESVM =
1X
σ2
max (0, 1 − yi fp (xi ))2 + αp⊤ Qp αp ,
n
2
(1.37)
i=1
where fp (x) is defined in (1.29). Similar to sparse GPR, the expected training
algorithms for both KLR and SVM scale linearly as the number of training
cases and would be much faster and more accurate than existing selectionbased approaches.
5.
Numerical experiments
In this section, we compare our gradient-based forward greedy algorithm
against other leading sparse GPR algorithms induced by different basis selection criteria on four datasets. For simplicity we refer to the algorithms to be
compared using the name of its first author and they are Williams [34], Fine
[9], Nair [19], Seeger [26], Baudat [2], Bach [1], Smola [28], Keerthi [15]
and Sun [30]. The first four of them employ very cheap basis selection criteria and have the negligible Tselection cost. The Baudat method is a special
case of Bach2 when the trade-off parameter is set to zero, i.e, only considering the unsupervised term. To reduce the complexity of the criterion Baudat,
we also apply ‘look-ahead’ strategy [1] to speed up its computation. Thus,
both of them have the same complexity of Tselection , which is O(δnp). We
would not run the Smola method in our experiments due to two reasons. (1)
It has been empirically proved to be generating almost the same results as Sun
[30]; (2) It leads to O(κnp2 ) complexity of Tselection which is much higher
than other approaches. The Keerthi and Sun methods induced by (1.25) and
(1.26), respectively, employ the same sub-greedy strategy and incur O(κnp)
complexity of Tselection . In our implementation, we set δ = 59 and κ = 59
to ensure the same selection complexity and similarly for the setting of our
gradient-based algorithm mentioned above.
The algorithms presented in this section were coded in Matlab 7.0 and all
the numerical experiments were conducted on the machine with PIV 2G and
512M memory. For all experiments, the squared-exponential kernel (1.2) was
14
used. The involved hyperparameters are estimated via a full GPR model on
a subset of 1000 examples 3 randomly selected from the original dataset and
these tasks were accomplished by GP routines of the well-known NETLAB
software 4 . To evaluate generalisation performance, we utilise mean squared
error (MSE) and negative logarithm of predictive distribution (NLPD). Their
definitions are
t
1X
(yi − µi )2 ,
(1.38)
MSE =
t
i=1
t
1X
− log P(yi |µi , σi2 ),
NLPD =
t
(1.39)
i=1
where t is the number of test examples, yi is the test target, µi and σi2 are
the predictive mean and variance, respectively. Sometimes normalised MSE
(NMSE) given by NMSE = MSE/var(y) is used for convenience, where
var(y) is the variance of training targets. Note that NLPD measures the quality
of predictive distributions as it penalizes over-confident predictions as well as
under-confident ones. The four employed datasets are Boston Housing, Kin32nm, LogP and KIN40K 5 . Finally, we select some leading approaches in
terms of better generalisation performance on all four datasets considered and
compare their scaling performance on a set of datasets generated from KIN40K.
A. Boston Housing Dataset
This popular regression dataset comprises 506 examples with 14 variables
and the task is to predict median house value of owner-occupied homes based
on other 13 variables. The results were averaged over 100 repetitions, where
the data set was partitioned into 481/25 training/testing splits randomly, which
is a common setting in the literature [19]. Table 1.1 summarises the test performances of the nine methods, along with the standard deviation, for p = 100
and p = 200, respectively.
From Table 1.1, it can be noted that, for both p = 100 and p = 200, our
contructing basis vectors method almost always achieves the better results on
both MSE and NLPD although it is not significant especially when we pick
up more basis vectors. The inferior one still ranks the second among all nine
methods. In addition, the performance of three unsupervised basis selection
methods marked by the superscript † seems systematically worse than other
six supervised methods if selecting fewer basis vectors. But when nearly half
of training examples are chosen, all of these methods generate very similar
MSE results.
B. Kin-32nm dataset
A Gradient-based Greedy Algorithm for Sparse GPR
15
Table 1.1. Test results of nine sparse GPR algorithms on the Boston Housing dataset for
p = 100 and p = 200, respectively. The superscript † denotes unsupervised basis selection
method. All reported results are the averages over 100 repetitions, along with the standard
deviation. The best method is highlighed in bold and the second best in italic.
Method
Williams† [34]
Fine† [9]
Nair [19]
Seeger [26]
Baudat† [2]
Bach [1]
Keerthi [15]
Sun [30]
Ours
p = 100
MSE
NLPD
9.97±6.58 2.73±0.44
8.22±3.97 2.53±0.29
6.83±2.72 2.50±0.28
7.32±3.21 2.54±0.20
8.15±4.27 2.48±0.29
7.52±3.19 2.54±0.24
7.08±2.92 2.44±0.24
6.64±2.82 2.46±0.30
6.43±2.67 2.46±0.09
p = 200
MSE
NLPD
6.98±4.01 2.66±0.57
6.83±2.83 2.48±0.38
6.28±2.70 2.56±0.47
6.35±2.63 2.45±0.37
6.56±2.68 2.52±0.43
6.56±2.66 2.54±0.45
6.38±2.54 2.48 ±0.40
6.28±2.55 2.55±0.45
6.26±2.58 2.36±0.13
The Kin-32nm dataset is one of the eight kin-family datasets which are synthetically generated from a realistic simulation of the forward kinematics of an
8 link all-revolute robot arm. The data is composed of 8192 examples with 32
input dimensions and aim to predict the distance of the end-effector from a target given the angular positions of the joints, the link twist angles, link lengths,
and link offset distances. We randomly split the mother data into 4000 training
and 4192 testing examples and produce 20 repetitions, respectively. Again, we
apply the nine methods to this high-dimensional problem. The results on the
Kin-32nm dataset are reported in Table 1.2.
According to Table 1.2, our proposed algorithm always ranks the first place
significantly and we believe that, in a high dimensional case, our flexible
gradient-based approach could discover much representative basis vectors compared to selection-based algorithms. Moreover, another two algorithms Keerthi
and Sun based on directly optimising the objective (1.13) also have obviously
better performance than other methods. Again, we observe that supervised
basis selection methods are always superior to unsupervised methods.
C. LogP Dataset
LogP data is a popular benchmark problem in Quantitative Structure-Activity
Relationships (QSAR). Our used data split is the same as that in [32]. Of the
6912 examples, 691 (10%) were used for testing and the remaining 6221 for
training 6 . Since the Matlab source code of Bach method (including Baudat)
provided by the authors involves the computation and storage of the full kernel
matrix, it cannot be used to deal with such a large dataset by our PC. Therefore, we remove these two methods from the list in the following comparative
16
Table 1.2. Test results of nine sparse GPR algorithms on the Kin-32nm dataset for p = 100
and p = 200, respectively. The superscript † denotes unsupervised basis selection method. All
reported results are the averages over 20 repetitions, along with the standard deviation. The best
method is highlighed in bold and the second best in italic.
Method
Williams†
Fine†
Nair
Seeger
Baudat†
Bach
Keerthi
Sun
Ours
p = 100
NMSE
NLPD
0.634±0.015 0.501±0.017
0.645±0.017 0.480±0.016
0.609±0.015 0.470±0.015
0.610±0.017 0.470±0.017
0.643±0.022 0.490±0.020
0.606±0.013 0.450±0.011
0.588±0.012 0.441±0.008
0.587±0.012 0.441±0.010
0.569±0.011 0.384±0.007
p = 200
MSE
NLPD
0.594±0.011 0.541±0.012
0.602±0.013 0.502±0.013
0.583±0.013 0.523±0.015
0.584±0.013 0.524±0.015
0.599±0.014 0.511±0.013
0.588±0.011 0.512±0.009
0.575±0.012 0.506±0.012
0.575±0.011 0.513±0.011
0.553±0.015 0.396±0.015
Table 1.3. Test results of seven sparse GPR algorithms on the LogP dataset as the number of
selected basis vectors increases. The superscript † denotes unsupervised basis selection method.
The best method is highlighed in bold and the second best in italic.
Method
Williams†
Fine†
Nair
Seeger
Keerthi
Sun
Ours
p = 100
MSE NLPD
0.615
5.50
0.745
1.26
0.650
2.20
0.673
1.75
0.577
1.79
0.544
3.91
0.528
1.13
p = 200
MSE NLPD
0.571
9.04
0.643
1.30
0.527
7.99
0.547
2.57
0.550
2.89
0.523
7.75
0.521
1.08
p = 300
MSE NLPD
0.571
9.04
0.557
1.58
0.497 11.63
0.516
3.83
0.526 4.463
0.518 11.43
0.509
1.06
study. Table 1.3 reports the performance of seven methods on LogP data as the
number of selected/constructed basis vectors is increased from 100 to 300. It
can be seen from the results that our method achieves great performance especially on NLPD over other six methods. Although the Nair method get slightly
better result on NMSE when p = 300, it produces a very poor result on NLPD
at the same time. It should be emphasized that our prediction accuracy is much
better than the results reported in [32] where the best achieveable MSE was
just 0.601.
D. KIN40K Dataset
17
A Gradient-based Greedy Algorithm for Sparse GPR
Table 1.4. Test results of seven sparse GPR algorithms on the KIN40K dataset as the number of selected basis vectors increases. The superscript † denotes unsupervised basis selection
method. All reported results are the averages over 10 repetitions, along with the standard deviation. The best method is highlighed in bold and the second best in italic.
p = 100
NMSE
NLPD
Williams† 0.235±0.014 -0.606±0.018
Fine†
0.227±0.012 -0.508±0.008
Nair
0.208±0.015 -0.424±0.027
Seeger
0.302±0.029 -0.282±0.056
Keerthi
0.139±0.005 -0.731±0.007
Sun
0.127±0.004 -0.751±0.005
Ours
0.088±0.003 -0.767±0.004
Method
p = 300
NMSE
NLPD
0.093±0.005 -1.060±0.016
0.100±0.006 -0.910±0.010
0.080±0.003 -0.805±0.022
0.130±0.020 -0.575±0.103
0.060±0.002 -1.143±0.005
0.057±0.001 -1.173±0.006
0.042±0.001 -1.060±0.004
p = 500
NMSE
NLPD
0.060±0.001 -1.304±0.008
0.064±0.003 -1.150±0.011
0.050±0.001 -1.042±0.016
0.068±0.006 -0.820±0.099
0.041±0.001 -1.366±0.006
0.039±0.001 -1.400±0.007
0.029±0.001 -1.223±0.006
The KIN40K dataset is the largest one in the experiments we conducted. It
is a variant of the kin family of datasets from the DELVE archive and composed of 40,000 examples with 8 inputs. As the author of this data stated7 ,
KIN40K was generated with maximum nonlinearity and little noise, giving a
very difficult regression task. We randomly selected 10,000 examples for training and kept the remaining 30,000 examples as test cases. The results on 10
random partitions reported in Table 1.4 have shown that the last three methods have a general advantage under either NMSE or NLPD over other four
approaches. Our method always achieves the best result on NMSE but slightly
worse that the best on the NLPD. Note that the Seeger method is even worse
than the random-based (Williams) method, which is already observed in other
work [15].
According to the results generated above, we can see that Nair, Keerthi, Sun
and Ours four methods often produce better generalisation performance on test
MSE (or NMSE). Now, we further compare these representative approaches for
the scaling performance on a set of datasets generated from KIN40K data. Figure 1.1 shows the computational time of the four methods for varying training
dataset sizes. Note that the maximal number of selected basis sectors is fixed
on p = 500. As expected, all of them linearly scale in the number of the training examples. The Nair is the fastest one among four methods since it only
requires O(1) time for scoring one basis at each selection step, and similarly
for Williams, Fine and Seeger three approaches although we did not plot them
in the figure. In contrast to Nair’s O(1) cost, other three leading algorithms
including Keerthi, Sun and Ours, will need O(n) time to evaluate their corresponding criteria for one instance. Furthermore, compared with Keerthi and
Sun our gradient-based search approach needs extra time to evaluate gradient
information and this is finally responsible for the time gap between Ours and
Keethi shown in the Figure 1.1.
E. Discussion
18
3
10
2
10
1
10
1000
2000
3000
4000
5000
6000
7000 8000 900010000
Figure 1.1. Comparison of the training time required by four leading approaches as a function
of the size of the training dataset. The maximal number of selected basis vectors is fixed to be
p = 500. From bottom to top, they are Nair (square), Sun (circle), Keerthi (pentagram) and
Ours (diamond).
To our knowledge, this is the first time to formally compare all kinds of
basis vector selection algorithms appeared in the literature. Based on our experimental studies, we can draw the following general summary empirically.
The supervised basis selection methods are clearly better than unsupervised
methods almost on all four datasets. In between Nair and Seeger two supervised basis selection methods which both lead to very minor selection cost, it
appears that Nair is superior than Seeger on test MSE (or NMSE). The last
three approaches Keerthi, Sun and Ours, which are all based on optimising the
original GPR objective (1.13), produce more stable results than other sparse
GPR methods on all datasets considered. On the large dataset, it seems that
the Keerthi method is inferior to the Sun method. Finally, the constructionbased forward algorithm proposed in this chapter is more attactive than all of
selection-based forward algorithms for both test NMSE and NLPD measures
if the generalisation performance is a major concern.
6.
Conclusions
Basis vector selection is very important in building a sparse GPR model. A
number of selection schemes based on various criteria have been proposed. In
this paper, we did not follow the previous idea of selecting basis vectors from
the training examples. Instead, we borrowed an idea from gradient boosting
and proposed to construct basis vectors one by one through gradient-based
optimisation. The proposed work is quite simple to implement. Excellent
results on a range of datasets have been obtained. In the near future, we will
analyse why the presented algorithm was not the best for some cases given
A Gradient-based Greedy Algorithm for Sparse GPR
19
in this paper and evaluate it on more and large problems. Another important
extension is to apply this idea to classification problems [37, 6].
Appendix
A. Gradients of kp , qp and qp∗
If using the squared-exponential (1.2) as the kernel function, we can have the gradients of
kp , qp and qp∗ as
∂kp (x̃p )
= θl kp . ∗ [X(:, l) − x̃p (l)1n ],
k̇p =
∂ x̃p (l)
∂qp (x̃p )
q̇p =
= θl qp . ∗ [X̃p−1 (:, l) − x̃p (l)1n ],
∂ x̃p (l)
∂qp∗ (x̃p )
= 0,
q̇p∗ =
∂ x̃p (l)
where X = [x1 ... xn ]⊤ ∈ Rn×m is the input matrix, X̃p−1 = [x̃1 ... x̃p−1 ]⊤ ∈ R(p−1)×m
is the basis vector matrix, the notation ‘.*’ denotes entry-by-entry multiplication, X(:, l) denotes
the l-th column of X and similarly for X̃p−1 (:, l). Finally, 1n denotes an all one vector in Rn .
B. Inclusion of the constructed basis vector x̃p
In order to make a prediction for a new test case, we need to work out αp , Q−1
p and Σp , which
can be seen from (1.19) and (1.20). Moreover, according to eq. (1.34), our forward procedure of
constructing basis vectors also requires the information of µp and rp . Since directly computing
Q−1
p and Σp may encounter the problem of numerical instability [12], we resort to the Cholesky
−⊤
decomposition. Let Lp be the factor of Cholesky factorisation: Lp L⊤
p = Qp , let Gp = Kp Lp
⊤
⊤
2
and Mp be the factor of another Cholesky decomposition: Mp Mp = (Gp Gp + σ Idp ), we
have
⊤ −1
Q−1
,
p = (Lp Lp )
−1
Σp = (Kp⊤ Kp + σ 2 Qp )−1 = (Lp Mp Mp⊤ L⊤
,
p )
and further
⊤ −1 ⊤
αp = Σp Kp⊤ y = L−⊤
Gp y,
p (Mp Mp )
µp = Kp αp , rp = y − µp .
Thus, the following quantities Lp , Mp , Gp , αp and µp are required to update recursively. The
involved steps can be summarised as follows:
kp = [K(x1 , x̃p ), ..., K(xn , x̃p )]⊤ ,
qp = [K(x̃1 , x̃p ), ..., K(x̃p−1 , x̃p )]⊤ , qp∗ = K(x̃p , x̃p ),
∗
lp = L−1
p−1 qp , lp =
gp =
q
qp∗ − lp⊤ lp ,
kp − Gp−1 lp
,
lp∗
−1
−⊤
mp = Mp−1
(G⊤
p−1 gp ), η = Mp−1 mp ,
dp = gp − Gp−1 η,
b = d⊤
p y,
m∗p =
p
c = d⊤
p gp ,
σ 2 + c, a =
b
,
lp∗ (σ 2 + c)
20
αp =
∗
αp−1 − a[L−⊤
p−1 (lp + lp η)]
a
µp = µp−1 +
,
bdp
,
σ2 + c
and finally
Lp =
Lp−1
lp⊤
0
lp∗
,
Mp =
Mp−1
m⊤
p
0
m∗p
,
Gp = [Gp−1 gp ].
Since the matrices Lp and Mp are low-triangular, the multiplication of their inverse and a vector
can be computed very efficiently.
Notes
1. Since each training case is responsible for each column in the full kernel matrix K, sometimes we
also refer to those corresponding columns in K as basis vectors.
2. The Matlab source code can be accessed via http://cmm.ensmp.fr/∼bach/csi/index.
html.
3. Since the first employed dataset only incudes 506 examples, we randomly pick up 400 points to do
model selection.
4. It is available at http://www.ncrg.aston.ac.uk/netlab/index.php.
5. Boston housing data can be found in StatLib, available at URL http://lib.stat.cmu.edu/
datasets/boston.; Kin-32nm and its full description can be accessed at http://www.cs.toronto.
edu/∼delve/data/datasets.html; The LogP data can be requested from Dr Peter Tino (pxt@
cs.bham.ac.uk); The KIN40K dataset is available at http://ida.first.fraunhofer.de/
∼anton/data.html.
6. The validation data is not necessary in our case since we employ the evidence framework to select
hyperparameters in NETLAB.
7. See http://ida.first.fraunhofer.de/∼anton/data.html.
References
[1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel
methods. In Proceedings of 22th International Conference on Machine
Learning (ICML 2005), pages 33–40, 2005.
[2] G. Baudat and F. Anouar. Kernel-based methods and function approximation. In Proceedings of 2001 International Joint Conference on Neural
Networks (IJCNN 2001), pages 1244–1249, 2001.
[3] A. Beygelzimer, S. M. Kakade, and J. Langford. Cover trees for nearest
neighbor. submitted, 2005.
[4] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for bound
constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5):1190–1208, 1995.
[5] J. Quinonero Candela and C. E. Rasmussen. A unifying view of sparse
approximate gaussian process regression. Journal of Machine Learning
Research, 6:1935–1959, 2005.
[6] O. Chapelle. Training a support vector machine in the primal. Journal of
Machine Learning Research, 2006. submitted.
A Gradient-based Greedy Algorithm for Sparse GPR
21
[7] L. Csato and M. Opper. Sparse On-line Gaussian Processes. Neural Computation, 14(3):641–668, 2002.
[8] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squares
algorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285,
2004.
[9] S. Fine and K. Scheinberg. Efficient SVM Training Using Low-rank Kernel Representations. Journal of Machine Learning Research, 2:243–264,
2002.
[10] J. H. Friedman. Greedy function approximation: A gradient boosting
machine. Annals of Statistics, 29(5):1189–1232, 2001.
[11] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In KDD-2001: Knowledge Discovery and Data Mining, pages 77–
86, San Francisco, CA, 2001.
[12] G. H. Golub and C. V. Loan. Matrix Computations. Johns Hopkins Univ.
Press, 1996.
[13] A. G. Gray. Fast kernel matrix-vector multiplication with application to
gaussian process learning. Technical report, School of Computer Science,
Carnegie Mellon University, 2004.
[14] A. G. Gray and A. W. Moore. ‘N-body’ problems in statistical learning.
In Advances in Neural Information Processing Systems 13, pages 521–
527. MIT Press, 2000.
[15] S. S. Keerthi and W. Chu. A matching pursuit approach to sparse
Gaussian process regression. In Advances in Neural Information Processing Systems 18. MIT Press, 2006.
[16] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop,
editor, Neural Networks and Machine Learning, pages 133–165. Springer,
Berlin, 1998.
[17] R. Meir and G. Rätsch. An introduction to boosting and leveraging. In
Advanced Lectures on Machine Learning (LNAI2600), pages 118–183,
2003.
[18] S. Mika, A.J. Smola, and B. Schökopf. An improved training algorithm
for kernel fisher discriminants. In Eighth International Workshop on Artificial Intelligence and Statistics, pages 98–104, Key West, Florida, 2001.
[19] P. B. Nair, A. Choudhury, and A. J. Keane. Some greedy learning algorithms for sparse regression and classification with mercer kernels. Journal of Machine Learning Research, 3:781–801, 2002.
[20] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM
Journal of Computing, 25(2):227–234, 1995.
[21] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine
Learning. The MIT Press, 2006.
22
[22] V. C. Raykar, C. Yang, R. Duraiswami, and N. Gumerov. Fast Computation of Sums of Gaussians in High Dimensions. Technical report, UM
Computer Science Department, 2005.
[23] R. Rifkin. Everything Old Is New Again: A Fresh Look at Historical
Approaches in Machine Learning. PhD thesis, MIT, Cambridge, MA,
2002.
[24] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning
Algorithm in Dual Variables. In Proceedings of 15th International Conference on Machine Learning (ICML 1998), pages 515–521, 1998.
[25] R.E. Schapire. A brief introduction to boosting. In T. Dean, editor,
Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence, pages 1401–1406, San Francisco, CA, 1999. Morgan Kaufmann Publishers.
[26] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Ninth International Workshop on Artificial Intelligence and Statistics, Key West,
Florida, 2003.
[27] Y. Shen, A. Ng, and M. Seeger. Fast Gaussian Process Regression Using
KD-Trees. In Advances in Neural Information Processing Systems 18.
MIT Press, 2006.
[28] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. In
Advances in Neural Information Processing Systems 14, pages 619–625.
MIT Press, 2001.
[29] A. J. Smola and B. Schökopf. Sparse greedy matrix approximation for
machine learning. In Proceedings of 14th International Conference on
Machine Learning (ICML 2000), pages 911–918, 2000.
[30] P. Sun and X. Yao. Greedy forward selection algorithms to sparse
Gaussian Process Regression. In Proceedings of 2006 International Joint
Conference on Neural Networks (IJCNN 2006), 2006. to appear.
[31] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.
[32] P. Tino, I. Nabney, B.S. Williams, J. Losel, and Y. Sun. Non-linear Prediction of Quantitative Structure-Activity Relationships. Journal of Chemical
Information and Computer Sciences, 44(5):1647–1653, 2004.
[33] P. Vincent and Y. Bengio. Kernel Matching Pursuit. Machine Learning,
48(1-3):165–187, 2002.
[34] C. Williams and M. Seeger. Using the Nyström method to speed up kernel
machines. In Advances in Neural Information Processing Systems 14,
pages 682–688. MIT Press, 2001.
[35] C. Yang, R. Duraiswami, and L. Davis. Efficient Kernel Machines Using
the Improved Fast Gauss Transform. In Advances in Neural Information
Processing Systems 17, pages 1561–1568. MIT Press, 2005.
A Gradient-based Greedy Algorithm for Sparse GPR
23
[36] T. Zhang. Approximation bounds for some sparse kernel regression algorithms. Neural Computation, 14:3013–3042, 2002.
[37] J. Zhu and T. Hastie. Kernel logistic regression and the import vector
machine. Journal of Computational & Graphical Statistics, 14(1):185–
205, 2005.
Download