Chapter 1 A GRADIENT-BASED FORWARD GREEDY ALGORITHM FOR SPARSE GAUSSIAN PROCESS REGRESSION Ping Sun, Xin Yao CERCIA, School of Computer Science University of Birmingham, Edgbaston Park Road Birmingham, B15 2TT, UK p.sun@cs.bham.ac.uk,x.yao@cs.bham.ac.uk Abstract In this chaper, we present a gradient-based forward greedy method for sparse approximation of Bayesian Gaussian Process Regression (GPR) model. Different from previous work, which is mostly based on various basis vector selection strategies, we propose to construct instead of select a new basis vector at each iterative step. This idea was motivated from the well-known gradient boosting approach. The resulting algorithm built on gradient-based optimisation packages incurs similar computational cost and memory requirements to other leading sparse GPR algorithms. Moreover, the proposed work is a general framework which can be extended to deal with other popular kernel machines, including Kernel Logistic Regression (KLR) and Support Vector Machines (SVMs). Numerical experiments on a wide range of datasets are presented to demonstrate the superiority of our algorithm in terms of generalisation performance. Keywords: Gaussian process regression, sparse approximation, sequential forward greedy algorithm, basis vector selection, basis vector construction, gradient-based optimisation, gradient boosting 1. Introduction Recently, Gaussian Processes (GP) [16] have become one of the most popular kernel machines in the machine learning community. Besides its simplicity in training and model selection, GP models also yield the probabilistic predictions for testing examples with excellent generalisation capability. However, original GP models are prevented from applying to large datasets due to their high computational demands. Firstly, GP models require the computation and 2 storage of the full-order kernel matrix K (also known as covariance matrix) of size n × n, where n is the number of training examples. Secondly, the computational cost of training GP models is about O(n3 ). Thirdly, predicting a test case requires O(n) for evaluating the mean and O(n2 ) for computing the variance. In order to overcome these limitations a number of approximation schemes have been proposed recently (see [21], chapter 8) to accelerate the computation of GP. Most of these approaches in the literature can be broadly classified into two main types: (i) Greedy forward selection methods that can also be viewed as iteratively approximating the full kernel matrix by a lowrank representation [1, 29, 28, 34, 9, 19, 36, 26, 15, 30]; (ii) The methods of approximating the matrix-vector multiplication (MVM) operations by Fast Gauss Transform (FGT) [35] and more generally N-body approach [14]. All of these algorithms could achieve a linear scalability in the number of training examples for both computational cost and memory requirement. In contrast to the MVM approximation, the method of approximating the kernel matrix is simpler to implement since it does not require determining some additional critical parameters [35]. In this chapter we follow the path of approximating the full kernel matrix and propose a different forward greedy algorithm from previous work for achieving a low-rank kernel representation. The main idea is to construct instead of select basis vectors, which was inspired by the wellknown gradient boosting [10] framework. Here we focus only on regression problems and the work can be extended to classification tasks [37]. We now outline the contents of this chapter. In Section 2, we introduce GP regression (GPR) and briefly show how to achieve approximate GPR models in the current literature. In Section 3, we review some forward greedy algorithms for approximating the full GPR model and present our motivation. In Section 4, we detail our approach. Some experimental results are reported in Section 5. Finally, Section 6 concludes this chapter by presenting possible directions of future research. 2. Gaussian Process Regression In regression problems, we are given training data composed of n examples, D = {(x1 , y1 ), ..., (xn , yn )} where xi ∈ Rm is the m-dimensional input and yi ∈ R is the corresponding target. It is common to assume that the outputs yi are generated by yi = f (xi ) + ǫi (1.1) where ǫi is a normal random variable with the density P(ǫi ) = N (ǫi |0, σ 2 ) and f (x) is an unobservable latent function. The goal of the regression task is to estimate the function f (x) which is then used to predict the target y∗ on an unseen test case x∗ . A Gradient-based Greedy Algorithm for Sparse GPR Nomenclature n total number of training examples m dimension of the input xi , X input example i and X = [x1 ... xn ]⊤ ∈ Rn×m xi (l) the l-th entry of the input xi yi , y target of xi and y = [y1 , ..., yn ]⊤ ∈ Rn Idq , 1q the identity matrix of size q × q and all one vector in Rq K(xi , xj ) kernel function, also known as covariance function θ 0 , θl , θb hyperparameters of kernel K(xi , xj ) K training kernel matrix and (K)ij = K(xi , xj ), i, j = 1, ..., n σ2 variance of the noise f (xi ) an unobservable latent function f vector of latent function values, i.e., f = [f (x1 ), ..., f (xn )]⊤ N (·|µ, Σ) density of a Gaussian with mean µ and covariance Σ P(·) the probability density function x∗ , y∗ test input and target f∗ latent function value on x∗ (k∗ )i = K(xi , x∗ ), i = 1, ..., n and k∗∗ = K(x∗ , x∗ ) k∗ , k∗∗ µ∗ , σ∗ the predictive mean and variance α weight parameter and α ∈ Rn E(·) the objective (error) function p iteration index or number of selected (or constructed) basis ip index of p-th basis vector to be added Ip index set and Ip = {i1 , ..., ip } x̃j , X̃p selected or constructed basis vector j and X̃p = [x̃1 ... x̃p ]⊤ x̃j (l) the l-th entry of the basis vector x̃j Kp kernel columns, (Kp )ij = K(xi , x̃j ), i = 1, ..., n; j = 1, ..., p kp the p-th column of Kp Qp matrix induced by {x̃j }pj=1 , (Q)ij = K(x̃i , x̃j ) qp∗ , qp qp∗ is p-th diagonal and qp is p-th column of Qp except qp∗ ⊤ K̃ approximate kernel matrix of K and K̃ = Kp Q−1 p Kp ⊤ Qp (·) probability density function conditioned on K̃ = Kp Q−1 p Kp 2 µ̃∗ , σ̃∗ approximate predictive mean and variance k̃∗ (k̃∗ )j = K(x̃j , x∗ ), j = 1, ..., p αp a sparse estimate of α and αp = (Kp⊤ Kp + σ 2 Qp )−1 Kp⊤ y µp , rp training mean µp = Kp αp and residual error rp = y − µp Hp matrix Idn − Kp Σp Kp⊤ Lp factor of Cholesky decomposition: Qp = Lp L⊤ p Gp the product Kp L−⊤ p 2 ⊤ Mp Cholesky decomposition: (G⊤ p Gp + σ Idp ) = Mp Mp 3 4 In GPR framework, the underlying f (x) is assumed to be a zero mean Gaussian process, which is a collection of random variables, any finite number of which have a joint Gaussian distribution [21]. Let f = [f (x1 ), ..., f (xn )]⊤ be a vector of latent function values, GPR assumes a GP prior over the functions, i.e. P(f ) = N (f |0, K), where K is the covariance matrix generated by evaluating paired inputs {(xi , xj )|i, j = 1, ..., n} on a covariance function K(xi , xj ). A common example of K(xi , xj ) is the squared-exponential function ! m 1X θl (xi (l) − xj (l))2 + θb , (1.2) K(xi , xj ; θ) = θ0 exp − 2 l=1 where θ0 , θl , θb > 0 are hyperparameters, θ = [θ0 , θ1 , ..., θm , θb ]⊤ ∈ Rm+2 and xi (l) denotes the l-th entry of xi . In order to make a prediction for a new input x∗ we need to compute the predictive distribution P(f∗ |x∗ , y). First, the probability P(y|f ), known as likelihood, can be evaluated by P(y|f ) = n Y N (yi |f (xi ), σ 2 ) = N (y|f, σ 2 Idn ), (1.3) i=1 where Idn is an identity matrix of size n × n. Second, the posterior probability of f can be written as P(f |y) ∝ P(f )P(y|f ) ∝ N (f |K(K + σ 2 Idn )−1 y, σ 2 K(K + σ 2 Idn )−1 ). (1.4) Third, the joint GP prior P(f, f∗ ) is multivariate Gaussian as well, denoted as K k∗ f f , (1.5) 0, P =N f∗ f∗ k∗⊤ k∗∗ where k∗ = (K(xi , x∗ ))ni=1 , k∗∗ = K(x∗ , x∗ ). (1.6) Furthermore the conditional distribution of f∗ given f is a Gaussian P(f∗ |f, x∗ ) = N (k∗⊤ K −1 f, k∗∗ − k∗⊤ K −1 k∗ ) and finally the predictive distribution P(f∗ |x∗ , y) can be found by Z P(f∗ |x∗ , y) = P(f∗ |f, x∗ )P(f |y)df = N (f∗ |µ∗ , σ∗2 ), (1.7) (1.8) A Gradient-based Greedy Algorithm for Sparse GPR 5 where µ∗ = k∗⊤ α, σ∗2 = k∗∗ − k∗⊤ (K + σ 2 Idn )−1 k∗ , (1.9) and the weight parameter α = (K + σ 2 Idn )−1 y. (1.10) Clearly, the main task of learning a GPR model is to estimate α. From (1.9) and (1.10), we can note that training a full GPR model requires O(n3 ) time, O(n2 ) memory and computing the predictive mean and variance for a new test case, leading to O(n) and O(n2 ), respectively. So, it is impractical to apply GPR to large-scale training or testing datasets. This has led people to investigate approximate GPR models. In order to understand the main ideas of approximate GPR models appeared in the literature, we view estimating α in (1.10) as the solution of the following optimisation problem [28, 30]: min E(α) = α = 1 1 ⊤ 2 α (σ K + K ⊤ K)α − (K ⊤ y)⊤ α + y ⊤ y (1.11) 2 2 2 1 σ ky − Kαk2 + α⊤ Kα. (1.12) 2 2 Based on formulation (1.12), it can be noted that many other popular kernel machines invented later such as Kernel Ridge Regression (KRR) [24], Least Squares Support Vector Machines (LS-SVM) [31], Kernel Fisher Discriminant [18], Regularised Least Squares Classification (RLSC) [23] and Proximal Support Vector Machine (PSVM) [11], are equivalent to the GPR model in essence. Since the matrix (σ 2 K + K ⊤ K) in (1.11) is symmetric and the objective is a quadratic function, it is straightforward to exploit the well-known Conjugate Gradient (CG) method [12]. The CG method solves the problem (1.11) by iteratively performing the matrix-vector multiplication (MVM) operations Kc where c ∈ Rn is a vector. This directly motivated some researchers to apply improved fast Gauss transform (IFGT) [35], KD-trees [27] and general N-body approach [13] to accelerating the computation of the full GPR model through a series of efficient approximations of the product Kc. Another class of approximate GPR models is based on the sparse estimate of α and can be further explained as approximating the full kernel matrix K by a low-rank kernel representation. A sparse estimate of α is defined as one in which redundant or uninformative entries are set to exactly zero. If we use αp to denote all the non-zero entries of α indexed by Ip = [i1 , ..., ip ], then the objective function (1.12) can be equivalently written as σ2 1 min E(αp ) = ky − Kp αp k2 + αp⊤ Qp αp , αp 2 2 (1.13) 6 where Kp denotes the submatrix of the columns of K centred on {xij , j = 1, ..., p}. Let x̃j = xij and we refer to {x̃j }pj=1 as the set of basis vectors1 . Qp denotes the kernel matrix generated by these basis vectors, i.e., (Q)ij = K(x̃i , x̃j ), i, j = 1, ..., p. The sparse estimate αp can be obtained from (1.13) as αp = Σp Kp⊤ y (1.14) with Σp = (Kp⊤ Kp + σ 2 Qp )−1 . (1.15) In contrast to (1.10), computing αp in (1.14) only needs O(np2 ) operations instead of the original O(n3 ), which greatly alleviates the computational burden involved in the training and testing procedures of the full GPR model if p ≪ n in practice. It was observed that selecting a good index set Ip has a crucial effect on the generalisation performance of the obtained sparse GPR model. Most current algorithms generally formulate the selection procedure as an iterative forward selection process. At each iteration, a new basis vector is identified based on greedy optimisation of some criterion and the corresponding αp is then updated. So we refer to this class of methods as greedy forward selection algorithms. In fact, the above sparsifying procedure can also be understood as approximating the kernel matrix K by a low-rank representation of the form K̃ = ⊤ Kp Q−1 p Kp . This can be seen from the optimal objective values of the problem (1.11) and the sparse version (1.13): E(α) = σ2 ⊤ y (K + σ 2 Idn )−1 y 2 (1.16) and σ2 ⊤ ⊤ 2 −1 y (Kp Q−1 (1.17) p Kp + σ Idn ) y. 2 Further, it means that the sparse GPR model is obtained by replacing original ⊤ GP prior P(f ) = N (0, K) with an approximate prior Qp (f ) = N (f |0, Kp Q−1 p Kp ) [5]. Following the same derivation procedures as the full GPR model, the approximate predictive distribution Qp (f∗ |x∗ , y) of the sparse GPR model becomes Z Qp (f∗ |x∗ , y) = Qp (f∗ |f )P(f |y)df = N (f∗ |µ̃∗ , σ̃∗2 ), (1.18) E(αp ) = where k̃∗ = (K(x̃j , x∗ ))pj=1 , (1.19) 2 ⊤ σ̃∗2 = k∗∗ − k̃∗⊤ Q−1 p k̃∗ + σ k̃∗ Σp k̃∗ . (1.20) µ̃∗ = k̃∗⊤ αp , A Gradient-based Greedy Algorithm for Sparse GPR 7 It can be noted that computing the predictive mean and variance only needs O(p) and O(p2 ), respectively, in sparse approximation of GPR models. Compared to the approaches of approximating MVM by IFGT [35] and KD-trees [27], greedy forward selection algorithms only involve some linear algebra algorithms and are not required to specify any critical parameters as in the case of IFGT [35]. Moreover, the approximation quality of MVM is degenerated when we are confronted with much high-dimensional problems even though some more complex improved algorithms have been proposed [22, 3]. As we mentioned above, the crucial step of greedy forward algorithms is to select a good index set Ip based on some criteria. In other words, the problem is how to find representative basis vectors from the original training examples. A number of basis vector selection schemes were proposed before [1, 29, 28, 34, 9, 19, 36, 26, 15, 30]. In the next section, we briefly summarise these algorithms and tease out our motivation of a new gradient-based algorithm. 3. Basis Vector Selection Algorithms Clearly, choosing p basis vectors out of n possible choices involves a combinatorial search over the Cnp space and it is a NP-hard problem [20]. So we have to resort to near-optimal search schemes, like greedy forward selection algorithms mentioned above, to ensure computational efficiency. This section shall review some principled basis vector selection schemes and analyse their corresponding computational complexity. For any greedy forward selection approach, the associated time complexity is composed of two parts: Tbasic and Tselection as defined in [15]. Tbasic denotes the cost associated with updating of the sparse GPR model if given the index set Ip . This cost is the same for all forward selection algorithms. Another part Tselection refers to the cost incured by the procedure of selecting basis vectors. In the following, for simplicity we will always neglect the Tbasic cost and all the involved time complexity issues refer to the Tselection cost. For convenience, we categorise the algorithms appeared in the literature into unsupervised (i.e., independent of the target information) and supervised types here. Although some algorithms, such as [1, 2, 9, 19], are not proposed to directly deal with sparse GPR models, their ideas can be extended easily to select the set of basis vectors for GPR models. Unsupervised methods The simplest unsupervised method is random selection [29, 34]. But several experimental studies [26, 15] have shown that this would produce poor results. All of other unsupervised methods [2, 9, 7, 8] attempt to directly minimise the trace of the residual matrix tr(∆Kp ) = tr(K − K̃) = tr(K − 8 ⊤ ⊤ Kp Q−1 p Kp ). Let Kp−1 = Lp−1 Lp−1 be decomposed in Cholesky factors and Gp−1 = Kp−1 L−⊤ . Let ip be the index of next added basis vector, kp = p−1 n p−1 K(xi , xip ) i=1 , qp = K(x̃j , xip ) j=1 , q̃p∗ = K(xip , xip ) and lp = L−1 p−1 qp . We have Jp = tr(∆Kp ) = Jp−1 − kgp k2 , (1.21) where kp − Gp−1 lp . gp = q qp∗ − lp⊤ lp (1.22) So, to compute the exact reduction kgp k2 after including the ip -th column is an O(np) operation [2]. If this were to be done for all the remaining columns at each iteration, it would lead to a prohibitive total complexity of O(n2 p2 ). Fine and Scheinberg [9] proposed a cheap implementation. Since kgp k2 is lower bounded by (gp (ip ))2 = k̃p∗ − lp⊤ lp , which can be recursively maintained, they just evaluate this bound with negligible cost to choose the p-th basis vector. Another cheaper implementation for this idea is to consider an on-line scheme [7, 8]. Supervised methods It is quite straightforward for approximating K to consider the target information since we are confronted with a supervised learning task. Continuing on the results of unsupervised methods, Bach and Jordan [1] recently proposed an algorithm which select a new basis vector based on the trade-off between ⊤ the unsupervised term tr(K −Kp Q−1 p Kp ) and another training squared error term ky − Kp αp k2 . Combining with an efficient ‘look-ahead’ strategy, their selection scheme only incurs O(δnp) complexity of Tselection if p basis vectors are selected, where δ is set to a small value. Removing the unsupervised term, Nair et al. [19] developed a very cheap strategy to decrease the supervised term ky − Kp αp k2 , which is achieved by examining the current residual (rp = y − Kp αp ) and searching for the entry with the largest absolute value. Following the formulation (1.13) of sparse GPR model, it would be preferable to choose the basis vector which leads to the largest reduction in the objective (1.17), which was firstly proposed in [28]. Let Hp = Idn − Kp Σp Kp⊤ , E(αp ) can be recursively computed as [30]: Ep = Ep−1 − ∆E1 (ip ), (1.23) where ∆E1 (ip ) = 1 (gp⊤ Hp−1 y)2 . 2 σ 2 + gp⊤ Hp−1 gp (1.24) A Gradient-based Greedy Algorithm for Sparse GPR 9 Similar to the criterion (1.21), computing the reduction ∆E1 (j), j ∈ / Ip−1 for all n + 1 − p previously unselected vectors till p basis vectors are accumulated is a prohibitive O(n2 p2 ) operation. Therefore, Smola and Bartlett [28] resorted to a sub-greedy scheme by considering only κ candidates randomly chosen from outside Ip−1 during the p-th basis vector selection. They used a value of κ = 59. For this sub-greedy method, the complexity is reduced to O(κnp2 ). Alternatively, Sun and Yao [30] recently improved the original complexity O(n2 p2 ) to O(n2 p) by recursively maintaining some quantities for all remaining vectors. Furthermore, they [30] suggest only using the numerator part of ∆E1 (ip ), i.e., 1 1 (1.25) ∆E2 (ip ) = (gp⊤ Hp−1 y)2 = (gp⊤ rp−1 )2 , 2 2 where rp−1 = Hp−1 y = y − Kp−1 αp−1 , as the criterion of scoring all remaining vectors, which could produce almost the same prediction accuracy as the criterion (1.24). The advantage of this simplified version (1.25) is that the computational cost can be decreased to O(κnp) when combining with the sub-greedy scheme compared to the O(κnp2 ) cost incurred by the sub-greedy method of [28]. Another scoring criterion, also based on optimising objective (1.13), is a matching pursuit approach [15] which was motivated by [33]. Instead of minimising (1.13) through all of the entries of αp as in the case of (1.24), they just adjust the last entry of αp to optimise (1.13). The resulting selection criterion is [15] 1 [kp⊤ rp−1 − σ 2 qp⊤ αp−1 ]2 . (1.26) ∆E3 (ip ) = 2 σ 2 qp∗ + kp⊤ kp The computational cost of using (1.26) to score one basis vector is O(n) time, which is similar to the criterion (1.25). The empirical study conducted in [30] showed that (1.26) is always inferior to (1.25) in generalisation performance, especially on large-scale datasets. The last supervised method we introduce here is the so-called ‘Info-gain’ approach [26]. Let Qp (f |y) denote the posterior probability of f given the approximate GP prior Qp (f ) like (1.4), Info-gain scores the “informativeness” of one basis vector by the Kullback-Leibler distance between Qp (f |y) and Qp−1 (f |y), i.e. KL[Qp kQp−1 ]. Under some assumptions, this criterion can be simplified to a very cheap approach of only O(1) cost for evaluating one basis vector. But sometimes Info-gain leads to very poor results reported in [15] and also shown in our experiments. Across the algorithms discussed above, we can note that, at the p-th iteration, all of them try to select a new basis vector from the remaining (n − p + 1) columns of K . If the dataset is very large, the computational cost of scoring (n − p + 1) candidates would be prohibitive for some of previous selection 10 criteria. The interesting question here is: why we have to select from a huge pool of vectors and why not construct it! This is the starting point of our work. 4. A Gradient-based Forward Greedy Algorithm The key idea is to construct not select a basis vector at each iteration. This is motivated by the well-known gradient boosting framework [10]. Before proceeding to our new algorithm, we briefly describe what the boosting was. The basic idea behind boosting is rather than using just a single learner for prediction, a linear combination of T base learners F (x) = T X βt ht (x) (1.27) t=1 is used [17]. Here each ht (x) is a base learner (e.g. decision trees) and βt is its coefficient in the linear combination. Following the pioneering work by Friedman [10], the boosting procedure can be generally viewed as a gradient-based incremental search for a good additive model [10]. This is done by searching, at each iteration, for the base learner which gives the “steepest descent” in the loss denoted by L(y, f ). The essential steps of a boosting procedure can be summarised as follows: 1 F0 (x) = 0; 2 For t = 1 : T do: (a) (β Ptn, ht (x)) = arg minβ ∗ ,h(x) ∗ i=1 L(yi , ft−1 (xi ) + β h(xi )) (b) Ft (x) = Ft−1 (x) + βt ht (x) 3 EndFor 4 F (x) = FT (x) = PT t=1 βt ht (x). If replacing the loss L(y, f ) by different kinds of loss functions, a family of boosting algorithms can be produced. The most prominent example is AdaBoost [25], which employs the exponential loss function L(yi , f (xi )) = exp {yi f (xi )}, with yi ∈ {−1, +1}. (1.28) Let us go back to the sparse GPR approach which aims to find a sparse representation of the regression model and has the form fp (x) = p X αp (j)k(x̃j , x), (1.29) j=1 where αp (j) is the j-th entry of αp . If we conceptually regard each term k(x̃j , x), j = 1, ..., p, involved in (1.29) as a base learner, all of greedy forward selection algorithms summarised in Section 3 are equivalent to the above A Gradient-based Greedy Algorithm for Sparse GPR 11 boosting procedure. The only difference is that greedy forward selection algorithms select a new base learner at each iteration but boosting construct a base learner by gradient-based search. This ultimately motivates us to propose the following new approach for sparse GPR. We formulate the problem of building sparse GPR model as a boosting procedure. First, the loss L(y, f ) is replaced by the objective (1.13). Then, at each iteration, we construct the ‘base learner’ k(x̃p , x) through optimising (1.13) w.r.t. the parameters x̃p and its coefficient αp∗ is also changed accordingly. In detail, it can be described by the following optimisation problem: 1 E(αp∗ , x̃p ) = ky − Kp−1 αp−1 − αp∗ kp (x̃p )k2 2 σ 2 αp−1 ⊤ Qp−1 qp (x̃p ) αp−1 + . αp∗ αp∗ qp (x̃p )⊤ qp∗ (x̃p ) 2 (1.30) In order to emphasize that kp , qp and qp∗ are dependent on x̃p , we have expressed them in the function form in (1.30). For simplicity sometimes we still neglect the explicit dependence on x̃p . It is easy to show that min α∗p ∈R,x̃p ∈Rm 1 E(αp∗ , x̃p ) = Ep−1 + (αp∗ )2 (σ 2 qp∗ + kp⊤ kp ) 2 ∗ 2 ⊤ + αp (σ qp αp−1 − kp⊤ rp−1 ). (1.31) Since the condition for optimality of αp∗ is ∂E(αp∗ , x̃p ) = αp∗ (σ 2 qp∗ + kp⊤ kp ) + [σ 2 qp⊤ αp−1 − kp⊤ rp−1 ] = 0 ∂αp∗ (1.32) we can get αp∗ = kp⊤ rp−1 − σ 2 qp⊤ αp−1 . σ 2 qp∗ + kp⊤ kp (1.33) Substituting αp∗ in (1.31) with (1.33), the problem (1.30) can be equivalently written as 1 [kp (x̃p )⊤ rp−1 − σ 2 qp (x̃p )⊤ αp−1 ]2 min E(x̃p ) = Ep−1 − . (1.34) x̃p ∈Rm 2 σ 2 qp∗ (x̃p ) + kp (x̃p )⊤ kp (x̃p ) In fact, the objective function (1.34) we derived is the same as the criterion (1.26). The only difference is that we would not pick up training example as the candidate of the next basis vector. The derivative of (1.34) w.r.t. x̃p (l), l = 12 1, ..., m, can be easily obtained, that is, p = 1, p > 1, ∂E(x̃p ) 1 = αp∗ [2k̇p⊤ rp−1 − αp∗ (σ 2 q̇p∗ + 2kp⊤ k̇p )], ∂ x̃p (l) 2 ∂E(x̃p ) 1 = αp∗ [2(k̇p⊤ rp−1 − σ 2 q̇p⊤ αp−1 ) − αp∗ (σ 2 q̇p∗ + 2kp⊤ k̇p )], ∂ x̃p (l) 2 where ∂qp (x̃p ) ∗ ∂qp∗ (x̃p ) ∂kp (x̃p ) , q̇p = , q̇ = . (1.35) ∂ x̃p (l) ∂ x̃p (l) p ∂ x̃p (l) So, any gradient-based optimisation algorithms can be used to construct the base learner k(x̃p , x) and thus the new basis vector x̃p . Note that it just costs O(n) time to compute E(x̃p ) and corresponding gradient information if the dimension m ≪ n and the number of selected basis p ≪ n. Therefore our algorithm is applicable to large-scale datasets as well as (1.26). From a complexity viewpoint, the proposed method is the same as the criteria (1.25) and (1.26), but our approach still requires to compute corresponding gradient information (1.35) which makes it slightly slower than other approaches. The updating of related quantities after x̃p was constructed is detailed in the Appendix. In our implementation, we employ the routine BFGS [4] as the gradientbased optimisation package. In the course of numerical experiments, it was found that even with a small number of BFGS steps at each iteration we can get better results than those obtained by other leading algorithms. In order to improve even further the performance of the gradient-based algorithm proposed, we use the following multiple initialisation strategy. At the beginning of each iteration, we randomly take 20 training examples as initial basis vectors and rank them by (1.34). The best one is used to initialise the routine BFGS. Moreover, we set the maximal allowed BFGS steps at each iteration to 39. Thus, the total number of evaluating the objective function (1.34) is 59. The aim of this setting is to compare the performance of our work with other sub-greedy algorithms [28, 15, 30], which just evaluate the corresponding selection criteria κ = 59 times at each iteration. The steps of the proposed gradient-based forward greedy algorithm can be summarised as follows: k̇p = For p = 1, ..., pmax (which is the maximal number of basis vectors.) 1 Randomly taking 20 training examples from {xi }n i=1 and score them by (1.34); then pick up the highest one, denoted as x̃0p ; 2 Using x̃0p as the initial value and run the routine BFGS; the output x̃p is the p-th constructed basis vector ; 3 Updating Ip−1 , Kp−1 , Qp−1 , Gp−1 , Lp−1 , αp−1 , µp−1 , rp−1 and other related quantities (see Appendix for details); End For Outputs: {x̃j }pj=1 , αp , Qp and Σp . A Gradient-based Greedy Algorithm for Sparse GPR 13 Finally, it is worth emphasizing that the proposed gradient-based approach to sparse GPR with the objective (1.13) can be straightforwardly extended to deal with other types of objective functions, which are responsible for different kinds of kernel machines. For example, the following two objectives EKLR and ESVM are corresponding to kernel logistic regression (KLR) [37] and support vector machines (SVM) [6], respectively: n EKLR = σ2 1X ln(1 + exp {−yi fp (xi )}) + αp⊤ Qp αp n 2 (1.36) i=1 and n ESVM = 1X σ2 max (0, 1 − yi fp (xi ))2 + αp⊤ Qp αp , n 2 (1.37) i=1 where fp (x) is defined in (1.29). Similar to sparse GPR, the expected training algorithms for both KLR and SVM scale linearly as the number of training cases and would be much faster and more accurate than existing selectionbased approaches. 5. Numerical experiments In this section, we compare our gradient-based forward greedy algorithm against other leading sparse GPR algorithms induced by different basis selection criteria on four datasets. For simplicity we refer to the algorithms to be compared using the name of its first author and they are Williams [34], Fine [9], Nair [19], Seeger [26], Baudat [2], Bach [1], Smola [28], Keerthi [15] and Sun [30]. The first four of them employ very cheap basis selection criteria and have the negligible Tselection cost. The Baudat method is a special case of Bach2 when the trade-off parameter is set to zero, i.e, only considering the unsupervised term. To reduce the complexity of the criterion Baudat, we also apply ‘look-ahead’ strategy [1] to speed up its computation. Thus, both of them have the same complexity of Tselection , which is O(δnp). We would not run the Smola method in our experiments due to two reasons. (1) It has been empirically proved to be generating almost the same results as Sun [30]; (2) It leads to O(κnp2 ) complexity of Tselection which is much higher than other approaches. The Keerthi and Sun methods induced by (1.25) and (1.26), respectively, employ the same sub-greedy strategy and incur O(κnp) complexity of Tselection . In our implementation, we set δ = 59 and κ = 59 to ensure the same selection complexity and similarly for the setting of our gradient-based algorithm mentioned above. The algorithms presented in this section were coded in Matlab 7.0 and all the numerical experiments were conducted on the machine with PIV 2G and 512M memory. For all experiments, the squared-exponential kernel (1.2) was 14 used. The involved hyperparameters are estimated via a full GPR model on a subset of 1000 examples 3 randomly selected from the original dataset and these tasks were accomplished by GP routines of the well-known NETLAB software 4 . To evaluate generalisation performance, we utilise mean squared error (MSE) and negative logarithm of predictive distribution (NLPD). Their definitions are t 1X (yi − µi )2 , (1.38) MSE = t i=1 t 1X − log P(yi |µi , σi2 ), NLPD = t (1.39) i=1 where t is the number of test examples, yi is the test target, µi and σi2 are the predictive mean and variance, respectively. Sometimes normalised MSE (NMSE) given by NMSE = MSE/var(y) is used for convenience, where var(y) is the variance of training targets. Note that NLPD measures the quality of predictive distributions as it penalizes over-confident predictions as well as under-confident ones. The four employed datasets are Boston Housing, Kin32nm, LogP and KIN40K 5 . Finally, we select some leading approaches in terms of better generalisation performance on all four datasets considered and compare their scaling performance on a set of datasets generated from KIN40K. A. Boston Housing Dataset This popular regression dataset comprises 506 examples with 14 variables and the task is to predict median house value of owner-occupied homes based on other 13 variables. The results were averaged over 100 repetitions, where the data set was partitioned into 481/25 training/testing splits randomly, which is a common setting in the literature [19]. Table 1.1 summarises the test performances of the nine methods, along with the standard deviation, for p = 100 and p = 200, respectively. From Table 1.1, it can be noted that, for both p = 100 and p = 200, our contructing basis vectors method almost always achieves the better results on both MSE and NLPD although it is not significant especially when we pick up more basis vectors. The inferior one still ranks the second among all nine methods. In addition, the performance of three unsupervised basis selection methods marked by the superscript † seems systematically worse than other six supervised methods if selecting fewer basis vectors. But when nearly half of training examples are chosen, all of these methods generate very similar MSE results. B. Kin-32nm dataset A Gradient-based Greedy Algorithm for Sparse GPR 15 Table 1.1. Test results of nine sparse GPR algorithms on the Boston Housing dataset for p = 100 and p = 200, respectively. The superscript † denotes unsupervised basis selection method. All reported results are the averages over 100 repetitions, along with the standard deviation. The best method is highlighed in bold and the second best in italic. Method Williams† [34] Fine† [9] Nair [19] Seeger [26] Baudat† [2] Bach [1] Keerthi [15] Sun [30] Ours p = 100 MSE NLPD 9.97±6.58 2.73±0.44 8.22±3.97 2.53±0.29 6.83±2.72 2.50±0.28 7.32±3.21 2.54±0.20 8.15±4.27 2.48±0.29 7.52±3.19 2.54±0.24 7.08±2.92 2.44±0.24 6.64±2.82 2.46±0.30 6.43±2.67 2.46±0.09 p = 200 MSE NLPD 6.98±4.01 2.66±0.57 6.83±2.83 2.48±0.38 6.28±2.70 2.56±0.47 6.35±2.63 2.45±0.37 6.56±2.68 2.52±0.43 6.56±2.66 2.54±0.45 6.38±2.54 2.48 ±0.40 6.28±2.55 2.55±0.45 6.26±2.58 2.36±0.13 The Kin-32nm dataset is one of the eight kin-family datasets which are synthetically generated from a realistic simulation of the forward kinematics of an 8 link all-revolute robot arm. The data is composed of 8192 examples with 32 input dimensions and aim to predict the distance of the end-effector from a target given the angular positions of the joints, the link twist angles, link lengths, and link offset distances. We randomly split the mother data into 4000 training and 4192 testing examples and produce 20 repetitions, respectively. Again, we apply the nine methods to this high-dimensional problem. The results on the Kin-32nm dataset are reported in Table 1.2. According to Table 1.2, our proposed algorithm always ranks the first place significantly and we believe that, in a high dimensional case, our flexible gradient-based approach could discover much representative basis vectors compared to selection-based algorithms. Moreover, another two algorithms Keerthi and Sun based on directly optimising the objective (1.13) also have obviously better performance than other methods. Again, we observe that supervised basis selection methods are always superior to unsupervised methods. C. LogP Dataset LogP data is a popular benchmark problem in Quantitative Structure-Activity Relationships (QSAR). Our used data split is the same as that in [32]. Of the 6912 examples, 691 (10%) were used for testing and the remaining 6221 for training 6 . Since the Matlab source code of Bach method (including Baudat) provided by the authors involves the computation and storage of the full kernel matrix, it cannot be used to deal with such a large dataset by our PC. Therefore, we remove these two methods from the list in the following comparative 16 Table 1.2. Test results of nine sparse GPR algorithms on the Kin-32nm dataset for p = 100 and p = 200, respectively. The superscript † denotes unsupervised basis selection method. All reported results are the averages over 20 repetitions, along with the standard deviation. The best method is highlighed in bold and the second best in italic. Method Williams† Fine† Nair Seeger Baudat† Bach Keerthi Sun Ours p = 100 NMSE NLPD 0.634±0.015 0.501±0.017 0.645±0.017 0.480±0.016 0.609±0.015 0.470±0.015 0.610±0.017 0.470±0.017 0.643±0.022 0.490±0.020 0.606±0.013 0.450±0.011 0.588±0.012 0.441±0.008 0.587±0.012 0.441±0.010 0.569±0.011 0.384±0.007 p = 200 MSE NLPD 0.594±0.011 0.541±0.012 0.602±0.013 0.502±0.013 0.583±0.013 0.523±0.015 0.584±0.013 0.524±0.015 0.599±0.014 0.511±0.013 0.588±0.011 0.512±0.009 0.575±0.012 0.506±0.012 0.575±0.011 0.513±0.011 0.553±0.015 0.396±0.015 Table 1.3. Test results of seven sparse GPR algorithms on the LogP dataset as the number of selected basis vectors increases. The superscript † denotes unsupervised basis selection method. The best method is highlighed in bold and the second best in italic. Method Williams† Fine† Nair Seeger Keerthi Sun Ours p = 100 MSE NLPD 0.615 5.50 0.745 1.26 0.650 2.20 0.673 1.75 0.577 1.79 0.544 3.91 0.528 1.13 p = 200 MSE NLPD 0.571 9.04 0.643 1.30 0.527 7.99 0.547 2.57 0.550 2.89 0.523 7.75 0.521 1.08 p = 300 MSE NLPD 0.571 9.04 0.557 1.58 0.497 11.63 0.516 3.83 0.526 4.463 0.518 11.43 0.509 1.06 study. Table 1.3 reports the performance of seven methods on LogP data as the number of selected/constructed basis vectors is increased from 100 to 300. It can be seen from the results that our method achieves great performance especially on NLPD over other six methods. Although the Nair method get slightly better result on NMSE when p = 300, it produces a very poor result on NLPD at the same time. It should be emphasized that our prediction accuracy is much better than the results reported in [32] where the best achieveable MSE was just 0.601. D. KIN40K Dataset 17 A Gradient-based Greedy Algorithm for Sparse GPR Table 1.4. Test results of seven sparse GPR algorithms on the KIN40K dataset as the number of selected basis vectors increases. The superscript † denotes unsupervised basis selection method. All reported results are the averages over 10 repetitions, along with the standard deviation. The best method is highlighed in bold and the second best in italic. p = 100 NMSE NLPD Williams† 0.235±0.014 -0.606±0.018 Fine† 0.227±0.012 -0.508±0.008 Nair 0.208±0.015 -0.424±0.027 Seeger 0.302±0.029 -0.282±0.056 Keerthi 0.139±0.005 -0.731±0.007 Sun 0.127±0.004 -0.751±0.005 Ours 0.088±0.003 -0.767±0.004 Method p = 300 NMSE NLPD 0.093±0.005 -1.060±0.016 0.100±0.006 -0.910±0.010 0.080±0.003 -0.805±0.022 0.130±0.020 -0.575±0.103 0.060±0.002 -1.143±0.005 0.057±0.001 -1.173±0.006 0.042±0.001 -1.060±0.004 p = 500 NMSE NLPD 0.060±0.001 -1.304±0.008 0.064±0.003 -1.150±0.011 0.050±0.001 -1.042±0.016 0.068±0.006 -0.820±0.099 0.041±0.001 -1.366±0.006 0.039±0.001 -1.400±0.007 0.029±0.001 -1.223±0.006 The KIN40K dataset is the largest one in the experiments we conducted. It is a variant of the kin family of datasets from the DELVE archive and composed of 40,000 examples with 8 inputs. As the author of this data stated7 , KIN40K was generated with maximum nonlinearity and little noise, giving a very difficult regression task. We randomly selected 10,000 examples for training and kept the remaining 30,000 examples as test cases. The results on 10 random partitions reported in Table 1.4 have shown that the last three methods have a general advantage under either NMSE or NLPD over other four approaches. Our method always achieves the best result on NMSE but slightly worse that the best on the NLPD. Note that the Seeger method is even worse than the random-based (Williams) method, which is already observed in other work [15]. According to the results generated above, we can see that Nair, Keerthi, Sun and Ours four methods often produce better generalisation performance on test MSE (or NMSE). Now, we further compare these representative approaches for the scaling performance on a set of datasets generated from KIN40K data. Figure 1.1 shows the computational time of the four methods for varying training dataset sizes. Note that the maximal number of selected basis sectors is fixed on p = 500. As expected, all of them linearly scale in the number of the training examples. The Nair is the fastest one among four methods since it only requires O(1) time for scoring one basis at each selection step, and similarly for Williams, Fine and Seeger three approaches although we did not plot them in the figure. In contrast to Nair’s O(1) cost, other three leading algorithms including Keerthi, Sun and Ours, will need O(n) time to evaluate their corresponding criteria for one instance. Furthermore, compared with Keerthi and Sun our gradient-based search approach needs extra time to evaluate gradient information and this is finally responsible for the time gap between Ours and Keethi shown in the Figure 1.1. E. Discussion 18 3 10 2 10 1 10 1000 2000 3000 4000 5000 6000 7000 8000 900010000 Figure 1.1. Comparison of the training time required by four leading approaches as a function of the size of the training dataset. The maximal number of selected basis vectors is fixed to be p = 500. From bottom to top, they are Nair (square), Sun (circle), Keerthi (pentagram) and Ours (diamond). To our knowledge, this is the first time to formally compare all kinds of basis vector selection algorithms appeared in the literature. Based on our experimental studies, we can draw the following general summary empirically. The supervised basis selection methods are clearly better than unsupervised methods almost on all four datasets. In between Nair and Seeger two supervised basis selection methods which both lead to very minor selection cost, it appears that Nair is superior than Seeger on test MSE (or NMSE). The last three approaches Keerthi, Sun and Ours, which are all based on optimising the original GPR objective (1.13), produce more stable results than other sparse GPR methods on all datasets considered. On the large dataset, it seems that the Keerthi method is inferior to the Sun method. Finally, the constructionbased forward algorithm proposed in this chapter is more attactive than all of selection-based forward algorithms for both test NMSE and NLPD measures if the generalisation performance is a major concern. 6. Conclusions Basis vector selection is very important in building a sparse GPR model. A number of selection schemes based on various criteria have been proposed. In this paper, we did not follow the previous idea of selecting basis vectors from the training examples. Instead, we borrowed an idea from gradient boosting and proposed to construct basis vectors one by one through gradient-based optimisation. The proposed work is quite simple to implement. Excellent results on a range of datasets have been obtained. In the near future, we will analyse why the presented algorithm was not the best for some cases given A Gradient-based Greedy Algorithm for Sparse GPR 19 in this paper and evaluate it on more and large problems. Another important extension is to apply this idea to classification problems [37, 6]. Appendix A. Gradients of kp , qp and qp∗ If using the squared-exponential (1.2) as the kernel function, we can have the gradients of kp , qp and qp∗ as ∂kp (x̃p ) = θl kp . ∗ [X(:, l) − x̃p (l)1n ], k̇p = ∂ x̃p (l) ∂qp (x̃p ) q̇p = = θl qp . ∗ [X̃p−1 (:, l) − x̃p (l)1n ], ∂ x̃p (l) ∂qp∗ (x̃p ) = 0, q̇p∗ = ∂ x̃p (l) where X = [x1 ... xn ]⊤ ∈ Rn×m is the input matrix, X̃p−1 = [x̃1 ... x̃p−1 ]⊤ ∈ R(p−1)×m is the basis vector matrix, the notation ‘.*’ denotes entry-by-entry multiplication, X(:, l) denotes the l-th column of X and similarly for X̃p−1 (:, l). Finally, 1n denotes an all one vector in Rn . B. Inclusion of the constructed basis vector x̃p In order to make a prediction for a new test case, we need to work out αp , Q−1 p and Σp , which can be seen from (1.19) and (1.20). Moreover, according to eq. (1.34), our forward procedure of constructing basis vectors also requires the information of µp and rp . Since directly computing Q−1 p and Σp may encounter the problem of numerical instability [12], we resort to the Cholesky −⊤ decomposition. Let Lp be the factor of Cholesky factorisation: Lp L⊤ p = Qp , let Gp = Kp Lp ⊤ ⊤ 2 and Mp be the factor of another Cholesky decomposition: Mp Mp = (Gp Gp + σ Idp ), we have ⊤ −1 Q−1 , p = (Lp Lp ) −1 Σp = (Kp⊤ Kp + σ 2 Qp )−1 = (Lp Mp Mp⊤ L⊤ , p ) and further ⊤ −1 ⊤ αp = Σp Kp⊤ y = L−⊤ Gp y, p (Mp Mp ) µp = Kp αp , rp = y − µp . Thus, the following quantities Lp , Mp , Gp , αp and µp are required to update recursively. The involved steps can be summarised as follows: kp = [K(x1 , x̃p ), ..., K(xn , x̃p )]⊤ , qp = [K(x̃1 , x̃p ), ..., K(x̃p−1 , x̃p )]⊤ , qp∗ = K(x̃p , x̃p ), ∗ lp = L−1 p−1 qp , lp = gp = q qp∗ − lp⊤ lp , kp − Gp−1 lp , lp∗ −1 −⊤ mp = Mp−1 (G⊤ p−1 gp ), η = Mp−1 mp , dp = gp − Gp−1 η, b = d⊤ p y, m∗p = p c = d⊤ p gp , σ 2 + c, a = b , lp∗ (σ 2 + c) 20 αp = ∗ αp−1 − a[L−⊤ p−1 (lp + lp η)] a µp = µp−1 + , bdp , σ2 + c and finally Lp = Lp−1 lp⊤ 0 lp∗ , Mp = Mp−1 m⊤ p 0 m∗p , Gp = [Gp−1 gp ]. Since the matrices Lp and Mp are low-triangular, the multiplication of their inverse and a vector can be computed very efficiently. Notes 1. Since each training case is responsible for each column in the full kernel matrix K, sometimes we also refer to those corresponding columns in K as basis vectors. 2. The Matlab source code can be accessed via http://cmm.ensmp.fr/∼bach/csi/index. html. 3. Since the first employed dataset only incudes 506 examples, we randomly pick up 400 points to do model selection. 4. It is available at http://www.ncrg.aston.ac.uk/netlab/index.php. 5. Boston housing data can be found in StatLib, available at URL http://lib.stat.cmu.edu/ datasets/boston.; Kin-32nm and its full description can be accessed at http://www.cs.toronto. edu/∼delve/data/datasets.html; The LogP data can be requested from Dr Peter Tino (pxt@ cs.bham.ac.uk); The KIN40K dataset is available at http://ida.first.fraunhofer.de/ ∼anton/data.html. 6. The validation data is not necessary in our case since we employ the evidence framework to select hyperparameters in NETLAB. 7. See http://ida.first.fraunhofer.de/∼anton/data.html. References [1] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In Proceedings of 22th International Conference on Machine Learning (ICML 2005), pages 33–40, 2005. [2] G. Baudat and F. Anouar. Kernel-based methods and function approximation. In Proceedings of 2001 International Joint Conference on Neural Networks (IJCNN 2001), pages 1244–1249, 2001. [3] A. Beygelzimer, S. M. Kakade, and J. Langford. Cover trees for nearest neighbor. submitted, 2005. [4] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5):1190–1208, 1995. [5] J. Quinonero Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6:1935–1959, 2005. [6] O. Chapelle. Training a support vector machine in the primal. Journal of Machine Learning Research, 2006. submitted. A Gradient-based Greedy Algorithm for Sparse GPR 21 [7] L. Csato and M. Opper. Sparse On-line Gaussian Processes. Neural Computation, 14(3):641–668, 2002. [8] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285, 2004. [9] S. Fine and K. Scheinberg. Efficient SVM Training Using Low-rank Kernel Representations. Journal of Machine Learning Research, 2:243–264, 2002. [10] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001. [11] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In KDD-2001: Knowledge Discovery and Data Mining, pages 77– 86, San Francisco, CA, 2001. [12] G. H. Golub and C. V. Loan. Matrix Computations. Johns Hopkins Univ. Press, 1996. [13] A. G. Gray. Fast kernel matrix-vector multiplication with application to gaussian process learning. Technical report, School of Computer Science, Carnegie Mellon University, 2004. [14] A. G. Gray and A. W. Moore. ‘N-body’ problems in statistical learning. In Advances in Neural Information Processing Systems 13, pages 521– 527. MIT Press, 2000. [15] S. S. Keerthi and W. Chu. A matching pursuit approach to sparse Gaussian process regression. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [16] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages 133–165. Springer, Berlin, 1998. [17] R. Meir and G. Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), pages 118–183, 2003. [18] S. Mika, A.J. Smola, and B. Schökopf. An improved training algorithm for kernel fisher discriminants. In Eighth International Workshop on Artificial Intelligence and Statistics, pages 98–104, Key West, Florida, 2001. [19] P. B. Nair, A. Choudhury, and A. J. Keane. Some greedy learning algorithms for sparse regression and classification with mercer kernels. Journal of Machine Learning Research, 3:781–801, 2002. [20] B.K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of Computing, 25(2):227–234, 1995. [21] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. 22 [22] V. C. Raykar, C. Yang, R. Duraiswami, and N. Gumerov. Fast Computation of Sums of Gaussians in High Dimensions. Technical report, UM Computer Science Department, 2005. [23] R. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. PhD thesis, MIT, Cambridge, MA, 2002. [24] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings of 15th International Conference on Machine Learning (ICML 1998), pages 515–521, 1998. [25] R.E. Schapire. A brief introduction to boosting. In T. Dean, editor, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401–1406, San Francisco, CA, 1999. Morgan Kaufmann Publishers. [26] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Ninth International Workshop on Artificial Intelligence and Statistics, Key West, Florida, 2003. [27] Y. Shen, A. Ng, and M. Seeger. Fast Gaussian Process Regression Using KD-Trees. In Advances in Neural Information Processing Systems 18. MIT Press, 2006. [28] A. J. Smola and P. Bartlett. Sparse greedy gaussian process regression. In Advances in Neural Information Processing Systems 14, pages 619–625. MIT Press, 2001. [29] A. J. Smola and B. Schökopf. Sparse greedy matrix approximation for machine learning. In Proceedings of 14th International Conference on Machine Learning (ICML 2000), pages 911–918, 2000. [30] P. Sun and X. Yao. Greedy forward selection algorithms to sparse Gaussian Process Regression. In Proceedings of 2006 International Joint Conference on Neural Networks (IJCNN 2006), 2006. to appear. [31] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999. [32] P. Tino, I. Nabney, B.S. Williams, J. Losel, and Y. Sun. Non-linear Prediction of Quantitative Structure-Activity Relationships. Journal of Chemical Information and Computer Sciences, 44(5):1647–1653, 2004. [33] P. Vincent and Y. Bengio. Kernel Matching Pursuit. Machine Learning, 48(1-3):165–187, 2002. [34] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 14, pages 682–688. MIT Press, 2001. [35] C. Yang, R. Duraiswami, and L. Davis. Efficient Kernel Machines Using the Improved Fast Gauss Transform. In Advances in Neural Information Processing Systems 17, pages 1561–1568. MIT Press, 2005. A Gradient-based Greedy Algorithm for Sparse GPR 23 [36] T. Zhang. Approximation bounds for some sparse kernel regression algorithms. Neural Computation, 14:3013–3042, 2002. [37] J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine. Journal of Computational & Graphical Statistics, 14(1):185– 205, 2005.