MARS: selecting basis and knots with the empirical Bayes method Wataru Sakamoto Division of Mathematical Science, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, JAPAN sakamoto@sigmath.es.osaka-u.ac.jp Summary. The multivariate adaptive regression spline (MARS), proposed by Friedman (1991), estimates regression structure including interaction terms adaptively with truncated power spline basis functions. However, it adopts the generalized cross-validation criterion to add and prune basis functions, and hence it tends to choose such large numbers of basis functions that estimated regression structure may not be easily interpreted. On the other hand, some Bayesian approaches incorporated in MARS have been proposed, in which the reversible jump MCMC algorithm is adopted. However, they generate enormous combinations of basis functions, from which it would be difficult to obtain clear interpretation on regression structure. An empirical Bayes method to select basis functions and knots in MARS is proposed, with taking both advantages of the frequentist model selection approach and the Bayesian approach. A penalized likelihood approach is used to estimate coefficients for given basis functions, and the Akaike Bayes information criterion (ABIC) is used to determine the number of basis functions. It is shown that the proposed method gives estimation of regression structure which is relatively simple and easy to interpret for an example data set. Key words: Akaike Bayes information criterion, estimation of interaction terms, marginal likelihood, multivariate regression, penalized likelihood approach. 1 Introduction The multivariate adaptive regression spline (MARS), proposed by Friedman (1991) [F91], estimates regression structure including interaction terms adaptively with truncated power spline basis functions. Basis functions as well as positions of knots selected from data give suggestive information for understanding regression structure including interaction terms. However, Friedman’s MARS adopts the generalized cross-validation criterion to add and prune basis functions. It has been pointed out that the cross-validation criterion tends to make estimated functions relatively rough, because the crossvalidation is intended to optimize prediction on responses. In some applications, Friedman’s MARS often chooses such large numbers of basis functions that estimated 1398 Wataru Sakamoto regression structure may not be easily interpreted. Moreover, the criterion based on the distance in the space of responses does not seem to match with the original objective of the MARS that is to explore regression structure. From these points of view, another criterion should be reconsidered. Recently, some Bayesian approaches incorporated in the MARS have been proposed (see Denison et al., 2002 [DHMS02]). Priors are prescribed on the numbers of basis functions, the variables involved in each basis function, the position of knots, basis coefficients and so on, as well as some hyper-parameters, and then posterior sampling of the estimated functions is conducted with the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm (Green, 1995 [G95]). The Bayesian approaches provide good prediction performance in some applications. However, they generate enormous combinations of basis functions, from which it would be difficult to obtain clear interpretation on regression structure. Moreover, since the sampling with the RJMCMC is extremely computer-intensive, the analysis is not feasible on PC. In this paper, we propose an empirical Bayes method to select basis functions and knots in MARS, with taking both advantages of the frequentist (or likelihoodbased) model selection approach and the Bayesian approach. We use a penalized likelihood approach to estimate coefficients for given basis functions, with the prior on the basis coefficients considered as a penalty for complexity. We select the basis functions and the position of knots as well as the hyper-parameters by maximizing the Laplace approximation of the marginal likelihood, and determine the number of basis functions by minimizing the Akaike Bayes information criterion (ABIC) (Akaike, 1980 [A80]). We show an application of our method to an example data set. 2 The MARS model For simplicity, suppose that each response yi , i = 1, . . . , n, is observed with r exT planatory variables xi = (xi1 , . . . , xir ) , and that we estimate a function f that satisfies the regression model yi = f (xi ) + i , i = 1, . . . , n, (1) where i is an error term. We assume that the regression function f is represented as a linear combination of basis functions φk (x), k = 1, . . . , K: K f (x) = a0 + ak φk (x), (2) k=1 and we estimate the coefficients a0 , a1 , . . . , aK . Each basis function is the truncated power spline function Lk φk (x) = [skl (xv(k,l) − ckl )]m +, (3) l=1 where [x]+ is the positive part of x, that is, [x]+ = x if x > 0 and [x]+ = 0 if x ≤ 0, and a given positive integer m is the order of splines. The number of basis functions MARS with empirical Bayes method 1399 K, the degree of interaction Lk , the sign skl , which is +1 or −1, the number of the variable involved in the basis function v(k, l), and the position of a knot ckl are all selected with some lack-of-fit criterion based on data. The algorithm of Friedman’s (1991) MARS [F91] is composed of the following two sections: • Forward stepwise: Set the basis for a constant term as φ0 ≡ 1. Supposing that we have K + 1 basis functions φ0 , φ1 (x), . . . , φK (x), add new two basis functions φK+1 (x) = φk (x)[+(xv(k,l) − ckl )]m + φK+2 (x) = φk (x)[−(xv(k,l) − and ckl )]m +, where φk (x) is a parent basis function, xv(k,l) is a variable that is not included in φk (x) and ckl is a knot position (ckl ∈ {xiv(k,l) }i=1,...,n ), all of which are selected so that they minimize a lack-of-fit criterion. The addition of basis functions is continued until K becomes greater than prespecified Kmax . • Backward stepwise: Select one basis function (except φ0 ) and prune it so that the lack-of-fit criterion is minimized. The pruning is continued until the lack-of-fit criterion does not become decreasing even if any basis function is deleted. Friedman (1991) [F91] used a version of the generalized cross-validation score, originally proposed by Craven and Wahba (1979) [CW79], as a lack-of-fit criterion. He also suggested using a cost complexity function proposed by Friedman and Silverman (1989) [FS89]. For given basis functions φk (x), k = 0, 1, . . . , K, we obtain the least square T T T −1 T ΦK y, where y = (y1 , . . . , yn ) estimate of aK = (a0 , a1 , . . . , aK ) , âLS K = (ΦK ΦK ) and ΦK is the matrix with the (i, k) component φk (xi ). However, the problem of multi-colinearity must occur in the matrix ΦK as K increases. The fact motivates using some regularization methods such as the ridge regression. 3 Empirical Bayes method for MARS 3.1 Estimation of basis coefficients In this section we assume that the distribution of the response yi belongs to the exponential family, that is, p(yi |θi , ψ) = exp yi − b(θi ) + d(yi , ψ) , ψ i = 1, . . . , n, where b( ) and d( ) are specific functions, θi is a parameter, which varies with each observation, and ψ is a nuisance parameter. Note that µi ≡ IE(yi |θi , ψ) = b (θi ) and var(yi |θi , ψ) = ψb (θi ). We estimate a function f that satisfies the regression model G(µi ) = f (xi ), where G is a specific link function. 1400 Wataru Sakamoto For given basis functions φk (x), k = 0, 1, . . . , K, we assume (2) as in Section 2, and consider a normal prior on the coefficient vector aK , p(aK |ΦK , λ) = (λ/2π) (K+1)/2 λ T exp − aK aK 2 , (4) where the positive number λ is a hyper-parameter, which controls the complexity of the function. The prior is also adopted in some Bayesian approaches (for example, Denison et al., 2002 [DHMS02]). Under the prior (4), the posterior density of aK is obtained, from Bayes’ theorem, as p(aK |y, ΦK , ψ, λ) ∝ p(y|ΦK , aK , ψ)p(aK |ΦK , λ), (5) where p(y|ΦK , aK , ψ) is the density of y for given ΦK and aK . Finding the mode, âK , of the posterior density (5) of aK , is equivalent to maximizing the penalized log-likelihood lP (aK |y, ΦK , ψ, λ) = log p(y|ΦK , aK , ψ) − λ T K+1 a aK + log λ + const 2 K 2 (6) with respect to aK . In the case of normal responses where i ∼ N(0, σ 2 ) in (1), we obtain the maximum penalized likelihood estimate (MPLE) âK = (ΦK ΦK + λ∗ i)−1 ΦK y T T (7) directly, which is equal to the posterior mean of aK , where λ∗ = σ 2 λ, and I is the identical matrix. In non-normal distribution case, we can obtain the MPLE via an iterative algorithm such as the Fisher scoring method (see, for example, McCullagh and Nelder, 1989 [MN89]; Green and Silverman, 1994 [GS94]). The updating equation for obtaining the MPLE of aK becomes ânew = (ΦK W ΦK + λ∗ i)−1 ΦK W z, K T T ∗ (8) T where λ = ψλ, z = (z1 , . . . , zn ) is the working response vector with zi = (yi − T µ̂i ))G (µ̂i ) + (ΦK âK )i , and W = diag(w1 , . . . , wn ) is the working weight matrix 2 −1 with wi = {G (µ̂i ) b (θ̂i )} . 3.2 Selection of basis functions and knots To select the knot position, c = {ckl }, and the combination of basis functions, as well as to estimate the nuisance and hyper parameters, ψ and λ, we consider the marginal likelihood as a lack-of-fit criterion. The marginal likelihood, in which aK is integrated-out from the posterior density of aK (5), is written as p(y|ΦK , ψ, λ) = = p(y|ΦK , aK , ψ)p(aK |ΦK , λ)daK exp{lP (aK |y, ΦK , ψ, λ)}daK . (9) The method of maximizing the marginal likelihood (9) can be also regarded as a kind of the empirical Bayes method in which non-informative priors are assumed to ψ and λ, MARS with empirical Bayes method 1401 The exact computation of the integral included in (9) is not generally feasible except in the normal response case. Using the Laplace approximation (Tierney and Kadane, 1986 [TK86]; Davison, 1986 [D86]), that is, the Taylor expansion of the penalized log-likelihood (6) around its maximum â, we obtain an approximated marginal log-likelihood lM (c, ψ, λ|y) = log p(y|ΦK , ψ, λ) ≈ lP (âK |y, ΦK , ψ, λ) − 1 log |HP (âK )| + const., 2 (10) where HP (âK ) is the negative Hessian of the penalized log-likelihood HP (âK ) = ∂ 2 lP − ∂aK ∂aTK = ψ −1 (ΦK W ΦK + λ∗ I), T â K in which W is the one when the Fisher scoring algorithm (8) has converged. In the normal response case, the expression (10) is exact, and HP (âK ) = σ −2 (ΦTK ΦK +λ∗ i). We maximize the approximated marginal log-likelihood (10) with respect to the knot position c = {ckl }, the variables {v(k, l)} involved in basis functions and even the combination of basis functions, as well as the nuisance and hyper parameters ψ and λ. However, the marginal log-likelihood (10) seems to increase as the number of basis functions K increase. Our idea is to select K and the combination of basis functions that minimize Akaike’s Bayes information criterion (ABIC) (Akaike, 1980 [A80]) ABIC = −2lM (c, ψ, λ|y) + 2(q + 2) (11) in the forward stepwise section, where q is the number of knots c involved in the basis functions. In the backward stepwise section, we only maximize the marginal log-likelihood (10), as the knots c are already estimated. 3.3 Algorithm for empirical Bayes MARS Our algorithm for the empirical Bayes MARS is summarized as follows: • Forward stepwise: Set the basis for a constant term as φ0 ≡ 1. Supposing that we have K + 1 basis functions φ0 , φ1 (x), . . . , φK (x), add new two basis functions φK+1 (x) = φk (x)[+(xv(k,l) − ckl )]m + and φK+2 (x) = φk (x)[−(xv(k,l) − ckl )]m +, where the parent basis φk (x), the variable xv(k,l) and the knot position ckl are selected so that they maximize the marginal log-likelihood (10). After the addition of basis functions is continued until K becomes greater than pre-specified Kmax , determine K so that the selected model attains the minimum of the ABIC score (11). • Backward stepwise: Select one basis function (except φ0 ) and prune it so that the the marginal loglikelihood (10) is minimized. The pruning is continued until (10) does not become increasing even if any basis function is deleted. 1402 Wataru Sakamoto 4 An example We applied our empirical Bayes method to the kyphosis data set, which was also analyzed by Hastie and Tibshirani (1990) [HT90] and Chambers and Hastie (1992) [CH92]. We fitted the logistic regression model, instead of (1), log pi = f (xi ), 1 − pi i = 1, . . . , n, where pi = Pr(yi = 1), yi is the binary variable for the i-th patient that indicates T whether the kyphosis remains after the surgery, and xi = (xi1 , xi2 , xi3 ) are covariates that indicate [1] the age (in month), [2] the starting vertebrae level of the surgery and [3] the number of the vertebrae levels involved, respectively. The order of the spline was set to m = 1 for simplicity. The resulting estimated function with MARS for Windows 2.0, available from Salford Systems, becomes fˆ(x) = 0.025 + 0.465[x3 − 2]+ − 0.020[x2 − 5]+ [x3 − 2]+ − 0.003[139 − x1 ]+ [x3 − 2]+ + 0.008[x1 − 140]+ [x3 − 3]+ − 0.008[x1 − 87]+ [x3 − 3]+ . (12) Using (12), 73 out of 83 (about 88%) observations are predicted rightly with the threshold specified to 0.5, that is, p̂i = [1 + exp{−fˆ(xi )}]−1 ≥ 0.5. Although (12) gives good performance for prediction, it is rather complicated, and so it is hard to obtain any useful information from it. Table 1 shows the stepwise process of the empirical Bayes MARS applied to the kyphosis data. The program for applying our approach was written in the R language. The rightmost column in Table 1 indicates the numbers of the variable involved in the additional basis functions. The ABIC score is minimized when seven basis functions are used in the forward stepwise section, and two basis functions are pruned out of the seven in the backward stepwise section. The resulting estimated function with our empirical Bayes approach becomes fˆ(x) = 1.02 − 0.369[x2 − 6]+ − 0.105[6 − x2 ]+ − 0.00285[96 − x1 ]+ − 0.00191[96 − x1 ]+ [14 − x3 ]+ , (13) Table 1. Empirical Bayes MARS applied to the kyphosis data: the stepwise process. K +1 lM ABIC Forward stepwise: 1 −44.725 91.450 3 −37.841 79.682 5 −36.289 78.578 7 −34.713 77.426 * 9 −34.186 78.372 11 −33.918 79.836 Backward stepwise: 6 −34.713 5 −34.136 log10 λ v(k, ·) −1.16 −1.43 −1.68 −1.67 −1.76 2 1 1, 3 1, 3 1, 3 −1.68 −1.43 1403 0.0 0.2 0.4 prob 0.6 0.8 1.0 MARS with empirical Bayes method 0 5 10 15 start Fig. 1. Empirical Bayes MARS applied to the kyphosis data: the estimated probability. The solid line is drawn at x1 = 84.7 and x3 = 4.2 (both are mean values), the dashed line is drawn at x1 = 1 (minimum) and x3 = 4.2, The dotted line is drawn at x1 ≥ 96 and x3 = 4.2. which is similar to those which were obtained with parametric regression approaches in some literature such as Chambers and Hastie (1992) [CH92]. The function (13) is more parsimonious and so easier to understand than (12). Furthermore, 71 out of 86 (about 86%) observations are predicted rightly using (13), and so it also gives good prediction performance. Figure 1 shows the estimated probability as functions of x2 . The figure suggests that both the age and the start level are highly related to the risk of kyphosis. 5 Concluding remarks We have proposed the empirical Bayes approach for the MARS model. It provides estimation of regression structure which is relatively parsimonious and easy to un- 1404 Wataru Sakamoto derstand for the data set we applied. Also, the performance of prediction is as good as Friedman’s MARS. A referee gave us a comment that computational issues were not discussed in details on the first draft. Now we are developing a Fortran program for applying the empirical Bayes approach, to provide quite faster computation. and furthermore, to apply to a data set which is large both in sample size and in the number of variables, We are studying the performance of the empirical Bayes MARS about how to extract regression structure including interaction terms appropriately. We expect that the empirical Bayes MARS would be more advantageous in preventing multicollinearity occurring in Friedman’s MARS, especially in applying to large data sets. Furthermore, although we used a simple prior on basis coefficients in this paper, we could use more elaborate priors in connection with the roughness penalty on a regression function. Acknowledgement I am grateful to the referees for their useful comments to the paper. This research is supported by MEXT Grant-in-Aid for Young Scientists (B), No. 17700280 (2005–). References [A80] Akaike, H.: Likelihood and Bayes procedure. In: Bernardo, J. M. et al. (eds.), Bayesian Statistics, pp. 143–166, University Press, Valencia (1980) [CH92] Chambers, J. M. and Hastie, T. J.: Statistical Models in S. Pacific Grove, California (1992) [CW79] Craven, P. and Wahba, G.: Smoothing noisy data with spline functions. Numer. Math. 31, 377–403 (1979) [D86] Davison, A. C.: Approximate predictive likelihood. Biometrika 73, 323–332 (1986) [DHMS02] Denison, D. G. T., Holmes, C. C., Mallick, B. K. and Smith, A. F. M.: Bayesian Methods for Nonlinear Classification and Regression. John Wiley and Sons (2002) [F91] Friedman, J. H.: Multivariate adaptive regression splines (with discussion). Ann. Statist. 19, 1–141 (1991) [FS89] Friedman, J. H. and Silverman, B. W.: Flexible parsimonious smoothing and additive modeling. Technometrics 31, 3–39 (1989) [G95] Green, P. J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 (1995) [GS94] Green, P. J. and Silverman, B. W.: Nonparametric Regression and Generalized Linear Models. Chapman and Hall, London (1994) [HT90] Hastie, T. and Tibshirani, R.: Generalized Additive Models. Chapman and Hall, London (1990) [MN89] McCullagh, P. and Nelder, J. A.: Generalized Linear Models (2nd edition). Chapman and Hall, London (1989) [TK86] Tierney, L. and Kadane, J. B.: Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81, 82–86 (1986)