MARS: selecting basis and knots with the empirical Bayes method

advertisement
MARS: selecting basis and knots with the
empirical Bayes method
Wataru Sakamoto
Division of Mathematical Science, Graduate School of Engineering Science, Osaka
University, 1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, JAPAN
sakamoto@sigmath.es.osaka-u.ac.jp
Summary. The multivariate adaptive regression spline (MARS), proposed by
Friedman (1991), estimates regression structure including interaction terms adaptively with truncated power spline basis functions. However, it adopts the generalized
cross-validation criterion to add and prune basis functions, and hence it tends to
choose such large numbers of basis functions that estimated regression structure may
not be easily interpreted. On the other hand, some Bayesian approaches incorporated in MARS have been proposed, in which the reversible jump MCMC algorithm
is adopted. However, they generate enormous combinations of basis functions, from
which it would be difficult to obtain clear interpretation on regression structure.
An empirical Bayes method to select basis functions and knots in MARS is
proposed, with taking both advantages of the frequentist model selection approach
and the Bayesian approach. A penalized likelihood approach is used to estimate
coefficients for given basis functions, and the Akaike Bayes information criterion
(ABIC) is used to determine the number of basis functions. It is shown that the
proposed method gives estimation of regression structure which is relatively simple
and easy to interpret for an example data set.
Key words: Akaike Bayes information criterion, estimation of interaction terms,
marginal likelihood, multivariate regression, penalized likelihood approach.
1 Introduction
The multivariate adaptive regression spline (MARS), proposed by Friedman (1991)
[F91], estimates regression structure including interaction terms adaptively with
truncated power spline basis functions. Basis functions as well as positions of knots
selected from data give suggestive information for understanding regression structure
including interaction terms.
However, Friedman’s MARS adopts the generalized cross-validation criterion to
add and prune basis functions. It has been pointed out that the cross-validation
criterion tends to make estimated functions relatively rough, because the crossvalidation is intended to optimize prediction on responses. In some applications,
Friedman’s MARS often chooses such large numbers of basis functions that estimated
1398
Wataru Sakamoto
regression structure may not be easily interpreted. Moreover, the criterion based on
the distance in the space of responses does not seem to match with the original
objective of the MARS that is to explore regression structure. From these points of
view, another criterion should be reconsidered.
Recently, some Bayesian approaches incorporated in the MARS have been proposed (see Denison et al., 2002 [DHMS02]). Priors are prescribed on the numbers of
basis functions, the variables involved in each basis function, the position of knots,
basis coefficients and so on, as well as some hyper-parameters, and then posterior
sampling of the estimated functions is conducted with the reversible jump Markov
chain Monte Carlo (RJMCMC) algorithm (Green, 1995 [G95]). The Bayesian approaches provide good prediction performance in some applications. However, they
generate enormous combinations of basis functions, from which it would be difficult
to obtain clear interpretation on regression structure. Moreover, since the sampling
with the RJMCMC is extremely computer-intensive, the analysis is not feasible on
PC.
In this paper, we propose an empirical Bayes method to select basis functions
and knots in MARS, with taking both advantages of the frequentist (or likelihoodbased) model selection approach and the Bayesian approach. We use a penalized
likelihood approach to estimate coefficients for given basis functions, with the prior
on the basis coefficients considered as a penalty for complexity. We select the basis
functions and the position of knots as well as the hyper-parameters by maximizing
the Laplace approximation of the marginal likelihood, and determine the number
of basis functions by minimizing the Akaike Bayes information criterion (ABIC)
(Akaike, 1980 [A80]). We show an application of our method to an example data
set.
2 The MARS model
For simplicity, suppose that each response yi , i = 1, . . . , n, is observed with r exT
planatory variables xi = (xi1 , . . . , xir ) , and that we estimate a function f that
satisfies the regression model
yi = f (xi ) + i ,
i = 1, . . . , n,
(1)
where i is an error term.
We assume that the regression function f is represented as a linear combination
of basis functions φk (x), k = 1, . . . , K:
K
f (x) = a0 +
ak φk (x),
(2)
k=1
and we estimate the coefficients a0 , a1 , . . . , aK . Each basis function is the truncated
power spline function
Lk
φk (x) =
[skl (xv(k,l) − ckl )]m
+,
(3)
l=1
where [x]+ is the positive part of x, that is, [x]+ = x if x > 0 and [x]+ = 0 if x ≤ 0,
and a given positive integer m is the order of splines. The number of basis functions
MARS with empirical Bayes method
1399
K, the degree of interaction Lk , the sign skl , which is +1 or −1, the number of the
variable involved in the basis function v(k, l), and the position of a knot ckl are all
selected with some lack-of-fit criterion based on data.
The algorithm of Friedman’s (1991) MARS [F91] is composed of the following
two sections:
• Forward stepwise:
Set the basis for a constant term as φ0 ≡ 1.
Supposing that we have K + 1 basis functions φ0 , φ1 (x), . . . , φK (x), add new
two basis functions
φK+1 (x) = φk (x)[+(xv(k,l) − ckl )]m
+
φK+2 (x) = φk (x)[−(xv(k,l) −
and
ckl )]m
+,
where φk (x) is a parent basis function, xv(k,l) is a variable that is not included
in φk (x) and ckl is a knot position (ckl ∈ {xiv(k,l) }i=1,...,n ), all of which are
selected so that they minimize a lack-of-fit criterion.
The addition of basis functions is continued until K becomes greater than prespecified Kmax .
• Backward stepwise:
Select one basis function (except φ0 ) and prune it so that the lack-of-fit criterion
is minimized.
The pruning is continued until the lack-of-fit criterion does not become decreasing even if any basis function is deleted.
Friedman (1991) [F91] used a version of the generalized cross-validation score, originally proposed by Craven and Wahba (1979) [CW79], as a lack-of-fit criterion. He
also suggested using a cost complexity function proposed by Friedman and Silverman
(1989) [FS89].
For given basis functions φk (x), k = 0, 1, . . . , K, we obtain the least square
T
T
T
−1 T
ΦK y, where y = (y1 , . . . , yn )
estimate of aK = (a0 , a1 , . . . , aK ) , âLS
K = (ΦK ΦK )
and ΦK is the matrix with the (i, k) component φk (xi ). However, the problem of
multi-colinearity must occur in the matrix ΦK as K increases. The fact motivates
using some regularization methods such as the ridge regression.
3 Empirical Bayes method for MARS
3.1 Estimation of basis coefficients
In this section we assume that the distribution of the response yi belongs to the
exponential family, that is,
p(yi |θi , ψ) = exp
yi − b(θi )
+ d(yi , ψ) ,
ψ
i = 1, . . . , n,
where b( ) and d( ) are specific functions, θi is a parameter, which varies with each
observation, and ψ is a nuisance parameter. Note that µi ≡ IE(yi |θi , ψ) = b (θi ) and
var(yi |θi , ψ) = ψb (θi ). We estimate a function f that satisfies the regression model
G(µi ) = f (xi ), where G is a specific link function.
1400
Wataru Sakamoto
For given basis functions φk (x), k = 0, 1, . . . , K, we assume (2) as in Section 2,
and consider a normal prior on the coefficient vector aK ,
p(aK |ΦK , λ) = (λ/2π)
(K+1)/2
λ T
exp − aK aK
2
,
(4)
where the positive number λ is a hyper-parameter, which controls the complexity of
the function. The prior is also adopted in some Bayesian approaches (for example,
Denison et al., 2002 [DHMS02]). Under the prior (4), the posterior density of aK is
obtained, from Bayes’ theorem, as
p(aK |y, ΦK , ψ, λ) ∝ p(y|ΦK , aK , ψ)p(aK |ΦK , λ),
(5)
where p(y|ΦK , aK , ψ) is the density of y for given ΦK and aK . Finding the mode,
âK , of the posterior density (5) of aK , is equivalent to maximizing the penalized
log-likelihood
lP (aK |y, ΦK , ψ, λ) = log p(y|ΦK , aK , ψ) −
λ T
K+1
a aK +
log λ + const
2 K
2
(6)
with respect to aK .
In the case of normal responses where i ∼ N(0, σ 2 ) in (1), we obtain the maximum penalized likelihood estimate (MPLE)
âK = (ΦK ΦK + λ∗ i)−1 ΦK y
T
T
(7)
directly, which is equal to the posterior mean of aK , where λ∗ = σ 2 λ, and I is the
identical matrix. In non-normal distribution case, we can obtain the MPLE via an
iterative algorithm such as the Fisher scoring method (see, for example, McCullagh and Nelder, 1989 [MN89]; Green and Silverman, 1994 [GS94]). The updating
equation for obtaining the MPLE of aK becomes
ânew
= (ΦK W ΦK + λ∗ i)−1 ΦK W z,
K
T
T
∗
(8)
T
where λ = ψλ, z = (z1 , . . . , zn ) is the working response vector with zi = (yi −
T
µ̂i ))G (µ̂i ) + (ΦK âK )i , and W = diag(w1 , . . . , wn ) is the working weight matrix
2 −1
with wi = {G (µ̂i ) b (θ̂i )} .
3.2 Selection of basis functions and knots
To select the knot position, c = {ckl }, and the combination of basis functions, as
well as to estimate the nuisance and hyper parameters, ψ and λ, we consider the
marginal likelihood as a lack-of-fit criterion. The marginal likelihood, in which aK
is integrated-out from the posterior density of aK (5), is written as
p(y|ΦK , ψ, λ) =
=
p(y|ΦK , aK , ψ)p(aK |ΦK , λ)daK
exp{lP (aK |y, ΦK , ψ, λ)}daK .
(9)
The method of maximizing the marginal likelihood (9) can be also regarded as a
kind of the empirical Bayes method in which non-informative priors are assumed to
ψ and λ,
MARS with empirical Bayes method
1401
The exact computation of the integral included in (9) is not generally feasible
except in the normal response case. Using the Laplace approximation (Tierney and
Kadane, 1986 [TK86]; Davison, 1986 [D86]), that is, the Taylor expansion of the
penalized log-likelihood (6) around its maximum â, we obtain an approximated
marginal log-likelihood
lM (c, ψ, λ|y) = log p(y|ΦK , ψ, λ)
≈ lP (âK |y, ΦK , ψ, λ) −
1
log |HP (âK )| + const.,
2
(10)
where HP (âK ) is the negative Hessian of the penalized log-likelihood
HP (âK ) =
∂ 2 lP
−
∂aK ∂aTK
= ψ −1 (ΦK W ΦK + λ∗ I),
T
â K
in which W is the one when the Fisher scoring algorithm (8) has converged. In the
normal response case, the expression (10) is exact, and HP (âK ) = σ −2 (ΦTK ΦK +λ∗ i).
We maximize the approximated marginal log-likelihood (10) with respect to the
knot position c = {ckl }, the variables {v(k, l)} involved in basis functions and even
the combination of basis functions, as well as the nuisance and hyper parameters ψ
and λ. However, the marginal log-likelihood (10) seems to increase as the number
of basis functions K increase. Our idea is to select K and the combination of basis
functions that minimize Akaike’s Bayes information criterion (ABIC) (Akaike, 1980
[A80])
ABIC = −2lM (c, ψ, λ|y) + 2(q + 2)
(11)
in the forward stepwise section, where q is the number of knots c involved in the
basis functions. In the backward stepwise section, we only maximize the marginal
log-likelihood (10), as the knots c are already estimated.
3.3 Algorithm for empirical Bayes MARS
Our algorithm for the empirical Bayes MARS is summarized as follows:
• Forward stepwise:
Set the basis for a constant term as φ0 ≡ 1.
Supposing that we have K + 1 basis functions φ0 , φ1 (x), . . . , φK (x), add new
two basis functions
φK+1 (x) = φk (x)[+(xv(k,l) − ckl )]m
+
and
φK+2 (x) = φk (x)[−(xv(k,l) − ckl )]m
+,
where the parent basis φk (x), the variable xv(k,l) and the knot position ckl are
selected so that they maximize the marginal log-likelihood (10).
After the addition of basis functions is continued until K becomes greater than
pre-specified Kmax , determine K so that the selected model attains the minimum
of the ABIC score (11).
• Backward stepwise:
Select one basis function (except φ0 ) and prune it so that the the marginal loglikelihood (10) is minimized.
The pruning is continued until (10) does not become increasing even if any basis
function is deleted.
1402
Wataru Sakamoto
4 An example
We applied our empirical Bayes method to the kyphosis data set, which was also
analyzed by Hastie and Tibshirani (1990) [HT90] and Chambers and Hastie (1992)
[CH92]. We fitted the logistic regression model, instead of (1),
log
pi
= f (xi ),
1 − pi
i = 1, . . . , n,
where pi = Pr(yi = 1), yi is the binary variable for the i-th patient that indicates
T
whether the kyphosis remains after the surgery, and xi = (xi1 , xi2 , xi3 ) are covariates that indicate [1] the age (in month), [2] the starting vertebrae level of the
surgery and [3] the number of the vertebrae levels involved, respectively. The order
of the spline was set to m = 1 for simplicity.
The resulting estimated function with MARS for Windows 2.0, available from
Salford Systems, becomes
fˆ(x) = 0.025 + 0.465[x3 − 2]+
− 0.020[x2 − 5]+ [x3 − 2]+ − 0.003[139 − x1 ]+ [x3 − 2]+
+ 0.008[x1 − 140]+ [x3 − 3]+ − 0.008[x1 − 87]+ [x3 − 3]+ .
(12)
Using (12), 73 out of 83 (about 88%) observations are predicted rightly with the
threshold specified to 0.5, that is, p̂i = [1 + exp{−fˆ(xi )}]−1 ≥ 0.5. Although (12)
gives good performance for prediction, it is rather complicated, and so it is hard to
obtain any useful information from it.
Table 1 shows the stepwise process of the empirical Bayes MARS applied to
the kyphosis data. The program for applying our approach was written in the R
language. The rightmost column in Table 1 indicates the numbers of the variable
involved in the additional basis functions. The ABIC score is minimized when seven
basis functions are used in the forward stepwise section, and two basis functions are
pruned out of the seven in the backward stepwise section.
The resulting estimated function with our empirical Bayes approach becomes
fˆ(x) = 1.02 − 0.369[x2 − 6]+ − 0.105[6 − x2 ]+
− 0.00285[96 − x1 ]+ − 0.00191[96 − x1 ]+ [14 − x3 ]+ ,
(13)
Table 1. Empirical Bayes MARS applied to the kyphosis data: the stepwise process.
K +1
lM
ABIC
Forward stepwise:
1
−44.725 91.450
3
−37.841 79.682
5
−36.289 78.578
7
−34.713 77.426 *
9
−34.186 78.372
11
−33.918 79.836
Backward stepwise:
6
−34.713
5
−34.136
log10 λ
v(k, ·)
−1.16
−1.43
−1.68
−1.67
−1.76
2
1
1, 3
1, 3
1, 3
−1.68
−1.43
1403
0.0
0.2
0.4
prob
0.6
0.8
1.0
MARS with empirical Bayes method
0
5
10
15
start
Fig. 1. Empirical Bayes MARS applied to the kyphosis data: the estimated probability. The solid line is drawn at x1 = 84.7 and x3 = 4.2 (both are mean values), the
dashed line is drawn at x1 = 1 (minimum) and x3 = 4.2, The dotted line is drawn
at x1 ≥ 96 and x3 = 4.2.
which is similar to those which were obtained with parametric regression approaches
in some literature such as Chambers and Hastie (1992) [CH92]. The function (13)
is more parsimonious and so easier to understand than (12). Furthermore, 71 out
of 86 (about 86%) observations are predicted rightly using (13), and so it also gives
good prediction performance. Figure 1 shows the estimated probability as functions
of x2 . The figure suggests that both the age and the start level are highly related to
the risk of kyphosis.
5 Concluding remarks
We have proposed the empirical Bayes approach for the MARS model. It provides
estimation of regression structure which is relatively parsimonious and easy to un-
1404
Wataru Sakamoto
derstand for the data set we applied. Also, the performance of prediction is as good
as Friedman’s MARS.
A referee gave us a comment that computational issues were not discussed in
details on the first draft. Now we are developing a Fortran program for applying the
empirical Bayes approach, to provide quite faster computation. and furthermore, to
apply to a data set which is large both in sample size and in the number of variables,
We are studying the performance of the empirical Bayes MARS about how to
extract regression structure including interaction terms appropriately. We expect
that the empirical Bayes MARS would be more advantageous in preventing multicollinearity occurring in Friedman’s MARS, especially in applying to large data sets.
Furthermore, although we used a simple prior on basis coefficients in this paper,
we could use more elaborate priors in connection with the roughness penalty on a
regression function.
Acknowledgement
I am grateful to the referees for their useful comments to the paper. This research is
supported by MEXT Grant-in-Aid for Young Scientists (B), No. 17700280 (2005–).
References
[A80] Akaike, H.: Likelihood and Bayes procedure. In: Bernardo, J. M. et al. (eds.),
Bayesian Statistics, pp. 143–166, University Press, Valencia (1980)
[CH92] Chambers, J. M. and Hastie, T. J.: Statistical Models in S. Pacific Grove,
California (1992)
[CW79] Craven, P. and Wahba, G.: Smoothing noisy data with spline functions.
Numer. Math. 31, 377–403 (1979)
[D86] Davison, A. C.: Approximate predictive likelihood. Biometrika 73, 323–332
(1986)
[DHMS02] Denison, D. G. T., Holmes, C. C., Mallick, B. K. and Smith, A. F. M.:
Bayesian Methods for Nonlinear Classification and Regression. John Wiley
and Sons (2002)
[F91] Friedman, J. H.: Multivariate adaptive regression splines (with discussion).
Ann. Statist. 19, 1–141 (1991)
[FS89] Friedman, J. H. and Silverman, B. W.: Flexible parsimonious smoothing and
additive modeling. Technometrics 31, 3–39 (1989)
[G95] Green, P. J.: Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82, 711–732 (1995)
[GS94] Green, P. J. and Silverman, B. W.: Nonparametric Regression and Generalized Linear Models. Chapman and Hall, London (1994)
[HT90] Hastie, T. and Tibshirani, R.: Generalized Additive Models. Chapman and
Hall, London (1990)
[MN89] McCullagh, P. and Nelder, J. A.: Generalized Linear Models (2nd edition).
Chapman and Hall, London (1989)
[TK86] Tierney, L. and Kadane, J. B.: Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81, 82–86 (1986)
Download