A model selection criterion for functional PLS logit regression

advertisement
A model selection criterion for functional PLS
logit regression
Aguilera, A.M.1 , Escabias, M.2 , and Valderrama, M.J.3
1
2
3
Department of Statistics and O.R. University of Granada. Faculty of Sciences.
Campus de Fuentenueva. 18071-Granada. aaguiler@ugr.es
Department of Statistics and O.R. University of Granada. Faculty of Pharmacy.
Campus de Cartuja. 18071-Granada. escabias@ugr.es
Department of Statistics and O.R. University of Granada. Faculty of Pharmacy.
Campus de Cartuja. 18071-Granada. valderra@ugr.es
Summary. In order to estimate the parameter function of the functional logit regression model from discrete time observations of the functional predictor, it is usual
to approximate the sample curves and the parameter function on a finite dimension
space generated by a basis. Then, the functional model turns to a multiple one with
high dependence between its explicative variables. As a consequence of this multicollinearity problem the parameter function estimation is very inaccurate and its
interpretation in terms of odds ratios may be erroneous. In this paper we applied a
PLS logit regression algorithm and introduce a new criterion for selecting the optimum PLS components that provides an optimum reconstruction of the parameter
function with a small number of predictor variables.
Key words: Logit regression, functional data analysis, PLS components
1 Introduction
Functional data analysis (FDA) methods have received much attention in the last
years generalizing a lot of statistical techniques to the functional field where data
are a sample of curves instead of vectors as in classic multivariate data analysis
(Ramsay and Silverman, 2005).
In the general context of FDA, logistic regression has been recently extended
to model a binary response in terms of a functional predictor whose realizations
are curves. Different methods for estimating the parameter function of this model
have been developed. In the more general context of functional generalized linear
models, James (2002) used the EM algorithm for estimating the model meanwhile
Müller and Stadtmüller (2005) employ an orthogonal expansion of the functional
predictor. The approximation of the parameter function and the sample curves in
terms of the same basis of functions has been considered in Ratcliffe et al. (2002)
and Escabias et al. (2004). This last paper introduces in addition different functional
principal component analysis (FPCA) approaches for solving the multicollinearity
1098
Aguilera, A.M., Escabias, M., and Valderrama, M.J.
problem that appears in this case and improving the parameter function estimation.
An odds ratio interpretation of the relationship between a binary response and a
functional predictor in terms of the estimated parameter function of the functional
logit model has been established in Escabias et al. (2005), where an application with
climatological data has been developed.
Taking into account that principal components do not take into account the
relationship between the response and the functional predictor, Wold et al. (1983)
introduced the PLS regression method as an alternative to principal component regression (PCR) for solving the multicollinearity problem in linear regression. This
method consists on taking as explicative variables a set of uncorrelated latent variables that maximize the covariance between the response and the original predictors.
Different methods for solving the multicollinearity problem in linear regression (PLS,
PCR, ridge regression and variable subset selection) have been compared in Frank
and Friedman (1993), concluding that PLS is superior to all of them. PLS has also
been extended to the case of generalized linear models by Bastien et al. (2005). For
the particular case of logistic regression, Aguilera et al. (2006) have proposed an
optimum principal component logistic regression (PCLR) model that includes principal components in the model by a stepwise method based on conditional likelihood
ratio test. They have developed a simulation study with high correlated data that
shows how this PCLR model provides better parameter estimation with less components than the PLS logit model so that the interpretation of the model is more
accurate.
In the FDA context, Preda and Saporta (2005) have extended PLS for estimating
the functional linear model. An extension of classical multivariate linear discriminant analysis to the case of a functional predictor and a categorical response has
also been developed by Preda et al. (2005), meanwhile a non-parametric approach
for curves discrimination has been proposed in Ferraty and Vieu (2004). Based on
the PLS logit algorithm introduced by Bastien et al. (2005), in this paper we will develop a PLS approach to functional logistic regression that will provide an optimum
reconstruction of the parameter function with a reduced set of PLS components as
predictors. In addition, we will propose a new criterion for selecting the optimum
number of PLS components that improves the parameter estimation provided by
the functional principal component logistic regression (FPCLR) model performed in
Escabias et al. (2004).
2 Basic Theory on Functional Logit Regression
In order to formulate the functional logistic regression (FLR) model we have a sample
of observations of a functional variable (sample paths) denoted by {xi (t) , t ∈ T ; i =
1, . . . , n} and a set of observations of a random binary variable Y associated to them
denoted by {yi ; i = 1, . . . , n} . Then, the FLR model is given by yi = πi + εi , i =
1, . . . , n, where εi are independent and centered random errors, the probability of
response Y = 1 (success) given each functional observation is given by
πi = P {Y = 1|xi (t)} =
exp (li )
1 + exp (li )
and li are the logit transformations modelized as
(1)
A model selection criterion for functional PLS logit regression
1099
Z
xi (t) β (t) dt,
li = α +
(2)
T
with α being a real parameter and β (t) a parameter function. In terms of the
logit transformation the model can be seen as a functional generalized linear model
(Müller and Stadtmüller, 2005).
As in functional linear regression, it is impossible to estimate the parameter
function from a finite number of sample paths which are usually observed only at a
finite number of time points not necessarily the same for all of them. In functional
regression models this problem has been solved by approximating the sample curves
and the parameter function in a finite dimension space spanned by a basis of functions (see Ramsay and Silverman (2005)). In the case of the FLR model, Ratcliffe et
al. (2002) used least squares approximation on fourier functions to predict the probability of a high risk birth outcome from periodically stimulated foetal heart rate
tracings meanwhile Escabias et al. (2005) developed an approach based on quasinatural cubic spline interpolation of the sample curves on the observed time points
to forecast the risk of drought from time evolution of temperatures along the year.
If we denote the vector of basis functions by Φ = (φ1 (t) , . . . , φp (t))′ , then the
sample paths and the parameter function can be expressed as
xi (t) = a′i Φ, β (t) = β ′ Φ
with ai = (ai1 , . . . , aip )′ being the vector of the ith sample path basis coefficients
and β = (β1 , . . . , βp )′ the parameter function basis coefficients. Then, the functional
model in terms of the logit transformations given by equation (2) is expressed in
matrix form as
L = 1α + AΨ β
(3)
′
′
with L = (l1 , . . . , ln ) , 1 = (1, . . . , 1) , A being the n × p matrix that has as rows
the sample path basis coefficients a′i and Ψ = (ψjk ) the p × p matrix that has as
entries the basic functions inner products
Z
ψjk =
φj (t) φk (t) dt, j, k = 1, . . . , p.
T
This way the FLR model has been transformed in the multiple logistic regression
model (3) whose design matrix has a high dependence structure. This multicollinearity problem has an undesirable effect in regression methods providing inaccurate
estimated parameters and increasing their estimated variances (see Jollife (2005) for
multicollinearty discussion in linear regression or Aguilera et al. (2006) for logistic
regression). This bad estimation can be the cause of an erroneous interpretation of
the relationship between the response and the functional predictor. Escabias et al.
(2005) show how the integral of the parameter function multiplied by a constant
K, can be interpreted as the multiplicative change in the odds of response Y = 1
obtained when a functional observation is incremented constantly in K units along
T.
3 Partial Least Squares Logit Regression
In order to solve the multicollinearity problem mentioned before and to reduce the
number of predictor variables, we propose to use as explicative variables of the
1100
Aguilera, A.M., Escabias, M., and Valderrama, M.J.
multiple logit model (3) a reduced set of logit PLS components of its design matrix
AΨ. The PLS logit model applied in this paper to the functional case is a particular
case of the adaption of the classical PLS regression algorithm (Wold et al., 1983) to
generalized linear models introduced by Bastien et al. (2005). In order to improve
the accuracy of the parameter function estimation, in this paper we also propose
a new criterion for selecting the optimum partial least squares components to be
introduced as predictors in the functional PLS logit model.
3.1 Model formulation
Any PLS regression model defines latent uncorrelated variables (PLS components)
given by linear spans of the predictor variables that maximize the covariance between
linear spans of explicative and response variables, respectively, and uses them as
predictors of the regression model (see for example, Hoskuldsson (1988)). The PLS
algorithm proposed by Bastien et al. (2005) for the logistic regression model is an ad
hoc adaptation of PLS linear regression algorithm where each one of the linear models
that involves the binary response variable Y is changed by the corresponding logit
model meanwhile the rest linear fits are kept. The functional PLS logit regression
(FPLSLR) model proposed in this work consists on applying this PLS logit algorithm
to the response variable Y and the design matrix AΨ.
The algorithm for computing a PLS regression model consists of three steps: (1)
Computation of a set S of PLS components, (2) Linear regression of the response
variable on the retained PLS components and (3) Formulation of the PLS regression
model in terms of the original predictor variables.
If we denoted by Hj the columns of the design matrix AΨ, the first step of
the PLS logit algorithm for estimating the logit model given by equation (3) is
summarized as follows:
Step 1. Fisrt logit PLS component
1)
1)
′
• Logit regression fits Y /Hj , j = 1, . . . , p. Let δb1) = δb1 , . . . , δbp
be the
estimated slope parameters.
• V1 = (v11 , . . . , v1p )′ normalized
vector of δb1) after setting equal to zero those
1)
1)
components that δbj /SE δbj ≤ zα/2 , being zα/2 a fixed critical value of
the standard normal distribution
• The first logit PLS component is defined as T1 = v11 H1 + · · · + v1p Hp
Step s. Given T1 , . . . , Ts−1 the first s − 1 logit PLS components, the sth one is
obtained as follows:
• Logit regression fits Y / (T1 , . . . , Ts−1 , Hj ) , j = 1, . . . , p. Let δbs) =
s)
s)
δb1 , . . . , δbp
′
be the slope parameter corresponding to Hj .
• Vs = (vs1 , . . . , vsp )′ normalized
vector of δbs) after setting equal to zero those
s)
s)
elements that δbj /SE δbj ≤ zα/2
• Linear regression fits Hj / (T1 , . . . , Ts−1 ) , j = 1, . . . , p. Let R1 , . . . , Rp be
the residual vectors
• The sth logit PLS component Ts = vs1 R1 + · · · + vsp Rp
Let us observe that the algorithm stops when computing a PLS component none
of its coefficients is significantly different from zero. The statistical significance of
A model selection criterion for functional PLS logit regression
1101
these coefficients is tested in this paper by using the classical Wald statistical test.
Finally all logit PLS components are expressed in terms of the original predictor
variables instead of the residuals R1 , . . . , Rp .
The second step of the PLS logit algorithm consists on formulating the logit
model(3) in terms of the S PLS components obtained in the first step. If we denote
by Γ the matrix of logit PLS components of the design matrix AΨ, then the logit
model in terms of the PLS components is expressed as
L = 1γ0 + Γ γ
(4)
where γ = (γ1 , . . . , γS )′ are the coefficients of the logit model in terms of the S logit
PLS components obtained in the first algorithm step.
Finally, in the third step we obtain an estimation of the original functional logit
model in terms of the following estimation of the parameter function basis coefficients
from the estimated gamma parameters: βb = V b
γ , with V being the matrix whose
columns are the Vs vectors of loadings of the PLS components.
3.2 Model selection criterion
Let us observe that in the algorithm previously introduced, all PLS components are
considered in the FPLSLR model. As we will see in the simulation developed at the
end of the paper the variance of the estimated parameter function associated to the
model with all PLS components is extremely big. In order to improve the accuracy
of the parameter function estimation with a small number of PLS components, in
this paper we introduce different criteria for selecting the optimum number of PLS
components to be introduced as predictors in the FPLSLR model. These model
selection procedures are based on two different accuracy measures of the estimated
parameters. On the one hand, the mean squared error of the beta parameter vector
of basis coefficients (MSEB) defined by
M SEB(s) =
1
p+1
p
X
(βbj(s) − βj )2 ,
s = 1, . . . , S,
j=0
with βbj(s) being the estimation of the parameter function basis coefficients provided
by the P LSLR(s) model with the first s PLS components as predictor variables
(s = 1, . . . , S). And on the other hand, the integrated mean squared error of the
beta parameter function (IMSEB) given by
IM SEB =
1
T
Z
2
β (t) − βb(s) (t)
dt,
s = 1, . . . , S,
T
with βb(s) (t) being the parameter function estimation given by the P LSLR(s) model
with the first s PLS components as predictor variables (s = 1, . . . , S).
Once these errors have been estimated we will select as optimum FPLSLR models
the ones associated to their minimum values.
1102
Aguilera, A.M., Escabias, M., and Valderrama, M.J.
4 Simulation Study
In order to test the performance of the proposed optimum functional PLS logit
model, we have developed a simulation study following the simulation scheme proposed in Escabias et al. (2004). We have considered as functional explicative variable
the one whose sample curves are cubic spline functions expressed in terms of the
basis of the cubic B-spline functions defined by the knots {0, 1, 2, . . . , 10} . In order
to simulate a set of sample paths we simulated 100 vectors of dimension 13 of a
centered multivariate normal distribution with highly correlated components, that
correspond to the basis coefficients. The parameter function of the simulated logit
model has been the natural cubic spline interpolation of the function sin (x − π/4)
on the knots previously defined. Finally, each observation of the response variable
yi (i = 1, . . . , n) was simulated by using a Bernouilli distribution with probability
πi calculated by expression (2) with α = 0.5.
After simulating the data, we fitted the functional logistic regression model given
by equation (3), obtaining an estimated parameter function very different to the
simulated one. The variance of the estimated parameter function, defined in this
paper as the sum of the estimated variances of its basis coefficients, was extremely
big. The parameter function accuracy measures MSEB and IMSEB were also very
big. However the deviance statistics G2 and the CCR (correct clasification rate for
cut point 0.5) showed that model fits well.
In order to avoid the multicollinearity problem and to obtain an accurate estimation of the parameter function of the FLR model, we obtained the functional
logit PLS components. After that, we fitted the FPLSLR(s) model with different
number of s PLS components. Then, we reconstructed for all the fitted FPLSLR(s)
models the estimated parameter function β̂(s) (t) and the accuracy measures defined
in previous section for evaluating the improvement in such estimation.
Table 1. Mean and standard deviation of goodness of fit and accuracy measures of
the different optimum FPLSLR models
Optimum FPLSLR models
All PLS components Minimum MSEB Minimum IMSEB
Measures Mean
SD Mean
SD Mean
SD
PLS Components 2.72
0.53 1.24
0.62 2.17
0.70
CCR
87.11
3.42 82.66
3.76 85.44
3.29
IMSEB
0.52
2.19 0.32
0.08 0.20
0.09
70.69
755.44 1.94
0.60 8.10
8.25
MSEB
VAR
61.47
313.64 0.67
0.53 11.52
18.69
Sample replication has been used to validate the results obtained from each
simulation. So, we have repeated the simulation of the binary response variable 350
times and selected in each one three optimum FPLSLR(s) models, the one with all
PLS components, the one with the smallest M SEB(s) and the one with the smallest
IM SEB(s) . In each replication we computed the mean and standard deviation of the
goodness of fit and accuracy measures associated to these three optimum models.
The results figure in Table 1 where we can observe that the mean CCR and
the mean IMSEB are similar for the three models, and the mean number of com-
A model selection criterion for functional PLS logit regression
1103
ponents needed for the best possible estimation of the parameter function is lower
for the model with the minimum MSEB. However, the MSEB and the variance of
the estimated parameter function are drastically reduced in the models proposed in
this paper that minimizes the MSEB and IMSEB errors, with the minimum MSEB
criterion being the optimum for selecting the best PLS components to be included
in the model.
In addition, we have observed that the best estimation of the parameter function
provided by the FPLSLR model with minimum ECMB is usually followed by a great
increase on its estimated variance. This means that in real applications where the
true parameter function is unknown, a good criterion to select the best model may
be to select as optimum the FPLSLR(s) model previous to a significant increment
in the estimated variance of the beta parameters.
Acknowledgements
This research has been supported by Project MTM2004-5992 from Direccion General
de Investigacion, Ministerio de Ciencia y Tecnologia.
References
[AEV06] Aguilera, A.M., Escabias, M., Valderrama, M.J.: Using principal components for estimating logistic regression with high-dimensional multicollinear data. Computational Statistics and Data Analysis, 50, 1905–
1924 (2006)
[BET05] Bastien, P., Esposito-Vinzi, V., Tenenhaus, M.: PLS generalised linear
regression. Computational Statistics and Data Analysis, 48, 17–46 (2005)
[EAV04] Escabias, M., Aguilera, A.M., Valderrama, M.J.: Principal component estimation of functional logistic regression: discussion of two different approaches. Journal of Nonparametric Statistics, 16(3-4), 365–384 (2004)
[EAV05] Escabias, M., Aguilera, A.M., Valderrama, M.J.: Modeling environmental data by functional principal component logistic regression. Environmetrics, 16(1), 95–107 (2005)
[FV03]
Ferraty, F., Vieu, P.: Curves discrimination: a nonparametric functional
approach. Computational Statistics and Data Analysis, 44, 161–173
(2003)
[FF93]
Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics, 32, 109–135 (1993)
[Jam02] James, J.M.: Generalized linear models with functional predictors. Journal of the Royal Statistic Society, Series B, 64(3), 411–432 (2002)
[MS05]
Müller, H-G., Stadtmüller, U.: Generalized functional linear models. The
Annals of Statistic, 33(2), 774–805 (2005)
[PS05]
Preda, P., Saporta, G.: PLS regression on a stochastic process. Computational Statistics and Data Analysis, 48(1), 149–158 (2005)
[RS05]
Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. SpringerVerlag, New York (2005)
1104
Aguilera, A.M., Escabias, M., and Valderrama, M.J.
[RHL02] Ratcliffe S.J., Heller, G.Z., Leader, L.R.: Functional data analysis with
application to periodically stimulated foetal heart rate data. II: Functional
logistic regression. Statistics in Medicine, 21, 1115–1127 (2002)
[Hos88] Hoskuldsson, A.: PLS regression methods. Journal of Chemometrics, 2,
211–228 (1988)
[PSC05] Preda, C., Saporta, G., Caroline, L.: PLS classification of functional data.
In: Aluja, T., Casanovas, J., Esposito Vinzi, V., Morineau, A., Tenenhaus,
M. (eds) Proceedings of the PLS’05 International Symposium, SPAD
Groupe Test& Go, 164–174 (2005)
[WMW83] Wold, S., Martens, H., Wold, H. (1983). The multivariate calibration
problem in chemistry solved by the PLS method. In: Ruhe, A., Kjastrhm,
B. (Eds.) Proceedings of the Conference Matrix Pencils. Lecture Notes in
Mathematics. Springer-Verlag, Heidelberg, 286-293.
Download