A model selection criterion for functional PLS logit regression Aguilera, A.M.1 , Escabias, M.2 , and Valderrama, M.J.3 1 2 3 Department of Statistics and O.R. University of Granada. Faculty of Sciences. Campus de Fuentenueva. 18071-Granada. aaguiler@ugr.es Department of Statistics and O.R. University of Granada. Faculty of Pharmacy. Campus de Cartuja. 18071-Granada. escabias@ugr.es Department of Statistics and O.R. University of Granada. Faculty of Pharmacy. Campus de Cartuja. 18071-Granada. valderra@ugr.es Summary. In order to estimate the parameter function of the functional logit regression model from discrete time observations of the functional predictor, it is usual to approximate the sample curves and the parameter function on a finite dimension space generated by a basis. Then, the functional model turns to a multiple one with high dependence between its explicative variables. As a consequence of this multicollinearity problem the parameter function estimation is very inaccurate and its interpretation in terms of odds ratios may be erroneous. In this paper we applied a PLS logit regression algorithm and introduce a new criterion for selecting the optimum PLS components that provides an optimum reconstruction of the parameter function with a small number of predictor variables. Key words: Logit regression, functional data analysis, PLS components 1 Introduction Functional data analysis (FDA) methods have received much attention in the last years generalizing a lot of statistical techniques to the functional field where data are a sample of curves instead of vectors as in classic multivariate data analysis (Ramsay and Silverman, 2005). In the general context of FDA, logistic regression has been recently extended to model a binary response in terms of a functional predictor whose realizations are curves. Different methods for estimating the parameter function of this model have been developed. In the more general context of functional generalized linear models, James (2002) used the EM algorithm for estimating the model meanwhile Müller and Stadtmüller (2005) employ an orthogonal expansion of the functional predictor. The approximation of the parameter function and the sample curves in terms of the same basis of functions has been considered in Ratcliffe et al. (2002) and Escabias et al. (2004). This last paper introduces in addition different functional principal component analysis (FPCA) approaches for solving the multicollinearity 1098 Aguilera, A.M., Escabias, M., and Valderrama, M.J. problem that appears in this case and improving the parameter function estimation. An odds ratio interpretation of the relationship between a binary response and a functional predictor in terms of the estimated parameter function of the functional logit model has been established in Escabias et al. (2005), where an application with climatological data has been developed. Taking into account that principal components do not take into account the relationship between the response and the functional predictor, Wold et al. (1983) introduced the PLS regression method as an alternative to principal component regression (PCR) for solving the multicollinearity problem in linear regression. This method consists on taking as explicative variables a set of uncorrelated latent variables that maximize the covariance between the response and the original predictors. Different methods for solving the multicollinearity problem in linear regression (PLS, PCR, ridge regression and variable subset selection) have been compared in Frank and Friedman (1993), concluding that PLS is superior to all of them. PLS has also been extended to the case of generalized linear models by Bastien et al. (2005). For the particular case of logistic regression, Aguilera et al. (2006) have proposed an optimum principal component logistic regression (PCLR) model that includes principal components in the model by a stepwise method based on conditional likelihood ratio test. They have developed a simulation study with high correlated data that shows how this PCLR model provides better parameter estimation with less components than the PLS logit model so that the interpretation of the model is more accurate. In the FDA context, Preda and Saporta (2005) have extended PLS for estimating the functional linear model. An extension of classical multivariate linear discriminant analysis to the case of a functional predictor and a categorical response has also been developed by Preda et al. (2005), meanwhile a non-parametric approach for curves discrimination has been proposed in Ferraty and Vieu (2004). Based on the PLS logit algorithm introduced by Bastien et al. (2005), in this paper we will develop a PLS approach to functional logistic regression that will provide an optimum reconstruction of the parameter function with a reduced set of PLS components as predictors. In addition, we will propose a new criterion for selecting the optimum number of PLS components that improves the parameter estimation provided by the functional principal component logistic regression (FPCLR) model performed in Escabias et al. (2004). 2 Basic Theory on Functional Logit Regression In order to formulate the functional logistic regression (FLR) model we have a sample of observations of a functional variable (sample paths) denoted by {xi (t) , t ∈ T ; i = 1, . . . , n} and a set of observations of a random binary variable Y associated to them denoted by {yi ; i = 1, . . . , n} . Then, the FLR model is given by yi = πi + εi , i = 1, . . . , n, where εi are independent and centered random errors, the probability of response Y = 1 (success) given each functional observation is given by πi = P {Y = 1|xi (t)} = exp (li ) 1 + exp (li ) and li are the logit transformations modelized as (1) A model selection criterion for functional PLS logit regression 1099 Z xi (t) β (t) dt, li = α + (2) T with α being a real parameter and β (t) a parameter function. In terms of the logit transformation the model can be seen as a functional generalized linear model (Müller and Stadtmüller, 2005). As in functional linear regression, it is impossible to estimate the parameter function from a finite number of sample paths which are usually observed only at a finite number of time points not necessarily the same for all of them. In functional regression models this problem has been solved by approximating the sample curves and the parameter function in a finite dimension space spanned by a basis of functions (see Ramsay and Silverman (2005)). In the case of the FLR model, Ratcliffe et al. (2002) used least squares approximation on fourier functions to predict the probability of a high risk birth outcome from periodically stimulated foetal heart rate tracings meanwhile Escabias et al. (2005) developed an approach based on quasinatural cubic spline interpolation of the sample curves on the observed time points to forecast the risk of drought from time evolution of temperatures along the year. If we denote the vector of basis functions by Φ = (φ1 (t) , . . . , φp (t))′ , then the sample paths and the parameter function can be expressed as xi (t) = a′i Φ, β (t) = β ′ Φ with ai = (ai1 , . . . , aip )′ being the vector of the ith sample path basis coefficients and β = (β1 , . . . , βp )′ the parameter function basis coefficients. Then, the functional model in terms of the logit transformations given by equation (2) is expressed in matrix form as L = 1α + AΨ β (3) ′ ′ with L = (l1 , . . . , ln ) , 1 = (1, . . . , 1) , A being the n × p matrix that has as rows the sample path basis coefficients a′i and Ψ = (ψjk ) the p × p matrix that has as entries the basic functions inner products Z ψjk = φj (t) φk (t) dt, j, k = 1, . . . , p. T This way the FLR model has been transformed in the multiple logistic regression model (3) whose design matrix has a high dependence structure. This multicollinearity problem has an undesirable effect in regression methods providing inaccurate estimated parameters and increasing their estimated variances (see Jollife (2005) for multicollinearty discussion in linear regression or Aguilera et al. (2006) for logistic regression). This bad estimation can be the cause of an erroneous interpretation of the relationship between the response and the functional predictor. Escabias et al. (2005) show how the integral of the parameter function multiplied by a constant K, can be interpreted as the multiplicative change in the odds of response Y = 1 obtained when a functional observation is incremented constantly in K units along T. 3 Partial Least Squares Logit Regression In order to solve the multicollinearity problem mentioned before and to reduce the number of predictor variables, we propose to use as explicative variables of the 1100 Aguilera, A.M., Escabias, M., and Valderrama, M.J. multiple logit model (3) a reduced set of logit PLS components of its design matrix AΨ. The PLS logit model applied in this paper to the functional case is a particular case of the adaption of the classical PLS regression algorithm (Wold et al., 1983) to generalized linear models introduced by Bastien et al. (2005). In order to improve the accuracy of the parameter function estimation, in this paper we also propose a new criterion for selecting the optimum partial least squares components to be introduced as predictors in the functional PLS logit model. 3.1 Model formulation Any PLS regression model defines latent uncorrelated variables (PLS components) given by linear spans of the predictor variables that maximize the covariance between linear spans of explicative and response variables, respectively, and uses them as predictors of the regression model (see for example, Hoskuldsson (1988)). The PLS algorithm proposed by Bastien et al. (2005) for the logistic regression model is an ad hoc adaptation of PLS linear regression algorithm where each one of the linear models that involves the binary response variable Y is changed by the corresponding logit model meanwhile the rest linear fits are kept. The functional PLS logit regression (FPLSLR) model proposed in this work consists on applying this PLS logit algorithm to the response variable Y and the design matrix AΨ. The algorithm for computing a PLS regression model consists of three steps: (1) Computation of a set S of PLS components, (2) Linear regression of the response variable on the retained PLS components and (3) Formulation of the PLS regression model in terms of the original predictor variables. If we denoted by Hj the columns of the design matrix AΨ, the first step of the PLS logit algorithm for estimating the logit model given by equation (3) is summarized as follows: Step 1. Fisrt logit PLS component 1) 1) ′ • Logit regression fits Y /Hj , j = 1, . . . , p. Let δb1) = δb1 , . . . , δbp be the estimated slope parameters. • V1 = (v11 , . . . , v1p )′ normalized vector of δb1) after setting equal to zero those 1) 1) components that δbj /SE δbj ≤ zα/2 , being zα/2 a fixed critical value of the standard normal distribution • The first logit PLS component is defined as T1 = v11 H1 + · · · + v1p Hp Step s. Given T1 , . . . , Ts−1 the first s − 1 logit PLS components, the sth one is obtained as follows: • Logit regression fits Y / (T1 , . . . , Ts−1 , Hj ) , j = 1, . . . , p. Let δbs) = s) s) δb1 , . . . , δbp ′ be the slope parameter corresponding to Hj . • Vs = (vs1 , . . . , vsp )′ normalized vector of δbs) after setting equal to zero those s) s) elements that δbj /SE δbj ≤ zα/2 • Linear regression fits Hj / (T1 , . . . , Ts−1 ) , j = 1, . . . , p. Let R1 , . . . , Rp be the residual vectors • The sth logit PLS component Ts = vs1 R1 + · · · + vsp Rp Let us observe that the algorithm stops when computing a PLS component none of its coefficients is significantly different from zero. The statistical significance of A model selection criterion for functional PLS logit regression 1101 these coefficients is tested in this paper by using the classical Wald statistical test. Finally all logit PLS components are expressed in terms of the original predictor variables instead of the residuals R1 , . . . , Rp . The second step of the PLS logit algorithm consists on formulating the logit model(3) in terms of the S PLS components obtained in the first step. If we denote by Γ the matrix of logit PLS components of the design matrix AΨ, then the logit model in terms of the PLS components is expressed as L = 1γ0 + Γ γ (4) where γ = (γ1 , . . . , γS )′ are the coefficients of the logit model in terms of the S logit PLS components obtained in the first algorithm step. Finally, in the third step we obtain an estimation of the original functional logit model in terms of the following estimation of the parameter function basis coefficients from the estimated gamma parameters: βb = V b γ , with V being the matrix whose columns are the Vs vectors of loadings of the PLS components. 3.2 Model selection criterion Let us observe that in the algorithm previously introduced, all PLS components are considered in the FPLSLR model. As we will see in the simulation developed at the end of the paper the variance of the estimated parameter function associated to the model with all PLS components is extremely big. In order to improve the accuracy of the parameter function estimation with a small number of PLS components, in this paper we introduce different criteria for selecting the optimum number of PLS components to be introduced as predictors in the FPLSLR model. These model selection procedures are based on two different accuracy measures of the estimated parameters. On the one hand, the mean squared error of the beta parameter vector of basis coefficients (MSEB) defined by M SEB(s) = 1 p+1 p X (βbj(s) − βj )2 , s = 1, . . . , S, j=0 with βbj(s) being the estimation of the parameter function basis coefficients provided by the P LSLR(s) model with the first s PLS components as predictor variables (s = 1, . . . , S). And on the other hand, the integrated mean squared error of the beta parameter function (IMSEB) given by IM SEB = 1 T Z 2 β (t) − βb(s) (t) dt, s = 1, . . . , S, T with βb(s) (t) being the parameter function estimation given by the P LSLR(s) model with the first s PLS components as predictor variables (s = 1, . . . , S). Once these errors have been estimated we will select as optimum FPLSLR models the ones associated to their minimum values. 1102 Aguilera, A.M., Escabias, M., and Valderrama, M.J. 4 Simulation Study In order to test the performance of the proposed optimum functional PLS logit model, we have developed a simulation study following the simulation scheme proposed in Escabias et al. (2004). We have considered as functional explicative variable the one whose sample curves are cubic spline functions expressed in terms of the basis of the cubic B-spline functions defined by the knots {0, 1, 2, . . . , 10} . In order to simulate a set of sample paths we simulated 100 vectors of dimension 13 of a centered multivariate normal distribution with highly correlated components, that correspond to the basis coefficients. The parameter function of the simulated logit model has been the natural cubic spline interpolation of the function sin (x − π/4) on the knots previously defined. Finally, each observation of the response variable yi (i = 1, . . . , n) was simulated by using a Bernouilli distribution with probability πi calculated by expression (2) with α = 0.5. After simulating the data, we fitted the functional logistic regression model given by equation (3), obtaining an estimated parameter function very different to the simulated one. The variance of the estimated parameter function, defined in this paper as the sum of the estimated variances of its basis coefficients, was extremely big. The parameter function accuracy measures MSEB and IMSEB were also very big. However the deviance statistics G2 and the CCR (correct clasification rate for cut point 0.5) showed that model fits well. In order to avoid the multicollinearity problem and to obtain an accurate estimation of the parameter function of the FLR model, we obtained the functional logit PLS components. After that, we fitted the FPLSLR(s) model with different number of s PLS components. Then, we reconstructed for all the fitted FPLSLR(s) models the estimated parameter function β̂(s) (t) and the accuracy measures defined in previous section for evaluating the improvement in such estimation. Table 1. Mean and standard deviation of goodness of fit and accuracy measures of the different optimum FPLSLR models Optimum FPLSLR models All PLS components Minimum MSEB Minimum IMSEB Measures Mean SD Mean SD Mean SD PLS Components 2.72 0.53 1.24 0.62 2.17 0.70 CCR 87.11 3.42 82.66 3.76 85.44 3.29 IMSEB 0.52 2.19 0.32 0.08 0.20 0.09 70.69 755.44 1.94 0.60 8.10 8.25 MSEB VAR 61.47 313.64 0.67 0.53 11.52 18.69 Sample replication has been used to validate the results obtained from each simulation. So, we have repeated the simulation of the binary response variable 350 times and selected in each one three optimum FPLSLR(s) models, the one with all PLS components, the one with the smallest M SEB(s) and the one with the smallest IM SEB(s) . In each replication we computed the mean and standard deviation of the goodness of fit and accuracy measures associated to these three optimum models. The results figure in Table 1 where we can observe that the mean CCR and the mean IMSEB are similar for the three models, and the mean number of com- A model selection criterion for functional PLS logit regression 1103 ponents needed for the best possible estimation of the parameter function is lower for the model with the minimum MSEB. However, the MSEB and the variance of the estimated parameter function are drastically reduced in the models proposed in this paper that minimizes the MSEB and IMSEB errors, with the minimum MSEB criterion being the optimum for selecting the best PLS components to be included in the model. In addition, we have observed that the best estimation of the parameter function provided by the FPLSLR model with minimum ECMB is usually followed by a great increase on its estimated variance. This means that in real applications where the true parameter function is unknown, a good criterion to select the best model may be to select as optimum the FPLSLR(s) model previous to a significant increment in the estimated variance of the beta parameters. Acknowledgements This research has been supported by Project MTM2004-5992 from Direccion General de Investigacion, Ministerio de Ciencia y Tecnologia. References [AEV06] Aguilera, A.M., Escabias, M., Valderrama, M.J.: Using principal components for estimating logistic regression with high-dimensional multicollinear data. Computational Statistics and Data Analysis, 50, 1905– 1924 (2006) [BET05] Bastien, P., Esposito-Vinzi, V., Tenenhaus, M.: PLS generalised linear regression. Computational Statistics and Data Analysis, 48, 17–46 (2005) [EAV04] Escabias, M., Aguilera, A.M., Valderrama, M.J.: Principal component estimation of functional logistic regression: discussion of two different approaches. Journal of Nonparametric Statistics, 16(3-4), 365–384 (2004) [EAV05] Escabias, M., Aguilera, A.M., Valderrama, M.J.: Modeling environmental data by functional principal component logistic regression. Environmetrics, 16(1), 95–107 (2005) [FV03] Ferraty, F., Vieu, P.: Curves discrimination: a nonparametric functional approach. Computational Statistics and Data Analysis, 44, 161–173 (2003) [FF93] Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics, 32, 109–135 (1993) [Jam02] James, J.M.: Generalized linear models with functional predictors. Journal of the Royal Statistic Society, Series B, 64(3), 411–432 (2002) [MS05] Müller, H-G., Stadtmüller, U.: Generalized functional linear models. The Annals of Statistic, 33(2), 774–805 (2005) [PS05] Preda, P., Saporta, G.: PLS regression on a stochastic process. Computational Statistics and Data Analysis, 48(1), 149–158 (2005) [RS05] Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. SpringerVerlag, New York (2005) 1104 Aguilera, A.M., Escabias, M., and Valderrama, M.J. [RHL02] Ratcliffe S.J., Heller, G.Z., Leader, L.R.: Functional data analysis with application to periodically stimulated foetal heart rate data. II: Functional logistic regression. Statistics in Medicine, 21, 1115–1127 (2002) [Hos88] Hoskuldsson, A.: PLS regression methods. Journal of Chemometrics, 2, 211–228 (1988) [PSC05] Preda, C., Saporta, G., Caroline, L.: PLS classification of functional data. In: Aluja, T., Casanovas, J., Esposito Vinzi, V., Morineau, A., Tenenhaus, M. (eds) Proceedings of the PLS’05 International Symposium, SPAD Groupe Test& Go, 164–174 (2005) [WMW83] Wold, S., Martens, H., Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In: Ruhe, A., Kjastrhm, B. (Eds.) Proceedings of the Conference Matrix Pencils. Lecture Notes in Mathematics. Springer-Verlag, Heidelberg, 286-293.