Continuum redundancy P LS regression: a simple continuum approach. Application to epidemiological data. Stéphanie Bougeard1 , Mohamed Hanafi2 and El Mostafa Qannari2 1 2 AFSSA, Département d’épidémiologie animale - Les Croix, BP53, 22440 Ploufragan (s.bougeard@afssa.fr) ENITIAA-INRA, Unité de Sensométrie et Chimiométrie - Rue de la Géraudière BP 82225, 44322 Nantes Cedex, (hanafi@enitiaa-nantes.fr, qannari@enitiaa-nantes.fr) Summary. In this paper, we discuss new formulations of both redundancy analysis (RA) and partial least squares regression (PLS) which clearly show the connection between these two methods. These new formulations also show that RA and PLS regression are the two end points of a continuum approach that we propose to investigate. Moreover, it enjoys very interesting properties which highlight the rational of the continuum approach and how it handles the multicollinearity problem. As the method of analysis establishes a bridge between RA and PLS regression, we shall refer to it as ”continuum redundancy PLS regression” (CR-PLS). The interest of the method is illustrated on the basis of a data set pertaining to epidemiology. Key words: redundancy analysis, partial least square regression, continuum, ridge regression. 1 Introduction Research in epidemiology is concerned with detecting and preventing animal diseases. The purpose of the study is to predict variables related to animal health from variables related to the breeding environment, alimentary factors and farm management, amongst others. Therefore, the investigation of the relationships between the variables related to the disease and the variables which are deemed to have an impact on this disease is of paramount importance as it is a source of opportunities to reduce disease. This paper deals with the description and the prediction of data organized in an explanatory table X and a data set Y to be predicted. The first issue is to describe the two tables and to summarise the relationships between variables. The second issue is to predict Y from X and to determine which X variables have an impact on Y . Let us give a set of predictors X = [x1 , . . . , xP ] and a set of variables to be predicted Y = [y1 , . . . , yQ ]. All these variables are measured on the same n individuals and supposed to be centred. Redundancy analysis (RA) [?] can be viewed either as a 658 Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari regression of Y upon linear combinations of X or as a principal component analysis (P CA) of X and Y where components are constrained to be linear combinations of X [?]. Thus RA can be used as a method of analysis to investigate the relationships between Y and X or to set up prediction models. However, in presence of quasicollinearity among X, RA may lead to unstable models owing to the fact that it involves the inversion of (X ′ X) which is likely to be ill-conditioned. Partial least square (P LS) regression [?] may be seen as an alternative method to RA in the case of ill-conditioned problem. In this paper, we discuss new formulations of RA and P LS which clearly show the connection between these methods. These new formulations also show that RA and P LS are the two end points of a continuum approach that we propose to investigate. In comparison to other continuum approaches which were proposed within regression analysis, the paramount feature of the approach we discuss herein, lies in the fact that it is conceptually simple to grasp and implement. Moreover, it enjoys properties which highlight the rationale of the continuum approach and how it handles the multicollinearity problem. As the method of analysis establishes a bridge between RA and P LS, we shall refer to it as “Continuum Redundancy P LS regression” (CR − P LS). 2 New formulations of redundancy analysis and P LS Several formulations have been proposed in order to introduce redundancy analysis (RA) [?, ?, ?]. In order to describe and predict Y from X, redundancy analysis seeks in a first step a standardized latent variable t(1) = Xw(1) which is highly related to the variables to be predicted Y . More formally, RA consists in maximizing the following criterion: Max. Qra = Q X cov 2 (yq , t(1) ) with t(1) = Xw(1) and ||t(1) || = 1 (1) q=1 We can prove that w(1) is the first eigenvector of: Mra = 1 (X ′ X)−1 X ′ Y Y ′ X n2 (2) In order to improve the prediction ability of the model, a second orthogonal latent variable t(2) may be sought to complement t(1) . This can be done by maximizing, with respect to t(2) = Xw(2) , the criterion Qra under the constraint that t(2) is orthogonal to t(1) . The solution to this problem leads to set the second vector of loadings w(2) as the second eigenvector of Mra . This process can be easily extended in order to derive subsequent latent variables (t(1) , . . . , t(h) ). The vectors of loadings (w(1) , . . . , w(h) ) associated with these latent variables are obtained from the successive eigenvectors of matrix Mra . In order to set up a general strategy of analysis which encompasses RA and P LS, we propose to derive the latent variables by iteratively deflating X, as it is done in P LS regression. More precisely, once the first latent variable t(1) is determined, we consider X (1) , the matrix of residuals of the regression of X upon t(1) . We seek a second latent variable t(2) = X (1) w(2) so as to maximize Continuum redundancy P LS regression (1) Qra = (1) Mra PQ 2 (2) ) with q=1 cov (yq , t (1)′ (1) −1 (1)′ 1 (X X ) X Y n2 659 ||w(2) || = 1. The solution w(2) is the eigenvector of Y ′ X (1) . As a matter of fact, we can show that the = first solution based on the eigenvector associated with the second largest eigenvalue (1) of matrix Mra and the second solution based on the first eigenvector of matrix Mra (2) lead to the same latent variable t . The latent variables obtained by means of RA may be used for an exploratory purpose in order to investigate the relationships between X and Y or for predicting Y variables from X. This can be done by considering a set of h latent variables (t(1) , . . . , t(h) ) and performing a regression of Y upon these latent variables. This method is called reduced-rank regression [?, ?]. The number of latent variables to be retained in the model may be determined by usual techniques such as crossvalidation [?]. As redundancy analysis is based on the eigenstructure of matrix Mra which involves the inversion of (X ′ X), this may lead to an unstable model in the case of ill-conditioned problems. In order to circumvent this problem, P LS regression may be seen as an alternative method to RA. The first latent variable t(1) = Xw(1) is determined in such a way so as to maximize the criterion: Max. Qpls = Q X cov 2 (yq , t(1) ) with t(1) = Xw(1) and ||w(1) || = 1 (3) q=1 We can prove that w(1) is the first eigenvector of: Mpls = 1 ′ X Y Y ′X n2 (4) Subsequent latent variables (t(1) , . . . , t(h) ) may be obtained according to the usual deflation technique on the X table. It is worth noting at this stage that in P LS regression it is advocated to deflate at each stage the Y variables using the latent variables that have been previously determined. However, the latent variables being orthogonal, it is clear that we get the same results no matter whether we deflate or not. 3 Continuum Redundancy P LS regression 3.1 A bridge between P LS and RA It turns out that RA and P LS regression are respectively based on the eigenstructure of matrix [(X ′ X)−1 X ′ Y Y ′ X] and (X ′ Y Y ′ X). Thus it appears that P LS regression corresponds to a shrinkage of matrix (X ′ X)−1 towards the identity matrix I. From this standpoint, we can adopt a gradual shrinkage of matrix (X ′ X)−1 towards I, by considering a convex combination of these two matrices. Moreover, the difference between RA and P LS lies in the constraints imposed on the vectors of loadings: ||Xw|| = 1 for RA and ||w|| = 1 for P LS. Alternatively, we can adopt an intermediate constraint by seeking, in a first step, a latent variable t(1) in such a way so as to maximize: 660 Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari Max. Qγ = Q X cov 2 (yq , t(1) γ ) under the constraints q=1 (1) t(1) γ = Xwγ , (1) We can prove that wγ Mγ = 2 γ||wγ(1) ||2 + (1 − γ)||t(1) γ || = 1 and 0≤γ≤1 (5) is the first eigenvector of: −1 ′ 1 X Y Y ′X (1 − γ)(X ′ X) + γI n2 (6) (1) (1) The first latent variable is given by tγ = Xwγ . In a second step, X is regressed (1) (1) upon tγ and, as previously, the matrix of residuals Xγ is considered. A sec(2) (1) (2) (2) ond latent variable tγ = Xγ wγ is determined, where h i wγ is the eigenvector ′ (1) (1) (1)′ ′ (1) −1 (1) 1 of Mγ = n2 [(1 − γ)(Xγ Xγ ) + γI] Xγ Y Y Xγ . Subsequent latent vari(1) (h) ables (tγ , . . . , tγ ) are derived according to the same principle. This set of latent variables can be used for the purpose of exploring the relationships between X and Y . We shall refer to this strategy of analysis as Continuum Redundancy P LS Regression (CR − P LS). It is clear that the case (γ = 0) corresponds to RA whereas (γ = 1) corresponds to P LS. 3.2 Modelisation and prediction (1) (h) The prediction of the Y variables is based on the latent variables (tγ , . . . , tγ ). These latent variables being orthogonal by construction, the Y table is split up into: (h) (h) (h) (h)′ (1) (1)′ Y = tγ cγ + . . . + tγ cγ + Yγ , Yγ being the matrix of residuals. More(1) over, the latent variables can be expressed as linear combinations of X: tγ = ∗(h) (h) ∗(1) ∗ Xwγ , . . . , tγ = Xwγ . The vectors of loading wγ and cγ are defined as in P LS ∗(1) (1)′ ∗(h) (h)′ (h) regression [?]. This leads to the model: Y = X(wγ cγ + . . . + wγ cγ ) + Yγ . From a practical point of view, if the objective is to explain the Y table from all the X variables, the choice of the optimal method (optimal value of γ and optimal number hγ of components tγ ) is r performed by minimizing the root mean square (h) error of calibration: RM SEC = PQ q=1 Yq − Yˆq (h) 2 /Q. Another objective is to use the model which was set up on a calibration set in order to predict the Y values for new observations. The choice of an optimal parameter γ and the appropriate number of latent variables can be made by means of a validation technique such as cross-validation [?] by minimizing the root mean square error of validation (RM SEV ). This consists in seeking to minimize the criterion RM SEV which is basically similar to RM SEC except that it is computed on the basis of several subsets of the original observations which have been in turn set aside in order to be used for a validation purpose. 3.3 Properties CR − P LS discussed herein stands as a compromise between having a model with a good explanation ability (RA for γ = 0) and a stable model (P LS for γ = 1). The following properties highlight the idea that when γ increases, the strategy of analysis Continuum redundancy P LS regression 661 investigates more stable directions in the X space as reflected by the variance of the first latent variable, but the explanation ability of the model decreases as reflected by the percentage of total variance of Y explained by the latent variable. Stability of the model It can be reflected by the variance of the latent component tγ , var(tγ ), determined at each stage. We can prove that var(tγ ) is an increasing function of γ. When γ increases, the continuum approach investigates more stable directions and avoids latent variables with small variances which may reflect noise only. It also follows that RA corresponds to the solution with the smallest variance of tγ , whereas P LS regression corresponds to the solution with the largest variance of tγ . Explanation ability It can be reflected by the variance of Y explained by the P latent variable tγ that can be written as Expl V ar(Y, tγ ) = q cov 2 (yq , tγ )/var(tγ ). We can prove that this is a decreasing function of γ. When γ increases, the explanation ability of the model obtained by the regression of Y upon tγ decreases. It follows that RA corresponds to the solution with the best explanation ability compared to P LS. Sensitivity to multicollinearity It can be reflected by the condition index [?]: η = λ1 /λP , with λ1 and λP respectively the largest and the smallest eigenvalues of (X ′ X). A large value of η flags the presence of multicollinearity among X which is likely to lead to an unstable model. As stated above, CR − P LS involves the matrix Mγ . Its condition index is given by ηγ = [(1 − γ)λ1 + γ]/[(1 − γ)λP + γ]. It is easy to prove, by considering the derivative of ηγ , that ηγ decreases when γ increases. P LS corresponds to the smallest ηγ whereas RA corresponds to the largest one. 3.4 Link with existing methods CR − P LS bears some resemblance to the regularization procedure used in ridge regression [?] in the case where Y = [y]. The ridge coefficient, solution of ŷ = Xβridge(k) , is defined by βridge(k) = [(X ′ X + kI)−1 X ′ y]. We can prove in this case that the first step of CR − P LS is tightly linked to ridge regression. There are some advantages of CR − P LS over ridge regression. Among these advantages we can mention the fact that the prediction model, derived from the first latent variable, can be improved by extracting additional orthogonal latent variables. Moreover, the strategy of analysis encompasses the univariate and multivariate case, and can be viewed as an extension of ridge regression to the multivariate setting. It is worth mentioning that there are several continuum approaches for linking two data sets. The more popular method is the continuum regression proposed by Stone and Brooks [?] and extended to the multivariate setting by Brooks and Stones [?]. This procedure also encompasses several procedures including P LS and ridge regression [?]. However, continuum regression is based on a complex iterative algorithm whereas CR − P LS discussed herein is straightforward and simple. Notwithstanding this simplicity, it was demonstrated on the basis of a case study that, in the univariate setting (one variable to predict), the simple continuum approach outperformed continuum regression which itself slightly outperforms P LS [?]. CR−P LS also bears some similarities to principal covariate regression proposed by De Jong and Kiers [?] who suggested to introduce a tuning parameter that bridges the gap between RA and regression on principal components analysis (P CR) whereas, as stressed above, CR − P LS bridges the gap between RA and P LS2. 662 Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari This can be consider as advantage of CR − P LS over principal covariate regression, because some authors consider that it may not be appropriate to include P CR in a continuum regression family [?]. 4 Application 4.1 Epidemiological data A study involving 158 farms was carried out in France in 2001 to assess the risk factors for post-weaning multi-systemic wasting syndrome (P M W S). The impact of the P M W S disease is great because of the losses due to mortality. Porcine circovirus (P CV 2) is pivotal in this syndrome. The Y table, which contains 3 quantitative variables, concerns the percentage of seropositive animals (sows, fattening pigs and piglets) to P CV 2. The X table contains 17 variables pertaining to the farming features, hygiene and infectious co-factors. 4.2 Explanation of Y from X If the objective is to identify, from the X variables, the risk factors of the disease reflected by the Y variables, the choice of the optimal method is performed by minimizing the root mean square error of calibration (RM SEC ). From Figure (1), it can be seen that several values of γ and dimension hγ can be chosen to minimize the RM SEC . RA (γ = 0) is more oriented towards Y explanation than P LS (γ = 1): it can also be seen that only 3 dimensions are needed in RA to achieve the same performance as P LS with 8 dimensions. However, it seems that RA is unstable on the latest dimensions as reflected by the increase of RM SEC . From Figure (1) which depicts the criterion RM SEC as a function of the number of latent variables to be introduced in the model and the tuning parameter γ, we can state that an optimal method to explain Y from X is obtained for γ = 0.2 with hγ = 3 dimensions. 4.3 Prediction of Y from X Another objective is be to predict the Y variables for new observations (new farms) that did not take part in the calibration process. The choice of the optimal method is performed by minimizing the root mean square error of validation (RM SEV ). In Figure (2), it can be seen that the sensitivity to the multicollinearity is reflected by the fact that when more that one latent variable is introduced into the model, the criterion RM SEV increases, indicating that the prediction ability is deteriorating. This is particularly the case for RA as this criterion jumps when more that 12 latent variables are introduced in the model. It also can be seen that the best method to predict Y from X is P LS (γ = 1) with only one latent variable. 5 Conclusion The Continuum Redundancy P LS is a continuum method to be used for the purpose of exploring and modelling the relationships between X and Y . The key feature of Continuum redundancy P LS regression 663 Fig. 1. RM SEC criterion as a function as of the number of latent variables to be introduced in the model and the tuning parameter γ. Fig. 2. Prediction ability assessed by RM SEV , as a function of γ and of the number of latent variables introduced in the model. this approach is the shrinkage of the variance-covariance matrix (X ′ X) towards the identity matrix, which leads to explore a continuum of methods ranging between redundancy analysis and P LS. Moreover, the method is easy to understand because there is a global criterion to maximise, and because solutions are derived from an eigenanalysis of a matrix. CR − P LS stands as a compromise between having a model with a good explanation ability (RA) and a stable model (P LS). CR − P LS with a small value of γ can lead to latent variable with small variance (likely to be linked to noise) but with a good explanation ability. CR − P LS with a large value of γ can lead to a biased solution less linked to Y but more stable (less sensitive to multicollinearity). 664 Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari References [BKW80] Belsley, D.A., Kuh, E., and Welsch, R.E.: Regression diagnostics: identifying influential data and sources of collinearity, Wiley, Ed. (1980) [BRO94] Brooks, R.J. and Stone, M.: Joint continuum regression for multiple predictants. Journal of american statistical association, 89: 1374–1377 (1994) [DAV82] Davies, P.T., and Tso, M. K. S.: Procedures for reduced-rank regression. Appl. Statist., 31, 244–255 (1982) [JON92] De Jong, S. and Kiers, H.A.L.: Principal covariates regression. Part I. Theory. Chemometrics and intelligent laboratory systems, 14: 155–164 (1992) [FAR90] Farebrother R.W : Discussion of the paper by Stone and Brooks. Journal of the Royal Statistical Society, 52: 2, 263 (1990) [HOE62] Hoerl, A.E.: Application of ridge analysis to regression problems. Chemical engineering progress, 58, 54–59 (1962) [MUL81] Muller, K.E.: Relationships between redundancy analysis, canonical correlation and multivariate regression. Psychometrika, 46, 139–142 (1981) [QAN05] Qannari E.M. and Hanafi M. : A simple continuum regression approach. Journal of Chemometrics, 19, 387–392 (2005) [RAO64] Rao, C.R.: The use and interpretation of principal component analysis in applied research. Sankhya A., 26, 329–358 (1964) [SAB84] Sabatier, R.: Quelques generalisations de l’analyse en composantes principales de variables instrumentales. Statistique et Analyse de Donn´ees, 9: 75–103 (1984) [STO74] Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36, 111–147 (1974) [STO90] Stone, M. and Brooks, R.J.: Continuum regression : cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society, 52: 2, 237–269 (1990) [SUN93] Sundberg R. : Continuum regression and ridge regression. Journal of the Royal Statistical Society, 55: 3, 653–659 (1993) [TEN98] Tenenhaus, M.: La r´egression PLS. Th´eorie et pratique. Technip, Paris (1998) [WOL77] Van Den Wollenberg, A.: Redundancy analysis: an alternative for canonical correlation analysis. Psychometrika, 42, 207–219 (1977) [WOL66] Wold, H.: Estimation of principal components and related models by iterative least squares. Multivariate analysis, Krishnaiah ed., Academic press, New York (1966)