Continuum redundancy P LS regression: a simple continuum approach. Application to

advertisement
Continuum redundancy P LS regression: a
simple continuum approach. Application to
epidemiological data.
Stéphanie Bougeard1 , Mohamed Hanafi2 and El Mostafa Qannari2
1
2
AFSSA, Département d’épidémiologie animale - Les Croix, BP53, 22440
Ploufragan (s.bougeard@afssa.fr)
ENITIAA-INRA, Unité de Sensométrie et Chimiométrie - Rue de la Géraudière
BP 82225, 44322 Nantes Cedex, (hanafi@enitiaa-nantes.fr,
qannari@enitiaa-nantes.fr)
Summary. In this paper, we discuss new formulations of both redundancy analysis
(RA) and partial least squares regression (PLS) which clearly show the connection
between these two methods. These new formulations also show that RA and PLS
regression are the two end points of a continuum approach that we propose to investigate. Moreover, it enjoys very interesting properties which highlight the rational of
the continuum approach and how it handles the multicollinearity problem. As the
method of analysis establishes a bridge between RA and PLS regression, we shall
refer to it as ”continuum redundancy PLS regression” (CR-PLS). The interest of the
method is illustrated on the basis of a data set pertaining to epidemiology.
Key words: redundancy analysis, partial least square regression, continuum, ridge
regression.
1 Introduction
Research in epidemiology is concerned with detecting and preventing animal diseases. The purpose of the study is to predict variables related to animal health from
variables related to the breeding environment, alimentary factors and farm management, amongst others. Therefore, the investigation of the relationships between
the variables related to the disease and the variables which are deemed to have an
impact on this disease is of paramount importance as it is a source of opportunities
to reduce disease.
This paper deals with the description and the prediction of data organized in an
explanatory table X and a data set Y to be predicted. The first issue is to describe the
two tables and to summarise the relationships between variables. The second issue
is to predict Y from X and to determine which X variables have an impact on Y .
Let us give a set of predictors X = [x1 , . . . , xP ] and a set of variables to be predicted
Y = [y1 , . . . , yQ ]. All these variables are measured on the same n individuals and
supposed to be centred. Redundancy analysis (RA) [?] can be viewed either as a
658
Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari
regression of Y upon linear combinations of X or as a principal component analysis
(P CA) of X and Y where components are constrained to be linear combinations of
X [?]. Thus RA can be used as a method of analysis to investigate the relationships
between Y and X or to set up prediction models. However, in presence of quasicollinearity among X, RA may lead to unstable models owing to the fact that it
involves the inversion of (X ′ X) which is likely to be ill-conditioned. Partial least
square (P LS) regression [?] may be seen as an alternative method to RA in the case
of ill-conditioned problem.
In this paper, we discuss new formulations of RA and P LS which clearly show the
connection between these methods. These new formulations also show that RA and
P LS are the two end points of a continuum approach that we propose to investigate.
In comparison to other continuum approaches which were proposed within regression
analysis, the paramount feature of the approach we discuss herein, lies in the fact
that it is conceptually simple to grasp and implement. Moreover, it enjoys properties
which highlight the rationale of the continuum approach and how it handles the
multicollinearity problem. As the method of analysis establishes a bridge between
RA and P LS, we shall refer to it as “Continuum Redundancy P LS regression”
(CR − P LS).
2 New formulations of redundancy analysis and P LS
Several formulations have been proposed in order to introduce redundancy analysis
(RA) [?, ?, ?]. In order to describe and predict Y from X, redundancy analysis seeks
in a first step a standardized latent variable t(1) = Xw(1) which is highly related
to the variables to be predicted Y . More formally, RA consists in maximizing the
following criterion:
Max. Qra =
Q
X
cov 2 (yq , t(1) ) with t(1) = Xw(1) and ||t(1) || = 1
(1)
q=1
We can prove that w(1) is the first eigenvector of:
Mra =
1
(X ′ X)−1 X ′ Y Y ′ X
n2
(2)
In order to improve the prediction ability of the model, a second orthogonal latent
variable t(2) may be sought to complement t(1) . This can be done by maximizing,
with respect to t(2) = Xw(2) , the criterion Qra under the constraint that t(2) is
orthogonal to t(1) . The solution to this problem leads to set the second vector of
loadings w(2) as the second eigenvector of Mra . This process can be easily extended
in order to derive subsequent latent variables (t(1) , . . . , t(h) ). The vectors of loadings (w(1) , . . . , w(h) ) associated with these latent variables are obtained from the
successive eigenvectors of matrix Mra .
In order to set up a general strategy of analysis which encompasses RA and
P LS, we propose to derive the latent variables by iteratively deflating X, as
it is done in P LS regression. More precisely, once the first latent variable t(1)
is determined, we consider X (1) , the matrix of residuals of the regression of X
upon t(1) . We seek a second latent variable t(2) = X (1) w(2) so as to maximize
Continuum redundancy P LS regression
(1)
Qra =
(1)
Mra
PQ
2
(2)
) with
q=1 cov (yq , t
(1)′
(1) −1 (1)′
1
(X X ) X Y
n2
659
||w(2) || = 1. The solution w(2) is the eigenvector of
Y ′ X (1) . As a matter of fact, we can show that the
=
first solution based on the eigenvector associated with the second largest eigenvalue
(1)
of matrix Mra and the second solution based on the first eigenvector of matrix Mra
(2)
lead to the same latent variable t .
The latent variables obtained by means of RA may be used for an exploratory
purpose in order to investigate the relationships between X and Y or for predicting
Y variables from X. This can be done by considering a set of h latent variables
(t(1) , . . . , t(h) ) and performing a regression of Y upon these latent variables. This
method is called reduced-rank regression [?, ?]. The number of latent variables to
be retained in the model may be determined by usual techniques such as crossvalidation [?].
As redundancy analysis is based on the eigenstructure of matrix Mra which
involves the inversion of (X ′ X), this may lead to an unstable model in the case of
ill-conditioned problems. In order to circumvent this problem, P LS regression may
be seen as an alternative method to RA. The first latent variable t(1) = Xw(1) is
determined in such a way so as to maximize the criterion:
Max. Qpls =
Q
X
cov 2 (yq , t(1) ) with t(1) = Xw(1) and ||w(1) || = 1
(3)
q=1
We can prove that w(1) is the first eigenvector of:
Mpls =
1 ′
X Y Y ′X
n2
(4)
Subsequent latent variables (t(1) , . . . , t(h) ) may be obtained according to the usual
deflation technique on the X table. It is worth noting at this stage that in P LS
regression it is advocated to deflate at each stage the Y variables using the latent
variables that have been previously determined. However, the latent variables being
orthogonal, it is clear that we get the same results no matter whether we deflate or
not.
3 Continuum Redundancy P LS regression
3.1 A bridge between P LS and RA
It turns out that RA and P LS regression are respectively based on the eigenstructure
of matrix [(X ′ X)−1 X ′ Y Y ′ X] and (X ′ Y Y ′ X). Thus it appears that P LS regression
corresponds to a shrinkage of matrix (X ′ X)−1 towards the identity matrix I. From
this standpoint, we can adopt a gradual shrinkage of matrix (X ′ X)−1 towards I,
by considering a convex combination of these two matrices. Moreover, the difference
between RA and P LS lies in the constraints imposed on the vectors of loadings:
||Xw|| = 1 for RA and ||w|| = 1 for P LS. Alternatively, we can adopt an intermediate constraint by seeking, in a first step, a latent variable t(1) in such a way so as
to maximize:
660
Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari
Max. Qγ =
Q
X
cov 2 (yq , t(1)
γ ) under the constraints
q=1
(1)
t(1)
γ = Xwγ ,
(1)
We can prove that wγ
Mγ =
2
γ||wγ(1) ||2 + (1 − γ)||t(1)
γ || = 1
and
0≤γ≤1
(5)
is the first eigenvector of:
−1 ′
1 X Y Y ′X
(1 − γ)(X ′ X) + γI
n2
(6)
(1)
(1)
The first latent variable is given by tγ = Xwγ . In a second step, X is regressed
(1)
(1)
upon tγ and, as previously, the matrix of residuals Xγ is considered. A sec(2)
(1) (2)
(2)
ond latent variable
tγ = Xγ wγ is determined, where
h
i wγ is the eigenvector
′
(1)
(1)
(1)′
′ (1)
−1 (1)
1
of Mγ = n2 [(1 − γ)(Xγ Xγ ) + γI] Xγ Y Y Xγ . Subsequent latent vari(1)
(h)
ables (tγ , . . . , tγ ) are derived according to the same principle. This set of latent
variables can be used for the purpose of exploring the relationships between X and
Y . We shall refer to this strategy of analysis as Continuum Redundancy P LS Regression (CR − P LS). It is clear that the case (γ = 0) corresponds to RA whereas
(γ = 1) corresponds to P LS.
3.2 Modelisation and prediction
(1)
(h)
The prediction of the Y variables is based on the latent variables (tγ , . . . , tγ ).
These latent variables being orthogonal by construction, the Y table is split up into:
(h)
(h)
(h) (h)′
(1) (1)′
Y = tγ cγ + . . . + tγ cγ + Yγ , Yγ being the matrix of residuals. More(1)
over, the latent variables can be expressed as linear combinations of X: tγ =
∗(h)
(h)
∗(1)
∗
Xwγ , . . . , tγ = Xwγ . The vectors of loading wγ and cγ are defined as in P LS
∗(1) (1)′
∗(h) (h)′
(h)
regression [?]. This leads to the model: Y = X(wγ cγ + . . . + wγ cγ ) + Yγ .
From a practical point of view, if the objective is to explain the Y table from all
the X variables, the choice of the optimal method (optimal value of γ and optimal
number hγ of components tγ ) is r
performed by minimizing the root mean square
(h)
error of calibration: RM SEC =
PQ
q=1
Yq − Yˆq
(h)
2
/Q.
Another objective is to use the model which was set up on a calibration set in
order to predict the Y values for new observations. The choice of an optimal parameter γ and the appropriate number of latent variables can be made by means of a
validation technique such as cross-validation [?] by minimizing the root mean square
error of validation (RM SEV ). This consists in seeking to minimize the criterion
RM SEV which is basically similar to RM SEC except that it is computed on the
basis of several subsets of the original observations which have been in turn set aside
in order to be used for a validation purpose.
3.3 Properties
CR − P LS discussed herein stands as a compromise between having a model with a
good explanation ability (RA for γ = 0) and a stable model (P LS for γ = 1). The
following properties highlight the idea that when γ increases, the strategy of analysis
Continuum redundancy P LS regression
661
investigates more stable directions in the X space as reflected by the variance of the
first latent variable, but the explanation ability of the model decreases as reflected
by the percentage of total variance of Y explained by the latent variable.
Stability of the model
It can be reflected by the variance of the latent component tγ , var(tγ ), determined at each stage. We can prove that var(tγ ) is an increasing
function of γ. When γ increases, the continuum approach investigates more stable
directions and avoids latent variables with small variances which may reflect noise
only. It also follows that RA corresponds to the solution with the smallest variance
of tγ , whereas P LS regression corresponds to the solution with the largest variance
of tγ .
Explanation ability
It can be reflected by the variance
of Y explained by the
P
latent variable tγ that can be written as Expl V ar(Y, tγ ) = q cov 2 (yq , tγ )/var(tγ ).
We can prove that this is a decreasing function of γ. When γ increases, the explanation ability of the model obtained by the regression of Y upon tγ decreases. It follows
that RA corresponds to the solution with the best explanation ability compared to
P LS.
Sensitivity to multicollinearity
It can be reflected by the condition index [?]:
η = λ1 /λP , with λ1 and λP respectively the largest and the smallest eigenvalues of
(X ′ X). A large value of η flags the presence of multicollinearity among X which is
likely to lead to an unstable model. As stated above, CR − P LS involves the matrix
Mγ . Its condition index is given by ηγ = [(1 − γ)λ1 + γ]/[(1 − γ)λP + γ]. It is easy to
prove, by considering the derivative of ηγ , that ηγ decreases when γ increases. P LS
corresponds to the smallest ηγ whereas RA corresponds to the largest one.
3.4 Link with existing methods
CR − P LS bears some resemblance to the regularization procedure used in ridge
regression [?] in the case where Y = [y]. The ridge coefficient, solution of ŷ =
Xβridge(k) , is defined by βridge(k) = [(X ′ X + kI)−1 X ′ y]. We can prove in this case
that the first step of CR − P LS is tightly linked to ridge regression. There are some
advantages of CR − P LS over ridge regression. Among these advantages we can
mention the fact that the prediction model, derived from the first latent variable,
can be improved by extracting additional orthogonal latent variables. Moreover, the
strategy of analysis encompasses the univariate and multivariate case, and can be
viewed as an extension of ridge regression to the multivariate setting.
It is worth mentioning that there are several continuum approaches for linking two data sets. The more popular method is the continuum regression proposed
by Stone and Brooks [?] and extended to the multivariate setting by Brooks and
Stones [?]. This procedure also encompasses several procedures including P LS and
ridge regression [?]. However, continuum regression is based on a complex iterative algorithm whereas CR − P LS discussed herein is straightforward and simple.
Notwithstanding this simplicity, it was demonstrated on the basis of a case study
that, in the univariate setting (one variable to predict), the simple continuum approach outperformed continuum regression which itself slightly outperforms P LS [?].
CR−P LS also bears some similarities to principal covariate regression proposed
by De Jong and Kiers [?] who suggested to introduce a tuning parameter that
bridges the gap between RA and regression on principal components analysis (P CR)
whereas, as stressed above, CR − P LS bridges the gap between RA and P LS2.
662
Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari
This can be consider as advantage of CR − P LS over principal covariate regression,
because some authors consider that it may not be appropriate to include P CR in a
continuum regression family [?].
4 Application
4.1 Epidemiological data
A study involving 158 farms was carried out in France in 2001 to assess the risk
factors for post-weaning multi-systemic wasting syndrome (P M W S). The impact of
the P M W S disease is great because of the losses due to mortality. Porcine circovirus
(P CV 2) is pivotal in this syndrome. The Y table, which contains 3 quantitative
variables, concerns the percentage of seropositive animals (sows, fattening pigs and
piglets) to P CV 2. The X table contains 17 variables pertaining to the farming
features, hygiene and infectious co-factors.
4.2 Explanation of Y from X
If the objective is to identify, from the X variables, the risk factors of the disease
reflected by the Y variables, the choice of the optimal method is performed by
minimizing the root mean square error of calibration (RM SEC ). From Figure (1),
it can be seen that several values of γ and dimension hγ can be chosen to minimize
the RM SEC . RA (γ = 0) is more oriented towards Y explanation than P LS (γ = 1):
it can also be seen that only 3 dimensions are needed in RA to achieve the same
performance as P LS with 8 dimensions. However, it seems that RA is unstable on
the latest dimensions as reflected by the increase of RM SEC . From Figure (1) which
depicts the criterion RM SEC as a function of the number of latent variables to be
introduced in the model and the tuning parameter γ, we can state that an optimal
method to explain Y from X is obtained for γ = 0.2 with hγ = 3 dimensions.
4.3 Prediction of Y from X
Another objective is be to predict the Y variables for new observations (new farms)
that did not take part in the calibration process. The choice of the optimal method
is performed by minimizing the root mean square error of validation (RM SEV ). In
Figure (2), it can be seen that the sensitivity to the multicollinearity is reflected by
the fact that when more that one latent variable is introduced into the model, the
criterion RM SEV increases, indicating that the prediction ability is deteriorating.
This is particularly the case for RA as this criterion jumps when more that 12 latent
variables are introduced in the model. It also can be seen that the best method to
predict Y from X is P LS (γ = 1) with only one latent variable.
5 Conclusion
The Continuum Redundancy P LS is a continuum method to be used for the purpose
of exploring and modelling the relationships between X and Y . The key feature of
Continuum redundancy P LS regression
663
Fig. 1. RM SEC criterion as a function as of the number of latent variables to be
introduced in the model and the tuning parameter γ.
Fig. 2. Prediction ability assessed by RM SEV , as a function of γ and of the number
of latent variables introduced in the model.
this approach is the shrinkage of the variance-covariance matrix (X ′ X) towards the
identity matrix, which leads to explore a continuum of methods ranging between
redundancy analysis and P LS. Moreover, the method is easy to understand because
there is a global criterion to maximise, and because solutions are derived from an
eigenanalysis of a matrix.
CR − P LS stands as a compromise between having a model with a good explanation ability (RA) and a stable model (P LS). CR − P LS with a small value of γ
can lead to latent variable with small variance (likely to be linked to noise) but with
a good explanation ability. CR − P LS with a large value of γ can lead to a biased
solution less linked to Y but more stable (less sensitive to multicollinearity).
664
Stéphanie Bougeard, Mohamed Hanafi and El Mostafa Qannari
References
[BKW80] Belsley, D.A., Kuh, E., and Welsch, R.E.: Regression diagnostics: identifying influential data and sources of collinearity, Wiley, Ed. (1980)
[BRO94] Brooks, R.J. and Stone, M.: Joint continuum regression for multiple predictants. Journal of american statistical association, 89: 1374–1377 (1994)
[DAV82] Davies, P.T., and Tso, M. K. S.: Procedures for reduced-rank regression.
Appl. Statist., 31, 244–255 (1982)
[JON92] De Jong, S. and Kiers, H.A.L.: Principal covariates regression. Part I.
Theory. Chemometrics and intelligent laboratory systems, 14: 155–164
(1992)
[FAR90] Farebrother R.W : Discussion of the paper by Stone and Brooks. Journal
of the Royal Statistical Society, 52: 2, 263 (1990)
[HOE62] Hoerl, A.E.: Application of ridge analysis to regression problems. Chemical engineering progress, 58, 54–59 (1962)
[MUL81] Muller, K.E.: Relationships between redundancy analysis, canonical correlation and multivariate regression. Psychometrika, 46, 139–142 (1981)
[QAN05] Qannari E.M. and Hanafi M. : A simple continuum regression approach.
Journal of Chemometrics, 19, 387–392 (2005)
[RAO64] Rao, C.R.: The use and interpretation of principal component analysis in
applied research. Sankhya A., 26, 329–358 (1964)
[SAB84] Sabatier, R.: Quelques generalisations de l’analyse en composantes principales de variables instrumentales. Statistique et Analyse de Donn´ees,
9: 75–103 (1984)
[STO74] Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36, 111–147 (1974)
[STO90] Stone, M. and Brooks, R.J.: Continuum regression : cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the
Royal Statistical Society, 52: 2, 237–269 (1990)
[SUN93] Sundberg R. : Continuum regression and ridge regression. Journal of the
Royal Statistical Society, 55: 3, 653–659 (1993)
[TEN98] Tenenhaus, M.: La r´egression PLS. Th´eorie et pratique. Technip, Paris
(1998)
[WOL77] Van Den Wollenberg, A.: Redundancy analysis: an alternative for canonical correlation analysis. Psychometrika, 42, 207–219 (1977)
[WOL66] Wold, H.: Estimation of principal components and related models by iterative least squares. Multivariate analysis, Krishnaiah ed., Academic press,
New York (1966)
Download