Uploaded by RA HoD DAIT

3-canonical correlation

advertisement
CANONICAL CORRELATION ANALYSIS
V.K. Bhatia
I.A.S.R.I., Library Avenue, New Delhi -110 012
vkbhatia@iasri.res.in
A canonical correlation is the correlation of two canonical (latent) variables, one
representing a set of independent variables, the other a set of dependent variables. Each set
may be considered a latent variable based on measured indicator variables in its set. The
canonical correlation is optimized such that the linear correlation between the two latent
variables is maximized. Whereas multiple regression is used for many-to-one relationships,
canonical correlation is used for many-to-many relationships. There may be more than one
such linear correlation relating the two sets of variables, with each such correlation
representing a different dimension by which the independent set of variables is related to the
dependent set. The purpose of canonical correlation is to explain the relation of the two sets of
variables, not to model the individual variables.
Analogous with ordinary correlation, canonical correlation squared is the percent of variance
in the dependent set explained by the independent set of variables along a given dimension
(there may be more than one). In addition to asking how strong the relationship is between
two latent variables, canonical correlation is useful in determining how many dimensions are
needed to account for that relationship. Canonical correlation finds the linear combination of
variables that produces the largest correlation with the second set of variables. This linear
combination, or "root," is extracted and the process is repeated for the residual data, with the
constraint that the second linear combination of variables must not correlate with the first one.
The process is repeated until a successive linear combination is no longer significant.
Canonical correlation is a member of the multiple general linear hypothesis (MLGH) family
and shares many of the assumptions of mutliple regression such as linearity of relationships,
homoscedasticity (same level of relationship for the full range of the data), interval or nearinterval data, untruncated variables, proper specification of the model, lack of high
multicollinearity, and multivariate normality for purposes of hypothesis testing.
Often in applied research, scientists encounter variables of large dimensions and are faced
with the problem of understanding dependency structures, reduction of dimensionalities,
construction of a subset of good predictors from the explanatory variables, etc. Canonical
correlation Analysis (CCA) provides us with a tool to attack these problems. However, its
appeal and hence its motivation seed to differ from the theoretical statisticians to the social
scientists. We deal here with the various motivations of CCA as mentioned above and related
statistical inference procedures.
Dependency between Two sets of Stochastic Variables
Let X: px1 be a random vector partitioned into two subvectors X1 :p1x1 and X2; p2x1, p1<p2,
p1+p2=p. Assume EX=0. In order to study the dependency between X1 and X2, we seek to
evaluate the maximum possible correlation between any two arbitrary linear compounds
Canonical Correlation Analysis
U=α'X1 and V=γ' X2 subject to the normalizations, Var (U) = =α'Σ11 α = 1 and Var (V) = γ'Σ11
γ = 1' where,
⎛
⎞
⎜
⎟
⎜ ∑11 ∑12
⎟
Disp. (X) = ∑ =⎜
⎟
⎜
⎟
⎜ ∑ 21 22∑ ⎟
⎝
⎠
is partitioned according to that of X as above,
It follows that, this maximum correlation, say ρ1 is given by the positive square root of the
largest eigen root among the eigen roots ρ12≥ρ22≥...ρr2≥...≥ρpl2, of Σ12Σ22-1Σ21 in the metrix of
Σ11, i.e. of Σ12Σ22-1Σ21Σ11-1. α and γ are then given by, α1,γ1 such that α' Σ11α1=γ1'Σ22γ1=1 and,
⎛− ρ∑
∑ ⎞ ⎛α1 ⎞
⎜
⎜
⎜
⎜
⎜
⎝
11
∑21
12
− ρ1
∑22
⎟
⎟
⎟
⎟
⎟
⎠
⎜ ⎟
⎜ ⎟
⎜γ ⎟ = 0
⎜ 1⎟
⎜ ⎟
⎝ ⎠
(2.1)
Alternatively, α and γ may be obtained as the eigen vector solutions, subject to the same
normalizations, from
(2.2)
(Σ11-1Σ12-1Σ22-1Σ21 - ρ2I)α =0, (Σ22-1Σ21Σ11-1-1Σ12 - ρ2I)γ =0
Further it follows that,
α = Σ11-1Σ12γ/ρ and γ =Σ22-1Σ21 α/ρ,
(2.3)
so that one needs to solve only one of the two equations in (2.2).
ρ1 is called the (first) canonical correlation between X1 and X2 and (U1, V1) = (α1'X1,γ1'X2) the
pair of first canonical varieties. If Σii, i=1 or 2 happens to be singular, one can use a g-inverse
Σii- in place of Σii-1 above.
now that, p1 = p2=1⇒ ρ1= usual Pearson's product moment correlation coefficient between the
scalar random variables X1 and X2 ; p1 = 1, p2 = p2>1 ⇒ρ1= Multiple correlation coefficient
between the scalar X1 and the vector X2. Sample analogues are trivially defined.
Reduction of Dimensionality
In case p2 or p1 is large, it may become necessary to achieve a reduction of dimensionality but
without sacrificing much of the dependency between X1 and X2. We then seek further linear
combinations Ui=α' X1, Vi = γ1'X2, i=1,2,............., r+1, such that Ur+1 and Vr+1 are maximally
correlated among all linear combinations subject to having unit variances and further subject
to being uncorrelated with U1, V1,..............Ur,Vr. It turns out that Corr. (Ur+1, Vr+1) = ρr+1 and
αr+1, γr+1 are simply solutions of (2.1) with ρ1 replaced by ρr+1.
When ρk+1 is judged to be insignificant compared to zero for some k+1, one may then retain
only (Ui, Vi), i=1,2,......k variables for further analysis in place of the original = ρ1+ ρ2
IV-38
Canonical Correlation Analysis
presumably much larger number of variables. Note however, that information on all ρ1+ ρ2
variables X1 and X2 are still needed even to construct these 2k new variables.
Canonical Correlation in SPSS
o Canonical correlation is part of MANOVA in SPSS, in which one has to refer to one set of
variables as "dependent" and the other as "covariates." It is available only in syntax. The
command syntax method is as follows, where set1 and set2 are variable lists:
MANOVA set1 WITH SET2 /DISCRIM ALL ALPHA(1) /PRINT SIGNIF(MULTIV
UNIV EIGEN DIMENR).
Note one cannot save canonical scores in this method.
o Canonical correlation has to be run in syntax, not from the SPSS menus. If you just want
to create a dataset with canonical variables, as part of the Advanced Statistics module
SPSS supplies the CANCORR macro located in the file canonic correlation.sps, usually in
the same directory as the SPSS main program. Open the syntax window with File, New,
Syntax. Enter this:
INCLUDE 'c:\Program Files\SPSS\Canonical correlation.sps'.
CANCORR SET1=varlist/
SET2=varlist/.
where "varlist" is one of two lists of numeric variables. Output will be saved to a file called
"cc_tmp2.sav," which will contain the canonical scores as new variables along with the
original data file. These scores will be labeled s1_cv1 and s1_cv1, s2_cv1 and s2_cv2, and the
like, standing for the scores on the two canonical variables associated with each canonical
correlation. The macro will create two canonical variables for a number of canonical
correlations equal to the smaller number of variables in SET1 or SET2.
o
OVERALS, which is part of the SPSS Categories module, computes nonlinear canonical
correlation analysis on two or more sets of variables.
Some Comments on the Canonical Correlations
• There could be a situation where some of variables have high structure correlations even
though their canonical weights are near zero. This could happen because the weights are
partial coefficients whereas the structure correlations (canonical factor loadings) are not: if
a given variable shares variance with other independent variables entered in the linear
combination of variables used to create a canonical variable, its canonical coefficient
(weight) is computed based on the residual variance it can explain after controlling for
these variables. If an independent variable is totally redundant with another independent
variable, its partial coefficient (canonical weight) will be zero. Nonetheless, such a
variable might have a high correlation with the canonical variable (that is, a high structure
coefficient). In summary, the canonical weights have to do with the unique contributions
of an original variable to the canonical variable, whereas the structure correlations have to
do with the simple, overall correlation of the original variable with the canonical variable.
• Canonical correlation is not a measure of the percent of variance explained in the original
variables. The square of the structure correlation is the percent of the variance in a given
original variable accounted for by a given canonical variable on a given (usually the first)
canonical correlation. Note that the average percent of variance explained in the original
IV-39
Canonical Correlation Analysis
•
variables by a canonical variable (the mean of the squared structure correlations for the
canonical variable) is not at all the same as the canonical correlation, which has to do with
the correlation between the weighted sums of the two sets of variables. Put another way,
the canonical correlation does not tell us how much of the variance in the original
variables is explained by the canonical variables. Instead, that is determined on the basis
of the squares of the structure correlations.
Canonical coefficients can be used to explain with which original variables a canonical
correlation is predominantly associated. The canonical coefficients are standardized
coefficients and (like beta weights in regression) their magnitudes can be compared.
Looking at the columns in SPSS output which list the canonical coefficients as columns
and the variables in a set of variables as rows, some researchers simply note variables with
the highest coefficients to determine which variables are associated with which canonical
correlations and use this as the basis for inducing the meaning of the dimension
represented by the canonical correlation.
However, Levine (1977) argues against the procedure above on the ground that the canonical
coefficients may be subject to multicollinearity, leading to incorrect judgments. Also, because
of suppression, a canonical coefficient may even have a different sign compared to the
correlation of the original variable with the canonical variable. Therefore, instead, Levine
recommends interpreting the relations of the original variables to a canonical variable in terms
of the correlations of the original variables with the canonical variables - that is, by structure
coefficients. This is now the standard approach.
Redundancy in Canonical Correlation Analysis
Redundancy is the percent of variance in one set of variables accounted for by the variate of
the other set. The researcher wants high redundancy, indicating that independent variate
accounts for a high percent of the variance in the dependent set of original variables. Note this
is not the canonical correlation squared, which the percent of variance in the dependent
variate is accounted for by the independent variate. The redundancy analysis section of SAS
output looks like that below, where rows 1 and 2 refer to the first and second canonical
correlations extracted for these data. Italicized comments are not part of SAS output.
Canonical Redundancy Analysis
Raw variance tables are reported by SAS but are omitted here because redundancy is
normally interpreted using the standardized tables.
Standardized Variance of the dependent variables
Explained by
Their Own
The Opposite
Canonical Variables
Canonical Variables
Cumulative Canonical
Proportion Proportion R-Squared Proportion
1
0.2394
0.2394
0.4715
0.1129
2
0.3518
0.5912
0.0052
0.0018
IV-40
Cumulative
Proportion
0.1129
0.1147
Canonical Correlation Analysis
The table above shows that, for the first canonical correlation, although the independent
canonical variable explains 47.15% of the variance in the dependent canonical variable, the
independent canonical variable is able to predict only 11.29% of the variance in the
individual original dependent variables. Also, the dependent canonical variable predicts only
23.94% of the variance in the individual original dependent variables. Similar statements
could be made about the second canonical correlation (row 2).
Canonical Redundancy Analysis
Standardized Variance of the independent variables
Explained by
Their Own
The Opposite
Canonical Variables
Canonical Variables
Cumulative Canonical
Cumulative
Proportion Proportion R-Squared Proportion Proportion
1
0.5000
0.5000
0.4715
0.2357
0.2357
2
0.5000
1.0000
0.0052
0.0026
0.2383
The table above repeats the first, except for comparisons involving the independent canonical
variable.
Canonical Redundancy Analysis
Squared Multiple Correlations Between the dependent variables and
the First 'M' Canonical Variables of the independent variables
M
1
2
Y1
0.1510
0.1526
Y2
0.0280
0.0305
Y3
0.1596
0.1610
In the table above, the columns represent the canonical correlations and the rows represent
the original dependent variables, three in this case. The R-squareds are the percent of
variance in each original dependent variable explained by the independent canonical
variables. A similar table for the independent variables and the dependent canonical
variables is also output by SAS but is not reproduced here.
Nonlinear Canonical Correlation (OVERALS)
Nonlinear canonical correlation analysis corresponds to categorical canonical correlation
analysis with optimal scaling. The OVERALS procedure in SPSS (part of SPSS Categories)
implements nonlinear canonical correlation. Independent variables can be nominal, ordinal, or
interval, and there can be more than two sets of variables (more than one independent set and
one dependent set). Whereas ordinary canonical correlation maximizes correlations between
the variable sets, in OVERALS the sets are compared to an unknown compromise set defined
by the object scores
OVERALS uses optimal scaling, which quantifies categorical variables and then treats as
numerical variables, including applying nonlinear transformations to find the best-fitting
model. For nominal variables, the order of the categories is not retained but values are created
IV-41
Canonical Correlation Analysis
for each category such that goodness of fit is maximized. For ordinal variables, order is
retained and values maximizing fit are created. For interval variables, order is retained as are
equal distances between values.
Obtain OVERALS from the SPSS menu by selecting Analyze, Data Reduction, Optimal
Scaling; Select Multiple sets; Select either Some variable(s) not multiple nominal or All
variables multiple nominal; click Define; define at least two sets of variables; define the value
range and measurement scale (optimal scaling level) for each selected variable. SPSS output
includes frequencies, centroids, iteration history, object scores, category quantifications,
weights, component loadings, single and multiple fit, object scores plots, category coordinates
plots, component loadings plots, category centroids plots, and transformation plots.
Tip: To minimize output, use the Automatic Recode facility on the Transform menu to create
consecutive categories beginning with 1 for variables treated as nominal or ordinal. To
minimize output, for each variable scaled at the numerical (integer) level, subtract the smallest
observed value from every value and add 1.
Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given
data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting
the model to the given data. Therefore, it is particularly appropriate to employ crossvalidation, developing the model for a training dataset and then assessing its generalizability
by running the model on a separate validation dataset.
The SPSS manual notes, "If each set contains one variable, nonlinear canonical correlation
analysis is equivalent to principal components analysis with optimal scaling. If each of these
variables is multiple nominal, the analysis corresponds to homogeneity analysis. If two sets of
variables are involved and one of the sets contains only one variable, the analysis is identical
to categorical regression with optimal scaling."
Reference
Levine, Mark S. (1977). Canonical Analysis and Factor Comparison. Thousand Oaks, CA:
Sage Publications, Quantitative Applications in the Social Sciences Series, No. 6.
IV-42
Download