Correlation

advertisement
Correlation
Overview
Correlation is a bivariate measure of association (strength) of the relationship between two variables. It varies from 0
(random relationship) to 1 (perfect linear relationship) or -1 (perfect negative linear relationship). It is usually reported in
terms of its square (r2), interpreted as percent of variance explained. For instance, if r2 is .25, then the independent
variable is said to explain 25% of the variance in the dependent variable. In SPSS, select Analyze, Correlate, Bivariate;
check Pearson.
There are several common pitfalls in using correlation. Correlation is symmetrical, not providing evidence of which way
causation flows. If other variables also cause the dependent variable, then any covariance they share with the given
independent variable in a correlation may be falsely attributed to that independent. Also, to the extent that there is a
nonlinear relationship between the two variables being correlated, correlation will understate the relationship. Correlation
will also be attenuated to the extent there is measurement error, including use of sub-interval data or artificial truncation
of the range of the data. Correlation can also be a misleading average if the relationship varies depending on the value of
the independent variable ("lack of homoscedasticity"). An, of course, atheoretical or post-hoc running of many
correlations runs the risk that 5% of the coefficients may be true by chance alone.
Beside Pearson correlation (r), the most common type, there are other special types of correlation to handle the special
characteristics of such types of variables as dichotomies, and there are other measures of association for nominal and
ordinal variables. Regression procedures produce multiple correlations, R, which is the correlation of multiple
independent variables with a single dependent. Also, there is partial correlation, which is the correlation of one variable
with another, controlling both the given variable and the dependent for a third or additional variables. And there is part
correlation, which is the correlation of one variable with another, controlling only the given variable for a third or
additional variables. Click on these links to see the separate discussion.
Key Concepts and Terms



Deviation. A deviation is a value minus its mean: x - meanx. In SPSS, select Analyze, Correlate, Bivariate; click
Options; check Cross-product deviations and covariances.
Covariance is a measure of how much the deviations of two variables match. The equation is:
cov(x,y) = SUM[(x - meanx)(y - meany)]. When the match is best, high positive deviations in x will be matched
with high positive deviations in y, high negatives with high negatives, and so on. Such a best-case match-up will
result in the highest possible sum in the formula above. In SPSS, select Analyze, Correlate, Bivariate; click
Options; check Cross-product deviations and covariances.
Standardization. One cannot easily compare the covariance of one pair of variables with the covariance of
another pair of variables because variables differ in magnitude (mean value) and dispersion (standard deviation).
Standardization is the process of making variables comparable in magnitude and dispersion: one subtracts the
mean from each variable and divides by its standard deviation, giving all variables a mean of 0 and a standard
deviation of 1.


Correlation is the covariance of standardized variables - that is, of variables after you make them comparable by
subtracting the mean and dividing by the standard deviation. Standardization is built into correlation and need not
be requested explicitly in SPSS or other programs. Correlation is the ratio of the observed covariance of two
standardized variables, divided by the highest possible covariance when their values are arranged in the best
possible match by order. When the observed covariance is as high as the possible covariance, the correlation will
have a value of 1, indicating perfectly matched order of the two variables. A value of -1 is perfect negative
covariation, matching the highest positive values of one variable with the highest negative values of the other. A
correlation value of 0 indicates a random relationship by order between the two variables.
Pearson's r: This is the usual measure of correlation, sometimes called product-moment correlation. Pearson's r is
a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values)
and 1 indicating perfect relationship, taking the form, "The more the x, the more the y, and vice versa." A value of
-1 is a perfect negative relationship, taking the form "The more the x, the less the y, and vice versa." In SPSS,
select Analyze, Correlate, Bivariate; check Pearson (the default).
Note: In older works, predating the prevalence of computers, special computation formulas were used for
computation of correlation by hand. For certain types of variables, notably dichotomies, there were computational
formulas which differed one from another (ex., phi coefficient for two dichotomies, point-biserial correlation for
an interval with a dichotomy). Today, however, SPSS will calculate the exact correlation regardless of whether the
variables are continuous or dichotomous.
Significance of correlation coefficients is discussed below in the frequently asked questions section.


Coefficient of determination, r2: The coefficient of determination is the square of the Pearson correlation
coefficient. It represents the percent of the variance in the dependent variable explained by the independent. Of
course, since correlation is bidirectional, r2 is also the percent of the independent accounted for by the dependent.
That is, the researcher must posit the direction of causation, if any, based on considerations external to correlation,
which, in itself, cannot demonstrate causality.
Correlation for Dichotomies and Ordinal Variables. Variations on r have been devised for binary and ordinal
data. Some studies suggest use of these variant forms of correlation rarely affects substantive research
conclusions. Rank correlation is nonparametric and does not assume normal distribution. It is less sensitive to
outliers.
Ordinal
o
o
See section on ordinal association for fuller discussion.
Spearman's rho: The most common correlation for use with two ordinal variables or an ordinal and an
interval variable. Rho for ranked data equals Pearson's r for ranked data. Note SPSS will assign the mean
rank to tied values. The formula for Spearman's rho is:
rho = 1 - [(6*SUM(d2)/n(n2 - 1)]
where d is the difference in ranks. In SPSS, choose Analyze, Correlate, Bivariate; check Spearman's rho.
o
Kendall's tau: Another common correlation for use with two ordinal variables or an ordinal and an
interval variable. Prior to computers, rho was preferred to tau due to computational ease. Now that
computers have rendered calculation trivial, tau is generally preferred. Partial Kendall's tau is also
o
o
available as an ordinal analog to partial Pearsonian correlation. In SPSS, choose Analyze, Correlate,
Bivariate; check Kendall's tau.
Polyserial correlation: Used when an interval variable is correlated with a dichtomy or an ordinal variable
which is assumed to reflect an underlying continuous variable. It is interpreted like Pearson's r. The chisquare test of polyserial correlation and the associate p value test the assumption of bivariate normality
required by the polyserial coefficient; if p > .05, then the researcher fails to conclude that the bivariate
normality assumption is met. Polyserial correlation is supported by the PRELIS module of LISREL, a
structural equation modeling (SEM) package distributed by Scientific Software International.
Polychoric correlation: Used when both variables are dichotomous or ordinal but both are assumed to
reflect underlying continuous variables. That is, polychoric correlation extrapolates what the categorical
variables' distributions would be if continuous, adding tails to the distribution. As such it is an estimate
strongly based on the assumption of an underlying continuous bivariate normal distribution. Polychoric
correlation is supported by SAS PROC FREQ and by the PRELIS module of LISREL, a structural
equation modeling (SEM) package distributed by Scientific Software International and has been used
extensively in SEM applications as well as in assessing inter-rater agreement on instruments. Tetrachoric
correlation is polychoric correlation for the case of dichotomous variables.
Dichotomies
o
Point-biserial correlation: Pearson's r for a true dichotomy and continuous variable is the same as pointbiserial correlation, which is recommended when an interval variable is correlated with a true dichotomous
variable. Thus in one sense it is true that a dichotomous or dummy variable can be used "like a continuous
variable" in ordinary Pearson correlation. Special formulas for point-biserial correlation in textbooks are
for hand computation; point-biserial correlation is the same as Pearson correlation when applied to a
dichotomy and a continuous variable. Thus, when one computes a Pearson correlation under the
assumption that one has a true dichotomy and a continuous variable, one is actually computing a pointbiserial correlation.
However, when the continuous variable is ordered perfectly from low to high, then even when the
dichotomy is also ordered as perfectly as possible to match low to high, r will be less than 1.0 and
therefore resulting r's must be interpreted accordingly. Specifically, r will have a maximum of 1.0 only for
the datasets with only two cases, and will have a maximum correlation around .85 even for large datasets,
when the independent is normally distributed. The value of r may approach 1.0 when the independent
variable is bimodal and the dichotomy is a 50/50 split. Unequal splits in the dichotomy and curvilinearity
in the continuous variable will both further depress the maximum possible correlation even under perfect
ordering. Moreover, if the dichotomy represents a true underlying continuum, correlation will be
attenuated.
o
o
Biserial correlation: Used when an interval variable is correlated with a dichotomous variable which
reflects an underlying continuous variable. Biserial correlation will always be greater than the
corresponding point-biserial correlation. Biserial correlation can be greater than 1.0. Biserial correlation is
rarely used any more, with polyserial/polychoric correlation now being preferred. Biserial correlation is
not supported by SPSS but is available in SAS as a macro.
Rank biserial correlation: Used when an ordinal variable is correlated with a dichotomous variable. Rank
biserial correlation is not supported by SPSS but is available in SAS as a macro.
o
o
Phi: Used when both variables are dichotomies. Special formulas in textbooks are for hand computation;
phi is the same as Pearson correlation for two dichotomies in SPSS correlation output, which uses exact
algorithms. Alternatively, in SPSS, select Analyze, Descriptive Statistics, Crosstabs; click Statistics; check
Phi and Cramer's V.
Tetrachoric correlation: Used when both variables are dichotomies which are assumed to represent
underlying bivariate normal distributions, as might be the case when a dichotomous test item is used to
measure some dimension of achievement. Tetrachoric correlation is sometimes used in structural equation
modeling (SEM) during the data preparation phase of tailoring the input correlation matrix and is
computed by PRELIS, companion software to LISREL, a SEM package distributed by Scientific Software
International. That is, LISREL, EQS, and other SEM packages often default to estimate tetrachoric
correlation instead of Pearsonian correlation for correlations involving dichotomies. Tetrachoric
correlations estimated by various SEM packages can differ markedly depending on the method of
estimation (ex., serial vs. simultaneous). SAS PROC FREQ computes tetrachoric correlation. An SPSS
macro, is available for computing tetrachoric correlation.
Note that tetrachoric correlation matrices in SEM often provide very inflated chi-square values and
underestimated standard errors of estimates due to larger variability than Pearson's r. Moreover, tetrachoric
correlation can yield a nonpositive definite correlation matrix because eigenvalues may be negative
(reflecting violation of normality, sampling error, outliers, or multicollinearity of variables). These
problems may lead the researcher away from SEM altogether, in favor of analysis using logit or probit
regression.


Correlation ratio, eta. Eta, the coefficient of nonlinear correlation, known as the correlation ratio, is discussed in
the section on analysis of variance. Eta is the ratio of the between sum of squares to total sum of squares in
analysis of variance. The extent to which eta is greater than r is an estimate of the extent to which the data
relationship is nonlinear. In SPSS, select Analyze, Compare Means, Means; click Options; check ANOVA table
and eta. Eta is also computed in Analyze, General linear model, Multivariate; and elsewhere in SPSS.
The coefficient of intraclass correlation, r: This ANOVA-based type of correlation measures the relative
homogeneity within groups in ratio to the total variation and is used, for example, in assessing inter-rater
reliability. Intraclass correlation, r = (Between-groups MS - Within-groups MS)/(Between-groups MS + (n1)*Within-Groups MS), where n is the average number of cases in each category of the independent. Intraclass
correlation is large and positive when there is no variation within the groups, but group means differ. It will be at
its largest negative value when group means are the same but there is great variation within groups. Its maximum
value is 1.0, but its maximum negative value is (-1/(n-1)). A negative intraclass correlation occurs when betweengroup variation is less than within-group variation, indicating some third (control) variable has introduced
nonrandom effects on the different groups. Intraclass correlation is discussed further in the section on reliability.
Assumptions
1. Interval level data (for Pearson correlation).
2. Linear relationships. It is assumed that the x-y scattergraph of points for the two variables being correlated can
be better described by a straight line than by any curvilinear function. To the extent that a curvilinear function
would be better, Pearson's r and other linear coefficients of correlation will understate the true correlation,
sometimes to the point of being useless or misleading.
Linearity can be checked visually by plotting the data. In SPSS, select Graphs, Scatter/Dots; select Simple Scatter;
click Define; let the independent be the x-axis and the dependent be the y-axis; click OK. One may also view
many scatterplots simultaneously by asking for a scatterplot matrix: in SPSS, select Graphs, Scatter/Dots, Matrix,
Scatter; click Define; move any variables of interest to the Matrix Variable list; click OK.
3. Homoscedasticity is assumed. That is, the error variance is assumed to be the same at any point along the linear
relationship. Otherwise the correlation coefficient is a misleading average of points of higher and lower
correlation,
4. No outliers. Outlier cases can attenuate correlation coefficients. Scatter plots may be used to spot outliers visually
(see above). A large difference between Pearson correlation and Spearman's rho may also indicate the presence of
outliers.
5. Minimal measurement error is assumed since low reliability attenuates the correlation coefficient. By definition,
correlation measures the systematic covariance of two variables. Measurement error usually, with rare chance
exceptions, reduces systematic covariance and lowers the correlation coefficient. This lowering is called
attenuation. Restricted variance, discussed below, also leads to attenuation.
o Correction for attenuation: . Reliability may be thought of as the correlation of a variable with itself. The
correction for attenuation of a correlation, rxy is a function of the reliabilities of the two variables, rxx and
ryy:
corrected rxy = rxy / [SQRT{rxxryy}]
6. Unrestricted variance If variance is truncated or restricted in one or both variables due, for instance, to poor
sampling, this can also lead to attenuation of the correlation coefficient. This also happens with truncation of the
range of variables as by dichotomization of continuous data, or by reducing a 7-point scale to a 3-point scale.
7. Similar underlying distributions are assumed for purposes of assessing strength of correlation. That is, if two
variables come from unlike distributions, their correlation may be well below +1 even when data pairs are
matched as perfectly as they can be while still conforming to the underlying distributions. Thus, the larger the
difference in the shape of the distribution of the two variables, the more the attenuation of the correlation
coefficient and the more the researcher should consider alternatives such as rank correlation. This assumption may
well be violated when correlating an interval variable with a dichotomy or even an ordinal variable.
8. Common underlying normal distributions, for purposes of assessing significance of correlation. Also, for
purposes of assessing strength of correlation, note that for non-normal distributions the range of the correlation
coefficient may not be from -1 to +1 (see Shih and Huang, 1992). Evaluating correlation with proper bounds.
Biometrics, Vol . 48: 1207-1213. ). The central limit theorem demonstrates, however, that for large samples,
indices used in significance testing will be normally distributed even when the variables themselves are not
normally distributed, and therefore significance testing may be employed. The researcher may wish to use
Spearman or other types of nonparametric rank correlation when there are marked violations of this assumption,
though this strategy has the danger of attenuation of correlation.
9. Normally distributed error terms. Again, the central limit theorem applies.
Frequently Asked Questions

What rules exist for determining the appropriate significance level for testing correlation coefficients?
Two rules have been proposed:
1. On the rationale that the more coefficients to be tested, the more conservative the test should be,
some argue that significance cut-off level should be set at .05/C, where C is the number of
coefficients to be tested. Thus, the significance level should be .01 if five coefficients are to be
tested.
2. A less conservative rule following the same rationale is to test the highest coefficient at .05/C, the
next highest at .05/(C-1), the third-highest at .05/C-2), etc.
Even the less conservative rule is very stringent when testing many coefficients. For instance, if one is
testing 50 coefficients, the highest coefficient should be tested at .05/50 = .001 level. Since in social
science, relationships are often not strong and often the researcher cannot amass large samples, such a test
will mean a high risk of a Type II error. In reality, most researchers simply apply .05 across the board
regardless of the number of coefficients, but one should realize that the significance of 1 in 20 coefficients
is apt to be spurious when using the customary 95% confidence level. This is mainly a danger when doing
post hoc analysis without à priori hypotheses to be tested.

Do I want one-tailed or two-tailed significance?
Normally the researcher wants 2-tailed significance and this is the default in SPSS output. One is then
testing the chance the observed correlation is significantly different from zero correlation, one way or the
other. If for some theoretical reason a one direction of correlation can be ruled out (negative or positive)
because it is impossible, then the researcher should choose one-tailed significance. In SPSS, choose
Analyze, Correlate, Bivariate; check Two-tailed (the default) or One-tailed..

How do I convert correlations into z scores?
Z-Score Conversions of Pearson's r
A correlation coefficient can be transformed into a z-score for purposes of hypothesis testing. This is done
by dividing the correlation plus 1, by the same correlation minus 1; then take the natural log of the absolute
value of the result; then divide that result by 2. The end result is Fisher's z-score transformation of
Pearson's r. Fisher's transformation reduces skew and makes the sampling distribution more normal as
sample size increases. For example, let r = .3. Then follow the formula: Z = ln[|(r+1)/r-1)|]/2
so in this case
Z = ln(|1.3/-.7|)/2 = ln(1.8571)/2 = .6190/2 = .3095 = the value shown in the table below
Table of Z-score conversions for Pearson's r
r
0.0000
z'
0.0000
0.0100
0.0200
0.0300
0.0400
0.0500
0.0600
0.0700
0.0800
0.0900
0.1000
0.1100
0.1200
0.1300
0.1400
0.1500
0.1600
0.1700
0.1800
0.1900
0.2000
0.2100
0.2200
0.2300
0.2400
0.2500
0.2600
0.2700
0.2800
0.2900
0.3000
0.3100
0.3200
0.3300
0.3400
0.3500
0.3600
0.3700
0.3800
0.3900
0.4000
0.4100
0.4200
0.4300
0.4400
0.4500
0.4600
0.4700
0.4800
0.4900
0.5000
0.5100
0.5200
0.5300
0.5400
0.5500
0.5600
0.5700
0.0100
0.0200
0.0300
0.0400
0.0500
0.0601
0.0701
0.0802
0.0902
0.1003
0.1104
0.1206
0.1307
0.1409
0.1511
0.1614
0.1717
0.1820
0.1923
0.2027
0.2132
0.2237
0.2342
0.2448
0.2554
0.2661
0.2769
0.2877
0.2986
0.3095
0.3205
0.3316
0.3428
0.3541
0.3654
0.3769
0.3884
0.4001
0.4118
0.4236
0.4356
0.4477
0.4599
0.4722
0.4847
0.4973
0.5101
0.5230
0.5361
0.5493
0.5627
0.5763
0.5901
0.6042
0.6184
0.6328
0.6475
0.5800
0.5900
0.6000
0.6100
0.6200
0.6300
0.6400
0.6500
0.6600
0.6700
0.6800
0.6900
0.7000
0.7100
0.7200
0.7300
0.7400
0.7500
0.7600
0.7700
0.7800
0.7900
0.8000
0.8100
0.8200
0.8300
0.8400
0.8500
0.8600
0.8700
0.8800
0.8900
0.9000
0.9100
0.9200
0.9300
0.9400
0.9500
0.9600
0.9700
0.9800
0.9900

0.6625
0.6777
0.6931
0.7089
0.7250
0.7414
0.7582
0.7753
0.7928
0.8107
0.8291
0.8480
0.8673
0.8872
0.9076
0.9287
0.9505
0.9730
0.9962
1.0203
1.0454
1.0714
1.0986
1.1270
1.1568
1.1881
1.2212
1.2562
1.2933
1.3331
1.3758
1.4219
1.4722
1.5275
1.5890
1.6584
1.7380
1.8318
1.9459
2.0923
2.2976
2.6467
How is the significance of a correlation coefficient computed?
Significance of r
One tests the hypothesis that the correlation is zero (p = 0) using this formula:
t = [r*SQRT(n-2)]/[SQRT(1-r2)]
where r is the correlation coefficient and n is sample size, and where one looks up the t value in a table of
the distribution of t, for (n - 2) degrees of freedom. If the computed t value is as high or higher than the
table t value, then the researcher concludes the correlation is significant (that is, significantly different
from 0). In practice, most computer programs compute the significance of correlation for the researcher
without need for manual methods.

How do I compute the significance of the difference between two correlation coefficients?
Significance of the difference between two correlations from two independent samples
To compute the significance of the difference between two correlations from independent samples, such as
a correlation for males vs. a correlation for females, follow these steps:
1. Use the table of z-score conversions or convert the two correlations to z-scores, as outlined above.
Note that if the correlation is negative, the z value should be negative.
2. Estimate the standard error of difference between the two correlations as:
SE = SQRT[(1/(n1 - 3) + (1/(n2 - 3)]
where n1 and n2 are the sample sizes of the two independent samples
3. Divide the difference between the two z-scores by the standard error.
4. If the z value for the difference computed in step 3 is 1.96 or higher, the difference in the
correlations is significant at the .05 level. Use a 2.58 cutoff for significance at the .01 level.
Example. Let a sample of 15 males have a correlation of income and education of .60, and let a sample of
20 females have a correlation of .50. We wish to test if this is a significant difference. The z-score
conversions of the two correlations are .6931 and .5493 respectively, for a difference of .1438. The SE
estimate is SQRT[(1/12)+(1/17)] = SQRT[.1422] = .3770. The z value of the difference is therefore
.1438/.3770 = .381, much smaller than 1.96 and thus not significant at the .05 level. (Source, Hubert
Blalock, Social Statistics, NY: McGraw-Hill, 1972: 406-407.)
Significance of the difference between two dependent correlations from the same sample
A t-test is used to test for the difference of two dependent correlations from the same sample. We compute
the value of t using the formula below, then look up the computed value in a t table with n - 3 degrees of
freedom, where n is the sample size. Let x be a variable such as education and let us be interested in the
significance of the difference os x's correlation with income (y), and also the correlation of parental
socioeconomic status (z) with income. The applicable formula is:
t = (rxy - rzy)* SQRT[{(n - 3)(1 + rxz)}/ {2(1 - rxy2 - rxz2 - rzy2 + 2rxy*rxz*rzy)}]
If the computed t value is as great or greater than the cutoff value in the t-table, then the difference in
correlations is significant at that level (the t-table will have various cutoffs for various significance levels
such as .05,.01, .001). (Source, Hubert Blalock, Social Statistics, NY: McGraw-Hill, 1972: 407.)
For tests of difference between two dependent correlations or the difference between more than two
independent correlations. see Chen and Popovich (2002) in the bibliography.

How do I set confidence limits on my correlation coefficients?
SPSS does not compute confidence limits directly. However, since the regression module in SPSS does
compute confidence limits, if you stantardize two variables of interest and then regress one on the other,
the regression coefficient (the slope, b) is identical to the correlation coefficient and its confidence limits
are identical to those for the correlation coefficient. When using this method, when the correlations are
very high and/or the reliabilities very low, it is possible for the upper and/or lower confidence limits to lie
outside the +1 to -1 bounds of the correlation coefficient.

I have ordinal variables and thus used Spearman's rho. How do I use these ordinal correlations in SPSS for
partial correlation, regression, and other procedures?
You got the output by selecting Statistics, Correlate, then checking Spearman's rho as the correlation type.
This invoked the NONPAR CORR procedure, but the dialog boxes (as of ver. 7.5) did not provide for
matrix output. Re-run the Spearman's correlations from the syntax window, which is invoked with File,
New, Syntax. Enter syntax such as the following, then run it:
NONPAR CORR VARIABLES= horse engine cylinder
/MATRIX=OUT(*).
The correlation matrix will now be in the SPSS Data Editor, where you change the ROWTYPE_ variable
values to CORR instead of RHO. Optionally, you may want to select File, Save As at this point to save
your matrix. Then select Statistics, Correlate, Partial Correlation (or another procedure) and SPSS will use
the Spearman's matrix as input. Alternatively, in the syntax window use MATRIX=IN(*) in PARTIAL
CORR or another procedure which accepts a correlation matrix as input.

What is the relation of correlation to ANOVA?
The significance level of a correlation coefficient for the correlation of an interval variable with a
dichotomy will be the same as for an ANOVA on the interval variable using the dichotomy as the only
factor. This similarity does not extend to categorical variables with greater than two values.
Bibliography






Bobko, Philip (2001), Correlation and regression, 2nd edition. Thousand Oaks, CA: Sage Publications.
Introductory text which Includes coverage of range restriction, trivariate correlation.
Chen, P. Y. and P. M. Popovich (2002). Correlation: Parametric and nonparametric measures. Thousand Oaks,
CA: Sage Publications. Covers tests of difference between two dependent correlations, and the difference between
more than two independent correlations.
Cohen, Jacob and Patricia Cohen (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral
Sciences, Second Edition. Hillsdale, NJ: Lawrence Erlbaum Assoc; ISBN: 0898592682.
Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum
Assoc; ISBN: 0805802835.
Kendall, Maurice and Jean Dickinson Gibbons (1990). Rank Correlation Methods, Fifth Edition. NY: Oxford
Univ Press; ISBN: 0195208374.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. NY: McGraw-Hill.
Copyright 1998, 2006 by G. David Garson.
Last update 05/09/06
Download