Correlation Overview Correlation is a bivariate measure of association (strength) of the relationship between two variables. It varies from 0 (random relationship) to 1 (perfect linear relationship) or -1 (perfect negative linear relationship). It is usually reported in terms of its square (r2), interpreted as percent of variance explained. For instance, if r2 is .25, then the independent variable is said to explain 25% of the variance in the dependent variable. In SPSS, select Analyze, Correlate, Bivariate; check Pearson. There are several common pitfalls in using correlation. Correlation is symmetrical, not providing evidence of which way causation flows. If other variables also cause the dependent variable, then any covariance they share with the given independent variable in a correlation may be falsely attributed to that independent. Also, to the extent that there is a nonlinear relationship between the two variables being correlated, correlation will understate the relationship. Correlation will also be attenuated to the extent there is measurement error, including use of sub-interval data or artificial truncation of the range of the data. Correlation can also be a misleading average if the relationship varies depending on the value of the independent variable ("lack of homoscedasticity"). An, of course, atheoretical or post-hoc running of many correlations runs the risk that 5% of the coefficients may be true by chance alone. Beside Pearson correlation (r), the most common type, there are other special types of correlation to handle the special characteristics of such types of variables as dichotomies, and there are other measures of association for nominal and ordinal variables. Regression procedures produce multiple correlations, R, which is the correlation of multiple independent variables with a single dependent. Also, there is partial correlation, which is the correlation of one variable with another, controlling both the given variable and the dependent for a third or additional variables. And there is part correlation, which is the correlation of one variable with another, controlling only the given variable for a third or additional variables. Click on these links to see the separate discussion. Key Concepts and Terms Deviation. A deviation is a value minus its mean: x - meanx. In SPSS, select Analyze, Correlate, Bivariate; click Options; check Cross-product deviations and covariances. Covariance is a measure of how much the deviations of two variables match. The equation is: cov(x,y) = SUM[(x - meanx)(y - meany)]. When the match is best, high positive deviations in x will be matched with high positive deviations in y, high negatives with high negatives, and so on. Such a best-case match-up will result in the highest possible sum in the formula above. In SPSS, select Analyze, Correlate, Bivariate; click Options; check Cross-product deviations and covariances. Standardization. One cannot easily compare the covariance of one pair of variables with the covariance of another pair of variables because variables differ in magnitude (mean value) and dispersion (standard deviation). Standardization is the process of making variables comparable in magnitude and dispersion: one subtracts the mean from each variable and divides by its standard deviation, giving all variables a mean of 0 and a standard deviation of 1. Correlation is the covariance of standardized variables - that is, of variables after you make them comparable by subtracting the mean and dividing by the standard deviation. Standardization is built into correlation and need not be requested explicitly in SPSS or other programs. Correlation is the ratio of the observed covariance of two standardized variables, divided by the highest possible covariance when their values are arranged in the best possible match by order. When the observed covariance is as high as the possible covariance, the correlation will have a value of 1, indicating perfectly matched order of the two variables. A value of -1 is perfect negative covariation, matching the highest positive values of one variable with the highest negative values of the other. A correlation value of 0 indicates a random relationship by order between the two variables. Pearson's r: This is the usual measure of correlation, sometimes called product-moment correlation. Pearson's r is a measure of association which varies from -1 to +1, with 0 indicating no relationship (random pairing of values) and 1 indicating perfect relationship, taking the form, "The more the x, the more the y, and vice versa." A value of -1 is a perfect negative relationship, taking the form "The more the x, the less the y, and vice versa." In SPSS, select Analyze, Correlate, Bivariate; check Pearson (the default). Note: In older works, predating the prevalence of computers, special computation formulas were used for computation of correlation by hand. For certain types of variables, notably dichotomies, there were computational formulas which differed one from another (ex., phi coefficient for two dichotomies, point-biserial correlation for an interval with a dichotomy). Today, however, SPSS will calculate the exact correlation regardless of whether the variables are continuous or dichotomous. Significance of correlation coefficients is discussed below in the frequently asked questions section. Coefficient of determination, r2: The coefficient of determination is the square of the Pearson correlation coefficient. It represents the percent of the variance in the dependent variable explained by the independent. Of course, since correlation is bidirectional, r2 is also the percent of the independent accounted for by the dependent. That is, the researcher must posit the direction of causation, if any, based on considerations external to correlation, which, in itself, cannot demonstrate causality. Correlation for Dichotomies and Ordinal Variables. Variations on r have been devised for binary and ordinal data. Some studies suggest use of these variant forms of correlation rarely affects substantive research conclusions. Rank correlation is nonparametric and does not assume normal distribution. It is less sensitive to outliers. Ordinal o o See section on ordinal association for fuller discussion. Spearman's rho: The most common correlation for use with two ordinal variables or an ordinal and an interval variable. Rho for ranked data equals Pearson's r for ranked data. Note SPSS will assign the mean rank to tied values. The formula for Spearman's rho is: rho = 1 - [(6*SUM(d2)/n(n2 - 1)] where d is the difference in ranks. In SPSS, choose Analyze, Correlate, Bivariate; check Spearman's rho. o Kendall's tau: Another common correlation for use with two ordinal variables or an ordinal and an interval variable. Prior to computers, rho was preferred to tau due to computational ease. Now that computers have rendered calculation trivial, tau is generally preferred. Partial Kendall's tau is also o o available as an ordinal analog to partial Pearsonian correlation. In SPSS, choose Analyze, Correlate, Bivariate; check Kendall's tau. Polyserial correlation: Used when an interval variable is correlated with a dichtomy or an ordinal variable which is assumed to reflect an underlying continuous variable. It is interpreted like Pearson's r. The chisquare test of polyserial correlation and the associate p value test the assumption of bivariate normality required by the polyserial coefficient; if p > .05, then the researcher fails to conclude that the bivariate normality assumption is met. Polyserial correlation is supported by the PRELIS module of LISREL, a structural equation modeling (SEM) package distributed by Scientific Software International. Polychoric correlation: Used when both variables are dichotomous or ordinal but both are assumed to reflect underlying continuous variables. That is, polychoric correlation extrapolates what the categorical variables' distributions would be if continuous, adding tails to the distribution. As such it is an estimate strongly based on the assumption of an underlying continuous bivariate normal distribution. Polychoric correlation is supported by SAS PROC FREQ and by the PRELIS module of LISREL, a structural equation modeling (SEM) package distributed by Scientific Software International and has been used extensively in SEM applications as well as in assessing inter-rater agreement on instruments. Tetrachoric correlation is polychoric correlation for the case of dichotomous variables. Dichotomies o Point-biserial correlation: Pearson's r for a true dichotomy and continuous variable is the same as pointbiserial correlation, which is recommended when an interval variable is correlated with a true dichotomous variable. Thus in one sense it is true that a dichotomous or dummy variable can be used "like a continuous variable" in ordinary Pearson correlation. Special formulas for point-biserial correlation in textbooks are for hand computation; point-biserial correlation is the same as Pearson correlation when applied to a dichotomy and a continuous variable. Thus, when one computes a Pearson correlation under the assumption that one has a true dichotomy and a continuous variable, one is actually computing a pointbiserial correlation. However, when the continuous variable is ordered perfectly from low to high, then even when the dichotomy is also ordered as perfectly as possible to match low to high, r will be less than 1.0 and therefore resulting r's must be interpreted accordingly. Specifically, r will have a maximum of 1.0 only for the datasets with only two cases, and will have a maximum correlation around .85 even for large datasets, when the independent is normally distributed. The value of r may approach 1.0 when the independent variable is bimodal and the dichotomy is a 50/50 split. Unequal splits in the dichotomy and curvilinearity in the continuous variable will both further depress the maximum possible correlation even under perfect ordering. Moreover, if the dichotomy represents a true underlying continuum, correlation will be attenuated. o o Biserial correlation: Used when an interval variable is correlated with a dichotomous variable which reflects an underlying continuous variable. Biserial correlation will always be greater than the corresponding point-biserial correlation. Biserial correlation can be greater than 1.0. Biserial correlation is rarely used any more, with polyserial/polychoric correlation now being preferred. Biserial correlation is not supported by SPSS but is available in SAS as a macro. Rank biserial correlation: Used when an ordinal variable is correlated with a dichotomous variable. Rank biserial correlation is not supported by SPSS but is available in SAS as a macro. o o Phi: Used when both variables are dichotomies. Special formulas in textbooks are for hand computation; phi is the same as Pearson correlation for two dichotomies in SPSS correlation output, which uses exact algorithms. Alternatively, in SPSS, select Analyze, Descriptive Statistics, Crosstabs; click Statistics; check Phi and Cramer's V. Tetrachoric correlation: Used when both variables are dichotomies which are assumed to represent underlying bivariate normal distributions, as might be the case when a dichotomous test item is used to measure some dimension of achievement. Tetrachoric correlation is sometimes used in structural equation modeling (SEM) during the data preparation phase of tailoring the input correlation matrix and is computed by PRELIS, companion software to LISREL, a SEM package distributed by Scientific Software International. That is, LISREL, EQS, and other SEM packages often default to estimate tetrachoric correlation instead of Pearsonian correlation for correlations involving dichotomies. Tetrachoric correlations estimated by various SEM packages can differ markedly depending on the method of estimation (ex., serial vs. simultaneous). SAS PROC FREQ computes tetrachoric correlation. An SPSS macro, is available for computing tetrachoric correlation. Note that tetrachoric correlation matrices in SEM often provide very inflated chi-square values and underestimated standard errors of estimates due to larger variability than Pearson's r. Moreover, tetrachoric correlation can yield a nonpositive definite correlation matrix because eigenvalues may be negative (reflecting violation of normality, sampling error, outliers, or multicollinearity of variables). These problems may lead the researcher away from SEM altogether, in favor of analysis using logit or probit regression. Correlation ratio, eta. Eta, the coefficient of nonlinear correlation, known as the correlation ratio, is discussed in the section on analysis of variance. Eta is the ratio of the between sum of squares to total sum of squares in analysis of variance. The extent to which eta is greater than r is an estimate of the extent to which the data relationship is nonlinear. In SPSS, select Analyze, Compare Means, Means; click Options; check ANOVA table and eta. Eta is also computed in Analyze, General linear model, Multivariate; and elsewhere in SPSS. The coefficient of intraclass correlation, r: This ANOVA-based type of correlation measures the relative homogeneity within groups in ratio to the total variation and is used, for example, in assessing inter-rater reliability. Intraclass correlation, r = (Between-groups MS - Within-groups MS)/(Between-groups MS + (n1)*Within-Groups MS), where n is the average number of cases in each category of the independent. Intraclass correlation is large and positive when there is no variation within the groups, but group means differ. It will be at its largest negative value when group means are the same but there is great variation within groups. Its maximum value is 1.0, but its maximum negative value is (-1/(n-1)). A negative intraclass correlation occurs when betweengroup variation is less than within-group variation, indicating some third (control) variable has introduced nonrandom effects on the different groups. Intraclass correlation is discussed further in the section on reliability. Assumptions 1. Interval level data (for Pearson correlation). 2. Linear relationships. It is assumed that the x-y scattergraph of points for the two variables being correlated can be better described by a straight line than by any curvilinear function. To the extent that a curvilinear function would be better, Pearson's r and other linear coefficients of correlation will understate the true correlation, sometimes to the point of being useless or misleading. Linearity can be checked visually by plotting the data. In SPSS, select Graphs, Scatter/Dots; select Simple Scatter; click Define; let the independent be the x-axis and the dependent be the y-axis; click OK. One may also view many scatterplots simultaneously by asking for a scatterplot matrix: in SPSS, select Graphs, Scatter/Dots, Matrix, Scatter; click Define; move any variables of interest to the Matrix Variable list; click OK. 3. Homoscedasticity is assumed. That is, the error variance is assumed to be the same at any point along the linear relationship. Otherwise the correlation coefficient is a misleading average of points of higher and lower correlation, 4. No outliers. Outlier cases can attenuate correlation coefficients. Scatter plots may be used to spot outliers visually (see above). A large difference between Pearson correlation and Spearman's rho may also indicate the presence of outliers. 5. Minimal measurement error is assumed since low reliability attenuates the correlation coefficient. By definition, correlation measures the systematic covariance of two variables. Measurement error usually, with rare chance exceptions, reduces systematic covariance and lowers the correlation coefficient. This lowering is called attenuation. Restricted variance, discussed below, also leads to attenuation. o Correction for attenuation: . Reliability may be thought of as the correlation of a variable with itself. The correction for attenuation of a correlation, rxy is a function of the reliabilities of the two variables, rxx and ryy: corrected rxy = rxy / [SQRT{rxxryy}] 6. Unrestricted variance If variance is truncated or restricted in one or both variables due, for instance, to poor sampling, this can also lead to attenuation of the correlation coefficient. This also happens with truncation of the range of variables as by dichotomization of continuous data, or by reducing a 7-point scale to a 3-point scale. 7. Similar underlying distributions are assumed for purposes of assessing strength of correlation. That is, if two variables come from unlike distributions, their correlation may be well below +1 even when data pairs are matched as perfectly as they can be while still conforming to the underlying distributions. Thus, the larger the difference in the shape of the distribution of the two variables, the more the attenuation of the correlation coefficient and the more the researcher should consider alternatives such as rank correlation. This assumption may well be violated when correlating an interval variable with a dichotomy or even an ordinal variable. 8. Common underlying normal distributions, for purposes of assessing significance of correlation. Also, for purposes of assessing strength of correlation, note that for non-normal distributions the range of the correlation coefficient may not be from -1 to +1 (see Shih and Huang, 1992). Evaluating correlation with proper bounds. Biometrics, Vol . 48: 1207-1213. ). The central limit theorem demonstrates, however, that for large samples, indices used in significance testing will be normally distributed even when the variables themselves are not normally distributed, and therefore significance testing may be employed. The researcher may wish to use Spearman or other types of nonparametric rank correlation when there are marked violations of this assumption, though this strategy has the danger of attenuation of correlation. 9. Normally distributed error terms. Again, the central limit theorem applies. Frequently Asked Questions What rules exist for determining the appropriate significance level for testing correlation coefficients? Two rules have been proposed: 1. On the rationale that the more coefficients to be tested, the more conservative the test should be, some argue that significance cut-off level should be set at .05/C, where C is the number of coefficients to be tested. Thus, the significance level should be .01 if five coefficients are to be tested. 2. A less conservative rule following the same rationale is to test the highest coefficient at .05/C, the next highest at .05/(C-1), the third-highest at .05/C-2), etc. Even the less conservative rule is very stringent when testing many coefficients. For instance, if one is testing 50 coefficients, the highest coefficient should be tested at .05/50 = .001 level. Since in social science, relationships are often not strong and often the researcher cannot amass large samples, such a test will mean a high risk of a Type II error. In reality, most researchers simply apply .05 across the board regardless of the number of coefficients, but one should realize that the significance of 1 in 20 coefficients is apt to be spurious when using the customary 95% confidence level. This is mainly a danger when doing post hoc analysis without à priori hypotheses to be tested. Do I want one-tailed or two-tailed significance? Normally the researcher wants 2-tailed significance and this is the default in SPSS output. One is then testing the chance the observed correlation is significantly different from zero correlation, one way or the other. If for some theoretical reason a one direction of correlation can be ruled out (negative or positive) because it is impossible, then the researcher should choose one-tailed significance. In SPSS, choose Analyze, Correlate, Bivariate; check Two-tailed (the default) or One-tailed.. How do I convert correlations into z scores? Z-Score Conversions of Pearson's r A correlation coefficient can be transformed into a z-score for purposes of hypothesis testing. This is done by dividing the correlation plus 1, by the same correlation minus 1; then take the natural log of the absolute value of the result; then divide that result by 2. The end result is Fisher's z-score transformation of Pearson's r. Fisher's transformation reduces skew and makes the sampling distribution more normal as sample size increases. For example, let r = .3. Then follow the formula: Z = ln[|(r+1)/r-1)|]/2 so in this case Z = ln(|1.3/-.7|)/2 = ln(1.8571)/2 = .6190/2 = .3095 = the value shown in the table below Table of Z-score conversions for Pearson's r r 0.0000 z' 0.0000 0.0100 0.0200 0.0300 0.0400 0.0500 0.0600 0.0700 0.0800 0.0900 0.1000 0.1100 0.1200 0.1300 0.1400 0.1500 0.1600 0.1700 0.1800 0.1900 0.2000 0.2100 0.2200 0.2300 0.2400 0.2500 0.2600 0.2700 0.2800 0.2900 0.3000 0.3100 0.3200 0.3300 0.3400 0.3500 0.3600 0.3700 0.3800 0.3900 0.4000 0.4100 0.4200 0.4300 0.4400 0.4500 0.4600 0.4700 0.4800 0.4900 0.5000 0.5100 0.5200 0.5300 0.5400 0.5500 0.5600 0.5700 0.0100 0.0200 0.0300 0.0400 0.0500 0.0601 0.0701 0.0802 0.0902 0.1003 0.1104 0.1206 0.1307 0.1409 0.1511 0.1614 0.1717 0.1820 0.1923 0.2027 0.2132 0.2237 0.2342 0.2448 0.2554 0.2661 0.2769 0.2877 0.2986 0.3095 0.3205 0.3316 0.3428 0.3541 0.3654 0.3769 0.3884 0.4001 0.4118 0.4236 0.4356 0.4477 0.4599 0.4722 0.4847 0.4973 0.5101 0.5230 0.5361 0.5493 0.5627 0.5763 0.5901 0.6042 0.6184 0.6328 0.6475 0.5800 0.5900 0.6000 0.6100 0.6200 0.6300 0.6400 0.6500 0.6600 0.6700 0.6800 0.6900 0.7000 0.7100 0.7200 0.7300 0.7400 0.7500 0.7600 0.7700 0.7800 0.7900 0.8000 0.8100 0.8200 0.8300 0.8400 0.8500 0.8600 0.8700 0.8800 0.8900 0.9000 0.9100 0.9200 0.9300 0.9400 0.9500 0.9600 0.9700 0.9800 0.9900 0.6625 0.6777 0.6931 0.7089 0.7250 0.7414 0.7582 0.7753 0.7928 0.8107 0.8291 0.8480 0.8673 0.8872 0.9076 0.9287 0.9505 0.9730 0.9962 1.0203 1.0454 1.0714 1.0986 1.1270 1.1568 1.1881 1.2212 1.2562 1.2933 1.3331 1.3758 1.4219 1.4722 1.5275 1.5890 1.6584 1.7380 1.8318 1.9459 2.0923 2.2976 2.6467 How is the significance of a correlation coefficient computed? Significance of r One tests the hypothesis that the correlation is zero (p = 0) using this formula: t = [r*SQRT(n-2)]/[SQRT(1-r2)] where r is the correlation coefficient and n is sample size, and where one looks up the t value in a table of the distribution of t, for (n - 2) degrees of freedom. If the computed t value is as high or higher than the table t value, then the researcher concludes the correlation is significant (that is, significantly different from 0). In practice, most computer programs compute the significance of correlation for the researcher without need for manual methods. How do I compute the significance of the difference between two correlation coefficients? Significance of the difference between two correlations from two independent samples To compute the significance of the difference between two correlations from independent samples, such as a correlation for males vs. a correlation for females, follow these steps: 1. Use the table of z-score conversions or convert the two correlations to z-scores, as outlined above. Note that if the correlation is negative, the z value should be negative. 2. Estimate the standard error of difference between the two correlations as: SE = SQRT[(1/(n1 - 3) + (1/(n2 - 3)] where n1 and n2 are the sample sizes of the two independent samples 3. Divide the difference between the two z-scores by the standard error. 4. If the z value for the difference computed in step 3 is 1.96 or higher, the difference in the correlations is significant at the .05 level. Use a 2.58 cutoff for significance at the .01 level. Example. Let a sample of 15 males have a correlation of income and education of .60, and let a sample of 20 females have a correlation of .50. We wish to test if this is a significant difference. The z-score conversions of the two correlations are .6931 and .5493 respectively, for a difference of .1438. The SE estimate is SQRT[(1/12)+(1/17)] = SQRT[.1422] = .3770. The z value of the difference is therefore .1438/.3770 = .381, much smaller than 1.96 and thus not significant at the .05 level. (Source, Hubert Blalock, Social Statistics, NY: McGraw-Hill, 1972: 406-407.) Significance of the difference between two dependent correlations from the same sample A t-test is used to test for the difference of two dependent correlations from the same sample. We compute the value of t using the formula below, then look up the computed value in a t table with n - 3 degrees of freedom, where n is the sample size. Let x be a variable such as education and let us be interested in the significance of the difference os x's correlation with income (y), and also the correlation of parental socioeconomic status (z) with income. The applicable formula is: t = (rxy - rzy)* SQRT[{(n - 3)(1 + rxz)}/ {2(1 - rxy2 - rxz2 - rzy2 + 2rxy*rxz*rzy)}] If the computed t value is as great or greater than the cutoff value in the t-table, then the difference in correlations is significant at that level (the t-table will have various cutoffs for various significance levels such as .05,.01, .001). (Source, Hubert Blalock, Social Statistics, NY: McGraw-Hill, 1972: 407.) For tests of difference between two dependent correlations or the difference between more than two independent correlations. see Chen and Popovich (2002) in the bibliography. How do I set confidence limits on my correlation coefficients? SPSS does not compute confidence limits directly. However, since the regression module in SPSS does compute confidence limits, if you stantardize two variables of interest and then regress one on the other, the regression coefficient (the slope, b) is identical to the correlation coefficient and its confidence limits are identical to those for the correlation coefficient. When using this method, when the correlations are very high and/or the reliabilities very low, it is possible for the upper and/or lower confidence limits to lie outside the +1 to -1 bounds of the correlation coefficient. I have ordinal variables and thus used Spearman's rho. How do I use these ordinal correlations in SPSS for partial correlation, regression, and other procedures? You got the output by selecting Statistics, Correlate, then checking Spearman's rho as the correlation type. This invoked the NONPAR CORR procedure, but the dialog boxes (as of ver. 7.5) did not provide for matrix output. Re-run the Spearman's correlations from the syntax window, which is invoked with File, New, Syntax. Enter syntax such as the following, then run it: NONPAR CORR VARIABLES= horse engine cylinder /MATRIX=OUT(*). The correlation matrix will now be in the SPSS Data Editor, where you change the ROWTYPE_ variable values to CORR instead of RHO. Optionally, you may want to select File, Save As at this point to save your matrix. Then select Statistics, Correlate, Partial Correlation (or another procedure) and SPSS will use the Spearman's matrix as input. Alternatively, in the syntax window use MATRIX=IN(*) in PARTIAL CORR or another procedure which accepts a correlation matrix as input. What is the relation of correlation to ANOVA? The significance level of a correlation coefficient for the correlation of an interval variable with a dichotomy will be the same as for an ANOVA on the interval variable using the dichotomy as the only factor. This similarity does not extend to categorical variables with greater than two values. Bibliography Bobko, Philip (2001), Correlation and regression, 2nd edition. Thousand Oaks, CA: Sage Publications. Introductory text which Includes coverage of range restriction, trivariate correlation. Chen, P. Y. and P. M. Popovich (2002). Correlation: Parametric and nonparametric measures. Thousand Oaks, CA: Sage Publications. Covers tests of difference between two dependent correlations, and the difference between more than two independent correlations. Cohen, Jacob and Patricia Cohen (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Second Edition. Hillsdale, NJ: Lawrence Erlbaum Assoc; ISBN: 0898592682. Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Assoc; ISBN: 0805802835. Kendall, Maurice and Jean Dickinson Gibbons (1990). Rank Correlation Methods, Fifth Edition. NY: Oxford Univ Press; ISBN: 0195208374. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. NY: McGraw-Hill. Copyright 1998, 2006 by G. David Garson. Last update 05/09/06