1 1/22/07 252corr (Open this document in 'Outline' view!) L. CORRELATION 1. Simple Correlation The simple sample correlation coefficient is r XY nXY or if spare parts X nX Y nY S XY nXY are available, we can say 2 SS x X 2 nX 2 , SS y Sxy r Y 2 nY 2 and 2 2 2 xy . SS x SS y XY nXY S X nX Y nY SS SS 2 Of course, since the coefficient of determination is R r 2 R 2 and it is often easier to compute r 2 S xy 2 SS x SS y 2 xy 2 2 2 2 x , y and to give the correlation the sign of S xy . But note that the correlation can range from +1 to -1, while the coefficient of determination can only range from 0 to 1. Also note that since the slope in simple regression is b1 b12 s 2y s x2 R 2 or b1 sy sx r. XY nXY X nX 2 The last equation has a counterpart in 1 2 y x , R 2 b12 s x2 s 2y or , where is the population correlation coefficient, so that testing H 0 :1 0 is equivalent to testing H 0 : 0 and the simple regression coefficient and the correlation will have the same sign. 2. Correlation when x and y are both independent If we want to test H 0 : xy 0 against H1 : xy 0 and x and y are normally distributed, we use t n 2 r sr r 1 r 2 n2 . But note that if we are testing H 0 : xy 0 against H 1 : xy 0 , and 0 0 , 1 1 r z ln the test is quite different. We need to use Fisher's z-transformation. Let ~ . This has an 2 1 r ~ n 2 z z 1 1 1 0 and a standard deviation of s z t approximate mean of z ln , so that n3 sz 2 1 0 . (Note: To get ln , the natural log, compute the log to the base 10 and divide by .434294482. ) 2 H 0 : xy 0 Example: Test when n 10, r .704 and r 2 .496 .05 . H : 0 1 xy To solve this we first compute s r with t n2 2 t .8025 (Note that t n2 2 2 1 r 2 1 .496 r and t n2 10 2 sr .704 2.805 . Compare this 1 .496 8 2.306 . Since this is not between these two values of t, reject the null hypothesis. Fn 2 so that this is equivalent to an F test on 1 in a regression.) H 0 : xy 0.8 Example: Test when n 10, r .704 and r 2 .496 .05 . H : 0 . 8 1 xy This time compute Fisher's z-transformation (because 0 is not zero) 1 1 r 1 1 .704 1 1.704 1 1 ~ z ln ln ln ln 5.75676 1.75037 0.87519 2 1 r 2 1 .704 2 0.296 2 2 1 1 0 1 1 .8 1 1.8 1 1 ln z ln ln ln 9.0000 2.19722 1.09861 2 1 0 2 1 .8 2 0.2 2 2 sz 1 1 n3 10 3 1 0.37796 . 7 Finally t ~ z z 0.87519 1.09861 0.591 . Compare sz 0.37796 this with t n2 2 t .8025 2.306 . Since –0.591 lies between these two values, do not reject the null hypothesis. 1 1 r z10 log Note: To do the above with logarithms to the base 10, try ~ . This has an approximate 2 1 r 1 1 0 mean of z 10 log 2 1 0 ~ n 2 z z 10 t 10 . s z 10 and a standard deviation of s z 10 3. Tests of Association a. Kendall's Tau. (Omitted) b. Spearman's Rank Correlation Coefficient. 0.18861 , so that n3 Take a set of n points x, y and rank both x and y from 1 to n to get rx , ry . Do not attempt to compute a rank correlation without replacing the original numbers by ranks, A correlation coefficient between rx and ry can be computed as in point 2 above, but it is easier to compute d rx ry ,and then rs 1 d . This can be given a t test for H nn 1 2 6 2 0 : 0 as in point 2 above, but for n between 4 and 30, a special table should be used. For really large n , z rs n 1 may be used. 3 Example: 5 applicants for a job are rated by two officers, with the following results. Note that in this example the ranks are given initially. Usually the data must be replaced by ranks. .05 Applicant Rater 1 Rater 2 A B C 4 1 3 D E 2 5 3 1 2 5 Test to see how well the ratings agree. 4 H 0 : s 0 In this case, we have a 1-sided test . Arrange the data in columns. H 1 : s 0 Rater 1 Rater 2 Applicant d d2 ry rx A 4 3 1 1 B C D E rs 1 1 3 2 5 1 2 1 1 2 5 1 4 1 Note that 4 1 1 d 0 and d 2 8 . Since n 5, d 1 68 1 2 0.600 . If we check the table ‘Critical Values of 5 nn 1 55 1 2 6 2 2 rs , the Spearman Rank Correlation Coefficient,’ we find that the critical value for n 5 and .05 is .8000 so we must not reject the null hypothesis and we conclude that we cannot say that the rankings agree. Example: We find that for n 122 rs d 0.15 . We want to do the same one-sided test as in 1 nn 1 2 6 2 H 0 : s 0 the last problem . .05 H 1 : s 0 We can do a t-test by computing s 1 rs2 r n2 122 2 .15 1.662 . This is and t s rs 2 n2 s 1 rs 1 .15 2 compared with tn2 t.120 05 1.658 . Since the t we computed is above the table value, we reject the null hypothesis. Or we can compute a z-score z rs n 1 .15 121 1.650 . Since this is above z .05 1.645 , we can reject the null hypothesis. c. Kendall's Coefficient of Concordance. Take k columns with n items in each and rank each column from 1 to n . The null hypothesis is that the rankings disagree. 2 SR2 n SR , where SR n 1k is Compute a sum of ranks SRi for each row. Then S 2 the mean of the SRi s. If H 0 is disagreement, S can be checked against a table for this test. If S , where S S reject H 0 . For n too large for the table use 2n1 k n 1W 1 knn 1 12 W S 1 k 12 2 n 3 n is the Kendall Coefficient of Concordance and must be between 0 and 1. 4 Example: n 6 applicants are rated by k 3 officers. The ranks are below. Applicant Rater 1 Rater 2 Rater 3 Rank Sum SR 2 A B C D E F 1 6 3 2 5 4 1 5 6 4 2 3 6 3 2 5 4 1 8 14 11 11 11 8 63 64 196 n 1k 73 Note that 63 121 SR 10 .5 6 2 2 121 121 64 687 if we had complete disagreement, every applicant would have a rank sum of 10.5. S SR 2 n SR 2 687 610 .52 25 .5 . The Kendall Coefficient of Concordance says that the degree of agreement on a zero to one scale is W S 1 k2 12 n 3 n 25 .5 32 6 3 6 0.162 . To 1 12 do a test of the null hypothesis of disagreement .05 , look up S in the table giving ‘Critical values of Kendall’s s as a Measure of Concordance’ for k 3 and n 6 , S .05 103 .9, so that we accept the null hypothesis of disagreement.. H 0 : Disagreement Example: For n 31 and k 3 we get W 0.10 , and wish to test H 1 : Agreement Since n 31 is too large for the table, use 2 k n 1W 3300.10 9.000 . Using a 2 table, look up 2n 1 .20530 43 .733 . Since 9 is below the table value, do not reject H 0 . 4. Multiple Correlation If R 2 is the coefficient of determination for a regression Yˆ b0 b1 X 1 b2 X 2 bk X k , then the square root of R 2 , R rYYˆ is called the multiple correlation coefficient. Note that Yˆ Y Y nY 2 R 2 2 R 2 1 s e2 s 2y 2 1 s e2 n k 1 where s 2y is the sample variance of y , and that for large n , 2 n 1 sy . 5. Partial Correlation (Optional) If Yˆ b0 b1 X 1 b2 X 2 , its multiple correlation coefficient can be written as RY . X1 X 2 or RY .12 . For example, in the multiple regression problem, we got three multiple correlation coefficients RY2. X .496 , RY2. XW .857 and RY2. XWH .906 If Yˆ b0 b1 X 1 b2 X 2 b3 X 3 and we compute the partial correlation of X 3 with Y we compute rY23.12 RY2.123 RY2.12 1 RY2.12 , the additional explanatory power of the third independent variable after the effects 5 of the first two are considered. If we read t 3 t2 b3 from the computer printout, rY23.12 2 3 , where s b3 t 3 df df n k 1 and k is the number of independent variables. 2 Example: In the multiple regression problem with which we have been working rYH .XW is the additional explanatory power of H beyond what was explained by X and Y . It can be computed two ways. First RY2. XW H RY2. XW .906 .857 2 rYH .343 . The partial correlation coefficient is actually . XW 1 .857 1 RY2. XW rYH . XW .343 0.576 . The sign of the partial correlation is the sign of the corresponding coefficient in the regression. (For the regression equation see below.) 2 For the second method of computing rYH .XW , recall that the last printout for the regression with which we were working was Y = 1.51 + 0.595 X - 0.698 W - 0.937 H Predictor Coef Constant 1.5079 X 0.5952 W -0.6984 H -0.9365 Thus the t corresponding to H is t H 2 rYH . XW t H2 t H2 df Stdev t-ratio p 0.2709 5.57 0.001 0.1198 4.97 0.003 0.4860 -1.44 0.201 0.5239 -1.79 0.124 1,79 and, since df 10 3 1 , 1.79 2 0.343 . 1.79 2 6 6. Collinearity If Yˆ b0 b1 X 1 b2 X 2 , and X 1 and X 2 are highly correlated, then we have no real variation of X 1 relative to X 2 . This is a condition known as (multi)collinearity. The standard deviations for both b1 and b2 will be large and, in extreme cases, the regression process may break down. Recall that in section I, we said that small variation in x can lead to large values of s b1 in simple regression and thus insignificant values of b1 . Similarly in multiple regression, lack of movement of the independent variables relative to one another leaves the regression process unable to tell what changes in the dependent variable are due to the various independent variables. This will be indicated by large values of s b1 or s b2 which cause us to find the coefficients insignificant when we use a t-test. A relatively recent method to check for collinearity is to use the Variance Inflation Factor VIF j 1 1 R 2j . Here R 2j is the coefficient of multiple correlation gotten by regressing the independent variable X j against all the other independent variables Xs . The rule of thumb seems to be that we should be suspicious if any VIFj 5 and positively horrified if VIFj 10 . If you get results like this, drop a variable or change your model. Note that, if you use a correlation matrix for your independent variables and see a large correlation between two of them, putting the square of that correlation into the VIF formula gives you a low estimate of the VIF, since the R-squared that you get from a regression against all the independent variables will be higher. 6 Example: Note that in the printout in section 5, the standard deviations for the coefficients of W and H are quite large, resulting in small t-ratios and p-values which lead us to believe that the coefficients are not even significant when the significance level is 10%. The data from section J4 is repeated at right: Obs Y X W H 1 2 3 4 5 6 7 8 0 2 1 3 1 3 4 2 0 1 2 1 0 3 4 2 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 9 10 1 2 2 1 1 0 1 0 Computation from these numbers reveals that W 4, H 3, W 2 4, H 3, WH 3, and n 10. Thus W 0.4 and H 0.3 so that W 2 nW 2 4 100.42 2.4 , H 2 nH 2 3 100.32 2.1 and WH nW H 3 100.40.3 1.8 . Finally, 1.8 WH nW H r .8018 , a relatively high correlation. This and the 2.4 2.1 W nW H nH 2 WH 2 2 2 2 relatively small sample size account for the large standard deviations and the generally discouraging results. Though the regression against two independent variables has been shown to be an improvement over the regression against one independent variable, addition of the third independent variable, in spite of the high R 2 was useless. Preliminary use of the Minitab correlation command, as below, might have warned us of the problem. MTB > Correlation 'X' 'W' 'H'. Correlations (Pearson) X W W -0.068 H -0.145 0.802 Actually for this problem, the largest VIF, for H, is only about 2.86, but it seems to interact with the small sample size.