252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 1 L. CORRELATION 1. Simple Correlation 2. Correlation when x and y are both independent Problem L1, Text 13.36 (Compute correlation and test correlation for significance!) 3. Tests of Association Problem L3, L2 (L1, L2) 4. Multiple Correlation 5. Partial Correlation 6. Collinearity Text 15.16-15.18 (15.19-15.21) (A printout will be supplied for the last problem – make sure that you understand it.) ---------------------------------------------------------------------------------------------------------------------------------------------- Correlation Problem L1: Assume that for n = 49 r = .24. Test for a. Correlation of zero b. Correlation of 0.3 Solution: .05 H 0 : 0 n 2 r 1 .24 2 1 r 2 .020051 0.1416 . So t a.) where s r n2 49 2 sr H 1 : 0 r .24 47 t n 2 1.695 . We do not reject H 0 if this t lies between t .025 2.012 . So we do s r 0.1416 not reject H 0 . H 0 : 0.3 n 2 z * z* 1 1 r 1 1 .24 1 t b.) where z* ln ln ln 1.6316 2 1 r 2 1 .24 2 H : 0 . 3 s z* 1 1 0 1 0.4895 0.2448 , z* 1 ln 2 2 1 0 s z2* 1 1 .30 1 1 ln 2 1 .30 2 ln 1.8571 2 .6190 0.3095 and z * z* 0.2448 0.3095 1 1 0.3095 . So t n 2 0.4389 . We use the same test as n 3 46 s z* 0.3095 in part a, and thus do not reject H 0 . Exercise 13.36: Suppose that you are testing the null hypothesis that there is no relationship between x and y . Assume n 20 and that SSR 60 and SSE 40 . a. What is the value of the F test statistic? b. At the 5% significance level what is the critical value of F? c. Based on the answers to a) and b) what statistical decision should be made? d. Calculate the correlation coefficient from R 2 by assuming that the slope b1 is negative. e. At the 5% significance level, is there significant correlation between x and y ? Solution: Let’s set up the ANOVA table. Note that the F test is the equivalent of a test on b1 . k 1, and the null hypothesis is that the regression is useless or if there is only one independent variable, H 0 : b1 0. Source SS DF MS F F.05 Regression Error (Within) Total 252solnL1 12/01/03 60 1 40 100 18 19 F 1,10 4.96 s 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 2 The Instructor’s Solution Manual says the following. MSR SSR / k 60 / 1 60 (a) MSE SSE /(n k 1) 40 / 18 2.222 F MSR / MSE 60 / 2.222 27 (b) F 1,18 4.41 (c) Reject H0 and say that there is evidence that the fitted linear regression model is useful. Of course, it would probably be easier to just complete the table. Remember that total degrees of freedom are n 1 and that the degrees of freedom for regression are k 1, the number of independent variables, and that both SS and DF must add up. MS is SS divided by DF and that F is always MS divided by MSE. Also remember that you were supposed to know this, but judging by the last exam, you don’t. Source SS DF MS F F.05 Regression 60 1 60 27.45 F 1,18 4.41 s (d) (e) sr Error (Within) 40 Total 100 SSR 60 R2 0. 6 SST 100 H 0 : 0 H 1 : 0 18 19 2.222 r signb1 R 2 0.60 .7746 H 0 : 0 H 1 : 0 t n 2 r sr where 1 .24 2 1 r 2 .020051 0.1416 . So n2 49 2 t n 2 r .24 47 1.695 . We do not reject H 0 if this t lies between t .025 2.012 . So we do s r 0.1416 not reject H 0 . H 0: 0 There is no correlation between X and Y. H 1: 0 There is correlation between X and Y. d.f. = n 2 18. Decision rule: Reject H 0 if tcal > t n 2 = 2.101. 2 r Test statistic: t sr r .7746 5.196 . 1 .6 1 r 2 18 n2 Since t cal 5.196 is below the lower critical bound of –2.1009, reject H 0 and say that there is enough evidence to conclude that there is a significant correlation between x and y. 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) Tests of Association Problem L2: The following are rankings of 3 judges Swimmer Judge A Judge B Judge C 1 2 1 2 2 1 2 1 3 3 3 4 4 4 4 3 5 5 5 5 Is there significant agreement? H : Disagreement Solution: 0 Use Kendall’s test of concordance with .05 . We must rank the data H 1 : Agreement within rows. This has already been done in this case. Now take column sums. Swimmer 1 2 3 4 5 n 5, k 3 Judge A 2 1 3 4 5 Judge B 1 2 3 4 5 Judge C 2 1 4 3 5 SR 5 + 4 + 10 + 11 + 15 = 45 SR 2 25 +16 +100 +121 +225 = 487 To check the sum of ranks, note that the sum of 1 through n , repeated through k rows is nn 1 56 k 3 45 . 2 2 SR 45 9 and S SR 2 Note: to compute W divide S by n SR 2 487 5 9 2 82 . From Table 12 for n 5 n 5, k 3 says that the 5% critical value is 64.4. Since our value is above 64.4 reject H 0 . Next SR 1 2 3 1 82 0.9111 . k n n 3 2 5 3 5 90 . Since S 82 , W 90 12 12 Also note: If n 7 , use 2 k n 1W , which has n 1 degrees of freedom. Also note: Assume that we get perfect agreement, Then our table would be as below. Swimmer 1 2 3 4 5 n 5, k 3 Judge A 2 1 3 4 5 Judge B 2 1 3 4 5 Judge C 2 1 3 4 5 SR 6 + 3 + 9 + 12 + 15 = 45 SR 2 Then S SR 2 n SR 2 36 + 9 + 81 +144 +225 =495 90 495 5 9 2 90 and W 1.000 . 90 3 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 4 Problem L3: In order to validate an aptitude test, a random sample of 15 salespersons is selected by an agency and their scores on the test are compared with their sales during their first year. Scores are as follows: Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Score 71.0 87.5 69.0 86.0 70.0 84.0 88.0 92.0 97.0 95.0 85.0 81.0 87.0 82.0 79.0 Sales 225 244 218 246 205 243 249 251 250 250 245 238 248 234 237 The correlation is .911, but the statistician believes that a rank correlation is more appropriate. Calculate a rank correlation, and test it for significance. Try to explain why the rank correlation is higher than the correlation. Solution: Minitab output follows. The second correlation computed is actually the rank correlation. Assuming that .05 , since both the p-values are below the significance level, we can reject our null hypothesis (below) and say that the correlation is significant. The fact that the rank correlation is higher than the Pearson correlation may be due to the fact that there is some curvature in the relationship between the original numbers. The Pearson correlation checks for straight line relationships. Pearson correlation of test and sales = 0.911 P-Value = 0.000 MTB > corr c4 c7 Correlations: testr, saler Pearson correlation of testr and saler = 0.953 P-Value = 0.000 MTB > print c1 c6 Data Display Row test sales 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 71.0 87.5 69.0 86.0 70.0 84.0 88.0 92.0 97.0 95.0 85.0 81.0 87.0 82.0 79.0 225 244 218 246 205 243 249 251 250 250 245 238 248 234 237 MTB > rank c1 c4 MTB > rank c4 c7 MTB > print c1 c4 c6 c7 252solnL1 12/01/03 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 5 Data Display Row test testr sales saler 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 71.0 87.5 69.0 86.0 70.0 84.0 88.0 92.0 97.0 95.0 85.0 81.0 87.0 82.0 79.0 3 11 1 9 2 7 12 13 15 14 8 5 10 6 4 225 244 218 246 205 243 249 251 250 250 245 238 248 234 237 3.0 8.0 2.0 10.0 1.0 7.0 12.0 15.0 13.5 13.5 9.0 6.0 11.0 4.0 5.0 I guess that it’s time to do this by hand. First we rank the data as above. We then compute the squared differences between the ranks. d r1 r2 H0 : s 0 , H0 : s 0 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x1 71.0 87.5 69.0 86.0 70.0 84.0 88.0 92.0 97.0 95.0 85.0 81.0 87.0 82.0 79.0 x2 225 244 218 246 205 243 249 251 250 250 245 238 248 234 237 r2 3 8 2 10 1 7 12 15 13.5 13.5 9 6 11 4 5 d 0 3 -1 -1 1 0 0 -2 1.5 0.5 -1 -1 -1 2 -1 0.0 d2 0 9 1 1 1 0 0 4 2.25 0.25 1 1 1 4 1 26.50 d 1 626.5 .9527. For a 2-sided test use the .025 value for n 15 , which is .5179. 1 15225 1 nn 1 2 6 rs r1 3 11 1 9 2 7 12 13 15 14 8 5 10 6 4 2 We reject the null hypothesis if rs is above .5179 or below -.5179. In this case we reject the null hypothesis and say that the rank correlation is significant. The formula used here is probably a little high because of the ties. I have left two exercises from last year to give you some more practice with rank correlations. The data and hypotheses should be self explanatory. 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 6 Exercise 15.46 (McClave et. al.): Put the data in columns and rank them within the column. d r1 r2 . The hypothesis are H 0 : s 0 , H 0 : s 0 . x1 0 3 0 -4 3 0 4 r1 3 5.5 3 1 5.5 3 7 rs 1 x2 0 2 2 0 3 1 2 d r2 1.5 1.5 5 0.5 5 -2 1.5 -0.5 7 -1.5 3 0 5 2. 0 d2 2.25 0.25 4.00 0.25 2.25 0 4.00 13.00 d 1 613 .7679 For a 2-sided test use the .025 value for n 7 , which is .7450. We 749 1 nn 1 2 6 2 reject the null hypothesis if rs is above .7450 or below -.7450. In this case we reject the null hypothesis and say that the rank correlation is significant. The formula used here is probably a little high because of the ties. Exercise 15.48 (McClave et. al.): H 0 : s 0 , H 0 : s 0 . x1 643 381 342 251 216 208 192 141 131 128 124 rs 1 r1 11 10 9 8 7 6 5 4 3 2 1 x2 2617 1724 1867 1238 890 681 1534 899 492 579 672 r2 11 9 10 7 5 4 8 6 1 2 3 d 0 1 -1 1 2 2 -3 -2 2 0 -2 0 d2 0 1 1 1 4 4 9 4 4 0 4 32 d 1 632 .8545 11121 1 nn 1 2 6 2 If we use Table 13, the 5% critical value is .5273. Since this is a right - side test, reject the null hypothesis if rs is above the critical value. Conclude that the number of parent companies is related to the number of subsidiaries. 252solnL1 11/26/07 (Open this document in 'Page Layout' view!) 7 Collinearity Exercise 15.18 [15.16 in 9th]: If the r-squared between 2 independent variables is 0.2, what is the VIF? 1 Solution: The text recommends the use of the Variance Inflation Factor, VIF j . Here R 2j 1 R 2j is the coefficient of multiple correlation gotten by regressing the independent variable X j against all the other independent variables Xs . The rule of thumb seems to be that we should be suspicious if any VIFj 5 and positively horrified if VIFj 10 . If you get results like this, drop a variable or change your model. In this problem R 2j .20 , so VIF 1 1.25 and we don’t need to worry. 1 0.2 Exercise 15.19 [15.17 in 9th]: If the r-squared between 2 independent variables is 0.5, what is the VIF? 1 Solution: R 2j .50 , so VIF 2.0 . What? Me worry? 1 0.5 Exercise 15.20 [15.18 in 9th]: In the WARECOST problem (14.4) find the VIF – Can we suspect collinearity? Solution:I haven’t verified this on Minitab, but it seems that since there are only 2 independent variables, the coefficient we want is the square of their correlation, so that both Rs are the 1 1 same. R12 0.64 , VIF1 2.778 and R22 0.64 , VIF2 2.778 1 0.64 1 0.64 There is no reason to suspect the existence of collinearity. Note – The printout mentioned here is the printout for problem 14.4 in 252solnJ1. It includes Minitab’s explanation of VIF. Exercise 11.101(McClave et. al.): We are fitting Y 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 in x1 a situation where the correlation matrix is x1 x2 x3 x4 x5 x2 x3 1 .17 .17 1 x4 x5 .02 .23 .45 .93 .19 .02 .02 .45 1 .23 .93 .22 .19 .02 .01 .22 .01 1 .86 .86 1 . In other words, the correlation between x 2 and x 4 is .93. Since this correlation and the correlation between x 4 and x5 are so high, we can expect problems due to collinearity - that is, because of the lack of relative movement between these pairs of variables, it will be hard to decide what changes in Y to attribute to each. Probably at least one of the highly correlated independent variables should be dropped. Let’s think about this, If we were doing a regression of x 4 against x 2 alone, and used these two variables only as explanatory (independent) variables, we would get VIF 1 1 1 7.402 , but, in fact, things are 1 .8649 .1351 1 .93 far worse, since the R-squared that we would get if we did a regression of one of these two variables against all the others would probably be considerably higher. If, for example it went to .90, the VIF would go to 10. 2