252solnL1 11/26/07 1 (Open this document in 'Page Layout' view!)

advertisement
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
1
L. CORRELATION
1. Simple Correlation
2. Correlation when x and y are both independent
Problem L1, Text 13.36 (Compute correlation and test correlation for significance!)
3. Tests of Association
Problem L3, L2 (L1, L2)
4. Multiple Correlation
5. Partial Correlation
6. Collinearity
Text 15.16-15.18 (15.19-15.21) (A printout will be supplied for the last problem – make sure that you understand it.)
----------------------------------------------------------------------------------------------------------------------------------------------
Correlation
Problem L1: Assume that for n = 49 r = .24. Test for
a. Correlation of zero
b. Correlation of 0.3
Solution:   .05 
 H 0 :   0 n 2  r
1  .24 2
1 r 2

 .020051  0.1416 . So
t

a.) 
where s r 
n2
49  2
sr
H 1 :   0
r
.24
47
t n  2  

 1.695 . We do not reject H 0 if this t lies between  t .025
 2.012 . So we do
s r 0.1416
not reject H 0 .
H 0 :   0.3 n  2  z *   z*
1  1  r  1  1  .24  1
t

b.) 
where z*  ln 
  ln 
  ln 1.6316 
2  1  r  2  1  .24  2
H
:


0
.
3
s z*
 1

1 0
1
0.4895   0.2448 ,  z*  1 ln 
2
2  1 0
s z2*
 1  1  .30  1
1
  ln 
 2  1  .30   2 ln 1.8571   2 .6190   0.3095 and

z *   z* 0.2448  0.3095
1
1


 0.3095 . So t n  2  

 0.4389 . We use the same test as
n  3 46
s z*
0.3095
in part a, and thus do not reject H 0 .
Exercise 13.36: Suppose that you are testing the null hypothesis that there is no relationship between x
and y . Assume n  20 and that SSR  60 and SSE  40 .
a. What is the value of the F test statistic?
b. At the 5% significance level what is the critical value of F?
c. Based on the answers to a) and b) what statistical decision should be made?
d. Calculate the correlation coefficient from R 2 by assuming that the slope b1 is negative.
e. At the 5% significance level, is there significant correlation between x and y ?
Solution: Let’s set up the ANOVA table. Note that the F test is the equivalent of a test on b1 . k  1, and
the null hypothesis is that the regression is useless or if there is only one independent variable, H 0 : b1  0.
Source
SS
DF
MS
F
F.05
Regression
Error (Within)
Total
252solnL1 12/01/03
60
1
40
100
18
19
F 1,10  4.96 s
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
2
The Instructor’s Solution Manual says the following.
MSR  SSR / k  60 / 1  60
(a)
MSE  SSE /(n  k  1)  40 / 18  2.222
F  MSR / MSE  60 / 2.222  27
(b)
F 1,18  4.41
(c)
Reject H0 and say that there is evidence that the fitted linear regression model is useful.
Of course, it would probably be easier to just complete the table. Remember that total degrees of freedom
are n  1 and that the degrees of freedom for regression are k  1, the number of independent variables,
and that both SS and DF must add up. MS is SS divided by DF and that F is always MS divided by MSE.
Also remember that you were supposed to know this, but judging by the last exam, you don’t.
Source
SS
DF
MS
F
F.05
Regression
60
1
60
27.45
F 1,18  4.41 s
(d)
(e)
sr 
Error (Within)
40
Total
100
SSR 60
R2 

 0. 6
SST 100
H 0 :   0

H 1 :   0
18
19
2.222
r  signb1  R 2   0.60  .7746
H 0 :   0

H 1 :   0
t n 2  
r
sr
where
1  .24 2
1 r 2

 .020051  0.1416 . So
n2
49  2
t n  2  
r
.24
47

 1.695 . We do not reject H 0 if this t lies between  t .025
 2.012 . So we do
s r 0.1416
not reject H 0 .
H 0:   0
There is no correlation between X and Y.
H 1:   0
There is correlation between X and Y.
d.f. = n  2  18. Decision rule: Reject H 0 if tcal > t n 2  = 2.101.
2
r
Test statistic: t 

sr
r
.7746

 5.196 .
1  .6
1 r 2
18
n2
Since t cal  5.196 is below the lower critical bound of –2.1009, reject H 0 and say that
there is enough evidence to conclude that there is a significant correlation between x and
y.
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
Tests of Association
Problem L2: The following are rankings of 3 judges
Swimmer
Judge A
Judge B
Judge C
1
2
1
2
2
1
2
1
3
3
3
4
4
4
4
3
5
5
5
5
Is there significant agreement?
 H : Disagreement
Solution:  0
Use Kendall’s test of concordance with   .05 . We must rank the data
 H 1 : Agreement
within rows. This has already been done in this case. Now take column sums.
Swimmer
1
2
3
4
5
n  5, k  3
Judge A
2
1
3
4
5
Judge B
1
2
3
4
5
Judge C
2
1
4
3
5
SR
5 + 4 + 10 + 11 + 15 = 45
SR 2
25 +16 +100 +121 +225 = 487
To check the sum of ranks, note that the sum of 1 through n , repeated through k rows is
nn  1
56
k
3
 45 .
2
2
 SR  45  9 and S 
 SR
2


 
Note: to compute W divide S by
 n SR
 
2
 487  5 9 2  82 . From Table 12 for
n
5
n  5, k  3 says that the 5% critical value is 64.4. Since our value is above 64.4 reject H 0 .
Next SR 


1 2 3
1
82
 0.9111 .
k n  n  3 2 5 3  5  90 . Since S  82 , W 
90
12
12
Also note: If n  7 , use  2  k n  1W , which has n  1 degrees of freedom.
Also note: Assume that we get perfect agreement, Then our table would be as below.
Swimmer
1
2
3
4
5
n  5, k  3
Judge A
2
1
3
4
5
Judge B
2
1
3
4
5
Judge C
2
1
3
4
5
SR
6 + 3 + 9 + 12 + 15 = 45
SR 2
Then S 
 SR
2
 
 n SR
2
36 + 9 + 81 +144 +225 =495
90
 495  5 9 2  90 and W 
 1.000 .
90
 
3
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
4
Problem L3: In order to validate an aptitude test, a random sample of 15 salespersons is selected by an
agency and their scores on the test are compared with their sales during their first year. Scores are as
follows:
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Score
71.0
87.5
69.0
86.0
70.0
84.0
88.0
92.0
97.0
95.0
85.0
81.0
87.0
82.0
79.0
Sales
225
244
218
246
205
243
249
251
250
250
245
238
248
234
237
The correlation is .911, but the statistician believes that a rank correlation is more appropriate. Calculate a
rank correlation, and test it for significance. Try to explain why the rank correlation is higher than the
correlation.
Solution:
Minitab output follows. The second correlation computed is actually the rank correlation. Assuming that
  .05 , since both the p-values are below the significance level, we can reject our null hypothesis (below)
and say that the correlation is significant. The fact that the rank correlation is higher than the Pearson
correlation may be due to the fact that there is some curvature in the relationship between the original
numbers. The Pearson correlation checks for straight line relationships.
Pearson correlation of test and sales = 0.911
P-Value = 0.000
MTB > corr c4 c7
Correlations: testr, saler
Pearson correlation of testr and saler = 0.953
P-Value = 0.000
MTB > print c1 c6
Data Display
Row
test
sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
71.0
87.5
69.0
86.0
70.0
84.0
88.0
92.0
97.0
95.0
85.0
81.0
87.0
82.0
79.0
225
244
218
246
205
243
249
251
250
250
245
238
248
234
237
MTB > rank c1 c4
MTB > rank c4 c7
MTB > print c1 c4 c6 c7
252solnL1 12/01/03
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
5
Data Display
Row
test
testr
sales
saler
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
71.0
87.5
69.0
86.0
70.0
84.0
88.0
92.0
97.0
95.0
85.0
81.0
87.0
82.0
79.0
3
11
1
9
2
7
12
13
15
14
8
5
10
6
4
225
244
218
246
205
243
249
251
250
250
245
238
248
234
237
3.0
8.0
2.0
10.0
1.0
7.0
12.0
15.0
13.5
13.5
9.0
6.0
11.0
4.0
5.0
I guess that it’s time to do this by hand. First we rank the data as above. We then compute the squared
differences between the ranks. d  r1  r2
H0 : s  0 , H0 : s  0 .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
x1
71.0
87.5
69.0
86.0
70.0
84.0
88.0
92.0
97.0
95.0
85.0
81.0
87.0
82.0
79.0
x2
225
244
218
246
205
243
249
251
250
250
245
238
248
234
237
r2
3
8
2
10
1
7
12
15
13.5
13.5
9
6
11
4
5
d
0
3
-1
-1
1
0
0
-2
1.5
0.5
-1
-1
-1
2
-1
0.0
d2
0
9
1
1
1
0
0
4
2.25
0.25
1
1
1
4
1
26.50
 d  1  626.5  .9527. For a 2-sided test use the .025 value for n  15 , which is .5179.
 1
15225  1
nn  1
2
6
rs
r1
3
11
1
9
2
7
12
13
15
14
8
5
10
6
4
2
We reject the null hypothesis if rs is above .5179 or below -.5179. In this case we reject the null
hypothesis and say that the rank correlation is significant. The formula used here is probably a little high
because of the ties.
I have left two exercises from last year to give you some more practice with rank correlations. The data and
hypotheses should be self explanatory.
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
6
Exercise 15.46 (McClave et. al.): Put the data in columns and rank them within the column. d  r1  r2 .
The hypothesis are H 0 :  s  0 , H 0 :  s  0 .
x1
0
3
0
-4
3
0
4
r1
3
5.5
3
1
5.5
3
7
rs  1 
x2
0
2
2
0
3
1
2
d
r2
1.5 1.5
5
0.5
5
-2
1.5 -0.5
7
-1.5
3
0
5
2.
0
d2
2.25
0.25
4.00
0.25
2.25
0
4.00
13.00
 d  1  613  .7679 For a 2-sided test use the .025 value for n  7 , which is .7450. We
749  1
nn  1
2
6
2
reject the null hypothesis if rs is above .7450 or below -.7450. In this case we reject the null hypothesis
and say that the rank correlation is significant. The formula used here is probably a little high because of
the ties.
Exercise 15.48 (McClave et. al.): H 0 :  s  0 , H 0 :  s  0 .
x1
643
381
342
251
216
208
192
141
131
128
124
rs  1 
r1
11
10
9
8
7
6
5
4
3
2
1
x2
2617
1724
1867
1238
890
681
1534
899
492
579
672
r2
11
9
10
7
5
4
8
6
1
2
3
d
0
1
-1
1
2
2
-3
-2
2
0
-2
0
d2
0
1
1
1
4
4
9
4
4
0
4
32
 d  1  632   .8545
11121  1
nn  1
2
6
2
If we use Table 13, the 5% critical value is .5273. Since this is a
right - side test, reject the null hypothesis if rs is above the critical value. Conclude that the number of
parent companies is related to the number of subsidiaries.
252solnL1 11/26/07
(Open this document in 'Page Layout' view!)
7
Collinearity
Exercise 15.18 [15.16 in 9th]: If the r-squared between 2 independent variables is 0.2, what is the VIF?
1
Solution: The text recommends the use of the Variance Inflation Factor, VIF j 
. Here R 2j
1  R 2j
is the coefficient of multiple correlation gotten by regressing the independent variable X j against all the
other independent variables  Xs  . The rule of thumb seems to be that we should be suspicious if any
VIFj  5 and positively horrified if VIFj  10 . If you get results like this, drop a variable or change your
model. In this problem R 2j  .20 , so VIF 
1
 1.25 and we don’t need to worry.
1  0.2
Exercise 15.19 [15.17 in 9th]: If the r-squared between 2 independent variables is 0.5, what is the VIF?
1
Solution: R 2j  .50 , so VIF 
 2.0 . What? Me worry?
1  0.5
Exercise 15.20 [15.18 in 9th]: In the WARECOST problem (14.4) find the VIF – Can we suspect
collinearity?
Solution:I haven’t verified this on Minitab, but it seems that since there are only 2
independent variables, the coefficient we want is the square of their correlation, so that both Rs are the
1
1
same. R12  0.64 , VIF1 
 2.778 and R22  0.64 , VIF2 
 2.778
1  0.64
1  0.64
There is no reason to suspect the existence of collinearity. Note – The printout mentioned here is the
printout for problem 14.4 in 252solnJ1. It includes Minitab’s explanation of VIF.
Exercise 11.101(McClave et. al.): We are fitting Y   0  1 X 1   2 X 2   3 X 3   4 X 4   5 X 5   in
x1
a situation where the correlation matrix is
x1
x2
x3
x4
x5
x2
x3
1 .17
.17
1
x4
x5
.02  .23
.45
.93
.19
.02
.02 .45
1
 .23 .93
.22
.19 .02  .01
.22  .01
1 .86
.86
1
. In other words, the
correlation between x 2 and x 4 is .93. Since this correlation and the correlation between x 4 and x5 are so
high, we can expect problems due to collinearity - that is, because of the lack of relative movement
between these pairs of variables, it will be hard to decide what changes in Y to attribute to each. Probably
at least one of the highly correlated independent variables should be dropped. Let’s think about this, If we
were doing a regression of x 4 against x 2 alone, and used these two variables only as explanatory
(independent) variables, we would get VIF 
1

1
1

 7.402 , but, in fact, things are
1  .8649 .1351
1  .93 
far worse, since the R-squared that we would get if we did a regression of one of these two variables
against all the others would probably be considerably higher. If, for example it went to .90, the VIF would
go to 10.
2
Download