252corr

advertisement
1
1/22/07 252corr (Open this document in 'Outline' view!)
L. CORRELATION
1. Simple Correlation
The simple sample correlation coefficient is r 
 XY  nXY
or if spare parts
 X  nX  Y  nY
S   XY  nXY are available, we can say
2
SS x 
X
2
 nX 2 , SS y 
Sxy
r
Y
2
 nY 2 and
2
2
2
xy
.
SS x SS y
 XY  nXY 
S


 X  nX Y  nY  SS SS
2
Of course, since the coefficient of determination is R
r 2  R 2 and it is often easier to compute r 
2
S xy 2
SS x SS y
2
xy
2
2
2
2
x
,
y
and to give the correlation the sign of S xy . But
note that the correlation can range from +1 to -1, while the coefficient of determination can only range from
0 to 1. Also note that since the slope in simple regression is b1 
b12 
s 2y
s x2
R 2 or b1 
sy
sx
r.
 XY  nXY
 X  nX
2
The last equation has a counterpart in 1 
2
y
x
, R 2  b12
s x2
s 2y
or
 , where  is the
population correlation coefficient, so that testing H 0 :1  0 is equivalent to testing H 0 :  0 and the
simple regression coefficient and the correlation will have the same sign.
2. Correlation when x and y are both independent
If we want to test H 0 : xy  0 against H1 : xy  0 and x and y are normally distributed, we use
t n  2  
r

sr
r
1 r 2
n2
. But note that if we are testing H 0 : xy   0 against H 1 : xy   0 , and  0  0 ,
1 1 r 
z  ln 
the test is quite different. We need to use Fisher's z-transformation. Let ~
 . This has an
2  1 r 
~
n 2 
z  z
1
1  1 0 
 and a standard deviation of s z 
t

approximate mean of  z  ln 
,
so
that
n3
sz
2  1   0 
.
(Note: To get ln , the natural log, compute the log to the base 10 and divide by .434294482. )
2

H 0 : xy  0
Example: Test 
when n  10, r  .704 and r 2  .496   .05  .
H
:


0

 1 xy
To solve this we first compute s r 
with  t n2 2  t .8025
 
(Note that t n2 2
2
1 r 2
1  .496
r
and t 


n2
10  2
sr
.704
 2.805 . Compare this
1  .496
8
 2.306 . Since this is not between these two values of t, reject the null hypothesis.
 Fn  2 so that this is equivalent to an F test on 1 in a regression.)
H 0 : xy  0.8
Example: Test 
when n  10, r  .704 and r 2  .496   .05  .
H
:


0
.
8
 1 xy
This time compute Fisher's z-transformation (because  0 is not zero)
1  1  r  1  1  .704  1  1.704  1
1
~
z  ln 
  ln 
  ln 
  ln 5.75676   1.75037   0.87519
2  1  r  2  1  .704  2  0.296  2
2
1  1   0  1  1  .8  1  1.8  1
1
  ln 
 z  ln 
  ln 
  ln 9.0000   2.19722   1.09861
2  1   0  2  1  .8  2  0.2  2
2
sz 
1
1


n3
10  3
1
 0.37796 .
7
Finally t 
~
z   z 0.87519  1.09861

 0.591 . Compare
sz
0.37796
this with  t n2 2  t .8025  2.306 . Since –0.591 lies between these two values, do not reject the null
hypothesis.
1 1 r 
z10  log 
Note: To do the above with logarithms to the base 10, try ~
 . This has an approximate
2  1 r 
1 1 0
mean of  z 10  log
2  1 0
~
n  2 
z   z 10
t
 10
.
s z 10

 and a standard deviation of s z 10 


3. Tests of Association
a. Kendall's Tau. (Omitted)
b. Spearman's Rank Correlation Coefficient.
0.18861
, so that
n3


Take a set of n points x, y  and rank both x and y from 1 to n to get rx , ry . Do not attempt to compute
a rank correlation without replacing the original numbers by ranks, A correlation coefficient between
rx and ry can be computed as in point 2 above, but it is easier to compute d  rx  ry ,and then
rs  1 
 d . This can be given a t test for H
nn  1
2
6
2
0
:   0 as in point 2 above, but for n between 4 and
30, a special table should be used. For really large n , z  rs n  1 may be used.
3
Example: 5 applicants for a job are rated by two officers, with the following results. Note that in this
example the ranks are given initially. Usually the data must be replaced by ranks.   .05 
Applicant
Rater 1
Rater 2
A B C
4 1 3
D E
2 5
3
1
2
5
Test to see how well the ratings agree.
4
H 0 :  s  0
In this case, we have a 1-sided test 
. Arrange the data in columns.
H 1 :  s  0
Rater 1 Rater 2
Applicant
d
d2
ry
rx
A
4
3
1
1
B
C
D
E
rs  1 
1
3
2
5
1
2
1
1
2
5
1
4
1 Note that
4
1
1
 d  0 and  d
2
 8 . Since n  5,
 d  1  68  1  2  0.600 . If we check the table ‘Critical Values of
5
nn  1
55  1
2
6
2
2
rs , the
Spearman Rank Correlation Coefficient,’ we find that the critical value for n  5 and   .05 is .8000 so we
must not reject the null hypothesis and we conclude that we cannot say that the rankings agree.
Example: We find that for n  122 rs
 d  0.15 . We want to do the same one-sided test as in
 1
nn  1
2
6
2
H 0 :  s  0
the last problem 
.   .05 
H 1 :  s  0
We can do a t-test by computing s 
1  rs2
r
n2
122  2
 .15 
 1.662 . This is
and t  s  rs
2
n2
s
1  rs
1  .15 2
compared with tn2  t.120
05  1.658 . Since the t we computed is above the table value, we reject the null
hypothesis.
Or we can compute a z-score z  rs n  1  .15 121  1.650 . Since this is above z .05  1.645 , we can
reject the null hypothesis.
c. Kendall's Coefficient of Concordance.
Take k columns with n items in each and rank each column from 1 to n . The null hypothesis is
that the rankings disagree.
2
SR2  n SR , where SR  n  1k is
Compute a sum of ranks SRi for each row. Then S 
2
the mean of the SRi s. If H 0 is disagreement, S can be checked against a table for this test. If
S
, where
S  S reject H 0 . For n too large for the table use  2n1  k n  1W 
1 knn  1
12

W
S
1 k
12
2
n
3
n

 
is the Kendall Coefficient of Concordance and must be between 0 and 1.
4
Example: n  6 applicants are rated by k  3 officers. The ranks are below.
Applicant Rater 1 Rater 2 Rater 3 Rank Sum SR 2
A
B
C
D
E
F
1
6
3
2
5
4
1
5
6
4
2
3
6
3
2
5
4
1
8
14
11
11
11
8
63
64
196
n  1k  73 Note that
63
121
SR 
 10 .5 
6
2
2
121
121
64
687
if we had complete disagreement, every applicant would have a rank sum of 10.5.
S
 SR
2
 
 n SR
2
 687  610 .52  25 .5 . The Kendall Coefficient of Concordance says that
the degree of agreement on a zero to one scale is W 
S
1 k2
12
n
3
n

25 .5
32 6 3  6
 0.162 . To
1
12
do a test of the null hypothesis of disagreement   .05  , look up S  in the table giving ‘Critical
values of Kendall’s s as a Measure of Concordance’ for k  3 and n  6 , S .05  103 .9, so that we
accept the null hypothesis of disagreement..
 H 0 : Disagreement
Example: For n  31 and k  3 we get W  0.10 , and wish to test 
 H 1 : Agreement
Since n  31 is too large for the table, use  2  k n 1W  3300.10  9.000 . Using a  2
table, look up  2n 1   .20530  43 .733 . Since 9 is below the table value, do not reject H 0 .
4. Multiple Correlation
If R 2 is the coefficient of determination for a regression Yˆ  b0  b1 X 1  b2 X 2          bk X k , then the
square root of R 2 , R  rYYˆ is called the multiple correlation coefficient. Note that
 Yˆ  Y 

 Y  nY
2
R
2
2
R 2  1
s e2
s 2y
2
 1
s e2 n  k  1
where s 2y is the sample variance of y , and that for large n ,
2
n

1
sy
.
5. Partial Correlation (Optional)
If Yˆ  b0  b1 X 1  b2 X 2 , its multiple correlation coefficient can be written as RY . X1 X 2 or RY .12 . For
example, in the multiple regression problem, we got three multiple correlation coefficients RY2. X  .496 ,
RY2. XW  .857 and RY2. XWH  .906
If Yˆ  b0  b1 X 1  b2 X 2  b3 X 3 and we compute the partial correlation of X 3 with Y we compute
rY23.12 
RY2.123  RY2.12
1  RY2.12
, the additional explanatory power of the third independent variable after the effects
5
of the first two are considered. If we read t 3 
t2
b3
from the computer printout, rY23.12  2 3
, where
s b3
t 3  df
df  n  k  1 and k is the number of independent variables.
2
Example: In the multiple regression problem with which we have been working rYH
.XW is the additional
explanatory power of H beyond what was explained by X and Y . It can be computed two ways. First
RY2. XW H  RY2. XW .906  .857
2
rYH

 .343 . The partial correlation coefficient is actually
. XW 
1  .857
1  RY2. XW
rYH . XW   .343  0.576 . The sign of the partial correlation is the sign of the corresponding coefficient
in the regression. (For the regression equation see below.)
2
For the second method of computing rYH
.XW , recall that the last printout for the regression with which we
were working was
Y = 1.51 + 0.595 X - 0.698 W - 0.937 H
Predictor
Coef
Constant
1.5079
X
0.5952
W
-0.6984
H
-0.9365
Thus the t corresponding to H is t H
2
rYH
. XW 
t H2
t H2  df

Stdev
t-ratio
p
0.2709
5.57
0.001
0.1198
4.97
0.003
0.4860
-1.44
0.201
0.5239
-1.79
0.124
 1,79 and, since df  10  3  1 ,
 1.79 2  0.343 .
 1.79 2  6
6. Collinearity
If Yˆ  b0  b1 X 1  b2 X 2 , and X 1 and X 2 are highly correlated, then we have no real variation of X 1
relative to X 2 . This is a condition known as (multi)collinearity. The standard deviations for both b1 and
b2 will be large and, in extreme cases, the regression process may break down. Recall that in section I, we
said that small variation in x can lead to large values of s b1 in simple regression and thus insignificant
values of b1 .
Similarly in multiple regression, lack of movement of the independent variables relative to one another
leaves the regression process unable to tell what changes in the dependent variable are due to the various
independent variables. This will be indicated by large values of s b1 or s b2 which cause us to find the
coefficients insignificant when we use a t-test.
A relatively recent method to check for collinearity is to use the Variance Inflation Factor VIF j 
1
1  R 2j
.
Here R 2j is the coefficient of multiple correlation gotten by regressing the independent
variable X j against all the other independent variables  Xs  . The rule of thumb seems
to be that we should be suspicious if any VIFj  5 and positively horrified if VIFj  10 .
If you get results like this, drop a variable or change your model. Note that, if you use a
correlation matrix for your independent variables and see a large correlation between two
of them, putting the square of that correlation into the VIF formula gives you a low
estimate of the VIF, since the R-squared that you get from a regression against all the
independent variables will be higher.
6
Example: Note that in the printout in section 5, the standard deviations for the coefficients of W and H
are quite large, resulting in small t-ratios and p-values which lead us to believe that the coefficients are not
even significant when the significance level is 10%. The data from section J4 is repeated at right:
Obs Y X W H
1
2
3
4
5
6
7
8
0
2
1
3
1
3
4
2
0
1
2
1
0
3
4
2
1
0
1
0
0
0
0
1
1
0
1
0
0
0
0
0
9
10
1
2
2
1
1
0
1
0
Computation from these numbers reveals that
 W  4,  H  3,  W
2
 4,
 H  3,  WH  3, and n  10. Thus W  0.4 and H  0.3 so that
W 2  nW 2  4 100.42  2.4 ,  H 2  nH 2  3 100.32  2.1 and
 WH  nW H  3  100.40.3  1.8 . Finally,
1.8
WH  nW H
r 

 .8018 , a relatively high correlation. This and the
2.4 2.1
W  nW  H  nH
2
WH
2
2
2
2
relatively small sample size account for the large standard deviations and the generally discouraging results.
Though the regression against two independent variables has been shown to be an improvement over the
regression against one independent variable, addition of the third independent variable, in spite of the high
R 2 was useless. Preliminary use of the Minitab correlation command, as below, might have warned us of
the problem.
MTB > Correlation 'X' 'W' 'H'.
Correlations (Pearson)
X
W
W
-0.068
H
-0.145 0.802
Actually for this problem, the largest VIF, for H, is only about 2.86, but it seems to interact with the small
sample size.
Download