correlation

advertisement
Inference for correlation
Definition of covariance
For a pair of random variables x and y, one can define the covariance of the variables as the average
(expected value) product of deviations from mean, or  x, y  covx, y   Ex   x  y   y .


Properties of covariance
 The covariance of a variable with itself is just its variance, because when x is identical to y ( x  y ),
then covx, x   Ex   x 2   x2

For given variances  x2 and  y2 , the greatest covariance,  x, y   x y , occurs when y and x are
directly linearly related, and the least,  x, y   x y , when they are inversely linearly related.

For the covariance to equal zero, it is sufficient, but not necessary, that they be statistically
independent, because then Ex   x  y   y  Ex   x   E y   y  0  0




Definition of correlation
The covariance, then, is a measure of the extent of linear relation, but depends on the scale of
measurement of the variables. A scale-free measure is had by dividing by the extreme value, and is called
 x, y
. It follows that the correlation ranges between the values 1
 x y
and +1, the extreme values corresponding to exact linear relation (direct and inverse, respectively).
the correlation, or  x, y  corrx, y  
Point Estimation
An estimate of the correlation can be constructed from sample analogues of the population parameters:
xi  x  yi  y 
s x, y
rx, y 

(the term n  1 dividing the numerator and denominator cancels
sx s y
xi  x 2
 yi  y 2



out). This estimator has a bias in small samples, but can be used for inference about correlation in the
population.
Testing
For testing the null hypothesis of no correlation, one can transform r to obtain a test statistic, namely
r n2
t
~ t n  2  , which follows a t-distribution with n  2 degrees of freedom under the null
2
1 r
hypothesis. As usual, the null hypothesis is rejected for large enough absolute values of t, according to
the desired probability of type I error. This t statistic turns out to be identical to the statistic used for
testing for zero slope in the regression model (either x or y can play the role of independent variable!)
Thus, regression software can be used to perform this test even though designed for a different statistical
model.
If hypotheses other than the value zero are to be tested, say    0 , another transformation of r yields an
1 1  
1 1 r 

approximately normal variable in moderately sized samples: w  ln 
 . If we let   ln 
2  1   
2 1 r 
1 

be the correspondingly transformed population correlation, then w  N  ,
, i.e. approximately
n
 3 

w
unbiased with variance n  3 , and so a standard normal test statistic is formed as z 
. The
n3
critical (rejection) region of the test is determined as usual according to the direction of the test
(alternative) and the probability of type I error.
Confidence interval for 
Confidence intervals for  are obtained by back-transforming the corresponding interval for . This latter
interval is constructed as w  z1
, or  
e 2  1
e 2  1
2
1
n3
. To get the inverse transformation, we solve for  in terms of
, and apply this transformation to the upper and lower confidence limits, respectively;
that is, the upper confidence limit for  is obtained by substituting w  z1
2
1
n3
for , and similarly
for the lower limit. The resulting interval for  is not symmetrical unless w happens by chance to be zero.
Download