Module 20: Correlation

advertisement
Module 20: Correlation
This module focuses on the calculating, interpreting
and testing hypotheses about the Pearson Product
Moment Correlation Coefficient.
Reviewed 05 June 05/MODULE 20
20 - 1
Correlation
In Module 19, we examined how two variables, x and y,
relate to each other by using the simple linear regression
tool. In that context, x was the independent variable and y
was the dependent variable. Typical examples for the
independent variable include measures of time, including
age; whereas, typical examples for the dependent variable
are continuous measurements such as blood cholesterol
level. The general assumption is that there are separate
normal distributions of the dependent variable y for each
value of the independent variable x. Further, we need to
assume that these separate normal distributions for the
dependent variable all have the same population variance.
20 - 2
Clearly these assumptions are quite restrictive in that
we are often interested in the relationship between
two variables, x and y, where it is not at all clear
which should be labeled the independent variable and
which the dependent one. An example is the
relationship between blood cholesterol level and
blood pressure level.
20 - 3
For this situation, we have another tool to measure and
test hypotheses about the relationship between these two
variables. The tool is correlation and we focus here only
on what is usually called the Pearson Product Moment
Correlation Coefficient. There are other measures of
correlation which we will not discuss here. There are
restrictions for the use of this correlation tool as well,
which include the basic assumption that the x and y
variables together have a joint frequency distribution
which is called the bivariate normal distribution. This
distribution looks like a three-dimensional bell in a
manner similar to the way a normal distribution for one
variable looks like a cross section of a bell.
20 - 4
The degree of association or correlation between two
variables is measured by the correlation coefficient.
This is done in a manner similar to that for other
population parameters and estimates of these
parameters obtained by calculating statistics from
samples. That is, there is a value for the population
parameter for this coefficient which is estimated by
selecting a random sample and calculating the
appropriate coefficient using the data from this
sample. We can also use the information from the
sample to test hypotheses about the population.
20 - 5
The population parameter for the Pearson Product
Moment Correlation Coefficient is defined as
 xy 
 ( x  u )( y  u )
 (x  u )  ( y  u )
x
y
2
x
2
y
which is typically called Rho, for the Greek letter it
represents. The estimate of ρ calculated from the sample
data is the statistic
( x  x )( y  y )
rxy 
=


 (x  x )  ( y  y)
 xy  ( x)( y ) / n
[ x  ( x) / n][ y  ( y )
2
2
2
2
2
2
/ n]
SS ( xy )
SS ( x) SS ( y )
20 - 6
Fecal Fat and Urinary Oxalate Example
Fecal Fat (g/24 hr) and Urinary Oxalate (mg/24 hr)
secreted by a random sample of n = 11 persons
Patient
1
2
3
4
5
6
7
8
9
10
11
Sum
Mean
x = Fecal Fat
16
14
38
8
15
22
28
27
14
45
46
273
24.8
y = Urinary Oxalate
31
40
41
70
75
65
85
115
128
145
140
935
85.0
20 - 7
Scatter plot for Fecal Fat and Urinary Oxalate Data
Urinary Oxalate (mg/24hr)
160
140
120
100
80
60
40
20
0
0
10
20
30
40
50
Fecal Fat (g/24hr)
20 - 8
Calculations for Regressing Urinary Oxalate on Fecal Fat
Fecal Fat
Person
1
2
3
4
5
6
7
8
9
10
11
Sum
Mean
x
16
14
38
8
15
22
28
27
14
45
46
273
24.8
Sum2/n
SS
Variance
SD
6,775.36
1,743.64
174.36
13.2
Urinary Oxalate
x2
256
196
1,444
64
225
484
784
729
196
2,025
2,116
8,519
y
31
40
41
70
75
65
85
115
128
145
140
935
85.0
79,475.00
16,976.00
1,697.60
41.2
y2
961
1,600
1,681
4,900
5,625
4,225
7,225
13,225
16,384
21,025
19,600
96,451
xy
496
560
1,558
560
1,125
1,430
2,380
3,105
1,792
6,525
6,440
25,971
2,766.00
20 - 9
Regression Tools
y  85.0
x  24.8
SS ( x)  1,743.64
SS ( y )  16,976.00
SS ( xy )  2,766.00
20 - 10
So we can calculate
SS ( xy) 2,766.00
Slope  b 

 1.59
SS ( x) 1,743.64
Intercept  a  y  bx  85.0  1.59(24.8)  45.54
The straight line depicting the regression relationship of y on
x is
yˆ  a  bx
yˆ  45.54  1.59 x
At x = 40, the regression estimate for y is:
yˆ  45.54  1.59(40)  109.14
20 - 11
With this information, we can add the regression line
ˆ  45.54  1.59x
y
Urinary Oxalate (mg/24hr)
to the scatter plot, as shown below.
160
140
120
100
80
60
40
20
0
0
20
40
60
Fecal Fat (g/24hr)
yˆ  45.54  1.59x  45  117.09
20 - 12
Test for regression of Urinary Oxalate on Fecal Fat
1. The hypothesis:
H0: β = 0 vs. H1: β ≠0
2. The  level:
 = 0.05
3. The assumptions:
Random normal samples for yvariable from populations
defined by x-variable
4. The test statistic:
ANOVA as specified by
ANOVA
Source
df
SS
MS
Regression
1 bSS (xy) SS(Reg)/1
Residual
n-2 SS(Res)a SS(Res)/(n-2)
Total
n-1
SS (y)
a
SS(Residual) = SS(y) – SS(Regression)
F
MS(Reg)/MS(Res)
20 - 13
5. The critical region:
Reject H0: β = 0 if the value
calculated for F > F0.95(1, 9) = 5.12
6. The result:
SS(Reg) = bSS(xy)
= 1.59 (2,766.00) = 4,397.94
SS(Total) = SS(y) = 16,976.00
SS(Res) = 16,976.00 – 4,397.94
= 12,578.06
Source
Regression
Residual
Total
df
1
9
10
7. The conclusion:
ANOVA
SS
4,397.94
12,578.06
16,976.00
MS
4,397.94
1,397.56
F
3.15
Accept H0: β = 0 since F < 5.12
20 - 14
(x, y)
(45,145)
Urinary Oxalate (mg/24hr)
160
140
yˆ  45.54  1.59(45)
B  145  117  28
120
100
(45, 85)
80
A  117  85  32
60
y  y  145  85  60
y = 45.54+1.59x
40
45.54
20
0
0
10
20
30
40
50
60
Fecal Fat (g/24hr)
R2 = 0.2585
yˆ = Regression estimate = 45.54 + 1.59x = 117
C=A+B=
A=
yˆ  y
B = y  yˆ
yy
Total deviation
 SS(y)
Explained by line
 SS(Reg)
Left over
 SS(Residual)
20 - 15
Correlation Tools
SS ( x)  1,743.64
SS ( y)  16,976.00
SS ( xy)  2,766.00
The estimate of the correlation coefficient is:
SS ( xy)
rxy 
SS ( x) SS ( y)
2,766.00
rxy 
 0.5084
[1,743.64][16,976.00]
20 - 16
The Correlation Coefficient
r measures linear association
r has values between -1 ≤ r ≤ + 1
r  + 1 implies strong positive linear association
r  - 1 implies strong negative linear association
r  0 implies no linear association
20 - 17
.
y
.
. .
.
.
.
y
.
. .
. .
.
.
.
.
.
. .
.
.
.
.
.
x
x
r  +1
r  -1
y
. . .
.. .
.
y
.
.
.
.
.
.
.
.
.
.
. .
. .
.
. .
.
x
x
r 0
r  +1
.
y
. .
.
.
. .
. .
.
. .
.
.
. .
.
y
.
. .
.
.
. . .
.. . ..
. .. .
x
x
r 0
r >0
20 - 18
Correlation Hypothesis Testing
The hypothesis of interest deals with whether there is linear
association between x and y. If there is no such
association, we would have  = 0. Hence, the hypotheses
of interest are:
H0:  = 0 vs. H1:   0
which we can test by using the test statistic:
½
 n2 
t r
 t( n 2)

2
1  r 
Note that this calculation requires only the sample estimate
r of the correlation coefficient ρ and the sample size n and
that we need to use the t distribution with n - 2 degrees of
freedom.
20 - 19
Test of Correlation between Urinary Oxalate
and Fecal Fat, n = 11, r = 0.5084
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
½
 n2 
t r
2
1  r 
 t( n  2)
20 - 20
5. The critical region: Reject H0:  = 0 if the value
calculated for t is not between
± t0.975(9) = 2.262
6. The result:
r = 0.5084, n = 11
½


9
t  0.5084 

2
1  (0.5084) 
7. The conclusion:
½
9


 0.5084 

1

0.2585


 1.77
Accept H0: = 0 since t = 1.77
is between ± t0.975(9) = 2.262
20 - 21
Test of Correlation for Tono-Pen vs. Goldman
intraocular pressure, n = 40, r = 0.6574
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
n2
tr
2
1 r
20 - 22
5. The rejection region: Reject H0:  = 0 , if t is not
between  t0.975(38) = 2.02
6. The result:
n = 40, r = 0.6574, r2 = 0.44,
38
38
t  0.6574
 0.66
 5.44
2
0.56
1  0.6574
7. The conclusion:
Reject H0: = 0 since t = 5.44
is not between  2.02
20 - 23
Example : AJPH, 1995; 85: 1397-1401
20 - 24
Source: AJPH, 1995; 85: 1397-1401
20 - 25
Test of Correlation between infant mortality rate
and gross domestic product, n = 17, r = -0.64
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
n2
tr
1  r2
20 - 26
5. The rejection region: Reject H0:  = 0 , if t is not
between  t0.975(15) = 2.13
6. The result:
n = 17, r = -0.64
15
15
t  0.64
 0.64
 3.23
2
1  (0.64)
0.59
7. The conclusion:
Reject H0:  = 0 since
t = -3.23 is not between
 t0.975(15) = 2.13
20 - 27
Test of correlation hypothesis for life expectancy
for males and females, n = 17, r = 0.67
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
n2
tr
1  r2
20 - 28
5. The rejection region: Reject H0:  = 0 , if t is not
between  t0.975(15) = 2.13
6. The result:
n = 17, r = 0.67
15
15
t  0.67
 0.67
 3.49
2
1  0.67
1  0.45
7. The conclusion:
Reject H0:  = 0 since
t = 3.49 is not between
 t0.975 (15) = 2.13
20 - 29
Example : AJPH, 1997; 87: 1491-1498
20 - 30
20 - 31
Correlation between Mortality and
Social Mistrust, n = 39, r = 0.79
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
n2
tr
1  r2
20 - 32
5. The rejection region: Reject H0:  = 0 , if t is not
between  t0.975(37) = 2.02
6. The result:
n = 39, r = 0.79
37
37
t  0.79
 0.79
 7.8
2
1  0.79
0.3759
7. The conclusion:
Reject H0: = 0 since t = 7.8
is not between
 t0.975 (37) = 2.02
20 - 33
Example : AJPH, 1998; 88: 1496-1502
20 - 34
Note: very few
outliers can have a
large impact on the
location of the line
Source: AJPH, 1998; 88: 1496-1502
20 - 35
Test for Correlation between Gonorrhea
rate and Chlamydia rate
1. The hypothesis:
H0:  = 0 vs. H1:  ≠ 0
2. The  level:
 = 0.05
3. The assumptions:
Random sample from bivariate
normal distribution.
4. The test statistic:
n2
tr
1  r2
20 - 36
5. The rejection region: Reject H0:  = 0 , if t is not
between  t0.975(320)  2.00
6. The result:
n = 322, r = 0.83
320
320
t  0.83
 0.83
 26.67
2
1  0.83
1  0.69
7. The conclusion:
Reject H0:  = 0 since
t = 26.67 is not between
 t0.975(320) = 2.00
20 - 37
Download