Correlation

advertisement
Correlation
A bit about Pearson’s r
Questions
• Why does the
maximum value of r
equal 1.0?
• What does it mean
when a correlation is
positive? Negative?
• What is the purpose of
the Fisher r to z
transformation?
• What is range
restriction? Range
enhancement? What do
they do to r?
• Give an example in
which data properly
analyzed by ANOVA
cannot be used to infer
causality.
• Why do we care about
the sampling
distribution of the
correlation coefficient?
• What is the effect of
reliability on r?
Basic Ideas
• Nominal vs. continuous IV
• Degree (direction) & closeness
(magnitude) of linear relations
– Sign (+ or -) for direction
– Absolute value for magnitude
• Pearson product-moment correlation
coefficient
z

r
z
X Y
N
Illustrations
Plot of Weight by Height
Plot of Errors by Study Time
210
30
180
Errors
Weight
20
150
10
120
90
60
63
66
69
72
0
75
Height
0
100
200
300
Study Time
400
Plot of SAT-V by Toe Size
700
Positive, negative, zero
SAT-V
600
500
400
1.5
1.6
1.7
Toe Size
1.8
1.9
Simple Formulas
r
 xy
NS X SY
x  X  X and y  Y  Y
SX 
(X  X )
2
N
Use either N throughout or else
use N-1 throughout (SD and
denominator); result is the
same as long as you are
consistent.
xy

Cov( X , Y ) 
N
zz

r
x y
N
z
XX
SX
Pearson’s r is the average
cross product of z scores.
Product of (standardized)
moments from the means.
Graphic Representation
Plot of Weight by Height
Plot of Weight by Height in Z-scores
2
210
1
180
-
+
+
-
Weight
Z-weight
M e a n = 1 5 0 .7 lb s .
150
0
-1
120
M e a n = 6 6 .8 In c h es
-2
90
60
63
66
Height
69
72
75
-2
-1
0
1
2
Z-height
1. Conversion from raw to z.
2. Points & quadrants. Positive & negative products.
3. Correlation is average of cross products. Sign &
magnitude of r depend on where the points fall.
4. Product at maximum (average =1) when points on line
where zX=zY.
Descriptive Statistics
Ht
Wt
Valid N (listwise)
N
10
10
10
Minimum
60.00
110.00
Maximum
78.00
200.00
Mean
69.0000
155.0000
Std. Deviation
6.05530
30.27650
r = 1.0
r=1
Leave X, add error to Y.
r=.99
r=.99
Add more error.
r=.91
With 2 variables, the correlation is the z-score slope.
Review
• Why does the maximum value of r
equal 1.0?
• What does it mean when a correlation is
positive? Negative?
Sampling Distribution of r
Statistic is r, parameter is ρ (rho). In general, r is slightly
Sampling Distributions of r
biased.
Relative Frequ ency
0 .0 8
0 .0 6
rho=0
rho=-.5
rho=.5
0 .0 4
0 .0 2
0 .0 0
-1 .2
-0 .8
-0 .4
0 .0
0 .4
0 .8
1 .2
2 2
(
1


)
The sampling variance is approximately:  2r 
N
Obs erv e d r
Sampling variance depends both on N and on ρ.
Empirical Sampling Distributions of the Correlation Coefficient
  .5; N  100
  .7; N  100
  .5; N  50
  .7; N  50
0.9 +
0
|
0
|
|
0
|
|
0
0
|
0.8 +
0
|
|
|
|
|
|
|
|
|
+-----+
|
0
|
+-----+
|
|
0.7 +
0
|
*--+--*
*--+--*
|
|
|
+-----+
|
|
|
|
|
|
+-----+
|
|
|
|
|
0.6 +
|
|
|
|
|
|
+-----+
0
|
|
+-----+
|
|
0
|
|
|
|
|
|
0
|
0.5 +
*--+--*
*--+--*
0
0
|
|
|
|
|
0
0
|
+-----+
|
|
*
0
|
|
+-----+
0
0.4 +
|
|
0
|
|
|
*
0
|
|
|
*
|
|
|
0.3 +
0
|
|
0
|
*
|
0
|
|
0
0
0.2 +
0
0
|
0
0
|
0
0
|
0
0.1 +
0
|
0
|
0
|
0
0 +
*
|
*
|
*
|
-0.1 +
------------+-----------+-----------+-----------+----------param
.5_N100
.5_N50
.7_N100
.7_N50
Fisher’s r to z Transformation
Fisher r to z Transformation
z
.10
.20
.31
.42
.55
.69
.87
1.10
1.47
1.5
 (1  r ) 

z  .5 ln 
 (1  r ) 
1.3
1.1
z (output)
r
.10
.20
.30
.40
.50
.60
.70
.80
.90
0.9
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
r (sample value input)
Sampling distribution of z is normal as N increases.
Pulls out short tail to make better (normal) distribution.
Sampling variance of z = (1/(n-3)) does not depend on ρ.
Hypothesis test:
t  N 2
r
1 r 2
H0 :   0
Result is compared to t with (N2) df for significance.
Say r=.25, N=100
.25
.25
t  98
 9.899
 2.56
2
.986
1  .25
t(.05, 98) = 1.984.
p< .05
Hypothesis test 2: H
z
.5 log e
1 r
1 
.5 log e
1 r
1 
1/ N  3
0
:   value
One sample z test where r is
sample value and ρ is
hypothesized population value.
Say N=200, r = .54, and ρ is .30.
z
.5 log e
1.54
1.30
.5 log e
1.54
1.30
1 / 200  3
z
.60.31
.07
=4.13
Compare to unit normal, e.g., 4.13 > 1.96 so it is
significant. Our sample was not drawn from a
population in which rho is .30.
Hypothesis test 3:
H 0 : 1   2
Testing equality of correlations from 2 INDEPENDENT
samples.
1 r
1 r
.5 log e
1
.5 log e
2
1  r1
1  r2
z
1 / ( N1  3)  1 / ( N 2  3)
Say N1=150, r1=.63, N2=175, r2=70.
1.63
1.70
.5 log e
.5 log e
1

.
63
1.70
z
1 / (150  3)  1 / (175  3)
z
.74 .87
= -1.18, n.s.
.11
Hypothesis test 4:H
0
: 1   2  ...   k
Testing equality of any number of independent correlations.
k
 (n  3) z
z
 (n  3)
i 1
i
Q   (ni  3)( zi  z ) 2
i
i
Compare Q to chi-square with k-1 df.
(n-3)z zbar
(z-zbar)2
(n-3)(z-zbar)2
39.94
.41
.0441
8.69
.5
150 .55 80.75
.41
.0196
2.88
.6
75
.41
.0784
5.64
Study
r
n
1
.2
200 .2
2
3
sum
425
z
.69 49.91
170.6
17.21=Q
Chi-square at .05 with 2 df = 5.99. Not all rho are equal.
Hypothesis test 5: dependent r
H 0 : 12  13
t ( N 3)
Hotelling-Williams test
( N  1)(1  r23 )
 (r12  r13 )
2( N  1) /( N  3) | R | r 2 (1  r23 )3
Say N=101, r12=.4, r13=.6, r23=.3
r  (r12  r13 ) / 2
r  (.4  .6) / 2  .5
| R | 1  r122  r132  r232  2(r12 )( r13 )( r23 )
| R | 1  .42  .62  .32  2(.4)(.6)(.3)  .534
t ( N 3)
(100)(1  .3)
 (.4  .6)
 2.1
2
3
2(100) /(98).534  .5 (1  .3)
t(.05, 98) = 1.98
H 0 : 12  34 See my notes.
Review
• What is the purpose of the Fisher r to z
transformation?
• Test the hypothesis that   
1
2
– Given that r1 = .50, N1 = 103
– r2 = .60, N2 = 128 and the samples are
independent.
• Why do we care about the sampling
distribution of the correlation
coefficient?
Range Restriction/Enhancement
Reliability
Reliability sets the ceiling for validity. Measurement error
attenuates correlations.
 XY  T
X TY
 XX ' YY '
If correlation between true scores is .7 and reliability of
X and Y are both .8, observed correlation is 7.sqrt(.8*.8)
= .7*.8 = .56.
Disattenuated correlation
T
X TY
  XY /  XX ' YY '
If our observed correlation is .56 and the reliabilities
of both X and Y are .8, our estimate of the correlation
between true scores is .56/.8 = .70.
Review
• What is range restriction? Range
enhancement? What do they do to r?
• What is the effect of reliability on r?
SAS Power Estimation
proc power;
onecorr dist=fisherz
corr = 0.35
nullcorr = 0.2
sides = 1
ntotal = 100
power = .;
run;
Computed Power
Actual alpha = .05
Power = .486
proc power;
onecorr
corr = 0.35
nullcorr = 0
sides = 2
ntotal = .
power = .8;
run;
Computed N Total
Alpha = .05
Actual Power = .801
Ntotal = 61
Power for Correlations
Rho
N required against
Null: rho = 0
.10
782
.15
346
.20
193
.25
123
.30
84
.35
61
Sample sizes required for powerful conventional
significance tests for typical values of the correlation
coefficient in psychology. Power = .8, two tails,
alpha is .05.
Download