Correlation A bit about Pearson’s r Questions • Why does the maximum value of r equal 1.0? • What does it mean when a correlation is positive? Negative? • What is the purpose of the Fisher r to z transformation? • What is range restriction? Range enhancement? What do they do to r? • Give an example in which data properly analyzed by ANOVA cannot be used to infer causality. • Why do we care about the sampling distribution of the correlation coefficient? • What is the effect of reliability on r? Basic Ideas • Nominal vs. continuous IV • Degree (direction) & closeness (magnitude) of linear relations – Sign (+ or -) for direction – Absolute value for magnitude • Pearson product-moment correlation coefficient z r z X Y N Illustrations Plot of Weight by Height Plot of Errors by Study Time 210 30 180 Errors Weight 20 150 10 120 90 60 63 66 69 72 0 75 Height 0 100 200 300 Study Time 400 Plot of SAT-V by Toe Size 700 Positive, negative, zero SAT-V 600 500 400 1.5 1.6 1.7 Toe Size 1.8 1.9 Simple Formulas r xy NS X SY x X X and y Y Y SX (X X ) 2 N Use either N throughout or else use N-1 throughout (SD and denominator); result is the same as long as you are consistent. xy Cov( X , Y ) N zz r x y N z XX SX Pearson’s r is the average cross product of z scores. Product of (standardized) moments from the means. Graphic Representation Plot of Weight by Height Plot of Weight by Height in Z-scores 2 210 1 180 - + + - Weight Z-weight M e a n = 1 5 0 .7 lb s . 150 0 -1 120 M e a n = 6 6 .8 In c h es -2 90 60 63 66 Height 69 72 75 -2 -1 0 1 2 Z-height 1. Conversion from raw to z. 2. Points & quadrants. Positive & negative products. 3. Correlation is average of cross products. Sign & magnitude of r depend on where the points fall. 4. Product at maximum (average =1) when points on line where zX=zY. Descriptive Statistics Ht Wt Valid N (listwise) N 10 10 10 Minimum 60.00 110.00 Maximum 78.00 200.00 Mean 69.0000 155.0000 Std. Deviation 6.05530 30.27650 r = 1.0 r=1 Leave X, add error to Y. r=.99 r=.99 Add more error. r=.91 With 2 variables, the correlation is the z-score slope. Review • Why does the maximum value of r equal 1.0? • What does it mean when a correlation is positive? Negative? Sampling Distribution of r Statistic is r, parameter is ρ (rho). In general, r is slightly Sampling Distributions of r biased. Relative Frequ ency 0 .0 8 0 .0 6 rho=0 rho=-.5 rho=.5 0 .0 4 0 .0 2 0 .0 0 -1 .2 -0 .8 -0 .4 0 .0 0 .4 0 .8 1 .2 2 2 ( 1 ) The sampling variance is approximately: 2r N Obs erv e d r Sampling variance depends both on N and on ρ. Empirical Sampling Distributions of the Correlation Coefficient .5; N 100 .7; N 100 .5; N 50 .7; N 50 0.9 + 0 | 0 | | 0 | | 0 0 | 0.8 + 0 | | | | | | | | | +-----+ | 0 | +-----+ | | 0.7 + 0 | *--+--* *--+--* | | | +-----+ | | | | | | +-----+ | | | | | 0.6 + | | | | | | +-----+ 0 | | +-----+ | | 0 | | | | | | 0 | 0.5 + *--+--* *--+--* 0 0 | | | | | 0 0 | +-----+ | | * 0 | | +-----+ 0 0.4 + | | 0 | | | * 0 | | | * | | | 0.3 + 0 | | 0 | * | 0 | | 0 0 0.2 + 0 0 | 0 0 | 0 0 | 0 0.1 + 0 | 0 | 0 | 0 0 + * | * | * | -0.1 + ------------+-----------+-----------+-----------+----------param .5_N100 .5_N50 .7_N100 .7_N50 Fisher’s r to z Transformation Fisher r to z Transformation z .10 .20 .31 .42 .55 .69 .87 1.10 1.47 1.5 (1 r ) z .5 ln (1 r ) 1.3 1.1 z (output) r .10 .20 .30 .40 .50 .60 .70 .80 .90 0.9 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 r (sample value input) Sampling distribution of z is normal as N increases. Pulls out short tail to make better (normal) distribution. Sampling variance of z = (1/(n-3)) does not depend on ρ. Hypothesis test: t N 2 r 1 r 2 H0 : 0 Result is compared to t with (N2) df for significance. Say r=.25, N=100 .25 .25 t 98 9.899 2.56 2 .986 1 .25 t(.05, 98) = 1.984. p< .05 Hypothesis test 2: H z .5 log e 1 r 1 .5 log e 1 r 1 1/ N 3 0 : value One sample z test where r is sample value and ρ is hypothesized population value. Say N=200, r = .54, and ρ is .30. z .5 log e 1.54 1.30 .5 log e 1.54 1.30 1 / 200 3 z .60.31 .07 =4.13 Compare to unit normal, e.g., 4.13 > 1.96 so it is significant. Our sample was not drawn from a population in which rho is .30. Hypothesis test 3: H 0 : 1 2 Testing equality of correlations from 2 INDEPENDENT samples. 1 r 1 r .5 log e 1 .5 log e 2 1 r1 1 r2 z 1 / ( N1 3) 1 / ( N 2 3) Say N1=150, r1=.63, N2=175, r2=70. 1.63 1.70 .5 log e .5 log e 1 . 63 1.70 z 1 / (150 3) 1 / (175 3) z .74 .87 = -1.18, n.s. .11 Hypothesis test 4:H 0 : 1 2 ... k Testing equality of any number of independent correlations. k (n 3) z z (n 3) i 1 i Q (ni 3)( zi z ) 2 i i Compare Q to chi-square with k-1 df. (n-3)z zbar (z-zbar)2 (n-3)(z-zbar)2 39.94 .41 .0441 8.69 .5 150 .55 80.75 .41 .0196 2.88 .6 75 .41 .0784 5.64 Study r n 1 .2 200 .2 2 3 sum 425 z .69 49.91 170.6 17.21=Q Chi-square at .05 with 2 df = 5.99. Not all rho are equal. Hypothesis test 5: dependent r H 0 : 12 13 t ( N 3) Hotelling-Williams test ( N 1)(1 r23 ) (r12 r13 ) 2( N 1) /( N 3) | R | r 2 (1 r23 )3 Say N=101, r12=.4, r13=.6, r23=.3 r (r12 r13 ) / 2 r (.4 .6) / 2 .5 | R | 1 r122 r132 r232 2(r12 )( r13 )( r23 ) | R | 1 .42 .62 .32 2(.4)(.6)(.3) .534 t ( N 3) (100)(1 .3) (.4 .6) 2.1 2 3 2(100) /(98).534 .5 (1 .3) t(.05, 98) = 1.98 H 0 : 12 34 See my notes. Review • What is the purpose of the Fisher r to z transformation? • Test the hypothesis that 1 2 – Given that r1 = .50, N1 = 103 – r2 = .60, N2 = 128 and the samples are independent. • Why do we care about the sampling distribution of the correlation coefficient? Range Restriction/Enhancement Reliability Reliability sets the ceiling for validity. Measurement error attenuates correlations. XY T X TY XX ' YY ' If correlation between true scores is .7 and reliability of X and Y are both .8, observed correlation is 7.sqrt(.8*.8) = .7*.8 = .56. Disattenuated correlation T X TY XY / XX ' YY ' If our observed correlation is .56 and the reliabilities of both X and Y are .8, our estimate of the correlation between true scores is .56/.8 = .70. Review • What is range restriction? Range enhancement? What do they do to r? • What is the effect of reliability on r? SAS Power Estimation proc power; onecorr dist=fisherz corr = 0.35 nullcorr = 0.2 sides = 1 ntotal = 100 power = .; run; Computed Power Actual alpha = .05 Power = .486 proc power; onecorr corr = 0.35 nullcorr = 0 sides = 2 ntotal = . power = .8; run; Computed N Total Alpha = .05 Actual Power = .801 Ntotal = 61 Power for Correlations Rho N required against Null: rho = 0 .10 782 .15 346 .20 193 .25 123 .30 84 .35 61 Sample sizes required for powerful conventional significance tests for typical values of the correlation coefficient in psychology. Power = .8, two tails, alpha is .05.