Measures of Association The association between two continuous variables can be illustrated in a scatterplot. 5 4 3 1 2 mean attachment level (mm) 6 serum cotinine and attachment levels 200 400 600 800 1000 serum cotinine (ng/mL) This scatterplot depicts serum cotinine levels (a metabolite of nicotine) and mean attachment levels of 30 current smokers. Each point indicates the two values for a single person. There appears to be a slight positive correlation; that is, higher cotinine levels seem to be related with higher attachment levels. An “eyeball” assessment such as this can be quite subjective. Pearson Correlation Coefficient An objective, numerical measure of correlation between two characteristics measured on the same subjects is the Pearson Correlation Coefficient. For n subjects the data for the two variables can be arranged in pairs. subject Variable 1 (X) Variable 2 (Y) 1 2 . . . n x1 x2 . . . xn y1 y2 . . . yn The sample correlation coefficient, r, is computed by r ( x x )( y y ) ( x x) ( y y) i i i 2 i i i 2 , i which is an estimate of the population correlation coefficient, E ( x x )( y y ) x y . The correlation is always between –1 and 1 and is a measure of the linear association between X and Y. Scatterplots illustrating a range of correlation coefficients r = 0.78 r = -0.7 r = 0.4 r = -0.4 r = 0.18 r = 0.01 Interpretation of the correlation coefficient 1. Correlation is unitless; that is, it is not affected by changes of location or scale. 2. r and ρ are always between –1 and 1 3. r > 0 says Y ↑ when X r < 0 says Y ↓ when X (same thing for ρ) ↑. ↑. 4. X and Y independent → r ≈ 0 , ρ = 0 5. r ≈ 0 , ρ = 0 → no linear relationship between X and Y. 6. If r = 1 or –1 then the points (x,y) lie on a straight line (perfect linear relationship). 7. “Strength” of linear assocation indicated by magnitude of r. Closer to ± 1 indicates stronger linear association. Two Extreme Examples r = -1 y y r=1 x x r = 1 or –1 mean perfect linear relationship y r=0 x Points with a perfect association but r = 0, because the association is not a linear association Inference for correlation coefficients If ρ = 0, (and X or Y is Normal) then the statistic t n2 r 1 r2 has an approximate tn-2 distribution. Thus we can use it to compute a hypothesis test of H0: ρ = 0 vs H1: ρ ≠ 0. Example: Attachment level and serum cotinine The sample correlation coefficient between the serum cotinine levels and mean attachment levels for 30 current smokers is r = 0.498. To test whether there is good evidence that the true correlation, ρ, is different than zero, we compute the statistic t 30 2 0.498 1 0.498 2 3.04 , which is greater than t28, .975= 2.048, so we reject at the α = .05 level. The p-value for the test is P(|t28| > 3.04) = 0.005 (from Excel). Spearman Rank Correlation One problem with the Pearson correlation coefficient is that, like the sample mean and standard deviation, it can be unduly influenced by outliers (extreme values). The Spearman rank correlation coefficient, rs , is an alternative measure of association that is more robust (less likely to be influenced by a small number of outliers). It is calculated simply by using the Pearson correlation coefficient forumula, but applying it to the ranks of X and Y instead. Example: Clinical trial studying leptin and proinflammatory cytokines, before and after hypo-caloric diet Change in Leptin 300 0 -200 0 200 400 600 -300 -600 Change in TNF receptor 55 TNFα 59.1 115.9 -67.8 -67.7 660.6 148.3 50.2 154.8 -93.7 22.4 -19.3 -5.8 9.4 36.3 23.6 -91.6 -121.5 0.6 -54.3 leptin 89.80 -49.25 188.55 -23.10 -519.65 102.90 -29.55 -83.20 -60.00 -12.65 255.40 27.90 -15.60 186.19 -23.10 -111.95 -8.52 165.75 -18.50 r = -0.67 rank(TNFα)rank(leptin) 15 16 4 5 19 17 14 18 2 11 7 8 10 13 12 3 1 9 6 14 5 18 7 1 15 6 3 4 11 19 13 10 17 7 2 12 16 9 rs = -0.16 If n > 10, we can test the hypothesis H0: ρs = 0 vs H1: ρs ≠ 0, using the same procedure: t s 19 2 .16 1 (.16) 2 .67 , so the p-value is P(|t17| > |-.67|) = 0.51. Compare to the corresponding to the Pearson r values: t = -3.72, p-value < 0.002