This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2006, The Johns Hopkins University and Karl W. Broman. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed. Fathers’ and daughters’ heights Fathers’ heights mean = 67.7 SD = 2.8 55 60 65 70 75 70 75 height (inches) Daughters’ heights mean = 63.8 SD = 2.7 55 60 65 height (inches) Pearson and Lee (1906) Biometrika 2:357-462 1376 pairs Fathers’ and daughters’ heights corr = 0.52 Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) Pearson and Lee (1906) Biometrika 2:357-462 1376 pairs Covariance and correlation Let X and Y be random variables with µX = E(X), µY = E(Y), σX = SD(X), σY = SD(Y) For example, sample a father/daughter pair and let X = the father’s height and Y = the daughter’s height. Covariance Correlation cov(X,Y) = E{(X – µX) (Y – µY)} cor(X, Y) = cov(X, Y) σXσY −1 ≤ cor(X, Y) ≤ 1 cov(X,Y) can be any real number. Examples corr = 0.1 30 25 25 0 20 −1 15 −2 10 −2 −1 0 1 2 Y 30 1 −3 10 5 10 15 20 25 30 5 30 25 25 20 20 20 15 15 10 10 5 5 15 20 25 30 Y 30 25 10 5 10 15 20 25 30 5 20 20 Y 25 20 Y 30 25 15 15 15 10 10 10 5 5 25 30 30 15 20 25 30 25 30 corr = −0.9 30 20 10 corr = 0.9 25 15 25 10 30 10 20 15 corr = 0.7 5 15 corr = −0.5 30 5 10 corr = 0.5 Y Y 20 15 corr = 0.3 Y corr = −0.1 2 Y Y corr = 0 5 5 10 15 20 25 30 5 10 15 20 Estimated correlation Consider n pairs of data: (x1, y1), (x2, y2), (x3, y3), . . . , (xn, yn) We consider these as independent draws from some bivariate distribution. We estimate the correlation in the underlying distribution by: P − x̄)(yi − ȳ) P 2 2 ( x − x̄ ) i i(yi − ȳ) i r = pP i (xi This is sometimes called the correlation coefficient. Correlation measures linear association All three plots have correlation ≈ 0.7! Fathers’ and daughters’ heights corr = 0.52 Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) Linear regression Daughter’s height (inches) 70 65 60 55 60 65 70 Father’s height (inches) 75 Linear regression Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) Regression line Daughter’s height (inches) 70 65 60 55 60 65 70 Father’s height (inches) Slope = r × SD(Y) / SD(X) 75 SD line Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) Slope = SD(Y) / SD(X) SD line vs regression line Daughter’s height (inches) 70 65 60 55 60 65 70 Father’s height (inches) Both lines go through the point (X̄, Ȳ). 75 Predicting father’s ht from daughter’s ht Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) Predicting father’s ht from daughter’s ht Daughter’s height (inches) 70 65 60 55 60 65 70 Father’s height (inches) 75 Predicting father’s ht from daughter’s ht Daughter’s height (inches) 70 65 60 55 60 65 70 75 Father’s height (inches) There are two regression lines! Daughter’s height (inches) 70 65 60 55 60 65 70 Father’s height (inches) 75