Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r Translating a research question into a statistical procedure • How strong is the linear relationship between skin cancer mortality and latitude? – (Pearson) correlation coefficient r – Coefficient of determination r2 Where does this topic fit in? • • • • Model formulation Model estimation Model evaluation Model use Situation #1 A very weak linear relationship Regression Plot y = 54.4758 - 0.764016 x S = 7.81137 R-Sq = 6.5 % R-Sq(adj) = 3.2 % n SSR yˆ i y 119.1 y 2 i 1 60 y n 2 SSE yi yˆ i 1708.5 i 1 50 n SSTO yi y 1827.6 ŷ 40 0 1 2 3 4 5 x 6 i 1 7 8 9 10 2 Situation #2 A fairly strong linear relationship Regression Plot y = 75.5458 - 5.76402 x R-Sq = 79.9 % R-Sq(adj) = 79.2 % S = 7.81137 80 n SSR yˆ i y 6679.3 y 70 2 i 1 60 n SSE yi yˆ i 1708.5 y 50 2 i 1 40 30 n SSTO yi y 8487.8 ŷ 20 i 1 10 0 1 2 3 4 5 x 6 7 8 9 10 2 Coefficient of determination 2 r SSR SSE r 1 SSTO SSTO 2 • r2 is a number (a proportion!) between 0 and 1. • If r2 = 1: – all data points fall perfectly on the regression line – the predictor x accounts for all of the variation in y • If r2 = 0: – the fitted regression line is perfectly horizontal – the predictor x accounts for none of the variation in y Interpretation of 2 r • r2 ×100 percent of the variation in y is reduced by taking into account predictor x. • r2 ×100 percent of the variation in y is “explained by” the variation in predictor x. R-sq in Minitab fitted line plot Regression Plot Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 200 150 100 30 40 Latitude (at center of state) 50 R-sq in Minitab regression output The regression equation is Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Analysis of Variance Source Regression Error Total DF 1 47 48 SS 36464.2 17173.1 53637.3 MS 36464.2 365.4 F 99.7968 P 0.000 Pearson correlation coefficient r If r2 is represented in decimal form, e.g. 0.39 or 0.87, then: r r 2 • r is a (unitless) number between -1 and 1, inclusive. • Sign of coefficient of correlation – plus sign if slope of fitted regression line is positive – negative sign if slope of fitted regression line is negative Formulas for the Pearson correlation coefficient r n r x i 1 i x yi y n n 2 xi x yi y 2 i 1 r i 1 n x i 1 n x 2 i 2 y y i i 1 b1 What do we learn from the formulas for r? • The correlation coefficient r gets its sign from the slope b1. • The correlation coefficient r is a unitless measure. • The correlation coefficient r = 0 when the estimated slope b1 = 0 and vice versa. Interpretation of Pearson correlation coefficient r • There is no nice practical interpretation for r as there is for r2. • r = -1 is perfect negative linear relationship. • r = 1 is perfect positive linear relationship. • r = 0 is no linear relationship. • For other r, how strong the relationship between x and y is deemed depends on the research area. Pearson correlation coefficient r in Minitab Correlations: Lat, Mort Pearson correlation of Lat and Mort = -0.825 Correlations: Mort, Lat Pearson correlation of Mort and Lat = -0.825 How strong is the linear relationship between Celsius and Fahrenheit? Regression Plot S=0 Fahrenheit = 32 + 1.8 Celsius R-Sq = 100.0 % R-Sq(adj) = 100.0 % 120 Fahrenheit 110 100 90 80 70 60 50 40 30 0 10 20 30 40 50 Celsius Pearson correlation of Celsius and Fahrenheit = 1.000 How strong is the linear relationship between # of stories and height? Regression Plot HEIGHT = 90.3096 + 11.2924 STORIES S = 58.3259 R-Sq = 90.4 % R-Sq(adj) = 90.2 % HEIGHT 1200 700 200 15 25 35 45 55 65 75 85 95 105 STORIES Pearson correlation of HEIGHT and STORIES = 0.951 How strong is the linear relationship between driver age and see distance? Regression Plot Distance = 576.682 - 3.00684 DrivAge S = 49.7616 R-Sq = 64.2 % R-Sq(adj) = 62.9 % 600 Distance 500 400 300 20 30 40 50 60 70 80 DrivAge Pearson correlation of Distance and DrivAge = -0.801 How strong is the linear relationship between height and g.p.a.? Regression Plot gpa = 3.41021 - 0.0065630 height S = 0.542316 R-Sq = 0.3 % R-Sq(adj) = 0.0 % 4 gpa 3 2 60 65 70 75 height Pearson correlation of height and gpa = -0.053 Caution #1 • The correlation coefficient r quantifies the strength of a linear relationship. • It is possible to get r = 0 with a perfect curvilinear relationship. Example of Caution #1 Regression Plot y = 14 - 0.0000000 x S = 13.4907 R-Sq = 0.0 % R-Sq(adj) = 0.0 % 40 30 y ŷ 20 10 0 -5 0 5 x Pearson correlation of x and y = 0.000 Clarification of Caution #1 Regression Plot y = 0.0000000 - 0.0000000 x + 1 x**2 S=0 R-Sq = 100.0 % R-Sq(adj) = 100.0 % 40 y 30 20 10 0 -5 0 5 x Pearson correlation of x and y = 0.000 Caution #2 • A large r2 value should not be interpreted as meaning that the estimated regression line fits the data well. • Another function might better describe the trend in the data. Example of Caution #2 Regression Plot US Population (millions) USPopn = -2217.46 + 1.21862 Year S = 22.8349 R-Sq = 92.0 % R-Sq(adj) = 91.6 % 200 100 0 1800 1900 2000 Year Pearson correlation of Year and USPopn = 0.959 Caution #3 • The coefficient of determination r2 and the correlation coefficient r can both be greatly affected by just one data point (or a few data points). Example of Caution #3 Regression Plot Deaths = -1121.94 + 179.468 Magnitude S = 140.359 R-Sq = 53.5 % R-Sq(adj) = 41.9 % 500 Deaths 400 300 200 100 0 6 7 8 Magnitude Pearson correlation of Deaths and Magnitude = 0.732 Example of Caution #3 Regression Plot Deaths = 647.967 - 87.1465 Magnitude S = 13.1447 R-Sq = 92.1 % R-Sq(adj) = 89.4 % Deaths 100 50 0 6.4 6.9 7.4 Magnitude Pearson correlation of Deaths and Magnitude = -0.960 Caution #4 • Correlation (association) does not imply causation. Example of Caution #4 Regression Plot Heart = 260.563 - 22.9688 Wine S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 % (per 100,000 people) Heart disease deaths 300 200 100 0 1 2 3 4 5 6 7 8 9 Wine consumption Liters of wine per person per year Pearson correlation of Wine and Heart = -0.843 Caution #5 • Ecological correlations are correlations that are based on rates or averages. • Ecological correlations tend to overstate the strength of an association. Example of Caution #5 • Data from 1988 Current Population Survey • Treating individuals as the units – Correlation between income and education for men age 25-64 in U.S. is r ≈ 0.4. • Treating nine regions as the units – Compute average income and average education for men age 25-64 in each of the nine regions. – Correlation between the average incomes and the average education in U.S. is r ≈ 0.7. Example of Caution #5 Regression Plot Heart = 260.563 - 22.9688 Wine S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 % (per 100,000 people) Heart disease deaths 300 200 100 0 1 2 3 4 5 6 7 Wine consumption Liters of wine per person per year 8 9 Example of Caution #5 Regression Plot Mort = 389.189 - 5.97764 Lat S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 200 150 100 30 40 Latitude (at center of state) 50 Caution #6 • A “statistically significant” r2 does not imply that the slope β1 is meaningfully different from 0. Caution #7 • A large r2 does not necessarily mean that a useful prediction of the response ynew (or estimation of the mean response μY) can be made. • It is still possible to get prediction (or confidence) intervals that are too wide to be useful. Using the sample correlation r to learn about the population correlation ρ Translating a research question into a statistical procedure • Is there a linear relationship between skin cancer mortality and latitude? – t-test for testing H0: β1= 0 – ANOVA F-test for testing H0: β1= 0 • Is there a linear correlation between husband’s age and wife’s age? – t-test for testing population correlation coefficient H0: ρ = 0 Where does this topic fit in? • • • • Model formulation Model estimation Model evaluation Model use Is there a linear correlation between husband’s age and wife’s age? 65 Husband's Age (years) 60 55 50 45 40 35 30 25 20 15 25 35 45 55 65 Wife's Age (years) Pearson correlation of HAge and WAge = 0.939 Is there a linear correlation between husband’s age and wife’s age? Wife's Age (years) 65 55 45 35 25 15 20 25 30 35 40 45 50 55 60 65 Husband's Age (years) Pearson correlation of WAge and HAge = 0.939 The formal t-test for correlation coefficient ρ Null hypothesis H0: ρ = 0 Alternative hypothesis HA: ρ ≠ 0 or ρ < 0 or ρ > 0 Test statistic t * r n2 1 r 2 P-value = What is the probability that we’d get a t* statistic as extreme as we did, if the null hypothesis is true? The P-value is determined by comparing t* to a t distribution with n-2 degrees of freedom. Is there a linear correlation between husband’s age and wife’s age? Test statistic: t * r n2 1 r 2 0.939 170 2 1 0.939 2 35.39 Help in determining the P-value: Student's t distribution with 168 DF x P( X <= x ) 35.3900 1.0000 Just let Minitab do the work: Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000 When is it okay to use the t-test for testing H0: ρ = 0? • When it is not obvious which variable is the response. • When the (x, y) pairs are a random sample from a bivariate normal population. – – – – For each x, the y’s are normal with equal variances. For each y, the x’s are normal with equal variances. Either, y can be considered a linear function of x. Or, x can be considered a linear function of y. • The (x, y) pairs are independent. The three tests will always yield similar results. The regression equation is HAge = 3.59 + 0.967 Wage 170 cases used 48 cases contain missing values Predictor Constant WAge S = 4.069 Coef 3.590 0.96670 SE Coef 1.159 0.02742 R-Sq = 88.1% Analysis of Variance Source DF SS Regression 1 20577 Error 168 2782 Total 169 23359 T 3.10 35.25 P 0.002 0.000 R-Sq(adj) = 88.0% MS 20577 17 F 1242.51 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000 P 0.000 The three tests will always yield similar results. The regression equation is WAge = 1.57 + 0.911 HAge 170 cases used 48 cases contain missing values Predictor Constant HAge S = 3.951 Coef 1.574 0.91124 SE Coef 1.150 0.02585 R-Sq = 88.1% T 1.37 35.25 P 0.173 0.000 R-Sq(adj) = 88.0% Analysis of Variance Source DF SS MS F Regression 1 19396 19396 1242.51 Error 168 2623 16 Total 169 22019 Pearson correlation of WAge and HAge = 0.939 P-Value = 0.000 P 0.000 Which results should I report? • If one of the variables can be clearly identified as the response, report the t-test or F-test results for testing H0: β1 = 0. – Does it make sense to use x to predict y? • If it is not obvious which variable is the response, report the t-test results for testing H0: ρ = 0. – Does it only make sense to look for an association between x and y?