Chapter 15 Describing Relationships: Regression, Prediction, and Causation Chapter 15 1 Linear Regression • Objective: To quantify the linear relationship between an explanatory variable and a response variable. We can then predict the average response for all subjects with a given value of the explanatory variable. • Regression equation: y = a + bx – x is the value of the explanatory variable – y is the average value of the response variable – note that a and b are just the intercept and slope of a straight line – note that r and b are not the same thing, but their signs will agree Chapter 15 2 Thought Question 1 How would you draw a line through the points? How do you determine which line ‘fits best’? 80 Axis Title 60 40 20 0 Axis Title 0 20 40 Chapter 15 60 3 Linear Equations Y Y = mX + b m = Slope Change in Y Change in X b = Y-intercept X High School Teacher Chapter 15 4 The Linear Model • Remember from Algebra that a straight line can be written as: y mx b • In Statistics we use a slightly different notation: ŷ b b x 0 1 • We write ŷ to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values. Chapter 15 5 Fat Versus Protein: An Example • The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Chapter 15 6 Residuals • The model won’t be perfect, regardless of the line we draw. • Some points will be above the line and some will be below. • The estimate made from a model is the predicted value (denoted as ŷ ). Chapter 15 7 Residuals (cont.) • The difference between the observed value and its associated predicted value is called the residual. • To find the residuals, we always subtract the predicted value from the observed one: residual observed predicted y yˆ Chapter 15 8 Residuals (cont.) • A negative residual means the predicted value is too big (an overestimate). • A positive residual means the predicted value is too small (an underestimate). Chapter 15 9 “Best Fit” Means Least Squares • Some residuals are positive, others are negative, and, on average, they cancel each other out. • So, we can’t assess how well the line fits by adding up all the residuals. • Similar to what we did with deviations, we square the residuals and add the squares. • The smaller the sum, the better the fit. • The line of best fit is the line for which the sum of the squared residuals is smallest. Chapter 15 10 Least Squares • Used to determine the “best” line • We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict) • Least Squares: use the line that minimizes the sum of the squares of the vertical distances of the data points from the line Chapter 15 11 The Linear Model (cont.) • We write b1 and b0 for the slope and intercept of the line. The b’s are called the coefficients of the linear model. • The coefficient b1 is the slope, which tells us how rapidly ŷ changes with respect to x. The coefficient b0 is the intercept, which tells where the line hits (intercepts) the y-axis. Chapter 15 12 The Least Squares Line • In our model, we have a slope (b1): – The slope is built from the correlation and the standard deviations: b1 r sy sx – Our slope is always in units of y per unit of x. – The slope has the same sign as the correlation coefficient. Chapter 15 13 The Least Squares Line (cont.) • In our model, we also have an intercept (b0). – The intercept is built from the means and the slope: b0 y b1 x – Our intercept is always in units of y. Chapter 15 14 Example Fill in the missing information in the table below: Chapter 15 15 Interpretation of the Slope and Intercept • The slope indicates the amount by which ŷ changes when x changes by one unit. • The intercept is the value of y when x = 0. It is not always meaningful. Chapter 15 16 Example The regression line for the Burger King data is Interpret the slope and the intercept. Slope: For every one gram increase in protein, the fat content increases by 0.97g. Intercept: A BK meal that has 0g of protein contains 6.8g of fat. Chapter 15 17 Thought Question 2 From a long-term study on several families, researchers constructed a scatterplot of the cholesterol level of a child at age 50 versus the cholesterol level of the father at age 50. You know the cholesterol level of your best friend’s father at age 50. How could you use this scatterplot to predict what your best friend’s cholesterol level will be at age 50? Chapter 15 18 Predictions In predicting a value of y based on some given value of x ... 1. If there is not a linear correlation, the best predicted y-value is y. 2. If there is a linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation. Chapter 15 19 Fat Versus Protein: An Example The regression line for the Burger King data fits the data well: – The equation is – The predicted fat content for a BK Broiler chicken sandwich that contains 30g of protein is 6.8 + 0.97(30) = 35.9 grams of fat. Chapter 15 20 Prediction via Regression Line Husband and Wife: Ages Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall ^ = 3.6 + 0.97x The regression equation is y – y is the average age of all husbands who have wives of age x For all women aged 30, we predict the average husband age to be 32.7 years: 3.6 + (0.97)(30) = 32.7 years Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be? Chapter 15 21 The Least Squares Line (cont.) • Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations: – Quantitative Variables Condition – Straight Enough Condition – Outlier Condition Chapter 15 22 Guidelines for Using The Regression Equation 1. If there is no linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. Chapter 15 23 Definitions Marginal Change – refers to the slope; the amount the response variable changes when the explanatory variable changes by one unit. Outlier - A point lying far away from the other data points. Influential Point - An outlier that that has the potential to change the regression line. Chapter 15 24 Residuals Revisited • Residuals help us to see whether the model makes sense. • When a regression model is appropriate, nothing interesting should be left behind. • After we fit a regression model, we usually plot the residuals in the hope of finding…nothing. Chapter 15 25 Residual Plot Analysis If a residual plot does not reveal any pattern, the regression equation is a good representation of the association between the two variables. If a residual plot reveals some systematic pattern, the regression equation is not a good representation of the association between the two variables. Chapter 15 26 Residuals Revisited (cont.) • The residuals for the BK menu regression look appropriately boring: Plot Chapter 15 27 Coefficient of Determination (R2) • Measures usefulness of regression prediction • R2 (or r2, the square of the correlation): measures the percentage of the variation in the values of the response variable (y) that is explained by the regression line r=1: R2=1: r=.7: R2=.49: regression line explains almost half (50%) of the variation in y regression line explains all (100%) of the variation in y Chapter 15 28 2 R (cont) • Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data. • Statistics is about variation, and R2 measures the success of the regression model in terms of the fraction of the variation of y accounted for by the regression. Chapter 15 29 A Caution Beware of Extrapolation 100 height (cm) • Sarah’s height was plotted against her age • Can you predict her height at age 42 months? • Can you predict her height at age 30 years (360 months)? Chapter 15 95 90 85 80 30 35 40 45 50 55 60 65 age (months) 30 A Caution Beware of Extrapolation 210 190 height (cm) • Regression line: y = 71.95 + .383 x • height at age 42 months? y = 88 cm. • height at age 30 years? y = 209.8 cm. – She is predicted to be 6' 10.5" at age 30. 170 150 130 110 90 70 30 90 150 210 270 330 390 age (months) Chapter 15 31 Correlation Does Not Imply Causation Even very strong correlations may not correspond to a real causal relationship. Chapter 15 32 Evidence of Causation • A properly conducted experiment establishes the connection • Other considerations: – A reasonable explanation for a cause and effect exists – The connection happens in repeated trials – The connection happens under varying conditions – Potential confounding factors are ruled out – Alleged cause precedes the effect in time Chapter 15 33 Evidence of Causation • An observed relationship can be used for prediction without worrying about causation as long as the patterns found in past data continue to hold true. • We must make sure that the prediction makes sense. • We must be very careful of extreme extrapolation. Chapter 15 34 Reasons Two Variables May Be Related (Correlated) • Explanatory variable causes change in response variable • Response variable causes change in explanatory variable • Explanatory may have some cause, but is not the sole cause of changes in the response variable • Confounding variables may exist • Both variables may result from a common cause – such as, both variables changing over time • The correlation may be merely a coincidence Chapter 15 35 Response causes Explanatory • Explanatory: Hotel advertising dollars • Response: Occupancy rate • Positive correlation? – more advertising leads to increased occupancy rate? Actual correlation is negative: lower occupancy leads to more advertising Chapter 15 36 Explanatory is not Sole Contributor • Explanatory: Consumption of barbecued foods • Response: Incidence of stomach cancer barbecued foods are known to contain carcinogens, but other lifestyle choices may also contribute Chapter 15 37 Common Response (both variables change due to common cause) • Explanatory: Divorce among men • Response: Percent abusing alcohol Both may result from an unhappy marriage. Chapter 15 38 Both Variables are Changing Over Time • Both divorces and suicides have increased dramatically since 1900. • Are divorces causing suicides? • Are suicides causing divorces??? • The population has increased dramatically since 1900 (causing both to increase). Better to investigate: Has the rate of divorce or the rate of suicide changed over time? Chapter 15 39 The Relationship May Be Just a Coincidence We will see some strong correlations (or apparent associations) just by chance, even when the variables are not related in the population Chapter 15 40 Coincidence (?) Vaccines and Brain Damage • A required whooping cough vaccine was blamed for seizures that caused brain damage – led to reduced production of vaccine (due to lawsuits) • Study of 38,000 children found no evidence for the accusations (reported in New York Times) – “people confused association with cause-and-effect” – “virtually every kid received the vaccine…it was inevitable that, by chance, brain damage caused by other factors would occasionally occur in a recently vaccinated child” Chapter 15 41 Key Concepts • • • • • Least Squares Regression Equation R2 Correlation does not imply causation Confirming causation Reasons variables may be correlated Chapter 15 42