Basic Practice of Statistics 7th Edition Lecture PowerPoint Slides Preliminary results Questions on the Chapter 4 Moodle Quiz? What just happened in Spreadsheet Assignment 4? How does this connect with SA5? Most of research is… (finding variance) Explaining variance (prediction, correlation) Explaining what is causing the variance (causation) Article What’s wrong? What’s wrong? What’s wrong? What’s wrong? F3LIVARR 1 = Lives by him/herself 2 = Lives in parent/guardian’s home 3 = Not in parents’ home; lives w/ spouse 4 = Not in parents’ home; lives w/ partner 5 = Not in parents’ home; lives w/ children 6 = Not in parents’ home; lives w/ sibling 7 = Not in parents’ home; lives w/ roommate/friend 8 = Other living arrangement What’s wrong? Let’s examine our data Which variables have the lowest means? The highest means? Which variables have the lowest standard deviation? The highest standard deviation? Which pairs of variables have the strongest correlations? (positive or negative) The weakest correlations? Which pairs of variables provide an interesting question to ask? What are the limitations of our data collection? Starter Question We hear about U.S. being a “violent” place to live, but how does it compare to the rest of the developed world in terms of serial killings? Let’s find and interpret the regression line for your spreadsheets Regression line REVIEW OF STRAIGHT LINES that 𝑦 is a response variable and 𝑥 is an explanatory variable. Suppose 𝑦 = 𝑎 + 𝑏𝑥 coefficient of 𝑥 is the slope, the amount by which 𝑦 changes when 𝑥 increases by one unit. The number 𝑎 is the intercept, the value of 𝑦 when 𝑥 = 0. The Influential Observations An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Results are questionable if they depend strongly on a few influential observations. Chapter 5, #6: From a graph in Tania Singer et al., “Empathy for pain involves the affective but not sensory components of pain,” Science, 303 (2004), pp. 1157-1162. Figure 5.5, The Basic Practice of Statistics, © 2015 W. H. Freeman Outliers and influential points Empathy score and brain activity After removing observation 16 r2 = 33.1% From all of the data r2 = 51.5% Multiple Regression Let’s take a shot at predicting your future salary (with some important caveats!) By putting other variables into the model, we increase our overall predictive power (R2) and we can “control” for variables to get a better sense of the unique relationship between two variables. Least-squares regression The distinction between explanatory and response variables is essential in regression. There is a close connection between correlation and the slope of the least-squares line. The slope is 𝑠𝑦 𝑏=𝑟 𝑠𝑥 The slope b and correlation r always have the same sign. The least-squares regression line always passes through (𝑥,𝑦). square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the leastsquares regression of y on x. The Evidence of causation A properly conducted experiment may establish causation. Other considerations when we cannot do an experiment: The association is strong and consistent. Control for lurking variables. Higher doses are associated with stronger responses. Alleged Alleged cause precedes the effect in time. cause is plausible (reasonable explanation). Cautions about correlation and regression Correlation and regression lines describe only linear relationships. Correlation and least-squares regression lines are not resistant. Beware ecological correlation, or correlation based on averages rather than individuals. Beware of extrapolation—predicting outside of the range of x. Beware of lurking variables—these have an important effect on the relationship among the variables in a study, but are not included in the study. Correlation does not imply causation! Least Squares Regression Line Why is the trendline through a scatterplot called a “least squares regression line”? Regression line A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Example: Predict the gain in fat (in kg) based on the change in Non-Exercise Activity (NEA change, in calories). If the NEA change is 400 calories, what is the expected fat gain? This regression line describes the overall pattern of the relationship How can we explain differences in accuracy? Basketball Regression The least-squares regression line LEAST-SQUARES REGRESSION LINE The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Entry Slip Question What’s it called when we predict a y-value for an x-value that is far outside of our range? extrapolation (Example: Trying to predict salary from age. We studied people between ages 25 and 65, but now attempt to predict the salary of a 100-year old woman using our same regression line.) The least-squares regression line EQUATION OF THE LEAST-SQUARES REGRESSION LINE We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means 𝑥 and 𝑦 and the standard deviations 𝑠𝑥 and 𝑠𝑦 of the two variables and their correlation r. The least-squares regression line is the line 𝑦 = 𝑎 + 𝑏𝑥, with slope 𝑏=𝑟 𝑠𝑦 𝑠𝑥 , and intercept, 𝑎 = 𝑦 − 𝑏𝑥 Prediction via regression line For the non-exercise activity example, the least-squares regression line is: 𝑦 = 3.5051 − 0.0034𝑥 Suppose we know someone has an increase of 400 calories of NEA. What would we predict for fat gain? 𝑦 = 3.5051 − 0.0034 400 = 2.1451 kg This is the predicted response for someone with an of 400 calories of NEA What calculations should you know? Definitely know these Mean, median Z-scores (and conversions for standard normal) Interpret and use the linear regression line No need to memorize How to calculate standard deviation or variance How to calculate correlation from data How to calculate the linear regression line The $1,300 homework finding Remember that our regression found an average difference in salary of $5,000 between students who rarely completed homework and those who nearly always did. Based on some (questionable) calculations, this could be interpreted as an additional $1,300 per night. What should we be careful of? Correlation does not imply causation Even very strong correlations may not correspond to a real causal relationship (changes in x actually causing changes in y). Correlation may be explained by a lurking variable Social Relationships and Health House, J., Landis, K., and Umberson, D. “Social Relationships and Health,” Science, Vol. 241 (1988), pp 540-545. Does lack of social relationships cause people to become ill? (There was a strong correlation.) Or, are unhealthy people less likely to establish and maintain social relationships? (reversed relationship) Or, is there some other factor that predisposes people both to have lower social activity and become ill? Caution: beware of extrapolation Can you predict her height at age 42 months? Can you predict her height at age 30 years (360 months)? 100 height (cm) Sarah’s height was plotted against her age. 95 90 85 80 30 35 40 45 50 55 60 65 age (months) Caution: beware of extrapolation Regression line: 𝒚 = 71.95 + .383 x Predicted height at age 30 years? 𝒚 = 209.8 She is predicted to be 6’10.5” at age 30! 190 height (cm) Predicted height at age 42 months? 𝒚 = 88 210 170 150 130 110 90 70 30 90 150 210 270 330 390 age (months)