STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6 • Least squares line • Interpreting coefficients • Prediction • Cautions Statistics: Unlocking the Power of Data Lock5 Want More Stats??? • If you have enjoyed learning how to analyze data, and want to learn more: • STAT 460 (Intermediate Applied Statistics) • STAT 461 (Analysis of Variance) • STAT 462 (Applied Regression Analysis) • All applied, only prerequisite is STAT 200 or 250 • If you like math and want to learn more of the mathematical theory behind what we’ve learned: • take STAT 414 (Probability) • and then STAT 415 (Mathematical Statistics) • Prerequisite: MATH 230 or MATH 231 Statistics: Unlocking the Power of Data Lock5 MODELING Statistics: Unlocking the Power of Data Lock5 Crickets and Temperature • Can you estimate the temperature on a summer evening, just by listening to crickets chirp? • We will fit a model to predict temperature based on cricket chirp rate Statistics: Unlocking the Power of Data Lock5 Crickets and Temperature Response Variable, y Explanatory Variable, x Statistics: Unlocking the Power of Data Lock5 Linear Model A linear model predicts a response variable, y, using a linear function of explanatory variables Simple linear regression predicts on response variable, y, as a linear function of one explanatory variable, x Statistics: Unlocking the Power of Data Lock5 Regression Line Goal: Find a straight line that best fits the data in a scatterplot Statistics: Unlocking the Power of Data Lock5 Equation of the Line The estimated regression line is 𝑦 = 𝛽0 + 𝛽1 𝑥 Intercept Slope where x is the explanatory variable, and 𝑦 is the predicted response variable. • Slope: increase in predicted y for every unit increase in x • Intercept: predicted y value when x = 0 Statistics: Unlocking the Power of Data Lock5 Regression in Minitab Stat -> Regression -> Fitted Line Plot Statistics: Unlocking the Power of Data Lock5 Regression Model 𝑇𝑒𝑚𝑝 = 37.68 + 0.23𝐶ℎ𝑖𝑟𝑝𝑠 Which is a correct interpretation? a) The average temperature is 37.68 b) For every extra 0.23 chirps per minute, the predicted temperate increases by 1 degree c) Predicted temperature increases by 0.23 degrees for each extra chirp per minute d) For every extra 0.23 chirps per minute, the predicted temperature increases by 37.68 Statistics: Unlocking the Power of Data Lock5 Units • It is helpful to think about units when interpreting a regression equation 𝑦 = 𝛽0 + 𝛽1 𝑥 y units y units x units y units x units 𝑇𝑒𝑚𝑝 = 37.68 + 0.23𝐶ℎ𝑖𝑟𝑝𝑠 degrees degrees Statistics: Unlocking the Power of Data degrees/ chirps per min chirps per minute Lock5 Explanatory and Response • Unlike correlation, for linear regression it does matter which is the explanatory variable and which is the response 𝑇𝑒𝑚𝑝 = 37.68 + 0.23𝐶ℎ𝑖𝑟𝑝𝑠 𝐶ℎ𝑖𝑟𝑝𝑠 = −157.8 + 4.25𝑇𝑒𝑚𝑝 Statistics: Unlocking the Power of Data Lock5 Regression Line Which plot goes with the line 𝑦 = 𝑥 + 10? (a) (c) Statistics: Unlocking the Power of Data (b) (d) Lock5 Prediction • The regression equation can be used to predict y for a given value of x 𝑇𝑒𝑚𝑝 = 37.68 + 0.23𝐶ℎ𝑖𝑟𝑝𝑠 • If you listen and hear crickets chirping about 140 times per minute, your best guess at the outside temperature is 37.68 0.23 140 69.88 Statistics: Unlocking the Power of Data Lock5 Prediction 37.68 0.23 140 69.88 Statistics: Unlocking the Power of Data Lock5 Prediction If the crickets are chirping about 180 times per minute, your best guess at the temperature is (a) 60 (b) 70 (c) 80 Statistics: Unlocking the Power of Data Lock5 Prediction 𝑇𝑒𝑚𝑝 = 37.68 + 0.23𝐶ℎ𝑖𝑟𝑝𝑠 The intercept tells us that the predicted temperature when the crickets are not chirping at all is 37.68. Do you think this is a good prediction? (a) Yes (b) No Statistics: Unlocking the Power of Data Lock5 Regression Line How do we find the best fitting line??? Statistics: Unlocking the Power of Data Lock5 Predicted and Actual Values • The actual response value, y, is the response value observed for a particular data point • The predicted response value, 𝑦, is the response value that would be predicted for a given x value, based on a model • In linear regression, the predicted values fall on the regression line directly above each x value • The best fitting line is that which makes the predicted values closest to the actual values Statistics: Unlocking the Power of Data Lock5 Predicted and Actual Values y ŷ Statistics: Unlocking the Power of Data Lock5 Residual The residual for each data point is actual – predicted = 𝑦 − 𝑦 The residual is also the vertical distance from each point to the line Statistics: Unlocking the Power of Data Lock5 Residual residual { y yˆ 63.5 61.44 2.06 • Want to make all the residuals as small as possible. • How would you measure this? Statistics: Unlocking the Power of Data Lock5 Least Squares Regression Least squares regression chooses the regression line that minimizes the sum of squared residuals n minimize ( y yˆ ) i 2 i i 1 *Note: you will never have to find the equation by hand, this is just letting you know what technology is doing… Statistics: Unlocking the Power of Data Lock5 Least Squares Regression Statistics: Unlocking the Power of Data Lock5 Regression Caution 1 • Do not use the regression equation or line to predict outside the range of x values available in your data (do not extrapolate!) • If none of the x values are anywhere near 0, then the intercept is meaningless! Statistics: Unlocking the Power of Data Lock5 Alkalinity and Mercury Does the relationship look linear? (a) Yes (b) No Statistics: Unlocking the Power of Data Lock5 Regression Caution 2 • Computers will calculate a regression line for any two quantitative variables, even if they are not associated or if the association is not linear • ALWAYS PLOT YOUR DATA! • The regression line/equation should only be used if the association is approximately linear Statistics: Unlocking the Power of Data Lock5 Regression Caution 3 • Outliers (especially outliers in both variables) can be very influential on the regression line • ALWAYS PLOT YOUR DATA! http://illuminations.nctm.org/LessonDetai l.aspx?ID=L455 Statistics: Unlocking the Power of Data Lock5 Life Expectancy and Birth Rate Coefficients: (Intercept) LifeExpectancy 83.4090 -0.8895 Statistics: Unlocking the Power of Data Which of the following interpretations is correct? (a) A decrease of 0.89 in the birth rate corresponds to a 1 year increase in predicted life expectancy (b) Increasing life expectancy by 1 year will cause birth rate to decrease by 0.89 (c) Both (d) Neither Lock5 Regression Caution 4 • Higher values of x may lead to higher (or lower) predicted values of y, but this does NOT mean that changing x will cause y to increase or decrease • Causation can only be determined if the values of the explanatory variable were determined randomly (which is rarely the case for a continuous explanatory variable) Statistics: Unlocking the Power of Data Lock5 Exam 3 Exam 3 next Friday, 4/24 Will primarily be on Chapters 5, 6, 7, and 9.1, but could also have questions from Units A or B (Chapters 1, 2, 3, 4) 3 double-sided pages of notes Bring a non-cell phone calculator Should know z* for 90%, 95%, 99%, but do not need to know t* values Statistics: Unlocking the Power of Data Lock5 Practice Problems Units C and D review reading posted online We covered all of Unit C We skipped Chapter 8, 9.2, and 9.3 and haven’t covered Chapter 10 yet, so there will be some unfamiliar topics in the Unit D review Good sources of practice problems: Unit C Exercises and Review Exercises: All odd problems Unit D Exercises and Review Exercises: 1, 3, 7, 11, 13, 21, 23, 27, 29, 31, 33, 35, 45, 47 Chapters 5, 6, 7, and 9.1: All odd problems Statistics: Unlocking the Power of Data Lock5 To Do Read Section 2.6 Do HW 2.6 (due Friday, 4/24) Statistics: Unlocking the Power of Data Lock5