AP Statistics Objectives Ch8 Find the Least Squares Regression Line and interpret its slope, y-intercept, and the coefficients of correlation and determination Justify the regression model using the scatterplot and residual plot Vocabulary Model Linear model Residuals Predicted value Slope Regression line Regression to the mean Intercept 2 R Regression Line Notes Lurking Variable Residual Plot Linear Regression Practice Chp 8 Part I Day 2 Example Vocabulary Chapter 8 Assignments Chapter 7 Answers Lurking Variable Chapter 8 #1 𝒙 a) 10 b) 𝑟𝑠2𝑦 𝑏c)1 = 12 𝑠𝑥 d) 2.5 𝒔𝒙 𝒚 𝒔𝒚 2 0.06 6 12 20 7.2 3 1.2 r 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 0.5 -0.4 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 -0.8 𝒚 =200-4x 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝒚 𝟕𝟓𝒙 100 =-100+50x Chapter 8 #1 𝒙 a) 10 b) 2 c) 12𝑦 𝑟𝑠 𝑏1 = 2.5 d) 𝑠 𝑥 𝒔𝒙 𝒚 𝒔𝒚 2 0.06 6 12 20 7.2 3 1.2 100 r 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙 -0.4 -0.8 𝒚 =200-4x 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 𝒚 =-100+50x Chapter 8 #1 a) b) c) d) 𝒙 𝒔𝒙 𝒚 𝒔𝒚 10 2 12 2.5 2 0.06 6 12 20 7.2 3 1.2 𝒚 =200-4x 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 r 100 0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙 -0.4 𝒚 = 𝟐𝟑. 𝟐 − 𝟖𝒙 -0.8 𝒚 =200-4x 𝑟𝑠 𝒚 =-100+50x 𝑏1 = 𝑦 𝑠𝑥 Chapter 8 #1 a) b) c) d) 𝒙 𝒔𝒙 10 2 12 2.5 2 20 0.06 7.2 6 𝟏𝟓𝟐 1.2 𝒚 =-100+50x 𝒚 𝒔𝒚 r 3 1.2 𝟑𝟎 100 𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙 -0.4 𝒚 = 𝟐𝟑. 𝟐 − 𝟖𝒙 -0.8 𝒚 =200-4x 𝒚 =-100+50x 𝑟𝑠𝑦 𝑏1 = 𝑠𝑥 Standardized Foot Length vs Height 2011 𝑧𝐻𝑒𝑖𝑔ℎ𝑡 = 0.84𝑧𝐹𝑜𝑜𝑡𝑆𝑖𝑧𝑒 Slope is the correlation 𝒙, 𝒚 is part of all regression lines NOTE: (0,0) represents the mean of x and the mean of y. Regression Line for Standardized Values 𝑧𝑥 , 𝑧𝑦 𝑧𝑦 = 𝑟𝑧𝑥 𝑧𝑦 is the predicted z-score for the response variable 𝑧𝑥 is the z-score for the explanatory variable 𝑟 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 Regression Line for 𝑥, 𝑦 𝑦 = 𝑏0 + 𝑏1 𝑥 𝑦 is the predicted response variable 𝑏1 is the slope 𝑏1 = 𝑟𝑠𝑦 𝑠𝑥 𝑏0 is the y-intercept 𝑏0 = 𝑦 − 𝑏1 𝑥 Explanatory or Response 𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 Now interpret the 2 R. 2 R = .697 According to the linear model, 69.7% of the variability in height is accounted for by variation in foot size. Explanatory or Response 2011 data resulted in the following linear equation: 𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 = −𝟗. 𝟓𝟓 + 𝟎. 𝟓𝟏𝑯𝒆𝒊𝒈𝒉𝒕 𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 Explanatory or Response 2011 data resulted in the following linear equation: 𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 = −𝟗. 𝟓𝟓 + 𝟎. 𝟓𝟏𝑯𝒆𝒊𝒈𝒉𝒕 𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 Residual Plot Example e=y-𝑦 Residual Plot Example REMEMBER: POSITIVE RESIDUALS are UNDERESTIMATES e=y-𝑦 Residual Plot Example NEGATIVE RESIDUALS are OVERESTIMATES Assignment CHAPTER 8 Part I: pp. 189-190 #2,4,8&10,12&14 Part II: pp. 190-192 #16,18,20,28&30 Chapter 7 Answers a) #1 shows little or no association b) #4 shows a negative association c) #2 & #4 each show a linear association d) #3 shows a moderately strong, curved association e) #2 shows a very strong association Chapter 7 Answers a) -0.977 b) 0.736 c) 0.951 d) -0.021 Chapter 7 Answers The researcher should have plotted the data first. A strong, curved relationship may have a very low correlation. In fact, correlation is only a useful measure of the strength of a linear relationship. Chapter 7 Answers If the association between GDP and infant mortality is linear, a correlation of -0.772 shows a moderate, negative association. Chapter 7 Answers Continent is a categorical variable. Correlation measures the strength of linear associations between quantitative variables. Chapter 7 Answers Correlation must be between -1 and 1, inclusive. Correlation can never be 1.22. Chapter 7 Answers A correlation, no matter how strong, cannot prove a cause-and-effect relationship. Chapter 8 Vocabulary 1) Regression to the mean – each predicted response variable (y) tends to be closer to the mean (in standard deviations) than its corresponding explanatory variable (x) Chapter 8 Vocabulary 2) 𝑦 – predicted response variable 3) Residual – the difference between the actual response value and the predicted response value e=y-𝑦 4) Overestimate – produces a negative residual 5) Underestimate – produces a positive residual Chapter 8 Vocabulary 6) Slope – rate of change given in units of the response variable (y) per unit of the explanatory variable (x) 7) intercept – response value when the explanatory value is zero 8) R2 – Must also be interpreted when describing a regression model (aka Coefficient of Determination) Chapter 8 Vocabulary 8) R2 – Must also be interpreted when describing a regression model “According to the linear model, _____% of the variability in _______ (response variable) is accounted for by variation in ________ (explanatory variable)” The remaining variation is due to the residuals Chapter 8 Vocabulary CONDITIONS FOR USING A LINEAR REGRESSION 1) Quantitative Variables – Check the variables 2) Straight Enough – Check the scatterplot 1st (should be nearly linear) - Check the residual plot next (should be random scatter) 3) Outlier Condition- Any outliers need to be investigated Chapter 8 Vocabulary 9. Residual Plot - a scatterplot of the residuals and either x or 𝑦 If you find a pattern in the Residual Plot, that means the residuals (errors) are predictable. If the residuals are predictable, then a better model exists. ---- LINEAR MODEL IS NOT APPROPRIATE. A residual plot is done with the RESIDUALS on the y-axis. On the x-axis, put the explanatory variable. NOTE: Some software packages will put 𝑦 on the x-axis. This does not change the presence of (or lack of) of a pattern. Chapter 8 Vocabulary 9. Residual Plot - a scatterplot of the residuals and either x or 𝑦 If you find a pattern in the Residual Plot, that means the residuals (errors) are predictable. If the residuals are predictable, then a better model exists. ---- LINEAR MODEL IS NOT APPROPRIATE. A residual plot is done with the RESIDUALS on the y-axis. On the x-axis, put the explanatory variable. NOTE: Some software packages will put 𝑦 on the x-axis. This does not change the presence of (or lack of) of a pattern. What is the 𝟒 ? Did you say 2? Wrong. Try again. It is actually ±2, because both 2 2 (2) and (-2) is 4. So what? Important Note: The correlation is not given directly in this software package. You need to look in two places for it. Taking the square root of the “R squared” (coefficient of determination) is not enough. You must look at the sign of the slope too. Positive slope is a positive r-value. Negative slope is a negative r-value. Grad Rate S/F Ratio -0.07861 So here you should note that the slope is negative. The correlation will be negative too. Since R2 is 0.482, r will be -0.694. Coefficient of Determination = (0.694)2 = 0.4816 0.4816 With the linear regression model, 48.2% of the variability in airline fares is accounted for by the variation in distance of the flight. 𝑠𝑦 𝑏1 = 𝑟 𝑠𝑥 𝟓𝟔. 𝟑𝟕 = 0.694 497.8 = 0.0786 There is an increase of 7.86 cents for every additional mile. There is an increase of $7.86 for every additional 100 miles. 𝑠𝑦 𝑏1 = 𝑟 𝑠𝑥 𝟓𝟔. 𝟑𝟕 = 0.694 497.8 There is an increase of 7.86 cents for every additional mile. There is an increase of $7.86 for every additional 100 miles. 𝑏1 = 0.0786 𝑦 = 𝑏0 + 𝑏1 𝑥 244.33 = 𝑏0 + (0.0786)(853.7) 244.33 – (0.0786)(853.7) = 𝑏0 177.2292= 𝑏0 The model predicts a flight of zero miles will cost $177.23. The airline may have built in an initial cost to pay for some of its expenses. 𝑏1 = 0.0786 𝑏0 = 177.2292 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786Distance 𝑏1 = 0.0786 𝑏0 = 177.2292 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786Distance 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200) = $192.95 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786Distance 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200) = $192.95 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =177.2292 + 0.0786(2000) = $334.43 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200) = $192.95 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =177.2292 + 0.0786(2000) = $334.43 8. Using those estimates, draw the line on the scatterplot. 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786Distance 𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(1719) = $312.34 𝒆 =y–𝑦 = 212 – 312.34 = -$100.34 12. In general, a positive residual means 13. In general, a negative residual means A linear model should be appropriate, because 1) the scatterplot shows a nearly linear form and 2) the residual plot shows random scatter. The coefficient of determination is .482, so the coefficient of correlation is .482 = .694. This shows a moderate strength in association for the model. $150 for a flight of about 700 miles seems low compared to the other fares. “fare” is the response variable. Not all software will call it the dependent variable. Always look for “Constant” and what is listed beside it. Here above it shows the column is for the “variable” and below “dist” is the explanatory variable. Recall: For y = 3x + 1 the coefficient of x is ‘3’. For computer printouts this is the key column for your regression model. Recall: For y = 3x + 1 the coefficient of x is ‘3’. For computer printouts this is the key column for your regression model. The “Coefficient” of the “Constant” is the y-intercept for your linear regression. Recall: For y = 3x + 1 the coefficient of x is ‘3’. For computer printouts this is the key column for your regression model. The “Coefficient” of the “Constant” is the y-intercept for your linear regression. The “Coefficient” of the variable “dist” is the slope for your linear regression. Recall: For y = 3x + 1 the coefficient of x is ‘3’. For computer printouts this is the key column for your regression model. The “Coefficient” of the “Constant” is the y-intercept for the linear regression. The “Coefficient” of the variable “dist” is the slope for the linear regression. 𝑓𝑎𝑟𝑒 = 177.215 + 0.078619distance 𝑓𝑎𝑟𝑒 = 177.215 + 0.078619distance 5. Predict the airfare for a 1000-mile flight. 𝑓𝑎𝑟𝑒 = 177.215 + 0.078619(1000) = $𝟐𝟓𝟓. 𝟖𝟑 R2 doesn’t change, but the equation does. 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101(𝟐𝟓𝟓. 𝟖𝟑) = 924.2 miles 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101(𝟐𝟓𝟓. 𝟖𝟑) = 924.2 miles 8. Residual? e = y - 𝑦 = 924.2 – 1000 = -75.8 Chp 8 #17 R squared = 92.4% 17a. What is the correlation between tar and nicotine? (NOTE: scatterplot shows a strong positive linear association.) + .924 = 0.961 Chp 8 #17 R squared = 92.4% r= 0.961 17b. What would you predict about the average nicotine content of cigarettes that are 2 standard deviations below average in tar content. 𝑧𝑛𝑖𝑐𝑜𝑡𝑖𝑛𝑒 = r𝑧𝑡𝑎𝑟 = 0.961(−2)= -1.922 I would predict that the nicotine content would be 1.922 standard deviations below the average. Chp 8 #17 R squared = 92.4% r= 0.961 17c. If a cigarette is 1 standard deviation above average in nicotine content, what do you suspect is true about its tar content? 𝑧𝑡𝑎𝑟 = r𝑧𝑛𝑖𝑐𝑜𝑡𝑖𝑛𝑒 = 0.961(1) = 0.961 I would predict that the tar content would be 0.961 standard deviations above the average.