Notes 3.2 Least Squares Regression Tapping on cans Don’t you hate it when you open a can of soda and some of the contents spray out of the can? Two AP®Statistics students, Kerry and Danielle, wanted to investigate if tapping on a can of soda would reduce the amount of soda expelled after the can has been shaken. For their experiment, they vigorously shook 40 cans of soda and randomly assigned each can to be tapped for 0 seconds, 4 seconds, 8 seconds, or 12 seconds. Then, after opening the can and cleaning up the mess, the students measured the amount of soda left in each can (in ml). Here are the data and a scatterplot. The scatterplot shows a fairly strong, positive linear association between the amount of tapping time and the amount remaining in the can. The line on the plot is a regression line for predicting the amount remaining from the amount of tapping time. Tapping on cans The equation of the regression line in the previous Example is Problem: Identify the slope and y intercept of the regression line. Interpret each value in context. Tapping on cans For the soda example, the equation of the regression line is soda = 248.6 + 2.63 (tapping time). If we shook a can in the same way as the students did in their project and tapped on it for 10 seconds, the predicted amount of soda remaining would be ml. Extrapolation and Tapping on Cans Should we predict how much soda will be left after 60 seconds of tapping? No! We have data only for cans that were tapped between 0 and 12 seconds. We don’t know if the linear pattern will continue beyond these values. In fact, if we did make a prediction for 60 seconds of tapping, we would get 406.4 ml, over 50 ml more than the can originally contained (355 ml)! Tapping on cans Problem: Find and interpret the residual for the can that was tapped for 4 seconds and had 260 ml of soda remaining. Tapping on cans Here is a scatterplot showing the tapping time and amount of soda remaining for the 40 cans. The least-squares regression line, is shown on the scatterplot. The point in red is for the can that was tapped for 8 seconds and had 255 ml remaining after it was opened. The predicted amount remaining is The residual is therefore y – = 255 – 269.64 = –14.64 ml. The amount of soda remaining in this can is 14.64 ml less than expected, based on the tapping time. Tapping on cans For the can tapping data, the standard deviation of the residuals is When we use the least-squares regression line to predict the amount of soda remaining using the amount of tapping time, our predictions will typically be off by about 5 ml. Tapping on cans Suppose that we wandered in during the can tapping experiment and found a partially-full can. Without measuring the contents, how could we predict how much soda is left in the can? We don’t know how long it was tapped, so our best guess would be the mean amount remaining in all the cans: y = 264.45 ml. The first scatterplot shows the squared prediction errors when using the mean amount y as our prediction. When using y as our predicted value, the sum of the squared prediction errors is 6506. We could make much better predictions if we knew the tapping time. How much better? The second scatterplot shows the squared prediction errors when using the least-squares regression line. The sum of the squared residuals when using the least-squares regression line is 951.3 (the same quantity we used to calculate the standard deviation of the residuals, other than a small difference due to rounding error). This means that only = 14.6% of the variation in amount of soda remaining is unaccounted for by the least-squares regression line. The remaining variation is due to other factors, such as how vigorously the can was shaken. Therefore, 1 – = 85.4% of the variability in amount of soda remaining is accounted for by the linear model relating amount of soda remaining to tapping time. Does seat location affect grades? Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP®Statistics teacher randomly assigned students to seat locations in his classroom for a particular chapter. At the end of the chapter, he recorded the row number (row 1 is closest to the front) and test score for each student. Least-squares regression was performed on the data. A scatterplot with the regression line added, a residual plot, and some computer output from the regression are shown below. Problem: (a) What is the equation of the least-squares regression line that describes the relationship between row number and test score? Define any variables that you use. (b) Interpret the slope of the regression line in context. (c) Find the correlation. (d) Is a line an appropriate model to use for these data? Explain how you know. Back to the track! Here is a scatterplot with the least-squares regression line for predicting the long-jump distance from sprint time and a scatterplot with the least-squares regression line for predicting sprint time from long-jump distance. Does committing more turnovers lead to more points? In the National Basketball Association, there is a strong positive association between the number of turnovers a player has and the number of points that he scores. A turnover is when a player loses the ball to the other team. Could a player increase his point totals by turning the ball over more frequently? No! Turning the ball over to the other team doesn’t cause a player to score more points. Instead, there is another variable that influences both turnovers and points: playing time. Players who are on the court more often tend to score more points and have more turnovers than players who don’t get much playing time. HW: pg. 193: 35, 37, 39, 41, 45 HW: pg. 193 43, 47, HW: 49, 52 48, 50, 55, 58 HW: 59, 61, 63, 65, 69, 71-78