Section 3.2 Least – Squares Regression » summarize the relationship between two variables, but only in settings where one of the variables helps explain or predict the other (We must have an explanatory variable and a response variable.) » describe how a response variable y changes as an explanatory variable x changes » We often use a regression line to predict the value of y for a given value of x. Don’t you hate it when you open a can of soda and some of the contents spray out of the can? Two AP®Statistics students, Kerry and Danielle, wanted to investigate if tapping on a can of soda would reduce the amount of soda expelled after the can has been shaken. For their experiment, they vigorously shook 40 cans of soda and randomly assigned each can to be tapped for 0 seconds, 4 seconds, 8 seconds, or 12 seconds. Then, after opening the can and cleaning up the mess, the students measured the amount of soda left in each can (in ml). Here are the data and a scatterplot. The scatterplot shows a fairly strong, positive linear association between the amount of tapping time and the amount remaining in the can. The line on the plot is a regression line for predicting the amount remaining from the amount of tapping time. A regression lines is a model for the data and provides a compact mathematical description between the variables. For the soda example, the equation of the regression line is 𝑠𝑜𝑑𝑎 = 248.6 + 2.63(𝑡𝑎𝑝𝑝𝑖𝑛𝑔 𝑡𝑖𝑚𝑒). Identify the slope and y-intercept of the regression line. Interpret each value in context. We can use a regression line to predict the response 𝑦 for a specific value of the exploratory variable x. The accuracy of the prediction depends on how much the data scatter about the line. For the soda example, use the equation of the regression line to predict the amount of soda remaining after tapping on the can for 10 seconds. Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. Should we predict how much soda will be left over after 60 seconds of tapping? Try it! What do you find? Some data were collected on the weight of a male white laboratory rate for the first 25 weeks after its birth. A scatterplot of the weight (in grams) and time since birth (in weeks) shows a fairly strong, positive linear relationship. The linear regression equation 𝑤𝑒𝑖𝑔ℎ𝑡 = 100 + 40(𝑡𝑖𝑚𝑒) models the data fairly well. 1. What is the slope of the regression line? 2. What is the y-intercept? Explain its meaning in context. 3. Predict the rat’s weight after 16 weeks. 4. Should you use this line to predict the rat’s weight at age 2 years? (There are 454 grams in a pound.) In most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible. Recall, the regression equation for “Tapping on Cans” is 𝑠𝑜𝑑𝑎 = 248.6 + 2.63(𝑡𝑎𝑝𝑝𝑖𝑛𝑔 𝑡𝑖𝑚𝑒). Find and interpret the residual for the can that was tapped for 4 seconds and had 260 ml of soda remaining. Different regression lines produce different residuals. The regression line we want is the one that minimizes the sum of the squared residuals. The least-squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible. Find the value of each residual. What do you notice if you add them together? Where else have we seen this? We want the sum of squared of our residuals to be as small as possible!! 1. Enter your data into two lists. 2. Create a scatterplot. Describe what you see. (DOFS) 3. Find the equation of your leastsquares regression line. STAT -> CALC -> LinReg(a+bx) 4. To store your equation as a line: VARS -> Y-VARS -> Function: Y1 » Page 163 #27 – 32 (More review from 3-1) » Page 193 #35, 37, 39, 41, 43, 45, 47 Here is a scatterplot showing the tapping time and amount of soda remaining for the 40 cans. The least-squares regression line, 𝑠𝑜𝑑𝑎 = 248.6 + 2.63(𝑡𝑎𝑝𝑝𝑖𝑛𝑔 𝑡𝑖𝑚𝑒), is shown on the scatterplot. The point in red is for the can that was tapped for 8 seconds and had 255 ml remaining after it was opened. What is the residual for this point? Explain what the value means. A regression line describes the overall pattern of a linear relationship between two variables. We see departures from this pattern by looking at the residuals. The residuals for a least-squares regression line have a special property: the mean of the least squares residuals is always zero A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data. Here are the scatterplot and residual plot for the can tapping data. **Notice the horizontal axis is the same. **The “residual = 0” corresponds to the regression line. » The residual plot in effect turns the regression line horizontal and magnifies deviations so they are easier to see. » The scatterplot and residual plot show a nonlinear relationship. » When an obvious curved pattern exists in a residual plot, the model we are using is not appropriate. » When we use a line to model a linear association, there should only be a random scatter of points. (Consider the scatterplot and residual plot for price versus miles driven.) » Residuals look for what is left over when comparing the actual observation with the predicted value. 𝑦 − 𝑦 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 » A residual plot looks at the form that is left when comparing the form of the association to the form of the regression model ˃ If there is a pattern, then the form of the two plots are not the same. The data below represents the years since 1995 and student enrollment. Based on the residual plot, is a linear model appropriate for these data? *Type your data into two lists. *Enter your residuals. [2ndStat] *Create a scatter plot with Explanatory variables and residuals The plots below are the scatterplot and residual plot for years since 1995 and student enrollment. What do you notice? An association can be clearly nonlinear and still have a correlation close to ±1. » Residuals show the amount of error for each observation. » To estimate the approximate size of a “typical” prediction error (residual), calculate the standard deviation of the residuals. 𝑠= 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝑠 2 = 𝑛−2 𝑦𝑖 − 𝑦 𝑛−2 2 For the can tapping data, the standard deviation of the residuals is 𝑠 = 951.536 40−2 = 5.00 𝑚𝐿. When we use the least-squares regression line to predict the amount of soda remaining using the amount of tapping time, our predictions will typically be off by about 5 ml. If all of the points fall directly on the least squares line, the sum of squared residuals is 0, which means 𝑟 2 = 1. This means all the variation in y is accounted for by the linear relationship with x. Suppose that we wandered in during the can tapping experiment and found a partially-full can. Without measuring the contents, how could we predict how much soda is left in the can? We don’t know how long it was tapped, so our best guess would be the mean amount remaining in all the cans: 𝑦 = 264.45 mL. When using 𝑦 as our predicted value, the sum of the squared prediction errors is 6506. The sum of the squared residuals when using the leastsquares regression line is 951.3. “ ______ % of the variation in (y variable) is accounted for by the linear relationship to the (x variable).” » They both help to answer the question, “How well does the line fit the data?” » S is measured in the same units as the response variable (y). » 𝑟 2 does not have units but is usually expressed as a percentage. » Report BOTH when you are assessing the fit of a line. In Section 3.1, we looked at the relationship between the 40-yard sprint time (in seconds) and the long-jump distance (in inches) for a small statistics class with 12 students. 1. Use your graphing calculators to create a scatterplot and find the least-squares regression line. 2. Find the value of 𝑟 2 . 3. Using your residuals, create a residual plot. 4. Find the value of s. A scatterplot with the least-squares regression line ( 𝑦 = 414.79 – 45.74x) and a residual plot are shown below. Also, s = 22.38 and 𝑟 2 = 0.702. (a) Calculate and interpret the residual for Christian, who had a sprint time of 7.25 seconds and a long jump of 110 inches. (b) Is a linear model appropriate for these data? Explain. (c) Interpret the value of s. (d) Interpret the value of 𝑟 2 . Section 3-2: Page 193 #48, 49, 50, 51, 55, 58 Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP®Statistics teacher randomly assigned students to seat locations in his classroom for a particular chapter. At the end of the chapter, he recorded the row number (row 1 is closest to the front) and test score for each student. Least-squares regression was performed on the data. A scatterplot with the regression line added, a residual plot, and some computer output from the regression are shown. (a) What is the equation of the leastsquares regression line that describes the relationship between row number and test score? Define any variables that you use. (b) Interpret the slope of the regression line in context. (c) Find the correlation. (d) Is a line an appropriate model to use for these data? Explain how you know. It is also possible to calculate the equation of the leastsquares regression line using only the means and standard deviations of the two variables and their correlation. Both formulas are on the AP Test formula sheet!! In the previous example, we investigated the relationship between test scores and seat location. The mean and standard deviation of the row numbers are 𝑥 = 4.033 and 𝑠𝑥 = 1.974. The mean and standard deviation of the test scores are 𝑦 = 81.2 and s𝑦 = 10.135. The correlation between row number and test score is r = –0.218. (Note that this value is slightly different than the previous example because of rounding in the computer output.) Find the equation of the least-squares regression line for predicting test score from row number. Show your work. Netbooks are a hybrid of a laptop computer and a tablet. They are smaller and have better battery life than a traditional laptop. They also have a separate keyboard, unlike most tablets. Consumer Reports did a study of 22 netbooks in their February 2010 issue. Among the variables measured were battery life (hours), weight (pounds), and cost. The data appear in the table. Should we use a linear model to predict battery life of a new netbook based on its weight? If so, how accurate will our predictions be? 1. The distinction between explanatory and response variables is important in regression. 2. Correlation and regression lines describe only linear relationships. 3. Correlation and least-squares regression lines are not resistant. An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line. 4. Association does not imply causation. (Use common sense when drawing conclusions.) Example – Does committing more turnovers lead to more points? In the NBA, there is a strong positive association between the number of turnovers a player has and the number of points that he scores. A turnover is when a player loses the ball to the other team. Could a player increase his point totals by turning the ball over more frequently? No! Turning the ball over to the other team doesn’t cause a player to score more points. Instead, there is another variable that influences both turnovers and points: playing time. Players who are on the court more often tend to score more points and have more turnovers than players who don’t get much playing time. In the chapter-opening Case Study (page 141), the Starnes family had just missed seeing Old Faithful erupt. They wondered how long it would be until the next eruption. The scatterplot below shows data on the duration (in minutes) and the interval of time until the next eruption (also in minutes) for each Old Faithful eruption in the month before the eruption. Answer all of the questions on page 191 and turn them in. Section 3-2: Page 193 #59, 61, 63, 65, 69, 71 – 78