MATH 2560 C F03 Elementary Statistics I LECTURE 8: Least-Squares Regression: Regression Line and Prediction. 1 Outline ⇒ regresion line; ⇒ fitting a line to data; ⇒ prediction; ⇒ regression line with Excel; 2 Regression Line ⇒ A regression line summarizes the relationship between two variables in the setting when one of the variables helps explain or predict the other. ⇒ Regression describes a relationship between an explanatory variables and a response variable. Regression Line A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. A regression line is used to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable. Example 2.10: Heights of Children in Kalama, Egypt (Table 2.7 and Figure 2.11) The data were obtained by measuring the heights of 161 children from the village each month from 18 to 29 months of age. Figure 2.11 is a scatterplot of the data in Table 2.7. Age is the explanatory variable, which is plotted on the x axis. Mean height (in cm) is the response variable. We can see on the plot a strong positive linear association with no outliers. The correlation is r=0.994, close to the r = 1 of points that lie exactly on a line. A line drawn through the points will describe these data very well. This line is called the regression line. 3 Fitting a Line to Data The overall pattern can be described by drawing a stright line through the points (we note, that a scatterplot displays a linear pattern). ⇒ Fitting a Line to data means drawing a line that comes as close as possible to the points. (Of course, no stright line passes exactly through all of the points). ⇒ The equation of a line fitted to the data gives a compact description of the dependence of the response variable y on the explanatory variable x. ⇒ It is a mathematical model for the stright-line relationship. Stright Line Let y is a response variable and x is an explanatory variable. A stright line relating y to x has an equation of the form y = a + bx. In this equation, b is the slope, the amount by which y changes when x increases by one unit. The number a is the intercept, the value of y when x = 0. Example 2.10. (Table 2.7, Figure 2.11 and Figure 2.12). The stright line describing the Kalama data has the form height = a + (b × age). In Figure 2.12 the regression line has been drawn with the following equation height = 64.93 + (0.635 × age). ⇒ The figure shows that this line fits the data well. The slope b = 0.635 tells us that the height of Kalama children increases by about 0.6 cm for each month of age. The slope b of a line y = a + bx is the rate of change in the response y as the explanatory variable x changes. ⇒ The slope of a regression line is an important numerical description of the relationship between the two variables. 4 Prediction =⇒ A regression line is used to predict the response y for a specific value of the explanatory variable x. Example 1. Predict the mean height of Kalama children at 32 months of age. We use the Figure 2.12: from age 32 months on the x axis, go up to the fitted line and over to the y axis. The predicted height is a bit more than 85 cm. It is faster and more accurate to substitute 32 for the age in the equation of the regression line. Our predicted height is height = 64.93 + (0.635 × 32) = 85.25cm. Important Remark: The accuracy of predictions from a regression line dependes on how much scatter about the line the data show. Kalama example: the data points are all very close to the line, so we are confident that our prediction is accurate. If the data show a linear pattern with considerable spread, we may use a regression line but we will put less confidence in predictions based on the line. Example 2. Predict the mean height of Kalama children at 20 years of age. 20 years is 240 months, so we substitute 240 for the age. The prediction is: height = 64.93 + (0.635 × 240) = 217.33cm. Blind calculation has produced an unreasonable result. The data cover only ages from 18 to 29 months. As people grow older, they gain height more slowly, so our fitted line is not good model at ages far removed from the data that produced it. Extrapolation Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable x that you used to obtain the line. Such predictions are often not accurate. 5 Summary 1. A regression line is stright line that describes how a response variable y changes as an explanatory variable x changes. 2. A regression line is used to predict the value of y for any value of x by substituting this x into the eqution of the line. Exptrapolation beyond the range of x values spanned by the data is risky. 3. The slope b of a regression line ŷ = a + bx is the rate at which the predicted response ŷ changes along the line as the explanatory variable x changes. Specifically, b is the change in ŷ when x increases by 1. 4. The intercept a of a regression line ŷ = a + bx is the predicted response ŷ when the explanatory variable x = 0. This prediction is of no statistical use unless x can actually take values near 0.