Algebra Review The equation of a straight line y = mx + b – m is the slope – the change in y over the change in x – or rise over run. – b is the y-intercept – the value where the line cuts the y axis. 1 y = 3x + 2 15 10 y 5 0 -5 -10 -15 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 2 Review y = 3x + 2 –x = 0 y = 2 (y-intercept) –x = 3 y = 11 –Change in y (+9) divided by the change in x (+3) gives the slope, 3. 3 Linear Regression Example: Tar (mg) and CO (mg) in cigarettes. –y, Response: CO (mg). –x, Explanatory: Tar (mg). –Cases: 25 brands of cigarettes. 4 Correlation Coefficient Tar and nicotine z z r 22.9796 n 1 24 x y r = 0.9575 5 Linear Regression There is a strong positive linear association between tar and nicotine. What is the equation of the line that models the relationship between tar and nicotine? 6 Linear Model The linear model is the equation of a straight line through the data. A point on the straight line through the data gives ŷ a predicted value of y, denoted . 7 Residual The difference between the observed value of y and the predicted value of y,ŷ , is called the residual. Residual = y yˆ 8 Residual 9 Line of “Best Fit” There are lots of straight lines that go through the data. The line of “best fit” is the line for which the sum of squared residuals is the smallest – the least squares line. 10 Line of “Best Fit” yˆ b0 b1 x Least squares slope: intercept: sy b1 r sx b0 y b1 x 11 Summary of the Data Tar, x x 12.216 mg s x 5.6658 mg CO, y y 12.528 mg s y 4.7397 mg r 0.9575 12 Least Squares Estimates 4.7397 b1 0.9575 0.801 5.6658 b0 12.528 0.801(12.216) 2.743 yˆ 2.743 0.801x Predicted CO 2.743 0.801* Tar 13 Interpretations Slope – for every 1 mg increase in tar, the CO content increases, on average, 0.801 mg. Intercept – there is not a reasonable interpretation of the intercept in this context because one wouldn’t see a cigarette with 0 mg of tar. 14 Predicted CO = 2.743 + 0.801*Tar 15 Prediction Least squares line Predicted CO 2.743 0.801* Tar for Tar 16.0 Predicted CO 2.743 0.801(16) Predicted CO 15.56 mg 16 Residual Tar, x = 16.0 mg CO, y = 16.6 mg Predicted, ŷ = 15.56 mg Residual, y yˆ = 16.6–15.56 = 1.04 mg 17 Residuals Residuals help us see if the linear model makes sense. Plot residuals versus the explanatory variable. – If the plot is a random scatter of points, then the linear model is the best we can do. 18 19 Interpretation of the Plot The residuals appear to have a pattern. For values of Tar between 0 and 20 the residuals tend to increase. The brand with Tar = 30, appears to have a large residual. 20 2 (r) or 2 R The square of the correlation coefficient gives the amount of variation in y, that is accounted for or explained by the linear relationship with x. 21 Tar and Nicotine r = 0.9575 (r)2 = (0.9575)2 = 0.917 or 91.7% 91.7% of the variation in CO content can be explained by the linear relationship with Tar content. 22 Regression Conditions Quantitative variables – both variables should be quantitative. Linear model – does the scatter plot show a reasonably straight line? Outliers – watch out for outliers as they can be very influential. 23 Regression Cautions Beware of extraordinary points. Don’t extrapolate beyond the data. Don’t infer x causes y just because there is a good linear model relating the two variables. Don’t choose a model based on R2 alone. 24