Least Squares Regression Fitting a Line to Bivariate Data Linear Relationships Avg. occupants per car 1980: 6/car 1990: 3/car 2000: 1.5/car By the year 2010 every fourth car will have nobody in it! Food for Thought Kind of mathematical relationship between year and avg. no. of occupants per car? Why might relationship break down by 2010? Basic Terminology Scatterplots, correlation: interested in association between 2 variables (assign x and y arbitrarily) Least squares regression: does one quantitative variable explain or cause changes in another variable? Basic Terminology (cont.) Explanatory variable: explains or causes changes in the other variable; the x variable. (independent variable) Response variable: the y -variable; it responds to changes in the x - variable. (dependent variable) Examples Fertilizer (x ) corn yield (y ) Advertising $ (x ) store income (y ) Drug dose (x ) blood pressure (y ) Daily temperature (x ) natural gas demand (y ) change in min wage(x) unemployment rate (y) Simplest Relationship Simplest equation that describes the dependence of variable y on variable x y = b0 + b1x linear equation graph is line with slope b1 and yintercept b0 Graph y=b0 +b1x y rise Slope b=rise/run b0 run 0 x Notation (x1, y1), (x2, y2), . . . , (xn, yn) draw the line y= b0 + b1x through the scatterplot , the point on the line corresponding to xi is yˆi b0 b1 xi ; yˆi is the value of y predicted by the line y b0 b1 x when x xi ; yi is the observed value of y when x xi . Observed y, Predicted y FUEL CONSUMPTION FUEL CONSUMPTION vs CAR WEIGHT 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 predicted y when x=2.7 yhat = a + bx = a + b*2.7 (2.7, 3.6) 3.6 = observed y 1.5 2 2.5 2.7 3 CAR WEIGHT 3.5 4 4.5 Scatterplot: Fuel Consumption vs Car Weight Fuel consumption (gal/100 miles) Fuel Consumption vs Car Weight 7 “Best” line? 6 5 Fuel consumption 4 3 2 1 2 3 4 Car Weight (1000 lbs) 5 Scatterplot with least squares prediction line FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6 5 4 3 2 y = 1.639x - 0.3631 r 2 = 0.9538 1.5 2.5 3.5 WEIGHT (1000 lbs) 4.5 How do we draw the line? Residuals the i th residual is the vertical deviation of the i th data point from the line : i th residual = observed y predicted y yi yˆi yi (b0 b1 xi ) Residuals: graphically Graphical Display of Residuals positive residual Yi negative residual Yi ei=Yi - Yi Xi X Criterion for choosing what line to draw: method of least squares The method of least squares chooses the line that makes the sum of squares of the residuals as small as possible This line has slope b1 and intercept b0 that minimizes n [ y (b i 1 i 0 b1 xi )] 2 for the given observations ( xi , yi ) Least Squares Line y = b0 + b1x: Slope b1 and Intercept b0 (x1 , y1 ),(x 2 , y2 ), b1 r slope ,(x n , yn ) sy sx y intercept b0 y bx where n (x x ) sx 2 i i 1 is the standard deviation of x1, x2 ,..., xn n 1 n ( y y) sy i 1 2 i is the standard deviation of y1, y2 ,..., yn n 1 n r ( x x )( y y ) i i 1 i is the correlation between x and y (n 1) sx s y n n n i 1 i 1 SSE y b0 yi b1 xi yi i 1 2 i Example: Income vs Consumption Expenditure Consumption Income (x) Expenditure (y) 1 7 5 6 9 9 13 8 17 10 Questions Construct scatterplot; determine if linear model is appropriate. If so … … find the least squares prediction line Estimate consumption expenditure in a household with an income of (i) $6,000 (ii) $25,000. Comfortable with estimates? Compute the residuals Scatterplot Expenditure ($1,000's) Consumption Expenditure 11 10 9 8 7 6 5 0 5 10 Household Income ($1,000's) 15 20 Solution Inc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar) (yi-ybar) 1 8 1 7 -8 64 -1 5 6 -4 16 -2 4 8 9 9 0 0 1 1 0 13 8 4 16 0 0 0 17 10 8 64 2 4 16 x=45 2 2 y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar) =0 =160 =0 =10 x sy 45 40 9; y 8; sx 5 5 10 4 2.5 1.581; r 160 4 40 6.325 32 .8 4(6.325)(1.581) 32 Calculations sy 1.581 b1 r .8 .2; sx 6.325 b0 y b1 x 8 .2(9) 8 1.8 6.2 least squares prediction line: yˆ 6.2 .2 x least squares prediction line yˆ b0 b1 x 6.2 .2 x income $6, 000, x 6 yˆ 6.2 .2(6) 7.4 ($7, 400) income $25, 000, x 25 yˆ 6.2 .2(25) 11.2 ($11, 200) Least Squares Prediction Line Expenditure ($1,000's) Consumption Expenditure 11 10 y = 6.2 + 0.2x 9 8 7 6 5 0 5 10 Household Income ($1,000's) 15 20 Consumption Expenditure Prediction When x=$6,000 7.4 Expenditure ($1,000's) Consumption Expenditure 11 10 y = 6.2 + 0.2x 9 8 7 6 5 0 5 6 10 Household Income ($1,000's) 15 20 Consumption Expenditure Prediction When x=$25,000 11.2 Expenditure ($1,000's) Consumption Expenditure 12 11 10 y = 6.2 + 0.2x 9 8 7 6 5 0 5 10 15 Household Income ($1,000's) 20 25 25 The least squares line always goes through the point with coordinates (x, y) Least Squares Line Goes Through ( x , y ) Consumption Expenditure 11 10 ( x, y ) = ( 9, 8 ) 9 y = 0.2x + 6.2 8 7 6 5 0 5 10 Income 15 20 C. Compute the Residuals Inc. x ConE y y=6.2+.2x y - y (y-y)^2 1 7 6.4 .6 .36 5 6 7.2 -1.2 1.44 9 9 8 1 1 13 8 8.8 -.8 .64 17 10 9.6 .4 .16 residuals=0 (residuals)2 =3.6 Residuals Expenditure ($1,000's) Consumption Expenditure 11 10 y = 6.2 + 0.2x 9 8 7 6 5 0 5 10 Household Income ($1,000's) 15 20 Income Residual Plot Residuals Income Residual Plot 2 1 0 -1 0 -2 5 10 Income 15 20 residuals, (residuals)2 Note that * residuals = 0 (residuals)2 = 3.6 * From formula in box on p. 7: SSE=yi2 – b0*yi – b1*xiyi 330 – 6.2*40 - .2*392 = 330 – 248 – 78.4 = 3.6 Any other line drawn through the scatterplot will have (residuals)2 > 3.6 Car Weight, Fuel Consumption Example, cont. (xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3) (2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9) FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6 5 4 3 2 1.5 2.5 3.5 WEIGHT (1000 lbs) 4.5 col. sum Wt (x) Fuel 2 x x y i - y (yi - y)2 (x x) i i (y) (xi - x)(y i - y) 3.4 5.5 .5 .25 1.11 1.231 .555 3.8 5.9 .9 .81 1.51 2.2801 1.359 4.1 6.5 1.2 1.44 2.11 4.4521 2.532 2.2 3.3 -.7 .49 -1.09 1.1881 .763 2.6 3.6 -.3 .09 -.79 .6241 .237 2.9 4.6 0 0 .21 .0441 0 2.0 2.9 -.9 .81 -1.49 2.2201 1.341 2.7 3.6 -.2 .04 -.79 1.9 3.1 -1.0 1 -1.29 1.6641 1.29 3.4 4.9 .5 .25 .51 .2601 29 43.9 0 5.18 0 14.589 8.49 .6241 .158 .255 Calculations x 2.9; y 4.39; sx 5.18 9 .7587; 8.49 sy 1.2732; r .9766 9(.77587)(1.2732) sy 1.2732 slope b1 r .9766 1.639 sx .7587 14.589 9 intercept b0 y b1 x 4.39 1.639(2.9) .3631 least squares prediction line yˆ b0 b1 x .3631 1.639x Scatterplot with least squares prediction line FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6 5 4 3 2 y = 1.639x - 0.3631 r 2 = 0.9538 1.5 2.5 3.5 WEIGHT (1000 lbs) 4.5 The Least Squares Line Always goes Through ( x, y ) FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 (x, y ) = (2.9, 4.39) y = 1.639x - 0.3631 1.5 2.5 WEIGHT (1000 lbs) 3.5 4.5 Using the least squares line for prediction. Fuel consumption of 3,000 lb car? (x=3) yˆ .3631 1.639(3) 4.5539 Fuel Consumption vs Car Weight: Scatterplot and Least Squares Line FUEL CONSUMPTION 7 y = - 0.3631 + 1.639x 6 5 4 (3.0, 4.5539) 3 2 1.5 2 2.5 3 CAR WEIGHT 3.5 4 4.5 Be Careful! Fuel consumption of 500 lb car? (x = .5) yˆ .3631 1.639(.5) .4564 (219 mpg) FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT 7 6 5 4 3 2 y = 1.639x - 0.3631 r 2 = 0.9538 1.5 2.5 3.5 4.5 WEIGHT (1000 lbs) x = .5 is outside the range of the x-data that we used to determine the least squares line Avoid GIGO! Evaluating the least squares line 1. 2. 3. Create scatterplot. Approximately linear? Calculate r2, the square of the correlation coefficient Examine residual plot r2 : The Variation Accounted For The square of the correlation coefficient r gives important information about the usefulness of the least squares line r2: important information for evaluating the usefulness of the least squares line -1 ≤ r ≤ 1 implies 0 ≤ r2 ≤ 1 The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the least squares regression of y on x. The square of the correlation coefficient, r2, is the fraction of the variation in y that is explained by the variation in x. Example: car weight, fuel consumption x=car weight, y=fuel consumption r2 = (.9766)2 .95 About 95% of the variation in fuel consumption (y) is explained by the linear relationship between car weight (x) and fuel consumption (y). What else affects fuel consumption? – Driver, size of engine, tires, road, etc. Example: SAT scores SAT Mean per State vs % Seniors Taking Test Mean SAT Score 1120 1070 y = -2.2375x + 1023.4 R2 = 0.7542 1020 970 920 870 820 0 10 20 30 40 50 % of Seniors Taking Test 60 70 80 SAT scores: calculations x 33.882 sx 24.103 y 947.549 s y 62.1 r .868 b1 r sy sx , b0 y b1 x 62.1 slope b1 .868 2.23635 24.103 intercept b0 947.549 (2.236)33.882 1023.309 least squares prediction line yˆ 1023.309 2.236 x SAT scores: result r2 = (-.868)2 = .7534 SAT Mean per State vs % Seniors Taking Test Mean SAT Score 1120 1070 y = -2.2375x + 1023.4 R2 = 0.7542 1020 970 920 870 820 0 10 20 30 40 50 60 70 80 % of Seniors Taking Test If 57% of NC seniors take the SAT, the predicted mean score is yˆ 1023.309 2.23635(57) 895.84 Avoid GIGO! Evaluating the least squares line 1. 2. 3. Create scatterplot. Approximately linear? Calculate r2, the square of the correlation coefficient Examine residual plot Residuals residual =observed y - predicted y = y y Properties of residuals 1. The residuals always sum to 0 (therefore the mean of the residuals is 0) 2. The least squares line always goes through the point (x, y) Graphically residual = y - y y yi yi ei=yi - yi X xi Residual Plot Residuals help us determine if fitting a least squares line to the data makes sense When a least squares line is appropriate, it should model the underlying relationship; nothing interesting should be left behind We make a scatterplot of the residuals in the hope of finding… NOTHING! Car Wt/ Fuel Consump: Residuals CAR WT. FUEL CONSUMP. Pred FUEL CONSUMP. Residuals 3.4 3.8 4.1 2.2 2.6 2.9 2 2.7 1.9 3.4 5.5 5.9 6.5 3.3 3.6 4.6 2.9 3.6 3.1 4.9 5.2094980690 .290501931 5.865096525 0.034903475 6.356795367 0.143204633 3.242702703 0.057297297 3.898301158 -0.29830115 4.39 0.21 2.914903475 -0.01490347 4.062200772 -0.46220077 2.751003861 0.348996139 5.209498069 -0.309498069 Example: Car wt/fuel consump. residual plot page 13 RESIDUALS vs WT(X) RESIDUALS 0.4 0.2 0 RESIDUAL -0.2 -0.4 -0.6 1.5 2 2.5 3 WT(X) 3.5 4 4.5 SAT Residuals Residuals %TAKE Residual Plot 100 50 0 -50 0 -100 20 40 %TAKE 60 80 Linear Relationship? Linear(?) 60 50 Y 40 30 20 10 0 -4 -2 0 2 X 4 6 8 Garbage In Garbage Out GIGO 60 50 y = 4x + 11 Y 40 30 20 10 0 -4 -2 0 2 X 4 6 8 Residual Plot – Clue to GIGO Residual Plot Residuals 20 10 0 -4 -2 -10 0 2 -20 X Variable 4 6 8 GIGO 60 50 y = 4x + 11 Y 40 30 20 10 0 -4 -2 0 2 4 6 8 4 6 8 X Residual Plot Residuals 20 10 0 -4 -2 -10 0 2 -20 X Variable