Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 20-1/26 Part 20: Aspects of Regression Statistics and Data Analysis Part 20 – Aspects of Regression 20-2/26 Part 20: Aspects of Regression Regression Models Using the regression model to predict the value of the dependent variable. ‘Cleaning’ the data to remove what look like extreme values. 20-3/26 Trimming – removing values with extreme ‘x’ Truncation – removing values with extreme ‘y’ Part 20: Aspects of Regression Prediction Use of the model for prediction Use “x” to predict y based on y = α+βx+ε Sources of uncertainty 20-4/26 Predicting “x” first Using sample estimates of α and β (and, possibly, σ) Can’t predict noise, ε Predicting outside the range of experience – uncertainty about the reach of the regression model. Part 20: Aspects of Regression Base Case Prediction Predict y with a given value of x*: We would use the regression equation. Sources of prediction error 20-5/26 True y = α + βx* + ε Since α and β must be estimated, the obvious estimate is y = a + bx We have no prediction for ε other than 0. Can never predict ε at all The farther from the center of experience, the greater is the uncertainty. Part 20: Aspects of Regression A Prediction Interval Prediction includes a range of uncertainty Point estimate: yˆ a bx* The range of uncertainty around the prediction: 2 1 (x * x) 2 a bx* 1.96 Se 1+ N 2 N i1(xi x) The usual 95% Due to ε Due to estimating α and β with a and b (Remember the empirical rule, 95% of the distribution will be within two standard deviations.) 20-6/26 Part 20: Aspects of Regression Slightly Simpler Formula for Prediction Prediction includes a range of uncertainty Point estimate: yˆ a bx* The range of uncertainty around the prediction: 2 1 2 a bx* 1.96 S 1+ (x * x) SE(b) N 2 e 20-7/26 Part 20: Aspects of Regression Prediction from Internet Buzz Regression Buzz = 0.48242 Max(Buzz)= 0.79 20-8/26 Part 20: Aspects of Regression Prediction Interval for Buzz = .8 Predict Box Office for Buzz = .8 a+bx = -14.36 + 72.72(.8) = 43.82 1 se2 1 (.8 Buzz)2 SE(b)2 N 1 2 2 13.38632 1 (.8 .48242) 10.94 62 13.93 Interval = 43.82 1.96(13.93) = 16.52 to 71.12 20-9/26 Part 20: Aspects of Regression Predicting Using a Loglinear Equation Predict the log first 20-10/26 Prediction of the log Prediction interval – (Lower to Upper) Prediction = exp(lower) to exp(upper) This produces very wide intervals. Part 20: Aspects of Regression Interval Estimates for the Sample of Signed Monet Paintings Fitted Line Plot 18 Regression 95% PI 17 S R-Sq R-Sq(adj) 16 ln (US$) Regression Analysis: ln (US$) versus ln (SurfaceArea) The regression equation is ln (US$) = 2.83 + 1.72 ln (SurfaceArea) Predictor Coef SE Coef T P Constant 2.825 1.285 2.20 0.029 ln (SurfaceArea) 1.7246 0.1908 9.04 0.000 S = 1.00645 R-Sq = 20.0% R-Sq(adj) = 19.8% ln (US$) = 2.825 + 1.725 ln (SurfaceArea) 1.00645 20.0% 19.8% 15 14 13 12 11 Mean of ln (SurfaceArea) = 6.72918 20-11/26 10 6.0 6.2 6.4 6.6 6.8 7.0 ln (SurfaceArea) 7.2 7.4 7.6 Part 20: Aspects of Regression Prediction for An Out of Sample Monet lnSurface ln(36.5 29) 6.96461 Prediction 2.83 1.72(6.96461) 14.809 1 Uncertainty 1.96 1.006452 1 (6.96461 6.72918)2 (.1908)2 328 1.96 1.012942(1.003049) (.23453)2 (.1908)2 Claude Monet: Bridge Over a Pool of Water Lilies. 1899. Original, 36.5”x29.” 1.96(1.008984) 1.977608 Prediction Interval = 14.809 1.977608 = 12.83139 to 16.786608 20-12/26 Part 20: Aspects of Regression Predicting y when the Model Describes log y The interval predicts log price. What abo ut the price? Predicted Price: Mean = Exp(a + bx ) = Exp(14.809 ) = $2,700,641.78 Upper Limit = Exp(14.809+1.9776) = $19,513,166.53 Lower Limit = Exp(14.809-1.9776) = $ 373,771.53 20-13/26 Part 20: Aspects of Regression Van Gogh: Irises 39.5 x 39.125. Prediction by our model = $17.903M Painting is in our data set. Sold for 16.81M on 5/6/04 Sold for 7.729M 2/5/01 Last sale in our data set was in May 2004 Record sale was 6/25/08. market peak, just before the crash. 20-14/26 Part 20: Aspects of Regression Uncertainty in Prediction The interval is narrowest at x* = x, the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty. (1) Uncertainty about the prediction of x (2) Uncertainty that the linear 1.96 s2e relationship will continue to exist as we move farther from the center. 20-15/26 1 2 2 1+ (x* x) (SE(b)) N Part 20: Aspects of Regression http://www.nytimes.com/2006/05/16/arts/design/16oran.html 20-16/26 Part 20: Aspects of Regression 167” (13 feet 11 inches) "Morning", Claude Monet 1920-1926, oil on canvas 200 x 425 cm, Musée de l Orangerie, Paris France. Left panel 26.2” (2 feet 2.2”) 78.74” (6 Feet 7 inch) 20-17/26 32.1” (2 feet 8 inches) Part 20: Aspects of Regression Predicted Price for a Huge Painting Regression Equation: ln $ = 2.825 + 1.725 ln Surface Area Width = 167 Inches Height = 78.74 Inches Area = 13,149.58 Square inches, ln = 9.484 Predicted ln Price = 2.825 + 1.725 (9.484) = 19.185 Predicted Price = exp(19.185) = $214,785,473.40 20-18/26 Part 20: Aspects of Regression Prediction Interval for Price Prediction Interval for ln Price is 2 1 Predicted ln Price 1.96 S 1 ln Area* ln Area SE 2 (b) N ln Area* = ln (167 78.74) = 9.484 2 e ln Area = 6.72918 (computed from the data) Se = 1.00645 (from regression results) SE(b) = 0.1908 1 2 2 19.185 1.96 (1.00645) 2 1 9.484 6.72918 (.1908) 328 19.185 2.228 = [16.957 to 21.413] Predicted Price = exp(16.957) to exp(21.413) = $23,138, 304 to $1,993,185,600 20-19/26 Part 20: Aspects of Regression 118” (9 feet 10 inches) 32.1” (2 feet 8 inches) Average Sized Monet 157” (13 Feet 1 inch) 26.2” (2 feet 2.2”) Use the Monet Model to Predict a Price for a Dali? Hallucinogenic Toreador 20-20/26 Part 20: Aspects of Regression 20-21/26 Part 20: Aspects of Regression Forecasting Out of Sample Fitted Line Plot G = 1.928 + 0.000179 Income 8 Regression 95% PI S R-Sq R-Sq(adj) 7 0.370241 88.0% 87.8% G 6 5 Regression Analysis: G versus Income The regression equation is G = 1.93 + 0.000179 Income Predictor Coef SE Coef T P Constant 1.9280 0.1651 11.68 0.000 Income 0.00017897 0.00000934 19.17 0.000 S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8% How to predict G for 2017? You would need first to predict Income for 2017. 4 3 10000 12500 15000 17500 20000 22500 25000 27500 Income How should we do that? Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004. 20-22/26 Part 20: Aspects of Regression Data Trimming DataSubset Worksheet Rows that match condition. Fitted Line Plot Fitted Line Plot ln (US$) = 5.290 + 1.326 ln (SurfaceArea) ln (US$) = 3.068 + 1.662 ln (SurfaceArea) 18 S R-Sq R-Sq(adj) 17 18 1.10354 33.4% 33.2% 16 1.09636 17.8% 17.6% 16 15 15 ln (US$) ln (US$) S R-Sq R-Sq(adj) 17 14 13 12 14 13 11 12 10 11 9 10 3 4 5 6 7 ln (SurfaceArea) All 430 Sales: 4.290 + 1.326 log area 8 9 6.0 6.2 6.4 6.6 6.8 7.0 ln (SurfaceArea) 7.2 7.4 7.6 377 Sales of area 403.4 < area < 2981.0 (log > 6 and < 8) 3.068 + 1.662 log area The sample is restricted to particular values of X – area between 403 and 2981. Trimming is generally benign, but the regression should be understood to apply to the specified range of x. The trimming is based on a variable not related to the underlying noise in Y. 20-23/26 Part 20: Aspects of Regression Truncation Fitted Line Plot ln (US$) = 11.44 + 0.3821 ln (SurfaceArea) 15.0 Fitted Line Plot S R-Sq R-Sq(adj) ln (US$) = 5.290 + 1.326 ln (SurfaceArea) 18 S R-Sq R-Sq(adj) 17 1.10354 33.4% 33.2% 0.487426 5.9% 5.4% 14.5 ln (US$) 16 ln (US$) 15 14 14.0 13 12 13.5 11 10 13.0 9 3 4 5 6 7 ln (SurfaceArea) 8 9 Entire Sample: 5.290+1.326 log Area 5.5 6.0 6.5 ln (SurfaceArea) 7.0 7.5 Subsample: 500,000 < Price < 3,000,000 11.44 + 0.3821 log Area Truncation based on the values of the dependent variable is VERY BAD. It reduces and sometimes destroys the relationship. This is one reason we resist removing “outliers” from the sample. 20-24/26 Part 20: Aspects of Regression Where Have We Been? Sample data – describing, display Probability models 20-25/26 Models for random experiments Models for random processes underlying sample data Random variables Models for covariation of random variables Linear regression model for covariation of a pair of variables Part 20: Aspects of Regression Where Do We Go From Here? Simple linear regression Thus far, mostly a descriptive device Use for prediction and forecasting Yet to consider: Statistical inference, testing the relationship Multiple linear regression 20-26/26 More than one variable to explain the variation of Y More elaborate model building Part 20: Aspects of Regression