Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 16-1/25 Part 16: Regression Model Specification Statistics and Data Analysis Part 16 – Aspects of Regression 16-2/25 Part 16: Regression Model Specification Regression Models Prediction Loose Ends Trimming Truncation Summary Where to next 16-3/25 Part 16: Regression Model Specification Prediction Use of the model for prediction Use “x” to predict y based on y = α+βx+ε Sources of uncertainty 16-4/25 Predicting “x” first Using sample estimates of α and β (and, possibly, σ) Can’t predict noise, ε Predicting outside the range of experience – uncertainty about the reach of the regression model. Part 16: Regression Model Specification Base Case Prediction For a given value of x*: Use the equation. Minimal sources of prediction error 16-5/25 True y = α + βx* + ε Obvious estimate: y = a + bx (Note, no prediction for ε) Can never predict ε at all The farther from the center of experience, the greater is the uncertainty. Part 16: Regression Model Specification Prediction Interval Prediction includes a range of uncertainty Point estimate: yˆ a bx* The range of uncertainty around the prediction: 2 1 (x * x) 2 a bx* 1.96 Se 1+ N 2 N i1(xi x) The usual 95% Due to ε Due to estimating α and β with a and b (Remember the empirical rule, 95% of the distribution within two standard deviations.) 16-6/25 Part 16: Regression Model Specification Slightly Simpler Formula for Prediction Prediction includes a range of uncertainty Point estimate: yˆ a bx* The range of uncertainty around the prediction: 2 1 2 a bx* 1.96 S 1+ (x * x) SE(b) N 2 e 16-7/25 Part 16: Regression Model Specification Prediction from Internet Buzz Regression Buzz = 0.48242 Max(Buzz)= 0.79 16-8/25 Part 16: Regression Model Specification Prediction Interval for Buzz = .8 Predict Box Office for Buzz = .8 a+bx = -14.36 + 72.72(.8) = 43.82 1 se2 1 (.8 Buzz)2 SE(b)2 N 1 2 2 13.38632 1 (.8 .48242) 10.94 62 13.93 Interval = 43.82 1.96(13.93) = 16.52 to 71.12 16-9/25 Part 16: Regression Model Specification Predicting Using a Loglinear Equation Predict the log first 16-10/25 Prediction of the log Prediction interval – (Lower to Upper) Prediction = exp(lower) to exp(upper) This produces very wide intervals. Part 16: Regression Model Specification Interval Estimates for the Sample of Monet Paintings Fitted Line Plot 18 Regression 95% PI 17 S R-Sq R-Sq(adj) 16 ln (US$) Regression Analysis: ln (US$) versus ln (SurfaceArea) The regression equation is ln (US$) = 2.83 + 1.72 ln (SurfaceArea) Predictor Coef SE Coef T P Constant 2.825 1.285 2.20 0.029 ln (SurfaceArea) 1.7246 0.1908 9.04 0.000 S = 1.00645 R-Sq = 20.0% R-Sq(adj) = 19.8% ln (US$) = 2.825 + 1.725 ln (SurfaceArea) 1.00645 20.0% 19.8% 15 14 13 12 11 Mean of ln (SurfaceArea) = 6.72918 16-11/25 10 6.0 6.2 6.4 6.6 6.8 7.0 ln (SurfaceArea) 7.2 7.4 7.6 Part 16: Regression Model Specification Prediction for An Out of Sample Monet lnSurface ln(36.5 29) 6.96461 Prediction 2.83 1.72(6.96461) 14.809 1 Uncertainty 1.96 1.006452 1 (6.96461 6.72918)2 (.1908)2 328 1.96 1.012942(1.003049) (.23453)2 (.1908)2 Claude Monet: Bridge Over a Pool of Water Lilies. 1899. Original, 36.5”x29.” 1.96(1.008984) 1.977608 Prediction Interval = 14.809 1.977608 = 12.83139 to 16.786608 16-12/25 Part 16: Regression Model Specification Predicting y when the Model Describes log y The interval predicts log price. What abo ut the price? Predicted Price: Mean = Exp(a + bx ) = Exp(14.809 ) = $2,700,641.78 Upper Limit = Exp(14.809+1.9776) = $19,513,166.53 Lower Limit = Exp(14.809-1.9776) = $ 373,771.53 16-13/25 Part 16: Regression Model Specification Van Gogh: Irises 39.5 x 39.125. Prediction by our model = $17.903M Painting is in our data set. Sold for 16.81M on 5/6/04 Sold for 7.729M 2/5/01 Last sale in our data set was in May 2004 Record sale was 6/25/08. market peak, just before the crash. 16-14/25 Part 16: Regression Model Specification Uncertainty in Prediction The interval is narrowest at x* = x, the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty. (1) Uncertainty about the prediction of x (2) Uncertainty that the linear 1.96 s2e relationship will continue to exist as we move farther from the center. 16-15/25 1 2 2 1+ (x* x) (SE(b)) N Part 16: Regression Model Specification http://www.nytimes.com/2006/05/16/arts/design/16oran.html 16-16/25 Part 16: Regression Model Specification 167” (13 feet 11 inches) "Morning", Claude Monet 1920-1926, oil on canvas 200 x 425 cm, Musée de l Orangerie, Paris France. Left panel 26.2” (2 feet 2.2”) 78.74” (6 Feet 7 inch) 16-17/25 32.1” (2 feet 8 inches) Part 16: Regression Model Specification Predicted Price for a Huge Painting Regression Equation: ln $ = 2.825 + 1.725 ln Surface Area Width = 167 Inches Height = 78.74 Inches Area = 13,149.58 Square inches, ln = 9.484 Predicted ln Price = 2.825 + 1.725 (9.484) = 19.185 Predicted Price = exp(19.185) = $214,785,473.40 16-18/25 Part 16: Regression Model Specification Prediction Interval for Price Prediction Interval for ln Price is 2 1 Predicted ln Price 1.96 S 1 ln Area* ln Area SE 2 (b) N ln Area* = ln (167 78.74) = 9.484 2 e ln Area = 6.72918 (computed from the data) Se = 1.00645 (from regression results) SE(b) = 0.1908 1 2 2 19.185 1.96 (1.00645) 2 1 9.484 6.72918 (.1908) 328 19.185 2.228 = [16.957 to 21.413] Predicted Price = exp(16.957) to exp(21.413) = $23,138, 304 to $1,993,185,600 16-19/25 Part 16: Regression Model Specification 118” (9 feet 10 inches) 32.1” (2 feet 8 inches) Average Sized Monet 157” (13 Feet 1 inch) 26.2” (2 feet 2.2”) Use the Monet Model to Predict a Price for a Dali? Hallucinogenic Toreador 16-20/25 Part 16: Regression Model Specification 16-21/25 Part 16: Regression Model Specification Forecasting Out of Sample Fitted Line Plot G = 1.928 + 0.000179 Income 8 Regression 95% PI S R-Sq R-Sq(adj) 7 0.370241 88.0% 87.8% G 6 5 Regression Analysis: G versus Income The regression equation is G = 1.93 + 0.000179 Income Predictor Coef SE Coef T P Constant 1.9280 0.1651 11.68 0.000 Income 0.00017897 0.00000934 19.17 0.000 S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8% How to predict G for 2017? You would need first to predict Income for 2017. 4 3 10000 12500 15000 17500 20000 22500 25000 27500 Income How should we do that? Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004. 16-22/25 Part 16: Regression Model Specification Data Trimming DataSubset Worksheet Rows that match condition. Fitted Line Plot Fitted Line Plot ln (US$) = 5.290 + 1.326 ln (SurfaceArea) ln (US$) = 3.068 + 1.662 ln (SurfaceArea) 18 S R-Sq R-Sq(adj) 17 18 1.10354 33.4% 33.2% 16 1.09636 17.8% 17.6% 16 15 15 ln (US$) ln (US$) S R-Sq R-Sq(adj) 17 14 13 12 14 13 11 12 10 11 9 10 3 4 5 6 7 ln (SurfaceArea) All 430 Sales: 4.290 + 1.326 log area 8 9 6.0 6.2 6.4 6.6 6.8 7.0 ln (SurfaceArea) 7.2 7.4 7.6 377 Sales of area 403.4 < area < 2981.0 (log > 6 and < 8) 3.068 + 1.662 log area The sample is restricted to particular values of X – area between 403 and 2981. Trimming is generally benign, but the regression should be understood to apply to the specified range of x. The trimming is based on a variable not related to the underlying noise in Y. 16-23/25 Part 16: Regression Model Specification Truncation Fitted Line Plot ln (US$) = 11.44 + 0.3821 ln (SurfaceArea) 15.0 Fitted Line Plot S R-Sq R-Sq(adj) ln (US$) = 5.290 + 1.326 ln (SurfaceArea) 18 S R-Sq R-Sq(adj) 17 1.10354 33.4% 33.2% 0.487426 5.9% 5.4% 14.5 ln (US$) 16 ln (US$) 15 14 14.0 13 12 13.5 11 10 13.0 9 3 4 5 6 7 ln (SurfaceArea) 8 9 Entire Sample: 5.290+1.326 log Area 5.5 6.0 6.5 ln (SurfaceArea) 7.0 7.5 Subsample: 500,000 < Price < 3,000,000 11.44 + 0.3821 log Area Truncation based on the values of the dependent variable is VERY BAD. It reduces and sometimes destroys the relationship. This is one reason we resist removing “outliers” from the sample. 16-24/25 Part 16: Regression Model Specification Where Have We Been? Sample data – describing, display Probability models 16-25/25 Models for random experiments Models for random processes underlying sample data Random variables Models for covariation of random variables Linear regression model for covariation of a pair of variables Part 16: Regression Model Specification Where Do We Go From Here? Simple linear regression Thus far, mostly a descriptive device Use for prediction and forecasting Yet to consider: Statistical inference, testing the relationship Multiple linear regression 16-26/25 More than one variable to explain the variation of Y More elaborate model building Part 16: Regression Model Specification