Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics 17-1/38 Part 17: Regression Residuals Statistics and Data Analysis Part 17 – The Linear Regression Model 17-2/38 Part 17: Regression Residuals Regression Modeling Theory behind the regression model Computing the regression statistics Interpreting the results Application: Statistical Cost Analysis 17-3/38 Part 17: Regression Residuals A Linear Regression Predictor: Box Office = -14.36 + 72.72 Buzz 17-4/38 Part 17: Regression Residuals Data and Relationship We suggested the relationship between box office sales and internet buzz is Box Office = -14.36 + 72.72 Buzz 17-5/38 Box Office is not exactly equal to -14.36+72.72xBuzz How do we reconcile the equation with the data? Part 17: Regression Residuals Modeling the Underlying Process A model that explains the process that produces the data that we observe: Regression model 17-6/38 Observed outcome = the sum of two parts (1) Explained: The regression line (2) Unexplained (noise): The remainder. Internet Buzz is not the only thing that explains Box Office, but it is the only variable in the equation. The “model” is the statement that part (1) is the same process from one observation to the next. Part 17: Regression Residuals The Population Regression THE model: Model statement 17-7/38 (1) Explained: Explained Box Office = α + β Buzz (2) Unexplained: The rest is “noise, ε.” Random ε has certain characteristics Box Office = α + β Buzz + ε Box Office is related to Buzz, but is not exactly equal to α + β Buzz Part 17: Regression Residuals The Data Include the Noise 17-8/38 Part 17: Regression Residuals What explains the noise? What explains the variation in fuel bills? Scatterplot of FUELBILL vs ROOMS 1400 1200 FUELBILL 1000 800 600 400 200 2 17-9/38 3 4 5 6 7 ROOMS 8 9 10 11 Part 17: Regression Residuals Noisy Data? What explains the variation in milk production other than number of cows? 17-10/38 Part 17: Regression Residuals Assumptions (Regression) The equation linking “Box Office” and “Buzz” is stable E[Box Office | Buzz] = α + β Buzz Another sample of movies, say 2012, would obey the same fundamental relationship. 17-11/38 Part 17: Regression Residuals Model Assumptions y i = α + β x i + εi The Disturbance is Random Noise 17-12/38 α + β xi is the “regression function” εi is the “disturbance. It is the unobserved random component Mean zero. The regression is the mean of yi. εi is the deviation from the regression. Variance σ2. Part 17: Regression Residuals We will use the data to estimate and β Sample : a + b Buzz 17-13/38 Part 17: Regression Residuals We also want to estimate 2 =√E[εi2] e=y-a-bBuzz Sample : a + b Buzz 17-14/38 Part 17: Regression Residuals Standard Deviation of the Residuals Standard deviation of εi = yi-α-βxi is σ σ = √E[εi2] (Mean of εi is zero) Sample a and b estimate α and β Residual ei = yi – a – bxi estimates εi Use √(1/N-2)Σei2 to estimate σ. se = 17-15/38 N i=1 2 i e N- 2 = N i=1 2 (yi - a - bxi ) N- 2 Why N-2? Relates to the fact that two parameters (α,β) were estimated. Same reason N-1 was used to compute a sample variance. Part 17: Regression Residuals Residuals 17-16/38 Part 17: Regression Residuals Summary: Regression Computations The same 5 statistics (with N) are still needed: N = 62 complete observations. 1 N yi = 20.721 N i1 1 N x = i1 xi = 0.48242 N 1 N Var(x) = s2x = (x i x)2 = 0.02453 i 1 N-1 1 N Var(y) = s2y = (y i y)2 = 305.985 i 1 N-1 Cov(x,y) = s xy y= = 17-17/38 1 N (xi x)(yi y) = 1.784 N-1 i1 b= s xy = 72.72 s2x a = y - bx se = = -14.36 (N -1)(s2y - b2 s2x ) N- 2 (for later...), R 2 b2 s2x = 2 sy = 13.386 = 0.424 Part 17: Regression Residuals Using se to identify outliers Remember the empirical rule, 95% of observations will lie within mean ± 2 standard deviations? We show (a+bx) ± 2se below.) This point is 2.2 standard deviations from the regression. Only 3.2% of the 62 observations lie outside the bounds. (We will refine this later.) 17-18/38 Part 17: Regression Residuals 17-19/38 Part 17: Regression Residuals Linear Regression Sample Regression Line 17-20/38 Part 17: Regression Residuals 17-21/38 Part 17: Regression Residuals 17-22/38 Part 17: Regression Residuals Results to Report 17-23/38 Part 17: Regression Residuals The Reported Results 17-24/38 Part 17: Regression Residuals Estimated equation 17-25/38 Part 17: Regression Residuals Estimated coefficients a and b 17-26/38 Part 17: Regression Residuals S = se = estimated std. deviation of ε 17-27/38 Part 17: Regression Residuals Square of the sample correlation between x and y 17-28/38 Part 17: Regression Residuals N-2 = degrees of freedom N-1 = sample size minus 1 17-29/38 Part 17: Regression Residuals Sum of squared residuals, Σiei2 17-30/38 Part 17: Regression Residuals S2 = se2 17-31/38 Part 17: Regression Residuals Total Variation = i=1 (yi - y)2 N 17-32/38 Part 17: Regression Residuals Coefficient of Determination R 2 RegressionSS = = TotalSS 17-33/38 b 2 i=1 (xi - x)2 N 2 (y y) i=1 i N Part 17: Regression Residuals The Model Constructed to provide a framework for interpreting the observed data What is the meaning of the observed relationship (assuming there is one) How it’s used 17-34/38 Prediction: What reason is there to assume that we can use sample observations to predict outcomes? Testing relationships Part 17: Regression Residuals A Cost Model Electricity.mpj Total cost in $Million Output in Million KWH N = 123 American electric utilities Model: Cost = α + βKWH + ε 17-35/38 Part 17: Regression Residuals Cost Relationship Scatterplot of Cost vs Output 500 400 Cost 300 200 100 0 0 17-36/38 10000 20000 30000 40000 Output 50000 60000 70000 80000 Part 17: Regression Residuals Sample Regression 17-37/38 Part 17: Regression Residuals Interpreting the Model Cost = 2.44 + 0.00529 Output + e Cost is $Million, Output is Million KWH. Fixed Cost = Cost when output = 0 Fixed Cost = $2.44Million Marginal cost = Change in cost/change in output = .00529 * $Million/Million KWH = .00529 $/KWH = 0.529 cents/KWH. 17-38/38 Part 17: Regression Residuals Summary Linear regression model Estimating the parameters of the model 17-39/38 Assumptions of the model Residuals and disturbances Regression parameters Disturbance standard deviation Computation of the estimated model Part 17: Regression Residuals