Notes 17: Regression Modeling

advertisement
Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
17-1/38
Part 17: Regression Residuals
Statistics and Data Analysis
Part 17 – The Linear
Regression Model
17-2/38
Part 17: Regression Residuals
Regression Modeling
Theory behind the regression model
 Computing the regression statistics
 Interpreting the results
 Application: Statistical Cost Analysis

17-3/38
Part 17: Regression Residuals
A Linear Regression
Predictor: Box Office = -14.36 + 72.72 Buzz
17-4/38
Part 17: Regression Residuals
Data and Relationship

We suggested the relationship between box office
sales and internet buzz is
Box Office = -14.36 + 72.72 Buzz


17-5/38
Box Office is not exactly equal to -14.36+72.72xBuzz
How do we reconcile the equation with the data?
Part 17: Regression Residuals
Modeling the Underlying Process

A model that explains the process that produces
the data that we observe:




Regression model

17-6/38
Observed outcome = the sum of two parts
(1) Explained: The regression line
(2) Unexplained (noise): The remainder.
Internet Buzz is not the only thing that explains Box
Office, but it is the only variable in the equation.
The “model” is the statement that part (1) is the
same process from one observation to the next.
Part 17: Regression Residuals
The Population Regression

THE model:



Model statement


17-7/38
(1) Explained:
Explained Box Office = α + β Buzz
(2) Unexplained: The rest is “noise, ε.”
Random ε has certain characteristics
Box Office = α + β Buzz + ε
Box Office is related to Buzz, but is not exactly
equal to α + β Buzz
Part 17: Regression Residuals
The Data Include the Noise
17-8/38
Part 17: Regression Residuals
What explains the noise?
What explains the variation in fuel bills?
Scatterplot of FUELBILL vs ROOMS
1400
1200
FUELBILL
1000
800
600
400
200
2
17-9/38
3
4
5
6
7
ROOMS
8
9
10
11
Part 17: Regression Residuals
Noisy Data?
What explains the variation in milk production other
than number of cows?
17-10/38
Part 17: Regression Residuals
Assumptions

(Regression) The equation linking “Box Office”
and “Buzz” is stable
E[Box Office | Buzz] = α + β Buzz

Another sample of movies, say 2012, would
obey the same fundamental relationship.
17-11/38
Part 17: Regression Residuals
Model Assumptions

y i = α + β x i + εi



The Disturbance is Random Noise



17-12/38
α + β xi is the “regression function”
εi is the “disturbance. It is the unobserved
random component
Mean zero. The regression is the mean of yi.
εi is the deviation from the regression.
Variance σ2.
Part 17: Regression Residuals
We will use the data to estimate  and β
Sample : a + b Buzz
17-13/38
Part 17: Regression Residuals
We also want to estimate 2 =√E[εi2]
e=y-a-bBuzz
Sample : a + b Buzz
17-14/38
Part 17: Regression Residuals
Standard Deviation of the Residuals
Standard deviation of εi = yi-α-βxi is σ
σ = √E[εi2] (Mean of εi is zero)
Sample a and b estimate α and β
Residual ei = yi – a – bxi estimates εi
Use √(1/N-2)Σei2 to estimate σ.





se =
17-15/38

N
i=1
2
i
e
N- 2
=

N
i=1
2
(yi - a - bxi )
N- 2
Why N-2? Relates to the fact that two
parameters (α,β) were estimated.
Same reason N-1 was used to
compute a sample variance.
Part 17: Regression Residuals
Residuals
17-16/38
Part 17: Regression Residuals
Summary: Regression Computations
The same 5 statistics (with N) are still needed:
N = 62 complete observations.
1 N
 yi = 20.721
N i1
1 N
x =  i1 xi = 0.48242
N
1
N
Var(x) = s2x =
(x i  x)2 = 0.02453

i 1
N-1
1
N
Var(y) = s2y =
(y i  y)2 = 305.985

i 1
N-1
Cov(x,y) = s xy
y=
=
17-17/38
1
N
(xi  x)(yi  y) = 1.784

N-1 i1
b=
s xy
= 72.72
s2x
a = y - bx
se =
= -14.36
(N -1)(s2y - b2 s2x )
N- 2
(for later...),
R
2
b2 s2x
= 2
sy
= 13.386
= 0.424
Part 17: Regression Residuals
Using se to identify outliers
Remember the empirical rule, 95% of observations will lie within
mean ± 2 standard deviations? We show (a+bx) ± 2se below.)
This point is
2.2 standard
deviations
from the
regression.
Only 3.2% of
the 62
observations
lie outside
the bounds.
(We will
refine this
later.)
17-18/38
Part 17: Regression Residuals
17-19/38
Part 17: Regression Residuals
Linear Regression
Sample
Regression
Line
17-20/38
Part 17: Regression Residuals
17-21/38
Part 17: Regression Residuals
17-22/38
Part 17: Regression Residuals
Results to Report
17-23/38
Part 17: Regression Residuals
The
Reported
Results
17-24/38
Part 17: Regression Residuals
Estimated
equation
17-25/38
Part 17: Regression Residuals
Estimated
coefficients
a and b
17-26/38
Part 17: Regression Residuals
S = se =
estimated std.
deviation of ε
17-27/38
Part 17: Regression Residuals
Square of the
sample
correlation
between x and y
17-28/38
Part 17: Regression Residuals
N-2 = degrees
of freedom
N-1 = sample
size minus 1
17-29/38
Part 17: Regression Residuals
Sum of
squared
residuals,
Σiei2
17-30/38
Part 17: Regression Residuals
S2 = se2
17-31/38
Part 17: Regression Residuals
Total Variation
=  i=1 (yi - y)2
N
17-32/38
Part 17: Regression Residuals
Coefficient of Determination R 2
RegressionSS
=
=
TotalSS
17-33/38
b 2  i=1 (xi - x)2
N
2
(y
y)
 i=1 i
N
Part 17: Regression Residuals
The Model

Constructed to provide a framework for
interpreting the observed data


What is the meaning of the observed relationship
(assuming there is one)
How it’s used


17-34/38
Prediction: What reason is there to assume that we
can use sample observations to predict outcomes?
Testing relationships
Part 17: Regression Residuals
A Cost Model
Electricity.mpj
Total cost in $Million
Output in Million KWH
N = 123 American electric utilities
Model: Cost = α + βKWH + ε
17-35/38
Part 17: Regression Residuals
Cost Relationship
Scatterplot of Cost vs Output
500
400
Cost
300
200
100
0
0
17-36/38
10000
20000
30000
40000
Output
50000
60000
70000
80000
Part 17: Regression Residuals
Sample Regression
17-37/38
Part 17: Regression Residuals
Interpreting the Model
Cost = 2.44 + 0.00529 Output + e
 Cost is $Million, Output is Million KWH.
 Fixed Cost = Cost when output = 0
Fixed Cost = $2.44Million
 Marginal cost
= Change in cost/change in output
= .00529 * $Million/Million KWH
= .00529 $/KWH = 0.529 cents/KWH.

17-38/38
Part 17: Regression Residuals
Summary

Linear regression model



Estimating the parameters of the model



17-39/38
Assumptions of the model
Residuals and disturbances
Regression parameters
Disturbance standard deviation
Computation of the estimated model
Part 17: Regression Residuals
Download