Specifying the Regression Model

advertisement
Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
16-1/25
Part 16: Regression Model Specification
Statistics and Data Analysis
Part 16 – Aspects of
Regression
16-2/25
Part 16: Regression Model Specification
Regression Models
Prediction
 Loose Ends



Trimming
Truncation
Summary
 Where to next

16-3/25
Part 16: Regression Model Specification
Prediction


Use of the model for prediction
Use “x” to predict y based on y = α+βx+ε
Sources of uncertainty




16-4/25
Predicting “x” first
Using sample estimates of α and β (and,
possibly, σ)
Can’t predict noise, ε
Predicting outside the range of experience –
uncertainty about the reach of the regression
model.
Part 16: Regression Model Specification
Base Case Prediction


For a given value of x*:
Use the equation.



Minimal sources of prediction error


16-5/25
True y = α + βx* + ε
Obvious estimate: y = a + bx
(Note, no prediction for ε)
Can never predict ε at all
The farther from the center of experience,
the greater is the uncertainty.
Part 16: Regression Model Specification
Prediction Interval
Prediction includes a range of uncertainty
Point estimate: yˆ  a  bx*
The range of uncertainty around the prediction:
2


1
(x
*

x)
2
a  bx*  1.96 Se  1+  N
2 
 N i1(xi  x) 
The usual 95%
Due to ε
Due to estimating α and β with a and b
(Remember the empirical rule, 95% of the distribution within
two standard deviations.)
16-6/25
Part 16: Regression Model Specification
Slightly Simpler Formula for Prediction
Prediction includes a range of uncertainty
Point estimate: yˆ  a  bx*
The range of uncertainty around the prediction:
2
 1
2
a  bx*  1.96 S  1+   (x *  x)  SE(b) 
 N
2
e
16-7/25
Part 16: Regression Model Specification
Prediction from Internet Buzz Regression
Buzz
= 0.48242
Max(Buzz)= 0.79
16-8/25
Part 16: Regression Model Specification
Prediction Interval for Buzz = .8
Predict Box Office for Buzz = .8
a+bx = -14.36 + 72.72(.8) = 43.82
1

se2 1    (.8  Buzz)2 SE(b)2
N

1 

2
2
 13.38632 1 

(.8

.48242)
10.94
62 

 13.93
Interval = 43.82  1.96(13.93)
= 16.52 to 71.12
16-9/25
Part 16: Regression Model Specification
Predicting Using a Loglinear Equation

Predict the log first


16-10/25
Prediction of the log
Prediction interval – (Lower to Upper)

Prediction = exp(lower) to exp(upper)

This produces very wide intervals.
Part 16: Regression Model Specification
Interval Estimates for the Sample of
Monet Paintings
Fitted Line Plot
18
Regression
95% PI
17
S
R-Sq
R-Sq(adj)
16
ln (US$)
Regression Analysis: ln (US$) versus
ln (SurfaceArea)
The regression equation is
ln (US$) = 2.83 + 1.72 ln (SurfaceArea)
Predictor
Coef SE Coef T
P
Constant
2.825 1.285 2.20 0.029
ln (SurfaceArea) 1.7246 0.1908 9.04 0.000
S = 1.00645 R-Sq = 20.0% R-Sq(adj) = 19.8%
ln (US$) = 2.825 + 1.725 ln (SurfaceArea)
1.00645
20.0%
19.8%
15
14
13
12
11
Mean of ln (SurfaceArea) = 6.72918
16-11/25
10
6.0
6.2
6.4
6.6
6.8
7.0
ln (SurfaceArea)
7.2
7.4
7.6
Part 16: Regression Model Specification
Prediction for An Out
of Sample Monet
lnSurface  ln(36.5  29)  6.96461
Prediction  2.83  1.72(6.96461)  14.809
1 

Uncertainty  1.96 1.006452  1 
 (6.96461  6.72918)2 (.1908)2

328 

 1.96 1.012942(1.003049)  (.23453)2 (.1908)2
Claude Monet: Bridge
Over a Pool of Water
Lilies. 1899. Original,
36.5”x29.”
 1.96(1.008984)
 1.977608
Prediction Interval = 14.809  1.977608
= 12.83139 to 16.786608
16-12/25
Part 16: Regression Model Specification
Predicting y when the Model Describes log y
The interval predicts log price. What abo ut the price?
Predicted Price: Mean = Exp(a + bx )
= Exp(14.809 ) = $2,700,641.78
Upper Limit
= Exp(14.809+1.9776)
= $19,513,166.53
Lower Limit
= Exp(14.809-1.9776)
= $ 373,771.53
16-13/25
Part 16: Regression Model Specification
Van Gogh: Irises
39.5 x 39.125. Prediction by our model = $17.903M
Painting is in our data set. Sold for 16.81M on 5/6/04
Sold for 7.729M
2/5/01
Last sale in our data set was in May 2004
Record sale was 6/25/08. market peak, just before the crash.
16-14/25
Part 16: Regression Model Specification
Uncertainty in Prediction
The interval is narrowest at x* = x, the center of our experience.
The interval widens as we move away from the center of our
experience to reflect the greater uncertainty.
(1) Uncertainty about the prediction of x
(2) Uncertainty that the linear
 1.96 s2e
relationship will continue to exist as
we move farther from the center.
16-15/25
 1
2
2
1+

(x*

x)
(SE(b))
 N


Part 16: Regression Model Specification
http://www.nytimes.com/2006/05/16/arts/design/16oran.html
16-16/25
Part 16: Regression Model Specification
167” (13 feet 11 inches)
"Morning", Claude Monet 1920-1926,
oil on canvas 200 x 425 cm, Musée de l
Orangerie, Paris France. Left panel
26.2” (2 feet 2.2”)
78.74” (6 Feet 7 inch)
16-17/25
32.1” (2 feet 8 inches)
Part 16: Regression Model Specification
Predicted Price for a Huge Painting
Regression Equation: ln $ = 2.825 + 1.725 ln Surface Area
Width = 167 Inches
Height = 78.74 Inches
Area = 13,149.58 Square inches, ln = 9.484
Predicted ln Price = 2.825 + 1.725 (9.484) = 19.185
Predicted Price = exp(19.185) = $214,785,473.40
16-18/25
Part 16: Regression Model Specification
Prediction Interval for Price
Prediction Interval for ln Price is


2
1

Predicted ln Price  1.96 S 1    ln Area*  ln Area SE 2 (b)
N

ln Area* = ln (167  78.74) = 9.484
2
e
ln Area = 6.72918 (computed from the data)
Se
= 1.00645 (from regression results)
SE(b) = 0.1908
1 
2

2
19.185  1.96 (1.00645) 2 1 

9.484

6.72918
(.1908)



 328 
19.185  2.228 = [16.957 to 21.413]
Predicted Price = exp(16.957) to exp(21.413) = $23,138, 304 to $1,993,185,600
16-19/25
Part 16: Regression Model Specification
118” (9 feet 10 inches)
32.1” (2 feet 8 inches)
Average Sized Monet
157” (13 Feet 1 inch)
26.2” (2 feet 2.2”)
Use the Monet
Model to Predict a
Price for a Dali?
Hallucinogenic Toreador
16-20/25
Part 16: Regression Model Specification
16-21/25
Part 16: Regression Model Specification
Forecasting Out of Sample
Fitted Line Plot
G = 1.928 + 0.000179 Income
8
Regression
95% PI
S
R-Sq
R-Sq(adj)
7
0.370241
88.0%
87.8%
G
6
5
Regression Analysis: G versus Income
The regression equation is
G = 1.93 + 0.000179 Income
Predictor
Coef SE Coef
T
P
Constant
1.9280
0.1651
11.68 0.000
Income 0.00017897 0.00000934 19.17 0.000
S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%
How to predict G for 2017? You would
need first to predict Income for 2017.
4
3
10000 12500 15000 17500 20000 22500 25000 27500
Income
How should we do that?
Per Capita Gasoline Consumption
vs. Per Capita Income, 1953-2004.
16-22/25
Part 16: Regression Model Specification
Data Trimming
DataSubset Worksheet 
Rows that match condition.
Fitted Line Plot
Fitted Line Plot
ln (US$) = 5.290 + 1.326 ln (SurfaceArea)
ln (US$) = 3.068 + 1.662 ln (SurfaceArea)
18
S
R-Sq
R-Sq(adj)
17
18
1.10354
33.4%
33.2%
16
1.09636
17.8%
17.6%
16
15
15
ln (US$)
ln (US$)
S
R-Sq
R-Sq(adj)
17
14
13
12
14
13
11
12
10
11
9
10
3
4
5
6
7
ln (SurfaceArea)
All 430 Sales:
4.290 + 1.326 log area
8
9
6.0
6.2
6.4
6.6
6.8
7.0
ln (SurfaceArea)
7.2
7.4
7.6
377 Sales of area 403.4 < area < 2981.0
(log > 6 and < 8)
3.068 + 1.662 log area
The sample is restricted to particular values of X – area between 403 and
2981. Trimming is generally benign, but the regression should be
understood to apply to the specified range of x. The trimming is based on a
variable not related to the underlying noise in Y.
16-23/25
Part 16: Regression Model Specification
Truncation
Fitted Line Plot
ln (US$) = 11.44 + 0.3821 ln (SurfaceArea)
15.0
Fitted Line Plot
S
R-Sq
R-Sq(adj)
ln (US$) = 5.290 + 1.326 ln (SurfaceArea)
18
S
R-Sq
R-Sq(adj)
17
1.10354
33.4%
33.2%
0.487426
5.9%
5.4%
14.5
ln (US$)
16
ln (US$)
15
14
14.0
13
12
13.5
11
10
13.0
9
3
4
5
6
7
ln (SurfaceArea)
8
9
Entire Sample: 5.290+1.326 log Area
5.5
6.0
6.5
ln (SurfaceArea)
7.0
7.5
Subsample: 500,000 < Price < 3,000,000
11.44 + 0.3821 log Area
Truncation based on the values of the dependent variable is VERY BAD.
It reduces and sometimes destroys the relationship. This is one reason
we resist removing “outliers” from the sample.
16-24/25
Part 16: Regression Model Specification
Where Have We Been?


Sample data – describing, display
Probability models





16-25/25
Models for random experiments
Models for random processes underlying
sample data
Random variables
Models for covariation of random variables
Linear regression model for covariation of a
pair of variables
Part 16: Regression Model Specification
Where Do We Go From Here?

Simple linear regression




Thus far, mostly a descriptive device
Use for prediction and forecasting
Yet to consider: Statistical inference, testing the
relationship
Multiple linear regression


16-26/25
More than one variable to explain the variation of Y
More elaborate model building
Part 16: Regression Model Specification
Download