Notes 20: Specifying the Regression

advertisement
Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
20-1/26
Part 20: Aspects of Regression
Statistics and Data Analysis
Part 20 – Aspects of
Regression
20-2/26
Part 20: Aspects of Regression
Regression Models
Using the regression model to predict the
value of the dependent variable.
 ‘Cleaning’ the data to remove what look
like extreme values.



20-3/26
Trimming – removing values with extreme ‘x’
Truncation – removing values with extreme ‘y’
Part 20: Aspects of Regression
Prediction


Use of the model for prediction
Use “x” to predict y based on y = α+βx+ε
Sources of uncertainty




20-4/26
Predicting “x” first
Using sample estimates of α and β (and, possibly, σ)
Can’t predict noise, ε
Predicting outside the range of experience –
uncertainty about the reach of the regression model.
Part 20: Aspects of Regression
Base Case Prediction


Predict y with a given value of x*:
We would use the regression equation.




Sources of prediction error


20-5/26
True y = α + βx* + ε
Since α and β must be estimated, the
obvious estimate is y = a + bx
We have no prediction for ε other than 0.
Can never predict ε at all
The farther from the center of experience,
the greater is the uncertainty.
Part 20: Aspects of Regression
A Prediction Interval
Prediction includes a range of uncertainty
Point estimate: yˆ  a  bx*
The range of uncertainty around the prediction:
2


1
(x
*

x)
2
a  bx*  1.96 Se  1+  N
2 
 N i1(xi  x) 
The usual 95%
Due to ε
Due to estimating α and β with a and b
(Remember the empirical rule, 95% of the distribution will be within
two standard deviations.)
20-6/26
Part 20: Aspects of Regression
Slightly Simpler Formula for Prediction
Prediction includes a range of uncertainty
Point estimate: yˆ  a  bx*
The range of uncertainty around the prediction:
2
 1
2
a  bx*  1.96 S  1+   (x *  x)  SE(b) 
 N
2
e
20-7/26
Part 20: Aspects of Regression
Prediction from Internet Buzz Regression
Buzz
= 0.48242
Max(Buzz)= 0.79
20-8/26
Part 20: Aspects of Regression
Prediction Interval for Buzz = .8
Predict Box Office for Buzz = .8
a+bx = -14.36 + 72.72(.8) = 43.82
1

se2 1    (.8  Buzz)2 SE(b)2
N

1 

2
2
 13.38632 1 

(.8

.48242)
10.94
62 

 13.93
Interval = 43.82  1.96(13.93)
= 16.52 to 71.12
20-9/26
Part 20: Aspects of Regression
Predicting Using a Loglinear Equation

Predict the log first


20-10/26
Prediction of the log
Prediction interval – (Lower to Upper)

Prediction = exp(lower) to exp(upper)

This produces very wide intervals.
Part 20: Aspects of Regression
Interval Estimates for the Sample of
Signed Monet Paintings
Fitted Line Plot
18
Regression
95% PI
17
S
R-Sq
R-Sq(adj)
16
ln (US$)
Regression Analysis: ln (US$) versus
ln (SurfaceArea)
The regression equation is
ln (US$) = 2.83 + 1.72 ln (SurfaceArea)
Predictor
Coef SE Coef T
P
Constant
2.825 1.285 2.20 0.029
ln (SurfaceArea) 1.7246 0.1908 9.04 0.000
S = 1.00645 R-Sq = 20.0% R-Sq(adj) = 19.8%
ln (US$) = 2.825 + 1.725 ln (SurfaceArea)
1.00645
20.0%
19.8%
15
14
13
12
11
Mean of ln (SurfaceArea) = 6.72918
20-11/26
10
6.0
6.2
6.4
6.6
6.8
7.0
ln (SurfaceArea)
7.2
7.4
7.6
Part 20: Aspects of Regression
Prediction for An Out
of Sample Monet
lnSurface  ln(36.5  29)  6.96461
Prediction  2.83  1.72(6.96461)  14.809
1 

Uncertainty  1.96 1.006452  1 
 (6.96461  6.72918)2 (.1908)2

328 

 1.96 1.012942(1.003049)  (.23453)2 (.1908)2
Claude Monet: Bridge
Over a Pool of Water
Lilies. 1899. Original,
36.5”x29.”
 1.96(1.008984)
 1.977608
Prediction Interval = 14.809  1.977608
= 12.83139 to 16.786608
20-12/26
Part 20: Aspects of Regression
Predicting y when the Model Describes log y
The interval predicts log price. What abo ut the price?
Predicted Price: Mean = Exp(a + bx )
= Exp(14.809 ) = $2,700,641.78
Upper Limit
= Exp(14.809+1.9776)
= $19,513,166.53
Lower Limit
= Exp(14.809-1.9776)
= $ 373,771.53
20-13/26
Part 20: Aspects of Regression
Van Gogh: Irises
39.5 x 39.125. Prediction by our model = $17.903M
Painting is in our data set. Sold for 16.81M on 5/6/04
Sold for 7.729M
2/5/01
Last sale in our data set was in May 2004
Record sale was 6/25/08. market peak, just before the crash.
20-14/26
Part 20: Aspects of Regression
Uncertainty in Prediction
The interval is narrowest at x* = x, the center of our experience.
The interval widens as we move away from the center of our
experience to reflect the greater uncertainty.
(1) Uncertainty about the prediction of x
(2) Uncertainty that the linear
 1.96 s2e
relationship will continue to exist as
we move farther from the center.
20-15/26
 1
2
2
1+

(x*

x)
(SE(b))
 N


Part 20: Aspects of Regression
http://www.nytimes.com/2006/05/16/arts/design/16oran.html
20-16/26
Part 20: Aspects of Regression
167” (13 feet 11 inches)
"Morning", Claude Monet 1920-1926,
oil on canvas 200 x 425 cm, Musée de l
Orangerie, Paris France. Left panel
26.2” (2 feet 2.2”)
78.74” (6 Feet 7 inch)
20-17/26
32.1” (2 feet 8 inches)
Part 20: Aspects of Regression
Predicted Price for a Huge Painting
Regression Equation: ln $ = 2.825 + 1.725 ln Surface Area
Width = 167 Inches
Height = 78.74 Inches
Area = 13,149.58 Square inches, ln = 9.484
Predicted ln Price = 2.825 + 1.725 (9.484) = 19.185
Predicted Price = exp(19.185) = $214,785,473.40
20-18/26
Part 20: Aspects of Regression
Prediction Interval for Price
Prediction Interval for ln Price is


2
1

Predicted ln Price  1.96 S 1    ln Area*  ln Area SE 2 (b)
N

ln Area* = ln (167  78.74) = 9.484
2
e
ln Area = 6.72918 (computed from the data)
Se
= 1.00645 (from regression results)
SE(b) = 0.1908
1 
2

2
19.185  1.96 (1.00645) 2 1 

9.484

6.72918
(.1908)



 328 
19.185  2.228 = [16.957 to 21.413]
Predicted Price = exp(16.957) to exp(21.413) = $23,138, 304 to $1,993,185,600
20-19/26
Part 20: Aspects of Regression
118” (9 feet 10 inches)
32.1” (2 feet 8 inches)
Average Sized Monet
157” (13 Feet 1 inch)
26.2” (2 feet 2.2”)
Use the Monet
Model to Predict a
Price for a Dali?
Hallucinogenic Toreador
20-20/26
Part 20: Aspects of Regression
20-21/26
Part 20: Aspects of Regression
Forecasting Out of Sample
Fitted Line Plot
G = 1.928 + 0.000179 Income
8
Regression
95% PI
S
R-Sq
R-Sq(adj)
7
0.370241
88.0%
87.8%
G
6
5
Regression Analysis: G versus Income
The regression equation is
G = 1.93 + 0.000179 Income
Predictor
Coef SE Coef
T
P
Constant
1.9280
0.1651
11.68 0.000
Income 0.00017897 0.00000934 19.17 0.000
S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%
How to predict G for 2017? You would
need first to predict Income for 2017.
4
3
10000 12500 15000 17500 20000 22500 25000 27500
Income
How should we do that?
Per Capita Gasoline Consumption
vs. Per Capita Income, 1953-2004.
20-22/26
Part 20: Aspects of Regression
Data Trimming
DataSubset Worksheet 
Rows that match condition.
Fitted Line Plot
Fitted Line Plot
ln (US$) = 5.290 + 1.326 ln (SurfaceArea)
ln (US$) = 3.068 + 1.662 ln (SurfaceArea)
18
S
R-Sq
R-Sq(adj)
17
18
1.10354
33.4%
33.2%
16
1.09636
17.8%
17.6%
16
15
15
ln (US$)
ln (US$)
S
R-Sq
R-Sq(adj)
17
14
13
12
14
13
11
12
10
11
9
10
3
4
5
6
7
ln (SurfaceArea)
All 430 Sales:
4.290 + 1.326 log area
8
9
6.0
6.2
6.4
6.6
6.8
7.0
ln (SurfaceArea)
7.2
7.4
7.6
377 Sales of area 403.4 < area < 2981.0
(log > 6 and < 8)
3.068 + 1.662 log area
The sample is restricted to particular values of X – area between 403 and
2981. Trimming is generally benign, but the regression should be
understood to apply to the specified range of x. The trimming is based on a
variable not related to the underlying noise in Y.
20-23/26
Part 20: Aspects of Regression
Truncation
Fitted Line Plot
ln (US$) = 11.44 + 0.3821 ln (SurfaceArea)
15.0
Fitted Line Plot
S
R-Sq
R-Sq(adj)
ln (US$) = 5.290 + 1.326 ln (SurfaceArea)
18
S
R-Sq
R-Sq(adj)
17
1.10354
33.4%
33.2%
0.487426
5.9%
5.4%
14.5
ln (US$)
16
ln (US$)
15
14
14.0
13
12
13.5
11
10
13.0
9
3
4
5
6
7
ln (SurfaceArea)
8
9
Entire Sample: 5.290+1.326 log Area
5.5
6.0
6.5
ln (SurfaceArea)
7.0
7.5
Subsample: 500,000 < Price < 3,000,000
11.44 + 0.3821 log Area
Truncation based on the values of the dependent variable is VERY BAD.
It reduces and sometimes destroys the relationship. This is one reason
we resist removing “outliers” from the sample.
20-24/26
Part 20: Aspects of Regression
Where Have We Been?


Sample data – describing, display
Probability models





20-25/26
Models for random experiments
Models for random processes underlying
sample data
Random variables
Models for covariation of random variables
Linear regression model for covariation of a
pair of variables
Part 20: Aspects of Regression
Where Do We Go From Here?

Simple linear regression




Thus far, mostly a descriptive device
Use for prediction and forecasting
Yet to consider: Statistical inference, testing the
relationship
Multiple linear regression


20-26/26
More than one variable to explain the variation of Y
More elaborate model building
Part 20: Aspects of Regression
Download