Notes 22: Multiple Regression Part 2

advertisement
Statistics and Data Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
22-1/60
Part 22: Multiple Regression – Part 2
Statistics and Data Analysis
Part 22 – Multiple
Regression: 2
22-2/60
Part 22: Multiple Regression – Part 2
Multiple Regression Models







22-3/60
Using Minitab To Compute A Multiple Regression
Basic Multiple Regression
Using Binary Variables
Logs and Elasticities
Trends in Time Series Data
Using Quadratic Terms to Improve the Model
Mini-seminar: Cost benefit test with a dynamic
model
Part 22: Multiple Regression – Part 2
Application: WHO

Data Used in Assignment 1: WHO data on
191 countries in 1995-1999.




22-4/60
Analysis of Disability Adjusted Life Expectancy = DALE
EDUC = average years of education
PCHexp = Per capita health expenditure
DALE = α + β1EDUC + β2HealthExp + ε
Part 22: Multiple Regression – Part 2
The (Famous) WHO Data
22-5/60
Part 22: Multiple Regression – Part 2
22-6/60
Part 22: Multiple Regression – Part 2
Specify the Variables in the Model
22-7/60
Part 22: Multiple Regression – Part 2
22-8/60
Part 22: Multiple Regression – Part 2
Graphs
22-9/60
Part 22: Multiple Regression – Part 2
Regression Results
22-10/60
Part 22: Multiple Regression – Part 2
Practical Model Building
Understanding the regression: The left
out variable problem
 Using different kinds of variables





22-11/60
Dummy variables
Logs
Time trend
Quadratic
Part 22: Multiple Regression – Part 2
A Fundamental Result
What happens when you leave a crucial
variable out of your model? Nothing good.
Regression Analysis: g versus GasPrice (no income)
The regression equation is
g = 3.50 + 0.0280 GasPrice
Predictor
Coef
SE Coef
T
P
Constant
3.4963
0.1678 20.84 0.000
GasPrice
0.028034 0.002809
9.98 0.000
Regression Analysis: G versus GasPrice, Income
The regression equation is
G = 0.134 - 0.00163 GasPrice + 0.000026 Income
Predictor
Coef
SE Coef
T
P
Constant
0.13449
0.02081
6.46 0.000
GasPrice
-0.0016281
0.0004152 -3.92 0.000
Income
0.00002634 0.00000231 11.43 0.000
22-12/60
Part 22: Multiple Regression – Part 2
Using Dummy Variables
Dummy variable = binary variable
= a variable that takes values 0 and 1.
 E.g. OECD Life Expectancies compared
to the rest of the world:


DALE = α + β1 EDUC + β2 PCHexp
+ β3 OECD + ε
Australia, Austria, Belgium, Canada, Czech Republic, Denmark, Finland, France,
Germany, Greece, Hungary, Iceland, Ireland, Italy, Japan, Korea, Luxembourg,
Mexico, The Netherlands, New Zealand, Norway, Poland, Portugal, Slovak
Republic, Spain, Sweden, Switzerland, Turkey, United Kingdom, United States.
22-13/60
Part 22: Multiple Regression – Part 2
OECD Life Expectancy
According to these
results, after accounting
for education and health
expenditure differences,
people in the OECD
countries have a life
expectancy that is 1.191
years shorter than
people in other
countries.
22-14/60
Part 22: Multiple Regression – Part 2
A Binary Variable in Regression
The regression
shifts down by
1.191 years for
the OECD
countries
We set PCHExp to 1000, approximately the sample mean.
22-15/60
Part 22: Multiple Regression – Part 2
Academic Reputation
22-16/60
Part 22: Multiple Regression – Part 2
22-17/60
Part 22: Multiple Regression – Part 2
22-18/60
Part 22: Multiple Regression – Part 2
22-19/60
Part 22: Multiple Regression – Part 2
22-20/60
Part 22: Multiple Regression – Part 2
22-21/60
Part 22: Multiple Regression – Part 2
Dummy Variable in a Log Regression
E.g., Monet’s signature equation
Log$Price = α + β1 logArea + β2 Signed
Unsigned: PriceU
Signed: PriceS
Signed/Unsigned
%Difference
22-22/60
= exp(α) Areaβ1
= exp(α) Areaβ1 exp(β2)
= exp(β2)
= 100%(Signed-Unsigned)/Unsigned
= 100%[exp(β2) – 1]
Part 22: Multiple Regression – Part 2
The Signature Effect: 253%
100%[exp(1.2618) – 1] = 100%[3.532 – 1] = 253.2 %
22-23/60
Part 22: Multiple Regression – Part 2
Monet Paintings in Millions
Scatterplot of Price vs Square Inches
30
Signed
0
1
25
Price
20
Difference is about 253%
15
10
5
0
0
1000
2000
3000
4000
Square Inches
5000
6000
7000
Predicted Price is exp(4.122+1.3458*logArea+1.2618*Signed) / 1000000
22-24/60
Part 22: Multiple Regression – Part 2
Logs in Regression
22-25/60
Part 22: Multiple Regression – Part 2
Elasticity




The coefficient on log(Area) is 1.346
For each 1% increase in area, price goes up by
1.346% - even accounting for the signature effect.
The elasticity is +1.346
Remarkable. Not only does price increase with
area, it increases much faster than area.
22-26/60
Part 22: Multiple Regression – Part 2
Monet: By the Square Inch
Scatterplot of Price vs Area
20000000
price
15000000
10000000
5000000
0
0
22-27/60
1000
2000
3000
4000
Area
5000
6000
7000
Part 22: Multiple Regression – Part 2
Logs and Elasticities
Theory: When the variables are in logs:
change in logx = %change in x
log y = α + β1 log x1 + β2 log x2 + … βK log xK + ε
Elasticity = βk
22-28/60
Part 22: Multiple Regression – Part 2
Elasticities
Price elasticity
22-29/60
= -0.02070
Income elasticity = +1.10318
Part 22: Multiple Regression – Part 2
A Set of Dummy Variables
Complete set of dummy variables
divides the sample into groups.
 Fit the regression with “group” effects.
 Need to drop one (any one) of the
variables to compute the regression.
(Avoid the “dummy variable trap.”)

22-30/60
Part 22: Multiple Regression – Part 2
Rankings of 132 U.S.Liberal Arts Colleges
Reputation = α + β1Religious + β2GenderEcon + β3EconFac +
β4North + β5South + β6Midwest + β7West + ε
Nancy Burnett: Journal of Economic Education, 1998
22-31/60
Part 22: Multiple Regression – Part 2
Minitab does not like this model.
22-32/60
Part 22: Multiple Regression – Part 2
Too many dummy variables

If we use all four region dummies, a is redundant





Only three are needed – so Minitab dropped west





22-33/60
Reputation = a + bn + … if north
Reputation = a + bm + … if midwest
Reputation = a + bs + … if south
Reputation = a + bw + … if west
Reputation = a + bn + … if north
Reputation = a + bm + … if midwest
Reputation = a + bs + … if south
Reputation = a
+ … if west
Why did it drop West and not one of the others? It doesn’t
matter which one is dropped. Minitab picked the last.
Part 22: Multiple Regression – Part 2
Unordered Categorical Variables
House price data (fictitious)
Style 1 = Split level
Style 2 = Ranch
Style 3 = Colonial
Style 4 = Tudor
Use 3 dummy variables for this kind
of data. (Not all 4)
Using variable STYLE in the model
makes no sense. You could change
the numbering scale any way you
like. 1,2,3,4 are just labels.
22-34/60
Part 22: Multiple Regression – Part 2
Transform Style to Types
22-35/60
Part 22: Multiple Regression – Part 2
22-36/60
Part 22: Multiple Regression – Part 2
House Price Regression
Each of these is relative to a Split
Level, since that is the omitted
category. E.g., the price of a Ranch
house is $74,369 less than a Split
Level of the same size with the
same number of bedrooms.
22-37/60
Part 22: Multiple Regression – Part 2
Better Specified House Price Model Using Logs
22-38/60
Part 22: Multiple Regression – Part 2
Time Trends in Regression
y = α + β1x + β2t + ε
β2 is the year to year increase not
explained by anything else.
 log y = α + β1log x + β2t + ε
(not log t, just t)
100β2 is the year to year
% increase
not explained by anything else.

22-39/60
Part 22: Multiple Regression – Part 2
Time Trend in Multiple Regression
After accounting for Income, the
price and the price of new cars,
per capita gasoline consumption
falls by 1.25% per year. I.e., if
income and the prices were
unchanged, consumption would
fall by 1.25%. Probably the
effect of improved fuel efficiency
22-40/60
Part 22: Multiple Regression – Part 2
A Quadratic Income vs. Age Regression
+----------------------------------------------------+
| LHS=HHNINC
Mean
=
.3520836
|
|
Standard deviation
=
.1769083
|
| Model size
Parameters
=
3
|
|
Degrees of freedom
=
27323
|
| Residuals
Sum of squares
=
794.9667
|
|
Standard error of e =
.1705730
|
| Fit
R-squared
=
.7040754E-01 |
+----------------------------------------------------+
+--------+--------------+--+--------+
|Variable| Coefficient | Mean of X|
+--------+--------------+-----------+ Note the coefficient on
Constant| -.39266196
Age squared is
AGE
| .02458140
43.5256898
negative. Age ranges
AGESQ
| -.00027237
2022.85549
from 25 to 65.
EDUC
| .01994416
11.3206310
+--------+--------------+-----------+
22-41/60
Part 22: Multiple Regression – Part 2
Implied By The Model
Careful: This shows the incomes of people of different ages, not the
path of income of a particular person at different ages.
22-42/60
Part 22: Multiple Regression – Part 2
Candidate Models for Cost
The quadratic equation is the appropriate model.
Logc = a + b1 logq + b2 log2q + e
22-43/60
Part 22: Multiple Regression – Part 2
A Better Model?
Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε
22-44/60
Part 22: Multiple Regression – Part 2
22-45/60
Part 22: Multiple Regression – Part 2
22-46/60
Part 22: Multiple Regression – Part 2
Case Study Using A Regression
Model: A Huge Sports Contract




22-47/60
Alex Rodriguez hired by the Texas Rangers for
something like $25 million per year in 2000.
Costs – the salary plus and minus some fine
tuning of the numbers
Benefits – more fans in the stands.
How to determine if the benefits exceed the
costs? Use a regression model.
Part 22: Multiple Regression – Part 2
The Texas Deal for Alex Rodriguez












2001
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Total:
22-48/60
Signing Bonus = 10M
21
21
21
21
25
25
27
27
27
27
$252M ???
Part 22: Multiple Regression – Part 2
The Real Deal












Year
Salary
Bonus Deferred Salary
2001
21
2
5 to 2011
2002
21
2
4 to 2012
2003
21
2
3 to 2013
2004
21
2
4 to 2014
2005
25
2
4 to 2015
2006
25
4 to 2016
2007
27
3 to 2017
2008
27
3 to 2018
2009
27
3 to 2019
2010
27
5 to 2020
Deferrals accrue interest of 3% per year.
22-49/60
Part 22: Multiple Regression – Part 2
Costs





Insurance: About 10% of the contract per year
(Taxes: About 40% of the contract)
Some additional costs in revenue sharing revenues from
the league (anticipated, about 17.5% of marginal
benefits – uncertain)
Interest on deferred salary - $150,000 in first year, well
over $1,000,000 in 2010.
(Reduction) $3M it would cost to have a different
shortstop. (Nomar Garciaparra)
22-50/60
Part 22: Multiple Regression – Part 2
PDV of the Costs




22-51/60
Using 8% discount factor
Accounting for all costs including insurance
Roughly $21M to $28M in each year from
2001 to 2010, then the deferred payments
from 2010 to 2020
Total costs: About $165 Million in 2001
(Present discounted value)
Part 22: Multiple Regression – Part 2
Benefits





22-52/60
More fans in the seats
 Gate
 Parking
 Merchandise
Increased chance at playoffs and world series
Sponsorships
(Loss to revenue sharing)
Franchise value
Part 22: Multiple Regression – Part 2
How Many New Fans?
Projected 8 more wins per year.
 What is the relationship between wins
and attendance?




22-53/60
Not known precisely
Many empirical studies (The Journal of
Sports Economics)
Use a regression model to find out.
Part 22: Multiple Regression – Part 2
Baseball Data











22-54/60
31 teams, 17 years (fewer years for 6 teams)
Winning percentage: Wins = 162 * percentage
Rank
Average attendance. Attendance = 81*Average
Average team salary
Number of all stars
Manager years of experience
Percent of team that is rookies
Lineup changes
Mean player experience
Dummy variable for change in manager
Part 22: Multiple Regression – Part 2
Baseball Data
(Panel Data – 31 Teams, 17 Years)
22-55/60
Part 22: Multiple Regression – Part 2
A Regression Model
Attendance(team,this year) = α team
+ γ Attendance(team, last year)
+ β1Wins (team,this year)
+ β 2 Wins(team, last year)
+ 3 All_Stars(team, this year)
+ (team, this year)
22-56/60
Part 22: Multiple Regression – Part 2
22-57/60
Part 22: Multiple Regression – Part 2
A Dynamic Equation
y(this year) = f[y(last year)…]
Fans(t)=a+bWins(t)+cFans(t-1)+ (Loyalty effect)
Suppose Fans(0) = Fans0
(Start observing in a base year)
Suppose we fix Wins(t) at some Wins* and  at 0 (no information).
What values does Fans(t) take in a sequence of years?
Fans(1) = a + bWins* + cFans0
Fans(2) = a + bWins* + c(a + bWins* + cFans0)
Fans(3) = a + bWins* + c(a + bWins* + c(a + bWins* + cFans0))
Fans(4) = a + bWins* + c(a + bWins* + c(a + bWins* + c(a + bWins* + cFans0)))
etc.
Collect terms: Fans(t) = a(1+c+c2  ...  c t-1 )  bWins*(1+c+c 2 ...  c t-1 )+c t Fans0
Suppose 0 < c < 1.
Fans finally settles down at Fans* =
22-58/60
a
b
+
Wins*.
1-c
1-c
b
dFans*
=
1-c dWins *
Part 22: Multiple Regression – Part 2
Example : Fans(t) = 227969+.6Fans(t-1) +11093Wins
Wins = 85/year
Fans(1990)=2,500,000
Fans(1991) = 227969 + .6(2.5M) + 11093(85) = 2.671M
Fans(1992) = .....
= 2.773M
...
Fans(2006) and years after = 2.93M
(This is the 'equilibrium')
22-59/60
Part 22: Multiple Regression – Part 2
22-60/60
Part 22: Multiple Regression – Part 2
Example : Fans(t) = 227969+.6Fans(t-1) +11093Wins
Wins = 85/year to 2000, 93/year 2001...
Fans(1990) = 2,500,000
Fans(1991) = 2.671M
Fans(1992 to 2000) about 2.9M
...
Fans(2001) and years after about 3.15M
(This is the 'equilibrium)
22-61/60
Part 22: Multiple Regression – Part 2
About 220,000 fans
22-62/60
Part 22: Multiple Regression – Part 2
Marginal Value of One More Win
Our Model is Fans(t) = α + β1Wins(t) + β 2 Wins(t-1) + β3 AllStars + γFans(t-1)
Using the formula for the value of Fans*
α+β1Wins*+β 2 Wins*+β 3 AllStars
Fans*=
1-γ
1  2
1 
The new player will definitely be an All Star, so we add this effect as well.
The effect of one more Win every year would be dFans*/dWins* =
The effect of adding an All Star player to the team would be
22-63/60
3
1 
Part 22: Multiple Regression – Part 2
 = .54914
1 = 11093.7
2 = 2201.2
3 = 14593.5
Effect of 1 more win
11093.7  2201.2
 32757
1  .59414
Effect of adding an All Star
=
=
14593.5
 35957
1  .59414
22-64/60
Part 22: Multiple Regression – Part 2
Marginal Value of an A Rod




(8 games * 32,757 fans) + 1 All Star = 35957
= 298,016 new fans
298,016 new fans *
 $18 per ticket
 $2.50 parking etc. (Average. Most people don’t park)
 $1.80 stuff (hats, bobble head dolls,…)
$6.67 Million per year !!!!!
It’s not close. (Marginal cost is at least $16.5M / year)
22-65/60
Part 22: Multiple Regression – Part 2
Summary
Using Minitab To Compute a Regression
 Building a Model







22-66/60
Logs
Dummy variables
Qualitative variables
Trends
Effects across time
Quadratic
Part 22: Multiple Regression – Part 2
Download