Exercises: 3rd day in PC lab

advertisement
Exercise 3
Third day in PC lab
- Wrap up from day 2
- Multivariate Regression Model
Production function and multicollinearity
• use <path:\Y> cobb.dta
• describe
Q = e β1 ⋅ Lβ 2 ⋅ K β 3 ⋅ eε
•
⇔ ln(Q) = β1 + β 2 ln( L) + β 3 ln( K ) + ε
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
generate logarithms of variables
estimate: regress lnQ lnL lnK
interpret coefficients
high R2 but none of the variables is significant
corr lnK lnL gives 0.98 correlation (extreme)
corr K L is similarly high (this is usually the case, that is, logarithms do not
remarkably change the correlation)
regress model if t>10, if t<25 (big differences)
individual regressions of lnQ on lnK and lnQ on lnL give very significant coefficients,
which are much higher in magnitude, however, and are definitely biased (one variable
also captures the effect of the other)
collect more data? (will probably not help, this is not primary data, so we cannot,
except for waiting another 5 or ten years)
we could introduce non-sample information, if we know that the production process is
associated with constant economies-of-scale (sum of production elasticities is one)
constraint 1 lnK+lnL=1
cnsreg lnQ lnK lnL, constraint(1) (command for constrained regression)
parameters still not significant, but this is not crucial. If we know that constant scale
economies occur, then this would be a good estimate, since we know that both
variables do play a significant role
Note that no R2 is displayed with constraints, because SST is not SSR + SSE
cons drop 1 (or _all), if you do not need them anymore
do regress lnQ lnK lnL again
just to show you the estimation without a constant term: regress lnQ lnK lnL, nocons
see that this type of constraint also changes the estimates quite a bit, but look at the
R2, which should not be reported, because it is wrong
Beer data (Regression specification error test)
• use <path\> beer.dta
• demand for a good depends on own price, prices of substitutes, prices of other goods
and income
• we have data for the average per capita beer demand
• describe
• summarize and look at the data and the variation
• general model: q=f(pb, pl, pr, m)
• what are the expected signs?
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
graph matrix q pb pl pr m (in order to see the whole matrix of twoway scatter plots)
but you can easily be fooled, because elasticities are ceteris paribus, whereas the
graphs do not control for changes in other variables
linear regression: regress q pb pl pr m
see for instance that marginal effect of income is positive, although scatter rather
indicated a negative relationship
R2 is relatively high
but is model correctly specified in terms of variables and functional form?
predict yroof
RESET test: what do we have to do?
gen yroof2=yroof^2 and gen yroof3=yroof^3
regress q pb pl pr m yroof2 (t-value is significant)
regress q pb pl pr m yroof2 yroof3 (although individual t-values are not significant the
joint F test is: test yroof2 yroof3)
we have to reject H0 that there is no mis-specification
ovtest (after regress original model), Ramsey RESET test, or “omitted variable test”,
but we know that this is also for functional form
also leads us to rejection of H0
since we do not know what variables would be missing, there is possibly a problem
with functional form
let us try the log-log model (which is popular)
generate logs of all variables
regress lnq lnpb lnpl lnpr lnm
interpret variables
ovtest shows us that there is no misspecification anymore
Often results are not that clear cut
Note that R2 would not have been a good indicator of correct or incorrect model
specification (almost the same as in previous linear model)
Dummy variables
•
•
•
•
•
•
•
•
•
•
use <path\> Soybean.dta
describe
soybean area, clear time trend?
regress Area t / predict yroof / scatter yroof Area t, c(l)
additional information: in 1997 RR technology was introduced
generate slope and interaction dummies
gen RR=0
replace RR=0 if Year>1996)
gen RR_t=(RR*t)
if the dummies are not included, the estimator is biased (because we assume that
before and after RR introduction the parameters are the same when in fact they are
not).
• regress Area t RR RR_t
• predict yhat
Area = beta1 + beta2 t + gamma1 RR + gamma2 RR_t
Area = beta1 + beta2 t
if RR = 0
Area = (beta1 + gamma1) + (beta2 + gamma2) t
if RR = 1
scatter yhat Area Year, c(l)
nice to show in simple regression model with only one regressor
House prices
• Dummy slope and interaction and their interpretation
• use <path\> Soybean.dta
• describe
• hedonic house price equation: price = f(size, college, college_size, age, pool,
fireplace)
• regress price size age college pool fireplace college_size
• interpret slope and dummy variables, coefficients, R2 etc.
Existence of qualitative effects?
(Restricted vs unrestricted model or, alternatively, Chow-test)
•
•
•
•
use <path\> investment.dta
describe
investment function: invest = b1 + b2value + b3capital
but data over 20 years are pooled for two different electrical companies (Dummy
variable D: 1=Westinghouse, 0=General Electric)
• it is nice to use all 40 observations, but is this appropriate?
• Null Hypothesis: H0: the functions for both firms are identical, so the data can be
pooled
• generate set of slope dummies
• gen D_value=D*value
• gen D_capital=D*capital
1) Restricted vs unrestricted model
( SSER − SSEU ) / J
SSEU /(T − K )
•
Recall that F-statistic is:
•
•
•
•
•
estimate restricted model: regress invest value capital
Note down sum of squared residuals = SS Residual = (SSER)
regress unrestriced model: regress invest value capital D D_value D_capital
Note down sum of squared residual
Contsruct F-statistic: F = ( SSER − SSEU ) / J = (16563.00 − 14989.82) / 3 = 1.1894
F=
SSEU /(T − K )
•
14989.82 /(40 − 6)
The α = .05 critical value Fc=2.8826 comes from the F(3,34) distribution. Since F<Fc
we cannot reject the null hypothesis that the investment functions for General Electric
and Westinghouse are identical.
• Therefore, pooling the data seems appropriate (without dummies)
• We can simplify the analysis by asking STATA to carry out the F test for joint
significance of the coefficients for D D_value and D_capital
• regress the unrestricted model: regress invest value capital D D_value D_capital
• test D D_value D_capital
• It is no coincidence that the F statistic is exactly equal to the one we calculated
manually before. The p-value of 0.3284 indicates that we cannot reject H0.
2) Chow test
• Alternatively both sets of data can be estimated separately and the SSE1 and SSE2 be
noted down. The same F statistic can be calculated because SSEU =SSE1+SSE2
(SSE R − SSE U )
•
•
•
•
•
•
•
•
•
•
•
Then F =
SSE
[DFR − DFU ]
U
with DFR=(40-3)=37; DFU=(20-3)+(20-3)=34
DFU
Therefore numerator df=3, denominator df=34
This will result in exactly the same test statistic as above
The advantage is that no dummy and intercept variables have to be generated
(especially important in longer equations with many variables)
regress invest value capital if D==1
regress invest value capital if D==0 (add Residual SS)
The Chow test (in general the F test) assumes that the MR assumptions hold for all
observations, especially also heteroskedasticity. Otherwise, the test fails.
Since there is likely to be heteroskedasticity in the model, we have to use an
alternative to the F-test, the so-called Wald test.
Without going into the details, the Wald test is similar to the F-test, but it uses the
correct variance-covariance matrix (White estimator).
regress invest value capital D D_value D_capital, robust
test D D_value D_capital. Now the p-value of 0.067 still tells us at a 5% level that
pooling the data is okay, but not as clear as before.
Nonlinear models
•
•
•
•
use <path\> EAF_technology.dta
describe
Over the last 3 decades, traditional technology for steel making, involving blast and
oxygen furnaces and the use of iron ore, has been replaced by newer electric arc
furnace (EAF) technology that utilizes scrap steel. Predictions about how fast this
new technology is being adopted has implications for the suppliers of iron ore (mining
companies) and scrap steel.
The data display the share of EAF technology adoption in the US steel industry for the
years 1970 to 1997.
Look at data: share in 1970 and share in 1997
•
Logistic growth curve
•
•
•
•
•
•
alpha is maximum adoption or saturation point (which we do not know)
delta controls the speed, beta determines how far below saturation at time 0.
-beta/delta is point of inflection (half saturation)
the relationship cannot be estimated by OLS, but by non-linear LS
programming required
In the head-menu: select "windows", choose "do-file editor", "New Do-file"
•
program nllogistic
version 8.0
if "`1'"=="?" {
yt =
α
+ et
1 + exp(−β − δt )
(name we have given the program, has to start with nl)
(always program start, to put parameters in macro
S_1)
(declare parameters to be estimated)
(give them initial values to start the iteration)
global S_1 "A B D"
global A=0
global B=0
global D=0
exit
(exit parametr definition)
}
(do not forget bracket)
replace `1' = $A/(1+exp(-$B-$D*t))
(function to be estimated, dollar
signs)
end
•
•
•
•
•
•
•
•
•
•
Save your file under the name "logistic.do"
if you have another function, just replace parameter definitions, initial values and
functional form)
One has to put program into memory to use for the next command
In the head-menu: select "File" , choose "Do" and choose your file logistic.do)
nl logistic eaf_share (only the dependent variable has to be defined, because the
regressors are defined in the program)
predict yhat
scatter yhat t, c(l)
be careful with R2. Other software does not report this altogether
nl logistic eaf_share, init(A=-2) or init(A=2, B=6…)
program drop logistic
Heteroskedasticity
Detecting heteroskedasticity
• use <path\> foodexp.dta
• regress foodexp income
• predict residual, residuals
• predict yhatlin
• scatter residual income
• scatter yhatlin foodexp income, c(l)
GQ Test by hand
• gen inc_group=1 if income>712 (median)
• replace inc_group=2 if missing(inc_group)
• regress foodexp income if inc_group==1 (remember SSE = var1)
• regress foodexp income if inc_group==2 (remember SSE = var2)
• Calculate GQ=var1/var2
• The disadvantage: Here you need to know with respect to what variable
heteroskedasticity occurs. This is obvious in the simple regression model, but not
always in multiple regression.
• An alternative is:
• GQ test does not seem to be included in STATA. Instead, the Cook-Weisberg test,
same as Breusch-Pragan test is included, which tests whether t=0 in
σ 2 = s 2 exp( x⋅t )
•
•
•
If t=0, then the variance of the error term is homoskedastic (it does not depend on x).
The STATA command is: hettest x (in our case hettest income)
If income is not specified (just hettest) in a multiple regression model the fitted values
are used.
Heteroskedastic partition
• use <path\> wheat.dta
• describe
• 26 observations on wheat supply, price and time (Australian wheat growing district)
• q = f(price, technology, wheater), t is proxy for technology, no wheather data, so part
of e
• q = β1 + β 2 p + β 3t + e
• new wheat varieties with lower susceptibility to weather were introduced after year 13.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Split up the sample into two sub-samples
generate group=0 if t<14
replace group=1 if missing(group)
STATA allows for simplifying the process by providing a maximum-likelihood
estimation (One could do things from hand by only using OLS in different steps, but
software packages are there to facilitate things. Only one needs to understand what the
software is actually doing). So here is the "shortcut" first:
xtgls q p t, i(group) panel(hetero)
(xt is panel data command (time series cross section) i defines the panel, which in our
case is the first and the second 13 years)
panel(hetero) indicates that the two panels show heteroskedasticity (unequal
variances)
Regress simple OLS model and Perform Breusch-Pragan Test (Cook-Weisberg)
hettest group (H0 is rejected).
One could also do hettest t (in this case with the same conclusion but a lower test
statistic). Hence, if there is very specific non-sample info, then this should be included
in order to come to more reliable conclusions.
Perform Goldfeld-Quandt Test by running two OLS:
Regress q p t if group==0
Regress q p t if group==1
Divide var0/var1=11.11. (F(10, 10)=2.98 (5%) df are (T1-K, T2-K).
predict residual, residuals
scatter residual t
Alternative to the "Short-Cut": GLS by hand (σ2 for group 0 (t=1-13) = 641.640762,
for group 1 = 57.585656)
gen cons_trans=1/sqrt(6416.40762) if group==0
replace cons_trans=1/sgrt(577.585656) if group==1
gen q_trans=q/sqrt(6416.40762) if group==0
replace q_trans=q/sqrt(577.585656) if group==1
gen p_trans=q/sqrt(6416.40762) if group==0
replace p_trans=q/sqrt(577.585656) if group==1
gen t_trans=q/sqrt(6416.40762) if group==0
replace t_trans=q/sqrt(577.585656) if group==1
regress q_trans cons_trans p_trans t_trans, nocons
result will be slightly different due to rounding errors
However, if we just know that there is heteroskedasticity, but have no idea where this is
coming from, then using White’s estimator (robust) might still be the best choice.
•
•
•
•
•
•
use <path\> foodexp.dta
describe
In order to show the working of the White estimator use the food expenditure data
compare the two models:
regress foodexp income
and
regress foodexp income, robust
The estimated coefficients (and R2) are the same, but the se and t-statistics are
different
GLS for proportional heteroskedasticity
•
•
•
here we use weighted least squares (weight is the reciprocal of x), i.e. instead of
minimizing the sum of errors function, we minimize the sum of transformed errors
function
regress foodexp income [aweight=1/income]
aweights, or analytic weights, are weights that are inversely proportional to the
variance of an observation; i.e., the variance of the j-th observation is assumed to be
var( yt ) = var(et ) =
σ2
aweight
. Since we have assumed var(et ) = σ 2 xt , the correct
weight is 1/xt
• The same can be done by hand by building the transformed model:
*
*
*
*
• yt = β1 xt1 + β 2 xt 2 + et
1
y
x
e
yt* = t
xt*1 =
xt*2 = t
et* = t
xt
xt
xt
xt
where
transform all variables, generate a new one for x1t, and estimate the transformed model by
suppressing the constant. (…, nocons).This will lead to the same results. Only we should be
careful with interpreting the R2, because of the nocons constraint.
Download