Problems in model averaging with dummy variables

advertisement
Problems in model averaging with
dummy variables
David F. Hendry and J. James Reade
Economics Department, Oxford University
Model Evaluation in Macroeconomics Workshop, University of Oslo
6th May 2005
1. Introduction
Model averaging:
1. Introduction
Model averaging:
• widely used,
1. Introduction
Model averaging:
• widely used,
• proposed as a method for accommodating model uncertainly,
1. Introduction
Model averaging:
• widely used,
• proposed as a method for accommodating model uncertainly,
• can be shown to have desirable properties in a stationary world:
1. Introduction
Model averaging:
• widely used,
• proposed as a method for accommodating model uncertainly,
• can be shown to have desirable properties in a stationary world:
– Raftery et al. (1997) show on logarithmic scoring rule, averaged
model forecasts better than any individual model.
1. Introduction
Model averaging:
• widely used,
• proposed as a method for accommodating model uncertainly,
• can be shown to have desirable properties in a stationary world:
– Raftery et al. (1997) show on logarithmic scoring rule, averaged
model forecasts better than any individual model.
– Hendry & Clements (2004) explore cases where averaging might
improve forecasts.
1. Introduction
Model averaging:
• widely used,
• proposed as a method for accommodating model uncertainly,
• can be shown to have desirable properties in a stationary world:
– Raftery et al. (1997) show on logarithmic scoring rule, averaged
model forecasts better than any individual model.
– Hendry & Clements (2004) explore cases where averaging might
improve forecasts.
But: extention to non-stationary world presents difficulties.
Plan
Section 2:
• model averaging introduced,
• various methods of implementing outlined,
• use of model averaging in empirical literature discussed.
In Section 3:
• simple models introduced to highlight problems with model
averaging in empirically relevant situations,
• Monte Carlo simulations used to support the predictions of models
and to suggest problems exist in more general context.
Section 4 concludes.
2. Model Averaging
Possible in both:
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
Latter much more commonly used in literature. Examples include:
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
Latter much more commonly used in literature. Examples include:
• growth theory:
– Fernandez et al. (2001),
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
Latter much more commonly used in literature. Examples include:
• growth theory:
– Fernandez et al. (2001),
– Doppelhofer et al. (2000),
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
Latter much more commonly used in literature. Examples include:
• growth theory:
– Fernandez et al. (2001),
– Doppelhofer et al. (2000),
• US quarterly GDP: Koop & Potter (2003),
2. Model Averaging
Possible in both:
• classical statistical framework (see Buckland et al. 1997), and
• Bayesian framework (BMA) (see Raftery et al. 1997).
Latter much more commonly used in literature. Examples include:
• growth theory:
– Fernandez et al. (2001),
– Doppelhofer et al. (2000),
• US quarterly GDP: Koop & Potter (2003) ,
• Swedish inflation: Eklund & Karlsson (2004).
2.1. Implementation
• Set of K variables thought to have explanatory power for parameters
of interest.
2.1. Implementation
• Set of K variables thought to have explanatory power for parameters
of interest.
• Form a set M of L models (every subset),
{M1, . . . , ML} ∈ M.
2.1. Implementation
• Set of K variables thought to have explanatory power for parameters
of interest.
• Form a set M of L models (every subset),
{M1, . . . , ML} ∈ M.
• Model selection could be used to reduce size of M.
2.1. Implementation
• Set of K variables thought to have explanatory power for parameters
of interest.
• Form a set M of L models (every subset),
{M1, . . . , ML} ∈ M.
• Model selection could be used to reduce size of M.
• Consider linear regression models.
2.1.1. Reporting of results
In conventional linear regression analysis, βb = E(β |X ) reported.
Might expect weighted average of βb over the L models,
βe =
L
X
wl βl ,
(1)
e +u
y = βX
e.
(2)
l=0
and output of:
2.1.2. Existence of true parameters
• BUT: Bayesian statisticians reject idea of a single, true estimate.
2.1.2. Existence of true parameters
• BUT: Bayesian statisticians reject idea of a single, true estimate.
• Instead: each parameter has a distribution.
2.1.2. Existence of true parameters
• BUT: Bayesian statisticians reject idea of a single, true estimate.
• Instead: each parameter has a distribution.
• Thus Fernandez et al. (2001) report probabilities of inclusion for
parameters, plot distribution functions.
2.1.2. Existence of true parameters
• BUT: Bayesian statisticians reject idea of a single, true estimate.
• Instead: each parameter has a distribution.
• Thus Fernandez et al. (2001) report probabilities of inclusion for
parameters, plot distribution functions.
• Hoover & Perez (2004) pp. 767–769 for discussion on existence of
‘true’ parameters, relevance for empirical work.
2.1.2. Existence of true parameters
• BUT: Bayesian statisticians reject idea of a single, true estimate.
• Instead: each parameter has a distribution.
• Thus Fernandez et al. (2001) report probabilities of inclusion for
parameters, plot distribution functions.
• Hoover & Perez (2004) pp. 767–769 for discussion on existence of
‘true’ parameters, relevance for empirical work.
• We analyse model averaging as in equations (1) and (2).
2.1.3. Issues relating to construction of weights
Key issue on two levels:
• how to construct the weights;
• which weighting criterion to use.
2.1.3. How to construct the weights
Considering the first issue, for any particular weighting criterion, say Cl ,
the weighting method might be:
Cl
wl = PL
i=1
Ci
.
(3)
2.1.3. How to construct the weights
Considering the first issue, for any particular weighting criterion, say Cl ,
the weighting method might be:
Cl
wl = PL
i=1
• ensures that
PL
l=1
wl = 1.
Ci
.
(3)
2.1.3. How to construct the weights
Considering the first issue, for any particular weighting criterion, say Cl ,
the weighting method might be:
Cl
wl = PL
i=1
• ensures that
PL
l=1
Ci
.
(3)
wl = 1.
• but no variable appears in every model ⇒ sum of weights applied to
particular variable not unity, so bias.
2.1.3. How to construct the weights
Alternative construction: rescale weight for regressor, so sum over models it appears in is unity. If Nk ⊂ M set of models in M containing
βk :
wl = P
Cl
i∈Nk Ci
.
(4)
2.1.3. How to construct the weights
Alternative construction: rescale weight for regressor, so sum over models it appears in is unity. If Nk ⊂ M set of models in M containing
βk :
wl = P
Cl
i∈Nk Ci
.
• weights for any particular regressor sum to unity.
(4)
2.1.3. How to construct the weights
• Buckland et al. (1997) favour first method: sum of weights measure
of importance of regressor.
2.1.3. How to construct the weights
• Buckland et al. (1997) favour first method: sum of weights measure
of importance of regressor.
• But Doppelhofer et al. (2000) advocate rescaled weighting for
reporting coefficients: coefficients produced by this method are ones
used in forecasting, and for analysing marginal effects.
2.1.3. Which weighting criterion to use
• In Bayesian context each model weighted by posterior probability:
Pr (Ml ) Pr (X |Ml )
Pr (Ml |X ) = PL
.
k=1 Pr (Mk ) Pr (X |Mk )
(5)
2.1.3. Which weighting criterion to use
• In Bayesian context each model weighted by posterior probability:
Pr (Ml ) Pr (X |Ml )
Pr (Ml |X ) = PL
.
k=1 Pr (Mk ) Pr (X |Mk )
(5)
• In non-Bayesian contexts, information criteria might be considered,
e.g. Akaike, Bayesian (Schwarz) information criteria.
2.1.3. Which weighting criterion to use
• In Bayesian context each model weighted by posterior probability:
Pr (Ml ) Pr (X |Ml )
Pr (Ml |X ) = PL
.
k=1 Pr (Mk ) Pr (X |Mk )
(5)
• In non-Bayesian contexts, information criteria might be considered,
e.g. Akaike, Bayesian (Schwarz) information criteria.
• out-of-sample methods might be used:
2.1.3. Which weighting criterion to use
• In Bayesian context each model weighted by posterior probability:
Pr (Ml ) Pr (X |Ml )
Pr (Ml |X ) = PL
.
k=1 Pr (Mk ) Pr (X |Mk )
(5)
• In non-Bayesian contexts, information criteria might be considered,
e.g. Akaike, Bayesian (Schwarz) information criteria.
• out-of-sample methods might be used:
– Eklund & Karlsson (2004) predictive Bayesian densities,
2.1.3. Which weighting criterion to use
• In Bayesian context each model weighted by posterior probability:
Pr (Ml ) Pr (X |Ml )
Pr (Ml |X ) = PL
.
k=1 Pr (Mk ) Pr (X |Mk )
(5)
• In non-Bayesian contexts, information criteria might be considered,
e.g. Akaike, Bayesian (Schwarz) information criteria.
• out-of-sample methods might be used:
– Eklund & Karlsson (2004) predictive Bayesian densities,
– Hendry & Clements (2004) minimising MSFE of averaged model.
2.1.3. Which weighting criterion to use
• Here, following Buckland et al. (1997), approximation to Schwarz
2
2
information criteria (SIC) used. Uses exp(−b
σv,l
/2) where σ
bv,l
is
th
residual variance of the l model:
2
σ
bv,l
T
1X 2
vb .
=
T t=1 t
2.1.3. Which weighting criterion to use
• Here, following Buckland et al. (1997), approximation to Schwarz
2
2
information criteria (SIC) used. Uses exp(−b
σv,l
/2) where σ
bv,l
is
th
residual variance of the l model:
2
σ
bv,l
T
1X 2
vb .
=
T t=1 t
• Thus (non-rescaled) weights given by:
2
exp − 12 σ
bv,l
wl = PL
l=1
exp
2
− 21 σ
bv,l
.
(6)
2.1.3. Which weighting criterion to use
Justification for non-Bayesian weights given predominance of BMA in
literature:
2.1.3. Which weighting criterion to use
Justification for non-Bayesian weights given predominance of BMA in
literature:
• Schwarz information criteria does not discriminate strongly between
similar models, fits in with concerns over model uncertainty.
2.1.3. Which weighting criterion to use
Justification for non-Bayesian weights given predominance of BMA in
literature:
• Schwarz information criteria does not discriminate strongly between
similar models, fits in with concerns over model uncertainty.
• Schwarz criterion is approximation to Bayes factor.
2.1.3. Which weighting criterion to use
Justification for non-Bayesian weights given predominance of BMA in
literature:
• Schwarz information criteria does not discriminate strongly between
similar models, fits in with concerns over model uncertainty.
• Schwarz criterion is approximation to Bayes factor.
• Difficulty of choosing priors for 2K models
2.1.3. Which weighting criterion to use
Justification for non-Bayesian weights given predominance of BMA in
literature:
• Schwarz information criteria does not discriminate strongly between
similar models, fits in with concerns over model uncertainty.
• Schwarz criterion is approximation to Bayes factor.
• Difficulty of choosing priors for 2K models
Thus can apply analytical results and Monte Carlo simulation results
here to Bayesian model averaging.
2.2. Context of macroeconomic modelling
Model averaging just one way of carrying out data-centred
macroeconomic modelling exercise; other paradigms include:
2.2. Context of macroeconomic modelling
Model averaging just one way of carrying out data-centred
macroeconomic modelling exercise; other paradigms include:
• alternative Bayesian strategies such as extreme-bounds analysis (see
Hoover & Perez 2004);
2.2. Context of macroeconomic modelling
Model averaging just one way of carrying out data-centred
macroeconomic modelling exercise; other paradigms include:
• alternative Bayesian strategies such as extreme-bounds analysis (see
Hoover & Perez 2004);
• General-to-Specific model selection (see Hoover & Perez 1999,
Hendry & Krolzig 2005, Perez-Amaral et al. 2003):
2.2. Context of macroeconomic modelling
Model averaging just one way of carrying out data-centred
macroeconomic modelling exercise; other paradigms include:
• alternative Bayesian strategies such as extreme-bounds analysis (see
Hoover & Perez 2004);
• General-to-Specific model selection (see Hoover & Perez 1999,
Hendry & Krolzig 2005, Perez-Amaral et al. 2003):
– general model posited to include all possible determining factors
for parameter of interest,
2.2. Context of macroeconomic modelling
Model averaging just one way of carrying out data-centred
macroeconomic modelling exercise; other paradigms include:
• alternative Bayesian strategies such as extreme-bounds analysis (see
Hoover & Perez 2004);
• General-to-Specific model selection (see Hoover & Perez 1999,
Hendry & Krolzig 2005, Perez-Amaral et al. 2003):
– general model posited to include all possible determining factors
for parameter of interest,
– then process of reduction Hendry (1995, Ch. 9) carried out resulting in parsimonious, congruent, encompassing model.
3. The Effect of Dummy Variables on Model
Averaging
3.1. Impulse dummy variables
• Simple location-scale data generation process (DGP) with transient
mean shift:
yt = β + γ1{t=ta} + vt, where vt ∼ IN 0, σv2
(7)
3. The Effect of Dummy Variables on Model
Averaging
3.1. Impulse dummy variables
• Simple location-scale data generation process (DGP) with transient
mean shift:
yt = β + γ1{t=ta} + vt, where vt ∼ IN 0, σv2
(7)
• Parameter of interest is β, forecast for yT +1, from forecast origin T .
3. The Effect of Dummy Variables on Model
Averaging
3.1. Impulse dummy variables
• Simple location-scale data generation process (DGP) with transient
mean shift:
yt = β + γ1{t=ta} + vt, where vt ∼ IN 0, σv2
(7)
• Parameter of interest is β, forecast for yT +1, from forecast origin T .
√
• consider empirically relevant case where γ = λ T for a fixed
constant λ , (see Doornik et al. 1998).
3.1. Impulse dummy variables
• General model (GUM, our initial K variables) is DGP augmented
by unnecessary impulse dummy, d2,t = 1{t=tb} (where d1,t = 1{t=ta}):
yt = β + γd1,t + δd2,t + ut
(8)
3.1. Impulse dummy variables
• General model (GUM, our initial K variables) is DGP augmented
by unnecessary impulse dummy, d2,t = 1{t=tb} (where d1,t = 1{t=ta}):
yt = β + γd1,t + δd2,t + ut
(8)
• thus structural break or outlier has been accounted for, but only
one transient location shift actually occurrs (investigator unaware of
this).
3.1. Impulse dummy variables
• 23 = 8 possible models result,can be estimated using least squares:
M0:
M2:
M4:
M6:
ybt
ybt
ybt
ybt
=0
= δb(2)d1,t
= βb(4) + δb(4)d1,t
= δb(6)d1,t + γ
b(6)d2,t
M1:
M3:
M5:
M7:
ybt
ybt
ybt
ybt
= βb(1)
=γ
b(3)d2,t
= βb(5) + γ
b(5)d2,t
= βb(7) + δb(7)d1,t + γ
b(7)d2,t
(9)
3.1.1. Deriving weights and estimates
• Least squares gives 3 possible outcomes:
βb(0) = βb(2) = βb(3) = βb(6) = 0,
T
T
X
X
1
1
γ
λ
b
b
β + γ1{t=tb} + vt ' β + = β + √ ,
β(1) = β(4) =
yt =
T t=1
T t=1
T
T
βb(5) = βb(7)
T
T
X
X
1
1
yt =
β + γ1{t=tb} + vt ' β .
=
T − 1 t=1,t6=t
T − 1 t=1,t6=t
b
b
3.1.1. Deriving weights and estimates
Cumulating these:
λ
βe ' (w5 + w7) β + (w1 + w4) β + √
T
λ
= (w1 + w4 + w5 + w7) β + (w1 + w4) √ .
(12)
T
• averaged coefficient 6= true coefficient
PL if λ 6= 0, and/or
w1 + w3 + w5 + w7 < 1 (which l=1 wl = 1 ⇒ in most cases).
3.1.1. Deriving weights and estimates
Cumulating these:
λ
βe ' (w5 + w7) β + (w1 + w4) β + √
T
λ
= (w1 + w4 + w5 + w7) β + (w1 + w4) √ .
(12)
T
• averaged coefficient 6= true coefficient
PL if λ 6= 0, and/or
w1 + w3 + w5 + w7 < 1 (which l=1 wl = 1 ⇒ in most cases).
√
• rescaling will mean w1 + w4 larger, hence bias from λ/ T greater.
3.1.1. Deriving weights and estimates
Cumulating these:
λ
βe ' (w5 + w7) β + (w1 + w4) β + √
T
λ
= (w1 + w4 + w5 + w7) β + (w1 + w4) √ .
(12)
T
• averaged coefficient 6= true coefficient
PL if λ 6= 0, and/or
w1 + w3 + w5 + w7 < 1 (which l=1 wl = 1 ⇒ in most cases).
√
• rescaling will mean w1 + w4 larger, hence bias from λ/ T greater.
e the irrelevant regressor, will receive greater weight.
• rescaling ⇒ δ,
3.1.2. Model averaging for forecasting stationary data
• One justification for model averaging is for ‘forecast pooling’. Outlier
one-off so will not occur in forecast period:
M0: ybT +1,0 = 0 M1: ybT +1,1 = βb(1) M2: ybT +1,2 = 0
M3: ybT +1,3 = 0 M4: ybT +1,4 = βb(4) M5: ybT +1,5 = βb(5)
M6: ybT +1,6 = 0 M7: ybT +1,7 = βb(7)
(13)
3.1.2. Model averaging for forecasting stationary data
• One justification for model averaging is for ‘forecast pooling’. Outlier
one-off so will not occur in forecast period:
M0: ybT +1,0 = 0 M1: ybT +1,1 = βb(1) M2: ybT +1,2 = 0
M3: ybT +1,3 = 0 M4: ybT +1,4 = βb(4) M5: ybT +1,5 = βb(5)
M6: ybT +1,6 = 0 M7: ybT +1,7 = βb(7)
• Letting: yeT +1|T =
P7
i=0
wiybT +1,i
(13)
3.1.2. Model averaging for forecasting stationary data
• One justification for model averaging is for ‘forecast pooling’. Outlier
one-off so will not occur in forecast period:
M0: ybT +1,0 = 0 M1: ybT +1,1 = βb(1) M2: ybT +1,2 = 0
M3: ybT +1,3 = 0 M4: ybT +1,4 = βb(4) M5: ybT +1,5 = βb(5)
M6: ybT +1,6 = 0 M7: ybT +1,7 = βb(7)
• Letting: yeT +1|T =
P7
i=0
wiybT +1,i
• Then forecast error is veT +1|T = yT +1 − yeT +1|T , with mean:
λ
E veT +1|T = (w0 + w2 + w3 + w6) β − (w1 + w4) √ .
T
(13)
3.1.2. Model averaging for forecasting stationary data
• One justification for model averaging is for ‘forecast pooling’. Outlier
one-off so will not occur in forecast period:
M0: ybT +1,0 = 0 M1: ybT +1,1 = βb(1) M2: ybT +1,2 = 0
M3: ybT +1,3 = 0 M4: ybT +1,4 = βb(4) M5: ybT +1,5 = βb(5)
M6: ybT +1,6 = 0 M7: ybT +1,7 = βb(7)
• Letting: yeT +1|T =
P7
i=0
wiybT +1,i
• Then forecast error is veT +1|T = yT +1 − yeT +1|T , with mean:
λ
E veT +1|T = (w0 + w2 + w3 + w6) β − (w1 + w4) √ .
T
• Hence again bias.
(13)
3.1.2. Model averaging for forecasting stationary data
Worse still, MSFE:

2
E veT +1|T = E  yT +1 −
7
X
!2
wiybT +1,i

i=0
"
2 #
λ
(w0 + w2 + w3 + w6) β − (w1 + w4) √
T
2
2 2
2 λ
2
= σv + (w0 + w2 + w3 + w6) β + (w1 + w4)
T
βλ
+ 2 (w0 + w2 + w3 + w6) (w1 + w4) √ ,
T
= E
(14)
3.1.2. Model averaging for forecasting stationary data
Worse still, MSFE:

2
E veT +1|T = E  yT +1 −
7
X
!2
wiybT +1,i

i=0
"
2 #
λ
(w0 + w2 + w3 + w6) β − (w1 + w4) √
T
2
2 2
2 λ
2
= σv + (w0 + w2 + w3 + w6) β + (w1 + w4)
T
βλ
+ 2 (w0 + w2 + w3 + w6) (w1 + w4) √ ,
(14)
T
• Likely to be worse for large λ than GUM or any selected model,
even allowing for estimation uncertainty, certainly if weights are not
rescaled.
= E
3.1.3. Numerical Example
• β = 1, λ = −1, σ 2v = 0.01, and T = 25. Then averaged β estimate
is:
λ
βe = (w5 + w7) β + (w1 + w4) β + √
T
1
= 0.382 + (0.305) 1 −
= 0.626
(15)
5
3.1.3. Numerical Example
• β = 1, λ = −1, σ 2v = 0.01, and T = 25. Then averaged β estimate
is:
λ
βe = (w5 + w7) β + (w1 + w4) β + √
T
1
= 0.382 + (0.305) 1 −
= 0.626
(15)
5
• very biased for the true value of unity.
3.1.3. Numerical Example
• β = 1, λ = −1, σ 2v = 0.01, and T = 25. Then averaged β estimate
is:
λ
βe = (w5 + w7) β + (w1 + w4) β + √
T
1
= 0.382 + (0.305) 1 −
= 0.626
(15)
5
• very biased for the true value of unity.
• MSFE when forecasting without rescaling the weights is:
2
E veT +1|T = 0.118
0.118.
3.1.3. Numerical Example
• Bias smaller if second weighting methodology used:
λ
1
βe = β + (w5 + w7) √ = 1 + 0.37754 −
= 0.924
0.924.
5
T
3.1.3. Numerical Example
• Bias smaller if second weighting methodology used:
λ
1
βe = β + (w5 + w7) √ = 1 + 0.37754 −
= 0.924
0.924.
5
T
• Hard to calculate MSFE with rescaled weights, because each weight
depends on which coefficient being multiplied by; would expect
MSFE smaller when weights rescaled.
3.1.3. Numerical Example
√
• γ = −5 large when σv = 0.1, but outliers of magnitude T often
occur in practical models (see Hendry 2001, Doornik et al. 1998).
• In Monte Carlo simulation, range of values of λ and T considered.
3.1.4. Impulse Dummy Monte Carlo Simulation
Table 1: Bias on β coefficient for
1,000 replications.
T GUM MA MA R Lib
λ=0
25 0.000 -0.319 0.000 0.000
50 -0.001 -0.316 -0.001 -0.001
λ = -0.05
25 0.000 -0.323 -0.006 0.000
50 -0.001 -0.319 -0.004 -0.001
λ = -0.5
25 0.000 -0.362 -0.048 0.000
50 -0.001 -0.348 -0.034 -0.001
λ = -1
25 0.000 -0.397 -0.079 0.000
50 -0.001 -0.376 -0.055 -0.001
each modelling strategy. Based on
Cons DGP True Value
0.000
-0.001
1
1
0.000
-0.001
1
1
0.000
-0.001
1
1
0.000
-0.001
1
1
3.1.4. Impulse Dummy Monte Carlo Simulation
• Bias when simply the GUM is run is tiny.
• Model averaging induces large bias, ranging from about 30% of the
true β coefficient size when the dummies are both insignificant
(λ = 0), to around 40% when λ = −1.
• Calculations of previous Section are supported here; bias of -0.374
reproduced and in fact stronger, -0.397.
• Rescaling leads to lower bias; bias does increase with size of true
γ (T fixed), but only reaches about 8% of coefficient size (again
corroborating calculations from earlier).
• Model selection induces GUM-sized, negligible bias regardless of
strategy.
3.1.4. Impulse Dummy Monte Carlo Simulation
Table 2: Bias on γ coefficient for each modelling strategy. Based on
1,000 replications.
T GUM MA MA R Lib Cons DGP True Value
λ=0
25 -0.002 0.212 0.383 -0.073 -0.116
0
50 0.000 0.211 0.381 0.026 -0.042
0
λ = -0.05
25 -0.002 0.324 0.383 -0.004 -0.007
-0.25
-0.354
50 0.000 0.369 0.381 -0.002 -0.002
λ = -0.5
25 -0.002 1.277 0.383 -0.003 -0.003
2.5
50 0.000 1.709 0.381 -0.002 -0.002
3.536
λ = -1
25 -0.002 1.966 0.383 -0.003 -0.003
5
50 0.000 2.645 0.381 -0.002 -0.002
7.07
3.1.4. Impulse Dummy Monte Carlo Simulation
• Again negligible GUM bias.
• Bias on γ under model averaging decreases as percentage of size
of true coefficient from about 100% when dummy barely noticeable
(λ = −0.05), to about 40% when dummy very conspicuous (λ = −1).
• Rescaled bias invariant to changes in λ (can be shown why).
• Rescaled bias larger when λ small (dummy effectively insignificant);
backs up earlier assertion. Bias smaller when λ bigger.
• Model selection bias again very small.
3.1.4. Impulse Dummy Monte Carlo Simulation
Table 3: MSFE for 1-step forecast of T + 1 from T for each modelling
strategy. Based on 1,000 replications.
GUM MA MA R Lib Cons σ 2
λ=0
T 25 0.011 0.108 0.011 0.013 0.013 0.01
T 50 0.010 0.109 0.010 0.013 0.013 0.01
λ = -0.05
T 25 0.011 0.111 0.011 0.013 0.013 0.01
T 50 0.010 0.111 0.010 0.013 0.013 0.01
λ = -0.5
T 25 0.011 0.137 0.013 0.013 0.013 0.01
T 50 0.010 0.130 0.011 0.013 0.013 0.01
λ = -1
T 25 0.011 0.163 0.016 0.013 0.013 0.01
T 50 0.010 0.151 0.013 0.013 0.013 0.01
3.1.4. Impulse Dummy Monte Carlo Simulation
• GUM MSFE is of expected size.
• Huge MSFEs predicted for MA are supported
• Rescaling substantially improves MSFE; only when λ large is MSFE
here larger than under model selection.
• Model selection provides competitive MSFE for each (λ, T )
combination.
• Suggests that model averaging, appropriately used, could be useful
for forecasting (see Raftery et al. 1997, Hendry & Clements 2004).
3.1.5. Does this generalise?
• Constant and dummies very simplistic; relevant to real-world
applications?
3.1.5. Does this generalise?
• Constant and dummies very simplistic; relevant to real-world
applications?
• Monte Carlo on initial model model, constant replaced with regressor
⇒ same results; in fact stronger bias of 0.103 on β in λ = −1, T = 25
case.
3.1.5. Does this generalise?
• Constant and dummies very simplistic; relevant to real-world
applications?
• Monte Carlo on initial model model, constant replaced with regressor
⇒ same results; in fact stronger bias of 0.103 on β in λ = −1, T = 25
case.
√
• If dummy variable was say d = 1{t1<t<t√
, then all λ/ T
2}
expressions replaced with (t2 − t1) λ/ T , expect bias to increase
with size of period dummy covers.
3.1.5. Does this generalise?
• Constant and dummies very simplistic; relevant to real-world
applications?
• Monte Carlo on initial model model, constant replaced with regressor
⇒ same results; in fact stronger bias of 0.103 on β in λ = −1, T = 25
case.
√
• If dummy variable was say d = 1{t1<t<t√
, then all λ/ T
2}
expressions replaced with (t2 − t1) λ/ T , expect bias to increase
with size of period dummy covers.
• Further, some time series, many cross section studies have
‘intermittent’ dummies, i.e. d = 1{t∈D}. e.g. industrial action in a
year, country located in Africa. Expect same effect on bias.
3.1.5. Does this generalise?
• Constant and dummies very simplistic; relevant to real-world
applications?
• Monte Carlo on initial model model, constant replaced with regressor
⇒ same results; in fact stronger bias of 0.103 on β in λ = −1, T = 25
case.
√
• If dummy variable was say d = 1{t1<t<t√
, then all λ/ T
2}
expressions replaced with (t2 − t1) λ/ T , expect bias to increase
with size of period dummy covers.
• Further, some time series, many cross section studies have
‘intermittent’ dummies, i.e. d = 1{t∈D}. e.g. industrial action in a
year, country located in Africa. Expect same effect on bias.
• Will consider these more general contexts...
3.2. Period and Intermittent Dummies
3.2.1. Period Dummy Monte Carlo Simulation
• Monte Carlo from earlier:
yt = β + γd1,t + δd2,t + ut
(18)
rerun with d1,t = 1{0<t<T /2} (half the sample), and
d2,t = 1{(T /2)+1<t<(T /2)+8}; practitioner unsure of point of break. Same
parameter values as before.
3.2.1. Period Dummy Monte Carlo Simulation
Table 4: Bias on β and MSFE for each modelling strategy. Based on
1,000 replications.
Bias on β (true value=1)
MSFE (σ 2 = 0.01)
GUM MA R Lib Cons GUM MA R Lib Cons
λ=0
λ=0
T 25 -0.001 -0.001 0.000 0.000 0.013 0.011 0.014 0.014
T 50 -0.001 -0.001 0.000 -0.001 0.011 0.010 0.013 0.013
λ = -0.05
λ = -0.05
T 25 -0.001 -0.081 -0.001 -0.002 0.013 0.017 0.015 0.015
T 50 -0.001 -0.101 -0.001 -0.001 0.011 0.020 0.013 0.013
λ = -0.5
λ = -0.5
T 25 -0.001 -0.606 0.001 0.000 0.013 0.370 0.014 0.014
T 50 -0.001 -0.409 -0.001 -0.001 0.011 0.177 0.013 0.013
λ = -1
λ = -1
T 25 -0.001 -0.418 0.001 0.000 0.013 0.182 0.014 0.014
T 50 -0.001 -0.020 -0.001 -0.001 0.011 0.011 0.013 0.013
3.2.1. Period Dummy Monte Carlo Simulation
• Bias same across all strategies when dummies irrelevant but in GUM.
• But even slight break (λ = −0.05) gives bias up to 10% of coefficient
size in model averaging.
• λ = −0.05 gives horrific bias for model averaging but nothing from
selection or GUM.
• Similar story for MSFE; competitive when λ = 0 but quickly deteriorates as λ increases.
• Horrendous MSFE when λ = −0.5 of up to 30 times DGP error
variance.
• Competitive MSFE again for larger T when λ = −1 (i.e. γ is massive).
• So predictions borne out for longer dummies.
3.2.2. Period Dummy Monte Carlo Simulation: Generalisation
• Ran experiment again with two regressors in place of constant.
3.2.2. Period Dummy Monte Carlo Simulation: Generalisation
• Ran experiment again with two regressors in place of constant.
• Principle appears to generalise as again get noticeable bias and worse
MSFE. Results in paper.
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Also considered role of intermittent dummies.
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Also considered role of intermittent dummies.
• GUM:
yt = β1X1,t + β2X2,t + γd1,t + δd2,t + ut.
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Also considered role of intermittent dummies.
• GUM:
yt = β1X1,t + β2X2,t + γd1,t + δd2,t + ut.
• d1,t is African dummy, d2,t is Latin America dummy
(both from Sala-i-Martin 1997a, Sala-i-Martin 1997b).
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Also considered role of intermittent dummies.
• GUM:
yt = β1X1,t + β2X2,t + γd1,t + δd2,t + ut.
• d1,t is African dummy, d2,t is Latin America dummy
(both from Sala-i-Martin 1997a, Sala-i-Martin 1997b).
• As before, d1,t relevant, d2,t irrelevant. X1,t, X2,t both relevant, both
mean zero Normally distributed random numbers.
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Also considered role of intermittent dummies.
• GUM:
yt = β1X1,t + β2X2,t + γd1,t + δd2,t + ut.
• d1,t is African dummy, d2,t is Latin America dummy
(both from Sala-i-Martin 1997a, Sala-i-Martin 1997b).
• As before, d1,t relevant, d2,t irrelevant. X1,t, X2,t both relevant, both
mean zero Normally distributed random numbers.
• β1 = β2 = 1.
3.2.3. Intermittent Dummy Monte Carlo Simulation
Table 5: Bias on β1 for each modelling strategy. Based on 1,000 replications.
Bias on β1 (true value=1) MSFE (σ 2 = 0.01)
GUM MA R
MS
GUM MA R MS
λ=0
λ=0
T 50 0.000 0.109
0.000
0.010 0.067 0.013
T 75 0.000 0.069
0.000
0.010 0.011 0.015
λ = -0.05
λ = -0.05
T 50 0.000 0.133
-0.001
0.010 0.073 0.013
T 75 0.000 0.084
0.000
0.010 0.012 0.015
λ = -0.5
λ = -0.5
T 50 0.000 0.212
-0.001
0.010 0.095 0.013
T 75 0.000 0.085
0.000
0.010 0.013 0.015
λ = -1
λ = -1
T 50 0.000 0.123
-0.001
0.010 0.072 0.013
T 75 0.000 0.070
0.000
0.010 0.011 0.015
3.2.3. Intermittent Dummy Monte Carlo Simulation
• Bias on β1, one of parameters of interest, shown.
• Even when dummies irrelevant, bias noticeably stronger on model
averaging (rescaled), especially as T increases.
• Bias most when λ = −0.5, considerably more than any other modelling strategy.
• Bias pretty invariant to λ when T larger (75). Not huge but much
larger than any other strategy.
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
• BUT: rescaling increases size of coefficients on irrelevant variables,
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
• BUT: rescaling increases size of coefficients on irrelevant variables,
• Further, bias still strong, MSFE still large, if got period or
intermittent dummies:
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
• BUT: rescaling increases size of coefficients on irrelevant variables,
• Further, bias still strong, MSFE still large, if got period or
intermittent dummies:
– Worrisome for empirical work, e.g. growth regressions.
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
• BUT: rescaling increases size of coefficients on irrelevant variables,
• Further, bias still strong, MSFE still large, if got period or
intermittent dummies:
– Worrisome for empirical work, e.g. growth regressions.
– Often many dummies specified:
∗ Doppelhofer et al. (2000) 8 dummies in 32 variable, 98 country
dataset;
∗ Hoover & Perez (2004) 7 dummies in 36 variable, 107 country
dataset.
3.2.4. Lessons from Monte Carlo
• Strong bias and shocking MSFE from simple model averaging
supported, shown to be general across (λ, T ) combinations. Argues
against Buckland et al.’s (1997) idea of model averaging.
• Rescaling improves both bias on β and MSFE,
• BUT: rescaling increases size of coefficients on irrelevant variables,
• Further, bias still strong, MSFE still large, if got period or
intermittent dummies:
– Worrisome for empirical work, e.g. growth regressions.
– Often many dummies specified:
∗ Doppelhofer et al. (2000) 8 dummies in 32 variable, 98 country
dataset;
∗ Hoover & Perez (2004) 7 dummies in 36 variable, 107 country
dataset.
– How biased are regression coeffients, given inclusion of dummies?
3.3. Larger Model
• Attempt to show problems of bias occur in larger models.
3.3. Larger Model
• Attempt to show problems of bias occur in larger models.
• 10 variable dataset with 3 dummies (two relevant).
3.3. Larger Model
• Attempt to show problems of bias occur in larger models.
• 10 variable dataset with 3 dummies (two relevant).
• DGP:
yt =β3X3,t + β4X4,t + β5X5,t + β6X6,t + β7X7,t
4
4
8
8
+ δ1d1,t + δ2d2,t + vt,
2
vt ∼ N(0, σ ).
(21)
3.3. Larger Model
• Attempt to show problems of bias occur in larger models.
• 10 variable dataset with 3 dummies (two relevant).
• DGP:
yt =β3X3,t + β4X4,t + β5X5,t + β6X6,t + β7X7,t
4
4
8
8
+ δ1d1,t + δ2d2,t + vt,
2
(21)
vt ∼ N(0, σ ).
• GUM specified with two irrelevant variables and one irrelevant
dummy:
yt = β1X1,t + β2X2,t + β3X3,t + β4X4,t + β5X5,t + β6X6,t
+ β7X7,t + δ1d1,t + δ2d2,t + δ3d3,t + ut.
(22)
3.4. Larger Model Monte Carlo simulation
Table 6: Bias on β2 coefficient for each modelling strategy. Based on
1,000 replications.
GUM MA R Lib
Con DGP True Value
λ =0
T 50 -0.006 -0.212 0.001 0.002
0
T 75 -0.007 -0.192 -0.005 0.001
0
T 100 -0.004 -0.107 -0.020 -0.003
0
λ =-0.5
T 50 -0.006 -0.107 0.002 0.001
0
T 75 -0.007 -0.124 -0.001 0.002
0
T 100 -0.004 -0.049 -0.026 -0.004
0
λ =-1
T 50 -0.006 -0.033 0.002 0.001
0
T 75 -0.007 -0.086 -0.001 0.002
0
T 100 -0.004 -0.014 -0.026 -0.004
0
3.4. Larger Model Monte Carlo simulation
• Bias from rescaling on the insignificant coefficient β2 decreasing in
T and λ,
• sizeable bias if dummies erroneously specified (λ = 0),
• considerable bias even when λ = −0.5 hence dummy noticeable.
3.4. Larger Model Monte Carlo simulation
Table 7: Bias on β4 coefficient for each modelling strategy. Based on
1,000 replications.
GUM MA R Lib Con True DGP value
λ =0
50 0.008 -0.178 0.009 0.014
0.676
75 0.001 -0.089 0.007 0.010
0.516
100 0.000 -0.082 0.005 0.009
0.433
λ =-0.5
50 0.008 -0.244 0.010 0.015
0.676
75 0.001 -0.059 0.007 0.011
0.516
100 0.000 -0.079 0.005 0.008
0.433
λ =-1
50 0.008 -0.292 0.010 0.015
0.676
75 0.001 -0.040 0.007 0.011
0.516
100 0.000 -0.076 0.005 0.008
0.433
3.4. Larger Model Monte Carlo simulation
• Bias on significant β4 coefficient around 20% of the size of the true
coefficient (final column) regardless of T .
• When dummy relevant, bias greater as percentage of true coefficient
when T = 100 than T = 75.
• Large bias for small sample sizes, as T = 50 column suggests.
3.4.1. Effect on other regressors and lessons from Monte Carlo
• For β5 and β6, strongly significant parameters, bias in small sample
sizes (increasing in λ), but small bias when T large.
3.4.1. Effect on other regressors and lessons from Monte Carlo
• For β5 and β6, strongly significant parameters, bias in small sample
sizes (increasing in λ), but small bias when T large.
• Coefficients on dummies strongly biased, invariant in λ, large bias
even when T = 100.
3.4.1. Effect on other regressors and lessons from Monte Carlo
• For β5 and β6, strongly significant parameters, bias in small sample
sizes (increasing in λ), but small bias when T large.
• Coefficients on dummies strongly biased, invariant in λ, large bias
even when T = 100.
• Suggests predictions of small models earlier generalise to larger
models; worrisome for empirical work using BMA.
3.4.1. Effect on other regressors and lessons from Monte Carlo
• For β5 and β6, strongly significant parameters, bias in small sample
sizes (increasing in λ), but small bias when T large.
• Coefficients on dummies strongly biased, invariant in λ, large bias
even when T = 100.
• Suggests predictions of small models earlier generalise to larger
models; worrisome for empirical work using BMA.
• MSFE is competitive here from model averaging. But period and
intermittent dummy Monte Carlos showed forecasting suffers when
structural breaks exist; this does generalise.
4. Conclusions
• Small models and outliers and structural breaks ⇒ bias and poor
forecasting in averaged models.
4. Conclusions
• Small models and outliers and structural breaks ⇒ bias and poor
forecasting in averaged models.
• Bias and bad forecasting:
– because, regardless of rescaling, too much emphasis on bad
models, no selection made.
4. Conclusions
• Small models and outliers and structural breaks ⇒ bias and poor
forecasting in averaged models.
• Bias and bad forecasting:
– because, regardless of rescaling, too much emphasis on bad
models, no selection made.
– even if practitioner has noted breaks/outliers and accounted for
them in dataset.
4. Conclusions
• Small models and outliers and structural breaks ⇒ bias and poor
forecasting in averaged models.
• Bias and bad forecasting:
– because, regardless of rescaling, too much emphasis on bad
models, no selection made.
– even if practitioner has noted breaks/outliers and accounted for
them in dataset.
– even in ‘clean’ datasets, DGP ∈ GUM; no collinearity,
heteroskedasticity, structural breaks in unmodelled variables...
4. Conclusions
We suggest these results:
• Refute Buckland et al.’s (1997) un-rescaled averaging,
4. Conclusions
We suggest these results:
• Refute Buckland et al.’s (1997) un-rescaled averaging,
• Call into question Raftery et al.’s (1997) forecasting results in
non-stationary datasets,
4. Conclusions
We suggest these results:
• Refute Buckland et al.’s (1997) un-rescaled averaging,
• Call into question Raftery et al.’s (1997) forecasting results in
non-stationary datasets,
• ⇒ bias problems in work of amongst others Fernandez et al. (2001)
and Doppelhofer et al. (2000).
4. Conclusions
We suggest these results:
• Refute Buckland et al.’s (1997) un-rescaled averaging,
• Call into question Raftery et al.’s (1997) forecasting results in
non-stationary datasets,
• ⇒ bias problems in work of amongst others Fernandez et al. (2001)
and Doppelhofer et al. (2000).
Model selection shown to be effective alternative to model averaging and
has been tested in other difficult modelling contexts (see e.g. Hoover &
Perez 2004, Castle 2004).
References
Buckland, S.T., K.P. Burnham & N.H. Augustin (1997), ‘Model selection: An integral part of inference’, Biometrics
53, 603–618.
Castle, J. (2004), Evaluating PcGets and RETINA as automatic model selection algorithms. Unpublished paper,
Economics Department, Oxford University.
Doornik, Jurgen A, David F Hendry & Bent Nielsen (1998), ‘Inference in cointegrating models: UK M1 revisited’,
Journal of Economic Surveys 12(5), 533–72.
Doppelhofer, Gernot, Ronald I. Miller & Xavier Sala-i-Martin (2000), Determinants of long-term growth: A Bayesian
Averaging of Classical Estimates (BACE) approach, Technical report, National Bureau of Economic Research,
Inc.
Eklund, J. & S. Karlsson (2004), Forecast combination and model averaging using predictive measures. Unpublished
paper, Stockholm School of Economics.
Fernandez, C., E. Ley & M.F.J. Steel (2001), ‘Model uncertainty in cross-country growth regressions’, Applied
Econometrics 16, 563–576.
Hendry, David F. (2001), ‘Modelling UK inflation, 1875-1991’, Journal of Applied Econometrics 16(3), 255–275.
Hendry, David F. & Hans-Martin Krolzig (2005), ‘The properties of automatic Gets modelling’, The Economic
Journal 105(502), C32–C61.
Hendry, D.F. (1995), Dynamic Econometrics, Oxford University Press, Oxford.
Hendry, D.F. & M.P. Clements (2004), ‘Pooling of forecasts’, Econometrics Journal 7, 1–31.
Hoover, K.D. & S.J. Perez (1999), ‘Data mining reconsidered: Encompassing and the general-to-specific approach to
specification search’, Econometrics Journal 2, 167–191.
Hoover, Kevin D. & Stephen J. Perez (2004), ‘Truth and robustness in cross-country growth regressions’, Oxford
Bulletin of Economics and Statistics 66(5), 765–798.
Koop, Gary & Simon Potter (2003), Forecasting in large macroeconomic panels using Bayesian Model Averaging,
Staff Report 163, Federal Reserve Bank of New York.
Perez-Amaral, Teodosio, Giampiero M. Gallo & Halbert White (2003), ‘A flexible tool for model building: the Relevant Transformation of the Inputs Network Approach (RETINA)’, Oxford Bulletin of Economics and Statistics
65(s1), 821–838.
Raftery, A.E., D Madigan & J.A. Hoeting (1997), ‘Bayesian model averaging for linear regression models’, Journal
of the American Statistical Association 92(437), 179–191.
Sala-i-Martin, Xavier X. (1997a), ‘I just ran two million regressions’, American Economic Review 87(2), 178–83.
Sala-i-Martin, Xavier X. (1997b), I just ran four million regressions, Technical report, National Bureau of Economic
Research, Inc.
Download