Problems in Model Averaging with Dummy Variables May 3, 2005

advertisement
Problems in Model Averaging with Dummy
Variables
David F. Hendry and J. James Reade
May 3, 2005
Abstract
Model averaging is widely used in empirical work, and proposed as
a solution to model uncertainty. This paper provides a range of relevant
empirical contexts where model averaging performs poorly in terms of bias
on coefficients and forecast errors. These contexts are when outliers and
structural breaks exist in datasets. Monte Carlo simulations support these
assertions and suggest that they apply in more complicated models than
the simple ones considered here. It is argued that not selecting relevant
variables over irrelevant ones is precisely the cause of poor performance;
weights ascribed to irrelevant components will bias that attributable to
the relevant. Within this context, the superior performance of model
selection algorithms is indicated.
1
Introduction
Model averaging, the practice of taking a weighted average of a number of regression models, is widely used, and is proposed as a method for accommodating
model uncertainty in statistical analysis. However, while averaging can be
shown to have desirable properties in a stationary world (see Raftery, Madigan
& Hoeting 1997), extention to the non-stationary world presents difficulties. In
this paper the performance of model averaging compared to model selection is
assessed in the empirically relevant situation where dummy variables form part
of the data generating process.
In Section 2, model averaging is introduced, various methods of implementing it are touched upon, and the use of model averaging in the empirical literature is discussed. In Section 3 a number of simple models are introduced
to highlight problems with model averaging in particular empirically relevant
situations, before Monte Carlo simulations are used firstly to support the simple
models and their conclusions, and then to suggest the problems exist in a more
general context. Section 4 concludes.
1
2
Model Averaging
Model averaging can be carried out in both the classical statistical framework
(see Buckland, Burnham & Augustin 1997), or the Bayesian paradigm (see
Raftery et al. 1997). In the empirical literature, the latter has been much more
commonly used as computing power has increased exponentially. Examples in
the growth literature include Fernandez, Ley & Steel (2001) who use a pure
Bayesian methodology with non-informative priors, and Doppelhofer, Miller &
Sala-i-Martin (2000) who calculate weights in a Bayesian manner, but average
over classical OLS estimates, while Koop & Potter (2003) use Bayesian model
averaging to forecast US quarterly GDP, and Eklund & Karlsson (2004) forecast
Swedish inflation based on predictive Bayesian densities.
When carrying out model averaging, practitioners state a set of K variables
considered to have explanatory power for the parameter of interest. These
variables then form a set M of L models,
{M1 , . . . , ML } ∈ M.
These models could be any particular type of statistical model. Here, along
with Raftery et al. (1997) and the other empirical studies mentioned above,
linear regression models are considered. Thus each one is of the form:
y l = β l Xl + u l
(l)
(l)
= β1 X1 + · · · + βK XK ,
where zeros in the β vector would signify where a particular regressor is not
included in model l. The models in the set M are usually every subset of the
K variables specified in the initial dataset, or some subset of these models using
some kind of selection algorithm. Raftery et al. (1997) advocate a Bayesian
selection algorithm based on the posterior density of each individual model.
However, the use of non-informative priors induced by the inability to specify
specific priors for each variable in the 2K models that result from considering
every subset of the K variables specified means that such selection algorithms
favour larger models (see Eklund & Karlsson 2004). Buckland et al. (1997)
appear to suggest that model selection should not be carried out at all.1 In
conventional linear regression analysis, the mean of parameters of interest conditional on the explanatory variables is usually reported, and as such one might
expect the weighted average of this conditional mean over the L models, say
a
β =
L
X
wl βl ,
(1)
l=0
1 It is not possible to challenge this claim in the small models considered here, because
model selection algorithms tend to choose just one model hence leaving nothing to average
over and leaving the comparion as one between model averaging and model selection per se.
It is hoped to investigate this claim in future research.
2
where wl is the weight for model l, to be reported in model averaging, hence
giving an output from the process of:
y = β a X + ua .
(2)
Bayesian model averagers, such as Fernandez et al. (2001), discuss the probability of including any particular regressor in the averaged model as its importance, refraining from reporting any coefficients or model of the form (2) in
their averaging analysis, since, as Doppelhofer et al. (2000) point out, Bayesian
statisticians reject the idea of a single, true estimate, believing that each parameter has a true distribution, and hence Fernandez et al. (2001) produce charts of
distribution functions for each parameter. At this point a debate about the existence or not of a true specification can be entered into; Hoover & Perez (2004,
pp. 767–769) summarise this well. Buckland et al. (1997) suggest reporting the
averaged coefficients as in (1), and they see the sum of weights as a measure
of the importance of each regressor. This introduces the debate over how the
models are weighted in the combination, which manifests itself on two levels;
firstly how to construct the weights, and secondly which weighting criterion to
use. Considering the first issue, for any particular weighting criterion, say Cl ,
the weighting method might be:
Cl
wl = PL
i=1
Ci
.
(3)
PL
This ensures that l=1 wl = 1. However, no variable appears in every model,
meaning that the sum of the weights applied to a particular variable will not be
unity, and as such the coefficient will be biased down.2 An alternative weighting
method to account for this downward bias might be to rescale the weights for
each regressor so that the sum over the number of models it appears in is unity.
Thus the weight for model l might then be described as, where Nk ⊂ M is the
set of models in M that contain regressor βk , the following:
wl = P
Cl
i∈Nk
Ci
.
(4)
Hence the weights for any particular regressor will sum to unity. Doppelhofer
et al. (2000) advocate this rescaled weighting for reporting coefficients in their
averaged model, stating that the coefficients produced by this method would be
the ones used in forecasting, and for analysing marginal effects. Both weight
construction methods will be considered in this paper.
In terms of the weighting criterion Cl , in the Bayesian context each model
is weighted by its posterior probability, which is given by:
Pr (Ml ) Pr (X |Ml )
.
Pr (Ml |X ) = PL
k=1 Pr (Mk ) Pr (X |Mk )
(5)
2 Taking the simplest 2-variable model illustrates this; then 4 models result, and each
variable will only appear in two of the models. Given a non-zero weighting for each model,
it cannot be the case that the sum of weights on either variable equals unity.
3
In non-Bayesian contexts, information criteria might be considered, such as the
Akaike or Bayesian information criteria. In this paper, following Buckland et al.
(1997), an approximation to Schwarz information criteria (SIC) is employed,
2
which uses exp(−b
σv,l
/2) (which is almost the same for the small number of
2
parameters considered here) where σ
bv,l
denotes the residual variance of the lth
model:
T
1X 2
2
σ
bv,l
=
vb .
T t=1 t
Estimator averaging, therefore, uses the weights given by:
³
´
2
exp − 21 σ
bv,l
³
´.
wl = P
L
1 2
bv,l
l=1 exp − 2 σ
(6)
Finally, for weighting, out-of-sample methods might be used; Eklund & Karlsson
(2004) suggest using predictive Bayesian densities, while Hendry & Clements
(2004) discuss minimising the mean squared forecast error of the averaged model
as a criterion to construct weights.
The justification for using SIC-based weights in this paper is that the Schwarz
information criterion does not discriminate strongly between models differing by
a regressor or two, a property infitting with Bayesian concern for model uncertainty. Further, the SIC is an approximation to the Bayes factor. Thus the
analytical results and Monte Carlo simulation results, it is argued, can be applied to the more widely used Bayesian model averaging.
Model averaging is just one way of carrying out a data-focussed macroeconomic modelling exercise. Another method is General-to-Specific model selection (see Hoover & Perez 1999, Hendry & Krolzig 2005, Perez-Amaral, Gallo &
White 2003), whereby a general model is posited to include all possible factors
contributing to determination of a parameter of interest, and then a process of
reduction is carried out to leave the practitioner with the most parsimonious
congruent and encompassing econometric model.
3
3.1
The bias when dummy variables are included
Orthogonal model with irrelevant dummy
We consider the simplest location-scale data generation process (DGP) in (7)
with a transient mean shift, namely:
¤
£
(7)
yt = β + γ1{t=ta } + vt , where vt ∼ IN 0, σv2
where 1{t=ta } denotes a zero-one observation-specific indicator, unity at observation ta and zero otherwise. The parameter of interest is β and the forecast
will be for yT +1 , 1-step ahead from√the forecast origin T . We consider the empirically relevant case where γ = λ T for a fixed constant λ , (see e.g. Doornik,
4
¡
¢
Hendry & Nielsen 1998), and neglect terms of Op T −1/2 or smaller in the analytic derivations. The simulation illustration confirms their small impact on
the outcomes.
The postulated model has an intercept augmented by adding one relevant
and one irrelevant impulse dummy, denoted d1,t = 1{t=ta } and d2,t = 1{t=tb }
respectively. This yields the general unrestricted model (GUM):
yt = β + γd1,t + δd2,t + ut
(8)
for t = 1, . . . , T where in the DGP, δ = 0 and γ 6= 0, the former holding in
the sense that only one transient location shift actually occurred, although the
investigator is unaware of that fact. Equation (8) is the starting point for model
averaging as it is the set of variables from which all possible models are derived;
it is also the starting point for model selection, which then follows a process of
reduction to arrive at the most parsimonious congruent encompassing economic
model (see ch. 9, Hendry 1995).
For model averaging a regression would be run on all subsets of the regressors
in (8), and the following 23 = 8 possible models result:
M0 : β = 0; δ = 0; γ = 0
M3 : β = 0; δ = 0
M6 : β = 0
M1 : δ = 0; γ = 0
M4 : γ = 0
M7 : —
M2 : β = 0; γ = 0
M5 : δ = 0
(9)
This yields eight estimated models, all using least squares, where estimators are
denoted by the subscript of their model number:
M0 :
M2 :
M4 :
M6 :
3.1.1
ybt
ybt
ybt
ybt
=0
= δb(2) d1,t
= βb(4) + δb(4) d1,t
= δb(6) d1,t + γ
b(6) d2,t
M1 :
M3 :
M5 :
M7 :
ybt
ybt
ybt
ybt
= βb(1)
=γ
b(3) d2,t
= βb(5) + γ
b(5) d2,t
b
= β(7) + δb(7) d1,t + γ
b(7) d2,t
(10)
Deriving the weights and estimates
For the regressors, using least squares we find that:
βb(0) = βb(2) = βb(3) = βb(6) = 0,
T
T
¢
γ
λ
1 X¡
1X
β + γ1{t=ta } + vt ≃ β + = β + √ ,
yt =
βb(1) = βb(4) =
T t=1
T t=1
T
T
βb(5) = βb(7) =
1
T −1
T
X
t=1,t6=ta
yt =
1
T −1
T
X
t=1,t6=ta
¡
¢
β + γ1{t=ta } + vt ≃ β.
Hence there are three possible outcomes for estimating the parameter of interest
(neglecting sampling variation as second order):
• βbi ≃ 0, when there is no intercept (M0 , M2 , M3 , M6 );
5
• βbi ≃ β, when an intercept and d1,t are included (M5 , M7 ); and
√
• βbi ≃ β + λ/ T , when an intercept, but no d1,t , is included (M1 , M4 ).
All the derivations of the weights follow the same formulation. First, for M0
from (7):
2
σ
bv,0
=
T
T
¢2
1X 2
1 X¡
β + γ1{t=tb } + vt
yt =
T t=1
T t=1
T
¢
1 X¡ 2
β + γ 2 1{t=tb } + vt2 + 2βγ1{t=tb } + 2βvt + 2γ1{t=tb } vt
T t=1
¢
1¡ 2
γ + 2βγ + 2γvtb
= β 2 + σ 2v + 2βv +
Tµ
¶
1
2
2
2
= β + σ v + λ + Op √
T
≃ β 2 + σv2 + λ2
(11)
=
where:
σ 2v =
T
T
1X 2
1X
vt and v =
vt ,
T t=1
T t=1
and the last line of (11) uses the asymptotic approximations:
√
¤
£
D
P
T v → N 0, σv2 and σ 2v → σv2 .
Clearly, βb(0) = 0 in M0 yet its weight will be non-zero in (1).
A similar approach for M1 yields:
¶2
T µ
1X √
λ
2
σ
bv,1
=
≃ λ2 + σv2
λ T 1{t=tb } + vt − √
T t=1
T
since:
λ
βb(1) ≃ β + √ .
T
Continuing through the remaining models delivers the complete set of approximate error standard deviations:
2
σ
bv,0
2
σ
bv,1
2
σ
bv,2
2
σ
bv,3
2
σ
bv,4
2
σ
bv,5
2
σ
bv,6
2
σ
bv,7
≃ β 2 + λ2 + σv2
≃ λ2 + σv2 ;
≃ β 2 + λ2 + σv2 ;
≃ β 2 + σv2
≃ λ2 + σv2 ;
≃ σv2 ;
≃ β 2 + σv2
≃ σv2 .
6
The error variance enters 8 times, and β 2 and λ2 both enter 4 times.
Cumulating these:
¶
µ
λ
βe ≃ (w5 + w7 ) β + (w1 + w4 ) β + √
T
λ
= (w1 + w4 + w5 + w7 ) β + (w1 + w4 ) √ .
T
(12)
Simulation confirms the accuracy of these calculations for the mean estimates
of β, even for T as small as 25 (where the number of parameters might matter
somewhat).
From (12), the averaged coefficient will not equal the true coefficient so long
PL
as λ 6= 0, and/or w1 + w3 + w5 + w7 < 1, which l=1 wl = 1 will imply in most
cases. On the other hand, rescaling
√ the weights will mean w1 + w4 is larger and
hence the bias induced from λ/ T will be greater. Also, rescaling will mean
e the irrelevant regressor, will receive greater weight, since the principle
that δ,
applies to all regressors, when one imagines previously, especially if weights are
a reflection of the importance of a parameter, it received a low weighting.
3.1.2
Model averaging for forecasting stationary data
One justification for model averaging is as a method of ‘forecast pooling’, so
we consider that aspect next. The outlier was one-off so will not occur in the
forecast period, yielding:
Letting:
M0 : ybT +1,0 = 0 M1 : ybT +1,1 = βb(1)
M3 : ybT +1,3 = 0 M4 : ybT +1,4 = βb(4)
M6 : ybT +1,6 = 0 M7 : ybT +1,7 = βb(7)
yeT +1|T =
7
X
i=0
M2 : ybT +1,2 = 0
M5 : ybT +1,5 = βb(5)
wi ybT +1,i
then the forecast error is veT +1|T = yT +1 − yeT +1|T , with mean:
£
¤
λ
E veT +1|T = (w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √ .
T
7
(13)
Thus, forecasts can be considerably biased, for similar reasons that βe can be
biased. The mean-square forecast error (MSFE) is:
Ã
!2 
7
i
h
X
wi ybT +1,i 
= E  yT +1 −
E veT2 +1|T
= E
"µ
i=0
λ
(w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √
T
2
= σv2 + (w0 + w2 + w3 + w6 ) β 2 + (w1 + w4 )
+
βλ
2 (w0 + w2 + w3 + w6 ) (w1 + w4 ) √ ,
T
2
¶2 #
λ2
T
(14)
which for unrescaled weights is almost bound to be worse for large λ than the
general unrestricted model (GUM) or any selected model, even allowing for
estimation uncertainty.
3.1.3
Numerical Example
Consider β = 1, λ = −1, σ 2v = 0.01 (i.e., a standard
¡ 2 ¢ deviation of 10%), and
T = 25. Then using the weights based on exp −b
σv /2 in (12):
¶
µ
¶
µ
λ
1
= 0.626
βe = (w5 + w7 ) β + (w1 + w4 ) β + √
= 0.382 + (0.305) 1 −
5
T
(15)
which is very biased for the true value of unity. The basic problem is that
the weights increase too little for better models over poorer ones, in addition
to which, the ‘irrelevant’ impulse creates several such poorer models in the
averaging pool. The bias can be seen to be smaller if one takes the second
weighting methodology outlined in Section 2; firstly if we rewrite (15) as in
(12):
λ
βe = (w1 + w4 + w5 + w7 ) β + (w5 + w7 ) √
T
µ
¶
1
= 1 + 0.37754 −
= 0.924,
5
since the weights on the β coefficient, which only appears in the models 1, 4, 5
and 7, will sum to unity.
The MSFE from (14) when forecasting without rescaling the weights is:
i
h
E veT2 +1|T = 0.118,
so the MSFE is a factor of almost 12-fold larger than the DGP error variance.
It is hard to calculate the MSFE with the rescaled weights, because each weight
8
depends on which coefficient it is being multiplied by; one would expect the
MSFE to be smaller when the weights are rescaled since the bias on the coefficients is smaller in that case. Finally, considering the parameter values chosen
√
for this example, γ = −5 is large when σv = 0.1, but outliers of magnitude T
often occur in practical models: (see Hendry 2001, Doornik et al. 1998). In the
Monte Carlo simulation, a range of values of λ and T will be considered.
3.1.4
Monte Carlo Evidence
A Monte Carlo simulation of 1,000 replications was run to assess the impact of
sampling distributions on the bias derived in Section 3.1.2. Table 2 reports the
average bias on the β (first panel) and γ (second panel) coefficients when various
modelling strategies are used. The different columns report the bias from the
various modelling strategies; GUM is the general unrestricted model, hence the
simple regression run on the entire dataset originally specified (equation (8)),
MA is model averaging, MA R is model averaging rescaled, Lib is model selection using PcGets Liberal strategy, and Cons is model selection using PcGets
Conservative strategy.3 A range of T and λ values are considered along with
the parameterisation of the numerical example in Section 3.1.3. As λ varies,
the size of the coefficient on the relevant dummy, γ varies, and Table 1 shows
the actual size of the γ coefficient for each (λ, T ) combination.
Table 1: Size of γ coefficient for various
-0.25
-0.5
T \ λ 0 -0.05
25
0 -0.25
-1.25
-2.5
50
0 -0.354 -1.768 -3.536
values of λ and T .
-0.75
-1
-3.75
-5
-5.303 -7.071
So from Table 2 the bias when simply the GUM is run is tiny on both
coefficients. Model averaging induces a very large bias, ranging from about 30%
of the true β coefficient size when the dummies are both insignificant (λ = 0), to
around 40% when λ = −1, and we see the calculations of the previous Section
are supported here; the bias of -0.374 is reproduced and in fact is stronger, at
-0.397 in the Monte Carlo. The bias on γ under model averaging decreases as
a fraction of the size of the true coefficient from about 100% of the size when
the dummy is barely noticeable (λ = −0.05), to about 40% when the dummy is
very conspicuous at λ = −1.
Thus a picture of strong bias is drawn from model averaging. Rescaling
the weights, as described in Section 2, improves this markedly. The bias on
β does increase with the size of the true γ coefficient, holding T fixed, but
is much smaller than when weights are not rescaled. The bias calculated in
the previous section (−0.076 when λ = −1) is supported with −0.079 found in
the simulation, while if λ = −0.5, the bias is around 5% of the β coefficient
size. The bias on the γ is invariant to changes in λ, and hence changes in the
3 For a description of these strategies, which amount to differing stringencies of tests used
at various stages of the selection procedure, see Hendry & Krolzig (2005).
9
Table 2: Bias on Coefficients in
tions.
β
GUM
MA
MA R
T
L0
25 0.000 -0.319 0.000
50 -0.001 -0.316 -0.001
L -0.05
25 0.000 -0.323 -0.006
50 -0.001 -0.319 -0.004
L -0.25
25 0.000 -0.340 -0.026
50 -0.001 -0.332 -0.018
L -0.5
25 0.000 -0.362 -0.048
50 -0.001 -0.348 -0.034
L -0.75
25 0.000 -0.381 -0.067
50 -0.001 -0.363 -0.047
L -1
25 0.000 -0.397 -0.079
50 -0.001 -0.376 -0.055
DGP (equation (7)). Based on 1,000 replicaLib
Cons
GUM
MA
0.000
-0.001
0.000
-0.001
-0.002
0.000
0.212
0.211
0.000
-0.001
0.000
-0.001
-0.002
0.000
0.324
0.369
0.000
-0.001
0.000
-0.001
-0.002
0.000
0.766
0.993
0.000
-0.001
0.000
-0.001
-0.002
0.000
1.277
1.709
0.000
-0.001
0.000
-0.001
-0.002
0.000
1.692
2.281
0.000
-0.001
0.000
-0.001
-0.002
0.000
1.966
2.645
γ
MA R
L0
0.383
0.381
L -0.05
0.383
0.381
L -0.25
0.383
0.381
L -0.5
0.383
0.381
L -0.75
0.383
0.381
L -1
0.383
0.381
Lib
Cons
-0.073
0.026
-0.116
-0.042
-0.004
-0.002
-0.007
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
-0.003
-0.002
size of the γ coefficient itself; once λ is non-zero, however, the bias is smaller
when weights are rescaled.4 When λ = 0, γ is the coefficient on an irrelevant
regressor, and this highlights the problem discussed in Section 3.1.1 of rescaling
weights; it increases the weight on irrelevant regressors.5
Table 2 also shows the performance of model selection; the bias is generally
comparable to the bias on the GUM, hence negligible, while Tables 3 and 4,
which report the percentage of simulation replications on which the DGP was
selected as the final model in the model selection algorithm, tell us that in
almost every replication the true DGP is found.6
4 This
invariance of the bias to λ is to be expected, since from (10), d1,t only appears in
M2 , M4 , M6 and M7 , and in M4 and M7 the coefficient is unbiased, γ̂ (3) = γ̂ since the model
is well specified, but in M2 and M6 we have:
PT
PT
PT
dt β
dt yt
t=1 dt (β + γdt + vt )
=
= Pt=1
+ γ = β + γ.
γ̂ (2) = Pt=1
PT
T
T
2
2
2
d
d
t=1 t
t=1 t
t=1 dt
Thus when we consider the averaged coefficient:
γ̂ (a) = (w2 + w6 ) (β + γ) + (w4 + w7 ) γ = (w2 + w4 + w6 + w7 ) γ + (w2 + w6 ) β,
(16)
√
then if w2 + w4 + w6 + w7 = 1 (rescaled weights) the bias is independent of γ = λ T for fixed
T.
5 Furthermore, the bias on δ, the irrelevant dummy, which is not reported, is larger when
weights are rescaled.
6 In fact the two relevant regressors were retained on every replication. It was only retention
10
Table 3: Percentage of times specific model is DGP using PcGets Liberal Strategy. Based on 1,000 replications.
λ =0 λ =-0.1 λ =-0.25 λ =-0.5 λ =-0.75 λ =-1
T 25 91.1%
95.4%
95.8%
95.8%
95.8%
95.8%
95.3%
95.3%
95.3%
95.3%
95.3%
T 50 91.6%
Table 4: Percentage of times specific model
Strategy. Based on 1,000 replications.
λ =0 λ =-0.1 λ =-0.25
T 25 97.3%
97.3%
99.1%
T 50 97.2%
99.0%
99.0%
is DGP using PcGets Conservative
λ =-0.5
99.1%
99.0%
λ =-0.75
99.1%
99.0%
λ =-1
99.1%
99.0%
Table 5 provides information on the mean-square forecast error for the different modelling strategies. The MSFE for a 1-step forecast of T + 1 from T is
given for the same modelling strategies reported in Table 2. One can see that
indeed the huge MSFE’s for MA predicted in Section 3.1.3 are supported, but
when the weights are rescaled for model averaging, the MSFE isn’t noticeably
worse than using the GUM, and MSFEs over the various modelling strategies
are indistinguishable.
Thus the Monte Carlo simulations support the assertions from Section 3.1.1
of bias and terrible forecast performance when using model averaging in the
presence of indicator variables. Rescaling weights does improve the bias performance of model averaging for regressors that are relevant, and substantially
improves the MSFE. However, upward bias is increased for irrelevant variables,
potentially giving false information as to the relevance of some variables, and
strong bias remains for coefficients on dummy variables. The model here is
extremely simplistic; however, it does generalise quite easily analytically to a
model with a regressor in place of the constant in (7), and Monte Carlo simulations give almost identical results; in fact a stronger bias of 0.103 on β is
reported in the λ = −1, T = 25 case is reported. Generalising to large models
with numerous regressors and dummies is analytically fearsome, but for empirical work it is important we can make generalisations. Longer dummy variables
and dummy variables that take unity for a certain state of the world are considered in the next Section, while larger models with impulse dummy variables
are investigated via Monte Carlo simulation in Section 3.3.
3.2
Period and intermittent dummy variables
If the dummy variable was postulated to be a period dummy, say d = 1{t1 <t<t2 } ,
then following
the same procedure to calculate the bias
√ as in Section 3.1.1,
√
all 1/ T expressions can be replaced with (t2 − t1 ) / T , hence the bias will
of the irrelevant dummy on a small number of replications that brought the percentages in
Tables 3 and 4 down.
11
Table 5: MSFE from various modelling strategies in Monte Carlo simulation in
Section 3.1.4. Based on 1,000 replications.
GUM
MA
MA R
Lib
Cons
L0
T 25 0.011 0.108 0.011 0.013 0.013
T 50 0.010 0.109 0.010 0.013 0.013
L -0.05
T 25 0.011 0.111 0.011 0.013 0.013
T 50 0.010 0.111 0.010 0.013 0.013
L -0.25
T 25 0.011 0.122 0.011 0.013 0.013
T 50 0.010 0.119 0.010 0.013 0.013
L -0.5
T 25 0.011 0.137 0.013 0.013 0.013
T 50 0.010 0.130 0.011 0.013 0.013
L -0.75
T 25 0.011 0.151 0.015 0.013 0.013
T 50 0.010 0.141 0.012 0.013 0.013
L -1
T 25 0.011 0.163 0.016 0.013 0.013
T 50 0.010 0.151 0.013 0.013 0.013
increase with the size of the period the dummy spans over. This is problematic
since often period dummies are postulated in models, as structural breaks do
exist. In this Section a model with a mean-shift mid-sample will be considered.
The DGP is specified as:
yt = β + γ1{t<ta } + vt ,
vt ∼ N(0, σ 2 ),
(17)
while the practitioner is unsure of the end point of the structural break, and
posits a model with two dummy variables, d1,t = 1{t<ta } , which is correctly
specified, and d2,t = 1{ta +1<t<ta +8} :7
yt = β + γd1,t + δd2,t + ut .
(18)
Hence again, as in the impulse dummy case, the mean-shift is noticed, and the
relevant regressors included in the GUM.
Table 6 shows the deteriorating bias performance of rescaled model averaging
as the true γ coefficient gets larger; when λ = 0, averaging produces a bias
indistinguishable from that of the GUM, or from model selection.8 However,
with even a slight 0.05 increase in λ, there is bias of nearing 10% of the coefficient
7 It is worth noting that the results from this experiment are very similar to those without
the irrelevant dummy included.
8 After the horrendous biases shown in Section 3.1, non-rescaled model averaging is not
reported here. The bias on all simulations relating to longer dummies showed bias of a
similar size to that in Section 3.1.
12
Table 6: Bias on β coefficient from various modelling strategies in period dummies Monte Carlo simulation. Based on 1,000 replications.
GUM MA R
Lib
Cons
L0
T 25 -0.001 -0.001 0.000
0.000
T 50 -0.001 -0.001 0.000 -0.001
L -0.05
T 25 -0.001 -0.081 -0.001 -0.002
T 50 -0.001 -0.101 -0.001 -0.001
L -0.25
T 25 -0.001 -0.377 0.001
0.000
T 50 -0.001 -0.419 -0.001 -0.001
L -0.5
0.000
T 25 -0.001 -0.606 0.001
T 50 -0.001 -0.409 -0.001 -0.001
L -0.75
T 25 -0.001 -0.598 0.001
0.000
T 50 -0.001 -0.137 -0.001 -0.001
L -1
T 25 -0.001 -0.418 0.001
0.000
T 50 -0.001 -0.020 -0.001 -0.001
Table 7: MSFE from using various modelling strategies in period dummies Monte
Carlo simulation. Based on 1,000 replications.
GUM MA R
Lib
Cons
λ=0
0.011 0.014 0.014
T 25 0.013
T 50 0.011
0.010 0.013 0.013
λ = -0.05
T 25 0.013
0.017 0.015 0.015
T 50 0.011
0.020 0.013 0.013
λ = -0.25
T 25 0.013
0.148 0.014 0.014
0.185 0.013 0.013
T 50 0.011
λ = -0.5
T 25 0.013
0.370 0.014 0.014
T 50 0.011
0.177 0.013 0.013
λ = -0.75
T 25 0.013
0.362 0.014 0.014
T 50 0.011
0.029 0.013 0.013
λ = -1
T 25 0.013
0.182 0.014 0.014
T 50 0.011
0.011 0.013 0.013
13
size. When λ = −0.5, the bias is 60% of the coefficient size when T = 25, and
40% when T = 50. As λ nears -1, the bias reduces as the effect of the weights
takes over, the SIC attributing a very low weight to models that exclude the
relevant dummy; however when T = 25, the bias is still horrendous at −0.418.
The bias on γ, (not reported) follows the pattern from earlier; it is invariant to λ
hence worse for small structural breaks relatively. The favourable performance
of model selection in this scenario can be gleaned from Table 6, almost always
producing bias not noticeably different from the GUM, which it must encompass.
Table 7 shows the much worsened forecast performance of the rescaled averaged
model when dummies are longer than single period impulses. While averaging
performs well if the dummy is irrelevant (λ = 0), even as λ registers a small
value the MSFE is increasing rapidly. When the break is just about noticeable
at λ = −0.25, the MSFE is already at least 15 times the size of σ 2 . For T = 25,
the MSFE keeps rising, reporting values of over 0.35 for λ = −0.5 and −0.75,
although for T = 50, there is a decrease down to an MSFE of 0.029, still around
3 times the DGP error variance, when λ = −0.75.
These results for a constant and dummies generalise to models with regressors in. A Monte Carlo was run replacing the constant in (17) and (18) by two
mean-zero vectors of Normally distributed random numbers with coefficients β1
and β2 , where β1 = β2 = 1 hence both are relevant; all else remained as before.
So the DGP is:
yt = β1 X1,t + β2 X2,t + γd1,t + vt ,
vt ∼ N(0, σ 2 )
(19)
and the GUM is:
yt = β1 X1,t + β2 X2,t + γd1,t + δd2,t + ut .
(20)
Only the bias on the β1 coefficient is reported here, in Table 8; the MSFE when
T = 25 is always around 4 times the DGP error variance, when T = 50 always
at least twice the size, though when T = 75 the MSFE is quite competitive
when λ > 0.25; before that it is up to three times the error variance. Table 8
reveals bias when model averaging is used in this more general context, holding
fairly constant at around 10% of the true coefficient size for all values of λ and
T . The bias on β2 shows similar patterns. Other dummies commonly used in
the literature are what might be called intermittent dummies; those that take
unity in a particular state of the world, perhaps if there is industrial action
in a particular year, or in cross section, if a country is located in sub-Saharan
Africa. As another experiment, the two dummies specified in (19) and (20),
which could be seen to be somewhat arbitrary and unrealistic, were replaced
with dummies from growth regressions (African and Latin American dummies
from Sala-i-Martin 1997a, Sala-i-Martin 1997b), and considerably worse biases
and MSFEs resulted (results not reported).
These results are quite stunning; they show that even if a practitioner has
located a clear structural break in a dataset for a parameter of interest, or
accomodated for a salient feature with an intermittent dummy variable, model
averaging based on that dataset still has the potential to produce very distorted
14
Table 8: Bias on β1 coefficient in (20). Based
GUM MA R
Lib
λ=0
T 25 0.000
0.064 -0.001
0.113 -0.001
T 50 0.000
T 75 0.000
0.078
0.000
λ = -0.05
T 25 0.000
0.080 -0.001
T 50 0.000
0.140 -0.001
T 75 0.000
0.110
0.000
λ = -0.25
0.130 -0.001
T 25 0.000
T 50 0.000
0.212 -0.001
T 75 0.000
0.168
0.000
λ = -0.5
T 25 0.000
0.128 -0.001
T 50 0.000
0.157 -0.001
T 75 0.000
0.095
0.000
λ = -0.75
0.086 -0.001
T 25 0.000
T 50 0.000
0.125 -0.001
T 75 0.000
0.086
0.000
λ = -1
T 25 0.000
0.072 -0.001
T 50 0.000
0.124 -0.001
T 75 0.000
0.086
0.000
15
on 1,000 replications.
Cons
-0.001
-0.001
0.000
-0.001
-0.001
0.000
-0.001
-0.001
0.000
-0.001
-0.001
0.000
-0.001
-0.001
0.000
-0.001
-0.001
0.000
results both for inference and forecasting. Model selection, on the other hand,
reports much better results in these contexts, showing negligible biases and
MSFEs. This must be concerning for the empirical work that has been done
using model averaging, in particular growth theorists, as in this field of research
many dummy variables are often specified in datasets: Doppelhofer et al. (2000)
included 8 dummy variables in their 32 variable, 98 country dataset, and Hoover
& Perez (2004) had 7 in the 36 variable, 107 country, dataset they considered.
Just how biased are the dummies for African nations or former Communist
countries, and the other regressors more generally given the inclusion of these
dummies? It might be argued that in larger models such as the growth ones
refered to above, the effects of dummy variables are drowned out; the next
Section investigates this.
3.3
Larger Models
Attempting to show the relevance of earlier simpler models for empirical work,
a Monte Carlo experiment of 1,000 replications has been designed to consider
the effect of dummies then there are many regressors and a number of dummy
variables. A 10 variable dataset with 3 dummies, two of which were relevant,
was constructed and used. The DGP is thus:
yt =β3 X3,t + β4 X4,t + β5 X5,t + β6 X6,t + β7 X7,t
4
4
8
8
+ δ1 d1,t + δ2 d2,t + vt ,
8
vt ∼ N(0, σ 2 ).
(21)
The GUM specified in this case has two irrelevant variables and two irrelevant
dummies:
yt = β1 X1,t + β2 X2,t + β3 X3,t + β4 X4,t + β5 X5,t + β6 X6,t
+ β7 X7,t + δ1 d1,t + δ2 d2,t + δ3 d3,t + ut .
(22)
The Monte Carlo experiment again considered a range of λ and T values, with
σ 2 = 1 specified. Due to space constraints only the results from model averaging with rescaled weights pertaining to four of the estimated coefficients in
(22) are shown below: β2 , an insignificant regressor (Table 9); β4 , a significant
regressor (Table 10); δ1 (Table 11), the significant dummy; and δ3 (Table 12),
the insignificant dummy. It is not relevant how strong or not the bias is on
non-reported variables; if there is noticeable bias on any coefficients of interest,
this is enough to warrant concern about its suitability as a modelling technique.
Considering first Table 9 the upward bias of irrelevant coefficients due to
rescaling weights can be observed; the irrelevant coefficient registers values of
up to around 0.2, although these values do decrease in λ and T . Turning to
Table 10, when T = 50 one can see that there is considerable bias relative to
the true value of β4 (the true value is given in the first column, entitled DGP)
induced by the inclusion of these dummy variables. Indeed this bias increases
with λ: from about 35% of the size of the true value when λ = 0, to almost 60%
when λ = −1. When T = 100, the bias is around 20% of the true coefficient
16
Table 9: Bias on β2 coefficient from using various modelling strategies
models Monte Carlo simulation. Based on 1,000 replications.
L0
L -0.25
GUM MA R
Con
Lib
GUM MA R
Con
T 25
0.004 -0.296 -0.052 -0.017 0.004 -0.272 -0.023
T 50
-0.006 -0.212 0.001
0.002 -0.006 -0.158 0.002
T 75
-0.007 -0.192 -0.005 0.001 -0.007 -0.156 -0.001
T 100 -0.004 -0.107 -0.020 -0.003 -0.004 -0.077 -0.026
L -0.5
L -1
GUM MA R
Con
Lib
GUM MA R
Con
T 25
0.004 -0.247 -0.023 -0.014 0.004 -0.201 -0.023
T 50
-0.006 -0.107 0.002
0.001 -0.006 -0.033 0.002
T 75
-0.007 -0.124 -0.001 0.002 -0.007 -0.086 -0.001
T 100 -0.004 -0.049 -0.026 -0.004 -0.004 -0.014 -0.026
in larger
Lib
-0.014
0.001
0.002
-0.004
Lib
-0.014
0.001
0.002
-0.004
size for all values of λ. Model selection in the form of PcGets delivers very
small bias here, this can be ascertained from the third and fourth columns in
each box.
Turning next to Table 11, we consider the bias of the coefficient δ1 , on the
first dummy. It can be seen that the bias on this coefficient, apart from the
T = 100 case, is constant as T increases. This is clearly more problematic for
inference on smaller values of δ1 , such as when λ = −0.25 where the bias is
about 60% of the true coefficient size. When T = 75, the bias is not so strong,
ranging from 25% of the true coefficient to 6% as λ increases. It is of reassurance
that when T = 100, as the dummies become more conspicuous, that the bias
disappears and is negligible for the most part. However, this is not the case
for the second, irrelevant, dummy. From Table 12, there is significant bias in
all cases for this irrelevant variable. Both Table 11 and Table 12 also show the
smaller bias that comes from using simply the GUM, or model selection; PcGets
in this situation selects the relevant dummy on every occasion, and retains the
irrelevant dummy on about 6% of replications for the liberal strategy, and about
1.5% of times when the strategy is conservative.9
Thus via Monte Carlo simulation it has been indicated that at least one of the
predictions of Section 3.1, noticeable bias, holds in larger models. The MSFE
was competitive in the model considered in this Section; however, given the
results of Section 3.2 which showed that when dummies taking unity for more
than one observation are specified, the MSFE deteriorates, we contend it is only
the fact that impulse dummies are specified here that provides a competitive
MSFE. Thus outliers and structural breaks, two issues that can beset empirical
work in economics, are shown to cause serious difficulties for model averaging.
9 It might also be noted that the average coefficient values, hence bias, in Tables 9–12 are
calculated only over the times that variable appears in the final model; hence given that on
the other 95% of occasions model selection has given a correct zero, we could average over
this and find a much smaller average value.
17
Table 10: Bias on β4 coefficient from using various modelling strategies in larger
models Monte Carlo simulation. Based on 1,000 replications.
λ =0
λ =-0.25
DGP T
GUM MA R Con
Lib
GUM MA R Con
Lib
0.676 25
0.006 -0.142 0.020 0.013 0.006 -0.149 0.024 0.016
0.516 50
0.008 -0.178 0.014 0.009 0.008 -0.212 0.015 0.010
0.433 75
0.001 -0.089 0.010 0.007 0.001 -0.073 0.011 0.007
0.381 100
0.000 -0.082 0.009 0.005 0.000 -0.080 0.008 0.005
λ =-0.5
λ =-1
DGP T
GUM MA R Con
Lib
GUM MA R Con
Lib
0.676 25
0.006 -0.156 0.024 0.016 0.006 -0.174 0.024 0.016
0.516 50
0.008 -0.244 0.015 0.010 0.008 -0.292 0.015 0.010
0.433 75
0.001 -0.059 0.011 0.007 0.001 -0.040 0.011 0.007
0.381 100
0.000 -0.079 0.008 0.005 0.000 -0.076 0.008 0.005
Table 11: Bias on γ1 coefficient from using various modelling strategies in larger
models Monte Carlo simulation. Based on 1,000 replications.
λ=0
λ = -0.25
T
GUM MA R
Con
Lib
GUM MA R
Con
Lib
25
0.029 -0.052 -0.164 -0.047 0.029 -0.082 -0.040 -0.018
50
-0.079 0.900
0.123
0.065 -0.079 0.949
0.022
0.007
75
-0.024 -0.544 -0.088 -0.042 -0.024 -0.536 -0.018 -0.008
100 0.013
0.124
0.001 -0.014 0.013
0.103 -0.009 -0.004
λ = -0.5
λ = -1
T
GUM MA R
Con
Lib
GUM MA R
Con
Lib
25
0.029 -0.111 -0.040 -0.018 0.029 -0.160 -0.040 -0.018
50
-0.079 0.993
0.022
0.007 -0.079 1.051
0.022
0.007
75
-0.024 -0.529 -0.018 -0.008 -0.024 -0.521 -0.018 -0.008
100 0.013
0.084 -0.009 -0.004 0.013
0.060 -0.009 -0.004
Table 12: Bias on γ3 coefficient from using various modelling strategies in larger
models Monte Carlo simulation. Based on 1,000 replications.
λ=0
λ = -0.25
T
GUM MA R
Con
Lib
GUM MA R
Con
Lib
T 25
-0.016 -1.063 -0.166 -0.080 -0.016 -1.079 -0.144 -0.082
T 50
-0.009 -0.842 -0.162 -0.069 -0.009 -0.770 -0.142 -0.068
T 75
-0.055 0.341
0.026 -0.031 -0.055 0.297
0.011 -0.036
T 100 -0.008 1.195 -0.080 -0.035 -0.008 1.195 -0.033 -0.043
λ = -0.5
λ = -1
T
GUM MA R
Con
Lib
GUM MA R
Con
Lib
T 25
-0.016 -1.092 -0.144 -0.082 -0.016 -1.108 -0.144 -0.082
T 50
-0.009 -0.701 -0.142 -0.068 -0.009 -0.595 -0.142 -0.068
T 75
-0.055 0.258
0.011 -0.036 -0.055 0.211
0.011 -0.036
T 100 -0.008 1.196 -0.033 -0.043 -0.008 1.193 -0.033 -0.043
18
4
Conclusions
In this paper, Monte Carlo simulations have been used to indicate worrisome
properties of model averaging, building on simple analytical models that incorporate empirically relevant characteristics such as outliers and structural breaks.
The simulations have suggested that these problems extend into larger datasets.
Outliers and structural breaks do not appear to have been considered by model
averaging investigators thus far.
Model averaging has been analysed, and it has been shown that averaging
without rescaling weights so that they sum to unity for each regressor leads to
large biases on relevant regressors, and gives large mean squared forecast errors
for one-step forecasts, refuting the arguments of Buckland et al. (1997) on model
averaging. Rescaling weights has been shown to improve the performance
of model averaging on these two levels; however in this situation, irrelevant
regressors have been argued to be more biased, an argument supported by Monte
Carlo simulation. Furthermore, the results of the Monte Carlo simulations
have suggested model averaging is inferior to the GUM and to model selection
for forecasting in the presence of period and intermittent dummy variables;
this overturns the assertions of (Raftery et al. 1997) relating to the forecasting
performance of model averaging.
This would be a quite depressing exercise for the future possibilities of
macroeconomic modelling, were it not for the performance of model selection
in the contexts analysed; relevant variables are almost always retained, giving a high probability of uncovering the DGP in each situation and providing
negligible bias throughout. This simply follows on from the growing number
of studies that have shown the impressive capabilities of model selection and
specifically PcGets in a range of empirical modelling contexts (see e.g. Hoover
& Perez 2004, Castle 2004). Hence we can conclude on a positive note, that
there is indeed a better alternative to model averaging for macroeconomic modelling.
References
Buckland, S.T., K.P. Burnham & N.H. Augustin (1997), ‘Model selection: An
integral part of inference’, Biometrics 53, 603–618.
Castle, J. (2004), Evaluating PcGets and RETINA as automatic model selection
algorithms. Unpublished paper, Economics Department, Oxford University.
Doornik, Jurgen A, David F Hendry & Bent Nielsen (1998), ‘Inference in
cointegrating models: UK M1 revisited’, Journal of Economic Surveys
12(5), 533–72.
Doppelhofer, Gernot, Ronald I. Miller & Xavier Sala-i-Martin (2000), Determinants of long-term growth: A Bayesian Averaging of Classical Estimates
19
(BACE) approach, Technical report, National Bureau of Economic Research, Inc.
Eklund, J. & S. Karlsson (2004), Forecast combination and model averaging
using predictive measures. Unpublished paper, Stockholm School of Economics.
Fernandez, C., E. Ley & M.F.J. Steel (2001), ‘Model uncertainty in crosscountry growth regressions’, Applied Econometrics 16, 563–576.
Hendry, David F. (2001), ‘Modelling UK inflation, 1875-1991’, Journal of Applied Econometrics 16(3), 255–275.
Hendry, David F. & Hans-Martin Krolzig (2005), ‘The properties of automatic
Gets modelling’, The Economic Journal 105(502), C32–C61.
Hendry, D.F. (1995), Dynamic Econometrics, Oxford University Press, Oxford.
Hendry, D.F. & M.P. Clements (2004), ‘Pooling of forecasts’, Econometrics
Journal 7, 1–31.
Hoover, K.D. & S.J. Perez (1999), ‘Data mining reconsidered: Encompassing
and the general-to-specific approach to specification search’, Econometrics
Journal 2, 167–191.
Hoover, Kevin D. & Stephen J. Perez (2004), ‘Truth and robustness in crosscountry growth regressions’, Oxford Bulletin of Economics and Statistics
66(5), 765–798.
Koop, Gary & Simon Potter (2003), Forecasting in large macroeconomic panels
using Bayesian Model Averaging, Staff Report 163, Federal Reserve Bank
of New York.
Perez-Amaral, Teodosio, Giampiero M. Gallo & Halbert White (2003), ‘A flexible tool for model building: the Relevant Transformation of the Inputs
Network Approach (RETINA)’, Oxford Bulletin of Economics and Statistics 65(s1), 821–838.
Raftery, A.E., D Madigan & J.A. Hoeting (1997), ‘Bayesian model averaging for
linear regression models’, Journal of the American Statistical Association
92(437), 179–191.
Sala-i-Martin, Xavier X. (1997a), ‘I just ran two million regressions’, American
Economic Review 87(2), 178–83.
Sala-i-Martin, Xavier X. (1997b), I just ran four million regressions, Technical
report, National Bureau of Economic Research, Inc.
20
Download