Problems in Model Averaging with Dummy Variables David F. Hendry and J. James Reade May 3, 2005 Abstract Model averaging is widely used in empirical work, and proposed as a solution to model uncertainty. This paper provides a range of relevant empirical contexts where model averaging performs poorly in terms of bias on coefficients and forecast errors. These contexts are when outliers and structural breaks exist in datasets. Monte Carlo simulations support these assertions and suggest that they apply in more complicated models than the simple ones considered here. It is argued that not selecting relevant variables over irrelevant ones is precisely the cause of poor performance; weights ascribed to irrelevant components will bias that attributable to the relevant. Within this context, the superior performance of model selection algorithms is indicated. 1 Introduction Model averaging, the practice of taking a weighted average of a number of regression models, is widely used, and is proposed as a method for accommodating model uncertainty in statistical analysis. However, while averaging can be shown to have desirable properties in a stationary world (see Raftery, Madigan & Hoeting 1997), extention to the non-stationary world presents difficulties. In this paper the performance of model averaging compared to model selection is assessed in the empirically relevant situation where dummy variables form part of the data generating process. In Section 2, model averaging is introduced, various methods of implementing it are touched upon, and the use of model averaging in the empirical literature is discussed. In Section 3 a number of simple models are introduced to highlight problems with model averaging in particular empirically relevant situations, before Monte Carlo simulations are used firstly to support the simple models and their conclusions, and then to suggest the problems exist in a more general context. Section 4 concludes. 1 2 Model Averaging Model averaging can be carried out in both the classical statistical framework (see Buckland, Burnham & Augustin 1997), or the Bayesian paradigm (see Raftery et al. 1997). In the empirical literature, the latter has been much more commonly used as computing power has increased exponentially. Examples in the growth literature include Fernandez, Ley & Steel (2001) who use a pure Bayesian methodology with non-informative priors, and Doppelhofer, Miller & Sala-i-Martin (2000) who calculate weights in a Bayesian manner, but average over classical OLS estimates, while Koop & Potter (2003) use Bayesian model averaging to forecast US quarterly GDP, and Eklund & Karlsson (2004) forecast Swedish inflation based on predictive Bayesian densities. When carrying out model averaging, practitioners state a set of K variables considered to have explanatory power for the parameter of interest. These variables then form a set M of L models, {M1 , . . . , ML } ∈ M. These models could be any particular type of statistical model. Here, along with Raftery et al. (1997) and the other empirical studies mentioned above, linear regression models are considered. Thus each one is of the form: y l = β l Xl + u l (l) (l) = β1 X1 + · · · + βK XK , where zeros in the β vector would signify where a particular regressor is not included in model l. The models in the set M are usually every subset of the K variables specified in the initial dataset, or some subset of these models using some kind of selection algorithm. Raftery et al. (1997) advocate a Bayesian selection algorithm based on the posterior density of each individual model. However, the use of non-informative priors induced by the inability to specify specific priors for each variable in the 2K models that result from considering every subset of the K variables specified means that such selection algorithms favour larger models (see Eklund & Karlsson 2004). Buckland et al. (1997) appear to suggest that model selection should not be carried out at all.1 In conventional linear regression analysis, the mean of parameters of interest conditional on the explanatory variables is usually reported, and as such one might expect the weighted average of this conditional mean over the L models, say a β = L X wl βl , (1) l=0 1 It is not possible to challenge this claim in the small models considered here, because model selection algorithms tend to choose just one model hence leaving nothing to average over and leaving the comparion as one between model averaging and model selection per se. It is hoped to investigate this claim in future research. 2 where wl is the weight for model l, to be reported in model averaging, hence giving an output from the process of: y = β a X + ua . (2) Bayesian model averagers, such as Fernandez et al. (2001), discuss the probability of including any particular regressor in the averaged model as its importance, refraining from reporting any coefficients or model of the form (2) in their averaging analysis, since, as Doppelhofer et al. (2000) point out, Bayesian statisticians reject the idea of a single, true estimate, believing that each parameter has a true distribution, and hence Fernandez et al. (2001) produce charts of distribution functions for each parameter. At this point a debate about the existence or not of a true specification can be entered into; Hoover & Perez (2004, pp. 767–769) summarise this well. Buckland et al. (1997) suggest reporting the averaged coefficients as in (1), and they see the sum of weights as a measure of the importance of each regressor. This introduces the debate over how the models are weighted in the combination, which manifests itself on two levels; firstly how to construct the weights, and secondly which weighting criterion to use. Considering the first issue, for any particular weighting criterion, say Cl , the weighting method might be: Cl wl = PL i=1 Ci . (3) PL This ensures that l=1 wl = 1. However, no variable appears in every model, meaning that the sum of the weights applied to a particular variable will not be unity, and as such the coefficient will be biased down.2 An alternative weighting method to account for this downward bias might be to rescale the weights for each regressor so that the sum over the number of models it appears in is unity. Thus the weight for model l might then be described as, where Nk ⊂ M is the set of models in M that contain regressor βk , the following: wl = P Cl i∈Nk Ci . (4) Hence the weights for any particular regressor will sum to unity. Doppelhofer et al. (2000) advocate this rescaled weighting for reporting coefficients in their averaged model, stating that the coefficients produced by this method would be the ones used in forecasting, and for analysing marginal effects. Both weight construction methods will be considered in this paper. In terms of the weighting criterion Cl , in the Bayesian context each model is weighted by its posterior probability, which is given by: Pr (Ml ) Pr (X |Ml ) . Pr (Ml |X ) = PL k=1 Pr (Mk ) Pr (X |Mk ) (5) 2 Taking the simplest 2-variable model illustrates this; then 4 models result, and each variable will only appear in two of the models. Given a non-zero weighting for each model, it cannot be the case that the sum of weights on either variable equals unity. 3 In non-Bayesian contexts, information criteria might be considered, such as the Akaike or Bayesian information criteria. In this paper, following Buckland et al. (1997), an approximation to Schwarz information criteria (SIC) is employed, 2 which uses exp(−b σv,l /2) (which is almost the same for the small number of 2 parameters considered here) where σ bv,l denotes the residual variance of the lth model: T 1X 2 2 σ bv,l = vb . T t=1 t Estimator averaging, therefore, uses the weights given by: ³ ´ 2 exp − 21 σ bv,l ³ ´. wl = P L 1 2 bv,l l=1 exp − 2 σ (6) Finally, for weighting, out-of-sample methods might be used; Eklund & Karlsson (2004) suggest using predictive Bayesian densities, while Hendry & Clements (2004) discuss minimising the mean squared forecast error of the averaged model as a criterion to construct weights. The justification for using SIC-based weights in this paper is that the Schwarz information criterion does not discriminate strongly between models differing by a regressor or two, a property infitting with Bayesian concern for model uncertainty. Further, the SIC is an approximation to the Bayes factor. Thus the analytical results and Monte Carlo simulation results, it is argued, can be applied to the more widely used Bayesian model averaging. Model averaging is just one way of carrying out a data-focussed macroeconomic modelling exercise. Another method is General-to-Specific model selection (see Hoover & Perez 1999, Hendry & Krolzig 2005, Perez-Amaral, Gallo & White 2003), whereby a general model is posited to include all possible factors contributing to determination of a parameter of interest, and then a process of reduction is carried out to leave the practitioner with the most parsimonious congruent and encompassing econometric model. 3 3.1 The bias when dummy variables are included Orthogonal model with irrelevant dummy We consider the simplest location-scale data generation process (DGP) in (7) with a transient mean shift, namely: ¤ £ (7) yt = β + γ1{t=ta } + vt , where vt ∼ IN 0, σv2 where 1{t=ta } denotes a zero-one observation-specific indicator, unity at observation ta and zero otherwise. The parameter of interest is β and the forecast will be for yT +1 , 1-step ahead from√the forecast origin T . We consider the empirically relevant case where γ = λ T for a fixed constant λ , (see e.g. Doornik, 4 ¡ ¢ Hendry & Nielsen 1998), and neglect terms of Op T −1/2 or smaller in the analytic derivations. The simulation illustration confirms their small impact on the outcomes. The postulated model has an intercept augmented by adding one relevant and one irrelevant impulse dummy, denoted d1,t = 1{t=ta } and d2,t = 1{t=tb } respectively. This yields the general unrestricted model (GUM): yt = β + γd1,t + δd2,t + ut (8) for t = 1, . . . , T where in the DGP, δ = 0 and γ 6= 0, the former holding in the sense that only one transient location shift actually occurred, although the investigator is unaware of that fact. Equation (8) is the starting point for model averaging as it is the set of variables from which all possible models are derived; it is also the starting point for model selection, which then follows a process of reduction to arrive at the most parsimonious congruent encompassing economic model (see ch. 9, Hendry 1995). For model averaging a regression would be run on all subsets of the regressors in (8), and the following 23 = 8 possible models result: M0 : β = 0; δ = 0; γ = 0 M3 : β = 0; δ = 0 M6 : β = 0 M1 : δ = 0; γ = 0 M4 : γ = 0 M7 : — M2 : β = 0; γ = 0 M5 : δ = 0 (9) This yields eight estimated models, all using least squares, where estimators are denoted by the subscript of their model number: M0 : M2 : M4 : M6 : 3.1.1 ybt ybt ybt ybt =0 = δb(2) d1,t = βb(4) + δb(4) d1,t = δb(6) d1,t + γ b(6) d2,t M1 : M3 : M5 : M7 : ybt ybt ybt ybt = βb(1) =γ b(3) d2,t = βb(5) + γ b(5) d2,t b = β(7) + δb(7) d1,t + γ b(7) d2,t (10) Deriving the weights and estimates For the regressors, using least squares we find that: βb(0) = βb(2) = βb(3) = βb(6) = 0, T T ¢ γ λ 1 X¡ 1X β + γ1{t=ta } + vt ≃ β + = β + √ , yt = βb(1) = βb(4) = T t=1 T t=1 T T βb(5) = βb(7) = 1 T −1 T X t=1,t6=ta yt = 1 T −1 T X t=1,t6=ta ¡ ¢ β + γ1{t=ta } + vt ≃ β. Hence there are three possible outcomes for estimating the parameter of interest (neglecting sampling variation as second order): • βbi ≃ 0, when there is no intercept (M0 , M2 , M3 , M6 ); 5 • βbi ≃ β, when an intercept and d1,t are included (M5 , M7 ); and √ • βbi ≃ β + λ/ T , when an intercept, but no d1,t , is included (M1 , M4 ). All the derivations of the weights follow the same formulation. First, for M0 from (7): 2 σ bv,0 = T T ¢2 1X 2 1 X¡ β + γ1{t=tb } + vt yt = T t=1 T t=1 T ¢ 1 X¡ 2 β + γ 2 1{t=tb } + vt2 + 2βγ1{t=tb } + 2βvt + 2γ1{t=tb } vt T t=1 ¢ 1¡ 2 γ + 2βγ + 2γvtb = β 2 + σ 2v + 2βv + Tµ ¶ 1 2 2 2 = β + σ v + λ + Op √ T ≃ β 2 + σv2 + λ2 (11) = where: σ 2v = T T 1X 2 1X vt and v = vt , T t=1 T t=1 and the last line of (11) uses the asymptotic approximations: √ ¤ £ D P T v → N 0, σv2 and σ 2v → σv2 . Clearly, βb(0) = 0 in M0 yet its weight will be non-zero in (1). A similar approach for M1 yields: ¶2 T µ 1X √ λ 2 σ bv,1 = ≃ λ2 + σv2 λ T 1{t=tb } + vt − √ T t=1 T since: λ βb(1) ≃ β + √ . T Continuing through the remaining models delivers the complete set of approximate error standard deviations: 2 σ bv,0 2 σ bv,1 2 σ bv,2 2 σ bv,3 2 σ bv,4 2 σ bv,5 2 σ bv,6 2 σ bv,7 ≃ β 2 + λ2 + σv2 ≃ λ2 + σv2 ; ≃ β 2 + λ2 + σv2 ; ≃ β 2 + σv2 ≃ λ2 + σv2 ; ≃ σv2 ; ≃ β 2 + σv2 ≃ σv2 . 6 The error variance enters 8 times, and β 2 and λ2 both enter 4 times. Cumulating these: ¶ µ λ βe ≃ (w5 + w7 ) β + (w1 + w4 ) β + √ T λ = (w1 + w4 + w5 + w7 ) β + (w1 + w4 ) √ . T (12) Simulation confirms the accuracy of these calculations for the mean estimates of β, even for T as small as 25 (where the number of parameters might matter somewhat). From (12), the averaged coefficient will not equal the true coefficient so long PL as λ 6= 0, and/or w1 + w3 + w5 + w7 < 1, which l=1 wl = 1 will imply in most cases. On the other hand, rescaling √ the weights will mean w1 + w4 is larger and hence the bias induced from λ/ T will be greater. Also, rescaling will mean e the irrelevant regressor, will receive greater weight, since the principle that δ, applies to all regressors, when one imagines previously, especially if weights are a reflection of the importance of a parameter, it received a low weighting. 3.1.2 Model averaging for forecasting stationary data One justification for model averaging is as a method of ‘forecast pooling’, so we consider that aspect next. The outlier was one-off so will not occur in the forecast period, yielding: Letting: M0 : ybT +1,0 = 0 M1 : ybT +1,1 = βb(1) M3 : ybT +1,3 = 0 M4 : ybT +1,4 = βb(4) M6 : ybT +1,6 = 0 M7 : ybT +1,7 = βb(7) yeT +1|T = 7 X i=0 M2 : ybT +1,2 = 0 M5 : ybT +1,5 = βb(5) wi ybT +1,i then the forecast error is veT +1|T = yT +1 − yeT +1|T , with mean: £ ¤ λ E veT +1|T = (w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √ . T 7 (13) Thus, forecasts can be considerably biased, for similar reasons that βe can be biased. The mean-square forecast error (MSFE) is: Ã !2 7 i h X wi ybT +1,i = E yT +1 − E veT2 +1|T = E "µ i=0 λ (w0 + w2 + w3 + w6 ) β − (w1 + w4 ) √ T 2 = σv2 + (w0 + w2 + w3 + w6 ) β 2 + (w1 + w4 ) + βλ 2 (w0 + w2 + w3 + w6 ) (w1 + w4 ) √ , T 2 ¶2 # λ2 T (14) which for unrescaled weights is almost bound to be worse for large λ than the general unrestricted model (GUM) or any selected model, even allowing for estimation uncertainty. 3.1.3 Numerical Example Consider β = 1, λ = −1, σ 2v = 0.01 (i.e., a standard ¡ 2 ¢ deviation of 10%), and T = 25. Then using the weights based on exp −b σv /2 in (12): ¶ µ ¶ µ λ 1 = 0.626 βe = (w5 + w7 ) β + (w1 + w4 ) β + √ = 0.382 + (0.305) 1 − 5 T (15) which is very biased for the true value of unity. The basic problem is that the weights increase too little for better models over poorer ones, in addition to which, the ‘irrelevant’ impulse creates several such poorer models in the averaging pool. The bias can be seen to be smaller if one takes the second weighting methodology outlined in Section 2; firstly if we rewrite (15) as in (12): λ βe = (w1 + w4 + w5 + w7 ) β + (w5 + w7 ) √ T µ ¶ 1 = 1 + 0.37754 − = 0.924, 5 since the weights on the β coefficient, which only appears in the models 1, 4, 5 and 7, will sum to unity. The MSFE from (14) when forecasting without rescaling the weights is: i h E veT2 +1|T = 0.118, so the MSFE is a factor of almost 12-fold larger than the DGP error variance. It is hard to calculate the MSFE with the rescaled weights, because each weight 8 depends on which coefficient it is being multiplied by; one would expect the MSFE to be smaller when the weights are rescaled since the bias on the coefficients is smaller in that case. Finally, considering the parameter values chosen √ for this example, γ = −5 is large when σv = 0.1, but outliers of magnitude T often occur in practical models: (see Hendry 2001, Doornik et al. 1998). In the Monte Carlo simulation, a range of values of λ and T will be considered. 3.1.4 Monte Carlo Evidence A Monte Carlo simulation of 1,000 replications was run to assess the impact of sampling distributions on the bias derived in Section 3.1.2. Table 2 reports the average bias on the β (first panel) and γ (second panel) coefficients when various modelling strategies are used. The different columns report the bias from the various modelling strategies; GUM is the general unrestricted model, hence the simple regression run on the entire dataset originally specified (equation (8)), MA is model averaging, MA R is model averaging rescaled, Lib is model selection using PcGets Liberal strategy, and Cons is model selection using PcGets Conservative strategy.3 A range of T and λ values are considered along with the parameterisation of the numerical example in Section 3.1.3. As λ varies, the size of the coefficient on the relevant dummy, γ varies, and Table 1 shows the actual size of the γ coefficient for each (λ, T ) combination. Table 1: Size of γ coefficient for various -0.25 -0.5 T \ λ 0 -0.05 25 0 -0.25 -1.25 -2.5 50 0 -0.354 -1.768 -3.536 values of λ and T . -0.75 -1 -3.75 -5 -5.303 -7.071 So from Table 2 the bias when simply the GUM is run is tiny on both coefficients. Model averaging induces a very large bias, ranging from about 30% of the true β coefficient size when the dummies are both insignificant (λ = 0), to around 40% when λ = −1, and we see the calculations of the previous Section are supported here; the bias of -0.374 is reproduced and in fact is stronger, at -0.397 in the Monte Carlo. The bias on γ under model averaging decreases as a fraction of the size of the true coefficient from about 100% of the size when the dummy is barely noticeable (λ = −0.05), to about 40% when the dummy is very conspicuous at λ = −1. Thus a picture of strong bias is drawn from model averaging. Rescaling the weights, as described in Section 2, improves this markedly. The bias on β does increase with the size of the true γ coefficient, holding T fixed, but is much smaller than when weights are not rescaled. The bias calculated in the previous section (−0.076 when λ = −1) is supported with −0.079 found in the simulation, while if λ = −0.5, the bias is around 5% of the β coefficient size. The bias on the γ is invariant to changes in λ, and hence changes in the 3 For a description of these strategies, which amount to differing stringencies of tests used at various stages of the selection procedure, see Hendry & Krolzig (2005). 9 Table 2: Bias on Coefficients in tions. β GUM MA MA R T L0 25 0.000 -0.319 0.000 50 -0.001 -0.316 -0.001 L -0.05 25 0.000 -0.323 -0.006 50 -0.001 -0.319 -0.004 L -0.25 25 0.000 -0.340 -0.026 50 -0.001 -0.332 -0.018 L -0.5 25 0.000 -0.362 -0.048 50 -0.001 -0.348 -0.034 L -0.75 25 0.000 -0.381 -0.067 50 -0.001 -0.363 -0.047 L -1 25 0.000 -0.397 -0.079 50 -0.001 -0.376 -0.055 DGP (equation (7)). Based on 1,000 replicaLib Cons GUM MA 0.000 -0.001 0.000 -0.001 -0.002 0.000 0.212 0.211 0.000 -0.001 0.000 -0.001 -0.002 0.000 0.324 0.369 0.000 -0.001 0.000 -0.001 -0.002 0.000 0.766 0.993 0.000 -0.001 0.000 -0.001 -0.002 0.000 1.277 1.709 0.000 -0.001 0.000 -0.001 -0.002 0.000 1.692 2.281 0.000 -0.001 0.000 -0.001 -0.002 0.000 1.966 2.645 γ MA R L0 0.383 0.381 L -0.05 0.383 0.381 L -0.25 0.383 0.381 L -0.5 0.383 0.381 L -0.75 0.383 0.381 L -1 0.383 0.381 Lib Cons -0.073 0.026 -0.116 -0.042 -0.004 -0.002 -0.007 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 -0.003 -0.002 size of the γ coefficient itself; once λ is non-zero, however, the bias is smaller when weights are rescaled.4 When λ = 0, γ is the coefficient on an irrelevant regressor, and this highlights the problem discussed in Section 3.1.1 of rescaling weights; it increases the weight on irrelevant regressors.5 Table 2 also shows the performance of model selection; the bias is generally comparable to the bias on the GUM, hence negligible, while Tables 3 and 4, which report the percentage of simulation replications on which the DGP was selected as the final model in the model selection algorithm, tell us that in almost every replication the true DGP is found.6 4 This invariance of the bias to λ is to be expected, since from (10), d1,t only appears in M2 , M4 , M6 and M7 , and in M4 and M7 the coefficient is unbiased, γ̂ (3) = γ̂ since the model is well specified, but in M2 and M6 we have: PT PT PT dt β dt yt t=1 dt (β + γdt + vt ) = = Pt=1 + γ = β + γ. γ̂ (2) = Pt=1 PT T T 2 2 2 d d t=1 t t=1 t t=1 dt Thus when we consider the averaged coefficient: γ̂ (a) = (w2 + w6 ) (β + γ) + (w4 + w7 ) γ = (w2 + w4 + w6 + w7 ) γ + (w2 + w6 ) β, (16) √ then if w2 + w4 + w6 + w7 = 1 (rescaled weights) the bias is independent of γ = λ T for fixed T. 5 Furthermore, the bias on δ, the irrelevant dummy, which is not reported, is larger when weights are rescaled. 6 In fact the two relevant regressors were retained on every replication. It was only retention 10 Table 3: Percentage of times specific model is DGP using PcGets Liberal Strategy. Based on 1,000 replications. λ =0 λ =-0.1 λ =-0.25 λ =-0.5 λ =-0.75 λ =-1 T 25 91.1% 95.4% 95.8% 95.8% 95.8% 95.8% 95.3% 95.3% 95.3% 95.3% 95.3% T 50 91.6% Table 4: Percentage of times specific model Strategy. Based on 1,000 replications. λ =0 λ =-0.1 λ =-0.25 T 25 97.3% 97.3% 99.1% T 50 97.2% 99.0% 99.0% is DGP using PcGets Conservative λ =-0.5 99.1% 99.0% λ =-0.75 99.1% 99.0% λ =-1 99.1% 99.0% Table 5 provides information on the mean-square forecast error for the different modelling strategies. The MSFE for a 1-step forecast of T + 1 from T is given for the same modelling strategies reported in Table 2. One can see that indeed the huge MSFE’s for MA predicted in Section 3.1.3 are supported, but when the weights are rescaled for model averaging, the MSFE isn’t noticeably worse than using the GUM, and MSFEs over the various modelling strategies are indistinguishable. Thus the Monte Carlo simulations support the assertions from Section 3.1.1 of bias and terrible forecast performance when using model averaging in the presence of indicator variables. Rescaling weights does improve the bias performance of model averaging for regressors that are relevant, and substantially improves the MSFE. However, upward bias is increased for irrelevant variables, potentially giving false information as to the relevance of some variables, and strong bias remains for coefficients on dummy variables. The model here is extremely simplistic; however, it does generalise quite easily analytically to a model with a regressor in place of the constant in (7), and Monte Carlo simulations give almost identical results; in fact a stronger bias of 0.103 on β is reported in the λ = −1, T = 25 case is reported. Generalising to large models with numerous regressors and dummies is analytically fearsome, but for empirical work it is important we can make generalisations. Longer dummy variables and dummy variables that take unity for a certain state of the world are considered in the next Section, while larger models with impulse dummy variables are investigated via Monte Carlo simulation in Section 3.3. 3.2 Period and intermittent dummy variables If the dummy variable was postulated to be a period dummy, say d = 1{t1 <t<t2 } , then following the same procedure to calculate the bias √ as in Section 3.1.1, √ all 1/ T expressions can be replaced with (t2 − t1 ) / T , hence the bias will of the irrelevant dummy on a small number of replications that brought the percentages in Tables 3 and 4 down. 11 Table 5: MSFE from various modelling strategies in Monte Carlo simulation in Section 3.1.4. Based on 1,000 replications. GUM MA MA R Lib Cons L0 T 25 0.011 0.108 0.011 0.013 0.013 T 50 0.010 0.109 0.010 0.013 0.013 L -0.05 T 25 0.011 0.111 0.011 0.013 0.013 T 50 0.010 0.111 0.010 0.013 0.013 L -0.25 T 25 0.011 0.122 0.011 0.013 0.013 T 50 0.010 0.119 0.010 0.013 0.013 L -0.5 T 25 0.011 0.137 0.013 0.013 0.013 T 50 0.010 0.130 0.011 0.013 0.013 L -0.75 T 25 0.011 0.151 0.015 0.013 0.013 T 50 0.010 0.141 0.012 0.013 0.013 L -1 T 25 0.011 0.163 0.016 0.013 0.013 T 50 0.010 0.151 0.013 0.013 0.013 increase with the size of the period the dummy spans over. This is problematic since often period dummies are postulated in models, as structural breaks do exist. In this Section a model with a mean-shift mid-sample will be considered. The DGP is specified as: yt = β + γ1{t<ta } + vt , vt ∼ N(0, σ 2 ), (17) while the practitioner is unsure of the end point of the structural break, and posits a model with two dummy variables, d1,t = 1{t<ta } , which is correctly specified, and d2,t = 1{ta +1<t<ta +8} :7 yt = β + γd1,t + δd2,t + ut . (18) Hence again, as in the impulse dummy case, the mean-shift is noticed, and the relevant regressors included in the GUM. Table 6 shows the deteriorating bias performance of rescaled model averaging as the true γ coefficient gets larger; when λ = 0, averaging produces a bias indistinguishable from that of the GUM, or from model selection.8 However, with even a slight 0.05 increase in λ, there is bias of nearing 10% of the coefficient 7 It is worth noting that the results from this experiment are very similar to those without the irrelevant dummy included. 8 After the horrendous biases shown in Section 3.1, non-rescaled model averaging is not reported here. The bias on all simulations relating to longer dummies showed bias of a similar size to that in Section 3.1. 12 Table 6: Bias on β coefficient from various modelling strategies in period dummies Monte Carlo simulation. Based on 1,000 replications. GUM MA R Lib Cons L0 T 25 -0.001 -0.001 0.000 0.000 T 50 -0.001 -0.001 0.000 -0.001 L -0.05 T 25 -0.001 -0.081 -0.001 -0.002 T 50 -0.001 -0.101 -0.001 -0.001 L -0.25 T 25 -0.001 -0.377 0.001 0.000 T 50 -0.001 -0.419 -0.001 -0.001 L -0.5 0.000 T 25 -0.001 -0.606 0.001 T 50 -0.001 -0.409 -0.001 -0.001 L -0.75 T 25 -0.001 -0.598 0.001 0.000 T 50 -0.001 -0.137 -0.001 -0.001 L -1 T 25 -0.001 -0.418 0.001 0.000 T 50 -0.001 -0.020 -0.001 -0.001 Table 7: MSFE from using various modelling strategies in period dummies Monte Carlo simulation. Based on 1,000 replications. GUM MA R Lib Cons λ=0 0.011 0.014 0.014 T 25 0.013 T 50 0.011 0.010 0.013 0.013 λ = -0.05 T 25 0.013 0.017 0.015 0.015 T 50 0.011 0.020 0.013 0.013 λ = -0.25 T 25 0.013 0.148 0.014 0.014 0.185 0.013 0.013 T 50 0.011 λ = -0.5 T 25 0.013 0.370 0.014 0.014 T 50 0.011 0.177 0.013 0.013 λ = -0.75 T 25 0.013 0.362 0.014 0.014 T 50 0.011 0.029 0.013 0.013 λ = -1 T 25 0.013 0.182 0.014 0.014 T 50 0.011 0.011 0.013 0.013 13 size. When λ = −0.5, the bias is 60% of the coefficient size when T = 25, and 40% when T = 50. As λ nears -1, the bias reduces as the effect of the weights takes over, the SIC attributing a very low weight to models that exclude the relevant dummy; however when T = 25, the bias is still horrendous at −0.418. The bias on γ, (not reported) follows the pattern from earlier; it is invariant to λ hence worse for small structural breaks relatively. The favourable performance of model selection in this scenario can be gleaned from Table 6, almost always producing bias not noticeably different from the GUM, which it must encompass. Table 7 shows the much worsened forecast performance of the rescaled averaged model when dummies are longer than single period impulses. While averaging performs well if the dummy is irrelevant (λ = 0), even as λ registers a small value the MSFE is increasing rapidly. When the break is just about noticeable at λ = −0.25, the MSFE is already at least 15 times the size of σ 2 . For T = 25, the MSFE keeps rising, reporting values of over 0.35 for λ = −0.5 and −0.75, although for T = 50, there is a decrease down to an MSFE of 0.029, still around 3 times the DGP error variance, when λ = −0.75. These results for a constant and dummies generalise to models with regressors in. A Monte Carlo was run replacing the constant in (17) and (18) by two mean-zero vectors of Normally distributed random numbers with coefficients β1 and β2 , where β1 = β2 = 1 hence both are relevant; all else remained as before. So the DGP is: yt = β1 X1,t + β2 X2,t + γd1,t + vt , vt ∼ N(0, σ 2 ) (19) and the GUM is: yt = β1 X1,t + β2 X2,t + γd1,t + δd2,t + ut . (20) Only the bias on the β1 coefficient is reported here, in Table 8; the MSFE when T = 25 is always around 4 times the DGP error variance, when T = 50 always at least twice the size, though when T = 75 the MSFE is quite competitive when λ > 0.25; before that it is up to three times the error variance. Table 8 reveals bias when model averaging is used in this more general context, holding fairly constant at around 10% of the true coefficient size for all values of λ and T . The bias on β2 shows similar patterns. Other dummies commonly used in the literature are what might be called intermittent dummies; those that take unity in a particular state of the world, perhaps if there is industrial action in a particular year, or in cross section, if a country is located in sub-Saharan Africa. As another experiment, the two dummies specified in (19) and (20), which could be seen to be somewhat arbitrary and unrealistic, were replaced with dummies from growth regressions (African and Latin American dummies from Sala-i-Martin 1997a, Sala-i-Martin 1997b), and considerably worse biases and MSFEs resulted (results not reported). These results are quite stunning; they show that even if a practitioner has located a clear structural break in a dataset for a parameter of interest, or accomodated for a salient feature with an intermittent dummy variable, model averaging based on that dataset still has the potential to produce very distorted 14 Table 8: Bias on β1 coefficient in (20). Based GUM MA R Lib λ=0 T 25 0.000 0.064 -0.001 0.113 -0.001 T 50 0.000 T 75 0.000 0.078 0.000 λ = -0.05 T 25 0.000 0.080 -0.001 T 50 0.000 0.140 -0.001 T 75 0.000 0.110 0.000 λ = -0.25 0.130 -0.001 T 25 0.000 T 50 0.000 0.212 -0.001 T 75 0.000 0.168 0.000 λ = -0.5 T 25 0.000 0.128 -0.001 T 50 0.000 0.157 -0.001 T 75 0.000 0.095 0.000 λ = -0.75 0.086 -0.001 T 25 0.000 T 50 0.000 0.125 -0.001 T 75 0.000 0.086 0.000 λ = -1 T 25 0.000 0.072 -0.001 T 50 0.000 0.124 -0.001 T 75 0.000 0.086 0.000 15 on 1,000 replications. Cons -0.001 -0.001 0.000 -0.001 -0.001 0.000 -0.001 -0.001 0.000 -0.001 -0.001 0.000 -0.001 -0.001 0.000 -0.001 -0.001 0.000 results both for inference and forecasting. Model selection, on the other hand, reports much better results in these contexts, showing negligible biases and MSFEs. This must be concerning for the empirical work that has been done using model averaging, in particular growth theorists, as in this field of research many dummy variables are often specified in datasets: Doppelhofer et al. (2000) included 8 dummy variables in their 32 variable, 98 country dataset, and Hoover & Perez (2004) had 7 in the 36 variable, 107 country, dataset they considered. Just how biased are the dummies for African nations or former Communist countries, and the other regressors more generally given the inclusion of these dummies? It might be argued that in larger models such as the growth ones refered to above, the effects of dummy variables are drowned out; the next Section investigates this. 3.3 Larger Models Attempting to show the relevance of earlier simpler models for empirical work, a Monte Carlo experiment of 1,000 replications has been designed to consider the effect of dummies then there are many regressors and a number of dummy variables. A 10 variable dataset with 3 dummies, two of which were relevant, was constructed and used. The DGP is thus: yt =β3 X3,t + β4 X4,t + β5 X5,t + β6 X6,t + β7 X7,t 4 4 8 8 + δ1 d1,t + δ2 d2,t + vt , 8 vt ∼ N(0, σ 2 ). (21) The GUM specified in this case has two irrelevant variables and two irrelevant dummies: yt = β1 X1,t + β2 X2,t + β3 X3,t + β4 X4,t + β5 X5,t + β6 X6,t + β7 X7,t + δ1 d1,t + δ2 d2,t + δ3 d3,t + ut . (22) The Monte Carlo experiment again considered a range of λ and T values, with σ 2 = 1 specified. Due to space constraints only the results from model averaging with rescaled weights pertaining to four of the estimated coefficients in (22) are shown below: β2 , an insignificant regressor (Table 9); β4 , a significant regressor (Table 10); δ1 (Table 11), the significant dummy; and δ3 (Table 12), the insignificant dummy. It is not relevant how strong or not the bias is on non-reported variables; if there is noticeable bias on any coefficients of interest, this is enough to warrant concern about its suitability as a modelling technique. Considering first Table 9 the upward bias of irrelevant coefficients due to rescaling weights can be observed; the irrelevant coefficient registers values of up to around 0.2, although these values do decrease in λ and T . Turning to Table 10, when T = 50 one can see that there is considerable bias relative to the true value of β4 (the true value is given in the first column, entitled DGP) induced by the inclusion of these dummy variables. Indeed this bias increases with λ: from about 35% of the size of the true value when λ = 0, to almost 60% when λ = −1. When T = 100, the bias is around 20% of the true coefficient 16 Table 9: Bias on β2 coefficient from using various modelling strategies models Monte Carlo simulation. Based on 1,000 replications. L0 L -0.25 GUM MA R Con Lib GUM MA R Con T 25 0.004 -0.296 -0.052 -0.017 0.004 -0.272 -0.023 T 50 -0.006 -0.212 0.001 0.002 -0.006 -0.158 0.002 T 75 -0.007 -0.192 -0.005 0.001 -0.007 -0.156 -0.001 T 100 -0.004 -0.107 -0.020 -0.003 -0.004 -0.077 -0.026 L -0.5 L -1 GUM MA R Con Lib GUM MA R Con T 25 0.004 -0.247 -0.023 -0.014 0.004 -0.201 -0.023 T 50 -0.006 -0.107 0.002 0.001 -0.006 -0.033 0.002 T 75 -0.007 -0.124 -0.001 0.002 -0.007 -0.086 -0.001 T 100 -0.004 -0.049 -0.026 -0.004 -0.004 -0.014 -0.026 in larger Lib -0.014 0.001 0.002 -0.004 Lib -0.014 0.001 0.002 -0.004 size for all values of λ. Model selection in the form of PcGets delivers very small bias here, this can be ascertained from the third and fourth columns in each box. Turning next to Table 11, we consider the bias of the coefficient δ1 , on the first dummy. It can be seen that the bias on this coefficient, apart from the T = 100 case, is constant as T increases. This is clearly more problematic for inference on smaller values of δ1 , such as when λ = −0.25 where the bias is about 60% of the true coefficient size. When T = 75, the bias is not so strong, ranging from 25% of the true coefficient to 6% as λ increases. It is of reassurance that when T = 100, as the dummies become more conspicuous, that the bias disappears and is negligible for the most part. However, this is not the case for the second, irrelevant, dummy. From Table 12, there is significant bias in all cases for this irrelevant variable. Both Table 11 and Table 12 also show the smaller bias that comes from using simply the GUM, or model selection; PcGets in this situation selects the relevant dummy on every occasion, and retains the irrelevant dummy on about 6% of replications for the liberal strategy, and about 1.5% of times when the strategy is conservative.9 Thus via Monte Carlo simulation it has been indicated that at least one of the predictions of Section 3.1, noticeable bias, holds in larger models. The MSFE was competitive in the model considered in this Section; however, given the results of Section 3.2 which showed that when dummies taking unity for more than one observation are specified, the MSFE deteriorates, we contend it is only the fact that impulse dummies are specified here that provides a competitive MSFE. Thus outliers and structural breaks, two issues that can beset empirical work in economics, are shown to cause serious difficulties for model averaging. 9 It might also be noted that the average coefficient values, hence bias, in Tables 9–12 are calculated only over the times that variable appears in the final model; hence given that on the other 95% of occasions model selection has given a correct zero, we could average over this and find a much smaller average value. 17 Table 10: Bias on β4 coefficient from using various modelling strategies in larger models Monte Carlo simulation. Based on 1,000 replications. λ =0 λ =-0.25 DGP T GUM MA R Con Lib GUM MA R Con Lib 0.676 25 0.006 -0.142 0.020 0.013 0.006 -0.149 0.024 0.016 0.516 50 0.008 -0.178 0.014 0.009 0.008 -0.212 0.015 0.010 0.433 75 0.001 -0.089 0.010 0.007 0.001 -0.073 0.011 0.007 0.381 100 0.000 -0.082 0.009 0.005 0.000 -0.080 0.008 0.005 λ =-0.5 λ =-1 DGP T GUM MA R Con Lib GUM MA R Con Lib 0.676 25 0.006 -0.156 0.024 0.016 0.006 -0.174 0.024 0.016 0.516 50 0.008 -0.244 0.015 0.010 0.008 -0.292 0.015 0.010 0.433 75 0.001 -0.059 0.011 0.007 0.001 -0.040 0.011 0.007 0.381 100 0.000 -0.079 0.008 0.005 0.000 -0.076 0.008 0.005 Table 11: Bias on γ1 coefficient from using various modelling strategies in larger models Monte Carlo simulation. Based on 1,000 replications. λ=0 λ = -0.25 T GUM MA R Con Lib GUM MA R Con Lib 25 0.029 -0.052 -0.164 -0.047 0.029 -0.082 -0.040 -0.018 50 -0.079 0.900 0.123 0.065 -0.079 0.949 0.022 0.007 75 -0.024 -0.544 -0.088 -0.042 -0.024 -0.536 -0.018 -0.008 100 0.013 0.124 0.001 -0.014 0.013 0.103 -0.009 -0.004 λ = -0.5 λ = -1 T GUM MA R Con Lib GUM MA R Con Lib 25 0.029 -0.111 -0.040 -0.018 0.029 -0.160 -0.040 -0.018 50 -0.079 0.993 0.022 0.007 -0.079 1.051 0.022 0.007 75 -0.024 -0.529 -0.018 -0.008 -0.024 -0.521 -0.018 -0.008 100 0.013 0.084 -0.009 -0.004 0.013 0.060 -0.009 -0.004 Table 12: Bias on γ3 coefficient from using various modelling strategies in larger models Monte Carlo simulation. Based on 1,000 replications. λ=0 λ = -0.25 T GUM MA R Con Lib GUM MA R Con Lib T 25 -0.016 -1.063 -0.166 -0.080 -0.016 -1.079 -0.144 -0.082 T 50 -0.009 -0.842 -0.162 -0.069 -0.009 -0.770 -0.142 -0.068 T 75 -0.055 0.341 0.026 -0.031 -0.055 0.297 0.011 -0.036 T 100 -0.008 1.195 -0.080 -0.035 -0.008 1.195 -0.033 -0.043 λ = -0.5 λ = -1 T GUM MA R Con Lib GUM MA R Con Lib T 25 -0.016 -1.092 -0.144 -0.082 -0.016 -1.108 -0.144 -0.082 T 50 -0.009 -0.701 -0.142 -0.068 -0.009 -0.595 -0.142 -0.068 T 75 -0.055 0.258 0.011 -0.036 -0.055 0.211 0.011 -0.036 T 100 -0.008 1.196 -0.033 -0.043 -0.008 1.193 -0.033 -0.043 18 4 Conclusions In this paper, Monte Carlo simulations have been used to indicate worrisome properties of model averaging, building on simple analytical models that incorporate empirically relevant characteristics such as outliers and structural breaks. The simulations have suggested that these problems extend into larger datasets. Outliers and structural breaks do not appear to have been considered by model averaging investigators thus far. Model averaging has been analysed, and it has been shown that averaging without rescaling weights so that they sum to unity for each regressor leads to large biases on relevant regressors, and gives large mean squared forecast errors for one-step forecasts, refuting the arguments of Buckland et al. (1997) on model averaging. Rescaling weights has been shown to improve the performance of model averaging on these two levels; however in this situation, irrelevant regressors have been argued to be more biased, an argument supported by Monte Carlo simulation. Furthermore, the results of the Monte Carlo simulations have suggested model averaging is inferior to the GUM and to model selection for forecasting in the presence of period and intermittent dummy variables; this overturns the assertions of (Raftery et al. 1997) relating to the forecasting performance of model averaging. This would be a quite depressing exercise for the future possibilities of macroeconomic modelling, were it not for the performance of model selection in the contexts analysed; relevant variables are almost always retained, giving a high probability of uncovering the DGP in each situation and providing negligible bias throughout. This simply follows on from the growing number of studies that have shown the impressive capabilities of model selection and specifically PcGets in a range of empirical modelling contexts (see e.g. Hoover & Perez 2004, Castle 2004). Hence we can conclude on a positive note, that there is indeed a better alternative to model averaging for macroeconomic modelling. References Buckland, S.T., K.P. Burnham & N.H. Augustin (1997), ‘Model selection: An integral part of inference’, Biometrics 53, 603–618. Castle, J. (2004), Evaluating PcGets and RETINA as automatic model selection algorithms. Unpublished paper, Economics Department, Oxford University. Doornik, Jurgen A, David F Hendry & Bent Nielsen (1998), ‘Inference in cointegrating models: UK M1 revisited’, Journal of Economic Surveys 12(5), 533–72. Doppelhofer, Gernot, Ronald I. Miller & Xavier Sala-i-Martin (2000), Determinants of long-term growth: A Bayesian Averaging of Classical Estimates 19 (BACE) approach, Technical report, National Bureau of Economic Research, Inc. Eklund, J. & S. Karlsson (2004), Forecast combination and model averaging using predictive measures. Unpublished paper, Stockholm School of Economics. Fernandez, C., E. Ley & M.F.J. Steel (2001), ‘Model uncertainty in crosscountry growth regressions’, Applied Econometrics 16, 563–576. Hendry, David F. (2001), ‘Modelling UK inflation, 1875-1991’, Journal of Applied Econometrics 16(3), 255–275. Hendry, David F. & Hans-Martin Krolzig (2005), ‘The properties of automatic Gets modelling’, The Economic Journal 105(502), C32–C61. Hendry, D.F. (1995), Dynamic Econometrics, Oxford University Press, Oxford. Hendry, D.F. & M.P. Clements (2004), ‘Pooling of forecasts’, Econometrics Journal 7, 1–31. Hoover, K.D. & S.J. Perez (1999), ‘Data mining reconsidered: Encompassing and the general-to-specific approach to specification search’, Econometrics Journal 2, 167–191. Hoover, Kevin D. & Stephen J. Perez (2004), ‘Truth and robustness in crosscountry growth regressions’, Oxford Bulletin of Economics and Statistics 66(5), 765–798. Koop, Gary & Simon Potter (2003), Forecasting in large macroeconomic panels using Bayesian Model Averaging, Staff Report 163, Federal Reserve Bank of New York. Perez-Amaral, Teodosio, Giampiero M. Gallo & Halbert White (2003), ‘A flexible tool for model building: the Relevant Transformation of the Inputs Network Approach (RETINA)’, Oxford Bulletin of Economics and Statistics 65(s1), 821–838. Raftery, A.E., D Madigan & J.A. Hoeting (1997), ‘Bayesian model averaging for linear regression models’, Journal of the American Statistical Association 92(437), 179–191. Sala-i-Martin, Xavier X. (1997a), ‘I just ran two million regressions’, American Economic Review 87(2), 178–83. Sala-i-Martin, Xavier X. (1997b), I just ran four million regressions, Technical report, National Bureau of Economic Research, Inc. 20