Comment

advertisement
Comment
Lutz Kilian
Department of Economics, University of Michigan, Ann Arbor, MI 48109-1220
Frank Diebold’s personal reflections about the history of the DM test remind us that this test was
originally designed to compare the accuracy of model-free forecasts such as judgmental forecasts
generated by experts, forecasts implied by financial markets, survey forecasts, or forecasts based
on prediction markets. This test is used routinely in applied work. For example, Baumeister and
Kilian (2012), use the DM test to compare oil price forecasts based on prices of oil futures
contracts against the no-change forecast.
Much of the econometric literature that builds on Diebold and Mariano (1995), in
contrast, has been preoccupied with testing the validity of predictive models in pseudo out-ofsample environments. In this more recent literature the concern actually is not the forecasting
ability of the models in question. Rather the focus is on testing the null hypothesis that there is
no predictive relationship from one variable to another in population. Testing for the existence of
a predictive relationship in population is viewed as an indirect test of all economic models that
suggest such a predictive relationship. A case in point is studies of the predictive power of
monetary fundamentals for the exchange rate (e.g., Mark 1995). Although this testing problem
may seem similar to that in Diebold and Mariano (1995) at first sight, it is conceptually quite
different from the original motivation for the DM test. As a result, numerous changes have been
proposed in the way the test statistic is constructed and in how its distribution is approximated
In a linear regression model testing for predictability in population comes down to testing
the null hypothesis of zero slopes which can be assessed using standard in-sample t - or Wald tests. Alternatively, the same null hypothesis of zero slopes can also be tested based on recursive
1
or rolling estimates of the loss in fit associated with generating pseudo out-of-sample predictions
from the restricted rather than the unrestricted model. Many empirical studies including Mark
(1995) implement both tests.
Under standard assumptions, it follows immediately that pseudo out-of-sample tests have
the same asymptotic size as, but lower power than in-sample tests of the null hypothesis of no
predictability in population, which raises the question why anyone would want to use such tests.
While perhaps obvious, this point has nevertheless generated extensive debate. The power
advantages of in-sample tests of predictability were first formally established in Inoue and Kilian
(2004). Recent work by Hansen and Timmermann (2013) elaborates on the same point. Less
obviously it can be shown that these asymptotic power advantages also generalize to
comparisons of models subject to data mining, serial correlation in the errors and even certain
forms of structural breaks (see Inoue and Kilian 2004).
WHERE DID THE LITERATURE GO OFF TRACK?
In recent years, there has been increased recognition of the fact that tests of population
predictability designed to test the validity of predictive models are not suitable for evaluating the
accuracy of forecasts. The difference is best illustrated within the context of a predictive
regression with coefficients that are modelled as local to zero. The local asymptotics here serve
as a device to capture our inability to detect nonzero regression coefficients with any degree of
reliability. Consider the data generating process yt 1     t 1 , where   0   T 1/2 ,   0, and
 t  NID(0,1). The Pitman drift parameter  cannot be estimated consistently. We restrict
attention to one-step-ahead forecasts. One is the restricted forecast yt 1|t  0; the other is the
unrestricted forecast yt 1|t  ˆ , where ˆ is the recursively obtained least-squares estimate of  .
2
This example is akin to the problem of choosing between a random walk with drift and without
drift in generating forecasts of the exchange rate.
It is useful to compare the asymptotic MSPEs of these two forecasts. The MSPE can be
expressed as the sum of the forecast variance and the squared forecast bias. The restricted
forecast has zero variance by construction for all values of  , but is biased away from the
optimal forecast by  , so its MSPE is  2 . The unrestricted forecast in contrast has zero bias, but
a constant variance for all  , which can be normalized to unity without loss of generality. As
Figure 1 illustrates, the MSPEs of the two forecasts are equal for   1. This means that for
values of 0    1 , the restricted forecast actually is more accurate than the unrestricted forecast,
although the restricted forecast is based on a model that is invalid in population, given that
  0. This observation illustrates that there is a fundamental difference between the objective of
selecting the true model and selecting the model with the lowest MSPE (also see Inoue and
Kilian 2006).
Traditional tests of the null hypothesis of no predictability (referred to as old school
WCM tests by Frank Diebold) correspond to tests of H 0 :   0 which is equivalent to testing
H 0 :   0. It has been common for proponents of such old school tests to advertise their tests as
tests of equal forecast accuracy. This language is misleading. As Figure 1 shows, testing equal
forecast accuracy under quadratic loss corresponds to testing H 0 :   1 in this example. An
immediate implication of Figure 1 is that the critical values of conventional pseudo-out-ofsample tests of equal forecast accuracy are too low, if the objective is to compare the forecasting
accuracy of the restricted and the unrestricted forecast. As a result, these tests invariably reject
the null of equal forecast accuracy too often in favor of the unrestricted forecast. They suffer
from size distortions even asymptotically.
3
This point was first made in Inoue and Kilian (2004) and has become increasingly
accepted in recent years (e.g., Giacomini and White 2006; Clark and McCracken 2012). It
applies not only to all pseudo-out-of-sample tests of predictability published as of 2004, but also
to the more recently developed alternative test of Clark and West (2007). Although the latter test
embodies an explicit adjustment for finite-sample parameter estimation uncertainty, it is based
on the same H 0 :   0 as more traditional tests of no predictability and hence is not suitable for
evaluating the null of equal MSPEs.
BACK TO THE ROOTS
Frank Diebold makes the case that we should return to the roots of this literature and abandon
old school WCM tests in favor of the original DM test to the extent that we are interested in
testing the null hypothesis of equal MSPEs. Applying the DM test relies on Assumption DM
which states that the loss differential has to be covariance stationary for the DM test to have an
asymptotic N(0,1) distribution. Frank Diebold suggests that as long as we carefully test that
assumption and verify that it holds at least approximately, the DM test should replace the old
school WCM tests in practice. He acknowledges that in some cases there are alternative tests of
the null of equal out-of-sample MSPEs such as Clark and McCracken (2011,2012), which he
refers to as new school WCM tests, but he considers these tests too complicated to be worthwhile
considering in practice.
It is useful to examine this proposal within the context of our local-to-zero predictive
model. Table 1 investigates the size of the DM test based on the N(0,1) asymptotic
approximation. We focus on practically relevant sample sizes. In each case, we choose  in the
data generating process such that the MSPEs of the restricted and the unrestricted model are
equal. Under our assumptions   1/ T 1/2 under this null hypothesis. We focus on recursive
4
estimation of the unrestricted model. The initial recursive estimation window consists of the first
R sample observations, R  T . We explore two alternative asymptotic thought experiments. In
the first case, R  15,30, 45 is fixed with respect to the sample size. In the second case, R   T
with   0.25, 0.5, 0.75 . Table 1 shows that, regardless of the asymptotic thought experiment,
the effective size of the DM test may be lower or higher than the nominal size, depending on R
and T . In most cases, the DM test is conservative in that its empirical size is below the nominal
size. This is an interesting contrast to traditional tests of H 0 :   0 whose size invariably
exceeds the nominal size when testing the null of equal MSPEs. When the empirical size of the
DM test exceeds its nominal size, the size distortions are modest and vanish for larger T . Table 1
also illustrates that in practice R must be large relative to T for the test to have reasonably
accurate size. Otherwise the DM test may become extremely conservative.
This evidence does not mean that the finite-sample distribution of the DM test statistic is
well approximated by a N(0,1) distribution. Figure 2 illustrates that even for T  480 the density
of the DM test statistic is far from N(0,1). This finding is consistent with the theoretical analysis
in Clark and McCracken (2011, p. 8) of the limiting distribution of the DM test statistic in the
local-to-zero framework. Nevertheless, the right tail of the empirical distribution is reasonably
close to that of the N(0,1) asymptotic approximation for large R / T , which explains the fairly
accurate size in Table 1 for R  0.75T .
The failure of the N(0,1) approximation in this example suggests a violation of
Assumption DM. This fact raises the question of whether one would have been able to detect this
problem by plotting and analyzing the recursive loss differential, as suggested by Frank Diebold.
The answer for our data generating process is sometimes yes, but often not. While there are some
draws of the loss differential in the Monte Carlo study that show clear signs of nonstationarity
5
even without formal tests, many others do not. Figure 3 shows one such example. This finding
casts doubt on our ability to verify Assumption DM in practice.
Leaving aside the question of the validity of the N(0,1) approximation in this context, the
fact that the DM test tends to be conservative for typical sample sizes raises concerns that the
DM test may have low power. It therefore is important to compare the DM test to the bootstrapbased tests of the same null hypothesis of equal MSPEs developed in Clark and McCracken
(2011). The latter new school WCM tests are designed for nested forecasting models and are
motivated by the same local-to-zero framework we relied on in our example. They can be
applied to estimates of the MSPE based on rolling or recursive regressions.
Simulation results presented in Clark and McCracken (2011) suggest that their alternative
tests have size similar to the DM-test in practice, facilitating the comparison. Clark and
McCracken’s bootstrap-based MSE  t test in several examples appears to have slightly higher
power than the DM test, but the power advantages are not uniform across all data generating
processes. Larger and more systematic power gains are obtained for their bootstrap MSE  F
test, however. This evidence suggests that there remains a need for alternative tests of the null
hypothesis of equal MSPEs such as the MSE  F -test in Clark and McCracken (2011). While
such tests so far are not available for all situations of interest in practice, their further
development seems worthwhile. There is also an alternative test of equal forecast accuracy
proposed by Giacomini and White (2006), involving a somewhat different specification of the
null hypothesis, which is designed specifically for pseudo out-of-sample forecasts based on
rolling regressions.
An interesting implication of closely related work in Clark and McCracken (2012) is that
even when one is testing the null hypothesis of equal MSPEs of forecasts from nested models (as
6
opposed to the traditional null hypothesis of no predictability), in-sample tests have higher power
than the corresponding pseudo out-of-sample tests in Clark and McCracken (2011). This finding
closely mirrors the earlier results in Inoue and Kilian (2004) for tests of the null of no
predictability. It renews the question raised in Inoue and Kilian (2004) of why anyone would
care to conduct pseudo out-of-sample inference about forecasts as opposed to in-sample
inference. Frank Diebold is keenly aware of this question and draws attention to the importance
of assessing and understanding the historical evolution of the accuracy of forecasting models as
opposed to their MSPE. A case in point is the analysis in Baumeister, Kilian and Lee (2013) who
show that the accuracy of forecasts with low MSPEs may not be stable over time. Such analysis
does not necessarily require formal tests, however, and existing tests for assessing stability are
not designed to handle the real-time data constraints in many economic time series. Nor do these
tests allow for iterated as opposed to direct forecasts.
THE PROBLEM OF SELECTING FORECASTING MODELS
This leaves the related, but distinct question of forecasting model selection. It is important to
keep in mind that tests for equal forecast accuracy are not designed to select among alternative
parametric forecasting models. This distinction is sometimes blurred in discussions of tests of
equal predictive ability. Forecasting model selection involves the ranking of candidate
forecasting models based on their performance. There are well established methods for selecting
the forecasting model with the lowest out-of-sample MSPE, provided that the number of
candidate models is small relative to the sample size, that we restrict attention to direct forecasts,
and that there is no structural change in the out-of-sample period.
It is important to stress that consistent model selection does not require the true model to
be included among the forecasting models in general. For example, Inoue and Kilian (2006)
7
prove that suitably specified information criteria based on the full sample will select the
forecasting model with the lowest out-of-sample MSPE even when all candidate models are
misspecified. In contrast, forecasts obtained by ranking models by their rolling or recursive
MSPE with positive probability will inflate the out-of-sample MSPE under conventional
asymptotic approximations. Frank Diebold’s discussion is correct that PLS effectively equals the
SIC under the assumption that R does not depend on T , but the latter nonstandard asymptotic
thought experiment does not provide good approximations in finite samples, as shown in Inoue
and Kilian (2006). Moreover, consistently selecting the forecasting model with the lowest MSPE
may require larger penalty terms in the information criterion than embodied in conventional
criteria such as the SIC and AIC.
The availability of these results does not mean that the problem of selecting forecasting
models is resolved. First, asymptotic results may be of limited use in small samples. Second,
information criteria are not designed for selecting among a large-dimensional set of forecasting
models. Their asymptotic validity breaks down, once we examine larger sets of candidate models.
Third, standard information criteria are unable to handle iterated forecasts. Fourth, Inoue and
Kilian (2006) prove that no forecasting model selection method in general remains valid in the
presence of unforeseen structural changes in the out-of-sample period. The problem has usually
been dealt with in the existing literature by simply restricting attention to structural changes in
the past, while abstracting from structural changes in the future. Finally, there are no theoretical
results on how to select among forecasting methods that involve model selection at each stage of
recursive or rolling regressions. The latter situation is quite common in applied work.
CONCLUSION
The continued interest in the question of how to evaluate the accuracy of forecasts shows that
8
Diebold and Mariano’s (1995) key insights still are very timely, even twenty years later.
Although the subsequent literature in some cases followed directions not intended by the authors,
there is a growing consensus on how to think about this problem and which pitfalls must be
avoided by applied users. The fact that in many applications in-sample tests have proved superior
to simulated out-of-sample tests does not mean that there are no situations in which we care
about out-of-sample inference. Important examples include inference about forecasts from realtime data and inference about iterated forecasts. Unfortunately, however, existing tests of equal
out-of-sample accuracy are not designed to handle these and other interesting extensions. This
fact suggests that much work remains to be done for econometricians going forward.
ADDITIONAL REFERENCES
Baumeister, C., and L. Kilian (2012), “Real-Time Forecasts of the Real Price of Oil,”
Journal of Business and Economic Statistics, 30, 326-336.
Baumeister, C., Kilian, L., and T.K. Lee (2013), “Are there Gains from Pooling RealTime Oil Price Forecasts?” mimeo, University of Michigan.
Clark, T.E., and M.W. McCracken (2012), “In-Sample Tests of Predictive Ability: A New
Approach,” Journal of Econometrics, 170, 1-14.
Clark, T.E., and K.W. West (2007), “Approximately Normal Tests for Equal Predictive
Accuracy in Nested Models,” Journal of Econometrics, 138, 291-311.
Inoue, A., and L. Kilian (2004), “Bagging Time Series Models,” CEPR Discussion Paper
No. 4333.
Mark, N.C. (1995), “Exchange Rates and Fundamentals: Evidence on Long-Horizon
Predictability,” American Economic Review, 85, 201-218.
9
Table 1: Size of Nominal 5% DM Test under Local-to-Zero Asymptotics
T
R
15
30
45
0.25T
0.50T
0.75T
  5%
1.8
3.7
7.1
1.8
3.7
7.1
60
  10%
  5%
3.9
7.1
11.3
3.9
7.1
11.3
0.8
1.5
2.3
1.4
3.1
5.3
120
  10%
  5%
2.0
3.5
4.9
3.5
6.4
9.4
0.3
0.6
1.0
1.4
2.6
4.2
240
  10%
  5%
1.1
1.9
2.5
3.2
5.5
8.1
0.2
0.4
0.5
1.1
2.3
3.7
480
  10%
0.7
1.0
1.4
3.1
5.2
7.3
NOTES: All results based on 50,000 trials under the null hypothesis of equal MSPEs in
population. The DM test statistic is based on recursive regressions. R denotes the length of the
initial recursive sample and T the sample size.
Figure 1: Asymptotic MSPEs under Local-to-Zero Asymptotics
4
Restricted forecast
Unrestricted forecast
3.5
Asymptotic MSPE
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1

1.2
1.4
1.6
1.8
2
NOTES: Adapted from Inoue and Kilian (2004).  denotes the Pitman drift term. The variance
has been normalized to 1 without loss of generality. The asymptotic MSPEs are equal at   1,
not at   0, so a test of H 0 :   0 is not a test of the null hypothesis of equal MSPEs.
10
Figure 2: Gaussian Kernel Density Estimates of the Null Distribution of the
DM Test Statistics in the Local-to-Zero Model
T=480
DM, R=0.25T
N(0,1)
DM, R=0.75T
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-3
-2
-1
0
1
2
3
DM Statistic
Notes: Estimates based on recursive forecast error sequences based on an initial window of R
observations. All results are based on 50,000 trials.
Figure 3: Assessing Assumption DM
Recursive Loss differential
1
0.5
0
-0.5
-1
50
100
150
200
250
300
350
300
350
Squared demeaned recursive loss differential
1
0.8
0.6
0.4
0.2
0
50
100
150
200
250
Evaluation Period
Notes: Loss differential obtained from a random sample of length T  R with T  480 and
R  0.25T .
11
Download