Comment Lutz Kilian Department of Economics, University of Michigan, Ann Arbor, MI 48109-1220 Frank Diebold’s personal reflections about the history of the DM test remind us that this test was originally designed to compare the accuracy of model-free forecasts such as judgmental forecasts generated by experts, forecasts implied by financial markets, survey forecasts, or forecasts based on prediction markets. This test is used routinely in applied work. For example, Baumeister and Kilian (2012), use the DM test to compare oil price forecasts based on prices of oil futures contracts against the no-change forecast. Much of the econometric literature that builds on Diebold and Mariano (1995), in contrast, has been preoccupied with testing the validity of predictive models in pseudo out-ofsample environments. In this more recent literature the concern actually is not the forecasting ability of the models in question. Rather the focus is on testing the null hypothesis that there is no predictive relationship from one variable to another in population. Testing for the existence of a predictive relationship in population is viewed as an indirect test of all economic models that suggest such a predictive relationship. A case in point is studies of the predictive power of monetary fundamentals for the exchange rate (e.g., Mark 1995). Although this testing problem may seem similar to that in Diebold and Mariano (1995) at first sight, it is conceptually quite different from the original motivation for the DM test. As a result, numerous changes have been proposed in the way the test statistic is constructed and in how its distribution is approximated In a linear regression model testing for predictability in population comes down to testing the null hypothesis of zero slopes which can be assessed using standard in-sample t - or Wald tests. Alternatively, the same null hypothesis of zero slopes can also be tested based on recursive 1 or rolling estimates of the loss in fit associated with generating pseudo out-of-sample predictions from the restricted rather than the unrestricted model. Many empirical studies including Mark (1995) implement both tests. Under standard assumptions, it follows immediately that pseudo out-of-sample tests have the same asymptotic size as, but lower power than in-sample tests of the null hypothesis of no predictability in population, which raises the question why anyone would want to use such tests. While perhaps obvious, this point has nevertheless generated extensive debate. The power advantages of in-sample tests of predictability were first formally established in Inoue and Kilian (2004). Recent work by Hansen and Timmermann (2013) elaborates on the same point. Less obviously it can be shown that these asymptotic power advantages also generalize to comparisons of models subject to data mining, serial correlation in the errors and even certain forms of structural breaks (see Inoue and Kilian 2004). WHERE DID THE LITERATURE GO OFF TRACK? In recent years, there has been increased recognition of the fact that tests of population predictability designed to test the validity of predictive models are not suitable for evaluating the accuracy of forecasts. The difference is best illustrated within the context of a predictive regression with coefficients that are modelled as local to zero. The local asymptotics here serve as a device to capture our inability to detect nonzero regression coefficients with any degree of reliability. Consider the data generating process yt 1 t 1 , where 0 T 1/2 , 0, and t NID(0,1). The Pitman drift parameter cannot be estimated consistently. We restrict attention to one-step-ahead forecasts. One is the restricted forecast yt 1|t 0; the other is the unrestricted forecast yt 1|t ˆ , where ˆ is the recursively obtained least-squares estimate of . 2 This example is akin to the problem of choosing between a random walk with drift and without drift in generating forecasts of the exchange rate. It is useful to compare the asymptotic MSPEs of these two forecasts. The MSPE can be expressed as the sum of the forecast variance and the squared forecast bias. The restricted forecast has zero variance by construction for all values of , but is biased away from the optimal forecast by , so its MSPE is 2 . The unrestricted forecast in contrast has zero bias, but a constant variance for all , which can be normalized to unity without loss of generality. As Figure 1 illustrates, the MSPEs of the two forecasts are equal for 1. This means that for values of 0 1 , the restricted forecast actually is more accurate than the unrestricted forecast, although the restricted forecast is based on a model that is invalid in population, given that 0. This observation illustrates that there is a fundamental difference between the objective of selecting the true model and selecting the model with the lowest MSPE (also see Inoue and Kilian 2006). Traditional tests of the null hypothesis of no predictability (referred to as old school WCM tests by Frank Diebold) correspond to tests of H 0 : 0 which is equivalent to testing H 0 : 0. It has been common for proponents of such old school tests to advertise their tests as tests of equal forecast accuracy. This language is misleading. As Figure 1 shows, testing equal forecast accuracy under quadratic loss corresponds to testing H 0 : 1 in this example. An immediate implication of Figure 1 is that the critical values of conventional pseudo-out-ofsample tests of equal forecast accuracy are too low, if the objective is to compare the forecasting accuracy of the restricted and the unrestricted forecast. As a result, these tests invariably reject the null of equal forecast accuracy too often in favor of the unrestricted forecast. They suffer from size distortions even asymptotically. 3 This point was first made in Inoue and Kilian (2004) and has become increasingly accepted in recent years (e.g., Giacomini and White 2006; Clark and McCracken 2012). It applies not only to all pseudo-out-of-sample tests of predictability published as of 2004, but also to the more recently developed alternative test of Clark and West (2007). Although the latter test embodies an explicit adjustment for finite-sample parameter estimation uncertainty, it is based on the same H 0 : 0 as more traditional tests of no predictability and hence is not suitable for evaluating the null of equal MSPEs. BACK TO THE ROOTS Frank Diebold makes the case that we should return to the roots of this literature and abandon old school WCM tests in favor of the original DM test to the extent that we are interested in testing the null hypothesis of equal MSPEs. Applying the DM test relies on Assumption DM which states that the loss differential has to be covariance stationary for the DM test to have an asymptotic N(0,1) distribution. Frank Diebold suggests that as long as we carefully test that assumption and verify that it holds at least approximately, the DM test should replace the old school WCM tests in practice. He acknowledges that in some cases there are alternative tests of the null of equal out-of-sample MSPEs such as Clark and McCracken (2011,2012), which he refers to as new school WCM tests, but he considers these tests too complicated to be worthwhile considering in practice. It is useful to examine this proposal within the context of our local-to-zero predictive model. Table 1 investigates the size of the DM test based on the N(0,1) asymptotic approximation. We focus on practically relevant sample sizes. In each case, we choose in the data generating process such that the MSPEs of the restricted and the unrestricted model are equal. Under our assumptions 1/ T 1/2 under this null hypothesis. We focus on recursive 4 estimation of the unrestricted model. The initial recursive estimation window consists of the first R sample observations, R T . We explore two alternative asymptotic thought experiments. In the first case, R 15,30, 45 is fixed with respect to the sample size. In the second case, R T with 0.25, 0.5, 0.75 . Table 1 shows that, regardless of the asymptotic thought experiment, the effective size of the DM test may be lower or higher than the nominal size, depending on R and T . In most cases, the DM test is conservative in that its empirical size is below the nominal size. This is an interesting contrast to traditional tests of H 0 : 0 whose size invariably exceeds the nominal size when testing the null of equal MSPEs. When the empirical size of the DM test exceeds its nominal size, the size distortions are modest and vanish for larger T . Table 1 also illustrates that in practice R must be large relative to T for the test to have reasonably accurate size. Otherwise the DM test may become extremely conservative. This evidence does not mean that the finite-sample distribution of the DM test statistic is well approximated by a N(0,1) distribution. Figure 2 illustrates that even for T 480 the density of the DM test statistic is far from N(0,1). This finding is consistent with the theoretical analysis in Clark and McCracken (2011, p. 8) of the limiting distribution of the DM test statistic in the local-to-zero framework. Nevertheless, the right tail of the empirical distribution is reasonably close to that of the N(0,1) asymptotic approximation for large R / T , which explains the fairly accurate size in Table 1 for R 0.75T . The failure of the N(0,1) approximation in this example suggests a violation of Assumption DM. This fact raises the question of whether one would have been able to detect this problem by plotting and analyzing the recursive loss differential, as suggested by Frank Diebold. The answer for our data generating process is sometimes yes, but often not. While there are some draws of the loss differential in the Monte Carlo study that show clear signs of nonstationarity 5 even without formal tests, many others do not. Figure 3 shows one such example. This finding casts doubt on our ability to verify Assumption DM in practice. Leaving aside the question of the validity of the N(0,1) approximation in this context, the fact that the DM test tends to be conservative for typical sample sizes raises concerns that the DM test may have low power. It therefore is important to compare the DM test to the bootstrapbased tests of the same null hypothesis of equal MSPEs developed in Clark and McCracken (2011). The latter new school WCM tests are designed for nested forecasting models and are motivated by the same local-to-zero framework we relied on in our example. They can be applied to estimates of the MSPE based on rolling or recursive regressions. Simulation results presented in Clark and McCracken (2011) suggest that their alternative tests have size similar to the DM-test in practice, facilitating the comparison. Clark and McCracken’s bootstrap-based MSE t test in several examples appears to have slightly higher power than the DM test, but the power advantages are not uniform across all data generating processes. Larger and more systematic power gains are obtained for their bootstrap MSE F test, however. This evidence suggests that there remains a need for alternative tests of the null hypothesis of equal MSPEs such as the MSE F -test in Clark and McCracken (2011). While such tests so far are not available for all situations of interest in practice, their further development seems worthwhile. There is also an alternative test of equal forecast accuracy proposed by Giacomini and White (2006), involving a somewhat different specification of the null hypothesis, which is designed specifically for pseudo out-of-sample forecasts based on rolling regressions. An interesting implication of closely related work in Clark and McCracken (2012) is that even when one is testing the null hypothesis of equal MSPEs of forecasts from nested models (as 6 opposed to the traditional null hypothesis of no predictability), in-sample tests have higher power than the corresponding pseudo out-of-sample tests in Clark and McCracken (2011). This finding closely mirrors the earlier results in Inoue and Kilian (2004) for tests of the null of no predictability. It renews the question raised in Inoue and Kilian (2004) of why anyone would care to conduct pseudo out-of-sample inference about forecasts as opposed to in-sample inference. Frank Diebold is keenly aware of this question and draws attention to the importance of assessing and understanding the historical evolution of the accuracy of forecasting models as opposed to their MSPE. A case in point is the analysis in Baumeister, Kilian and Lee (2013) who show that the accuracy of forecasts with low MSPEs may not be stable over time. Such analysis does not necessarily require formal tests, however, and existing tests for assessing stability are not designed to handle the real-time data constraints in many economic time series. Nor do these tests allow for iterated as opposed to direct forecasts. THE PROBLEM OF SELECTING FORECASTING MODELS This leaves the related, but distinct question of forecasting model selection. It is important to keep in mind that tests for equal forecast accuracy are not designed to select among alternative parametric forecasting models. This distinction is sometimes blurred in discussions of tests of equal predictive ability. Forecasting model selection involves the ranking of candidate forecasting models based on their performance. There are well established methods for selecting the forecasting model with the lowest out-of-sample MSPE, provided that the number of candidate models is small relative to the sample size, that we restrict attention to direct forecasts, and that there is no structural change in the out-of-sample period. It is important to stress that consistent model selection does not require the true model to be included among the forecasting models in general. For example, Inoue and Kilian (2006) 7 prove that suitably specified information criteria based on the full sample will select the forecasting model with the lowest out-of-sample MSPE even when all candidate models are misspecified. In contrast, forecasts obtained by ranking models by their rolling or recursive MSPE with positive probability will inflate the out-of-sample MSPE under conventional asymptotic approximations. Frank Diebold’s discussion is correct that PLS effectively equals the SIC under the assumption that R does not depend on T , but the latter nonstandard asymptotic thought experiment does not provide good approximations in finite samples, as shown in Inoue and Kilian (2006). Moreover, consistently selecting the forecasting model with the lowest MSPE may require larger penalty terms in the information criterion than embodied in conventional criteria such as the SIC and AIC. The availability of these results does not mean that the problem of selecting forecasting models is resolved. First, asymptotic results may be of limited use in small samples. Second, information criteria are not designed for selecting among a large-dimensional set of forecasting models. Their asymptotic validity breaks down, once we examine larger sets of candidate models. Third, standard information criteria are unable to handle iterated forecasts. Fourth, Inoue and Kilian (2006) prove that no forecasting model selection method in general remains valid in the presence of unforeseen structural changes in the out-of-sample period. The problem has usually been dealt with in the existing literature by simply restricting attention to structural changes in the past, while abstracting from structural changes in the future. Finally, there are no theoretical results on how to select among forecasting methods that involve model selection at each stage of recursive or rolling regressions. The latter situation is quite common in applied work. CONCLUSION The continued interest in the question of how to evaluate the accuracy of forecasts shows that 8 Diebold and Mariano’s (1995) key insights still are very timely, even twenty years later. Although the subsequent literature in some cases followed directions not intended by the authors, there is a growing consensus on how to think about this problem and which pitfalls must be avoided by applied users. The fact that in many applications in-sample tests have proved superior to simulated out-of-sample tests does not mean that there are no situations in which we care about out-of-sample inference. Important examples include inference about forecasts from realtime data and inference about iterated forecasts. Unfortunately, however, existing tests of equal out-of-sample accuracy are not designed to handle these and other interesting extensions. This fact suggests that much work remains to be done for econometricians going forward. ADDITIONAL REFERENCES Baumeister, C., and L. Kilian (2012), “Real-Time Forecasts of the Real Price of Oil,” Journal of Business and Economic Statistics, 30, 326-336. Baumeister, C., Kilian, L., and T.K. Lee (2013), “Are there Gains from Pooling RealTime Oil Price Forecasts?” mimeo, University of Michigan. Clark, T.E., and M.W. McCracken (2012), “In-Sample Tests of Predictive Ability: A New Approach,” Journal of Econometrics, 170, 1-14. Clark, T.E., and K.W. West (2007), “Approximately Normal Tests for Equal Predictive Accuracy in Nested Models,” Journal of Econometrics, 138, 291-311. Inoue, A., and L. Kilian (2004), “Bagging Time Series Models,” CEPR Discussion Paper No. 4333. Mark, N.C. (1995), “Exchange Rates and Fundamentals: Evidence on Long-Horizon Predictability,” American Economic Review, 85, 201-218. 9 Table 1: Size of Nominal 5% DM Test under Local-to-Zero Asymptotics T R 15 30 45 0.25T 0.50T 0.75T 5% 1.8 3.7 7.1 1.8 3.7 7.1 60 10% 5% 3.9 7.1 11.3 3.9 7.1 11.3 0.8 1.5 2.3 1.4 3.1 5.3 120 10% 5% 2.0 3.5 4.9 3.5 6.4 9.4 0.3 0.6 1.0 1.4 2.6 4.2 240 10% 5% 1.1 1.9 2.5 3.2 5.5 8.1 0.2 0.4 0.5 1.1 2.3 3.7 480 10% 0.7 1.0 1.4 3.1 5.2 7.3 NOTES: All results based on 50,000 trials under the null hypothesis of equal MSPEs in population. The DM test statistic is based on recursive regressions. R denotes the length of the initial recursive sample and T the sample size. Figure 1: Asymptotic MSPEs under Local-to-Zero Asymptotics 4 Restricted forecast Unrestricted forecast 3.5 Asymptotic MSPE 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 NOTES: Adapted from Inoue and Kilian (2004). denotes the Pitman drift term. The variance has been normalized to 1 without loss of generality. The asymptotic MSPEs are equal at 1, not at 0, so a test of H 0 : 0 is not a test of the null hypothesis of equal MSPEs. 10 Figure 2: Gaussian Kernel Density Estimates of the Null Distribution of the DM Test Statistics in the Local-to-Zero Model T=480 DM, R=0.25T N(0,1) DM, R=0.75T 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -3 -2 -1 0 1 2 3 DM Statistic Notes: Estimates based on recursive forecast error sequences based on an initial window of R observations. All results are based on 50,000 trials. Figure 3: Assessing Assumption DM Recursive Loss differential 1 0.5 0 -0.5 -1 50 100 150 200 250 300 350 300 350 Squared demeaned recursive loss differential 1 0.8 0.6 0.4 0.2 0 50 100 150 200 250 Evaluation Period Notes: Loss differential obtained from a random sample of length T R with T 480 and R 0.25T . 11