Recursive and direct multi-step forecasting: the best of both worlds

advertisement
ISSN 1440-771X
Department of Econometrics and Business Statistics
http://www.buseco.monash.edu.au/depts/ebs/pubs/wpapers/
Recursive and direct multi-step
forecasting: the best of both
worlds
Souhaib Ben Taieb and Rob J Hyndman
September 2012
Working Paper 19/12
Recursive and direct multi-step
forecasting: the best of both worlds
Souhaib Ben Taieb
Machine Learning Group, Department of Computer Science,
Université Libre de Bruxelles,
Brussels, Belgium.
Email: Souhaib.Ben.Taieb@ulb.ac.be
Rob J Hyndman
Department of Econometrics and Business Statistics,
Monash University,
Clayton VIC 3800,
Australia.
Email: Rob.Hyndman@monash.edu
2 September 2012
JEL classification: J11, C53, C14, C32
Recursive and direct multi-step
forecasting: the best of both worlds
Abstract
We propose a new forecasting strategy, called rectify, that seeks to combine the best properties of
both the recursive and direct forecasting strategies. The rationale behind the rectify strategy is
to begin with biased recursive forecasts and adjust them so they are unbiased and have smaller
error. We use linear and nonlinear simulated time series to investigate the performance of the
rectify strategy and compare the results with those from the recursive and the direct strategies.
We also carry out some experiments using real world time series from the M3 and the NN5
forecasting competitions. We find that the rectify strategy is always better than, or at least has
comparable performance to, the best of the recursive and the direct strategies. This finding
makes the rectify strategy very attractive as it avoids making a choice between the recursive and
the direct strategies which can be a difficult task in real-world applications.
Keywords: Multi-step forecasting; forecasting strategies; recursive forecasting; direct forecasting; linear time series; nonlinear time series; M3 competition; NN5 competition.
2
Recursive and direct multi-step forecasting: the best of both worlds
1
Introduction
Traditionally, multi-step forecasting has been handled recursively, where a single time series
model is estimated and each forecast is computed using previous forecasts. More recently, direct
calculation of multi-step forecasting has been proposed, where a separate time series model for
each forecasting horizon is estimated, and forecasts are computed only on the observed data.
Choosing between these different strategies involves a trade-off between bias and estimation
variance. Recursive forecasting is biased when the underlying model is nonlinear, but direct
forecasting has higher variance because it uses fewer observations when estimating the model,
especially for longer forecast horizons.
The literature on this topic often involves comparing the recursive and direct strategies, and
discussing the conditions under which one or other is better. For example, Ing (2003) shows that
in the linear case, the recursive MSE is greater than the direct MSE. Chevillon (2007) concludes
that the direct strategy is most beneficial when the model is misspecified.
In this paper, we take a different approach and propose a new forecasting strategy that seeks to
combine the best properties of both the recursive and direct strategies. The rationale behind the
rectify strategy is to begin with biased recursive forecasts and adjust them so they are unbiased
and have smaller error.
In the next section, we present both the recursive and the direct strategy together with a
decomposition of their corresponding mean squared error. Section 3 presents the rectify
strategy using the same terminology as for the recursive and the direct strategies to allow
theoretical comparisons. Section 4 gives some details about the set-up of our experiments.
Section 5 investigates the performance of the different strategies using linear and nonlinear
simulated time series. Section 6 shows the performance of the different strategies with time
series from two forecasting competitions, namely the M3 and the NN5 competition. Finally, we
conclude our work in Section 7.
2
Forecasting strategies
Given a univariate time series {y1 , . . . , yT } comprising T observations, we want to forecast the
next H values of the time series.
Ben Taieb & Hyndman: 2 September 2012
3
Recursive and direct multi-step forecasting: the best of both worlds
We will assume the data come from a possibly nonlinear autoregressive process of the form
yt = f (xt−1 ) + εt with xt = [yt , . . . , yt−d+1 ]0 ,
where {εt } is a stochastic iid error process with mean zero, variance σ 2 , E[εt3 ] = 0 (for simplicity)
and κ = E(ε4 ) > 0. The process is specified by a function f , embedding dimension d, and an
error term εt .
In this article, we assume the goal of forecasting is to estimate the conditional mean µt+h|t =
E(yt+h | xt ), and we will evaluate different forecasting strategies by how well they approximate
µt+h|t .
When h = 1, we have the simple expression µt+1|t = f (xt ). If f is linear, we can also write down
some relatively simple expressions (Fan & Yao 2003, p.118):



(h−1) (x ), . . . , f (h−d) (x )]0 ),


t
t
f ([f
µt+h|t = 


x 0 w ,

t h
if h > 0;
(1)
if 1 − d ≤ h ≤ 0.
where wh has jth element equal to 1 if j = 1 − h and equal to 0 otherwise. Thus, when f is linear,
the conditional mean forecasts can be computed recursively. More generally, for nonlinear f
and h > 1, the calculation of µt+h|t has no simple form.
Each forecasting strategy involves estimating one or more models which are not necessarily
of the same form as f and may not have the same embedding dimension as f . For one-step
forecasts, the model will be denoted by yt = m(xt−1 ; θ) + et where xt = [yt , . . . , yt−p+1 ]0 . That is we
estimate (or assume) the form of m, the parameters θ and the dimension p from a set of training
data. If m is of the same form as f (up to some estimable parameters), we write m f . Ideally,
we would like p = d, m f and the estimates of θ close to the true parameters. But we are
allowing for model mis-specification by not making these assumptions. For multi-step forecasts,
some strategies will use the same model m as for one-step forecasts, while other strategies may
involve additional or alternative models to be estimated.
If we let m̂(h) (xt ) denote the forecasts of a given strategy at horizon h and define m(h) (xt ) =
E[m̂(h) (xt ) | xt ], then the mean squared error (MSE) at horizon h is given by
h
i
MSEh = E (yt+h − m̂(h) (xt ))2
Ben Taieb & Hyndman: 2 September 2012
4
Recursive and direct multi-step forecasting: the best of both worlds
h
i
h
i
= E (yt+h − µt+h|t )2 + (µt+h|t − m(h) (xt ))2 + E (m̂(h) (xt ) − m(h) (xt ))2
{z
} |
|
{z
} |
{z
}
Noise
Bias bh2
(2)
Variance
where all expectations are conditional on xt . The variance term on the right will converge to
zero as the size of the training set increases, and so we will not consider it further.
To simplify the derivation of the remaining terms, we use similar arguments to Atiya et al.
(1999) and consider only h = 2. An analogous approach can be used to derive the MSE for larger
forecast horizons.
Now yt+2 is given by
yt+2 = f (yt+1 , . . . , yt−d+2 ) + εt+2 = f (f (xt ) + εt+1 , . . . , yt−d+2 ) + εt+2 .
Using Taylor series approximations up to second-order terms, we get
yt+2 ≈ f (f (xt ), . . . , yt−d+2 ) + εt+1 fx1 + 21 (εt+1 )2 fx1 x1 + εt+2 ,
where fx1 is the derivative of f with respect to its first argument, fx1 x1 is its second derivative
with respect to its first argument, and so on.
The noise term depends only on the data generating process and is given by
h
i
E (yt+2 − µt+2|t )2
2 1 2
1 2
≈ E f (f (xt ), . . . , yt−d+2 ) + εt+1 fx1 + 2 εt+1 fx1 x1 + εt+2 − f (f (xt ), . . . , yt−d+2 ) − 2 σ fx1 x1
= σ 2 (1 + fx21 ) + 14 (κ − σ 4 )fx21 x1 .
Consequently, the MSE at horizon h = 2 is given by
MSE2 ≈ σ 2 (1 + fx21 ) + 41 (κ − σ 4 )fx21 x1 + (µt+h|t − m(h) (xt ))2
2.1
(3)
The recursive strategy
In the recursive strategy, we estimate the model
yt = m(xt−1 ; θ) + et , where xt = [yt , . . . , yt−p+1 ]0
Ben Taieb & Hyndman: 2 September 2012
(4)
5
Recursive and direct multi-step forecasting: the best of both worlds
and E[et ] = 0. In standard one-step optimization with squared errors, the objective function of
h
i
the recursive strategy is E (yt+1 − m(xt ; θ))2 | xt and the parameters θ are estimated by
θ̂ = argmin
θ∈Θ
X
(yt − m(xt−1 ; θ))2 ,
(5)
t
where Θ denotes the parameter space.
We compute forecasts recursively as in (1):




m̂([m̂(h−1) (xt ), . . . , m̂(h−p) (xt )]0 ),

(h)
m̂ (xt ) = 

 0

xt wh ,
if h > 0;
if 1 − p ≤ h ≤ 0;
where m̂(x) is a shorthand notation for m(x; θ̂). These forecasts are also sometimes called
“iterated multi-step” (IMS) forecasts (e.g., Chevillon & Hendry 2005, Chevillon 2007, Franses &
Legerstee 2009).
The choice of θ̂ given by (5) minimizes the mean squared error of the one-step forecasts, and
so ensures (see, for example, Hastie et al. 2008, p.18) that m(1) (xt ) = µt+1|t provided m f and
p = d. Thus, the one-step forecasts are unbiased under these conditions. The same unbiasedness
property does not hold for higher forecast horizons except in some special cases.
Again, we will consider only h = 2. The bias of the recursive strategy can be calculated as
b2 = µt+2|t − m(2) (xt ) ≈ f (f (xt ), . . . , yt−d+2 ) + 12 σ 2 fx1 x1 − m(m(xt ), . . . , yt−p+2 )
h
i
≈ f (f (xt ), . . . , yt−d+2 ) − m(m(xt ), . . . , yt−p+2 ) + 21 σ 2 fx1 x1 .
(6)
So even when m f and d = p, the forecasts will be biased unless fx1 x1 = 0. That is, recursive
forecasts are unbiased if and only if f is linear, a linear model m is used, and when the
embedding dimension is correctly determined; in all other situations, recursive forecasts will be
biased for h ≥ 2. In particular, the bias will be large whenever |fx1 x1 | is large; that is when f has
high curvature.
Using expression (3), we see that the recursive strategy has a mean squared error at horizon
h = 2 equal to
MSErecursive
2
h
i
2
≈ σ 2 (1 + fx21 ) + 41 (κ − σ 4 )fx21 x1 + f (f (xt ), . . . , yt−d+2 ) − m(m(xt ), . . . , yt−p+2 ) + 21 σ 2 fx1 x1 .
Ben Taieb & Hyndman: 2 September 2012
6
Recursive and direct multi-step forecasting: the best of both worlds
When m f and d = p, the MSE simplifies to
MSErecursive
≈ σ 2 (1 + fx21 ) + 14 κfx21 x1 .
2
But when the model is misspecified, either in the embedding dimension, or in the functional
form of m, the MSE can be much larger.
A variation on the recursive forecasting strategy is to use a different set of parameters for each
forecasting horizon:
θ̂h = argmin
Xh
i2
yt − m(h) (xt−h ; θ) .
θ∈Θ
t
A further variation selects the parameters to minimize the forecast errors over the first H
forecast horizons:
H Xh
X
i2
θ̂ = argmin
yt − m(h) (xt−h ; θ) .
θ∈Θ
h=1 t
In both of these variants, forecasts are still obtained recursively from the one-step model.
The only difference with standard recursive forecasting is that the parameters are optimized
differently to allow more accurate multi-step forecasts. Examples of machine learning models
using these variations of the recursive strategy are recurrent neural networks (e.g., Williams &
Zipser 1989, Werbos 1990) and local models with the nearest trajectories criterion (McNames
1998).
An advantage of using the recursive strategy is that only one model is required, saving significant
computational time, especially when a large number of time series and forecast horizons are
involved. The strategy also ensures that the fitted model m matches the assumed data generating
process f as closely as possible. On the other hand, the recursive forecasts are not equal to the
conditional mean, even when the model is exactly equivalent to the data generating process.
2.2
The direct strategy
With the direct strategy, different forecasting models are used for each forecast horizon:
yt = mh (yt−h , . . . , yt−h−ph ; θh ) + et,h ,
Ben Taieb & Hyndman: 2 September 2012
(7)
7
Recursive and direct multi-step forecasting: the best of both worlds
where h = 1, . . . , H. For each model, the parameters θh are estimated as follows
θ̂h = argmin
X
θh ∈Θ h
[yt − mh (xt−h ; θh )]2 .
(8)
t
Then forecasts are obtained for each horizon from the corresponding model, m̂(h) (xt ) = mh (xt ; θ̂h ).
This is sometimes also known as “direct multi-step” (DMS) forecasting (e.g., Chevillon & Hendry
2005, Chevillon 2007, Franses & Legerstee 2009).
Because multiple models are used, this approach involves a heavier computational load than
recursive forecasting. Also, we no longer match the model used for forecasting with the assumed
model; the various models are estimated independently and can in practice be quite different
from each other.
Because of the squared errors in (8), m(h) (xt ) = mh (xt ), and the bias of the direct strategy at
horizon h = 2 is given by
b2 = µt+2|t − m(2) (xt ) ≈ f (f (xt ), . . . , yt−d+2 ) + 12 σ 2 fx1 x1 − m2 (yt , . . . , yt−p2 +1 ).
Thus the strategy leads to unbiased forecasts when m2 (yt , . . . , yt−p+1 ) f (f (xt ), . . . , yt−d+2 ) +
1 2
2 σ fx1 x1 .
These conditions will be satisfied whenever m2 is a sufficiently flexible model. Similar
arguments can be used for other forecast horizons.
As for the recursive strategy, we can get the MSE for the direct strategy with
h
i2
2
2
4 2
1
1 2
MSEdirect
≈
σ
(1
+
f
)
+
(κ
−
σ
)f
+
f
(f
(x
),
.
.
.
,
y
)
+
σ
f
−
m
(y
,
.
.
.
,
y
)
.
t
t−d+2
x
x
2
t
t−p
+1
x
x
x
2
4
2
1 1
2
1 1
1
When the strategy is unbiased, the MSE simplifies to
MSEdirect
≈ σ 2 (1 + fx21 ) + 14 (κ − σ 4 )fx21 x1 .
2
Consequently, under ideal conditions when m f and p = d for the recursive strategy, and the
direct strategy is unbiased, we find that the recursive strategy has larger MSE than the direct
strategy:
MSErecursive
− MSEdirect
≈ 14 σ 4 fx21 x1 .
2
2
Ben Taieb & Hyndman: 2 September 2012
8
Recursive and direct multi-step forecasting: the best of both worlds
Ing (2003) shows that even in the linear case (when fx1 x1 = 0), when mh and m are assumed
linear and {yt } is assumed stationary, the recursive MSE is greater than the direct MSE, with the
difference being of order O(T −3/2 ) where T is the length of the time series.
Chevillon & Hendry (2005) provide a useful review of some of the literature comparing recursive
and direct forecasting strategies, and explore in detail the differences between the strategies
for VAR models applied to stationary and difference-stationary time series. They show that
for these models, when f , m and mh are all assumed multivariate linear, and under various
mis-specification conditions, then the recursive MSE is greater than the direct MSE.
In a later paper, Chevillon (2007) provides a further review and unifies some of the results
in the literature. He concludes that the direct strategy is most beneficial when the model is
misspecified (i.e., that m and f are not asymptotically equivalent), in particular when the data
contains misspecified unit roots, neglected residual autocorrelation and omitted location shifts.
Putting aside the computational disadvantage of using the direct strategy, it also has a problem
in generating forecasts from potentially very different models at different forecast horizons.
Because every model is selected independently at each horizon, it is possible for consecutive
forecasts to be based on different conditioning information and different model forms. This can
lead to irregularities in the forecast function. These irregularities are manifest as a contribution
to the forecast variance. The problem is exacerbated when each of the mh models is allowed to
be nonlinear and nonparametrically estimated (Chen et al. 2004).
3
The RECTIFY strategy
In this section we propose a new forecasting strategy that borrows from the strengths of the
recursive and direct strategies. We call this new strategy “RECTIFY” because it begins with
recursive forecasts and adjusts (or rectifies) them so they are unbiased and have smaller MSE.
We begin with a simple linear base model, and produce forecasts from it using the recursive
strategy. These are known to be biased, even when the true data generating process is correctly
specified by the base model. Then we correct these forecasts by modelling the forecast errors
using a direct strategy. The resulting forecasts will be unbiased, provided the models used in
the direct strategy are sufficiently flexible.
The advantage of this two-stage process is that it links all the direct forecast models together with
the same unifying base model, thus reducing the irregularities that can arise with independent
Ben Taieb & Hyndman: 2 September 2012
9
Recursive and direct multi-step forecasting: the best of both worlds
models, and so reducing the forecast variance. Of course, it is still possible for the direct
models to be different from each other, but these differences are likely to be much smaller when
modelling the errors from the recursive strategy than when modelling the time series directly.
We shall denote the simple linear base model by yt = z(xt−1 ; θ) + et , from which recursive
forecasts are calculated. In the second stage, we adjust the forecasts from the base model by
applying direct forecasting models to the errors from the recursive base forecasts; that is, we fit
the models
yt − z(h) (yt−h , . . . , yt−h−p ; θ̂) = rh (yt−h , . . . , yt−h−ph ; γh ) + et,h
(9)
where h = 1, . . . , H.
The parameters for all models are estimated by least squares. Then forecasts are obtained for
each horizon by combining the base model and the rectification models: m̂(h) (xt ) = ẑ(h) (xt )+ r̂h (xt ).
Let m(h) (xt ) = E[m̂(h) (xt ) | xt ]. Then the bias of the rectify strategy at horizon h = 2 is given by
b2 = µt+2|t − m(2) (xt )
h
i
≈ f (f (xt ), . . . , yt−d+2 ) + 12 σ 2 fx1 x1 − z(z(xt ), . . . , yt−p+2 ) + r2 (yt , . . . , yt−p2 +1 )
h
i
= f (f (xt ), . . . , yt−d+2 ) − z(z(xt ), . . . , yt−p+2 ) + 21 σ 2 fx1 x1 − r2 (yt , . . . , yt−p2 +1 ).
Thus the strategy leads to unbiased forecasts when r2 (yt , . . . , yt−p2 +1 ) f (f (xt ), . . . , yt−d+2 ) −
z(z(xt ), . . . , yt−p+2 ) + 12 σ 2 fx1 x1 . In other words, when the rectification models are sufficiently
flexible to estimate the conditional mean of the residuals from the base model.
Bias in the base model is corrected with the rectification model, so the value of the base model
is not in getting low bias but low variance. Consequently a relatively simple parametric base
model works best. In particular, we do not need z f . In our applications, we use a linear
autoregressive model where the order is selected using the AIC. While this model will be biased
for nonlinear time series processes, it will allow most of the signal in f to be modelled, and will
give relatively small variances due to being linear. Another advantage of making the base model
linear is that it provides an estimate of the underlying process in areas where the data is sparse.
The rectification models must be flexible in order to handle the bias produced by the base model.
In our applications we use nearest neighbour estimates (kNN).
Ben Taieb & Hyndman: 2 September 2012
10
Recursive and direct multi-step forecasting: the best of both worlds
We can get the MSE for the rectify strategy with
rectify
MSE2
≈ σ 2 (1 + fx21 ) + 14 (κ − σ 4 )fx21 x1
h
i2
+ [f (f (xt ), . . . , yt−d+2 ) − z(z(xt ), . . . , yt−p+2 ) + 12 σ 2 fx1 x1 ] − r2 (yt , . . . , yt−p2 +1 ) .
When the rectify strategy is unbiased, it has an asymptotic MSE equal to the direct strategy and
less than the recursive strategy:
rectify
MSErecursive
− MSE2
2
= MSErecursive
− MSEdirect
≈ 14 σ 4 fx21 x1 .
2
2
However, this overlooks the variance advantage of the rectify strategy. While the asymptotic
variances of the rectify and direct strategies are both zero (and hence not part of these MSE
results), the rectify strategy should have smaller finite-sample variance due to the unifying
effects of the base model. We will demonstrate this property empirically in the next section.
The rectify strategy proposed here is similar to the method proposed by Judd & Small (2000),
although they do not consider the statistical properties (such as bias and MSE) of the method.
One contribution we are making in this paper is an explanation of why the method of Judd &
Small (2000) works.
4
4.1
Experimental set-up
Regression methods and model selection
Each forecasting strategy requires a different number of regression tasks. In our experiments,
we considered two autoregression methods: a linear model and a kNN algorithm.
The linear models are estimated using conditional least squares with the order p selected using
the AIC.
The kNN model is a nonlinear and nonparametric model where the prediction for a given data
point xq is obtained by averaging the target outputs y[i] of the k nearest neighbors points of
the given point xq (Atkeson et al. 1997). We used a weighted kNN model by taking a weighted
average rather than a simple average. The weights are a function of the Euclidean distance
between the query point and the neighboring point (we used the biweight function of Atkeson
et al. 1997). The key parameter k has to be selected with care as it is controlling the bias/variance
Ben Taieb & Hyndman: 2 September 2012
11
Recursive and direct multi-step forecasting: the best of both worlds
tradeoff. A large k will lead to a smoother fit and therefore a lower variance (at the expense of a
higher bias), and vice versa for a small k.
To select the best value of the parameter k, we adopted a holdout approach in which the first part
(70%) of each time series is used for training and the second part (30%) is used for validation.
More precisely, the first part is used to search for the neighbors and the second part to calculate
a validation error.
4.2
The forecasting strategies
In our experiments, we will compare four forecasting strategies:
• REC: the recursive strategy using the kNN algorithm;
• DIRECT: the direct strategy using the kNN algorithm for each of the H forecast horizons;
• RECTIFY: the rectify strategy;
To show the impact of the base model, we will consider two implementations:
• LINARP-KNN: the linear model as base model and the kNN algorithm for the H rectification models;
• KNN-KNN: the kNN algorithm for both the base model and the rectification models.
5
Simulation experiments
We carry out a Monte Carlo study to investigate the performance of the rectify strategy compared
to the other forecasting strategies from the perspective of bias and variance. We use simulated
data with a controlled noise component to effectively measure bias and variance effects.
5.1
Data generating processes
Two different autoregressive data generating processes are considered in the simulation study.
First we use a linear AR(6) process given by
yt = 1.32yt−1 − 0.52yt−2 − 0.16yt−3 + 0.18yt−4 − 0.26yt−5 + 0.19yt−6 + εt ,
where εt ∼ NID(0, 1). This process exhibits cyclic behaviour and was selected by fitting an AR(6)
model to the famous annual sunspot series. Because it is a linear process, the variance of εt
Ben Taieb & Hyndman: 2 September 2012
12
Recursive and direct multi-step forecasting: the best of both worlds
simply scales the resulting series. Consequently, we set the error variance to one without loss of
generality.
We also consider a nonlinear STAR process given by
h
i−1
yt = 0.3yt−1 + 0.6yt−2 + (0.1 − 0.9yt−1 + 0.8yt−2 ) 1 + e(−10yt−1 ) + εt
where εt ∼ NID(0, σ 2 ). We considered two values for the error variance of the STAR model noise
σ 2 = [0.052 , 0.12 ].
Several other simulation studies used this STAR process for the purposes of model selection,
model evaluation as well as model comparison. See Teräsvirta & Anderson (1992), Tong &
Lim (1980) and Tong (1995) for some examples as well as related theoretical background and
applications.
Figure 1 shows samples with T = 200 observations of both DGPs.
5.2
Bias and variance estimation
In expression (2), we have seen the decomposition of the MSE for a given strategy at horizon h.
In this section, we will show how to estimate the different parts of the decomposition, namely
the noise, the squared bias and the variance.
For a given DGP, we generate L independent time series D i = {y1 , . . . , yT } each composed of
T observations using different random seeds for the error term, where i ∈ {1, . . . , L}. These
generated time series represent samples of the DGP.
To measure the bias and variance, we use an independent time series from the same DGP for
testing purposes. We represent this independent testing time series by a set of R input/output
pairs {(xj , yj )}R
j=1 where xj is a sequence of d consecutive observations and the vector yj comprises
the next H consecutive observations.
(h)
If we let m̂D i (xj ) be the forecast of a given strategy for the input xj at horizon h using dataset D i
and yj (h) be the hth element of the vector yj , then the MSE can be calculated as
L R 2
1 XX
(h)
MSEh =
yj (h) − m̂D i (xj )
LR
i=1 j=1
= Noiseh + Bias2h + Varianceh .
Ben Taieb & Hyndman: 2 September 2012
13
Recursive and direct multi-step forecasting: the best of both worlds
0
−10
−5
Values
5
AR(6)
0
50
100
150
200
150
200
150
200
Time
Values
−0.2
0.0
0.2
0.4
STAR(2) − σ = 0.05
0
50
100
Time
0.0
−0.5
Values
0.5
STAR(2) − σ = 0.1
0
50
100
Time
Figure 1
The three components can be estimated as follows
R
2
1 X
Noiseh =
yj (h) − E[yj (h) | xj ]
R
j=1
Bias2h =
R
2
1 X
E[yj (h) | xj ] − m̄(xj )
R
j=1
Varianceh =
L R 2
1 X X (h)
m̂D i (xj ) − m̄(xj )
LR
i=1 j=1
L
where m̄(xj ) =
1 X (h)
m̂D i (xj ) is an estimate of m(h) (xj ).
L
i=1
Ben Taieb & Hyndman: 2 September 2012
14
Recursive and direct multi-step forecasting: the best of both worlds
5.3
Simulation details
The first three hundred simulated values are discarded for each simulated series to stabilize the
time series, as suggested by Law & Kelton (2000).
To show the importance of the size of the time series T for each strategy, we will compare
different sizes, namely T ∈ {50, 100, 200, 400}.
We considered a maximum forecasting horizon H = 10 for the AR(6) DGP and H = 20 for the
STAR(2) DGP.
For the bias and variance estimation, we generate L = 500 sample time series from the DGP.
A large time series is independently generated and R = 2000 input/output pairs are extracted
from it for testing purposes as described in Section 5.2.
For the AR(6) DGP, the conditional mean E[yj (h) | xj ] at input xj and horizons h ∈ {1, . . . , H}
has been calculated analytically. For the STAR(2) DGP, we used simulations to compute
(h)
(h)
1 PS
s=1 ys (xj ) where ys (xj ) is one realization of the future of the input xj . We chose a large
S
value for the parameter S.
The range of values for the parameters p, p1 , . . . , pH (see equations (4), (7) and (9)) required by
all the regression tasks is {2, 3, 4, 5, 6, 7} and the set {2, 3, 4, 5} for the base model of the rectify
strategy. For the STAR(2) DGP, we used the set {2, 3, 4, 5}.
5.4
Results and discussion
Figure 2 gives for the AR(6) DGP and different values of T , the MSE together with the corresponding squared bias and variance. The same information is given in Figures 3 and 4 for the
STAR(2) DGP with σ = 0.05 and σ = 0.1, respectively.
For both DGPs, we can see in Figures 2, 3 and 4 that the recursive strategy has virtually the same
performance as the direct strategy in the first few horizons while it performs poorly for longer
horizons. For the AR(6) DGP, this is mainly due to the nonlinear model which increases the
forecast variance. For the STAR(2) DGP, both the nonlinear model and the nonlinear time series
make the forecasts of the recursive strategy worse. Indeed, as we have seen in expression (6), if
the time series is nonlinear and fx1 x1 > 0, there is a positive bias. Also, if the model is nonlinear
then an error made at one forecast horizon will be propagated to the subsequent forecast
horizons, thus increasing the bias of the recursive strategy as the horizon increases.
Ben Taieb & Hyndman: 2 September 2012
15
Recursive and direct multi-step forecasting: the best of both worlds
The same figures show also that the direct forecasts converge to the mean forecasts as the size of
the time series T increases, which is not the case for the recursive strategy. This is consistent
with the theoretical analysis of Section 2 where we have seen that the recursive strategy is biased
when the time series is noisy and f has a non-zero second derivative.
Concerning the two implementations of the rectify strategy, we can see that KNN-KNN has
significantly reduced the bias of the recursive strategy for both DGPs. However, this was at a
price of an increase in variance which consequently made the forecasts worse. The increase
in variance is particularly noticeable with small-sample time series and vanishes with large
samples. In consequence, using a nonlinear base model seems not to be a good idea when
implementing the rectify strategy.
If we look at the second implementation of the rectify strategy (LINARP-KNN in blue), we can see
that it performs better compared to KNN-KNN for both DGPs. Instead of using a nonlinear base
model, LINARP-KNN uses a linear AR(p) model which reduces the variance of the final forecasts.
Figure 2 shows that LINARP-KNN has improved the forecasting performance over both the
recursive and the direct strategies. The bias has been reduced over the forecasting horizons
for both strategies. Also, LINARP-KNN has significantly reduced the variance for the recursive
strategy. Compared to the direct strategy, the better performance is particularly noticeable at
shorter horizons and we get similar performance at longer horizons.
Figures 3 and 4 gives the results for the STAR(2) DGP for two different levels of noise. Comparing
these two figures, we can see that the gain with LINARP-KNN is greater with a higher level of
noise. This can be explained by the fact that the recursive strategy will be more biased and
also because the linear base model performs well in a high noise setting compared to nonlinear
models.
We can also notice in Figure 4 that LINARP-KNN has a smaller MSE compared to other strategies
at shorter horizons and converges to the forecasts of the direct strategy as the horizon increases.
In addition, LINARP-KNN is better with small-sample time series and gets closer to the direct
strategy as T increases.
The different results suggest that we can take advantage of the linear model in areas where the
data is sparse and benefit from the small variance due the linearity.
To illustrate the MSE decomposition for the different strategies, Figure 5 shows for T = 100
and the AR(6) DGP, the noise, the squared bias and the variance stacked in each panel. The
Ben Taieb & Hyndman: 2 September 2012
16
Recursive and direct multi-step forecasting: the best of both worlds
same information is given in Figures 6 and 7 for the STAR(2) DGP with σ = 0.05 and σ = 0.1,
respectively.
Figure 5, 6 and 7 show that the main factors affecting the MSE for all horizons are the noise and
the variance. For both DGPs, we can see that the recursive strategy has both higher bias (in grey)
and higher variance (in yellow) at longer horizons compared to the direct strategy.
For the AR DGP, KNN-KNN has decreased the bias and increased the variance of the recursive
strategy. However, for LINARP-KNN, we can clearly see the decrease in bias as well as in variance.
For the STAR DGP, we can see a higher decrease in terms of both bias and variance for LINARPKNN in Figure 7 compared to Figure 6.
Ben Taieb & Hyndman: 2 September 2012
17
Recursive and direct multi-step forecasting: the best of both worlds
Bias − T = 50
6
8
5
0
2
4
6
8
10
2
6
8
MSE − T = 100
Bias − T = 100
Variance − T = 100
5
4
2
1
0.2
6
8
10
0
0.0
2
4
2
4
6
8
10
2
4
6
8
Horizon
Horizon
MSE − T = 200
Bias − T = 200
Variance − T = 200
6
8
10
5
4
2
1
0
2
4
6
8
10
2
4
6
8
Horizon
Horizon
MSE − T = 400
Bias − T = 400
Variance − T = 400
6
8
10
5
10
4
2
3
Error
0.6
1
0
0.0
2
0.2
4
0.4
6
Error
8
0.8
10
1.0
Horizon
4
10
3
Error
0.6
0.0
2
0.2
4
0.4
6
Error
8
0.8
10
1.0
Horizon
4
10
3
Error
0.6
0.4
6
Error
8
0.8
10
1.0
Horizon
0
2
4
Horizon
0
2
3
Error
2
1
0.2
0.0
10
Horizon
4
Error
0.6
Error
0.4
4
0
2
Error
4
0.8
10
8
6
Error
4
0
2
REC
KNN−KNN
LINARP−KNN
DIRECT
2
Error
Variance − T = 50
1.0
MSE − T = 50
2
Horizon
4
6
Horizon
8
10
2
4
6
8
10
Horizon
Figure 2: AR(6) DGP and T = [50, 100, 200, 400]. MSE with corresponding squared bias and
variance.
Ben Taieb & Hyndman: 2 September 2012
18
Recursive and direct multi-step forecasting: the best of both worlds
15
0.05
0.03
Error
0.02
0.01
5
10
15
20
5
10
15
MSE − T = 100
Bias − T = 100
Variance − T = 100
20
15
20
0.01
0.00
0.000
0.00
10
0.03
0.02
Error
0.004
0.002
0.04
Error
0.06
0.04
0.006
0.05
0.06
Horizon
0.08
Horizon
0.02
5
10
15
20
5
10
15
Horizon
Horizon
MSE − T = 200
Bias − T = 200
Variance − T = 200
20
10
15
0.03
0.02
Error
0.004
20
0.01
5
10
15
20
5
10
15
Horizon
Horizon
MSE − T = 400
Bias − T = 400
Variance − T = 400
20
5
10
15
20
0.03
Error
0.02
0.01
0.00
0.00
0.000
0.02
0.002
0.04
Error
0.06
0.04
0.006
0.05
0.06
Horizon
0.08
5
0.00
0.00
0.000
0.02
0.002
0.04
Error
0.06
0.04
0.006
0.05
0.06
Horizon
0.004
Error
0.00
0.000
20
Horizon
0.08
5
Error
Variance − T = 50
0.04
0.006
10
0.004
0.002
Error
0.06
Error
0.04
0.02
0.00
REC
KNN−KNN
LINARP−KNN
DIRECT
5
Error
0.06
Bias − T = 50
0.08
MSE − T = 50
5
Horizon
10
Horizon
15
20
5
10
15
20
Horizon
Figure 3: STAR(2) DGP, σ = 0.05 and T = [50, 100, 200, 400]. MSE with corresponding squared
bias and variance.
Ben Taieb & Hyndman: 2 September 2012
19
Recursive and direct multi-step forecasting: the best of both worlds
0.08
Bias − T = 50
Variance − T = 50
10
15
20
0.02
0.00
5
10
15
20
5
10
15
Horizon
Horizon
MSE − T = 100
Bias − T = 100
Variance − T = 100
0.08
Horizon
20
10
15
20
0.02
0.00
0.000
0.00
5
10
15
20
5
10
15
Horizon
Horizon
MSE − T = 200
Bias − T = 200
Variance − T = 200
0.08
Horizon
20
10
15
20
0.02
0.00
0.000
0.00
5
10
15
20
5
10
15
Horizon
Horizon
MSE − T = 400
Bias − T = 400
Variance − T = 400
0.08
Horizon
20
10
15
20
0.02
0.00
0.000
0.00
5
0.04
Error
0.010
Error
0.005
0.05
Error
0.10
0.06
0.015
0.15
5
0.04
Error
0.010
Error
0.005
0.05
Error
0.10
0.06
0.015
0.15
5
0.04
Error
0.010
Error
0.005
0.05
Error
0.10
0.06
0.015
0.15
5
0.000
0.00
REC
KNN−KNN
LINARP−KNN
DIRECT
0.04
Error
0.010
Error
0.005
0.05
Error
0.10
0.06
0.015
0.15
MSE − T = 50
5
Horizon
10
Horizon
15
20
5
10
15
20
Horizon
Figure 4: STAR(2) DGP, σ = 0.1 and T = [50, 100, 200, 400]. MSE with corresponding squared bias
and variance.
Ben Taieb & Hyndman: 2 September 2012
20
Recursive and direct multi-step forecasting: the best of both worlds
6
0
2
4
Error
8
10
12
MSE − T = 100 − noise = 1
REC
KNN−KNN
LINARP−KNN
DIRECT
2
4
6
8
10
Horizon
10
8
4
6
8
10
2
4
6
Horizon
T = 100 − DIRECT
8
10
8
10
10
8
6
4
2
0
0
2
4
6
Error
8
10
12
Horizon
T = 100 − LINARP−KNN
12
2
Error
6
Error
0
2
4
6
0
2
4
Error
8
10
12
T = 100 − KNN−KNN
12
T = 100 − REC
2
4
6
8
10
Horizon
2
4
6
Horizon
Figure 5: AR(6) DGP. MSE of the different forecasting strategies (top left). Decomposed MSE in
noise (cyan), squared bias (grey) and variance (yellow).
Ben Taieb & Hyndman: 2 September 2012
21
Recursive and direct multi-step forecasting: the best of both worlds
0.06
0.00
0.02
0.04
Error
0.08
0.10
MSE − T = 100 − noise = 0.05
REC
KNN−KNN
LINARP−KNN
DIRECT
5
10
15
20
Horizon
0.08
10
15
20
5
10
Horizon
T = 100 − DIRECT
15
20
15
20
0.08
0.06
0.04
0.02
0.00
0.00
0.02
0.04
Error
0.06
0.08
0.10
Horizon
T = 100 − LINARP−KNN
0.10
5
Error
0.06
0.00
0.02
0.04
Error
0.06
0.00
0.02
0.04
Error
0.08
0.10
T = 100 − KNN−KNN
0.10
T = 100 − REC
5
10
15
20
Horizon
5
10
Horizon
Figure 6: STAR(2) DGP and σ = 0.05. MSE of the different forecasting strategies (top left). Decomposed MSE in noise (cyan), squared bias (grey) and variance (yellow).
Ben Taieb & Hyndman: 2 September 2012
22
Recursive and direct multi-step forecasting: the best of both worlds
0.00
0.05
Error
0.10
0.15
MSE − T = 100 − noise = 0.1
REC
KNN−KNN
LINARP−KNN
DIRECT
5
10
15
20
Horizon
0.10
Error
0.05
0.00
0.00
0.05
Error
0.10
0.15
T = 100 − KNN−KNN
0.15
T = 100 − REC
10
15
20
5
10
Horizon
T = 100 − DIRECT
15
20
15
20
0.10
Error
0.05
0.00
0.00
0.05
Error
0.10
0.15
Horizon
T = 100 − LINARP−KNN
0.15
5
5
10
15
20
Horizon
5
10
Horizon
Figure 7: STAR(2) DGP and σ = 0.1. MSE of the different forecasting strategies (top left). Decomposed MSE in noise (cyan), squared bias (grey) and variance (yellow).
Ben Taieb & Hyndman: 2 September 2012
23
Recursive and direct multi-step forecasting: the best of both worlds
6
Real data applications
To shed some light on the performance of the rectify strategy with real-world time series, we
carried out some experiments using time series from two forecasting competitions, namely the
M3 and the NN5 competitions.
6.1
Forecasting competition data
The M3 competition dataset consists of 3003 monthly, quarterly, and annual time series. The
competition was organized by the International Journal of Forecasting (Makridakis & Hibon 2000),
and has attracted a lot of attention. The time series of the M3 competition have a variety of
features. Some have a seasonal component, some possess a trend, and some are just fluctuating
around some level.
We have considered all the monthly time series in the M3 data that have more than 110 data
points. The number of time series considered was M = 800 with a range of lengths between
T = 111 and T = 126. For these monthly time series, the competition required forecasts for the
next H = 18 months, using the given historical points. Figure 8 shows four time series from the
set of 800 time series.
The NN5 competition dataset ? comprises M = 111 daily time series each containing T = 735
observations. Each of these time series represents roughly two years of daily cash money
withdrawal amounts at ATM machines at one of several cities in the UK. The competition was
organized in order to compared and evaluate the performance of computational intelligence
methods. For all these time series, the competition required forecasts of the next H = 56 days,
using the given historical points. Figure 9 shows four time series from the NN5 data set.
6.2
Experiment details
The NN5 dataset includes some zero values that indicate no money withdrawal occurred and
missing observations for which no value was recorded. We replaced these two types of gaps
using the method proposed in Wichard (2010): the missing or zero observation yt is replaced by
the median of the set {yt−365 , yt−7 , yt+7 , yt+365 } using only non-zero and non-missing values.
For both competitions, we deseasonalized the time series using the STL (Seasonal-Trend decomposition based on Loess smoothing) (Cleveland et al. 1990). Of course, the seasonality has
been restored after forecasting. For the parameter controlling the loess window for seasonal
Ben Taieb & Hyndman: 2 September 2012
24
5000
3000
Values
7000
Recursive and direct multi-step forecasting: the best of both worlds
1984
1986
1988
1990
1992
6000
2000
0
Values
Month
1982
1984
1986
1988
1990
1988
1990
1988
1990
6000
4000
Values
8000
Month
1982
1984
1986
10000
4000
Values
16000
Month
1982
1984
1986
Month
Figure 8: Four time series from the M3 forecasting competition.
extraction, we used the value s = 50 for the M3 competition and s = periodic for the NN5
competition.
For the M3 competition, the value of each parameters p, p1 , . . . , pH were selected from the set
{2 : 10} for all strategies and from the set {2 : 5} for the base model of the rectify strategy. For the
NN5 competition, we used the set {2 : 14} for all strategies and the set {2 : 7} for the base model
of the rectify strategy.
We considered two forecast accuracy measures. The first was the symmetric mean absolute
percentage error (SMAPE). Let
SMAPEm
h
m
m
2|ŷt+h
− yt+h
|
=
∗ 100
m
m
ŷt + yt
Ben Taieb & Hyndman: 2 September 2012
25
Values
0
10 20 30 40
Recursive and direct multi-step forecasting: the best of both worlds
0
200
400
600
400
600
400
600
400
600
30
0 10
Values
50
Day
0
200
30
0 10
Values
50
Day
0
200
40
20
0
Values
Day
0
200
Day
Figure 9: Four time series from the NN5 forecasting competition.
for the mth time series at horizon h where m ∈ {1, . . . , M} and h ∈ {1, . . . , H}. Then SMAPE is
the average of SMAPEm
h across the M time series. Hyndman & Koehler (2006) discussed some
problems with this measure, but as it was used by Makridakis & Hibon (2000), we use it to
enable comparisons with the M3 competition.
The second accuracy measure was the mean absolute scaled error (MASE) introduced by
Hyndman & Koehler (2006). Let
MASEm
h =
m
m
|ŷt+h
− yt+h
|
P
T
1
m
m
T −s t=s+1 |yt − yt−s |
where s is the seasonal lag. We used s = 7 for the NN5 time series and s = 12 for the M3 time
series. Then MASE is the average of MASEm
h across the M time series.
Ben Taieb & Hyndman: 2 September 2012
26
Recursive and direct multi-step forecasting: the best of both worlds
h = 1−18
MASE
2
0.15
0
0.00
1
0.05
0.10
SMAPE
3
0.20
4
0.25
5
0.30
h = 1−18
REC
L−K
K−K
DIR
L−K
REC
K−K
DIR
Figure 10: Boxplots of the SMAPE (left) and MASE (right) measures averaged over the H = 18
consecutive horizons and obtained across the 800 time series for the different strategies.
6.3
Results and discussion
Figures 10 and 11 present a graphical depiction of the performance of each strategy on both the
M3 and the NN5 Competitions. Each figure has two plots for both SMAPE and MASE measures
and each plot gives notched boxplots ranked according to the median and summarizing the
distribution of the SMAPE/MASE measures averaged over the forecasting horizon H for each
strategy. Outliers were omitted to better facilitate the graphical comparison between the various
forecasting strategies.
For both the M3 and the NN5 competitions, Figures 10 and 11 show that the rectify strategy
is almost always better or at least equivalent to the best of both the recursive and the direct
strategy.
In fact, for the M3 competition, we can see on Figure 10 that the recursive strategy is significantly
better than the direct strategy both in term of SMAPE and MASE measures. We can also see
that the rectify strategy is better or equivalent to the recursive strategy.
For the NN5 competition, the direct strategy performs better than the recursive strategy and as
before, the rectify strategy is better or equivalent to the direct strategy.
Ben Taieb & Hyndman: 2 September 2012
27
Recursive and direct multi-step forecasting: the best of both worlds
h = 1−56
1.0
MASE
0.25
0.10
0.15
0.5
0.20
SMAPE
0.30
1.5
0.35
h = 1−56
DIR
L−K
MEAN REC
K−K
MEAN
L−K
DIR
K−K
REC
Figure 11: Boxplots of the SMAPE (left) and MASE (right) measures averaged over the H = 56
consecutive horizons and obtained across the 111 time series for the different strategies.
This suggests that using the rectify strategy in all situations is a reasonable approach, and avoids
making a choice between the recursive and the direct strategies which can be a difficult task in
real-world applications.
To allow a more detailed comparison between the different strategies, Figures 12–15 present
the same result as in Figures 10 and 11 but at a more disaggregated level. Figures 12 and 13
give the SMAPE/MASE measures for each horizon h while Figures 14 and 15 averaged it over 7
consecutive horizons instead of H = 56 as in Figure 11.
If we compare the performance of the different forecasting strategies in term of lowest median,
upper and lower quartiles of the distribution of errors, the following conclusions can be drawn.
For the M3 forecasting competition, Figure 12 and 13 suggest that direct is the least accurate
strategy consistently over the entire horizon both in term of SMAPE and MASE. This might be
explained by the size of the time series (between 111 and 126) and the large forecasting horizon
H = 18 which can reduce the dataset by 15% for high horizons.
The LINARP-KNN strategy performs particularly well in short horizons and competes closely
with KNN-KNN and REC according to MASE and SMAPE respectively.
For the NN5 forecasting competition, Figures 14 and 15 suggest that the direct strategy provides
better forecasts compared to the recursive strategy, probably because of the greater nonlinearity
present in the NN5 data compared to the M3 data.
Ben Taieb & Hyndman: 2 September 2012
28
Recursive and direct multi-step forecasting: the best of both worlds
K−K
SMAPE
0.00
L−K
L−K
0.30
SMAPE
K−K
DIR
REC
DIR
L−K
SMAPE
0.1
L−K
K−K
DIR
REC
SMAPE
0.3
0.2
SMAPE
0.1
0.0
K−K
DIR
L−K
REC
K−K
K−K
L−K
DIR
h = 18
0.4
h = 17
0.0 0.1 0.2 0.3 0.4
L−K
DIR
0.0
REC
h = 16
REC
K−K
h = 15
0.2
SMAPE
DIR
REC
0.4
0.4
h = 14
0.1
K−K
0.20
0.30
K−K
0.0
L−K
DIR
0.10
SMAPE
L−K
0.3
0.3
0.2
0.1
0.0
REC
K−K
0.00
REC
h = 13
L−K
0.3
DIR
DIR
h = 12
0.00
K−K
K−K
h=9
h = 11
0.30
SMAPE
L−K
0.20
0.30
0.20
0.10
L−K
REC
0.00
REC
0.00
REC
0.20
SMAPE
0.00
DIR
0.00
DIR
K−K
0.20
REC
h=8
0.30
SMAPE
K−K
0.10
K−K
DIR
0.10
L−K
0.20
0.30
0.20
0.10
L−K
REC
0.10
0.20
SMAPE
0.10
DIR
h = 10
SMAPE
L−K
h=6
0.10
REC
0.00
SMAPE
K−K
h=7
REC
SMAPE
K−K
0.00
0.10
SMAPE
0.00
L−K
SMAPE
DIR
h=5
0.20
h=4
REC
0.2
DIR
DIR
0.0 0.1 0.2 0.3 0.4
REC
0.10
0.15
0.10
SMAPE
0.00
0.05
0.10
0.05
L−K
h=3
0.20
h=2
0.00
SMAPE
0.15
h=1
L−K
REC
K−K
DIR
Figure 12: Boxplots of the SMAPE measures for horizon h = 1 to h = 18 obtained across the 800 time
series for the different strategies.
Ben Taieb & Hyndman: 2 September 2012
29
Recursive and direct multi-step forecasting: the best of both worlds
REC
DIR
K−K
1.5
1.0
MASE
0.0
0.0
L−K
0.5
0.8
MASE
h=3
0.4
0.8
0.4
0.0
MASE
h=2
1.2
h=1
L−K
DIR
K−K
L−K
h=5
K−K
REC
DIR
h=6
L−K
REC
K−K
DIR
2.0
MASE
0.0
1.0
2.0
MASE
1.0
0.0
1.0
0.0
MASE
2.0
3.0
h=4
REC
L−K
REC
DIR
L−K
h=8
REC
K−K
DIR
h=9
K−K
DIR
DIR
L−K
REC
DIR
DIR
5
1
0
K−K
L−K
REC
DIR
L−K
h = 14
K−K
REC
DIR
h = 15
2
L−K
DIR
K−K
REC
DIR
DIR
2
0
0
K−K
REC
h = 18
10
MASE
6
2
4
MASE
8
6
4
2
0
L−K
L−K
8
8
10
REC
h = 17
10
h = 16
K−K
6
DIR
4
REC
0
0
K−K
4
MASE
4
2
MASE
6
6
8
0 1 2 3 4 5 6
L−K
MASE
K−K
4
MASE
3
MASE
2
1
0
L−K
REC
h = 12
4
5
4
3
2
0
1
MASE
REC
h = 11
h = 13
MASE
K−K
5
h = 10
K−K
3
1
0
L−K
3
REC
2
L−K
2
MASE
3
0
0
1
2
MASE
2
1
MASE
3
4
4
4
h=7
K−K
L−K
K−K
REC
DIR
K−K
L−K
REC
DIR
Figure 13: Boxplots of the MASE measures for horizon h = 1 to h = 18 obtained across the 800 time
series for the different strategies.
Ben Taieb & Hyndman: 2 September 2012
30
Recursive and direct multi-step forecasting: the best of both worlds
h = 8−14
0.25
SMAPE
0.15
0.15
0.05
0.05
SMAPE
0.25
0.35
0.35
h = 1−7
L−K
DIR
K−K
REC
MEAN
DIR
L−K
K−K
REC
MEAN
K−K
REC
MEAN
REC
K−K
h = 22−28
0.3
SMAPE
0.2
0.3
0.1
0.1
0.2
SMAPE
0.4
0.4
0.5
h = 15−21
MEAN
L−K
DIR
MEAN
K−K
REC
L−K
DIR
h = 36−42
0.2
0.3
SMAPE
0.4
0.3
0.1
0.1
0.2
SMAPE
0.5
0.4
0.6
h = 29−35
REC
MEAN
DIR
L−K
REC
K−K
L−K
K−K
h = 50−56
0.25
SMAPE
0.05
0.1
0.15
0.2
SMAPE
0.3
0.35
0.4
h = 43−49
DIR
L−K
DIR
MEAN
K−K
REC
L−K
MEAN
DIR
Figure 14: Boxplots of the SMAPE measures averaged over 7 consecutive horizons and obtained across
the 111 time series for the different strategies.
Ben Taieb & Hyndman: 2 September 2012
31
Recursive and direct multi-step forecasting: the best of both worlds
h = 8−14
1.0
MASE
0.8
0.5
0.6
0.0
0.0
0.2
0.4
MASE
1.0
1.2
1.5
h = 1−7
L−K
DIR
K−K
REC
MEAN
L−K
DIR
MEAN
REC
REC
K−K
K−K
MEAN
K−K
REC
h = 22−28
0.8
MASE
0.0
0.2
0.5
0.4
0.6
1.5
1.0
MASE
1.0
2.0
1.2
1.4
2.5
h = 15−21
K−K
REC
L−K
MEAN
DIR
K−K
L−K
MEAN
h = 36−42
1.5
MASE
1.0
2.0
1.5
0.0
0.0
0.5
0.5
1.0
MASE
2.5
2.0
3.0
2.5
h = 29−35
DIR
MEAN
DIR
L−K
REC
K−K
L−K
REC
h = 50−56
0.8
MASE
0.0
0.0
0.2
0.4
0.6
1.0
0.5
MASE
1.0
1.2
1.5
h = 43−49
DIR
MEAN
DIR
L−K
REC
K−K
DIR
L−K
MEAN
Figure 15: Boxplots of the MASE measures averaged over 7 consecutive horizons and obtained across
the 111 time series for the different strategies.
Ben Taieb & Hyndman: 2 September 2012
32
Recursive and direct multi-step forecasting: the best of both worlds
7
Conclusion
The recursive strategy can produce biased forecasts and the direct strategy can have high
variance forecasts with short time series or long forecasting horizons. The rectify strategy has
been proposed to take advantage of the strengths of both the recursive and direct strategies.
Using simulations, we have found with nonlinear time series that the rectify strategy reduces
the bias of the recursive strategy as well as the variance if a linear base model is used. A higher
level of noise and a shorter time series make the performance of the rectify strategy even better,
relative to other strategies. With linear time series, we also found that the rectify strategy has
performed particularly well by decreasing both the bias and the variance.
Using real time series, we found that the rectify strategy is always better than, or at least has
comparable performance to, the best of the recursive and the direct strategy. This finding is
interesting as it avoids the difficult task of choosing between the recursive and the direct strategy
and at the same time it shows that the rectify strategy is better or competitive with both the
recursive and the direct strategy.
All these findings make the rectify strategy very attractive for multi-step forecasting tasks.
Ben Taieb & Hyndman: 2 September 2012
33
Recursive and direct multi-step forecasting: the best of both worlds
References
Atiya, A. F., El-shoura, S. M., Shaheen, S. I. & El-sherif, M. S. (1999), ‘A comparison between
neural-network forecasting techniques–case study: river flow forecasting.’, IEEE Transactions
on Neural Networks 10(2), 402–409.
Atkeson, C. G., Moore, A. W. & Schaal, S. (1997), ‘Locally weighted learning’, Artificial Intelligence
Review 11(1-5), 11–73.
Chen, R., Yang, L. & Hafner, C. (2004), ‘Nonparametric multistep-ahead prediction in time
series analysis’, Journal of the Royal Statistical Society, Series B 66(3), 669–686.
Chevillon, G. (2007), ‘Direct multi-step estimation and forecasting’, Journal of Economic Surveys
21(4), 746–785.
Chevillon, G. & Hendry, D. (2005), ‘Non-parametric direct multi-step estimation for forecasting
economic processes’, International Journal of Forecasting 21(2), 201–218.
Cleveland, R. (1990), ‘STL : A seasonal-trend decomposition procedure based on loess’, Journal
of Official . . . .
Fan, J. & Yao, Q. (2003), Nonlinear time series: nonparametric and parametric methods, Springer,
New York.
Franses, P. H. & Legerstee, R. (2009), ‘A unifying view on multi-step forecasting using an
autoregression’, Journal of Economic Surveys 24(3), 389–401.
Hastie, T. J., Tibshirani, R. & Friedman, J. H. (2008), The elements of statistical learning, 2nd edn,
Springer-Verlag, New York.
Ing, C.-K. (2003), ‘Multistep Prediction in Autoregressive Processes’, Econometric Theory
19(02), 254–279.
Judd, K. & Small, M. (2000), ‘Towards long-term prediction’, Physica D 136, 31–44.
Law, A. M. & Kelton, W. D. (2000), Simulation Modelling and Analysis, 3rd edn, McGraw-Hill.
Makridakis, S. G. & Hibon, M. (2000), ‘The M3-Competition: results, conclusions and implications’, International Journal of Forecasting 16(4), 451–476.
Ben Taieb & Hyndman: 2 September 2012
34
Recursive and direct multi-step forecasting: the best of both worlds
McNames, J. (1998), A nearest trajectory strategy for time series prediction, in ‘Proceedings
of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling’,
Citeseer, pp. 112–128.
Teräsvirta, T. & Anderson, H. M. (1992), ‘Characterizing nonlinearties in business cycles using
smooth transition autoregressive models’, Journal of Applied Econometrics 7, 119–136.
Tong, H. (1995), ‘A personal overview of non-linear time series analysis from a chaos perspective’,
Scandinavian Journal of Statistics 22, 399–445.
Tong, H. & Lim, K. S. (1980), ‘Threshold autoregression, limit cycles and cyclical data’, Journal
of the Royal Statistical Society, Series B 42(3), 245–292.
Werbos, P. (1990), ‘Backpropagation through time: What it does and how to do it’, Proceedings of
the IEEE 78(10), 1550–1560.
Wichard, J. D. (2010), ‘Forecasting the NN5 time series with hybrid models’, International Journal
of Forecasting .
Williams, R. J. & Zipser, D. (1989), ‘A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks’, Neural computation 1(2), 270–280.
Ben Taieb & Hyndman: 2 September 2012
35
Download