Uploaded by solofan999

IJCNN2013-Cross-validationformodelcombination

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/260832144
Crogging (cross-validation aggregation) for forecasting - A novel algorithm of
neural network ensembles on time series subsamples
Conference Paper · August 2013
DOI: 10.1109/IJCNN.2013.6706740
CITATIONS
READS
35
987
1 author:
Devon K. Barrow
University of Birmingham
18 PUBLICATIONS 990 CITATIONS
SEE PROFILE
All content following this page was uploaded by Devon K. Barrow on 03 April 2014.
The user has requested enhancement of the downloaded file.
Crogging (Cross-Validation Aggregation) for Forecasting – a novel
algorithm of Neural Network Ensembles on Time Series Subsamples
Devon K. Barrow and Sven F. Crone

Abstract— In classification, regression and time series
prediction alike, cross-validation is widely employed to estimate
the expected accuracy of a predictive algorithm by averaging
predictive errors across mutually exclusive subsamples of the
data. Similarly, bootstrapping aims to increase the validity of
estimating the expected accuracy by repeatedly sub-sampling
the data with replacement, creating overlapping samples of the
data. Estimates are then used to anticipate of future risk in
decision making, or to guide model selection where multiple
candidates are feasible. Beyond error estimation, bootstrapping
has recently been extended to combine each of the diverse
models created for estimation, and aggregating over each of
their predictions (rather than their errors), coined bootstrap
aggregation or bagging. However, similar extensions of crossvalidation to create diverse forecasting models have not been
considered. In accordance with bagging, we propose to combine
the benefits of cross-validation and forecast aggregation, i.e.
crogging. We assesses different levels of cross-validation,
including a (single-fold) hold-out approach, 2-fold and 10-fold
cross validation and Monte-Carlos cross validation, to create
diverse base-models of neural networks for time series
prediction trained on different data subsets, and average their
individual multiple-step ahead predictions. Results of
forecasting the 111 time series of the NN3 competition indicate
significant improvements accuracy through Crogging relative
to Bagging or individual model selection of neural networks.
I.
T
INTRODUCTION
HE combination of predictive models in ensembles has
received substantial attention, aiming to increase an
existing algorithms’ accuracy in out-of-sample
predictions for classification, regression and time series
forecasting alike. In forecasting, the seminal paper by Bates
and Granger [1] showed significant gains in accuracy
through the linear combination of univariate time series
methods, which has been independently confirmed to be
unbiased and more accurate (see, e.g., [2]) in comparison to
the accuracy of each of the individual models. Since then,
statistical methods as well as neural network ensembles have
been used to improve the accuracy over single models in
time series forecasting [3]. However, the majority of papers
have simply combined multiple algorithms previously
specified, or multiple initializations thereof, each one
parameterized on the same complete learning data.
As an alternative, algorithms which actively create diverse
base models by resampling the dataset on which methods are
parameterized have received less attention in forecasting.
Most notably, Bagging is a general algorithm for stabilizing
Devon. K. Barrow and Sven. F. Crone are with the Lancaster Centre for
Forecasting, Lancaster University Management School, Lancaster
University, Lancaster, UK (e-mail: {barrow, crone@exchange.lancs.ac.uk).
unstable prediction methods such as neural networks and
decision trees [4]. Bagging creates bootstrap replicates of a
given learning set by randomly sampling with replacement,
and useing each replicate as a new training set of equal size
to parameterize a base learner, aggregating their outputs to
an ensemble forecast by averaging their predictions. The
algorithm is fundamentally based on Bootstrapping, an
established statistical technique of resampling from observed
data, which is traditionally used to estimate the distribution
of almost any statistic [5], while enhancing the precision and
reducing variance of the estimation [6].
However, other established statistical methods of
resampling for estimation purposes, such as jackknifing or
cross-validation, have not yet been considered to create
diverse predictions. While the main application of crossvalidation has been the estimation of prediction errors, in
model selection, or in neural network training to control
overfitting through early stopping on a validation dataset, it
has not been used to create diverse forecasts to be combined
in ensembles. In this work we propose a novel aggregation
and combination method in the similar spirit to Bagging [7],
but based on cross validation rather than bootstrapping,
consequently coined Crogging. This novel combination
method averages forecast over a set of predictive models
trained using mutually exclusive cross-validation replicates
of the original learning set. As different splitting strategies
allow alternate cross-validation estimates, we assess the
accuracy of k-fold Crogging and Monte-Carlo Crogging in
comparison to established approaches of Bagging,
Ensembles trained using single-fold cross validation, i.e.
using a single hold-out set, and model selection (of a single
best), always using neural networks (NN) as a base learner.
The paper is organized as follows: section II describes the
cross validation strategies traditionally employed for error
estimation, including the benchmark approach of Bagging.
Section III introduces forecasting model combination and
aggregation, and motivates the novel algorithm. Section IV
outlines the experimental design, with findings presented in
section V. Section IV concludes the paper.
II. CROSS VALIDATION FOR ERROR ESTIMATION
A. Resampling Techniques for Error Estimation
The estimation of predictive accuracy is important, both
for comparing statistical models and for assessing the model
which is finally selected. Given a learning set of observed
data sampled from an unknown population, and a set of
models constructed for predicting future values, one
calculates the prediction error of each model and wants to
know which model performs best. However, in-sample
accuracy, i.e. the model fit which measures the ability to
approximate the data generating process, has been proven to
have little correlation with out-of-sample accuracy, i.e. the
ability to generalize for unseen data of the same data
generating process.
As a result, the statistical resampling technique of crossvalidation (CV) assesses how the results of a statistical
estimate will generalize to an independent data set [8], [7].
Cross-validation splits the data, using one subset as a
learning set to train each model, and the remaining part as a
validation sample for estimating the error of the predictor.
This provides a less biased and more representative
estimation of the true ex ante performance of the model. For
multiple alternative candidate models, the model with the
lowest prediction error is then selected as the final model.
Different versions of cross validation exist: hold-out CV, kfold CV and Monte Carlo CV, depending on the number of
subsets, and whether the subsets are mutually exclusive or
overlapping. All variants have in common that they generate
different training sets based on splitting the original learning
set, and that they are used to estimate errors and aid in model
selection, but not in forecast combination for prediction.
In time series prediction, where data over time is often
non-stationary, error estimation through cross validation has
become a prerequisite to assess predictive out-of-sample
accuracy of an algorithm with validity and reliability [9],
and to perform model selection between competing
algorithms prior to their actual application. Out-of-sample
evaluations with a single hold-out dataset have been most
popular for NN or statistical methods alike, including
systematic analysis regarding theoretical properties [10] or
in a particular application area, such as climate forecasting
[11], or financial forecasting with statistics and neural
networks [12-14]. Similarly, forecasting competitions
regularly employ (single fold) out-of-sample evaluation,
where a part of each time series is not disclosed to the
contestants in order to assess the empirical accuracy of
competing forecasting methods objectively in a simulated ex
ante design. In contrast, only few publications have
estimated accuracy across multiple folds, employing k-fold
subsampling for time series prediction.
B. Hold-Out Cross Validation
For the simplest case of CV, a single split into two data
subsets is performed. The holdout method, also referred to as
validation estimation, partitions the original learning set
into two mutually exclusive subsets S Train and S Valid of
training and validation (or holdout) set respectively. A
model m is estimated on S Train and used to obtain forecasts
to estimate predictive accuracy on S Valid. Guidance
as to how many observations to include in either dataset is
inconclusive, often employing heuristic rules of thumb using
70%:30% splits of training and validation respectively.
The hold-out method may be considered both a special
case of k-fold CV (with k = 1) and also of Monte-Carlo CV,
both discussed next.
C. K-fold Cross Validation
In a more general setting, for a time series of length T, we
define a k-fold cross validation, with k ≤ T, which divides a
learning set
into k none-overlapping and mutually
exclusive subsets of approximately equal size. Observations
are drawn without replacement, either randomly, or, in the
case of time series data with potential autocorrelation,
sequentially in blocks of consecutive observations. The
predictive model is then estimated k times, each time using a
training data set S Train comprised of k - 1 of the subsamples.
The one remaining subsamples is retained as validation data
S Valid, used to estimate the out of sample performance of the
estimated model. This process is repeated k times with each
of the k subsamples used exactly once as the validation data.
Estimates of the algorithms out of sample predictive errors
are then obtained by averaging the errors across the k
validation samples omitted in each estimation.
Consequently, for k = 1 a hold-out evaluation is estimated.
For k = 2 a two-fold cross validation splits the dataset into 2
folds, training the model on one and estimating on the other,
then vice versa, and averaging the estimated out-of-sample
error across both validation sets. For k = T-fold CV it
conducts a leave-one-out (LOO) cross validation, assessing
T estimated models (using T-1 observations) on T single
observations for validation, an approach equivalent to
jackknifing in statistical estimation.
An advantage of k-fold CV is that all observations are
used for both training and validation, all training
observations are used with equal weight, and each
observation is used for validation exactly once. A potential
disadvantage is that the proportion of the training/validation
split is dependent on the number of iterations (folds). Due to
its simplicity, single fold CV is widely applied in model
selection [8] and common in neural network training with
early stopping to prevent overfitting. However, as for larger
k the CV becomes more computationally demanding, only
few scientific studies with LOO CV on time series exist.
D. Monte-Carlo Cross-Validation
The Monte-Carlo CV repeats randomly splitting the original
learning into two subsets S Train and S Valid multiple times,
each time randomly drawing without replacement
examples from the learning set to form a training set S Train,
using the remaining
examples of
to form S Valid.
Model is trained using
and used to obtain forecasts
for
. This is repeated times, with as large as
possible, and errors estimated by averaging across the K
validation folds. Note that although data subsets are
mutually exclusive for each round of Monte Carlo CV, they
are not if repeated K times. As a result, all observations in
Monte Carlo CV will be used for estimation of m and
validation of errors multiple times, but across all iterations a
different number of times depending on the independent
random sampling between rounds.
E. Bootstrapping
As an alternative to CV, Bootstrapping provides an
alternative statistical resampling technique to estimate
errors, using sampling with replacement to create different
training datasets for within-sample estimation.
We consider the ordinary bootstrap method [5] where the
temporal and spatial covariance structure of the original time
series is preserved in the lagged vectors , much like the
moving block bootstrap [15, 16]. From the original learning
set ,
examples are randomly drawn with replacement
according to a discrete uniform distribution, where each
example in has equal probability of being chosen. These
of equal size as
examples form the new training set
the original learning set. The training of model
is
and used to obtain forecasts
performed using
on
. This is repeated
times, with
as large as
possible, and validation errors estimated as an average
. To compare with CV, bootstrapping
across K sets of
does not make use of a validation set, but creates diverse
estimates of errors by utilizing approximately
1-(1-(1⁄N))N = 63.2% unique examples in each training set
by sampling with replacement [6].
III. CROSS VALIDATION FOR FORECAST AGGREGATION
A. Forecast Combination and Bootstrapp Aggregation
As an alternative to identifying and selecting the single most
promising algorithm for forecasting future observations,
research in forecast combination remains active.
Makridakis et al. showed that using arithmetic means of
forecasts improves forecasting accuracy [17], and that taking
a simple average outperforms taking a weighted average
model combination [18] while being more robust [19].
Results of the M-competition showed that averaging
forecasts of six different algorithms performed better than
each of the individual methods included in the average [18].
Similarly, at the M3 competition the arithmetic mean of
Single, Holt and Dampen Trend Exponential Smoothing
proved more accurate than each of the three methods
individually, for practically all forecasting horizons [20]. For
NN models, constructing ensembles has proven equally
successful in increasing accuracy and hence prominent.
However, how to combine models and under which
conditions still remains a research question under debate.
Some papers dispel the notion that equally-weighted
combined forecasts lead to better performance [21], others
suggest the weighted median as it is deemed less sensitive to
outliers than the weighted mean [22], or using unweighted
averages with trimming and winzorisation to avoid the
influence of extreme values and errors [23].
Within forecasting and time series prediction, the majority
of papers have resorted to combining multiple algorithms
previously specified, or multiple initializations thereof, each
one parameterized on the same complete learning data. In
contrast, the Bagging algorithm has been recently proposed
as an alternative to simple combination. Rather than use
boostrapping to estimate errors, Breiman averages forecasts
(not errors) across multiple models m trained on different
data subsets created using random uniform sampling with
replacement with substantial success [23].
However, despite the prominence of bagging, similar
extensions of creating predictors from cross validation
routines have not been developed, also promising potential
benefits on single models and simple forecast combinations.
B. Cross-Validation with Forecast Aggregation
In k-fold CV, each of the k contender models m provides
forecasts only for validation data, but ignores their potential
to predict out of sample. As a result, many of the diverse
candidate models created in cross validation trained on
subsamples of the data are used only to estimate accuracy,
but not to create predictions themselves.
Rather than use each of the CV methods for error
estimation or model selection, we extend them to model
combination through forecast aggregation. In analogy to
Bagging, we propose to aggregate and combine the
predictions across each individual cross validation
prediction, termed Crogging. Specifically, we propose two
new algorithms of k-fold Crogging and Monte-CarloCrogging, and seek to evaluate them in an empirical
evaluation. (Note that the case of 1-fold hold-out evaluation
is equivalent to using an ensemble of conventional neural
networks, and as such cannot be considered novel).
The difference between Bagging and the proposed
approach of Crogging lies in the generation of the data
samples used for training and validation. While both crossvalidation and bootstrapping are based on resampling, crossvalidation ensures that all observations are used for both
training and validation, though not simultaneously, and each
observation is guaranteed to be used for model estimation
and validation the same number of times. Furthermore, the
validation set available in CV can be used to control for
overfitting in neural network training using early stopping.
-fold cross-validation allows the use of all validation sets
in performing early stopping, and this potentially further
reduces the risk of overfitting. In comparison to the
conventional ‘hold-out’ or validation method, commonly
used for early stopping neural networks, which uses only a
single split of the data and therefore only a single validation
set, Crogging promises the added benefit of using multiple
mutually-exclusive validation datasets. Nevertheless, while
cross-validation produces a nearly unbiased estimate of the
future value of a parameter, a major drawback is the high
variability which can be present in this estimation [24].
In light of these differences, this paper evaluates the
potential benefits from the proposed Crogging approach
based on cross-validation aggregation relative to standard
model averaging (hold-out aggregation) and Bagging (i.e.
Bootstrap aggregation) , and investigates possible gains in
accuracy resulting from use of one method over another.
IV. EXPERIMENTAL DESIGN
A. Comparing Cross Validation and Bagging Forecasts
We conduct a rigorous empirical experiment to evaluate
the relative forecasting accuracy of Crogging, in comparison
to Bagging, conventional neural network ensembles, and
individual NN model selection. This is the first evaluation of
employing each of the CV methods for model combination,
rather than error estimation or model selection. The
Multilayer Perceptron (MLP) algorithm is used to obtain
neural network models . To assess under which conditions
each of the algorithms performs well, we evaluate k-fold
Crogging for k = 2 and k = 10 and Monte-Carlo Crogging. In
order to allow a valid comparison across algorithms capable
of creating a number of diverse models each, we constrain
the total number of NN base models estimated to 50.
For k-fold Crogging, we assess both 10-fold and 2-fold
variants to assess the impact of different k. For
2-fold
cross-validation for aggregation, 2 subsets are generated, one
for training and one for validation. This has the advantage
that both the training and validation sets are large, and each
data point is used for both training and validation on each
fold. We train 25 randomly initialized MLPs on each fold
generating a total of 50 models which are then averaged. For
10-fold CV for aggregation, on each of the 10 folds we
train 5 randomly initialized MLPs for a total of 50 trained
MLPs which are then averaged. As result, each validation
fold is smaller yielding a potential tradeoff in the valid
estimation of out-of-sample accuracy for early stopping.
For Monte Carlo Crogging, we set K 50 creating 50
random cross validation splits of the learning set into
training and validation data, and averaging over 50 randomly
initialized MLPs each trained on a different training set.
Accuracy is compared to three established benchmark
methods of Bagging, NN ensembles and individual NN
model selection. For bagging, we set
50 creating 50
bootstrap replicates of the learning set, and averaging over
50 randomly initialized MLPs each trained on a different
bootstrap. For NN ensembles using simple model averaging
on the Hold-out method, we use the single split of the
training set obtained, to train 50 differently initialized MLPs.
This is equivalently referred to as neural network model
averaging [25] and most widely used in combining neural
networks for time series forecasting [26], [27], [18].This
provides a strong benchmark and allows investigating the
benefits of cross-validation versus validation for model
averaging. Ultimately, individual model selection is also
based on cross-validation, selecting from a set of 50
randomly initialized MLPs, the MLP model with the
smallest mean squared error (MSE) on the validation set. In
doing this, we use the hold-out method, which uses a single
validation set on which the prediction error is calculated.
B. Dataset
In order to provide empirical evidence across a large
number of time series, we utilize the time series data from
the NN3 competition [28]. The complete dataset of 111 time
series of the NN3 dataset was chosen containing between 68
and 144 observations. The dataset consists of a
representative set of long and short, monthly time series
drawn from a homogeneous population of empirical business
time series. Fig. 1 shows six time series from the NN3
competition dataset. As illustrated, the time series contain
both seasonal and non-seasonal patterns, with only minor
trends and different time series lengths.
To allow a valid comparison of the forecast accuracy of
the proposed Crogging methods to those originally
participating in the NN3 competition, we perform multistep-ahead forecasting using the iterative method,
forecasting 18 months into the future from a single fixed
origin. As a result, 18 examples are designated for the
holdout test set while the remainder is used for training.
NN3_101
NN3_102
6000
10000
5000
5000
4000
10
0
20
4
x 10
40
60 80 100 120 140
NN3_103
20
40
60 80 100 120 140
NN3_104
10000
5
5000
0
0
20
40
60 80 100 120 140
NN3_105
20
40
60 80
NN3_106
100 120
Fig. 1. Four time series of the NN3 Competition dataset.
The size of the single validation set is set to 14 to ensure
consistency with the hold-out and Monte-Carlo crossvalidation setup. The size of the validation set in -fold
cross-validation is determined by the value .
C. Error Metrics
We calculate the mean absolute scaled error (MASE) and
the symmetric mean absolute error (SMAPE) for all methods
in assessing forecast accuracy and performance. For a given
actual , and forecast , the forecast made for period , and
the number of observations forecasted by the respective
forecasting method, the SMAPE is calculated as follows:
|
1
| |
|
| | ⁄2
(1)
Hyndman and Koehler propose the use of the MASE to
overcome several degenerate problems associated with MAE
and sMAPE, and because it is less sensitive to outliers and
more easily interpreted than other scaled error measures
[29]. The MASE is used to compare across all time series
and forecast methods and is defined by:
|
1
1
|
∑
|
(2)
|
where N is defined as the number of observations in the
training set and H is the number of values being forecasted
in the out-of-sample test set. The SMAPE and MASE are
then averaged over all time series in the dataset to produce
the mean SMAPE and mean MASE respectively.
D. Specification of the Neural Networks
The base model used is a univariate Multilayer Perceptron
(MLP). MLPs are well researched and their ability to
approximate and generalize any linear and nonlinear
functional relationship to an arbitrary degree of accuracy has
been proven in time series prediction [30]. They are also
viewed as benefiting from model combination approaches
due to their learning instability and the large number of
factors or degrees of freedom affecting neural network
training [4], [26]. The functional form of these networks is
given by:
,
(3)
and describes a single layered MLP characterized by its
which captures the
,
…,
input vector
lagged observations of the time series in input nodes , its
number of hidden nodes and a single output node. We set
13 which captures lags up to
. This is sufficient
to model monthly (stochastic) seasonality of an
12
process in addition to trends (i.e. an I(1) process). All data is
pre-processed using linear scaling into the interval of [-0.5,
0.5] and each time series is modelled directly without prior
differencing or further data transformation. Level, trend and
seasonality are estimated directly in the model weights. Each
MLP network contains a single hidden layer with two hidden
nodes using the hyperbolic tangent transfer function [31],
and a single output node with a linear identity function.
The MLP is trained using the Levenberg-Marquardt
algorithm with a maximum of 1000 epochs. An early
stopping criterion is employed which stops the network
training if the validation error increases or remains the same
for more than 50 epochs. Additionally network training stops
if the adaptive value
exceeds 1e10. The network weights
giving the lowest validation error during training are used in
order to reduce overfitting to the data. All networks are
trained using early stopping on S Valid. Alternatively one can
consider training using only
with regularization, or
forcing overfitting for diversity, but better results were
obtained using the former approach.
For all neural networks we employ random weight
initialization. This means that in creating each new model,
we randomly initialize the starting weights for each neural
network allowing for different solutions of the network to be
achieved, in addition to the randomness introduced by the
cross validation and bootstrap procedures. In all cases, we
combine a total of 50 models to allow for a fair comparison
of the different methods; any differences should not be due
to the number of models included in the final combination.
V. EXPERIMENTAL RESULTS
A. Results Across all 111 Time Series
The results for the competing methods are summarized in
Table I and Table II for MASE and SMAPE respectively.
The error measures yield slightly difference results with the
10-fold cross-validation (10FOLDCV) method having the
lowest mean MASE (1.07) on test set, and 2-fold crossvalidation (2FOLDCV) having the lowest mean SMAPE
(15.29). Some consistent patterns however occur across error
measures. Most notably, all Crogging methods of crossvalidation aggregation MONTECV, 10FOLDCV and
2FOLDCV generate smaller forecast errors compared to the
standard Hold-out method (HOLDOUT), which only
averages over a single validation set, the most widely use
approach to creating MLP ensembles. This indicates a
general improvement in forecast accuracy from multiple
splitting of the leaning set, into either random or mutually
exclusive subsets, with significant improvements over the
benchmark model averaging approach. In addition, all
Cogging variants outperform the benchmark Bagging
algorithm which has an out-of-sample forecast error
(MASE=1.21, SMAPE=16.32) slightly larger than the
HOLDOUT method (MASE=1.20, SMAPE=16.08). These
finding are also consistent when errors on the validation
dataset are considered, indicating that these comparative
results are not subject to overfitting on the validation set.
TABLE I
AVERAGE MASE ON TRAINING, VALIDATION AND TEST DATASET ACROSS
ALL TIME SERIES
Method
Train
Validation
Test
BESTMLP
0.67
0.60
1.50
HOLDOUT
BAG
0.64
0.76
0.75
0.70
1.20
1.21
MONTECV
10FOLDCV
0.76
0.69
0.41
0.45
1.16
2FOLDCV
0.73
0.60
1.07
1.15
TABLE II
AVERAGE SMAPE ON TRAINING, VALIDATION AND TEST DATASET ACROSS
ALL TIME SERIES
Method
Train
Validation
Test
BESTMLP
12.36
11.10
17.89
HOLDOUT
BAG
11.78
12.95
12.57
13.17
16.08
16.32
MONTECV
13.81
10FOLDCV
2FOLDCV
12.65
13.68
8.29
8.94
11.19
15.52
15.35
15.29
As would be expected, all combination methods
outperform model selection, that is, the best MLP
(BESTMLP) method which runs 50 randomly initialized
MLPs and selects the MLP with the smallest error on the
validation set. While the BESTMLP performs well on
training and validation set, relative to other methods, for
example, Bagging and 2FOLDCV, it produces the highest
forecast errors on the test set. This is an indication of
overfitting of the individual MLP models to the validation
set, and the poor performance on the test set is explained by
the resulting instability in the model selection process from
selecting the model which minimizes the validation set
MSE. The selected model is not robust to changes in the
time series out-of-sample on the test set.
Table III shows the average MASE and SMAPE, and the
standard deviation and coefficient of variation of the
distribution of the MASE and SMAPE across all time series.
Results of both error measures show that model averaging
results in a lower standard deviation (SD) in forecast error
across time series when compared to model selection with
the 2FOLDCV method having the lowest standard deviation
reflecting a more robust performance across all time series.
The coefficient of variation (CoeVAR) over the distribution
of both the MASE and SMAPE across the time series also
supports the observation that the performance of the
2FOLDCV method is most robust across time series.
A plot of the distribution across all time series of the
SMAPE and in particular the MASE as shown in Fig. 2
shows further that 2FOLDCV and MONTECV methods
produces lower variation in the forecast error and standard
deviation of forecast errors relative to other methods, in
particular the BESTMLP which has the largest variation.
Method
TABLE III
AVERAGE MASE, STANDARD DEVIATION AND COEFFICIENT OF VARIATION ON TEST SET ACROSS ALL TIME SERIES
mean MASE
mean SDMASE
mean CoeVARMASE
mean SMAPE (%)
mean SDSMAPE
mean CoeVARSMAPE
BESTMLP
1.50
1.06
0.75
17.89
12.81
0.74
HOLDOUT
BAG
1.20
1.21
0.80
0.82
0.72
0.73
16.08
16.32
11.59
11.51
0.73
0.72
15.35
15.52
11.40
12.04
0.74
0.77
15.29
11.12
0.73
MONTECV
10FOLDCV
2FOLDCV
1.16
0.79
0.73
1.07
1.15
0.80
0.78
0.76
0.71
Fig. 2 also shows the 2FOLDCV has the lowest median
MASE and standard deviation of the MASE across all time
series and that across both error measures, CV methods
produce smaller median errors and standard deviation, and
lower variation in both measures when compared to the
HOLDOUT method and Bagging. This gives further
evidence that the improvement in accuracy is due to the
manner in which cross validation introduces diversity
through data splitting rather than bootstrap resampling.
A factor which is likely to impact the performance of
cross-validation is the length of the time series which
determines the amount of data available in the learning set
and consequently the number of observations available for
training and validation in each cross-validation split.
B. Results by Time Series Data Conditions
Table IV shows the forecast accuracy measured using
SMAPE averaged across short, medium and long forecast
horizons, for time series categorized as long and short [28].
We present only the results of the SMAPE as these are
consistent in this case, with those of the MASE. It can be
observed that on long time series 10FOLDCV has the
smallest SMAPE for medium to long horizons, and over
forecast lead time, 1-18. In contrast 2FOLDCV and
MONTECV both outperform 10FOLDCV on short time
series across all forecast horizons.
The performance of 2FOLDCV and MONTECV reflects
an advantage of both methods which is the increase in length
of both training and validation data. Because 2-fold crossvalidation generates only 2 folds of equal size, the training
and validation sets are both large. Likewise an advantage of
MONTECV is that the proportion of examples in the
training and validation set is not dependent on the number of
folds. This decoupling of the number of splits and the size of
the training/validation set results in larger validation sets.
The availability of sufficient data for training is
particularly important where the time series is short. This is
reflected in Fig. 3 which shows the distribution of the
Fig. 2. Boxplots of the MASE and SMAPE (top) and Standard Deviation of MASE and SMAPE (bottom) measures averaged over all forecast horizons and
obtained across all time series for the different methods. The line of reference represents the median value of the distributions.
TABLE IV
SMAPE FOR TEST SET ACROSS SHORT, MEDIUM AND LONG FORECAST HORIZON
Forecast Horizon a
Length
Method
1-3
4-12
13-18
1-18
Long
BESTMLP
10.79
16.59
20.02
16.77
HOLDOUT
BAG
9.34
9.74
14.96
15.46
16.20
16.38
14.43
14.81
MONTECV
10FOLDCV
10.86
10.39
15.16
15.43
14.54
2FOLDCV
9.03
14.04
14.64
14.82
15.69
13.69
14.06
Short
a
BESTMLP
16.83
17.03
20.66
18.20
HOLDOUT
BAG
17.59
17.20
17.04
17.27
20.12
20.96
18.16
18.49
MONTECV
10FOLDCV
15.47
16.00
14.71
15.91
19.05
20.25
16.28
17.37
14.51
18.95
16.21
2FOLDCV
15.86
1-3 = short horizon, 4-12 = medium horizon, 13-18 = long horizon.
SMAPE for short and long time series. For short series, the
increased size of the training and validation set from using
2FOLDCV and MONTECV, results in better training of the
network and as the results suggest, improved forecast
accuracy. When sufficient data is available for training and
validation, the increase in the number of folds from 2 to 10,
results in improved forecast accuracy (see Fig. 3 – right).
C. Relative Ranking on NN3 results
Table V reports the results obtained by the first eight
participants of the NN3 competition, the top five methods of
this study, the benchmark neural network model of the
competition (AutomatANN) and the single MLP used in this
study. In keeping with the report format of the competition,
we report rankings first according to SMAPE and then to
MASE. Among the computational intelligence (NN/CI)
methods, 2FOLDCV and MONTECV rank 2nd and 3rd
respectively behind Illies, and 4th and 5th overall among all
methods. This reflects rather good performance by the
proposed cross-validation combination methods relative to
methods used in the competition and in the case of the
MASE, the 10FOLDCV method which ranks 1st among
computational intelligence methods, and 1st among all
methods. An advantage of these methods based on crossvalidation and bootstrapping is their simplicity compared to
other methods. This includes the approach of Illies et al.
(C27) which is based on a combination of time series
clustering, decomposition and recurrent Echo State
Networks (ESN), and the method of Flores et. al. (C03),
which uses a self adaptive genetic algorithm to determine the
terms of a seasonal ARIMA (p,d,q)(P,D,Q) model.
VI. CONCLUSION
Current approaches to model averaging with neural
networks which are based on data sampling use either a
single training set which is then the original learning set, or
bootstrapping to generate multiple training sets through
resampling of the original learning set. Where a single
training set is used, model diversity is generated through
multiple random initializations of the neural network
weights and where bootstrapping is employed, model
diversity comes from the randomly sampled training data to
which neural network training is sensitive. This paper
proposes the use of cross-validation data splitting for model
averaging, and assesses different forms of cross-validation
for creating model diversity. In this case, the set of candidate
models are trained on different splits of the training data
while simultaneously reducing overfitting of the neural
network models through early-stopping on different trainingvalidation set pairs. This approach proves to be a very
promising alternative to the current strategy of neural
network model averaging, Bagging and model selection.
Fig. 3. Boxplots of the SMAPE averaged over all forecast horizons and obtained across short (left) and long (right) time series for the different methods. The
line of reference represents the median value of the distributions.
TABLE V
AVERAGE ERRORS AND RANKS OF ERRORS ACROSS ALL TIME SERIES OF THE NN3 COMPETITION
Average errors
Ranking all methods
SMAPE
MASE
SMAPE
Ranking NN/CI
MASE
SMAPE
MASE
B09
Wildi
14.84
1.13
1
2
−
−
B07
C27
Theta
Illies
14.89
15.18
1.13
1.25
2
3
2
9
−
1
−
7
**
2FOLDCV
MONTECV
ForecastPro
15.29
15.35
15.44
1.15
1.16
1.17
4
5
6
3
4
5
2
3
−
2
3
−
B16
B17
10FOLDCV
DES
Comb S-H-D
15.52
15.90
15.93
1.07
1.17
1.21
7
8
9
1
5
8
4
−
−
1
−
−
**
B03
**
B05
Autobox
15.95
1.18
10
6
−
−
**
C03
HOLDOUT
Flores
16.08
16.31
1.20
1.20
11
12
7
7
5
6
4
4
**
B00
BAG
AutomatANN
16.32
16.81
1.21
1.21
13
14
8
8
7
8
5
5
**
MLP
17.89
1.50
15
10
9
6
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
View
J. M. Bates and C. W. J. Granger, "Combination of Forecasts,"
Operational Research Quarterly, vol. 20, pp. 451-&, 1969.
P. Newbold and C. W. J. Granger, "Experience with Forecasting
Univariate Time Series and Combination of Forecasts," Journal of
the Royal Statistical Society Series a-Statistics in Society, vol. 137,
pp. 131-165, 1974.
S. Crone. (2007, 20/08/2009). NN3 Results. Available:
http://www.neural-forecasting-competition.com/NN3/results.htm
L. Breiman, "Heuristics of instability and stabilization in model
selection," Annals of Statistics, vol. 24, pp. 2350-2383, 1996.
B. Efron, "1977 Rietz Lecture - Bootstrap Methods - Another Look at
the Jackknife," Annals of Statistics, vol. 7, pp. 1-26, 1979.
B. Efron, "Estimating the Error Rate of a Prediction Rule Improvement on Cross-Validation," Journal of the American
Statistical Association, vol. 78, pp. 316-331, 1983.
L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp.
123-140, Aug 1996.
S. Arlot and A. Celisse, "A survey of cross-validation procedures for
model selection," Statistics Surveys, vol. 4, pp. 40-79, 2010.
J. Tashman, "Out-of-sample tests of forecasting accuracy: an analysis
and review," International Journal of Forecasting, vol. 16, pp. 437450, Oct-Dec 2000.
T. E. Clark, "Can out-of-sample forecast comparisons help prevent
overfitting?," Journal of Forecasting, vol. 23, pp. 115-139, Mar
2004.
J. Michaelsen, "Cross-Validation in Statistical Climate Forecast
Models," Journal of Climate and Applied Meteorology, vol. 26, pp.
1589-1600, Nov 1987.
C. C. P. Wolff, "Time-Varying Parameters and the out-of-Sample
Forecasting Performance of Structural Exchange-Rate Models,"
Journal of Business & Economic Statistics, vol. 5, pp. 87-97, Jan
1987.
R. H. Clarida, L. Sarno, M. P. Taylor, and G. Valente, "The out-ofsample success of term structure models as exchange rate predictors:
a step beyond," Journal of International Economics, vol. 60, pp. 6183, May 2003.
M. Y. Hu, G. Q. Zhang, C. Z. Jiang, and B. E. Patuwo, "A crossvalidation analysis of neural network out-of-sample performance in
exchange rate forecasting," Decision Sciences, vol. 30, pp. 197-216,
Win 1999.
H. R. Kunsch, "The Jackknife and the Bootstrap for General
Stationary Observations," Annals of Statistics, vol. 17, pp. 12171241, Sep 1989.
B. Efron and R. Tibshirani, An Introduction to the bootstrap.
London: Chapman and Hall, 1993.
S. Makridakis and R. L. Winkler, "Averages of Forecasts - Some
Empirical Results," Management Science, vol. 29, pp. 987-996, 1983.
publication
stats
[18] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R.
Lewandowski, et al., "The Accuracy of Extrapolation (Time-Series)
Methods - Results of a Forecasting Competition," Journal of
Forecasting, vol. 1, pp. 111-153, 1982.
[19] F. C. Palm and A. Zellner, "To combine or not to combine - Issues of
combining forecasts," Journal of Forecasting, vol. 11, pp. 687-701,
Dec 1992.
[20] S. Makridakis and M. Hibon, "The M3-Competition: results,
conclusions and implications," International Journal of Forecasting,
vol. 16, pp. 451-476, Oct-Dec 2000.
[21] G. Elliott and A. Timmermann, "Optimal forecast combinations
under general loss functions and forecast error distributions," Journal
of Econometrics, vol. 122, pp. 47-79, Sep 2004.
[22] M. Assaad, R. Bone, and H. Cardot, "A new boosting algorithm for
improved time-series forecasting with recurrent neural networks,"
Information Fusion, vol. 9, pp. 41-55, Jan 2008.
[23] V. R. R. Jose and R. L. Winkler, "Simple robust averages of
forecasts: Some empirical results," International Journal of
Forecasting, vol. 24, pp. 163-169, Jan-Mar 2008.
[24] B. Efron and R. Tibshirani, "Improvements on Cross-Validation: The
.632+ Bootstrap Method," Journal of the American Statistical
Association, vol. 92, pp. 548-560, 1997.
[25] L. K. Hansen and P. Salamon, "Neural Network Ensembles," Ieee
Transactions on Pattern Analysis and Machine Intelligence, vol. 12,
pp. 993-1001, Oct 1990.
[26] G. P. Zhang and V. L. Berardi, "Time series forecasting with neural
network ensembles: an application for exchange rate prediction,"
Journal of the Operational Research Society, vol. 52, pp. 652-664,
Jun 2001.
[27] U. Naftaly, N. Intrator, and D. Horn, "Optimal ensemble averaging of
neural networks," Network-Computation in Neural Systems, vol. 8,
pp. 283-296, Aug 1997.
[28] S. F. Crone, M. Hibon, and K. Nikolopoulos, "Advances in
forecasting with neural networks? Empirical evidence from the NN3
competition on time series prediction," International Journal of
Forecasting, vol. 27, pp. 635-660, 2011.
[29] R. J. Hyndman and A. B. Koehler, "Another look at measures of
forecast accuracy," International Journal of Forecasting, vol. 22, pp.
679-688, 2006.
[30] K. Hornik, "Approximation capabilities of multilayer feedforward
networks," Neural Networks, vol. 4, pp. 251-257, 1991.
[31] G. Zhang, B. E. Patuwo, and M. Y. Hu, "Forecasting with artificial
neural networks: The state of the art," International Journal of
Forecasting, vol. 14, pp. 35-62, 1998/3/1 1998.
Download