See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/260832144 Crogging (cross-validation aggregation) for forecasting - A novel algorithm of neural network ensembles on time series subsamples Conference Paper · August 2013 DOI: 10.1109/IJCNN.2013.6706740 CITATIONS READS 35 987 1 author: Devon K. Barrow University of Birmingham 18 PUBLICATIONS 990 CITATIONS SEE PROFILE All content following this page was uploaded by Devon K. Barrow on 03 April 2014. The user has requested enhancement of the downloaded file. Crogging (Cross-Validation Aggregation) for Forecasting – a novel algorithm of Neural Network Ensembles on Time Series Subsamples Devon K. Barrow and Sven F. Crone Abstract— In classification, regression and time series prediction alike, cross-validation is widely employed to estimate the expected accuracy of a predictive algorithm by averaging predictive errors across mutually exclusive subsamples of the data. Similarly, bootstrapping aims to increase the validity of estimating the expected accuracy by repeatedly sub-sampling the data with replacement, creating overlapping samples of the data. Estimates are then used to anticipate of future risk in decision making, or to guide model selection where multiple candidates are feasible. Beyond error estimation, bootstrapping has recently been extended to combine each of the diverse models created for estimation, and aggregating over each of their predictions (rather than their errors), coined bootstrap aggregation or bagging. However, similar extensions of crossvalidation to create diverse forecasting models have not been considered. In accordance with bagging, we propose to combine the benefits of cross-validation and forecast aggregation, i.e. crogging. We assesses different levels of cross-validation, including a (single-fold) hold-out approach, 2-fold and 10-fold cross validation and Monte-Carlos cross validation, to create diverse base-models of neural networks for time series prediction trained on different data subsets, and average their individual multiple-step ahead predictions. Results of forecasting the 111 time series of the NN3 competition indicate significant improvements accuracy through Crogging relative to Bagging or individual model selection of neural networks. I. T INTRODUCTION HE combination of predictive models in ensembles has received substantial attention, aiming to increase an existing algorithms’ accuracy in out-of-sample predictions for classification, regression and time series forecasting alike. In forecasting, the seminal paper by Bates and Granger [1] showed significant gains in accuracy through the linear combination of univariate time series methods, which has been independently confirmed to be unbiased and more accurate (see, e.g., [2]) in comparison to the accuracy of each of the individual models. Since then, statistical methods as well as neural network ensembles have been used to improve the accuracy over single models in time series forecasting [3]. However, the majority of papers have simply combined multiple algorithms previously specified, or multiple initializations thereof, each one parameterized on the same complete learning data. As an alternative, algorithms which actively create diverse base models by resampling the dataset on which methods are parameterized have received less attention in forecasting. Most notably, Bagging is a general algorithm for stabilizing Devon. K. Barrow and Sven. F. Crone are with the Lancaster Centre for Forecasting, Lancaster University Management School, Lancaster University, Lancaster, UK (e-mail: {barrow, crone@exchange.lancs.ac.uk). unstable prediction methods such as neural networks and decision trees [4]. Bagging creates bootstrap replicates of a given learning set by randomly sampling with replacement, and useing each replicate as a new training set of equal size to parameterize a base learner, aggregating their outputs to an ensemble forecast by averaging their predictions. The algorithm is fundamentally based on Bootstrapping, an established statistical technique of resampling from observed data, which is traditionally used to estimate the distribution of almost any statistic [5], while enhancing the precision and reducing variance of the estimation [6]. However, other established statistical methods of resampling for estimation purposes, such as jackknifing or cross-validation, have not yet been considered to create diverse predictions. While the main application of crossvalidation has been the estimation of prediction errors, in model selection, or in neural network training to control overfitting through early stopping on a validation dataset, it has not been used to create diverse forecasts to be combined in ensembles. In this work we propose a novel aggregation and combination method in the similar spirit to Bagging [7], but based on cross validation rather than bootstrapping, consequently coined Crogging. This novel combination method averages forecast over a set of predictive models trained using mutually exclusive cross-validation replicates of the original learning set. As different splitting strategies allow alternate cross-validation estimates, we assess the accuracy of k-fold Crogging and Monte-Carlo Crogging in comparison to established approaches of Bagging, Ensembles trained using single-fold cross validation, i.e. using a single hold-out set, and model selection (of a single best), always using neural networks (NN) as a base learner. The paper is organized as follows: section II describes the cross validation strategies traditionally employed for error estimation, including the benchmark approach of Bagging. Section III introduces forecasting model combination and aggregation, and motivates the novel algorithm. Section IV outlines the experimental design, with findings presented in section V. Section IV concludes the paper. II. CROSS VALIDATION FOR ERROR ESTIMATION A. Resampling Techniques for Error Estimation The estimation of predictive accuracy is important, both for comparing statistical models and for assessing the model which is finally selected. Given a learning set of observed data sampled from an unknown population, and a set of models constructed for predicting future values, one calculates the prediction error of each model and wants to know which model performs best. However, in-sample accuracy, i.e. the model fit which measures the ability to approximate the data generating process, has been proven to have little correlation with out-of-sample accuracy, i.e. the ability to generalize for unseen data of the same data generating process. As a result, the statistical resampling technique of crossvalidation (CV) assesses how the results of a statistical estimate will generalize to an independent data set [8], [7]. Cross-validation splits the data, using one subset as a learning set to train each model, and the remaining part as a validation sample for estimating the error of the predictor. This provides a less biased and more representative estimation of the true ex ante performance of the model. For multiple alternative candidate models, the model with the lowest prediction error is then selected as the final model. Different versions of cross validation exist: hold-out CV, kfold CV and Monte Carlo CV, depending on the number of subsets, and whether the subsets are mutually exclusive or overlapping. All variants have in common that they generate different training sets based on splitting the original learning set, and that they are used to estimate errors and aid in model selection, but not in forecast combination for prediction. In time series prediction, where data over time is often non-stationary, error estimation through cross validation has become a prerequisite to assess predictive out-of-sample accuracy of an algorithm with validity and reliability [9], and to perform model selection between competing algorithms prior to their actual application. Out-of-sample evaluations with a single hold-out dataset have been most popular for NN or statistical methods alike, including systematic analysis regarding theoretical properties [10] or in a particular application area, such as climate forecasting [11], or financial forecasting with statistics and neural networks [12-14]. Similarly, forecasting competitions regularly employ (single fold) out-of-sample evaluation, where a part of each time series is not disclosed to the contestants in order to assess the empirical accuracy of competing forecasting methods objectively in a simulated ex ante design. In contrast, only few publications have estimated accuracy across multiple folds, employing k-fold subsampling for time series prediction. B. Hold-Out Cross Validation For the simplest case of CV, a single split into two data subsets is performed. The holdout method, also referred to as validation estimation, partitions the original learning set into two mutually exclusive subsets S Train and S Valid of training and validation (or holdout) set respectively. A model m is estimated on S Train and used to obtain forecasts to estimate predictive accuracy on S Valid. Guidance as to how many observations to include in either dataset is inconclusive, often employing heuristic rules of thumb using 70%:30% splits of training and validation respectively. The hold-out method may be considered both a special case of k-fold CV (with k = 1) and also of Monte-Carlo CV, both discussed next. C. K-fold Cross Validation In a more general setting, for a time series of length T, we define a k-fold cross validation, with k ≤ T, which divides a learning set into k none-overlapping and mutually exclusive subsets of approximately equal size. Observations are drawn without replacement, either randomly, or, in the case of time series data with potential autocorrelation, sequentially in blocks of consecutive observations. The predictive model is then estimated k times, each time using a training data set S Train comprised of k - 1 of the subsamples. The one remaining subsamples is retained as validation data S Valid, used to estimate the out of sample performance of the estimated model. This process is repeated k times with each of the k subsamples used exactly once as the validation data. Estimates of the algorithms out of sample predictive errors are then obtained by averaging the errors across the k validation samples omitted in each estimation. Consequently, for k = 1 a hold-out evaluation is estimated. For k = 2 a two-fold cross validation splits the dataset into 2 folds, training the model on one and estimating on the other, then vice versa, and averaging the estimated out-of-sample error across both validation sets. For k = T-fold CV it conducts a leave-one-out (LOO) cross validation, assessing T estimated models (using T-1 observations) on T single observations for validation, an approach equivalent to jackknifing in statistical estimation. An advantage of k-fold CV is that all observations are used for both training and validation, all training observations are used with equal weight, and each observation is used for validation exactly once. A potential disadvantage is that the proportion of the training/validation split is dependent on the number of iterations (folds). Due to its simplicity, single fold CV is widely applied in model selection [8] and common in neural network training with early stopping to prevent overfitting. However, as for larger k the CV becomes more computationally demanding, only few scientific studies with LOO CV on time series exist. D. Monte-Carlo Cross-Validation The Monte-Carlo CV repeats randomly splitting the original learning into two subsets S Train and S Valid multiple times, each time randomly drawing without replacement examples from the learning set to form a training set S Train, using the remaining examples of to form S Valid. Model is trained using and used to obtain forecasts for . This is repeated times, with as large as possible, and errors estimated by averaging across the K validation folds. Note that although data subsets are mutually exclusive for each round of Monte Carlo CV, they are not if repeated K times. As a result, all observations in Monte Carlo CV will be used for estimation of m and validation of errors multiple times, but across all iterations a different number of times depending on the independent random sampling between rounds. E. Bootstrapping As an alternative to CV, Bootstrapping provides an alternative statistical resampling technique to estimate errors, using sampling with replacement to create different training datasets for within-sample estimation. We consider the ordinary bootstrap method [5] where the temporal and spatial covariance structure of the original time series is preserved in the lagged vectors , much like the moving block bootstrap [15, 16]. From the original learning set , examples are randomly drawn with replacement according to a discrete uniform distribution, where each example in has equal probability of being chosen. These of equal size as examples form the new training set the original learning set. The training of model is and used to obtain forecasts performed using on . This is repeated times, with as large as possible, and validation errors estimated as an average . To compare with CV, bootstrapping across K sets of does not make use of a validation set, but creates diverse estimates of errors by utilizing approximately 1-(1-(1⁄N))N = 63.2% unique examples in each training set by sampling with replacement [6]. III. CROSS VALIDATION FOR FORECAST AGGREGATION A. Forecast Combination and Bootstrapp Aggregation As an alternative to identifying and selecting the single most promising algorithm for forecasting future observations, research in forecast combination remains active. Makridakis et al. showed that using arithmetic means of forecasts improves forecasting accuracy [17], and that taking a simple average outperforms taking a weighted average model combination [18] while being more robust [19]. Results of the M-competition showed that averaging forecasts of six different algorithms performed better than each of the individual methods included in the average [18]. Similarly, at the M3 competition the arithmetic mean of Single, Holt and Dampen Trend Exponential Smoothing proved more accurate than each of the three methods individually, for practically all forecasting horizons [20]. For NN models, constructing ensembles has proven equally successful in increasing accuracy and hence prominent. However, how to combine models and under which conditions still remains a research question under debate. Some papers dispel the notion that equally-weighted combined forecasts lead to better performance [21], others suggest the weighted median as it is deemed less sensitive to outliers than the weighted mean [22], or using unweighted averages with trimming and winzorisation to avoid the influence of extreme values and errors [23]. Within forecasting and time series prediction, the majority of papers have resorted to combining multiple algorithms previously specified, or multiple initializations thereof, each one parameterized on the same complete learning data. In contrast, the Bagging algorithm has been recently proposed as an alternative to simple combination. Rather than use boostrapping to estimate errors, Breiman averages forecasts (not errors) across multiple models m trained on different data subsets created using random uniform sampling with replacement with substantial success [23]. However, despite the prominence of bagging, similar extensions of creating predictors from cross validation routines have not been developed, also promising potential benefits on single models and simple forecast combinations. B. Cross-Validation with Forecast Aggregation In k-fold CV, each of the k contender models m provides forecasts only for validation data, but ignores their potential to predict out of sample. As a result, many of the diverse candidate models created in cross validation trained on subsamples of the data are used only to estimate accuracy, but not to create predictions themselves. Rather than use each of the CV methods for error estimation or model selection, we extend them to model combination through forecast aggregation. In analogy to Bagging, we propose to aggregate and combine the predictions across each individual cross validation prediction, termed Crogging. Specifically, we propose two new algorithms of k-fold Crogging and Monte-CarloCrogging, and seek to evaluate them in an empirical evaluation. (Note that the case of 1-fold hold-out evaluation is equivalent to using an ensemble of conventional neural networks, and as such cannot be considered novel). The difference between Bagging and the proposed approach of Crogging lies in the generation of the data samples used for training and validation. While both crossvalidation and bootstrapping are based on resampling, crossvalidation ensures that all observations are used for both training and validation, though not simultaneously, and each observation is guaranteed to be used for model estimation and validation the same number of times. Furthermore, the validation set available in CV can be used to control for overfitting in neural network training using early stopping. -fold cross-validation allows the use of all validation sets in performing early stopping, and this potentially further reduces the risk of overfitting. In comparison to the conventional ‘hold-out’ or validation method, commonly used for early stopping neural networks, which uses only a single split of the data and therefore only a single validation set, Crogging promises the added benefit of using multiple mutually-exclusive validation datasets. Nevertheless, while cross-validation produces a nearly unbiased estimate of the future value of a parameter, a major drawback is the high variability which can be present in this estimation [24]. In light of these differences, this paper evaluates the potential benefits from the proposed Crogging approach based on cross-validation aggregation relative to standard model averaging (hold-out aggregation) and Bagging (i.e. Bootstrap aggregation) , and investigates possible gains in accuracy resulting from use of one method over another. IV. EXPERIMENTAL DESIGN A. Comparing Cross Validation and Bagging Forecasts We conduct a rigorous empirical experiment to evaluate the relative forecasting accuracy of Crogging, in comparison to Bagging, conventional neural network ensembles, and individual NN model selection. This is the first evaluation of employing each of the CV methods for model combination, rather than error estimation or model selection. The Multilayer Perceptron (MLP) algorithm is used to obtain neural network models . To assess under which conditions each of the algorithms performs well, we evaluate k-fold Crogging for k = 2 and k = 10 and Monte-Carlo Crogging. In order to allow a valid comparison across algorithms capable of creating a number of diverse models each, we constrain the total number of NN base models estimated to 50. For k-fold Crogging, we assess both 10-fold and 2-fold variants to assess the impact of different k. For 2-fold cross-validation for aggregation, 2 subsets are generated, one for training and one for validation. This has the advantage that both the training and validation sets are large, and each data point is used for both training and validation on each fold. We train 25 randomly initialized MLPs on each fold generating a total of 50 models which are then averaged. For 10-fold CV for aggregation, on each of the 10 folds we train 5 randomly initialized MLPs for a total of 50 trained MLPs which are then averaged. As result, each validation fold is smaller yielding a potential tradeoff in the valid estimation of out-of-sample accuracy for early stopping. For Monte Carlo Crogging, we set K 50 creating 50 random cross validation splits of the learning set into training and validation data, and averaging over 50 randomly initialized MLPs each trained on a different training set. Accuracy is compared to three established benchmark methods of Bagging, NN ensembles and individual NN model selection. For bagging, we set 50 creating 50 bootstrap replicates of the learning set, and averaging over 50 randomly initialized MLPs each trained on a different bootstrap. For NN ensembles using simple model averaging on the Hold-out method, we use the single split of the training set obtained, to train 50 differently initialized MLPs. This is equivalently referred to as neural network model averaging [25] and most widely used in combining neural networks for time series forecasting [26], [27], [18].This provides a strong benchmark and allows investigating the benefits of cross-validation versus validation for model averaging. Ultimately, individual model selection is also based on cross-validation, selecting from a set of 50 randomly initialized MLPs, the MLP model with the smallest mean squared error (MSE) on the validation set. In doing this, we use the hold-out method, which uses a single validation set on which the prediction error is calculated. B. Dataset In order to provide empirical evidence across a large number of time series, we utilize the time series data from the NN3 competition [28]. The complete dataset of 111 time series of the NN3 dataset was chosen containing between 68 and 144 observations. The dataset consists of a representative set of long and short, monthly time series drawn from a homogeneous population of empirical business time series. Fig. 1 shows six time series from the NN3 competition dataset. As illustrated, the time series contain both seasonal and non-seasonal patterns, with only minor trends and different time series lengths. To allow a valid comparison of the forecast accuracy of the proposed Crogging methods to those originally participating in the NN3 competition, we perform multistep-ahead forecasting using the iterative method, forecasting 18 months into the future from a single fixed origin. As a result, 18 examples are designated for the holdout test set while the remainder is used for training. NN3_101 NN3_102 6000 10000 5000 5000 4000 10 0 20 4 x 10 40 60 80 100 120 140 NN3_103 20 40 60 80 100 120 140 NN3_104 10000 5 5000 0 0 20 40 60 80 100 120 140 NN3_105 20 40 60 80 NN3_106 100 120 Fig. 1. Four time series of the NN3 Competition dataset. The size of the single validation set is set to 14 to ensure consistency with the hold-out and Monte-Carlo crossvalidation setup. The size of the validation set in -fold cross-validation is determined by the value . C. Error Metrics We calculate the mean absolute scaled error (MASE) and the symmetric mean absolute error (SMAPE) for all methods in assessing forecast accuracy and performance. For a given actual , and forecast , the forecast made for period , and the number of observations forecasted by the respective forecasting method, the SMAPE is calculated as follows: | 1 | | | | | ⁄2 (1) Hyndman and Koehler propose the use of the MASE to overcome several degenerate problems associated with MAE and sMAPE, and because it is less sensitive to outliers and more easily interpreted than other scaled error measures [29]. The MASE is used to compare across all time series and forecast methods and is defined by: | 1 1 | ∑ | (2) | where N is defined as the number of observations in the training set and H is the number of values being forecasted in the out-of-sample test set. The SMAPE and MASE are then averaged over all time series in the dataset to produce the mean SMAPE and mean MASE respectively. D. Specification of the Neural Networks The base model used is a univariate Multilayer Perceptron (MLP). MLPs are well researched and their ability to approximate and generalize any linear and nonlinear functional relationship to an arbitrary degree of accuracy has been proven in time series prediction [30]. They are also viewed as benefiting from model combination approaches due to their learning instability and the large number of factors or degrees of freedom affecting neural network training [4], [26]. The functional form of these networks is given by: , (3) and describes a single layered MLP characterized by its which captures the , …, input vector lagged observations of the time series in input nodes , its number of hidden nodes and a single output node. We set 13 which captures lags up to . This is sufficient to model monthly (stochastic) seasonality of an 12 process in addition to trends (i.e. an I(1) process). All data is pre-processed using linear scaling into the interval of [-0.5, 0.5] and each time series is modelled directly without prior differencing or further data transformation. Level, trend and seasonality are estimated directly in the model weights. Each MLP network contains a single hidden layer with two hidden nodes using the hyperbolic tangent transfer function [31], and a single output node with a linear identity function. The MLP is trained using the Levenberg-Marquardt algorithm with a maximum of 1000 epochs. An early stopping criterion is employed which stops the network training if the validation error increases or remains the same for more than 50 epochs. Additionally network training stops if the adaptive value exceeds 1e10. The network weights giving the lowest validation error during training are used in order to reduce overfitting to the data. All networks are trained using early stopping on S Valid. Alternatively one can consider training using only with regularization, or forcing overfitting for diversity, but better results were obtained using the former approach. For all neural networks we employ random weight initialization. This means that in creating each new model, we randomly initialize the starting weights for each neural network allowing for different solutions of the network to be achieved, in addition to the randomness introduced by the cross validation and bootstrap procedures. In all cases, we combine a total of 50 models to allow for a fair comparison of the different methods; any differences should not be due to the number of models included in the final combination. V. EXPERIMENTAL RESULTS A. Results Across all 111 Time Series The results for the competing methods are summarized in Table I and Table II for MASE and SMAPE respectively. The error measures yield slightly difference results with the 10-fold cross-validation (10FOLDCV) method having the lowest mean MASE (1.07) on test set, and 2-fold crossvalidation (2FOLDCV) having the lowest mean SMAPE (15.29). Some consistent patterns however occur across error measures. Most notably, all Crogging methods of crossvalidation aggregation MONTECV, 10FOLDCV and 2FOLDCV generate smaller forecast errors compared to the standard Hold-out method (HOLDOUT), which only averages over a single validation set, the most widely use approach to creating MLP ensembles. This indicates a general improvement in forecast accuracy from multiple splitting of the leaning set, into either random or mutually exclusive subsets, with significant improvements over the benchmark model averaging approach. In addition, all Cogging variants outperform the benchmark Bagging algorithm which has an out-of-sample forecast error (MASE=1.21, SMAPE=16.32) slightly larger than the HOLDOUT method (MASE=1.20, SMAPE=16.08). These finding are also consistent when errors on the validation dataset are considered, indicating that these comparative results are not subject to overfitting on the validation set. TABLE I AVERAGE MASE ON TRAINING, VALIDATION AND TEST DATASET ACROSS ALL TIME SERIES Method Train Validation Test BESTMLP 0.67 0.60 1.50 HOLDOUT BAG 0.64 0.76 0.75 0.70 1.20 1.21 MONTECV 10FOLDCV 0.76 0.69 0.41 0.45 1.16 2FOLDCV 0.73 0.60 1.07 1.15 TABLE II AVERAGE SMAPE ON TRAINING, VALIDATION AND TEST DATASET ACROSS ALL TIME SERIES Method Train Validation Test BESTMLP 12.36 11.10 17.89 HOLDOUT BAG 11.78 12.95 12.57 13.17 16.08 16.32 MONTECV 13.81 10FOLDCV 2FOLDCV 12.65 13.68 8.29 8.94 11.19 15.52 15.35 15.29 As would be expected, all combination methods outperform model selection, that is, the best MLP (BESTMLP) method which runs 50 randomly initialized MLPs and selects the MLP with the smallest error on the validation set. While the BESTMLP performs well on training and validation set, relative to other methods, for example, Bagging and 2FOLDCV, it produces the highest forecast errors on the test set. This is an indication of overfitting of the individual MLP models to the validation set, and the poor performance on the test set is explained by the resulting instability in the model selection process from selecting the model which minimizes the validation set MSE. The selected model is not robust to changes in the time series out-of-sample on the test set. Table III shows the average MASE and SMAPE, and the standard deviation and coefficient of variation of the distribution of the MASE and SMAPE across all time series. Results of both error measures show that model averaging results in a lower standard deviation (SD) in forecast error across time series when compared to model selection with the 2FOLDCV method having the lowest standard deviation reflecting a more robust performance across all time series. The coefficient of variation (CoeVAR) over the distribution of both the MASE and SMAPE across the time series also supports the observation that the performance of the 2FOLDCV method is most robust across time series. A plot of the distribution across all time series of the SMAPE and in particular the MASE as shown in Fig. 2 shows further that 2FOLDCV and MONTECV methods produces lower variation in the forecast error and standard deviation of forecast errors relative to other methods, in particular the BESTMLP which has the largest variation. Method TABLE III AVERAGE MASE, STANDARD DEVIATION AND COEFFICIENT OF VARIATION ON TEST SET ACROSS ALL TIME SERIES mean MASE mean SDMASE mean CoeVARMASE mean SMAPE (%) mean SDSMAPE mean CoeVARSMAPE BESTMLP 1.50 1.06 0.75 17.89 12.81 0.74 HOLDOUT BAG 1.20 1.21 0.80 0.82 0.72 0.73 16.08 16.32 11.59 11.51 0.73 0.72 15.35 15.52 11.40 12.04 0.74 0.77 15.29 11.12 0.73 MONTECV 10FOLDCV 2FOLDCV 1.16 0.79 0.73 1.07 1.15 0.80 0.78 0.76 0.71 Fig. 2 also shows the 2FOLDCV has the lowest median MASE and standard deviation of the MASE across all time series and that across both error measures, CV methods produce smaller median errors and standard deviation, and lower variation in both measures when compared to the HOLDOUT method and Bagging. This gives further evidence that the improvement in accuracy is due to the manner in which cross validation introduces diversity through data splitting rather than bootstrap resampling. A factor which is likely to impact the performance of cross-validation is the length of the time series which determines the amount of data available in the learning set and consequently the number of observations available for training and validation in each cross-validation split. B. Results by Time Series Data Conditions Table IV shows the forecast accuracy measured using SMAPE averaged across short, medium and long forecast horizons, for time series categorized as long and short [28]. We present only the results of the SMAPE as these are consistent in this case, with those of the MASE. It can be observed that on long time series 10FOLDCV has the smallest SMAPE for medium to long horizons, and over forecast lead time, 1-18. In contrast 2FOLDCV and MONTECV both outperform 10FOLDCV on short time series across all forecast horizons. The performance of 2FOLDCV and MONTECV reflects an advantage of both methods which is the increase in length of both training and validation data. Because 2-fold crossvalidation generates only 2 folds of equal size, the training and validation sets are both large. Likewise an advantage of MONTECV is that the proportion of examples in the training and validation set is not dependent on the number of folds. This decoupling of the number of splits and the size of the training/validation set results in larger validation sets. The availability of sufficient data for training is particularly important where the time series is short. This is reflected in Fig. 3 which shows the distribution of the Fig. 2. Boxplots of the MASE and SMAPE (top) and Standard Deviation of MASE and SMAPE (bottom) measures averaged over all forecast horizons and obtained across all time series for the different methods. The line of reference represents the median value of the distributions. TABLE IV SMAPE FOR TEST SET ACROSS SHORT, MEDIUM AND LONG FORECAST HORIZON Forecast Horizon a Length Method 1-3 4-12 13-18 1-18 Long BESTMLP 10.79 16.59 20.02 16.77 HOLDOUT BAG 9.34 9.74 14.96 15.46 16.20 16.38 14.43 14.81 MONTECV 10FOLDCV 10.86 10.39 15.16 15.43 14.54 2FOLDCV 9.03 14.04 14.64 14.82 15.69 13.69 14.06 Short a BESTMLP 16.83 17.03 20.66 18.20 HOLDOUT BAG 17.59 17.20 17.04 17.27 20.12 20.96 18.16 18.49 MONTECV 10FOLDCV 15.47 16.00 14.71 15.91 19.05 20.25 16.28 17.37 14.51 18.95 16.21 2FOLDCV 15.86 1-3 = short horizon, 4-12 = medium horizon, 13-18 = long horizon. SMAPE for short and long time series. For short series, the increased size of the training and validation set from using 2FOLDCV and MONTECV, results in better training of the network and as the results suggest, improved forecast accuracy. When sufficient data is available for training and validation, the increase in the number of folds from 2 to 10, results in improved forecast accuracy (see Fig. 3 – right). C. Relative Ranking on NN3 results Table V reports the results obtained by the first eight participants of the NN3 competition, the top five methods of this study, the benchmark neural network model of the competition (AutomatANN) and the single MLP used in this study. In keeping with the report format of the competition, we report rankings first according to SMAPE and then to MASE. Among the computational intelligence (NN/CI) methods, 2FOLDCV and MONTECV rank 2nd and 3rd respectively behind Illies, and 4th and 5th overall among all methods. This reflects rather good performance by the proposed cross-validation combination methods relative to methods used in the competition and in the case of the MASE, the 10FOLDCV method which ranks 1st among computational intelligence methods, and 1st among all methods. An advantage of these methods based on crossvalidation and bootstrapping is their simplicity compared to other methods. This includes the approach of Illies et al. (C27) which is based on a combination of time series clustering, decomposition and recurrent Echo State Networks (ESN), and the method of Flores et. al. (C03), which uses a self adaptive genetic algorithm to determine the terms of a seasonal ARIMA (p,d,q)(P,D,Q) model. VI. CONCLUSION Current approaches to model averaging with neural networks which are based on data sampling use either a single training set which is then the original learning set, or bootstrapping to generate multiple training sets through resampling of the original learning set. Where a single training set is used, model diversity is generated through multiple random initializations of the neural network weights and where bootstrapping is employed, model diversity comes from the randomly sampled training data to which neural network training is sensitive. This paper proposes the use of cross-validation data splitting for model averaging, and assesses different forms of cross-validation for creating model diversity. In this case, the set of candidate models are trained on different splits of the training data while simultaneously reducing overfitting of the neural network models through early-stopping on different trainingvalidation set pairs. This approach proves to be a very promising alternative to the current strategy of neural network model averaging, Bagging and model selection. Fig. 3. Boxplots of the SMAPE averaged over all forecast horizons and obtained across short (left) and long (right) time series for the different methods. The line of reference represents the median value of the distributions. TABLE V AVERAGE ERRORS AND RANKS OF ERRORS ACROSS ALL TIME SERIES OF THE NN3 COMPETITION Average errors Ranking all methods SMAPE MASE SMAPE Ranking NN/CI MASE SMAPE MASE B09 Wildi 14.84 1.13 1 2 − − B07 C27 Theta Illies 14.89 15.18 1.13 1.25 2 3 2 9 − 1 − 7 ** 2FOLDCV MONTECV ForecastPro 15.29 15.35 15.44 1.15 1.16 1.17 4 5 6 3 4 5 2 3 − 2 3 − B16 B17 10FOLDCV DES Comb S-H-D 15.52 15.90 15.93 1.07 1.17 1.21 7 8 9 1 5 8 4 − − 1 − − ** B03 ** B05 Autobox 15.95 1.18 10 6 − − ** C03 HOLDOUT Flores 16.08 16.31 1.20 1.20 11 12 7 7 5 6 4 4 ** B00 BAG AutomatANN 16.32 16.81 1.21 1.21 13 14 8 8 7 8 5 5 ** MLP 17.89 1.50 15 10 9 6 REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] View J. M. Bates and C. W. J. Granger, "Combination of Forecasts," Operational Research Quarterly, vol. 20, pp. 451-&, 1969. P. Newbold and C. W. J. Granger, "Experience with Forecasting Univariate Time Series and Combination of Forecasts," Journal of the Royal Statistical Society Series a-Statistics in Society, vol. 137, pp. 131-165, 1974. S. Crone. (2007, 20/08/2009). NN3 Results. Available: http://www.neural-forecasting-competition.com/NN3/results.htm L. Breiman, "Heuristics of instability and stabilization in model selection," Annals of Statistics, vol. 24, pp. 2350-2383, 1996. B. Efron, "1977 Rietz Lecture - Bootstrap Methods - Another Look at the Jackknife," Annals of Statistics, vol. 7, pp. 1-26, 1979. B. Efron, "Estimating the Error Rate of a Prediction Rule Improvement on Cross-Validation," Journal of the American Statistical Association, vol. 78, pp. 316-331, 1983. L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp. 123-140, Aug 1996. S. Arlot and A. Celisse, "A survey of cross-validation procedures for model selection," Statistics Surveys, vol. 4, pp. 40-79, 2010. J. Tashman, "Out-of-sample tests of forecasting accuracy: an analysis and review," International Journal of Forecasting, vol. 16, pp. 437450, Oct-Dec 2000. T. E. Clark, "Can out-of-sample forecast comparisons help prevent overfitting?," Journal of Forecasting, vol. 23, pp. 115-139, Mar 2004. J. Michaelsen, "Cross-Validation in Statistical Climate Forecast Models," Journal of Climate and Applied Meteorology, vol. 26, pp. 1589-1600, Nov 1987. C. C. P. Wolff, "Time-Varying Parameters and the out-of-Sample Forecasting Performance of Structural Exchange-Rate Models," Journal of Business & Economic Statistics, vol. 5, pp. 87-97, Jan 1987. R. H. Clarida, L. Sarno, M. P. Taylor, and G. Valente, "The out-ofsample success of term structure models as exchange rate predictors: a step beyond," Journal of International Economics, vol. 60, pp. 6183, May 2003. M. Y. Hu, G. Q. Zhang, C. Z. Jiang, and B. E. Patuwo, "A crossvalidation analysis of neural network out-of-sample performance in exchange rate forecasting," Decision Sciences, vol. 30, pp. 197-216, Win 1999. H. R. Kunsch, "The Jackknife and the Bootstrap for General Stationary Observations," Annals of Statistics, vol. 17, pp. 12171241, Sep 1989. B. Efron and R. Tibshirani, An Introduction to the bootstrap. London: Chapman and Hall, 1993. S. Makridakis and R. L. Winkler, "Averages of Forecasts - Some Empirical Results," Management Science, vol. 29, pp. 987-996, 1983. publication stats [18] S. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, et al., "The Accuracy of Extrapolation (Time-Series) Methods - Results of a Forecasting Competition," Journal of Forecasting, vol. 1, pp. 111-153, 1982. [19] F. C. Palm and A. Zellner, "To combine or not to combine - Issues of combining forecasts," Journal of Forecasting, vol. 11, pp. 687-701, Dec 1992. [20] S. Makridakis and M. Hibon, "The M3-Competition: results, conclusions and implications," International Journal of Forecasting, vol. 16, pp. 451-476, Oct-Dec 2000. [21] G. Elliott and A. Timmermann, "Optimal forecast combinations under general loss functions and forecast error distributions," Journal of Econometrics, vol. 122, pp. 47-79, Sep 2004. [22] M. Assaad, R. Bone, and H. Cardot, "A new boosting algorithm for improved time-series forecasting with recurrent neural networks," Information Fusion, vol. 9, pp. 41-55, Jan 2008. [23] V. R. R. Jose and R. L. Winkler, "Simple robust averages of forecasts: Some empirical results," International Journal of Forecasting, vol. 24, pp. 163-169, Jan-Mar 2008. [24] B. Efron and R. Tibshirani, "Improvements on Cross-Validation: The .632+ Bootstrap Method," Journal of the American Statistical Association, vol. 92, pp. 548-560, 1997. [25] L. K. Hansen and P. Salamon, "Neural Network Ensembles," Ieee Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 993-1001, Oct 1990. [26] G. P. Zhang and V. L. Berardi, "Time series forecasting with neural network ensembles: an application for exchange rate prediction," Journal of the Operational Research Society, vol. 52, pp. 652-664, Jun 2001. [27] U. Naftaly, N. Intrator, and D. Horn, "Optimal ensemble averaging of neural networks," Network-Computation in Neural Systems, vol. 8, pp. 283-296, Aug 1997. [28] S. F. Crone, M. Hibon, and K. Nikolopoulos, "Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction," International Journal of Forecasting, vol. 27, pp. 635-660, 2011. [29] R. J. Hyndman and A. B. Koehler, "Another look at measures of forecast accuracy," International Journal of Forecasting, vol. 22, pp. 679-688, 2006. [30] K. Hornik, "Approximation capabilities of multilayer feedforward networks," Neural Networks, vol. 4, pp. 251-257, 1991. [31] G. Zhang, B. E. Patuwo, and M. Y. Hu, "Forecasting with artificial neural networks: The state of the art," International Journal of Forecasting, vol. 14, pp. 35-62, 1998/3/1 1998.