IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 337 Lower Upper Bound Estimation Method for Construction of Neural Network-Based Prediction Intervals Abbas Khosravi, Member, IEEE, Saeid Nahavandi, Senior Member, IEEE, Doug Creighton, and Amir F. Atiya, Senior Member, IEEE Abstract— Prediction intervals (PIs) have been proposed in the literature to provide more information by quantifying the level of uncertainty associated to the point forecasts. Traditional methods for construction of neural network (NN) based PIs suffer from restrictive assumptions about data distribution and massive computational loads. In this paper, we propose a new, fast, yet reliable method for the construction of PIs for NN predictions. The proposed lower upper bound estimation (LUBE) method constructs an NN with two outputs for estimating the prediction interval bounds. NN training is achieved through the minimization of a proposed PI-based objective function, which covers both interval width and coverage probability. The method does not require any information about the upper and lower bounds of PIs for training the NN. The simulated annealing method is applied for minimization of the cost function and adjustment of NN parameters. The demonstrated results for 10 benchmark regression case studies clearly show the LUBE method to be capable of generating high-quality PIs in a short time. Also, the quantitative comparison with three traditional techniques for prediction interval construction reveals that the LUBE method is simpler, faster, and more reliable. Index Terms— Neural network, prediction interval, simulated annealing, uncertainty. I. I NTRODUCTION HERE are numerous reports discussing the successful application of neural networks (NNs) in prediction and regression problems. However, there is a belief that NN point predictions are of limited value where there is uncertainty in the data or variability in the underlying system. Examples of such systems are transportation networks [1], manufacturing enterprises [2], and material handling facilities [3]. Statistically, the NN output approximates the average of the underlying target conditioned on the NN input vector [4]. If the target is multivalued, the NN conditional averaged output can be far from the actual target, and is therefore unreliable. Furthermore, NN point predictions convey no information about T Manuscript received May 30, 2010; revised November 28, 2010; accepted November 28, 2010. Date of publication December 23, 2010; date of current version March 2, 2011. This research was supported in part by the Center for Intelligent Systems Research at Deakin University. A. Khosravi, S. Nahavandi, and D. Creighton are with the Center for Intelligent Systems Research, Deakin University, Geelong, Victoria 3117, Australia (e-mail: abbas.khosravi@deakin.edu.au; saeid.nahavandi@deakin.edu.au; douglas.creighton@deakin.edu.au). A. F. Atiya is with the Department of Computer Engineering, Cairo University, Cairo 12613, Egypt (e-mail: amir@alumni.caltech.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2096824 the sampling errors and the prediction accuracy. Incorporating the prediction uncertainty into the deterministic approximation generated by NNs improves the reliability and credibility of the predictions [5]. The semiconductor industry [2], surface mount manufacturing [6], electricity load forecasting [7], [8], fatigue lifetime prediction [9], financial services [10], hydrologic case studies [5], transportation [1], and baggage handling systems [3], to name a few, are examples discussing this problem in different domains. Confidence intervals (CIs) and prediction intervals (PIs) are two well-known tools for quantifying and representing the uncertainty of predictions. While a CI describes the uncertainty in the prediction of an unknown but fixed value, a PI deals with the uncertainty in the prediction of a future realization of a random variable [11]. By definition, a PI accounts for more sources of uncertainty (model misspecification and noise variance) and is wider than the corresponding CI [12]. Unfortunately, many authors wrongly interchange these terminologies [13]. In the literature, several methods have been proposed for construction of PIs and assessment of the NN outcome uncertainty. Chryssolouris et al. [14] and Hwang and Ding [15] developed the delta technique through the nonlinear regression representation of the NNs. The method first linearizes the NN model around a set of parameters obtained through minimization of the sum of squared error cost function. Then, standard asymptotic theory is applied to the linearized model for constructing PIs [16]. Intervals are constructed under the assumption that the noise is homogeneous and normally distributed. As the noise is heterogeneous in many real-world case studies, the constructed intervals can be misleading [17]. Veaux et al. [18] extended the delta method to those cases in which NNs are trained using the weight decay cost function instead of sum of squared errors. Although the generalization power of NN models is improved using this method, the PIs suffer from the fundamental limitation of the delta technique (linearization). The delta method has been applied in many synthetic and real case studies [3], [6], [19], [20], despite this weakness. The Bayesian technique is another method for construction of NN-based PIs [21]. Training NNs using the Bayesian technique allows error bars to be assigned to the predicted values of the network [4]. Despite the strength of the supporting theories, the method suffers from massive computational 1045–9227/$26.00 © 2010 IEEE 338 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 burden, and requires calculation of the Hessian matrix of the cost function for construction of PIs. Bootstrap is one of the most frequently used techniques in the literature for construction of PIs for NN point forecasts [12], [22]–[26]. The main advantage of this method is its simplicity and ease of implementation. It does not require calculation of complex derivatives and the Hessian matrix involved in the delta and Bayesian techniques. It has been claimed that the bootstrap method generates more reliable PIs than other methods [22]. The main disadvantage of this method is its computational cost for large datasets [6]. A mean-variance estimation-based method for PI construction has also been proposed by Nix and Weigend [27]. The method uses a NN to estimate the characteristics of the conditional target distribution. Additive Gaussian noise with nonconstant variance is the key assumption of the method for PI construction. Compared to the aforementioned techniques, the computational mass of this method in the training and utilization stage is negligible. However, this method underestimates the variance of data, leading to a low empirical coverage probability, as discussed in [17] and [22]. The four methods identified above have been used for construction of PIs in the literature. Regardless of their implementation differences, they share one methodological principle. NNs are first trained through minimization of an error-based cost function, such as the sum of squared errors or weight decay cost function. Then, PIs are constructed for outcomes of these trained NNs. The central argument in this paper is that the quality of constructed PIs in this way is questionable. In all traditional PI construction methods, the main strategy is to minimize the prediction error, instead of trying to improve the PI quality. The constructed PIs, therefore, are not guaranteed to be optimal in terms of their key characteristics: i.e., width and coverage probability. If optimality of PIs is the main concern in the process of PI construction, the NN development procedure should be revised to directly address the characteristics of PIs. Another critical, yet often ignored issue related to PIs is that the literature is void of information regarding the quantitative and objective assessment of PIs. The main focus of the literature is on the methodologies for the construction of PIs. Quantitative examination of the quality of developed PIs (and also CIs) is often ignored or represented subjectively and ambiguously [1], [6], [7], [9], [15], [25], [27], [28]. In the comparative studies of CI and PI construction methods, the coverage probability has been considered as the only criterion for assessing the quality of the constructed intervals [29]. We argue that the coverage probability by itself does not completely describe all characteristics of the intervals. A 100% coverage probability can be easily achieved through assigning sufficiently large and small values to the upper and lower bounds of PIs. To the best of our knowledge, there exist only a few papers that quantitatively evaluate the quality of constructed PIs in terms of their key characteristics [3], [5], [8]. The main objective of this paper is to propose a new method for construction of PIs using NN models. One goal of the proposed lower upper bound estimation (LUBE) method is to avoid the calculation of derivatives of NN output with respects to its parameters. As indicated by Dybowski and Roberts [22], these derivatives can be a source of unreliability of constructed PIs in the delta technique. Also, the NN training and development process is accomplished through improvement of the quality of PIs. While not directly performing an error minimization, the proposed technique aims at producing a narrow PI bracketing the prediction, thereby also achieving accurate predictions. This aspect of LUBE is distinct from the common practice for construction of PIs using traditional techniques. A new PI-based cost function using the quantitative measures proposed in [3] is developed. The new cost function simultaneously examines PIs based on their width and coverage probability. An NN model is considered for approximating the upper and lower bounds of PIs. The parameters of this NN are adjusted through minimization of the proposed PI-based cost function. As the cost function is highly nonlinear, complex, and discontinuous, a simulated annealing (SA) method is implemented for its minimization. The effectiveness of the LUBE method is examined using synthetic and real-world case studies. It is shown that the proposed method builds narrow PIs with a high coverage probability. The performance of the proposed method is also compared with Bayesian and delta techniques. The simulation results show that the quality of constructed PIs using the new technique is superior to the Bayesian and delta-based PIs. The rest of this paper is organized as follows. Section II describes the proposed PI-based cost function. The new NN LUBE method for construction of PIs is represented in Section III. Experimental results are given in Section IV. Finally, Section V concludes this paper with some remarks for further study in this domain. II. PI-BASED C OST F UNCTION The cornerstone of the proposed method for construction of optimal PIs is a new PI-based cost function. This cost function is later used for training and development of NNs. In order to derive the cost function, we first need to define measures for the quantitative assessment of PIs. By definition, future observations are expected to lie within the bounds of PIs with a prescribed probability called the confidence level ((1 − α)%). It is expected that the coverage probability of PIs will asymptotically approach the nominal level of confidence. According to this, PI coverage probability (PICP) is the spontaneous measure related to the quality of the constructed PIs [3], [5], [8] n 1 ci (1) PIC P = n i=1 where ci = 1 if yi ∈ [L(X i ), U (X i )], otherwise ci = 0. L(X i ) and U (X i ) are the lower and upper bounds of the i th PI, respectively. If the empirical PICP is much less than its nominal value, the first conclusion is that the constructed PIs are not at all reliable. This measure has been reported in almost all studies related to PIs, as an indication of how good the constructed PIs are [1], [3], [5]–[7], [9], [15], [25], [29]. If the extreme values of targets are considered as the upper and lower bounds of all PIs, the corresponding PICP will be perfect (100% coverage). Practically, PIs that are too wide KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS MPIW = n 1 (U (X i ) − L(X i )) n (2) i=1 where U (X i ) and L(X i ) represent the upper and lower bounds of PIs corresponding to the i th sample. If the width of target is known, MPIW can be normalized for objective comparison of PIs developed using different methods. Normalized MPIW (NMPIW) is given below [3], [8] MPIW (3) R where R is the range of the underlying target. NMPIW is a dimensionless measure representing the average width of PIs as a percentage of the underlying target range. In the case of using the extreme target values as upper and lower bounds of PIs, both NMPIW and PICP will be 100%. This indicates that PICP and NMPIW have a direct relationship. Under equal conditions, a larger NMPIW will usually result in a higher PICP. From a practical standpoint, it is important to have narrow PIs (small NMPIW) with a high coverage probability (large PICP). Theoretically, these two goals are conflicting. Reducing the width of PIs often results in a decrease in PICP, due to some observations dropping out of PIs. To quantitatively represent and measure this event, a combinational measure/index is required to carry information about how wide PIs are and what their coverage probability is. As PICP is theoretically the fundamental feature of the PIs, the proposed measure should give more weight to the variation of PICP. Put in other words, the index should be large for those cases where PICP is less than its corresponding nominal confidence ((1−α)%). The following combinational coverage width-based criterion (CWC) addresses all these issues for evaluation of the PIs: C W C = N M P I W 1 + γ (P I C P) e−η( P I C P−μ) . (4) 80 Logarithm of CWC are of no value, as they convey no information about the target variation. Another measure, therefore, is required for quantifying the width of PIs. Mean prediction interval width (MPIW) is defined as follows [3], [5]: 339 η = 300 60 40 20 η = 100 η = 10 0 80 82.5 85 87.5 90 92.5 95 97.5 100 μ Fig. 1. 1 0.95 0.9 0.85 0.8 PICP% CWC for different values of η. NMPIW = Assume for now that γ (PC I P) = 1. The constants η and μ are two hyperparameters determining how much penalty is assigned to PIs with a low coverage probability. μ corresponds to the nominal confidence level associated with PIs and can be set to 1 − α. The role of η is to magnify any small difference between PICP and μ. Usually, it should be selected to have a large value. The rationale for using such an asymmetric criterion (with regard to PICP) is that a PICP lower than μ will give a misleadingly optimistic PI, and should be penalized more (than a wider PI is rewarded). This is a reasonably conservative strategy. Applying such a strategy leads us to the following observation. Based on the CWC definition in (4), for P I C P < μ we are operating in the steep and high portion of the exponential term. Accordingly, such solutions are highly penalized due to the dominance of the PICP term, and rightfully so. When PICP is greater but close to μ (P I C P ≥ μ), there are two conflicting factors. Loosening the PIs’ widths will increase PICP thereby decreasing CWC. On the other hand, loosening the PIs’ widths will increase NMPIW, and hence increase CWC. However, as PICP goes further away from μ, the exponential term will gradually level off, leading to the NMPIW factor becoming more and more dominant in CWC. So, the algorithm will end up stopping at a PICP slightly higher than μ (for example, if we are constructing 90% PIs, it might get around 92–93%). One might then question the rationale for forgoing a few percentage points when we could get tighter PIs at PICP exactly equal to μ. The reason is that it is better to leave a slack to allow for deviations in out-of-sample results. For example, assume μ = 90%. If we achieve PICP exactly equal 90% in the training set, most probably the test set will give a PICP lower than 90%, e.g., around 87% or 88%, thus violating the constraints. Violating the PIs’ constraint (that PICP should be greater than 90%) should have serious ramification (as misleading results are obtained), whereas having a PICP a little higher than 90% means that we are just a little conservative (in having wider PIs). When we evaluate a set of PIs, for example by measuring the CWC for a test set, there is no reason to allow for the slack. In such a case, γ (P I C P) is given by the following step function: ⎧ ⎨ 0, P I C P ≥ μ γ = (5) ⎩ 1, P I C P < μ. This means that the exponential term in (4) is eliminated whenever P I C P ≥ μ and CWC becomes equal to NMPIW. This allows an impartial evaluation that does not unnecessarily reward solutions that give a PICP a little higher than μ. Theoretically correct PIs are assessed according to their widths. Fig. 1 displays the evolution of CWC for different values of η. In the three plots, it has been assumed that N M P I W = 40%. If P I C P ≥ μ, CWC is small and very close to the NMPIW. This indicates that the PICP has been above the nominal confidence level. When P I C P < μ, the PICP is not satisfactory and CWC exponentially increases. In these cases, CWC is significantly greater than NMPIW. This explicitly means that the coverage probability of constructed PIs has not been satisfactory. The location, rate, and magnitude of the CWC jump can be easily controlled by η and μ. 340 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 Hidden Layer Input 1 Input 2 Input 3 Start Upper Bound Lower Bound Prediction Interval Actual Split the data set into training and test samples (Dtrain and Dtest) Target Intialize the optimization algorithm (T0 , w0, and CWC0) Update the temperature Sample Generate a wnew through perturbation of one parameter Fig. 2. NN model developed for approximating upper and lower bounds of PIs in the LUBE method. Construct PInew and Calculate CWCnew Traditionally, NNs are trained through minimization of error-based cost functions such as sum of squared error, weight decay cost function, Akaike information criterion, and Bayesian information criterion [4]. Such an approach is theoretically and practically acceptable if the purpose of modeling and analysis is point prediction. If NNs are going to be used for PIs construction, it is more reasonable to train them through minimization of PI-based cost functions. A set of optimal PIs constructed using the NNs in this manner will have an improved quality in terms of their width and coverage probability. To achieve this goal, the NN training procedure can be accomplished based on minimization of the proposed new cost function that constitutes the core of the CWC (γ (P I C P) = 1). III. LUBE M ETHOD The proposed cost function in the previous section is now used for training an NN for constructing PIs. The structure of the proposed NN model with two outputs is shown in Fig. 2. The demonstrated NN is symbolic and it can have any number of layers and neurons per layer. The first output corresponds to the upper bound of the PI, and the second output approximates the lower bound of the PI. In the literature, the PI construction techniques attempt to estimate the mean and variance of the targets for construction of PIs. In contrast to existing techniques, the proposed method tries to directly approximate upper and lower bounds of PIs based on the set of inputs. CWC, defined in (4), is nonlinear, complex, and nondifferentiable. Therefore, gradient descent-based algorithms cannot be applied for its minimization. Additionally, CWC is sensitive to the NN set of parameters. As the gradient descent-based techniques are notorious for being trapped in local minima, their application may result in a suboptimal set of NN parameters. With regard to this discussion, stochastic gradient-freebased methods are the best candidates for global optimization of the PI-based cost function. The training algorithm here uses the SA optimization technique for minimization of the CWC cost function and adjustment of the NN set of parameters (w). Fig. 3 shows the procedure for training and development of the two output NN for construction of PIs. The training (optimization) algorithm description is as follows. Step 1: Data splitting. The method starts with randomly splitting the available data into the training set (Dtrain ) and test CWCnew < CWCopt No CWCopt = CWCnew and wopt = wnew Generate a random number (r) between 0 to 1 Yes r≥e− CWCnew − CWCopt κT No No Termination condition met? Yes Construct PI for Dtest End Fig. 3. LUBE method for PI-based training of NNs and PI construction. set (Dt est ). If required, preprocessing of datasets is completed in this stage. Step 2: Initialization. An NN with two outputs (similar to the one shown in Fig. 2) is constructed. The initial parameters and weights of this network can be randomly assigned. An alternative is to train this network using the traditional learning methods, such as Levenberg– Marquardt algorithm, to approximate the real target. The obtained NN is then used for construction of PI for the training samples (Dtrain ). PICP, NMPIW, and CWC are calculated and considered as the initial values for the training algorithm. The NN set of parameters is also recorded as the optimal set of NN parameters (wopt ). The cooling temperature (T ) is set to a high value (T0 ) to allow uphill movements in the early iterations of the optimization algorithm. Step 3: Temperature update. The first step in the SA-based training loop is to update the cooling temperature. Different cooling schedules can be used depending on the application and data behavior. Examples are linear, geometric, and logarithmic [30], [31]. Step 4: Parameter perturbation. A new set of parameters (wnew ) is obtained through random perturbation of one of the current NN KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS 341 TABLE I D ATASETS U SED IN THE E XPERIMENTS Case study #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Target 5-D function with constant noise 5-D function with nonconstant noise T90 Concrete compressive strength Plasma beta-carotene Dry bulb temperature Moisture content of raw materials Steam pressure Main steam temperature Reheat steam temperature parameters. The perturbation should be quite small as the CWC cost function is sensitive to the changes of NN parameters. Step 5: PI construction. PIs are constructed for the new set of NN parameters. The fitness function (CWC) is calculated as an indication of the quality of constructed PIs. As discussed before, γ is set to 1. This conservative approach is intended to avoid excessive shortening of PIs during the training stage, which may result in a low PICP for test samples. Step 6: Cost function evaluation. a) If C W Cnew ≤ C W Copt , wopt and C W Copt are replaced with wnew and C W Cnew , respectively. This means that the new transition in the optimization algorithm leads to higher quality PIs. b) If C W Cnew ≥ C W Copt , a random number is generated between zero and 1 (r ∈ [0, 1]). Then decision about the acceptance or rejection of the new set of parameters is made based on the Boltzmann probability factor [32]. If r ≥ e−(C W Cnew −C W Copt )/κT, again wopt and C W Copt are replaced with wnew and C W Cnew , respectively. κ is the Boltzmann constant which is an important parameter in SA algorithm. c) If none of above happens, the optimal set of NN remains unchanged. Step 7: Training termination. The training algorithm terminates if any of the following conditions is satisfied: the maximum number of iterations is reached; no further improvement is achieved for a specific number of consecutive iterations; a very low temperature is reached; or a very small CWC is found. Otherwise the optimization algorithm returns to the Step 4. Step 8: PIs for test samples. Upon termination of the training, wopt is considered as the set of NN parameters and PIs are constructed for Dt est . One of the key features of the LUBE method for construction of PIs is its simplicity. Compared to the delta and Bayesian technique, it does not require calculation of any derivatives and, therefore, it does not suffer from the singularity problems of the Hessian matrix and its approximation [18]. Samples 300 500 272 1030 315 867 867 200 200 200 Attributes 5 5 3 8 12 3 3 5 5 5 Reference [33], [34] [29] [3], [35] [36] [36] [36] [36] [37] [37] [37] Furthermore, it does not make any special assumption about the data and residual distributions. This freedom makes the method applicable to a wide range of problems without any special consideration of the data distributions. As the LUBE method uses only one NN model for constructing PIs, its online computational requirement is at least B times less than the bootstrap method for PI construction. Focus on the quality of PIs is another important feature making the LUBE method distinct from the traditional PI construction methods. While other methods construct PIs in two steps (first doing point prediction and then constructing PIs), the LUBE method directly generates PIs in one step based on the set of NN inputs. This feature allows us to apply a mechanism, as proposed here, for directly improving the quality of PIs rather reducing the point prediction error. IV. E XPERIMENTAL R ESULTS This section presents the experiments conducted to evaluate the effectiveness of the LUBE method for construction of PIs. Structured in four subsections, it describes the datasets used for carrying out the experiments (Section IV-A). The experiment methodology followed in all case studies is explained in Section IV-B. Ample care is exercised to provide each method with enough freedom to reveal its PI construction power. Assessment and optimization parameters are described in Section IV-C. Then the simulation results are represented and discussed in Section IV-D. The delta, Bayesian, bootstrap (pairs sampling [22]), and LUBE methods are compared based on the quality of constructed PIs as well as their computational requirements. A. Data Experiments are carried out using 10 datasets to verify the effectiveness of the LUBE method. Table I outlines the characteristics of these datasets. The datasets represent a number of synthetic and real world regression case studies from different domains, including mathematics, medicine, transportation, environmental studies, electronics, and manufacturing. They have a range of discrete and continuous input attributes. The first case study is a 5-D dataset consisting of 300 randomly generated data from a highly nonlinear function. The second case study is the synthetic function described in [29], where the additive noise has a normal distribution with 342 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 TABLE II PARAMETERS U SED IN THE E XPERIMENTS Parameter Dtrain Dtest α η μ κ T0 Geometric cooling schedule Number of bootstrap models Numerical value 70% of all samples 30% of all samples 0.1 50 0.90 1 5 Tk+1 = β Tk 10 a nonconstant variance. Data in case study 3 comes from a real-world baggage handling system. The target is to predict the travel times of 90% of bags in the system based on the check-in gates, exit points, and work load. The fourth dataset consists 1030 samples of the concrete compressive strength, where eight inputs are used to approximate the target. The relationship between personal characteristics and dietary factors, and plasma concentrations of beta-carotene is studied in case studies 5. Dry bulb temperature and moisture content of raw material for an industrial dryer are approximated using three inputs in case studies 6 and 7. Datasets in case studies 8, 9, and 10 come from a power station, where targets are steam pressure, main stem temperature, and reheat steam temperature, respectively. All datasets are available from the authors on request. B. Experiment Methodology Performance of PI construction methods depends on the NN structure and the training process. Therefore, it is important to give each method enough freedom and flexibility to generate the best possible results. Single hidden layer NNs are considered in the experiments conducted in this paper. For all compared methods, the optimal NN structure is found through applying a fivefold cross-validation technique, where the number of neurons is changed between 1 to 20. Experiments are repeated five times for each case study to avoid issues with the random sampling of datasets. For the LUBE method, the single-layer NNs are trained using the proposed method in Section III. Training is driven using CWC as the cost function with γ = 1. Upon completion of the training stage, PICP, NMPIW, and CWC are calculated for test samples (Dt est ), where γ (P I C P) is the step function. The median of these measures is then used for comparing performance of the four methods for ten case studies. C. Parameters Table II summarizes parameters used in the experiments and the NN training process. The data are randomly split into training (70%) and test (30%) datasets. The level of confidence associated with all PIs is 90%. Initial temperature is set to 5 to allow uphill movements in the early iterations of the optimization algorithm. A geometric cooling schedule with a cooling factor of 0.9 is applied in the LUBE method. The NN parameter perturbation function in the LUBE method generates random numbers whose elements are normally distributed with mean zero and unit variance. μ is set to 0.9, because the prescribed level of confidence of PIs is 90%. Also, η is selected to be 50 in order to highly penalize PIs with a coverage probability lower than 90%. According to these values, the profile of CWC will be similar the middle plot in Fig. 1. D. Results and Discussion The convergence behavior of CWC and profile of the cooling temperature for the first six case studies for the LUBE method are shown in Fig. 4. For a better graphical visualization of the optimization algorithm effort, extreme values of CWC have not been displayed in some plots. CWC plots for case studies 1 and 6 are logarithmic, and show the CWC variation throughout the NN training. In the early iterations, when the temperature is high (T = 2), CWC rapidly fluctuates allowing uphill movements. As the temperature drops, the optimization algorithm becomes more greedy. The CWC decreases gradually, but nonmonotonically. For temperatures less than 1, CWC takes a continuous downward trend and only improving movements are allowed. The initial value of CWC is very large (C W C0 ≥ 1080 for all cases). This means that the initial set of parameters obtained using the traditional training techniques has not been suitable. Such a large CWC is mainly due to an unsatisfactorily low PICP. The number of iterations varies for different case studies and ranges from approximately 500 to 3000. Each iteration takes less than 0.02 s to be completed. This means the NN training is completed in less than 1 min in the worst case. Therefore, the convergence speed of the SA-based training algorithm is acceptable. Variation of NN parameters 21 to 25 during the optimization process is displayed in Fig. 5 for case studies 7 to 10. These parameters change throughout the optimization, indicating the complexity of the cost function in the multidimensional space of the NN parameters. The median of PICP and NMPIW for the test datasets (Dt est ) are listed in Table III. The PIs obtained using the LUBE method are compared with those constructed using the delta, Bayesian, and bootstrap techniques. Hereafter and for ease of reference, Delta, Bays, BS, and LUBE subscripts are used for indicating the delta, Bayesian, bootstrap, and the LUBE method in this paper for the construction of PIs. It is expected that PICPs will be at least equal to 90%, as the confidence level associated with all PIs is 90%. According to the results in Table III, the coverage probability of P I LU B E is better than other PIs in the majority of conducted experiments. While the number of under coverage for the LUBE technique is 1, it is 6, 4, and 4 and for the delta, Bayesian, and bootstrap methods, respectively. It is only for case study 7 that P I C PLU B E is slightly lower than the nominal confidence level (89.47% instead of 90%). The mean and median of P I C PDelt and P I C PB S for 10 case studies are lower than the nominal confidence level. This indicates that PIs constructed using these methods suffer from the undercoverage KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS 343 0 500 1000 1500 #3 2000 2500 3 2 1 0 3000 0 50 100 150 350 5 60 60 4 50 3 50 2 40 1 30 Temperature 70 3 5 60 5 4 50 4 500 1000 30 1 0 0 40 2 20 1500 30 1 0 500 1000 1500 2000 20 2500 Temperature 40 2 CWC 3 0 0 200 400 600 20 800 1000 1200 1400 1600 1800 #6 1060 3 1040 2 1020 1 0 0 500 1000 1500 2000 2500 800 1000 Profile of the cooling temperature (solid line) and the CWC (dashed line) for the six case studies. 3 2.5 2 1.5 1 1.5 0 −0.5 −1 −1.5 #8 1 1.5 NN Weights NN Weights #7 0 −0.5 −1 −1.5 0 500 1000 Iteration 1500 −2 2000 0 200 2 2 1.5 600 Iteration 1 1.5 1 1.5 0 1.5 0 −0.5 −1 −0.5 −1 400 #10 2.5 NN Weights NN Weights #9 Fig. 5. 300 4 #5 Temperature 250 #4 5 0 Fig. 4. 200 60 50 CLC 40 30 20 10 0 400 450 CWC 1020 1 4 Logartihm of CWC 2 Temperature 1040 CWC Temperature 3 0 Temperature 10 4 5 CWC #2 60 Logartihm of CWC #1 5 −1.5 0 200 400 600 Iteration 800 1000 −2 0 200 400 600 Iteration 800 1000 Evolution of NN parameters during the training algorithm. problem, and therefore are not practically very reliable. The mean and median of P I C PBays for all case studies are 89.93% and 93.70%, respectively. Although these values are greater than the nominal confidence level, P I C PBays highly fluctuates for different case studies. The minimum of P I C PBays is for case study 9 (75%), where constructed PIs are totally unreliable. For the LUBE method, the mean and median of PICP are 92.00% and 91.36%. Besides, the standard deviation of P I C PLU B E is 2.3%, almost five times less than the standard deviation of P I C PBayes . These statistics specify 344 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 TABLE III PI C HARACTERISTICS FOR T EST S AMPLES ( Dtest ) OF THE T HREE C ASE S TUDIES Case study 1 2 3 4 5 6 7 8 9 10 Delta technique PICP(%) NMPIW(%) 85.56 79.37 86.67 11.29 93.90 57.97 89.00 25.42 90.39 36.12 88.08 54.73 95.79 47.04 91.67 43.45 83.33 15.85 85.00 25.58 Bayesian technique PICP(%) NMPIW(%) 100.00 120.12 92.67 16.54 97.56 70.39 88.67 22.17 97.31 36.20 100.00 78.72 94.74 60.60 76.67 13.08 75.00 8.52 76.67 15.99 Bootstrap technique PICP(%) NMPIW(%) 85.56 85.47 98.00 24.53 82.93 36.48 91.26 27.52 93.85 37.51 58.46 32.09 85.26 33.13 96.67 50.79 91.67 32.62 96.67 50.79 LUBE method PICP(%) NMPIW(%) 90.00 72.22 92.00 23.09 91.46 41.51 91.26 35.57 91.15 33.91 94.62 69.39 89.47 41.80 93.33 45.05 96.67 24.71 90.00 31.52 200.00 180.00 160.00 140.00 120.00 CWC 100.00 80.00 60.00 40.00 20.00 0.00 1 2 3 4 Delta Fig. 6. 5 6 7 8 Case Studies Bayesian Bootstrap LUBE 9 10 Median of CWC measure for PIs constructed for test samples using the Bayesian, delta, and LUBE methods. that the LUBE-based PIs: 1) properly cover the targets in different replicates of an experiment (median above 90%), and 2) show a consistent behavior for different case studies (small standard deviation with a mean and median above the nominal confidence level). The medians of C W C Delt a , C W C Bays , C W C B S , and C W C LU B E for 10 case studies are displayed in Fig. 6. Evaluation of the CWCs are done using γ (P I C P) as the step function [see (5)]. As some CWCs, for instance C W C B S for case study 6 or C W C Bays for case study 9 and 10, are very large, the CWC axis is limited to 200 for a better visualization. C W C LU B E is lower than other CWCs for 8 out of 10 case studies. C W C LU B E is always less than 100, which means that P I LU B E are correct and sufficiently narrow. For the other three methods, there are always cases where the computed CWCs for constructed PIs are greater than 100 (four cases for each method). This means that the LUBE method can more effectively establish a tradeoff between the correctness and informativeness of PIs than other methods. Although P I Delt a are on average narrower than P I LU B E , they are incorrect due to a low coverage probability. The Bayesian method generates the highest quality PIs for the case of noise with a nonconstant variance (case study 2). P I Bays are narrower than other PIs with an acceptable coverage probability. P I LU B E are second in the rank in terms of the overall quality. Bootstrap-based PIs are wider than others in five replicates for achieving a high coverage probability. P I Delt a are excessively narrow resulting in an unacceptable coverage probability of 86%. According to these results, the Bayesian and LUBE methods are the best candidates for PI construction for cases in which there is an additive noise with a nonconstant variance. Apart from the performance comparison, it is important to compare the four methods in terms of their computational burden required for constructing PIs. A more efficient set of PIs is produced by the LUBE method with a much lower computational requirement. The required time for construction of P I Delt a , P I Bays , P I B S , and P I LU B E for Dt est is shown in Table IV. A comparison of the times represented in Table IV reveals that the LUBE method constructs PIs in periods thousands of times smaller than is required by the delta and Bayesian methods. The required time for the LUBE method to construct PIs for test samples is 4 ms in the worst case, while this is in average for the Bayesian and delta techniques 4447 ms and 3368 ms, respectively. The large computational load of the delta and Bayesian techniques is mainly due to calculation KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS TABLE IV PI C ONSTRUCTION T IME U SING THE BAYESIAN , D ELTA , AND LUBE T ECHNIQUE FOR THE T EST S AMPLES 2∗ Case study 1 2 3 4 5 6 7 8 9 10 Bayesian 4195 4761 3508 3498 7818 5826 5813 3002 3265 2779 Time (ms) Delta Bootstrap 3562 51 3159 54 3181 51 3199 55 5924 56 3886 56 3892 51 2500 51 2029 50 2368 51 LUBE 3 3 3 3 3 4 4 3 4 3 of the gradient of NN outputs with respects to its parameters. The computational requirement of the bootstrap method is also more than the LUBE method by a factor of at least B times (here more than 10 times). This is because 10 NNs with some extra calculations are required for construction of PIs for a test sample. The computational mass becomes important for those cases in which the constructed PIs are used for optimization purposes. For instance, in case study 3, the provided information by PIs can be used for operational planning and scheduling in the underlying baggage handling system. Thousands of operational scenarios should be generated and evaluated in a short time to determine which guarantees smooth operation of the whole system. As the computational mass of delta, Bayesian, and bootstrap techniques is high, their application for these purposes is limited. In contrast, the LUBE method can be easily applied in real time for constructing PIs for different operational scenarios. This paves the way for more application of PIs in real-time planning and decision making. In summary, in almost all experiments represented in Table III and Fig. 6, the quality of P I LU B E is at least equal or superior to P I Delt a , P I Bays , and P I B S in terms of the PICP and NMPIW measures. Demonstrated results satisfactorily prove the reliability and quality of constructed PIs using the LUBE method. Besides, its computational requirement for construction of PIs is highly lower than the other three methods. Therefore, it is reasonable to conclude that the LUBE method constructs high quality PIs in a very short time. V. C ONCLUSION In this paper, a new method was proposed for the construction of PIs using NN models. The LUBE method uses an NN with two outputs to construct upper and lower bounds of the PIs. For training this NN, a new cost function was proposed based on the two key features of PIs: width and coverage probability. As the proposed cost function is nonlinear, complex, and nondifferentiable, a SA method is applied for the minimization of the proposed prediction intervalbased cost function and training of the NN model. Through synthetic and real case studies, it was demonstrated that the 345 LUBE method is effective and reliable for construction of PIs. The comparative study revealed that the quality of prediction intervals constructed using the LUBE method is in many cases superior to those constructed using the delta, Bayesian, and bootstrap techniques. Furthermore, the computational expense of the proposed method for construction of PIs is virtually nil compared to that of the delta and Bayesian techniques. The LUBE method can be modified in different ways for generating narrower PIs with a higher coverage probability. As with any other application of neural networks, performance of the proposed method depends on the network structure. The effectiveness of the LUBE method can be easily improved through the combination of the LUBE method with one of the NN structure selection techniques [34]. Currently, the proposed optimization method does not consider the network structure. The NN structures in this paper were selected on a trial-and-error basis. The current training algorithm uses the SA method for minimization of the cost function. Our minimization experiments using the genetic algorithm was unsuccessful, and it diverged in the majority of the cases. However, the application of other optimization methods may enhance the quality of constructed prediction intervals. Finally, the overfitting problem of the developed NN can be avoided by applying cross validation, pruning, and weight elimination techniques. ACKNOWLEDGMENT The authors are grateful to the anonymous reviewers for their helpful comments and suggestions. R EFERENCES [1] C. P. I. J. van Hinsbergen, J. W. C. van Lint, and H. J. van Zuylen, “Bayesian committee of neural networks to predict travel times with confidence intervals,” Trans. Res. Part C: Emerg. Technol., vol. 17, no. 5, pp. 498–509, Oct. 2009. [2] W.-H. Liu, “Forecasting the semiconductor industry cycles by bootstrap prediction intervals,” Appl. Econ., vol. 39, no. 13, pp. 1731–1742, 2007. [3] A. Khosravi, S. Nahavandi, and D. Creighton, “A prediction intervalbased approach to determine optimal structures of neural network metamodels,” Expert Syst. Appl., vol. 37, no. 3, pp. 2377–2387, Mar. 2010. [4] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.: Oxford Univ. Press, 1995. [5] D. L. Shrestha and D. P. Solomatine, “Machine learning approaches for estimation of prediction interval for the model output,” Neural Netw., vol. 19, no. 2, pp. 225–235, Mar. 2006. [6] S. L. Ho, M. Xie, L. C. Tang, K. Xu, and T. N. Goh, “Neural network modeling with confidence bounds: A case study on the solder paste deposition process,” IEEE Trans. Electron. Packag. Manufact., vol. 24, no. 4, pp. 323–332, Oct. 2001. [7] J. H. Zhao, Z. Y. Dong, Z. Xu, and K. P. Wong, “A statistical approach for interval forecasting of the electricity price,” IEEE Trans. Power Syst., vol. 23, no. 2, pp. 267–276, May 2008. [8] A. Khosravi, S. Nahavandi, and D. Creighton, “Construction of optimal prediction intervals for load forecasting problems,” IEEE Trans. Power Syst., vol. 25, no. 3, pp. 1496–1503, Aug. 2010. [9] S. G. Pierce, K. Worden, and A. Bezazi, “Uncertainty analysis of a neural network used for fatigue lifetime prediction,” Mech. Syst. Signal Process., vol. 22, no. 6, pp. 1395–1411, Aug. 2008. [10] D. F. Benoit and D. Van den Poel, “Benefits of quantile regression for the analysis of customer lifetime value in a contractual setting: An application in financial services,” Expert Syst. Appl., vol. 36, no. 7, pp. 10475–10484, Sep. 2009. [11] N. Meade and T. Islam, “Prediction intervals for growth curve forecasts,” J. Forecast., vol. 14, no. 5, pp. 413–430, Sep. 1995. 346 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011 [12] T. Heskes, “Practical confidence and prediction intervals,” in Advances in Neural Information Processing Systems, vol. 9, T. P. M. Mozer and M. Jordan, Eds. Cambridge, MA: MIT Press, 1997, pp. 176–182. [13] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecasting,” Int. J. Forecast., vol. 22, no. 3, pp. 443–473, 2006. [14] G. Chryssolouris, M. Lee, and A. Ramsey, “Confidence interval prediction for neural network models,” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 229–232, Jan. 1996. [15] J. T. G. Hwang and A. A. Ding, “Prediction intervals for artificial neural networks,” J. Amer. Stat. Assoc., vol. 92, no. 438, pp. 748–757, Jun. 1997. [16] C. J. Wild and G. A. F. Seber, Nonlinear Regression. New York: Wiley, 1989. [17] A. A. Ding and X. He, “Backpropagation of pseudo-errors: Neural networks that are adaptive to heterogeneous noise,” IEEE Trans. Neural Netw., vol. 14, no. 2, pp. 253–262, Mar. 2003. [18] R. D. De Veaux, J. Schumi, J. Schweinsberg, and L. H. Ungar, “Prediction intervals for neural networks via nonlinear regression,” Technometrics, vol. 40, no. 4, pp. 273–282, Nov. 1998. [19] T. Lu and M. Viljanen, “Prediction of indoor temperature and relative humidity using neural network models: Model comparison,” Neural Comput. & Appl., vol. 18, no. 4, pp. 345–357, Mar. 2009. [20] A. Khosravi, S. Nahavandi, and D. Creighton, “Improving prediction interval quality: A genetic algorithm-based method applied to neural networks,” in Proc. 16th Int. Conf. Neural Inf. Process.: Part II, vol. 5864. 2009, pp. 141–149. [21] D. J. C. MacKay, “The evidence framework applied to classification networks,” Neural Comput., vol. 4, no. 5, pp. 720–736, Sep. 1992. [22] R. Dybowski and S. J. Roberts, “Confidence intervals and prediction intervals for feed-forward neural networks,” in Clinical Applications of Artificial Neural Networks, R. Dybowski and V. Gant, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2001, pp. 298–326. [23] J. G. Carney, P. Cunningham, and U. Bhagwan, “Confidence and prediction intervals for neural network ensembles,” in Proc. Int. Joint Conf. Neural Netw., vol. 2. Washington D.C., Jul. 1999, pp. 1215–1218. [24] E. Zio, “A study of the bootstrap method for estimating the accuracy of artificial neural networks in predicting nuclear transient processes,” IEEE Trans. Nucl. Sci., vol. 53, no. 3, pp. 1460–1478, Jun. 2006. [25] F. Giordano, M. La Rocca, and C. Perna, “Forecasting nonlinear time series with neural network sieve bootstrap,” Comput. Stat. & Data Anal., vol. 51, no. 8, pp. 3871–3884, May 2007. [26] N. O. Oleng’, A. Gribok, and J. Reifman, “Error bounds for data-driven models of dynamical systems,” Comput. Biol. Med., vol. 37, no. 5, pp. 670–679, May 2007. [27] D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the target probability distribution,” in Proc. IEEE Int. Conf. Neural Netw. World Congr. Comput. Intell., vol. 1. Orlando, FL, Jun.–Jul. 1994, pp. 55–60. [28] I. Rivals and L. Personnaz, “Construction of confidence intervals for neural networks based on least squares estimation,” Neural Netw., vol. 13, nos. 4–5, pp. 463–484, Jun. 2000. [29] G. Papadopoulos, P. J. Edwards, and A. F. Murray, “Confidence estimation methods for neural networks: A practical comparison,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1278–1287, Nov. 2001. [30] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory and Applications. Boston, MA: Kluwer, 1987. [31] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing (Discrete Mathematics and Optimization). New York: Wiley, 1989. [32] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, May 1983. [33] S. Hashem, “Optimal linear combinations of neural networks,” Neural Netw., vol. 10, no. 4, pp. 599–614, Jun. 1997. [34] L. Ma and K. Khorasani, “New training strategies for constructive neural networks with application to regression problems,” Neural Netw., vol. 17, no. 4, pp. 589–609, May 2004. [35] A. Khosravi, S. Nahavandi, and D. Creighton, “Constructing prediction intervals for neural network metamodels of complex systems,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009, pp. 1576–1582. [36] P. Vlachos. (2010, Jan. 10). StatLib Datasets Archive [Online]. Available: http://lib.stat.cmu.edu/datasets [37] B. De Moor. (2010, Jan. 10). DaISy: Database for the Identification of Systems. Dept. Elect. Eng., ESAT/SISTA, Katholieke Univ. Leuven, Leuven, Belgium [Online]. Available: http://homes.esat.kuleuven.be/smc/daisy/ Abbas Khosravi (M’07) received the B.Sc. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, and the M.Sc. (hons.) degree in electrical engineering from Amirkabir University of Technology, Tehran, in 2002 and 2005, respectively. His specialization was artificial intelligence (AI). He joined the eXiT Group as a Research Academic at the University of Girona, Girona, Spain, in 2006, working in the area of AI applications. Currently, he is a Research Fellow in the Center for Intelligent Systems Research, Deakin University, Victoria, Australia. His current research interests include the development and application of AI techniques for (meta)modeling, analysis, control, and optimization of operations within complex systems. Saeid Nahavandi (SM’07) received the B.Sc. (hons.), M.Sc., and Ph.D. degrees in automation and control from Durham University, Durham, U.K. He is the Alfred Deakin Professor, Chair of Engineering, and the Director for the Center for Intelligent Systems Research, Deakin University, Victoria, Australia. He has published over 350 peerreviewed papers in various international journals and conference proceedings. He designed the world’s first 3-D interactive surface/motion controller. His current research interests include modeling of complex systems, simulation-based optimization, robotics, haptics, and augmented reality. Dr. Nahavandi was a recipient of the Young Engineer of the Year award in 1996 and six international awards in Engineering. He is the Associate Editor of the IEEE S YSTEMS J OURNAL, an Editorial Consultant Board Member for the International Journal of Advanced Robotic Systems, and Editor (South Pacific Region) of the International Journal of Intelligent Automation and Soft Computing. He is a Fellow of Engineers Australia and the Institution of Engineering and Technology. Doug Creighton received the B.Eng. (hons.) degree in systems engineering and the B.Sc. degree in physics from the Australian National University, Canberra, Australia, in 1997, where he attended as a National Undergraduate Scholar. He received the Ph.D. degree in simulation-based optimization from Deakin University, Victoria, Australia, in 2004. He spent several years as a software consultant prior to obtaining the Ph.D. He is currently a Research Academic and Stream Leader with the Center for Intelligent Systems Research, Deakin University. His current research interests include modeling, discrete event simulation, intelligent agent technologies, human machine interface and visualization, simulation-based optimization research, development of algorithms to allow the application of learning agents to industrial-scale systems for use in optimization, dynamic control, and scheduling. Amir F. Atiya (S’86–M’90–SM’97) received the B.S. degree from Cairo University, Cairo, Egypt, in 1982, and the M.S. and Ph.D. degrees from California Institute of Technology (Caltech), Pasadena, in 1986 and 1991, respectively, all in electrical engineering. He is currently a Professor in the Department of Computer Engineering, Cairo University. He has held several visiting appointments, including with Caltech and Chonbuk National University, Jeonju, South Korea. His current research interests include neural networks, machine learning, theory of forecasting, computational finance, and Monte Carlo methods. Dr. Atiya was a recipient of the several awards, such as the Kuwait Prize in 2005. He was an Associate Editor for the IEEE T RANSACTIONS ON N EURAL N ETWORKS from 1998 to 2008.