Lower Upper Bound Estimation Method for Construction of Neural

advertisement
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
337
Lower Upper Bound Estimation Method for
Construction of Neural Network-Based Prediction
Intervals
Abbas Khosravi, Member, IEEE, Saeid Nahavandi, Senior Member, IEEE,
Doug Creighton, and Amir F. Atiya, Senior Member, IEEE
Abstract— Prediction intervals (PIs) have been proposed in
the literature to provide more information by quantifying the
level of uncertainty associated to the point forecasts. Traditional
methods for construction of neural network (NN) based PIs
suffer from restrictive assumptions about data distribution and
massive computational loads. In this paper, we propose a new,
fast, yet reliable method for the construction of PIs for NN
predictions. The proposed lower upper bound estimation (LUBE)
method constructs an NN with two outputs for estimating the
prediction interval bounds. NN training is achieved through the
minimization of a proposed PI-based objective function, which
covers both interval width and coverage probability. The method
does not require any information about the upper and lower
bounds of PIs for training the NN. The simulated annealing
method is applied for minimization of the cost function and
adjustment of NN parameters. The demonstrated results for
10 benchmark regression case studies clearly show the LUBE
method to be capable of generating high-quality PIs in a short
time. Also, the quantitative comparison with three traditional
techniques for prediction interval construction reveals that the
LUBE method is simpler, faster, and more reliable.
Index Terms— Neural network, prediction interval, simulated
annealing, uncertainty.
I. I NTRODUCTION
HERE are numerous reports discussing the successful
application of neural networks (NNs) in prediction and
regression problems. However, there is a belief that NN point
predictions are of limited value where there is uncertainty in
the data or variability in the underlying system. Examples
of such systems are transportation networks [1], manufacturing enterprises [2], and material handling facilities [3].
Statistically, the NN output approximates the average of the
underlying target conditioned on the NN input vector [4]. If the
target is multivalued, the NN conditional averaged output can
be far from the actual target, and is therefore unreliable. Furthermore, NN point predictions convey no information about
T
Manuscript received May 30, 2010; revised November 28, 2010; accepted
November 28, 2010. Date of publication December 23, 2010; date of current
version March 2, 2011. This research was supported in part by the Center for
Intelligent Systems Research at Deakin University.
A. Khosravi, S. Nahavandi, and D. Creighton are with the Center for Intelligent Systems Research, Deakin University, Geelong, Victoria 3117, Australia
(e-mail: abbas.khosravi@deakin.edu.au; saeid.nahavandi@deakin.edu.au;
douglas.creighton@deakin.edu.au).
A. F. Atiya is with the Department of Computer Engineering, Cairo
University, Cairo 12613, Egypt (e-mail: amir@alumni.caltech.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNN.2010.2096824
the sampling errors and the prediction accuracy. Incorporating
the prediction uncertainty into the deterministic approximation
generated by NNs improves the reliability and credibility of
the predictions [5]. The semiconductor industry [2], surface
mount manufacturing [6], electricity load forecasting [7],
[8], fatigue lifetime prediction [9], financial services [10],
hydrologic case studies [5], transportation [1], and baggage
handling systems [3], to name a few, are examples discussing
this problem in different domains.
Confidence intervals (CIs) and prediction intervals (PIs) are
two well-known tools for quantifying and representing the
uncertainty of predictions. While a CI describes the uncertainty
in the prediction of an unknown but fixed value, a PI deals
with the uncertainty in the prediction of a future realization
of a random variable [11]. By definition, a PI accounts
for more sources of uncertainty (model misspecification and
noise variance) and is wider than the corresponding CI [12].
Unfortunately, many authors wrongly interchange these terminologies [13].
In the literature, several methods have been proposed for
construction of PIs and assessment of the NN outcome uncertainty. Chryssolouris et al. [14] and Hwang and Ding [15]
developed the delta technique through the nonlinear regression representation of the NNs. The method first linearizes
the NN model around a set of parameters obtained through
minimization of the sum of squared error cost function. Then,
standard asymptotic theory is applied to the linearized model
for constructing PIs [16]. Intervals are constructed under
the assumption that the noise is homogeneous and normally
distributed. As the noise is heterogeneous in many real-world
case studies, the constructed intervals can be misleading [17].
Veaux et al. [18] extended the delta method to those cases in
which NNs are trained using the weight decay cost function
instead of sum of squared errors. Although the generalization
power of NN models is improved using this method, the PIs
suffer from the fundamental limitation of the delta technique
(linearization). The delta method has been applied in many
synthetic and real case studies [3], [6], [19], [20], despite this
weakness.
The Bayesian technique is another method for construction
of NN-based PIs [21]. Training NNs using the Bayesian
technique allows error bars to be assigned to the predicted
values of the network [4]. Despite the strength of the supporting theories, the method suffers from massive computational
1045–9227/$26.00 © 2010 IEEE
338
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
burden, and requires calculation of the Hessian matrix of the
cost function for construction of PIs.
Bootstrap is one of the most frequently used techniques in
the literature for construction of PIs for NN point forecasts
[12], [22]–[26]. The main advantage of this method is its
simplicity and ease of implementation. It does not require
calculation of complex derivatives and the Hessian matrix
involved in the delta and Bayesian techniques. It has been
claimed that the bootstrap method generates more reliable PIs
than other methods [22]. The main disadvantage of this method
is its computational cost for large datasets [6].
A mean-variance estimation-based method for PI construction has also been proposed by Nix and Weigend [27]. The
method uses a NN to estimate the characteristics of the
conditional target distribution. Additive Gaussian noise with
nonconstant variance is the key assumption of the method for
PI construction. Compared to the aforementioned techniques,
the computational mass of this method in the training and
utilization stage is negligible. However, this method underestimates the variance of data, leading to a low empirical coverage
probability, as discussed in [17] and [22].
The four methods identified above have been used for construction of PIs in the literature. Regardless of their implementation differences, they share one methodological principle.
NNs are first trained through minimization of an error-based
cost function, such as the sum of squared errors or weight
decay cost function. Then, PIs are constructed for outcomes
of these trained NNs. The central argument in this paper is that
the quality of constructed PIs in this way is questionable. In
all traditional PI construction methods, the main strategy is to
minimize the prediction error, instead of trying to improve the
PI quality. The constructed PIs, therefore, are not guaranteed to
be optimal in terms of their key characteristics: i.e., width and
coverage probability. If optimality of PIs is the main concern in
the process of PI construction, the NN development procedure
should be revised to directly address the characteristics of PIs.
Another critical, yet often ignored issue related to PIs
is that the literature is void of information regarding the
quantitative and objective assessment of PIs. The main focus
of the literature is on the methodologies for the construction
of PIs. Quantitative examination of the quality of developed
PIs (and also CIs) is often ignored or represented subjectively
and ambiguously [1], [6], [7], [9], [15], [25], [27], [28]. In
the comparative studies of CI and PI construction methods,
the coverage probability has been considered as the only
criterion for assessing the quality of the constructed intervals
[29]. We argue that the coverage probability by itself does
not completely describe all characteristics of the intervals. A
100% coverage probability can be easily achieved through
assigning sufficiently large and small values to the upper
and lower bounds of PIs. To the best of our knowledge,
there exist only a few papers that quantitatively evaluate the
quality of constructed PIs in terms of their key characteristics
[3], [5], [8].
The main objective of this paper is to propose a new method
for construction of PIs using NN models. One goal of the
proposed lower upper bound estimation (LUBE) method is
to avoid the calculation of derivatives of NN output with
respects to its parameters. As indicated by Dybowski and
Roberts [22], these derivatives can be a source of unreliability
of constructed PIs in the delta technique. Also, the NN
training and development process is accomplished through
improvement of the quality of PIs. While not directly performing an error minimization, the proposed technique aims at
producing a narrow PI bracketing the prediction, thereby also
achieving accurate predictions. This aspect of LUBE is distinct
from the common practice for construction of PIs using
traditional techniques. A new PI-based cost function using
the quantitative measures proposed in [3] is developed. The
new cost function simultaneously examines PIs based on their
width and coverage probability. An NN model is considered
for approximating the upper and lower bounds of PIs. The
parameters of this NN are adjusted through minimization of
the proposed PI-based cost function. As the cost function
is highly nonlinear, complex, and discontinuous, a simulated
annealing (SA) method is implemented for its minimization.
The effectiveness of the LUBE method is examined using
synthetic and real-world case studies. It is shown that the
proposed method builds narrow PIs with a high coverage
probability. The performance of the proposed method is also
compared with Bayesian and delta techniques. The simulation
results show that the quality of constructed PIs using the new
technique is superior to the Bayesian and delta-based PIs.
The rest of this paper is organized as follows. Section II
describes the proposed PI-based cost function. The new NN
LUBE method for construction of PIs is represented in
Section III. Experimental results are given in Section IV.
Finally, Section V concludes this paper with some remarks
for further study in this domain.
II. PI-BASED C OST F UNCTION
The cornerstone of the proposed method for construction
of optimal PIs is a new PI-based cost function. This cost
function is later used for training and development of NNs.
In order to derive the cost function, we first need to define
measures for the quantitative assessment of PIs. By definition,
future observations are expected to lie within the bounds of
PIs with a prescribed probability called the confidence level
((1 − α)%). It is expected that the coverage probability of PIs
will asymptotically approach the nominal level of confidence.
According to this, PI coverage probability (PICP) is the
spontaneous measure related to the quality of the constructed
PIs [3], [5], [8]
n
1
ci
(1)
PIC P =
n
i=1
where ci = 1 if yi ∈ [L(X i ), U (X i )], otherwise ci = 0.
L(X i ) and U (X i ) are the lower and upper bounds of the i th
PI, respectively. If the empirical PICP is much less than its
nominal value, the first conclusion is that the constructed PIs
are not at all reliable. This measure has been reported in almost
all studies related to PIs, as an indication of how good the
constructed PIs are [1], [3], [5]–[7], [9], [15], [25], [29].
If the extreme values of targets are considered as the upper
and lower bounds of all PIs, the corresponding PICP will be
perfect (100% coverage). Practically, PIs that are too wide
KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS
MPIW =
n
1
(U (X i ) − L(X i ))
n
(2)
i=1
where U (X i ) and L(X i ) represent the upper and lower bounds
of PIs corresponding to the i th sample. If the width of target
is known, MPIW can be normalized for objective comparison
of PIs developed using different methods. Normalized MPIW
(NMPIW) is given below [3], [8]
MPIW
(3)
R
where R is the range of the underlying target. NMPIW is a
dimensionless measure representing the average width of PIs
as a percentage of the underlying target range. In the case of
using the extreme target values as upper and lower bounds of
PIs, both NMPIW and PICP will be 100%. This indicates that
PICP and NMPIW have a direct relationship. Under equal
conditions, a larger NMPIW will usually result in a higher
PICP.
From a practical standpoint, it is important to have narrow PIs (small NMPIW) with a high coverage probability
(large PICP). Theoretically, these two goals are conflicting.
Reducing the width of PIs often results in a decrease in PICP,
due to some observations dropping out of PIs. To quantitatively represent and measure this event, a combinational
measure/index is required to carry information about how wide
PIs are and what their coverage probability is. As PICP is
theoretically the fundamental feature of the PIs, the proposed
measure should give more weight to the variation of PICP.
Put in other words, the index should be large for those cases
where PICP is less than its corresponding nominal confidence
((1−α)%). The following combinational coverage width-based
criterion (CWC) addresses all these issues for evaluation of
the PIs:
C W C = N M P I W 1 + γ (P I C P) e−η( P I C P−μ) . (4)
80
Logarithm of CWC
are of no value, as they convey no information about the
target variation. Another measure, therefore, is required for
quantifying the width of PIs. Mean prediction interval width
(MPIW) is defined as follows [3], [5]:
339
η = 300
60
40
20
η = 100
η = 10
0
80 82.5
85 87.5
90 92.5
95 97.5
100
μ
Fig. 1.
1
0.95
0.9
0.85
0.8
PICP%
CWC for different values of η.
NMPIW =
Assume for now that γ (PC I P) = 1. The constants η and
μ are two hyperparameters determining how much penalty is
assigned to PIs with a low coverage probability. μ corresponds
to the nominal confidence level associated with PIs and can be
set to 1 − α. The role of η is to magnify any small difference
between PICP and μ. Usually, it should be selected to have
a large value. The rationale for using such an asymmetric
criterion (with regard to PICP) is that a PICP lower than μ
will give a misleadingly optimistic PI, and should be penalized
more (than a wider PI is rewarded). This is a reasonably
conservative strategy. Applying such a strategy leads us to
the following observation. Based on the CWC definition in
(4), for P I C P < μ we are operating in the steep and high
portion of the exponential term. Accordingly, such solutions
are highly penalized due to the dominance of the PICP term,
and rightfully so. When PICP is greater but close to μ
(P I C P ≥ μ), there are two conflicting factors. Loosening
the PIs’ widths will increase PICP thereby decreasing CWC.
On the other hand, loosening the PIs’ widths will increase
NMPIW, and hence increase CWC. However, as PICP goes
further away from μ, the exponential term will gradually level
off, leading to the NMPIW factor becoming more and more
dominant in CWC. So, the algorithm will end up stopping
at a PICP slightly higher than μ (for example, if we are
constructing 90% PIs, it might get around 92–93%). One
might then question the rationale for forgoing a few percentage
points when we could get tighter PIs at PICP exactly equal
to μ. The reason is that it is better to leave a slack to allow
for deviations in out-of-sample results. For example, assume
μ = 90%. If we achieve PICP exactly equal 90% in the
training set, most probably the test set will give a PICP
lower than 90%, e.g., around 87% or 88%, thus violating the
constraints. Violating the PIs’ constraint (that PICP should
be greater than 90%) should have serious ramification (as
misleading results are obtained), whereas having a PICP a little
higher than 90% means that we are just a little conservative
(in having wider PIs).
When we evaluate a set of PIs, for example by measuring
the CWC for a test set, there is no reason to allow for the
slack. In such a case, γ (P I C P) is given by the following
step function:
⎧
⎨ 0, P I C P ≥ μ
γ =
(5)
⎩
1, P I C P < μ.
This means that the exponential term in (4) is eliminated
whenever P I C P ≥ μ and CWC becomes equal to NMPIW.
This allows an impartial evaluation that does not unnecessarily reward solutions that give a PICP a little higher than
μ. Theoretically correct PIs are assessed according to their
widths.
Fig. 1 displays the evolution of CWC for different values
of η. In the three plots, it has been assumed that N M P I W =
40%. If P I C P ≥ μ, CWC is small and very close to the
NMPIW. This indicates that the PICP has been above the
nominal confidence level. When P I C P < μ, the PICP is not
satisfactory and CWC exponentially increases. In these cases,
CWC is significantly greater than NMPIW. This explicitly
means that the coverage probability of constructed PIs has
not been satisfactory. The location, rate, and magnitude of the
CWC jump can be easily controlled by η and μ.
340
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
Hidden Layer
Input 1
Input 2
Input 3
Start
Upper
Bound
Lower
Bound
Prediction
Interval
Actual
Split the data set into training and test samples (Dtrain and Dtest)
Target
Intialize the optimization algorithm (T0 , w0, and CWC0)
Update the temperature
Sample
Generate a wnew through perturbation of one parameter
Fig. 2. NN model developed for approximating upper and lower bounds of
PIs in the LUBE method.
Construct PInew and Calculate CWCnew
Traditionally, NNs are trained through minimization of
error-based cost functions such as sum of squared error,
weight decay cost function, Akaike information criterion,
and Bayesian information criterion [4]. Such an approach
is theoretically and practically acceptable if the purpose of
modeling and analysis is point prediction. If NNs are going
to be used for PIs construction, it is more reasonable to train
them through minimization of PI-based cost functions. A set
of optimal PIs constructed using the NNs in this manner
will have an improved quality in terms of their width and
coverage probability. To achieve this goal, the NN training
procedure can be accomplished based on minimization of the
proposed new cost function that constitutes the core of the
CWC (γ (P I C P) = 1).
III. LUBE M ETHOD
The proposed cost function in the previous section is now
used for training an NN for constructing PIs. The structure of
the proposed NN model with two outputs is shown in Fig. 2.
The demonstrated NN is symbolic and it can have any number
of layers and neurons per layer. The first output corresponds to
the upper bound of the PI, and the second output approximates
the lower bound of the PI. In the literature, the PI construction
techniques attempt to estimate the mean and variance of
the targets for construction of PIs. In contrast to existing
techniques, the proposed method tries to directly approximate
upper and lower bounds of PIs based on the set of inputs.
CWC, defined in (4), is nonlinear, complex, and nondifferentiable. Therefore, gradient descent-based algorithms cannot
be applied for its minimization. Additionally, CWC is sensitive
to the NN set of parameters. As the gradient descent-based
techniques are notorious for being trapped in local minima,
their application may result in a suboptimal set of NN parameters. With regard to this discussion, stochastic gradient-freebased methods are the best candidates for global optimization
of the PI-based cost function. The training algorithm here uses
the SA optimization technique for minimization of the CWC
cost function and adjustment of the NN set of parameters
(w). Fig. 3 shows the procedure for training and development
of the two output NN for construction of PIs. The training
(optimization) algorithm description is as follows.
Step 1: Data splitting.
The method starts with randomly splitting the
available data into the training set (Dtrain ) and test
CWCnew < CWCopt
No
CWCopt = CWCnew and wopt = wnew
Generate a random number (r)
between 0 to 1
Yes
r≥e−
CWCnew − CWCopt
κT
No
No
Termination
condition met?
Yes
Construct PI for Dtest
End
Fig. 3.
LUBE method for PI-based training of NNs and PI construction.
set (Dt est ). If required, preprocessing of datasets is
completed in this stage.
Step 2: Initialization.
An NN with two outputs (similar to the one shown
in Fig. 2) is constructed. The initial parameters and
weights of this network can be randomly assigned.
An alternative is to train this network using the
traditional learning methods, such as Levenberg–
Marquardt algorithm, to approximate the real target. The obtained NN is then used for construction
of PI for the training samples (Dtrain ). PICP,
NMPIW, and CWC are calculated and considered
as the initial values for the training algorithm.
The NN set of parameters is also recorded as the
optimal set of NN parameters (wopt ). The cooling
temperature (T ) is set to a high value (T0 ) to allow
uphill movements in the early iterations of the
optimization algorithm.
Step 3: Temperature update.
The first step in the SA-based training loop is to
update the cooling temperature. Different cooling
schedules can be used depending on the application
and data behavior. Examples are linear, geometric,
and logarithmic [30], [31].
Step 4: Parameter perturbation.
A new set of parameters (wnew ) is obtained through
random perturbation of one of the current NN
KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS
341
TABLE I
D ATASETS U SED IN THE E XPERIMENTS
Case study
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
Target
5-D function with constant noise
5-D function with nonconstant noise
T90
Concrete compressive strength
Plasma beta-carotene
Dry bulb temperature
Moisture content of raw materials
Steam pressure
Main steam temperature
Reheat steam temperature
parameters. The perturbation should be quite small
as the CWC cost function is sensitive to the
changes of NN parameters.
Step 5: PI construction.
PIs are constructed for the new set of NN parameters. The fitness function (CWC) is calculated as
an indication of the quality of constructed PIs. As
discussed before, γ is set to 1. This conservative
approach is intended to avoid excessive shortening
of PIs during the training stage, which may result
in a low PICP for test samples.
Step 6: Cost function evaluation.
a) If C W Cnew ≤ C W Copt , wopt and C W Copt are replaced with wnew and C W Cnew , respectively. This
means that the new transition in the optimization
algorithm leads to higher quality PIs.
b) If C W Cnew ≥ C W Copt , a random number is
generated between zero and 1 (r ∈ [0, 1]).
Then decision about the acceptance or rejection
of the new set of parameters is made based on
the Boltzmann probability factor [32]. If r ≥
e−(C W Cnew −C W Copt )/κT, again wopt and C W Copt
are replaced with wnew and C W Cnew , respectively.
κ is the Boltzmann constant which is an important
parameter in SA algorithm.
c) If none of above happens, the optimal set of NN
remains unchanged.
Step 7: Training termination.
The training algorithm terminates if any of the
following conditions is satisfied: the maximum
number of iterations is reached; no further improvement is achieved for a specific number of
consecutive iterations; a very low temperature is
reached; or a very small CWC is found. Otherwise
the optimization algorithm returns to the Step 4.
Step 8: PIs for test samples.
Upon termination of the training, wopt is considered as the set of NN parameters and PIs are
constructed for Dt est .
One of the key features of the LUBE method for construction of PIs is its simplicity. Compared to the delta and
Bayesian technique, it does not require calculation of any
derivatives and, therefore, it does not suffer from the singularity problems of the Hessian matrix and its approximation [18].
Samples
300
500
272
1030
315
867
867
200
200
200
Attributes
5
5
3
8
12
3
3
5
5
5
Reference
[33], [34]
[29]
[3], [35]
[36]
[36]
[36]
[36]
[37]
[37]
[37]
Furthermore, it does not make any special assumption about
the data and residual distributions. This freedom makes the
method applicable to a wide range of problems without any
special consideration of the data distributions. As the LUBE
method uses only one NN model for constructing PIs, its
online computational requirement is at least B times less than
the bootstrap method for PI construction.
Focus on the quality of PIs is another important feature
making the LUBE method distinct from the traditional PI
construction methods. While other methods construct PIs in
two steps (first doing point prediction and then constructing
PIs), the LUBE method directly generates PIs in one step
based on the set of NN inputs. This feature allows us to apply
a mechanism, as proposed here, for directly improving the
quality of PIs rather reducing the point prediction error.
IV. E XPERIMENTAL R ESULTS
This section presents the experiments conducted to evaluate the effectiveness of the LUBE method for construction of PIs. Structured in four subsections, it describes the datasets used for carrying out the experiments
(Section IV-A). The experiment methodology followed in all
case studies is explained in Section IV-B. Ample care is
exercised to provide each method with enough freedom to
reveal its PI construction power. Assessment and optimization
parameters are described in Section IV-C. Then the simulation
results are represented and discussed in Section IV-D. The
delta, Bayesian, bootstrap (pairs sampling [22]), and LUBE
methods are compared based on the quality of constructed PIs
as well as their computational requirements.
A. Data
Experiments are carried out using 10 datasets to verify the
effectiveness of the LUBE method. Table I outlines the characteristics of these datasets. The datasets represent a number
of synthetic and real world regression case studies from different domains, including mathematics, medicine, transportation,
environmental studies, electronics, and manufacturing. They
have a range of discrete and continuous input attributes.
The first case study is a 5-D dataset consisting of 300
randomly generated data from a highly nonlinear function.
The second case study is the synthetic function described in
[29], where the additive noise has a normal distribution with
342
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
TABLE II
PARAMETERS U SED IN THE E XPERIMENTS
Parameter
Dtrain
Dtest
α
η
μ
κ
T0
Geometric cooling schedule
Number of bootstrap models
Numerical value
70% of all samples
30% of all samples
0.1
50
0.90
1
5
Tk+1 = β Tk
10
a nonconstant variance. Data in case study 3 comes from a
real-world baggage handling system. The target is to predict
the travel times of 90% of bags in the system based on
the check-in gates, exit points, and work load. The fourth
dataset consists 1030 samples of the concrete compressive
strength, where eight inputs are used to approximate the target.
The relationship between personal characteristics and dietary
factors, and plasma concentrations of beta-carotene is studied
in case studies 5. Dry bulb temperature and moisture content
of raw material for an industrial dryer are approximated using
three inputs in case studies 6 and 7. Datasets in case studies
8, 9, and 10 come from a power station, where targets are
steam pressure, main stem temperature, and reheat steam
temperature, respectively. All datasets are available from the
authors on request.
B. Experiment Methodology
Performance of PI construction methods depends on the
NN structure and the training process. Therefore, it is important to give each method enough freedom and flexibility
to generate the best possible results. Single hidden layer
NNs are considered in the experiments conducted in this
paper. For all compared methods, the optimal NN structure is
found through applying a fivefold cross-validation technique,
where the number of neurons is changed between 1 to 20.
Experiments are repeated five times for each case study to
avoid issues with the random sampling of datasets. For the
LUBE method, the single-layer NNs are trained using the
proposed method in Section III. Training is driven using CWC
as the cost function with γ = 1. Upon completion of the
training stage, PICP, NMPIW, and CWC are calculated for
test samples (Dt est ), where γ (P I C P) is the step function.
The median of these measures is then used for comparing
performance of the four methods for ten case studies.
C. Parameters
Table II summarizes parameters used in the experiments
and the NN training process. The data are randomly split into
training (70%) and test (30%) datasets. The level of confidence
associated with all PIs is 90%. Initial temperature is set to
5 to allow uphill movements in the early iterations of the
optimization algorithm. A geometric cooling schedule with a
cooling factor of 0.9 is applied in the LUBE method. The NN
parameter perturbation function in the LUBE method generates random numbers whose elements are normally distributed
with mean zero and unit variance.
μ is set to 0.9, because the prescribed level of confidence
of PIs is 90%. Also, η is selected to be 50 in order to highly
penalize PIs with a coverage probability lower than 90%.
According to these values, the profile of CWC will be similar
the middle plot in Fig. 1.
D. Results and Discussion
The convergence behavior of CWC and profile of the
cooling temperature for the first six case studies for the LUBE
method are shown in Fig. 4. For a better graphical visualization
of the optimization algorithm effort, extreme values of CWC
have not been displayed in some plots. CWC plots for case
studies 1 and 6 are logarithmic, and show the CWC variation
throughout the NN training. In the early iterations, when the
temperature is high (T = 2), CWC rapidly fluctuates allowing
uphill movements. As the temperature drops, the optimization algorithm becomes more greedy. The CWC decreases
gradually, but nonmonotonically. For temperatures less than 1,
CWC takes a continuous downward trend and only improving
movements are allowed.
The initial value of CWC is very large (C W C0 ≥ 1080 for
all cases). This means that the initial set of parameters obtained
using the traditional training techniques has not been suitable.
Such a large CWC is mainly due to an unsatisfactorily low
PICP. The number of iterations varies for different case studies
and ranges from approximately 500 to 3000. Each iteration
takes less than 0.02 s to be completed. This means the NN
training is completed in less than 1 min in the worst case.
Therefore, the convergence speed of the SA-based training
algorithm is acceptable.
Variation of NN parameters 21 to 25 during the optimization
process is displayed in Fig. 5 for case studies 7 to 10. These
parameters change throughout the optimization, indicating the
complexity of the cost function in the multidimensional space
of the NN parameters.
The median of PICP and NMPIW for the test datasets
(Dt est ) are listed in Table III. The PIs obtained using the LUBE
method are compared with those constructed using the delta,
Bayesian, and bootstrap techniques. Hereafter and for ease
of reference, Delta, Bays, BS, and LUBE subscripts are used
for indicating the delta, Bayesian, bootstrap, and the LUBE
method in this paper for the construction of PIs. It is expected
that PICPs will be at least equal to 90%, as the confidence
level associated with all PIs is 90%.
According to the results in Table III, the coverage probability of P I LU B E is better than other PIs in the majority of
conducted experiments. While the number of under coverage
for the LUBE technique is 1, it is 6, 4, and 4 and for the
delta, Bayesian, and bootstrap methods, respectively. It is only
for case study 7 that P I C PLU B E is slightly lower than the
nominal confidence level (89.47% instead of 90%). The mean
and median of P I C PDelt and P I C PB S for 10 case studies are
lower than the nominal confidence level. This indicates that PIs
constructed using these methods suffer from the undercoverage
KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS
343
0
500
1000
1500
#3
2000
2500
3
2
1
0
3000
0
50
100
150
350
5
60
60
4
50
3
50
2
40
1
30
Temperature
70
3
5
60
5
4
50
4
500
1000
30
1
0
0
40
2
20
1500
30
1
0
500
1000
1500
2000
20
2500
Temperature
40
2
CWC
3
0
0
200
400
600
20
800 1000 1200 1400 1600 1800
#6
1060
3
1040
2
1020
1
0
0
500
1000
1500
2000
2500
800
1000
Profile of the cooling temperature (solid line) and the CWC (dashed line) for the six case studies.
3
2.5
2
1.5
1
1.5
0
−0.5
−1
−1.5
#8
1
1.5
NN Weights
NN Weights
#7
0
−0.5
−1
−1.5
0
500
1000
Iteration
1500
−2
2000
0
200
2
2
1.5
600
Iteration
1
1.5
1
1.5
0
1.5
0
−0.5
−1
−0.5
−1
400
#10
2.5
NN Weights
NN Weights
#9
Fig. 5.
300
4
#5
Temperature
250
#4
5
0
Fig. 4.
200
60
50
CLC
40
30
20
10
0
400 450
CWC
1020
1
4
Logartihm of CWC
2
Temperature
1040
CWC
Temperature
3
0
Temperature
10
4
5
CWC
#2
60
Logartihm of CWC
#1
5
−1.5
0
200
400
600
Iteration
800
1000
−2
0
200
400
600
Iteration
800
1000
Evolution of NN parameters during the training algorithm.
problem, and therefore are not practically very reliable. The
mean and median of P I C PBays for all case studies are 89.93%
and 93.70%, respectively. Although these values are greater
than the nominal confidence level, P I C PBays highly fluctuates
for different case studies. The minimum of P I C PBays is
for case study 9 (75%), where constructed PIs are totally
unreliable. For the LUBE method, the mean and median of
PICP are 92.00% and 91.36%. Besides, the standard deviation
of P I C PLU B E is 2.3%, almost five times less than the
standard deviation of P I C PBayes . These statistics specify
344
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
TABLE III
PI C HARACTERISTICS FOR T EST S AMPLES ( Dtest ) OF THE T HREE C ASE S TUDIES
Case study
1
2
3
4
5
6
7
8
9
10
Delta technique
PICP(%)
NMPIW(%)
85.56
79.37
86.67
11.29
93.90
57.97
89.00
25.42
90.39
36.12
88.08
54.73
95.79
47.04
91.67
43.45
83.33
15.85
85.00
25.58
Bayesian technique
PICP(%)
NMPIW(%)
100.00
120.12
92.67
16.54
97.56
70.39
88.67
22.17
97.31
36.20
100.00
78.72
94.74
60.60
76.67
13.08
75.00
8.52
76.67
15.99
Bootstrap technique
PICP(%)
NMPIW(%)
85.56
85.47
98.00
24.53
82.93
36.48
91.26
27.52
93.85
37.51
58.46
32.09
85.26
33.13
96.67
50.79
91.67
32.62
96.67
50.79
LUBE method
PICP(%)
NMPIW(%)
90.00
72.22
92.00
23.09
91.46
41.51
91.26
35.57
91.15
33.91
94.62
69.39
89.47
41.80
93.33
45.05
96.67
24.71
90.00
31.52
200.00
180.00
160.00
140.00
120.00
CWC 100.00
80.00
60.00
40.00
20.00
0.00
1
2
3
4
Delta
Fig. 6.
5
6
7
8
Case Studies
Bayesian
Bootstrap
LUBE
9
10
Median of CWC measure for PIs constructed for test samples using the Bayesian, delta, and LUBE methods.
that the LUBE-based PIs: 1) properly cover the targets in
different replicates of an experiment (median above 90%), and
2) show a consistent behavior for different case studies (small
standard deviation with a mean and median above the nominal
confidence level).
The medians of C W C Delt a , C W C Bays , C W C B S , and
C W C LU B E for 10 case studies are displayed in Fig. 6.
Evaluation of the CWCs are done using γ (P I C P) as the step
function [see (5)]. As some CWCs, for instance C W C B S for
case study 6 or C W C Bays for case study 9 and 10, are very
large, the CWC axis is limited to 200 for a better visualization.
C W C LU B E is lower than other CWCs for 8 out of 10 case
studies. C W C LU B E is always less than 100, which means that
P I LU B E are correct and sufficiently narrow. For the other
three methods, there are always cases where the computed
CWCs for constructed PIs are greater than 100 (four cases for
each method). This means that the LUBE method can more
effectively establish a tradeoff between the correctness and
informativeness of PIs than other methods. Although P I Delt a
are on average narrower than P I LU B E , they are incorrect due
to a low coverage probability.
The Bayesian method generates the highest quality PIs for
the case of noise with a nonconstant variance (case study 2).
P I Bays are narrower than other PIs with an acceptable
coverage probability. P I LU B E are second in the rank in terms
of the overall quality. Bootstrap-based PIs are wider than others in five replicates for achieving a high coverage probability.
P I Delt a are excessively narrow resulting in an unacceptable
coverage probability of 86%. According to these results, the
Bayesian and LUBE methods are the best candidates for PI
construction for cases in which there is an additive noise with
a nonconstant variance.
Apart from the performance comparison, it is important to
compare the four methods in terms of their computational
burden required for constructing PIs. A more efficient set of
PIs is produced by the LUBE method with a much lower
computational requirement. The required time for construction
of P I Delt a , P I Bays , P I B S , and P I LU B E for Dt est is shown in
Table IV. A comparison of the times represented in Table IV
reveals that the LUBE method constructs PIs in periods
thousands of times smaller than is required by the delta and
Bayesian methods. The required time for the LUBE method to
construct PIs for test samples is 4 ms in the worst case, while
this is in average for the Bayesian and delta techniques 4447
ms and 3368 ms, respectively. The large computational load of
the delta and Bayesian techniques is mainly due to calculation
KHOSRAVI et al.: METHOD FOR CONSTRUCTION OF NEURAL NETWORK-BASED PREDICTION INTERVALS
TABLE IV
PI C ONSTRUCTION T IME U SING THE BAYESIAN , D ELTA , AND LUBE
T ECHNIQUE FOR THE T EST S AMPLES
2∗ Case study
1
2
3
4
5
6
7
8
9
10
Bayesian
4195
4761
3508
3498
7818
5826
5813
3002
3265
2779
Time (ms)
Delta
Bootstrap
3562
51
3159
54
3181
51
3199
55
5924
56
3886
56
3892
51
2500
51
2029
50
2368
51
LUBE
3
3
3
3
3
4
4
3
4
3
of the gradient of NN outputs with respects to its parameters.
The computational requirement of the bootstrap method is also
more than the LUBE method by a factor of at least B times
(here more than 10 times). This is because 10 NNs with some
extra calculations are required for construction of PIs for a
test sample.
The computational mass becomes important for those cases
in which the constructed PIs are used for optimization purposes. For instance, in case study 3, the provided information
by PIs can be used for operational planning and scheduling
in the underlying baggage handling system. Thousands of
operational scenarios should be generated and evaluated in a
short time to determine which guarantees smooth operation
of the whole system. As the computational mass of delta,
Bayesian, and bootstrap techniques is high, their application
for these purposes is limited. In contrast, the LUBE method
can be easily applied in real time for constructing PIs for
different operational scenarios. This paves the way for more
application of PIs in real-time planning and decision making.
In summary, in almost all experiments represented in
Table III and Fig. 6, the quality of P I LU B E is at least
equal or superior to P I Delt a , P I Bays , and P I B S in terms
of the PICP and NMPIW measures. Demonstrated results
satisfactorily prove the reliability and quality of constructed
PIs using the LUBE method. Besides, its computational requirement for construction of PIs is highly lower than the
other three methods. Therefore, it is reasonable to conclude
that the LUBE method constructs high quality PIs in a very
short time.
V. C ONCLUSION
In this paper, a new method was proposed for the construction of PIs using NN models. The LUBE method uses an
NN with two outputs to construct upper and lower bounds
of the PIs. For training this NN, a new cost function was
proposed based on the two key features of PIs: width and
coverage probability. As the proposed cost function is nonlinear, complex, and nondifferentiable, a SA method is applied
for the minimization of the proposed prediction intervalbased cost function and training of the NN model. Through
synthetic and real case studies, it was demonstrated that the
345
LUBE method is effective and reliable for construction of PIs.
The comparative study revealed that the quality of prediction
intervals constructed using the LUBE method is in many cases
superior to those constructed using the delta, Bayesian, and
bootstrap techniques. Furthermore, the computational expense
of the proposed method for construction of PIs is virtually nil
compared to that of the delta and Bayesian techniques.
The LUBE method can be modified in different ways for
generating narrower PIs with a higher coverage probability. As
with any other application of neural networks, performance of
the proposed method depends on the network structure. The
effectiveness of the LUBE method can be easily improved
through the combination of the LUBE method with one of
the NN structure selection techniques [34]. Currently, the
proposed optimization method does not consider the network
structure. The NN structures in this paper were selected on
a trial-and-error basis. The current training algorithm uses
the SA method for minimization of the cost function. Our
minimization experiments using the genetic algorithm was
unsuccessful, and it diverged in the majority of the cases.
However, the application of other optimization methods may
enhance the quality of constructed prediction intervals. Finally,
the overfitting problem of the developed NN can be avoided
by applying cross validation, pruning, and weight elimination
techniques.
ACKNOWLEDGMENT
The authors are grateful to the anonymous reviewers for
their helpful comments and suggestions.
R EFERENCES
[1] C. P. I. J. van Hinsbergen, J. W. C. van Lint, and H. J. van Zuylen,
“Bayesian committee of neural networks to predict travel times with
confidence intervals,” Trans. Res. Part C: Emerg. Technol., vol. 17, no.
5, pp. 498–509, Oct. 2009.
[2] W.-H. Liu, “Forecasting the semiconductor industry cycles by bootstrap
prediction intervals,” Appl. Econ., vol. 39, no. 13, pp. 1731–1742, 2007.
[3] A. Khosravi, S. Nahavandi, and D. Creighton, “A prediction intervalbased approach to determine optimal structures of neural network
metamodels,” Expert Syst. Appl., vol. 37, no. 3, pp. 2377–2387, Mar.
2010.
[4] C. M. Bishop, Neural Networks for Pattern Recognition. London, U.K.:
Oxford Univ. Press, 1995.
[5] D. L. Shrestha and D. P. Solomatine, “Machine learning approaches for
estimation of prediction interval for the model output,” Neural Netw.,
vol. 19, no. 2, pp. 225–235, Mar. 2006.
[6] S. L. Ho, M. Xie, L. C. Tang, K. Xu, and T. N. Goh, “Neural network
modeling with confidence bounds: A case study on the solder paste
deposition process,” IEEE Trans. Electron. Packag. Manufact., vol. 24,
no. 4, pp. 323–332, Oct. 2001.
[7] J. H. Zhao, Z. Y. Dong, Z. Xu, and K. P. Wong, “A statistical approach
for interval forecasting of the electricity price,” IEEE Trans. Power Syst.,
vol. 23, no. 2, pp. 267–276, May 2008.
[8] A. Khosravi, S. Nahavandi, and D. Creighton, “Construction of optimal
prediction intervals for load forecasting problems,” IEEE Trans. Power
Syst., vol. 25, no. 3, pp. 1496–1503, Aug. 2010.
[9] S. G. Pierce, K. Worden, and A. Bezazi, “Uncertainty analysis of a
neural network used for fatigue lifetime prediction,” Mech. Syst. Signal
Process., vol. 22, no. 6, pp. 1395–1411, Aug. 2008.
[10] D. F. Benoit and D. Van den Poel, “Benefits of quantile regression
for the analysis of customer lifetime value in a contractual setting: An
application in financial services,” Expert Syst. Appl., vol. 36, no. 7, pp.
10475–10484, Sep. 2009.
[11] N. Meade and T. Islam, “Prediction intervals for growth curve forecasts,”
J. Forecast., vol. 14, no. 5, pp. 413–430, Sep. 1995.
346
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 3, MARCH 2011
[12] T. Heskes, “Practical confidence and prediction intervals,” in Advances
in Neural Information Processing Systems, vol. 9, T. P. M. Mozer and
M. Jordan, Eds. Cambridge, MA: MIT Press, 1997, pp. 176–182.
[13] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecasting,” Int. J. Forecast., vol. 22, no. 3, pp. 443–473, 2006.
[14] G. Chryssolouris, M. Lee, and A. Ramsey, “Confidence interval prediction for neural network models,” IEEE Trans. Neural Netw., vol. 7, no.
1, pp. 229–232, Jan. 1996.
[15] J. T. G. Hwang and A. A. Ding, “Prediction intervals for artificial neural
networks,” J. Amer. Stat. Assoc., vol. 92, no. 438, pp. 748–757, Jun.
1997.
[16] C. J. Wild and G. A. F. Seber, Nonlinear Regression. New York: Wiley,
1989.
[17] A. A. Ding and X. He, “Backpropagation of pseudo-errors: Neural
networks that are adaptive to heterogeneous noise,” IEEE Trans. Neural
Netw., vol. 14, no. 2, pp. 253–262, Mar. 2003.
[18] R. D. De Veaux, J. Schumi, J. Schweinsberg, and L. H. Ungar,
“Prediction intervals for neural networks via nonlinear regression,”
Technometrics, vol. 40, no. 4, pp. 273–282, Nov. 1998.
[19] T. Lu and M. Viljanen, “Prediction of indoor temperature and relative
humidity using neural network models: Model comparison,” Neural
Comput. & Appl., vol. 18, no. 4, pp. 345–357, Mar. 2009.
[20] A. Khosravi, S. Nahavandi, and D. Creighton, “Improving prediction
interval quality: A genetic algorithm-based method applied to neural
networks,” in Proc. 16th Int. Conf. Neural Inf. Process.: Part II, vol.
5864. 2009, pp. 141–149.
[21] D. J. C. MacKay, “The evidence framework applied to classification networks,” Neural Comput., vol. 4, no. 5, pp. 720–736,
Sep. 1992.
[22] R. Dybowski and S. J. Roberts, “Confidence intervals and prediction
intervals for feed-forward neural networks,” in Clinical Applications of
Artificial Neural Networks, R. Dybowski and V. Gant, Eds. Cambridge,
U.K.: Cambridge Univ. Press, 2001, pp. 298–326.
[23] J. G. Carney, P. Cunningham, and U. Bhagwan, “Confidence and
prediction intervals for neural network ensembles,” in Proc. Int.
Joint Conf. Neural Netw., vol. 2. Washington D.C., Jul. 1999,
pp. 1215–1218.
[24] E. Zio, “A study of the bootstrap method for estimating the accuracy
of artificial neural networks in predicting nuclear transient processes,”
IEEE Trans. Nucl. Sci., vol. 53, no. 3, pp. 1460–1478, Jun. 2006.
[25] F. Giordano, M. La Rocca, and C. Perna, “Forecasting nonlinear time
series with neural network sieve bootstrap,” Comput. Stat. & Data Anal.,
vol. 51, no. 8, pp. 3871–3884, May 2007.
[26] N. O. Oleng’, A. Gribok, and J. Reifman, “Error bounds for data-driven
models of dynamical systems,” Comput. Biol. Med., vol. 37, no. 5, pp.
670–679, May 2007.
[27] D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the
target probability distribution,” in Proc. IEEE Int. Conf. Neural Netw.
World Congr. Comput. Intell., vol. 1. Orlando, FL, Jun.–Jul. 1994, pp.
55–60.
[28] I. Rivals and L. Personnaz, “Construction of confidence intervals for
neural networks based on least squares estimation,” Neural Netw., vol.
13, nos. 4–5, pp. 463–484, Jun. 2000.
[29] G. Papadopoulos, P. J. Edwards, and A. F. Murray, “Confidence estimation methods for neural networks: A practical comparison,” IEEE Trans.
Neural Netw., vol. 12, no. 6, pp. 1278–1287, Nov. 2001.
[30] P. J. M. van Laarhoven and E. H. L. Aarts, Simulated Annealing: Theory
and Applications. Boston, MA: Kluwer, 1987.
[31] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines:
A Stochastic Approach to Combinatorial Optimization and Neural
Computing (Discrete Mathematics and Optimization). New York: Wiley,
1989.
[32] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, “Optimization by
simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, May
1983.
[33] S. Hashem, “Optimal linear combinations of neural networks,” Neural
Netw., vol. 10, no. 4, pp. 599–614, Jun. 1997.
[34] L. Ma and K. Khorasani, “New training strategies for constructive neural
networks with application to regression problems,” Neural Netw., vol.
17, no. 4, pp. 589–609, May 2004.
[35] A. Khosravi, S. Nahavandi, and D. Creighton, “Constructing prediction
intervals for neural network metamodels of complex systems,” in Proc.
Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009, pp. 1576–1582.
[36] P. Vlachos. (2010, Jan. 10). StatLib Datasets Archive [Online]. Available:
http://lib.stat.cmu.edu/datasets
[37] B. De Moor. (2010, Jan. 10). DaISy: Database for the
Identification of Systems. Dept. Elect. Eng., ESAT/SISTA,
Katholieke Univ. Leuven, Leuven, Belgium [Online]. Available:
http://homes.esat.kuleuven.be/smc/daisy/
Abbas Khosravi (M’07) received the B.Sc. degree
in electrical engineering from Sharif University of
Technology, Tehran, Iran, and the M.Sc. (hons.)
degree in electrical engineering from Amirkabir
University of Technology, Tehran, in 2002 and 2005,
respectively. His specialization was artificial intelligence (AI).
He joined the eXiT Group as a Research Academic
at the University of Girona, Girona, Spain, in 2006,
working in the area of AI applications. Currently,
he is a Research Fellow in the Center for Intelligent Systems Research, Deakin University, Victoria, Australia. His current
research interests include the development and application of AI techniques
for (meta)modeling, analysis, control, and optimization of operations within
complex systems.
Saeid Nahavandi (SM’07) received the B.Sc.
(hons.), M.Sc., and Ph.D. degrees in automation and
control from Durham University, Durham, U.K.
He is the Alfred Deakin Professor, Chair of
Engineering, and the Director for the Center for
Intelligent Systems Research, Deakin University,
Victoria, Australia. He has published over 350 peerreviewed papers in various international journals and
conference proceedings. He designed the world’s
first 3-D interactive surface/motion controller. His
current research interests include modeling of complex systems, simulation-based optimization, robotics, haptics, and augmented
reality.
Dr. Nahavandi was a recipient of the Young Engineer of the Year award in
1996 and six international awards in Engineering. He is the Associate Editor
of the IEEE S YSTEMS J OURNAL, an Editorial Consultant Board Member for
the International Journal of Advanced Robotic Systems, and Editor (South
Pacific Region) of the International Journal of Intelligent Automation and
Soft Computing. He is a Fellow of Engineers Australia and the Institution of
Engineering and Technology.
Doug Creighton received the B.Eng. (hons.)
degree in systems engineering and the B.Sc. degree
in physics from the Australian National University,
Canberra, Australia, in 1997, where he attended as
a National Undergraduate Scholar. He received the
Ph.D. degree in simulation-based optimization from
Deakin University, Victoria, Australia, in 2004.
He spent several years as a software consultant
prior to obtaining the Ph.D. He is currently a Research Academic and Stream Leader with the Center
for Intelligent Systems Research, Deakin University.
His current research interests include modeling, discrete event simulation,
intelligent agent technologies, human machine interface and visualization,
simulation-based optimization research, development of algorithms to allow
the application of learning agents to industrial-scale systems for use in
optimization, dynamic control, and scheduling.
Amir F. Atiya (S’86–M’90–SM’97) received the
B.S. degree from Cairo University, Cairo, Egypt, in
1982, and the M.S. and Ph.D. degrees from California Institute of Technology (Caltech), Pasadena,
in 1986 and 1991, respectively, all in electrical
engineering.
He is currently a Professor in the Department
of Computer Engineering, Cairo University. He has
held several visiting appointments, including with
Caltech and Chonbuk National University, Jeonju,
South Korea. His current research interests include
neural networks, machine learning, theory of forecasting, computational
finance, and Monte Carlo methods.
Dr. Atiya was a recipient of the several awards, such as the Kuwait Prize in
2005. He was an Associate Editor for the IEEE T RANSACTIONS ON N EURAL
N ETWORKS from 1998 to 2008.
Download