Making Prediction Intervals Using Neural Networks

advertisement
Using Clustering to Make Prediction Intervals
For Neural Networks
Claus Benjaminsen
ECE 539
12/21 2005
Abstract
This project describes a way of estimating prediction intervals using clustering. The
method is developed and implemented in matlab, and tested on a small 2D dataset. The
performance of the method is compared with the standard method of making prediction intervals
using several identical neural networks, and a baseline of just applying equal sized intervals to all
predictions. The results show that the method using clustering indeed can be used to estimate
prediction intervals, and that the performance can be even better than the standard method.
2
Table of Contents
Abstract ................................................................................................................................................ 2
Table of Contents ................................................................................................................................. 3
Introduction .......................................................................................................................................... 4
Approach .............................................................................................................................................. 6
Data set ............................................................................................................................................. 6
How the models are set up ............................................................................................................... 7
Implementation .................................................................................................................................... 9
Results ................................................................................................................................................ 11
Comparing the 3 methods .............................................................................................................. 11
Testing with different values of the bias and slag parameters ....................................................... 14
Discussion .......................................................................................................................................... 16
Conclusion ......................................................................................................................................... 17
References .......................................................................................................................................... 18
3
Introduction
Neural networks are widely used nowadays to solve all kinds of problems related to
classification and regression. A special group of problems has to do with prediction, and being able
to estimate the output of an unknown function, given a set of new inputs. This could for instance be
forecasting the total snowfall over night using today’s weather data, or predicting how many hats of
a certain brand will be sold this Christmas, given money spend on marketing and the average
temperature in December. Both these examples are regression problems in that the output is a
numerical value, and are typical problems, which could be solved by neural networks. Taking the
first example, a trained neural network might predict, that the total snowfall will be 2 inches. This
estimate might be right or wrong, close to the actual value of the snowfall or far away. Sometimes
the neural network will be very accurate, and sometime it won’t, but for the person who wants to
plan when to get up tomorrow in order to make it to work on time, only the estimate might not be
sufficient information. For him it could be more relevant to know, what the maximum possible
snowfall could be, so that he can take precautions and get up early, if there is a chance there could
fall more than for instance 5 inches of snow.
This kind of inference has to do with prediction intervals. Instead of predicting only
an estimate of the actual snowfall, an interval of the possible range of snowfall could be estimated.
These intervals can be given in many forms, for instance by a max and min value and possibly a
percentage of confidence, or by a mean value and a variance of a Gaussian distribution. In anyway
the output is supplied with a lot of extra information and can be used to determine, how precise the
estimated output value is (width of prediction interval), and what the possible max and min values
of the actual output value are (limits of prediction interval).
The normal way of estimating prediction intervals using neural networks is by training
multiple identical networks, and using the variance of the predicted outputs to estimate a prediction
interval for the given output. This method normally has some requirements to the noise added to the
unknown function, and can have problems in certain situations. Therefore this project will
investigate the possibility of implementing interval prediction with the help of clustering.
The idea is that, if the noise in the unknown function is input dependent, inputs close
together might be subject to the same underlying noise function, and hence their outputs might be
about equally hard to predict. This can then be used to give the prediction intervals for similar
inputs an equal size, and in that way give a good estimate of the possible range of output values.
4
Motivation
The motivation for this project is to test the possibility of using clustering in interval
prediction. In many real world problems prediction intervals can be a big help in making prediction
based decisions, and so a good method of estimating prediction intervals can be used extensively.
Also in many situations the standard method of training several identical neural networks is
unfeasible, because the model easily gets really big, and it can take a long time if several neural
networks have to be trained on a large dataset. The proposed method only requires the training of
one neural network and clustering of the training data, which would normally be much faster than
training many neural networks.
5
Approach
In this project the main focus has been put on implementing and testing the
performance of doing interval predictions using a neural network and clustering. To evaluate the
performance two other methods of doing prediction intervals have been included. The first is the
baseline, which consists of just applying equally large prediction intervals to all predictions. The
second method is the standard way of implementing prediction intervals using neural networks. In
this project it uses 10 identical neural networks, trained on the same data, to estimate the individual
variance of predicted outputs, and from these variances estimate the corresponding prediction
intervals.
Data set
For this project I have chosen to use some easy accessible data coming from a neural
network competition1 set up on the internet. The reason is twofold: First it meant, that I didn’t have
to spend a lot of time collecting data and transforming it into a format, which can easily be
processed. Secondly the competition, the data comes from, was what gave me the inspiration for
this project. It focuses exactly on making good prediction intervals using neural networks, and so
the datasets from the competition are very relevant for testing the implementations in this project.
The competition has 4 different datasets:

Synthetic – a small synthetic dataset

Precip – a large dataset consisting of precipitation data

Temp – a large dataset with temperature data

SO2 – a medium size dataset of various measurement for predicting SO2 levels
I have only used the Synthetic dataset to develop the methods as it is 2D (one input
and one output), and therefore easy to work with and plot. It consists of 256 points divided into two
files, one containing all the x-coordinates (synthetic_train.inputs), and one with the corresponding
y-coordinates (synthetic_train.targets). A plot of this data set is shown below in figure 1.
1
Predictive Uncertainty in Environmental Modeling Competition
http://theoval.cmp.uea.ac.uk/~gcc/competition/
6
Figur 1. A matlab plot of the Synthetic dataset used in this project.
How the models are set up
When training a neural network, it is often desirable, that the feature vectors (input
part) of the training/testing data are scaled into a certain range. This is both to give the same weight
to different features, and to make it easier for the neural network to adapt to the inputs. Also since
the output of a neural network is computed by a nonlinear function with a limited range of output
values, the targets (output values of training/test set) will need to be scaled into this range, in order
for the network to be able to make predictions. Therefore the Synthetic dataset is being randomized
and then scaled, before it is divided into a training set and a testing set.
Using only the training set 10 identical neural networks are trained using backpropagation and a small random perturbation of the weights, and the weights giving the lowest
training error are stored. Then both the training feature vectors and the test feature vectors are fed to
all 10 networks, and the mean and the variance of each predicted output is calculated from the 10
individual predictions. The mean values found will serve as the predicted point estimates of the
targets. The training error is calculated as the sum of the distance between these predicted point
7
estimates and the training targets. This value gives the minimum total sum of prediction intervals
needed to cover all the training data for the found point predictions. Since all prediction intervals in
this project are symmetric, the sum of the prediction intervals is calculated as the sum of half the
interval length.
Giving the training error plus an optional slag parameter, the baseline prediction
intervals are formed as equal size intervals with a total sum equal to the training error. The standard
method (from now on referred to as the variance method) scales the variance of each prediction and
adds a possible bias term, to make the sum of the variances (prediction intervals) equal to the
training error.
Finally the clustering method is applied, in which a k-means clustering algorithm is
used on the input feature vectors to find a number of cluster centers. The membership of each of the
training feature vectors to these cluster centers is then found, and the mean of the training errors, for
all the feature vectors in a given cluster, is assigned to that cluster center. These mean errors are
now scaled and possibly added with a bias term to have a total sum equal to the total training error,
and thereby they can serve as prediction intervals for the training data.
When calculating the prediction intervals for the test data, the baseline method just
assigns the size of the training prediction intervals to all the test prediction intervals. The variance
method scales the variance of the test predictions by the same amount as the training set variances
were scaled. Finally the clustering method first determines the membership of the given test feature
vector, and then assigns the corresponding mean error associated with the found cluster center to the
feature vector. This mean error is then scaled by the same factor as under training, and the
corresponding value defines the test prediction interval.
The performance measures used in this project are the number of targets, which falls
within the prediction intervals, and the cost function, which is determined as the mean squared
distance from the edge of the prediction intervals to the corresponding targets, for all the targets
which fall outside the prediction intervals.
8
Implementation
The implementation of the methods described above is done in Matlab, and makes
extensive use of the matlab programs developed by professor Yu Hen Hu2, or modification of these
programs. The main file is IntervalNN, which sets up all the parameters for training the 10 Neural
Networks, scales the data and divides it into a training set and a test set. Then it calls NNfunction,
which is a modified version of bp.m, and takes care of training a neural network with the given
training data and returns the found weights and the training data predictions. Given the weights, the
test set predictions are calculated, and this procedure is repeated 10 times. Then the mean and
variances of the training and test predictions are determined along with the training and test errors.
Finally all the values are saved in IntNNvar.mat and the program c_test is called.
In c_test the prediction intervals for the baseline method and the variance method are
calculated, and the performance of these intervals on both the train and test data is found. Then the
cluster centers of the training feature vectors are found using the function cluster.m, which is a
modified version of clusterdemo.m by professor Yu Hen Hu, and the membership of each of the
training features is determined using kmeantest.m also developed by professor Yu Hen Hu. Then
the prediction intervals are formed, and the training performance evaluated, after which the same is
done for the test set. The whole scheme is run multiple times for different number of cluster centers,
so the performance can be compared.
Finally disp_performance.m can be run. It uses the prediction intervals given by the
number of cluster centers, which give the best performance in terms of the minimum cost function
for the test data. These prediction intervals are plotted along with the training data and training
predictions, and the same is done for the other two methods. Similar plots are made for the test data.
Also another small program can be run called test_bias.m. This program tests the
performance of each of the three methods, when the bias term in the prediction intervals and the
training error slag parameter are changed. The bias term alters how much of the total available
prediction interval should be divided as an equal amount to all individual prediction intervals, (the
rest is divided by scaling the prediction intervals). The slag parameter changes the total sum of
available prediction interval to become larger or smaller than the total training error. For each new
combination of the slag parameter and the bias value, c_test2.m is called and the performance
2
These programs can be found on the class webpage: http://homepages.cae.wisc.edu/~ece539/matlab/index.html
9
results recorded. c_test2.m is a very slightly modified version of c_test.m, and is just made to be
called from test_bias.m.
10
Results
I have run the programs described above multiple times with different parameters, and
here I will present some of the results I found. The training and test feature vectors were scaled to
the range from -5 to 5, and the targets to the range from 0.2 to 0.8. Out of the total 256 samples in
the synthetic dataset, 100 samples were randomly picked out to use for testing. The 10 neural
networks were set up to have 3 layers (1 input, 1 hidden and 1 output layer) with 5 neurons in the
hidden layer. Sigmoidal activation functions were used for all the hidden neurons, and a tangent
hyperbolic function for the output neuron. The step size alpha was set to 0.1, the momentum term to
0.8 and the epoch size to 20. The stopping criteria used was no improvement in training error for
200 consecutive iterations.
Comparing the 3 methods
After obtaining the estimated point predictions I evaluated the prediction interval
performance using n = 4, 8, …, 100 cluster centers. The best one in terms of the minimum cost
function for the test data was picked out, and the training and test data along with the point
predictions and prediction intervals are shown in the figures below for each of the 3 methods.
Figur 2 and 3. Left figure: Training set and prediction intervals using clustering with 52 cluster centers. Right figure:
The test set and corresponding prediction intervals using the same cluster centers.
11
Figur 4 and 5. Left figure: Training set and prediction intervals using the variance method. Right figure: Test set and
prediction intervals using the variance method.
Figur 6 and 7. Left figure: Training set and prediction intervals using the baseline method. Right figure: Test set and
prediction intervals using the baseline method.
From the plots it can be seen how the clustering and variance methods try to fit the
sizes of the prediction intervals to match the easy and difficult parts of the data. For the clustering
method it can be seen, how the training areas (in the feature vector space), which gives large
prediction intervals, also gives large prediction intervals for the test data. The same is in someway
true for the variance method.
Looking at the corresponding performance data for this experiment gives an easier
way of comparing the 3 methods. The performance measures are shown in the table below.
12
n_centers
c_clus
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
88
92
96
100
n_centers
c_
c_test_
cost_
cost_test_
98
61
0.001842
0.001679
90
59
0.001762
0.001547
88
58
0.001702
0.001421
83
54
0.001430
0.001480
86
57
0.001372
0.001415
84
56
0.001387
0.001506
89
58
0.001092
0.001620
89
56
0.001112
0.001322
82
56
0.001105
0.001595
86
55
0.001067
0.001314
89
57
0.000986
0.001682
90
52
0.001038
0.001345
89
55
0.000960
0.001277
79
54
0.000926
0.001617
89
50
0.000898
0.001845
76
54
0.000908
0.001887
94
53
0.000851
0.002278
97
55
0.000807
0.002155
96
56
0.000866
0.002276
66
52
0.000503
0.002347
101
55
0.000525
0.002298
105
54
0.000524
0.002301
110
55
0.000513
0.002278
110
52
0.000472
0.002425
111
55
0.000431
0.002969
= number of cluster centers
= number of training targets inside prediction intervals
= number of test targets inside prediction intervals
= cost function for training data
= cost function for test data.
c_test_clus
cost_clus
cost_test_clus
c_var
c_basis
79
108
c_test_var
56
c_test_basis
62
cost_var
0.002516
cost_basis
0.003075
cost_test_var
0.004856
cost_test_basis
0.002414
The slag
parameter
e_slag
0
The bias term
Number of
training targets
156
Number of
test targets
100
bias
0
From the values in the table it can be seen that the number of cluster centers has a
significant influence on the performance of the clustering method. Comparing the number of targets
found inside the prediction intervals, it ranges quite a lot from 66 to 111 for the training data for the
clustering method, and it includes the 79 and 108 for the variance and the base line methods
respectively. For the test data the range of values is smaller, but the maximum is only 61 compared
to the variance method 56 and the base line 62. So the base line method actually has a higher
number of test targets, which falls within the prediction intervals. The values found might seem
pretty low. The maximum percentage of data inside the prediction intervals is (111/156 ~) 71 %, but
it has to be compared with the fact that the total sum of the prediction intervals, is equal to the
minimum needed in order to include all the targets in the intervals3.
On the other hand when comparing the cost functions (precision) of the different
methods, the clustering method turns out to perform far better than the other methods. For the
3
This is controlled by the slag parameter c_slag, which is zero in the above experiment.
13
training data it generally decreases with the number of cluster centers, and should theoretically be
able to become zero for n_centers = 156 (the number of training targets). For the test data it has a
minimum at 52 cluster centers and the value is about half the value of the baseline method, and
about one fourth of the value of the variance method. The increase in the cost function, for the test
data for high number of cluster centers, can be viewed as a form of over fitting. This comes from
the fact that the use of too many cluster centers, will relate the size of the prediction interval to
specific training samples instead of areas of the input feature space. So if a sample with low noise is
in an area with generally high noise levels, and a test sample is close to it, the test sample will get a
small prediction interval, even though it might have a high level of noise, because it comes from the
area with generally high noise levels.
Testing with different values of the bias and slag parameters
In order to get a higher number of test targets inside the prediction intervals, I tested
the methods using different values of the bias and slag parameters. The bias term changes the
normalization of the individual prediction intervals from being completely linearly scaled. Instead it
determines how much of the total available prediction interval should be evenly divided by all the
individual prediction intervals, which means it decreases the scaling difference. This is only
applicable for the clustering and variance methods. The slag parameter determines how much larger
than the minimum needed total sum of prediction intervals, the sum of the calculated prediction
intervals are allowed to be. Increasing the slag parameter increases the sizes of the prediction
intervals, and hence makes it easier for the models to include more targets in the prediction intervals.
By using test_bias.m the performance results for bias values  [0; 0.9] and slag
parameter values  [-4; 14] were found. The results are given in the table below.
14
e_slag
-4
-2
0
2
8
10
12
14
64
71
76
71
71
74
71
69
71
68
4
6
min_c_test_clus
77
84
77
80
78
84
78
84
79
85
81
85
79
86
80
84
77
85
77
82
25
31
35
40
41
42
39
37
40
39
40
44
51
54
52
52
57
52
56
49
52
59
63
67
62
63
62
61
62
60
85
86
86
90
88
89
89
88
90
89
91
90
91
90
93
91
90
90
90
91
93
91
92
95
94
94
92
92
92
92
94
95
94
93
95
95
94
94
97
96
35
36
43
42
38
37
39
43
46
48
44
48
51
52
52
53
49
49
51
50
56
58
62
64
63
65
60
62
59
56
58
64
71
71
75
74
71
70
69
68
min_c_test_var
61
63
66
69
72
75
74
77
78
78
77
78
79
83
80
84
76
84
76
84
67
70
75
79
80
81
85
87
86
88
69
76
77
80
80
83
85
88
89
91
70
77
79
81
83
84
87
89
91
92
70
78
79
82
83
85
87
89
93
93
bias
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
bias
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
min_c_test_basis
40
53
62
69
76
83
87
89
91
93
min_c_test_ = Number of test targets inside prediction interval (for the clustering method the result with the minimum
test cost function (cost_test_clus) is given).
From the results in the table above, it can be seen, that the bigger the slag parameter is,
the larger number of test targets inside the prediction intervals are obtained – as expected. The
results also show, that the bias term has a big influence on this measure of performance, and so with
the right bias term, both the clustering method and the variance method outperform the baseline
method for all values of the slag parameter. The best results are obtained with the clustering method,
which, for nearly all values of the slag parameter, performs better than both the other methods,
when using the right bias value.
15
Discussion
The above results show that clustering can be used with neural networks to estimate
prediction intervals in regression problems. At least in this problem it worked, but is this problem a
special case, or will it work in general? The method doesn’t really require anything in order to be
applied to a given regression problem, so the question is, if it in general will give good results as
above? The idea behind the method is that similar feature vectors often are equally difficult to
predict. This can be the result of a lot of noise or a high nonlinearity in the given area of the feature
space. Since it is this property the method exploits, its performance might decrease a lot in problems,
where this doesn’t apply. If the noise is uniform, which is often assumed in many regression
problems, the clustering method should still perform reasonably well, if a low number of cluster
centers are used. If the number of cluster centers is too high, the method will start to react on
individual output errors, and hence a single sample with, by chance, a relative high noise value,
might make the prediction intervals of similar inputs too large.
The biggest problem with this method is that it requires, that new input feature vectors
fed to it, “looks” like old feature vectors, which it was trained on. If one of the input features is time,
the new feature vectors might never “look” like old feature vectors, and the model will not be able
to perform very well. In these cases it might be possible to omit the features, which behave like that,
and perform the clustering only on the remaining features. This has not been tested though, and it is
hard to predict, what kind of implications it will have on the performance of the method.
Another type of problem for this method relates to the size of the input space and the
number of training samples. If both become large a very high number of cluster centers will be
needed to cover the feature space with a certain resolution. This can make this method impractical,
if too many cluster centers need to be found, which can require a lot of computation. Also they need
to be stored along with the error associated with each of them, and when new inputs arrive it might
be a heavy process to determine the membership of each of them. Unfortunately I haven’t had time
to test the clustering method on larger datasets, so I can’t really conclude on how big the problem
will be.
16
Conclusion
As shown above in the result section, the clustering method developed in this project
can be used for making good prediction intervals. It can, with the right values of the 3 important
parameters (number of cluster centers, bias and slag), give prediction intervals with very good
performance, even when compared to standard methods. I didn’t find a good way of determining
these parameters in this project, they were just found using repeated trials, and I have no good
solution to this problem. It is not very different from trying to determine the right structure for a
neural network, and maybe a cross validation scheme might be used to determine the right
parameters.
The dataset used in this project is very well suited for this method of using clustering
to make prediction intervals. Since I didn’t have the time to evaluate the performance on other
datasets, it is still a question how well it works in general. Also, as brought up in the discussion
section above, there might be some problems related to the size of the feature space, number of
training vectors and test vectors, which doesn’t “look” like any of the training vectors. These
problems need to be considered, when deciding if the method is applicable to a given problem.
17
References
Haykin, S.
“Neural Networks: A Comprehensive Foundation” (second edition 1999)
Prentice Hall: Upper Saddle River, New Jersey
Papadopoulos, G.; Edwards, P.J.; Murray, A.F.;
“Confidence Estimation Methods for Neural Networks: A Practical Comparison”
IEEE Transactions on Neural Networks, Volume 12, Issue 6, Nov. 2001 Page(s):1278 - 1287
Carney, J.G.; Cunningham, P.; Bhagwan, U.;
“Confidence and prediction intervals for neural network ensembles”
IJCNN '99. International Joint Conference on Neural Networks, 1999.
Volume 2, 10-16 July 1999 Page(s):1215 - 1218 vol.2
Predictive Uncertainty in Environmental Modeling Competition
http://theoval.cmp.uea.ac.uk/~gcc/competition/
18
Download