Using Clustering to Make Prediction Intervals For Neural Networks Claus Benjaminsen ECE 539 12/21 2005 Abstract This project describes a way of estimating prediction intervals using clustering. The method is developed and implemented in matlab, and tested on a small 2D dataset. The performance of the method is compared with the standard method of making prediction intervals using several identical neural networks, and a baseline of just applying equal sized intervals to all predictions. The results show that the method using clustering indeed can be used to estimate prediction intervals, and that the performance can be even better than the standard method. 2 Table of Contents Abstract ................................................................................................................................................ 2 Table of Contents ................................................................................................................................. 3 Introduction .......................................................................................................................................... 4 Approach .............................................................................................................................................. 6 Data set ............................................................................................................................................. 6 How the models are set up ............................................................................................................... 7 Implementation .................................................................................................................................... 9 Results ................................................................................................................................................ 11 Comparing the 3 methods .............................................................................................................. 11 Testing with different values of the bias and slag parameters ....................................................... 14 Discussion .......................................................................................................................................... 16 Conclusion ......................................................................................................................................... 17 References .......................................................................................................................................... 18 3 Introduction Neural networks are widely used nowadays to solve all kinds of problems related to classification and regression. A special group of problems has to do with prediction, and being able to estimate the output of an unknown function, given a set of new inputs. This could for instance be forecasting the total snowfall over night using today’s weather data, or predicting how many hats of a certain brand will be sold this Christmas, given money spend on marketing and the average temperature in December. Both these examples are regression problems in that the output is a numerical value, and are typical problems, which could be solved by neural networks. Taking the first example, a trained neural network might predict, that the total snowfall will be 2 inches. This estimate might be right or wrong, close to the actual value of the snowfall or far away. Sometimes the neural network will be very accurate, and sometime it won’t, but for the person who wants to plan when to get up tomorrow in order to make it to work on time, only the estimate might not be sufficient information. For him it could be more relevant to know, what the maximum possible snowfall could be, so that he can take precautions and get up early, if there is a chance there could fall more than for instance 5 inches of snow. This kind of inference has to do with prediction intervals. Instead of predicting only an estimate of the actual snowfall, an interval of the possible range of snowfall could be estimated. These intervals can be given in many forms, for instance by a max and min value and possibly a percentage of confidence, or by a mean value and a variance of a Gaussian distribution. In anyway the output is supplied with a lot of extra information and can be used to determine, how precise the estimated output value is (width of prediction interval), and what the possible max and min values of the actual output value are (limits of prediction interval). The normal way of estimating prediction intervals using neural networks is by training multiple identical networks, and using the variance of the predicted outputs to estimate a prediction interval for the given output. This method normally has some requirements to the noise added to the unknown function, and can have problems in certain situations. Therefore this project will investigate the possibility of implementing interval prediction with the help of clustering. The idea is that, if the noise in the unknown function is input dependent, inputs close together might be subject to the same underlying noise function, and hence their outputs might be about equally hard to predict. This can then be used to give the prediction intervals for similar inputs an equal size, and in that way give a good estimate of the possible range of output values. 4 Motivation The motivation for this project is to test the possibility of using clustering in interval prediction. In many real world problems prediction intervals can be a big help in making prediction based decisions, and so a good method of estimating prediction intervals can be used extensively. Also in many situations the standard method of training several identical neural networks is unfeasible, because the model easily gets really big, and it can take a long time if several neural networks have to be trained on a large dataset. The proposed method only requires the training of one neural network and clustering of the training data, which would normally be much faster than training many neural networks. 5 Approach In this project the main focus has been put on implementing and testing the performance of doing interval predictions using a neural network and clustering. To evaluate the performance two other methods of doing prediction intervals have been included. The first is the baseline, which consists of just applying equally large prediction intervals to all predictions. The second method is the standard way of implementing prediction intervals using neural networks. In this project it uses 10 identical neural networks, trained on the same data, to estimate the individual variance of predicted outputs, and from these variances estimate the corresponding prediction intervals. Data set For this project I have chosen to use some easy accessible data coming from a neural network competition1 set up on the internet. The reason is twofold: First it meant, that I didn’t have to spend a lot of time collecting data and transforming it into a format, which can easily be processed. Secondly the competition, the data comes from, was what gave me the inspiration for this project. It focuses exactly on making good prediction intervals using neural networks, and so the datasets from the competition are very relevant for testing the implementations in this project. The competition has 4 different datasets: Synthetic – a small synthetic dataset Precip – a large dataset consisting of precipitation data Temp – a large dataset with temperature data SO2 – a medium size dataset of various measurement for predicting SO2 levels I have only used the Synthetic dataset to develop the methods as it is 2D (one input and one output), and therefore easy to work with and plot. It consists of 256 points divided into two files, one containing all the x-coordinates (synthetic_train.inputs), and one with the corresponding y-coordinates (synthetic_train.targets). A plot of this data set is shown below in figure 1. 1 Predictive Uncertainty in Environmental Modeling Competition http://theoval.cmp.uea.ac.uk/~gcc/competition/ 6 Figur 1. A matlab plot of the Synthetic dataset used in this project. How the models are set up When training a neural network, it is often desirable, that the feature vectors (input part) of the training/testing data are scaled into a certain range. This is both to give the same weight to different features, and to make it easier for the neural network to adapt to the inputs. Also since the output of a neural network is computed by a nonlinear function with a limited range of output values, the targets (output values of training/test set) will need to be scaled into this range, in order for the network to be able to make predictions. Therefore the Synthetic dataset is being randomized and then scaled, before it is divided into a training set and a testing set. Using only the training set 10 identical neural networks are trained using backpropagation and a small random perturbation of the weights, and the weights giving the lowest training error are stored. Then both the training feature vectors and the test feature vectors are fed to all 10 networks, and the mean and the variance of each predicted output is calculated from the 10 individual predictions. The mean values found will serve as the predicted point estimates of the targets. The training error is calculated as the sum of the distance between these predicted point 7 estimates and the training targets. This value gives the minimum total sum of prediction intervals needed to cover all the training data for the found point predictions. Since all prediction intervals in this project are symmetric, the sum of the prediction intervals is calculated as the sum of half the interval length. Giving the training error plus an optional slag parameter, the baseline prediction intervals are formed as equal size intervals with a total sum equal to the training error. The standard method (from now on referred to as the variance method) scales the variance of each prediction and adds a possible bias term, to make the sum of the variances (prediction intervals) equal to the training error. Finally the clustering method is applied, in which a k-means clustering algorithm is used on the input feature vectors to find a number of cluster centers. The membership of each of the training feature vectors to these cluster centers is then found, and the mean of the training errors, for all the feature vectors in a given cluster, is assigned to that cluster center. These mean errors are now scaled and possibly added with a bias term to have a total sum equal to the total training error, and thereby they can serve as prediction intervals for the training data. When calculating the prediction intervals for the test data, the baseline method just assigns the size of the training prediction intervals to all the test prediction intervals. The variance method scales the variance of the test predictions by the same amount as the training set variances were scaled. Finally the clustering method first determines the membership of the given test feature vector, and then assigns the corresponding mean error associated with the found cluster center to the feature vector. This mean error is then scaled by the same factor as under training, and the corresponding value defines the test prediction interval. The performance measures used in this project are the number of targets, which falls within the prediction intervals, and the cost function, which is determined as the mean squared distance from the edge of the prediction intervals to the corresponding targets, for all the targets which fall outside the prediction intervals. 8 Implementation The implementation of the methods described above is done in Matlab, and makes extensive use of the matlab programs developed by professor Yu Hen Hu2, or modification of these programs. The main file is IntervalNN, which sets up all the parameters for training the 10 Neural Networks, scales the data and divides it into a training set and a test set. Then it calls NNfunction, which is a modified version of bp.m, and takes care of training a neural network with the given training data and returns the found weights and the training data predictions. Given the weights, the test set predictions are calculated, and this procedure is repeated 10 times. Then the mean and variances of the training and test predictions are determined along with the training and test errors. Finally all the values are saved in IntNNvar.mat and the program c_test is called. In c_test the prediction intervals for the baseline method and the variance method are calculated, and the performance of these intervals on both the train and test data is found. Then the cluster centers of the training feature vectors are found using the function cluster.m, which is a modified version of clusterdemo.m by professor Yu Hen Hu, and the membership of each of the training features is determined using kmeantest.m also developed by professor Yu Hen Hu. Then the prediction intervals are formed, and the training performance evaluated, after which the same is done for the test set. The whole scheme is run multiple times for different number of cluster centers, so the performance can be compared. Finally disp_performance.m can be run. It uses the prediction intervals given by the number of cluster centers, which give the best performance in terms of the minimum cost function for the test data. These prediction intervals are plotted along with the training data and training predictions, and the same is done for the other two methods. Similar plots are made for the test data. Also another small program can be run called test_bias.m. This program tests the performance of each of the three methods, when the bias term in the prediction intervals and the training error slag parameter are changed. The bias term alters how much of the total available prediction interval should be divided as an equal amount to all individual prediction intervals, (the rest is divided by scaling the prediction intervals). The slag parameter changes the total sum of available prediction interval to become larger or smaller than the total training error. For each new combination of the slag parameter and the bias value, c_test2.m is called and the performance 2 These programs can be found on the class webpage: http://homepages.cae.wisc.edu/~ece539/matlab/index.html 9 results recorded. c_test2.m is a very slightly modified version of c_test.m, and is just made to be called from test_bias.m. 10 Results I have run the programs described above multiple times with different parameters, and here I will present some of the results I found. The training and test feature vectors were scaled to the range from -5 to 5, and the targets to the range from 0.2 to 0.8. Out of the total 256 samples in the synthetic dataset, 100 samples were randomly picked out to use for testing. The 10 neural networks were set up to have 3 layers (1 input, 1 hidden and 1 output layer) with 5 neurons in the hidden layer. Sigmoidal activation functions were used for all the hidden neurons, and a tangent hyperbolic function for the output neuron. The step size alpha was set to 0.1, the momentum term to 0.8 and the epoch size to 20. The stopping criteria used was no improvement in training error for 200 consecutive iterations. Comparing the 3 methods After obtaining the estimated point predictions I evaluated the prediction interval performance using n = 4, 8, …, 100 cluster centers. The best one in terms of the minimum cost function for the test data was picked out, and the training and test data along with the point predictions and prediction intervals are shown in the figures below for each of the 3 methods. Figur 2 and 3. Left figure: Training set and prediction intervals using clustering with 52 cluster centers. Right figure: The test set and corresponding prediction intervals using the same cluster centers. 11 Figur 4 and 5. Left figure: Training set and prediction intervals using the variance method. Right figure: Test set and prediction intervals using the variance method. Figur 6 and 7. Left figure: Training set and prediction intervals using the baseline method. Right figure: Test set and prediction intervals using the baseline method. From the plots it can be seen how the clustering and variance methods try to fit the sizes of the prediction intervals to match the easy and difficult parts of the data. For the clustering method it can be seen, how the training areas (in the feature vector space), which gives large prediction intervals, also gives large prediction intervals for the test data. The same is in someway true for the variance method. Looking at the corresponding performance data for this experiment gives an easier way of comparing the 3 methods. The performance measures are shown in the table below. 12 n_centers c_clus 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 n_centers c_ c_test_ cost_ cost_test_ 98 61 0.001842 0.001679 90 59 0.001762 0.001547 88 58 0.001702 0.001421 83 54 0.001430 0.001480 86 57 0.001372 0.001415 84 56 0.001387 0.001506 89 58 0.001092 0.001620 89 56 0.001112 0.001322 82 56 0.001105 0.001595 86 55 0.001067 0.001314 89 57 0.000986 0.001682 90 52 0.001038 0.001345 89 55 0.000960 0.001277 79 54 0.000926 0.001617 89 50 0.000898 0.001845 76 54 0.000908 0.001887 94 53 0.000851 0.002278 97 55 0.000807 0.002155 96 56 0.000866 0.002276 66 52 0.000503 0.002347 101 55 0.000525 0.002298 105 54 0.000524 0.002301 110 55 0.000513 0.002278 110 52 0.000472 0.002425 111 55 0.000431 0.002969 = number of cluster centers = number of training targets inside prediction intervals = number of test targets inside prediction intervals = cost function for training data = cost function for test data. c_test_clus cost_clus cost_test_clus c_var c_basis 79 108 c_test_var 56 c_test_basis 62 cost_var 0.002516 cost_basis 0.003075 cost_test_var 0.004856 cost_test_basis 0.002414 The slag parameter e_slag 0 The bias term Number of training targets 156 Number of test targets 100 bias 0 From the values in the table it can be seen that the number of cluster centers has a significant influence on the performance of the clustering method. Comparing the number of targets found inside the prediction intervals, it ranges quite a lot from 66 to 111 for the training data for the clustering method, and it includes the 79 and 108 for the variance and the base line methods respectively. For the test data the range of values is smaller, but the maximum is only 61 compared to the variance method 56 and the base line 62. So the base line method actually has a higher number of test targets, which falls within the prediction intervals. The values found might seem pretty low. The maximum percentage of data inside the prediction intervals is (111/156 ~) 71 %, but it has to be compared with the fact that the total sum of the prediction intervals, is equal to the minimum needed in order to include all the targets in the intervals3. On the other hand when comparing the cost functions (precision) of the different methods, the clustering method turns out to perform far better than the other methods. For the 3 This is controlled by the slag parameter c_slag, which is zero in the above experiment. 13 training data it generally decreases with the number of cluster centers, and should theoretically be able to become zero for n_centers = 156 (the number of training targets). For the test data it has a minimum at 52 cluster centers and the value is about half the value of the baseline method, and about one fourth of the value of the variance method. The increase in the cost function, for the test data for high number of cluster centers, can be viewed as a form of over fitting. This comes from the fact that the use of too many cluster centers, will relate the size of the prediction interval to specific training samples instead of areas of the input feature space. So if a sample with low noise is in an area with generally high noise levels, and a test sample is close to it, the test sample will get a small prediction interval, even though it might have a high level of noise, because it comes from the area with generally high noise levels. Testing with different values of the bias and slag parameters In order to get a higher number of test targets inside the prediction intervals, I tested the methods using different values of the bias and slag parameters. The bias term changes the normalization of the individual prediction intervals from being completely linearly scaled. Instead it determines how much of the total available prediction interval should be evenly divided by all the individual prediction intervals, which means it decreases the scaling difference. This is only applicable for the clustering and variance methods. The slag parameter determines how much larger than the minimum needed total sum of prediction intervals, the sum of the calculated prediction intervals are allowed to be. Increasing the slag parameter increases the sizes of the prediction intervals, and hence makes it easier for the models to include more targets in the prediction intervals. By using test_bias.m the performance results for bias values [0; 0.9] and slag parameter values [-4; 14] were found. The results are given in the table below. 14 e_slag -4 -2 0 2 8 10 12 14 64 71 76 71 71 74 71 69 71 68 4 6 min_c_test_clus 77 84 77 80 78 84 78 84 79 85 81 85 79 86 80 84 77 85 77 82 25 31 35 40 41 42 39 37 40 39 40 44 51 54 52 52 57 52 56 49 52 59 63 67 62 63 62 61 62 60 85 86 86 90 88 89 89 88 90 89 91 90 91 90 93 91 90 90 90 91 93 91 92 95 94 94 92 92 92 92 94 95 94 93 95 95 94 94 97 96 35 36 43 42 38 37 39 43 46 48 44 48 51 52 52 53 49 49 51 50 56 58 62 64 63 65 60 62 59 56 58 64 71 71 75 74 71 70 69 68 min_c_test_var 61 63 66 69 72 75 74 77 78 78 77 78 79 83 80 84 76 84 76 84 67 70 75 79 80 81 85 87 86 88 69 76 77 80 80 83 85 88 89 91 70 77 79 81 83 84 87 89 91 92 70 78 79 82 83 85 87 89 93 93 bias 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 bias 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 min_c_test_basis 40 53 62 69 76 83 87 89 91 93 min_c_test_ = Number of test targets inside prediction interval (for the clustering method the result with the minimum test cost function (cost_test_clus) is given). From the results in the table above, it can be seen, that the bigger the slag parameter is, the larger number of test targets inside the prediction intervals are obtained – as expected. The results also show, that the bias term has a big influence on this measure of performance, and so with the right bias term, both the clustering method and the variance method outperform the baseline method for all values of the slag parameter. The best results are obtained with the clustering method, which, for nearly all values of the slag parameter, performs better than both the other methods, when using the right bias value. 15 Discussion The above results show that clustering can be used with neural networks to estimate prediction intervals in regression problems. At least in this problem it worked, but is this problem a special case, or will it work in general? The method doesn’t really require anything in order to be applied to a given regression problem, so the question is, if it in general will give good results as above? The idea behind the method is that similar feature vectors often are equally difficult to predict. This can be the result of a lot of noise or a high nonlinearity in the given area of the feature space. Since it is this property the method exploits, its performance might decrease a lot in problems, where this doesn’t apply. If the noise is uniform, which is often assumed in many regression problems, the clustering method should still perform reasonably well, if a low number of cluster centers are used. If the number of cluster centers is too high, the method will start to react on individual output errors, and hence a single sample with, by chance, a relative high noise value, might make the prediction intervals of similar inputs too large. The biggest problem with this method is that it requires, that new input feature vectors fed to it, “looks” like old feature vectors, which it was trained on. If one of the input features is time, the new feature vectors might never “look” like old feature vectors, and the model will not be able to perform very well. In these cases it might be possible to omit the features, which behave like that, and perform the clustering only on the remaining features. This has not been tested though, and it is hard to predict, what kind of implications it will have on the performance of the method. Another type of problem for this method relates to the size of the input space and the number of training samples. If both become large a very high number of cluster centers will be needed to cover the feature space with a certain resolution. This can make this method impractical, if too many cluster centers need to be found, which can require a lot of computation. Also they need to be stored along with the error associated with each of them, and when new inputs arrive it might be a heavy process to determine the membership of each of them. Unfortunately I haven’t had time to test the clustering method on larger datasets, so I can’t really conclude on how big the problem will be. 16 Conclusion As shown above in the result section, the clustering method developed in this project can be used for making good prediction intervals. It can, with the right values of the 3 important parameters (number of cluster centers, bias and slag), give prediction intervals with very good performance, even when compared to standard methods. I didn’t find a good way of determining these parameters in this project, they were just found using repeated trials, and I have no good solution to this problem. It is not very different from trying to determine the right structure for a neural network, and maybe a cross validation scheme might be used to determine the right parameters. The dataset used in this project is very well suited for this method of using clustering to make prediction intervals. Since I didn’t have the time to evaluate the performance on other datasets, it is still a question how well it works in general. Also, as brought up in the discussion section above, there might be some problems related to the size of the feature space, number of training vectors and test vectors, which doesn’t “look” like any of the training vectors. These problems need to be considered, when deciding if the method is applicable to a given problem. 17 References Haykin, S. “Neural Networks: A Comprehensive Foundation” (second edition 1999) Prentice Hall: Upper Saddle River, New Jersey Papadopoulos, G.; Edwards, P.J.; Murray, A.F.; “Confidence Estimation Methods for Neural Networks: A Practical Comparison” IEEE Transactions on Neural Networks, Volume 12, Issue 6, Nov. 2001 Page(s):1278 - 1287 Carney, J.G.; Cunningham, P.; Bhagwan, U.; “Confidence and prediction intervals for neural network ensembles” IJCNN '99. International Joint Conference on Neural Networks, 1999. Volume 2, 10-16 July 1999 Page(s):1215 - 1218 vol.2 Predictive Uncertainty in Environmental Modeling Competition http://theoval.cmp.uea.ac.uk/~gcc/competition/ 18