Black box modeling for highly nonlinear systems Skolidis G, Souriadakis M, Georgikopoulou A., Hatzopoulos P, Nikolaou G., Tseles D.I. 1. Introduction Neural networks are interconnecting systems, which can be considered as simplified mathematical models functioning like the neuron patterns of the human brain. However, in contrast to traditional computing techniques, which are programmed with rules to perform a specific task, neural networks must be taught or trained through a training data set and create by itself the patterns and the rules governing the network. Although computers perform better than artificial neural networks, for tasks based on precise and fast arithmetic operations, neural networks can be used in problems where the associations and the patterns between the input variables are unknown, and is worth mentioning that this method does not need continuous relationships between the data that is being evaluated in order to identify key events or patterns. While neural networks have many applications in Finance and Economics such as stock selection, mortgage applicants, bankruptcy forecasts, real estate appraisal and forecasting stock or index prices with time-series, their utility in practice is often limited. On the other hand the unique ability of neural networks to learn from data any nonlinear relationship without prior knowledge of the system make them an excellent tool for forecasting applications. In this paper two case studies are going to be presented. First a financial application where the neural networks are used to predict the value of houses in Boston, the Boston Housing Project as it is called, which is a typical benchmark problem for neural networks. The second case study uses neural network model for the prediction of meteorological parameters. 2. Neural Networks as Forecasting Models Neural networks can be used as forecasting tools in many different areas. They also have the ability to classify nonlinear systems and can approximate any nonlinear function to some level of accuracy. In economic and financial applications the most basic and commonly used neural network is the multilayer feedforward network. Figure 1 X1 illustrates the architecture on a neural network with one hidden N1 layer containing two neurons, Y three input variables {xi.}, i=1,2,3 X2 and one output y. All the nodes at each layer are connected to each N2 node at the upper layer by interconnection strength called X3 weights. A training algorithm is used to obtain a set of weights that minimizes the difference Figure 1. Architecture of a neural network between the target and the output produced by the simulation of the network. 2.1 Learning Algorithms In our study we used several different variations of the backpropagation training algorithm, each of them having a variety of different computation and storage requirements. The table below summarizes the training algorithms used in the seeking procedure of the model with the highest level of accuracy. Algorithm Gradient Descent (GD) Gradient Descent with Momentum (GDM) Gradient Descent with Adaptive Learning Rate (GDX) Resilient Backpropagation (RP) Polak-Ribiere Conjugate Gradient (CGP) Levenberg-Marquardt (LM) Bayessian Regularization (BR) Description Slow response, can be used in incremental training mode Faster training than GD, can be used in incremental training mode Faster training than GD, but can only be used in Batch training mode Simple batch training mode algorithm with fast convergence and minimal storage requirements Slightly larger storage requirements and faster convergence on some problems Faster training algorithm for networks with moderate size, with ability of memory reduction for use when the training data set is large Modification of the Levenberg –Marquardt training algorithm to produce networks that improves generalization and reduces the difficulty of the determination of the optimum network architecture. 2.3 Techniques for Improving Generalization In our research we studied two techniques for improving the generalization ability of the network: Early Stopping, Bayesian Regularization. Early Stopping This technique requires the data set to be divided into three subsets: training, test and validation set. The training set is used for computing the gradient and updating the network weights and biases. The training procedure monitors the error of the validation set, and as soon as the error starts to increase the training stops and returns the weights at the phase where the error was minimum. This is the stage when the model should cease to be trained to overcome the over-fitting problem. Bayesian Regularization This technique involves the modification of the performance function, such as the sum of squared network errors (MSE). A typical performance function that is used for training feedforward neural networks is the mean sum of squares of the network errors. 1 N 1 N 2 ( e ) (t i ai ) 2 i N i 1 N i 1 It is possible to improve generalization if we modify the performance function by adding a term that consists of the mean of the sum of squares of network weight and biases. msereg mse (1 )msw where is the performance ratio, and 1 n msw w 2j n j 1 Using this performance function will cause the network to have smaller weights and biases, forcing the networks response to be smoother and less likely to over-fit. F mse From these two techniques we have chosen Bayesian Regularization because we have observed a significant increase in the accuracy of the models tested. 2.4 Performance Metrics For both case studies there are used two statistical criteria to estimate the performance of each neural network: in-sample criteria and out of sample criteria which are based on tests of significance. Specifically in the case of the in-sample-criteria we evaluate the regression, since we want to know how well a model fits in the actual data. The best way to measure the goodness of fit measure is through the multiple correlation coefficient which is also known as the R -squared coefficient. The ratio of the variance of the output predicted by the model relative to the true or observed output: T R2 ( yˆ t yt ) 2 (y t yt ) 2 t 1 T t 1 The out of sample criteria evaluates how well competing models generalizes the data set that we use for the estimation. To evaluate the performance of a model out-ofsample, initially we begun by dividing the data into an in sample estimation training set, so we can take the coefficients. The most commonly used statistic for evaluating out-of-sample fit is the root mean squared error (rmsq) statistic: * rmsq ( y 1 yˆ ) 2 * 3. Case studies In the following section, the two case studies are presented and we compare the simulation results of the models using the performance metrics described in the previous section. 3.1 The Boston Housing Project In our application we developed a number of neural networks for forecasting the value of houses in Boston, the Boston Housing Project as it is called, which is a typical benchmark problem for neural networks. The application was developed in Matlab 6.5 programming environment, using the neural network toolbox that it provides. Initially, the inputs and the targets were preprocessed, so that they fall in the range [-1,1]. Secondly we developed the network, with the specific architecture (number of layers, neurons in each layer, transfer function of each layer) that we wanted to test. Then the network was trained using one of the algorithms described above. After the training of the neural network, we simulated it with the data that was actually trained to evaluate its performance, to wit how well the model fits the actual data. Its performance was measured by computing the R-squared co-efficient and the root mean squared error. Lastly the network was tested with new inputs, to measure its ability to generalize using the rmsq statistical criterion. After the first tests we put aside the models with the best performance and we tried to improve their ability to make forecast in new data, using the Bayesian Regularization with the modified performance function. 3.2 Experimental Results Every model that was developed had eleven (11) inputs, one hidden layer and an output unit to predict the value of the house. The final set of weights to which a network settles down, depends on a number of factors, e.g., initial weights chosen, different learning parameters and the number of hidden neurons. The number of hidden neurons varied between 5~14 and the training completed after the training cycle has reached the 5000 iterations. The experiment, for each model constructed, was conducted ten (10) times, and to compare the models we used the average error of the test data. The method used to select the neural network, which gives us the forecast with the highest level of accuracy, was first the trial and error technique where we compared several architectures and selected the best ones and in continuance the optimization of the models selected from the trial and error method. 3.2.1 Optimization After the first selection, we tried to optimize those networks using the Bayesian Regularization with the modified performance function. When we use the regularization technique the user has to determine the optimum value of the learning rate so that the network will adequately fit the training data and will not get overfitted. If the learning rate is too large, the network may get overtrained, but if it is too small the network will not fit the data. After we have experimented with several values of the learning rate, we have concluded to the values listed in the table below, indicating also the level of accuracy of each model. Architecture Transfer Function layers Training Algorithm Learning Rate Epochs Average Error of Test Data 11-5-1 tansig-purelin gda default 5000 3.058 rmsq 11-5-1 tansigpurelin gdx default 5000 2.924 11-7-1 tansigpurelin gdx default 5000 2.856 11-9-1 tansigpurelin gda default 5000 3.001 11-9-1 tansigpurelin gdx default 5000 2.87 11-13-1 tansig-purelin gdx default 5000 2.832 11-14-1 tansig-purelin gdx default 5000 2.87 11-5-1 tansig-purelin gda 0.8 5000 3.456 11-5-1 tansig-purelin gdx 0.8 5000 3.178 11-7-1 tansig-purelin gdx 0.8 5000 2.997 11-9-1 tansig-purelin gda 0.8 5000 3.212 11-9-1 tansig-purelin gdx 0.8 5000 2.907 11-13-1 tansig-purelin gdx 0.8 5000 2.862 11-14-1 tansig-purelin gdx 0.8 5000 2.834 11-5-1 tansig-purelin gda 0.75 5000 3.419 11-5-1 tansig-purelin gdx 0.75 5000 3.257 11-7-1 tansig-purelin gdx 0.75 5000 3.13 11-9-1 tansig-purelin gda 0.75 5000 3.219 11-9-1 tansig-purelin gdx 0.75 5000 2.941 11-13-1 tansig-purelin gdx 0.75 5000 2.932 11-14-1 tansig-purelin gdx 0.75 5000 2.913 11-5-1 tansig-purelin gda 0.70 5000 3.547 11-5-1 tansig-purelin gdx 0.70 5000 3.312 11-7-1 tansig-purelin gdx 0.70 5000 3.235 11-9-1 tansig-purelin gda 0.70 5000 3.3 11-9-1 tansig-purelin gdx 0.70 5000 3.115 11-13-1 tansig-purelin gdx 0.70 5000 2.914 11-14-1 tansig-purelin gdx 0.70 5000 2.9 11-5-1 tansig-purelin gda 0.85 5000 3.304 11-5-1 tansig-purelin gdx 0.85 5000 3.029 11-7-1 tansig-purelin gdx 0.85 5000 2.932 11-9-1 tansig-purelin gda 0.85 5000 3.309 11-9-1 tansig-purelin gdx 0.85 5000 2.895 11-13-1 tansig-purelin gdx 0.85 5000 2.826 11-14-1 tansig-purelin gdx 0.85 5000 2.876 3.2.3 Model Selection The model which gave the best forecast, was the one with 11-13-1 architecture, trained with the gdx algorithm for 5000 iterations and the value of the learning rate was 0.85. Performance is 0.0239781, Goal is 0 0 Real and Simulated Outputs of TrainData 10 50 45 40 Training-Blue 35 30 -1 10 25 20 15 10 -2 10 0 500 1000 1500 2000 2500 3000 5000 Epochs 3500 4000 4500 5000 5 5 10 15 20 25 30 35 40 45 50 Error of Test Data Real and Simulated Outputs of TestData 10 45 9 40 8 7 35 6 5 30 4 25 3 2 20 1 0 0 5 10 15 20 25 30 35 40 45 15 10 50 15 20 25 30 35 40 45 4. Weather forecasting using neural networks The purpose of the neural networks models for this case study is to predict the ambient temperature using past measurements. After thorough literature review and analysis all the models was decided to have three parameters as inputs: previous temperature measurements, atmospheric pressure and relative humidity. After experimentation on different network architectures it turned out that for the given data used for modeling the best prediction was given by a network with one hidden layer, having five neurons. For this architecture the results for a number of training algorithms and activation functions are given in the next two tables. TRAIN TEST MSE min 4,5348 Tansig - Purelin MAE min 1,6356 Tansig - Purelin R max 0,9685 Tansig - Purelin MSE min 5,6598 Tansig - Tansig MAE min 1,7490 Tansig - Tansig R max 0,9555 Tansig - Tansig TRAIN TEST 5. Conclusion MSE min 4,6510 TRAINBR MAE min 1,6616 TRAINBR R max 0,9677 TRAINBR MSE min 5,7135 TRAINBR MAE min 1,7554 TRAINBR R max 0,9553 TRAINBR The purpose of this paper was to prove that neural networks can be used for time series forecasting. We have presented and compared many different neural network models for a typical benchmark problem from the financial sector and also an application for modeling a highly nonlinear problem. It was proved that neural networks given a large set of consistent data can be used as time series forecasters providing results that can achieve accuracy comparable if not superior to traditional forecasting techniques.