Predicting Daily Returns for the IBM Stock M. Oldemiro Fernandes Luis Torgo mofer@liacc.up.pt ltorgo@liacc.up.pt LIACC, University of Porto www.liacc.up.pt Abstract. The goal of the work described in this paper is to predict the daily returns of the closing prices for the IBM stock. From the original data of IBM daily quotes a new data set was built using technical indicators as predictor variables. Using this new data set, two modelling approaches were tried: regression and classification. Early analysis and experiments suggested that this prediction problem has some specific properties that make it difficult for standard learning algorithms. Resulting from this analysis we propose a two-steps approach to overcome these difficulties. Initial experimental analysis shows that this approach is promising. However, the actual results are still far from the ideal performance achievable by our proposed methodology. Our analysis of these results show that further work must be done, namely in improving the performance of the classification stage of our approach. 1. Introduction During market hours, having an accurate prediction of the last price of the day, allows one to make profitable intraday trades. This is the main motivation behind our work, where we try to predict the daily returns of one particular stock. Daily stock returns are defined as the percentage change between two successive closing prices: Closet +1 = − 1 × 100 Closet Rt +1 (1) For each period we set the prediction time at the moment after the Open price (Open t+1) is known as shown in Figure 1. This means that this value can be used to predict the Close price of that day, which is our main objective. Past Open t Close t Open t+1 Prediction Moment Close t+1 Future Figure 1-Prediction moment. 2. Data Presentation and Pre Processing Methodology IBM daily quotes were collected from finance.yahoo.com. The data set consists in 7891 observations of Open, High, Low and Close prices and Volume traded, of each day from 14/07/1970 to 12/10/2001. From this base data we have generated a dataset whose predictor variables are several technical indicators. Namely, we have used 9 attributes. The technical indicators were chosen among those most used by traders: Moving Averages, Aroon Indicator, Relative Strength Index, Chaikin Money Flow, Stochastic Oscillator, Average True Range. The variation between the last Close (t) and the last Open (t+1) was also incorporated in the data set as an attribute. This attribute is very important when the aim is trying to predict the closing price (t+1) (Zirilli, 1997). The target variable is the daily return of the Closing prices, which was calculated according to the formula given before. This results in a regression data set. Additionally, using the same attributes we have created a classification data set by classifying the target variable into 4 classes, using quartiles information to determine bins dimension. The reason for using quartiles to divide the target variable into classes was to get one classification problem with balanced classes. The resulting classification has an easy interpretation. The first quartile represents large negative moves (sell opportunities), the last quartile represents large positive advances (buy opportunities), while both middle quartiles represents very small moves in either directions, respectively (insufficient to compensate trading costs if any action is done). In summary, we have created two different data sets from the original data, which represent two different views of the same problem: one with the original numeric returns and the other with the returns discretised into four classes. After the pre-processing steps described above we have obtained a data set with 7837 observations. These observations were divided into a training set (with the first 5173 cases) and a testing set (with the remaining 2664 cases). 3. Exploratory Data Analysis We have carried out a simple analysis of the statistical properties of the target variable, separately for the training and testing cases (c.f. Figure 2). 2000 1000 V11 Std. Dev = 2.15 Mean = .1 N = 2664.00 .0 12 .0 10 0 8. 0 6. 0 4. 0 2. 0 0. .0 -2 .0 -4 .0 -6 .0 -8 0 0. -1 0 2. -1 0 4. -1 0 6. -1 0 TestingDataSet 700 Count 2664 600 Maximum 13.16 500 Mean 0.07 Median 0.00 400 Mode 0.00 300 Minimum -15.54 200 Range 28.71 100 Std. Dev = 1.42 Std. Deviation 2.15 Mean = .0 0 N = 5172.00 Variance 4.61 Skewness 0.22 V11 Kurtosis 5.64 Figure 2- Statistical Measures of the Data. .0 10 0 8. 0 6. 0 4. 0 2. 0 0. 0 . -2 .0 -4 0 . -6 0 . -8 .0 0 -1 0 2. -1 .0 4 -1 .0 6 -1 .0 8 -1 .0 0 -2 .0 2 -2 .0 4 -2 TrainingDataSet Count 5172 Maximum 11.38 Mean 0.03 Median 0.00 Mode 0.00 Minimum -23.52 Range 34.90 Std. Deviation 1.42 Variance 2.02 Skewness -0.47 Kurtosis 17.55 The observation of the distribution of the returns target variable revealed a non-normal distribution in the sense that it has longer tails and bigger concentration of occurrences around the central value. This situation is more evident in the training set which has a larger range of values, but it is present in the test set too, as it can be seen through the Kurtosis measure. Comparing both data sets, it is possible to identify a larger variance in the test set. Regarding measures of centrality, while the means are slightly different, the medians are the same. 4. Construction of Prediction Models Using the two data sets described before, some models were obtained using different learning algorithms. 4.1 Regression Data Set Several experiments were done, using the system RT4.1 1 (Torgo, 1999). The results obtained in the testing set are summarised in Figure 3. Results are presented using the Mean 1 www.liacc.up.pt/~ltorgo/RT Square Error (MSE), Mean Absolute Deviation (MAD) and Normalised Mean Square Error (NMSE). Algorithm Model of Mean RT4.1 (default) RT4.1 lr RT4.1 lr -tlm be MSE MAD NMSE 4.608 1.526 1.000 4.152 1.487 0.901 3.376 1.338 0.733 3.321 1.331 0.721 Figure 3- Results with RT. The Model of Mean represents the simplest model one can have: it always predicts the mean value of returns (calculated with the training data). This model is used as a reference to evaluate the relative gain in performance obtained with more complex models. RT4.1 with default parameters grows one tree with 43 nodes and averages in the 22 leaves. This more complex model has little advantage when compared to always predicting the average. The best model was a standard linear regression with attribute selection (3 attributes selected from the original 9), that performs slightly better than linear regression with all attributes. These experiments using the regression data set highlighted some specific characteristics of this domain. In fact, the most interesting observations (from a trading perspective) of this problem are a few extreme observations, usually considered outliers. Most learning algorithms will ignore such cases, because they are biased to reduce the error in the prediction process, which is better achieved when the most common cases (non outliers) are modelled. This is the reason why some models that we have obtained just produced the mean as the prediction model. Thus, the main problem with this data set is the fact that the most interesting cases from the trading perspective do not have sufficient representativeness from a statistical point of view, as to be considered relevant by the learning algorithms. This empirical observation as lead us to develop a methodology to try to overcome this difficulty. This methodology will be presented in Section 5. 4.2 Classification Data Set We have tried different classification learning algorithms with the classification data set, namely: C5.02 (Quinlan, 1993), Ltree 3 (Gama, 2001) and a back propagation Neural Network. The results obtained with these systems are summarised in Figure 4. A lgorith m E rror Ra te (% ) C 5.0 (default) 65.1% C 5.0 -r (decision rules) 62.6% C 5.0 -t20 (boostin g) 61.6% N eural N etwork 61.9% Ltree 60.8% Figure 4- Results with Classification Data Set. The default version of C5.0 produces one large decision tree, with 920 nodes, which has an error rate of 65.1%. Using C5.0 to generate rules, 87 are generated, with a smaller error rate. Using 20 trials boosting, the error rate was even smaller. Ltree with pre pruning (-m15) and an univariate parameter (-U) produced the best classification model, with a very small tree. We have also tried to train a neural net for this classification task. The target variable was decomposed in 4 binary variables (each of them assumes the value 1 if the case is from the correspondent class, and 0 if not). The best result (shown in Figure 4) was obtained with an architecture with 10 input neurons, 8 hidden neurons and 4 output neurons. In spite of all efforts we have made to reduce the error rate, the best results achieve an error that is still too high, meaning that more work should be done. 2 www.rulequest.com 3 www.liacc.up.pt/~jgama 5. Suggested Methodology As a result of our regression experiments we have noticed that the most interesting observations, from a trading perspective, where being disregarded by the regression models that were obtained (c.f. Section 4.1) To overcome this difficulty, we propose an approach based on a two-stages learning process. In the first stage we try to obtain a model that is able to correctly identify the type of observation, according to the classes we have used before (large and small increases (decreases) of the closing price returns). Based on this classification of the training cases, we develop a regression model for each class of observations. In this way, each regression model will be obtained only with similar cases. During prediction our approach follows a similar two steps method. In the first step, one uses the obtained classification model to classify the case for which we want a prediction into one of the four classes. Then, given the predicted classification, in the second step the respective regression model is used to predict the closing price return. 5.1 Ideal Results (benchmarking model) The sources of error of this proposed methodology can be two-folded. On one hand we have the classification error of our first level model. On the other hand we have the regression error of our second level models. Ideally our first level model will have a classification error of zero. With the goal of understanding the limits of our proposed methodology we have simulated this situation. We have looked at the test set and have obtained the correct classification for each test case (thus “cheating”). Given this ideal classification we have observed the error of the respective regression model. The regression algorithm used in this experiment was RT4.1 with default parameters (regression trees with averages at leaves). Results obtained are presented in Figure 5. CaseClass -2 -1 1 2 All observations Size MSE 13 2.041 3 0.073 3 0.058 7 2.176 --- 1.277 MAD NMSE 0.887 0.873 0.236 0.995 0.207 1.021 0.959 0.728 0.638 --- Figure 5- Results with Benchmark Model. These results can be regarded as a kind of ideal performance we can aim at with our twostages methodology. These results show much lower prediction error than those obtained without the two-stages methodology. One can see that in those cases with lower absolute returns (classes 1 and –1) predicting the mean return seems to be a good compromise. However, in extreme cases, with bigger movements (like those in classes 2 and –2), more complex models are necessary. Obviously, this experiment only serves to give an idea of how far one can expect to go, using the proposed methodology, because, in the real world, the true class of the test cases is not known. Still one can try to discover it, using the classification models, obtained as we have described previously. This is described in the next section. 5.2 The results of the two-stages method We now present the results obtained the method we have proposed including both prediction stages (classification and regression). The classification model was obtained using Ltree (the best classifier obtained in first experiments) to learn to classify each new observation in one of the four classes. Within each class a regression model was constructed using RT4.1 with the default parameter values. For each case of the test set, the classifier was used to obtain a probabilistic classification. This means that for each test case, Ltree produced a class distribution probability. With this set-up two different experiments were carried out. In the first, each test case was considered to belong to the class with larger probability, and the respective regression model was used to obtain the prediction. In the second approach, all regression models were used, and the final prediction was obtained as a weighted average of all four predictions, with the class probabilities being used as weights. Results for both experiments are presented in Figure 6 and show that the second approach is better. C ase C la ss M a xim um W eigh ted A verage M SE M AD 4.579 1.593 3.760 1.394 Figure 6- Results with Two Stages Method. These results show that the error of the classification stage was so high (namely, 60.8%, c.f. Figure 4) that the overall results of our two-stages methodology were worst than those obtained with simpler approaches (c.f. Figure 3). However, we should recall that given the results of Figure 5 one can expect that if the results of the overall score can be significantly better than the results of Figure 3. Thus the main conclusion of these experiments is that further work needs to be carried out in the classification stage so as to allow this two-stages approach to pay-off. Namely, an idea that should provide better results is to explore misclassification costs within the classification task. In effect, better than optimising the error rate we would like to avoid certain types of error more than others (for instance classifying a class –2 case as class 2). 6. Conclusions and Future Work In this study, we found that predicting daily returns is not an easy task. We believe that this difficulty is increased because algorithms are error oriented, instead of being profit oriented. As such, these algorithms ignore most interesting cases to model, and concentrate efforts on those without significant moves. This was the main motivation for the development of one method that is able to separate observations by class (type of market movement) and then use class specialised regression models. We constructed one ideal scenario to have one benchmark model. This benchmark model shows that using specialised regression models, with accurate classification of cases, lower error rates will be possible. When we used our best classifier to obtain distribution classes for each observation, results have deteriorated. We believe that better results can be achieved if better classification is done. We intend to explore the idea of misclassification costs as a means to bias the classification stage towards more trading-oriented performance goals. References Gama, J. (2001), “Functional Trees for Classification”, IEEE International conference in Data Mining, Published by IEEE Computer Society. Quinlan, J. (1993), “C4.5: programs for machine learning”, Kluwer Academic Publishers. Torgo, L. (1999), “Inductive Learning of Tree-based Regression Models”, PhD thesis, Faculty of Sciences, University of Porto. Zirilli, Joseph S. (1997), “Financial Prediction using Neural Networks”, International Thomson Computer Press, London, UK.