Forecasting volatility with support vector machines: an international financial markets study FERNANDO PÉREZ-CRUZ1 , JULIO A. AFONSO-RODRÍGUEZ2 AND JAVIER GINER-RUBIO3 1 Department of Signal Theory and Communications University Carlos III, 28911 Madrid 2 Department of Institutional Economics, Economic Statistics and Econometrics University of La Laguna, 38071 Tenerife – Canary Islands 3 Department of Financial Economy and Accounting University of La Laguna, 38071 Tenerife - Canary Islands SPAIN Abstract: The Support Vector Machine (SVM) appears in the late nineties as a new technique with a great number of analogies to Neural Network (NN) techniques. Since Vapnik’s first paper (1992), the number of the applications of this method has experienced an important growth, from classification tasks such as OCR to chaotic time series prediction, due to its better features compared to the ordinary NN techniques. In this paper, we analyze the applications of SVM in finance by focusing one important question of this subject: The forecasting of volatility of financial time series, comparing the effectiveness of this technique to the normal GARCH estimation models. Empirical results with stock time series and comparisons between these techniques are presented, showing the improved accuracy obtained with the SVM technique. Key-words: Support Vector Machines; Neural Networks; GARCH model; Volatility Forecasting 1. Introduction minima, so one does not need to address the local minima that are present in most NN procedures. Support vector machines (SVMs) are state-of-theart tools for linear and nonlinear input-output knowledge discovery (Vapnik, 1998; Schölkopf and Smola, 2001). SVMs can be used over pattern recognition and regression estimation problems, needing to solve a quadratic functional linearly restricted. The SVM have been developed in the Machine learning community and it can be viewed as an Neural Network (NN). Anyhow the SVM presents several advantages against other NNs schemes, such as: Multi-Layered Perceptron (MLP) or Radial Basis Function Networks (RBFN). The architecture of the SVM is given by the learning procedure, so one does not need to choose a priori the number of hidden units or layers. Also, what it is more relevant, the functional to be minimized presents a single In this paper, we will use the SVM for Regression estimation, also known as support vector regressor (SVR), for estimating the parameters of a GARCH model and for computing the efficient market hypothesis. The GARCH model is usually estimated using least squares procedures, which are only optimal if the variable to be estimated has been sampled from a gaussian distribution function. The SVR defines a insensitivity zone (we will detailed in Section 2.1) that allows to deal with any distribution function. So, if the estimated variable has not been sampled from a gaussian distribution function, SVR can actually lead to better predictions than the ordinary least squares. The SVR will be used as well to try to fit a predictive model over the actual return of the stock market, in order to address if the efficient market hypothesis holds in the light of this novel technique for knowledge discovery. Figure 1. Cost function associated to SVR errors. The rest of the paper is outlined as follows. Section 2 is devoted to introduce the SVR and its solving procedure. We will deal with the estimation of the GARCH model in Section 3 using least squares and SVR. In Section 4, we will apply the SVR to test the efficiency markets hypothesis trying to predict the return on the S&P100 index stock market. We end in Section 5 with some concluding remarks. 2. The Support Vector Machines The SVR needs to work with a training set to adjust its parameters and afterwards its model can be use to predict any possible outcome. The prediction over the samples used for training purposes is known as in-sample prediction and for the samples the algorithm did not used for such purposes out-sample prediction, also known as generalization error in the NN community. The SVR needs a labelled training data set (xid and yi, for i=1,…, n, where xi is the input vector and yi is its corresponding label)1, to solve the regression estimation problem ( yi = wTxi + b, where w and b define the linear regressor), i.e, finding the values of w and b. The SVR uses the penalty function shown in Figure 1, in which the samples with an prediction error (ei = yi wTxi b) lower than ε in absolute value are not penalized and those samples with a prediction error greater than ε are linearly penalized. Most regression estimation problems used a f(ei)= ei2 (least squares), but for those problems in which yi has not been drawn form a gaussian distribution function, the least square (LS) techniques are suboptimal. Furthermore, the value of ε can be optimally set if the probability density function over ei is known, as shown in (Smola et al., 1998). 1 We will use for compactness matrix notation. The vectors will be column vectors and will be denoted by bold lower cases, the matrices will be bold upper cases and the scalars will be slanted lower cases (sometimes also upper cases). The dot product between columns-vectors will be shown as matrix multiplication (wTx), where T indicates the matrix transpose operation. 2.1. SVR Optimisation The SVR is stated as a constrained optimisation problem: n 1 2 min w C i i w ,b ,i ,i 2 i 1 (1) subject to yi w T xi b i (2) wT xi b yi i (3) i , i 0 (4) () is a non linear transformation to a higher dimensional space (xid(xi)H, d H). The SVR defines a linear regressor in the transformed space (H), which is nonlinear in the input space, unless (xi) = xi (linear regression). The i are positive slack variables, introduce with the samples with prediction error greater than . This problem is usually solved introducing constraints (2), (3) and (4) using Lagrange multipliers, leading to the minimization of LP n 1 w i i y i w T x i b 2 i 1 n n i 1 i 1 i i i i yi w T x i b n n i 1 i 1 i i C i i (5) with respect to w, b, i and i* and its maximization with respect to the Lagrange multipliers, i, i*, i and i*. In order to solve this problems one needs to compute the KarushKunh-Tucker conditions (Fletcher, 1987), that states some conditions over the variables in (5), being (2), (3), (4) and n LP w i i y i x i 0 w i 1 (6) n LP i i 0 b i 1 (7) LP C i i 0 i (8) LP C i i 0 i (9) i , i , i , i 0 (10) i i yi wT xi b 0 (11) i i w x b y 0 T i i i 0 and i i i 0 (12) (13) The usual procedure to solve the SVR introducing (6), (7), (8) and (9) into (5), leading to the maximization of n Ld i i (14) i 1 n n i 1 j 1 i i j j Τ x i x j subject to (7) and 0 i, i* C. This procedure can be solved using QP schemes and in order to solve it one does not need to know the nonlinear mapping (), only its Reproducing Kernel in Hilbert Space (RKHS) (xi,xj) = T(xi)(xj). The SVR can be also solved relying on an Iterative Re-Weighted Least Squares (IRWLS) procedure, that is easier to implement and it is much faster that usual QP schemes, as shown in (Perez-Cruz and Navia-Vázquez, 2000), which is the one we will use through out this paper. Some of the most widely used kernels are show in Table 1, where k is a natural number and is a real number. We must recall that the Mercer theorem (Schölkopf and Smola, 2001) states the necessary and sufficient conditions for any function (xi,xj) to be a kernel is some Hilbert space. Table 1. Some of the kernels used in SVR. Linear Polynomial RBF (xi,xj) = xiTxj (xi,xj) = (xiTxj+1)k (xi,xj) = exp (||xiTxj||2/(22)) 3. Estimating GARCH with SVR The SVR can be used instead of the LS to estimate the parameters of a GARCH model, described below. With the use of SVR and an appropriate selection of its insensitive parameter , we will be able to deal with the main empirical properties usually observed in high frequency financial time series: high kurtosis, small first order autocorrelation of squared observations and slow decay towards zero of the autocorrelation coefficients of squared observations. The SVR can adjust to any probability distribution function by setting the value of differently for each problem. The ability of GARCH models to provide good estimates of equity and index return volatility is well documented. Many studies show that the parameters of a variety of different GARCH models are highly significant in sample; see, for example, Bollerslev (1986 and 1987), Nelson (1991) or Andersen and Bollerslev (1998). Although, there is less evidence that GARCH models provide good forecasts of equity return volatility. Some studies (Franses and Van Dijk, 1995; Figlewski, 1997) examine the out of sample predictive ability of GARCH models. All find that a regression of realized volatility on forecast volatility produces a low R 2 statistic (often less than 10%) and hence the predictive power of the forecasts may be questionable. Recent works are introducing important improvements in the volatility forecasting and testing procedures (Andersen et al. 2001 or Blair et al. 2001), but SVR estimation can be realized independently of the initial approach used. 3.1. The GARCH(1, 1) model The GARCH(1,1) model provides a simple representation of the main statistical characteristics of return series of a wide range of assets and, consequently, it is extensively used to model real financial time series. It serves as a natural benchmark for the forecast performance of heterocedastic models based on ARCH. In the simplest set up, if yt follows a GARCH(1,1) model, then yt t t t2 yt21 t21 (15) where t is a white noise process with unity variance NID(0, 1). For habitual financial time series, the stationary mean of the original data yt can be neglected and equaled to zero. t is known as the conditional volatility of the process. Following the definition in (15), the conditional variance t2 is a stochastic process assumed to be a constant plus a weighted average of last period’s forecast, t21 , and last period’s squared observation, y t21 . , and are unknown parameters that satisfy 0, , 0 to ensure the positivity of the conditional variance. The parameter has to be strictly positive for the process yt not to degenerate. The process yt is stationary if 1 . Financial return series are mainly characterized by having high kurtosis, typically exhibit little correlation, but the squared observations often indicate high autocorrelation and persistence. This implies correlation in the variance process, and is a indication that the data is a candidate for GARCH modeling. 3.2. Empirical modeling In order to illustrate the main empirical properties often observed in high frequency financial time series, we analyze the descriptive statistics of six series observed daily. If we denote by p t the observed price at time t, we are considering as the series of interest, the returns defined as yt ln pt ln pt 1 . The series considered are returns of four international stock market indexes, the S&P100 index observed from January 1996 to October 2000, the FTSE 100 index observed from January 1995 to December 2000, the IBEX 35 of the Madrid Stock Exchange observed from January 1990 to December 1999 and the NIKKEI index from January 1995 to November 1999, and returns of two stock prices, General Motors and Hewlett Packard from January 1996 to October 2000. It is possible to observe that all the series2 have zero mean and excess kurtosis, always over the standard normal value 3. It is also important to note that, although the series are not severally autocorrelated, the squared observations are very correlated and that the analysis of squared observations shown significant correlation and persistence. We have obtained the ML estimates of the parameters of the GARCH(1,1) model for all the series considered3. The original series of length N was divided to define two sets: the in sample or training set with the first N/2 samples and the out sample or testing set with the last N/2 samples. Also, we have estimated the same parameters using the SVR technique. We must point out that the parameters are quite different and this is due to the different optimization procedure used for both schemes. 3.3. Forecasting GARCH with ML and SVR Given forecasts yˆ t2 of the squared returns y t2 known at times t 1,..., N , we report the proportion of variance explained by the forecasts with the R 2 statistic, e.g. Theil (1971), defined by: 2 We have eliminated some tables in order to optimize the text space of the article. 3 The estimation has been carried out with Matlab, version 12. y N R2 1 t 1 y N t 1 2 t 2 t yˆ t2 2 mean y 2 t (3.5) 2 This relative accuracy statistic indicates that the model accounts for over R 2 per cent of the variability in the observations. For example, R 2 =0.11 means that the model accounts for over 10% of the variability in the observations. If R 2 is zero or a little value, then the model is not capable of extract the deterministic part of the time series. If R 2 is negative, this means that the model is worse than the statistical mean, because introduce more variability than the original variance of the time series. Table 2. R 2 statistic for GARCH solved by ML and SVM models. The in-sample data are the first half and the out-sample data are the second half. GARCH LS SP100 FTSE IBEX NIKKEI GM HP GARCH SVR In-sample Out-sample In-sample Out-sample R2 R2 R2 R2 0.0466 0.0365 0.0565 0.0427 0.0911 0.0352 0.0475 0.0423 0.0590 0.0502 0.0999 0.1341 0.0110 0.0423 0.0108 0.0479 0.0175 0.0055 0.0153 0.0066 -0.0116 -0.0171 2.16E-04 0.00478 In Table 2 we can see that the SVR technique is able to explain a higher percentage of all the series in the out-sample except for the IBEX one, in which the ML is superior. Also the SVR is always able to predict better than the mean, which it is not possible for the ML technique for the HP data set. These results are as expected because the data sets do not resemble a gaussian distribution function and a different technique as SVR is able to extract more knowledge from the data set than usual ML techniques. We have plot the predicted value of t2 and squared observations y t2 for all data in the sample and out-sample sets in Figure 2, for S&P100 index. In this plots one can see that the inthe the prediction for both series is very alike although the prediction obtain by the SVM explains over a half a percentage more than the explanation made by the LS scheme. Figure 2. S&P100. Squared Observations y t2 and ˆ t2 GARCH(1,1) estimations using LS and SVM. -3 6 x 10 5 S&P100 Squared Observations Forecast GARCH LS R2 out-sample=0.0365 R2 in-sample=0.0466 4 3 2 1 0 0 6 x 5 200 400 600 800 1000 1200 -3 10 S&P100 Squared Observations Forecast GARCH SVM R2 in-sample=0.0565 R2 out-sample=0.0427 4 3 2 1 0 0 200 400 600 800 1000 1200 4. Conclusions and further work We have shown that SVR can be used to estimate the parameters in a GARCH model for estimating the volatility in a stock market series returns with a much higher accuracy than the one obtained with LS. The SVR has been used over a widely known model producing a better estimate, for further work it will have to be tested over other models and kernels in order to address if it is able to further improve the shown results. References: Andersen, T.G. and T. Bollerslev (1998), “Answering the skeptics: yes standard volatility models do provide accurate forecasts”, International Economic Review, 39, pp. 885-905. Schölkopf, B. and A. Smola (2001), Learning with kernels, M.I.T. Press (to appear). Andersen, T.G., T. Bollerslev, F.X. Diebold, and H. Ebens (2001), “The distribution of realized stock return volatility”, Journal of Financial Economics, 61, pp. 43-76. Smola, A., N. Murata, B. Schölkopf and K.R. Müller (1998) “Asymptotically Optimal Choice of -Loss for Support Vector Machines”, Proceedings of ICANN'98, Blair, B.J., S.H. Poon and S.J. Taylor (2001), “Forecasting S&P100 volatility: the incremental information content of implied volatilities and high-frequency index returns”, Journal of Econometrics, 105, pp. 5-26. Theil, H. (1971), Principles of Econometrics. Ed. Wiley. New York. Bollerslev, T. (1986), “Generalized Autoregressive Conditional Heteroskedasticity”, Journal of Econometrics, 31, pp. 307-327. Bollerslev, T. (1987), “A conditional heteroskedasticity time series model for speculative prices and rates of returns”, Review of Economics and Statistics, 69, pp. 542-547. Franses, P.H. and D. Van Dijk (1995), “Forecasting stock market volatility using (nonlinear) GARCH models”, Journal of Forecasting, 15, pp. 229-235. Figlewski, S. (1997), Forecasting volatility. Financial Markets. Institutions and Instruments 6, pp. 1-88. Fletcher, R. (1987), Optimization, Wiley. Practical Methods of Nelson, D.B. (1991), “Conditional heteroskedasticity in asset returns: a new approach”, Econometrica, 59, pp. 347-370. Pérez-Cruz, F., A. Navia-Vázquez, P. L. AlarcónDiana and A. Artés-Rodríguez (2000) “An IRWLS procedure for SVR”, Proceedings of the EUSIPCO'00. Vapnik, V.N. (1998) Statistical Learning Theory, Wiley.