Forecasting volatility with support vector machines

advertisement
Forecasting volatility with support vector machines:
an international financial markets study
FERNANDO PÉREZ-CRUZ1 , JULIO A. AFONSO-RODRÍGUEZ2 AND JAVIER GINER-RUBIO3
1
Department of Signal Theory and Communications
University Carlos III, 28911 Madrid
2
Department of Institutional Economics, Economic Statistics and Econometrics
University of La Laguna, 38071 Tenerife – Canary Islands
3
Department of Financial Economy and Accounting
University of La Laguna, 38071 Tenerife - Canary Islands
SPAIN
Abstract: The Support Vector Machine (SVM) appears in the late nineties as a new technique with a great
number of analogies to Neural Network (NN) techniques. Since Vapnik’s first paper (1992), the number of
the applications of this method has experienced an important growth, from classification tasks such as OCR
to chaotic time series prediction, due to its better features compared to the ordinary NN techniques. In this
paper, we analyze the applications of SVM in finance by focusing one important question of this subject:
The forecasting of volatility of financial time series, comparing the effectiveness of this technique to the
normal GARCH estimation models. Empirical results with stock time series and comparisons between these
techniques are presented, showing the improved accuracy obtained with the SVM technique.
Key-words: Support Vector Machines; Neural Networks; GARCH model; Volatility Forecasting
1. Introduction
minima, so one does not need to address the local
minima that are present in most NN procedures.
Support vector machines (SVMs) are state-of-theart tools for linear and nonlinear input-output
knowledge discovery (Vapnik, 1998; Schölkopf
and Smola, 2001). SVMs can be used over pattern
recognition and regression estimation problems,
needing to solve a quadratic functional linearly
restricted. The SVM have been developed in the
Machine learning community and it can be
viewed as an Neural Network (NN). Anyhow the
SVM presents several advantages against other
NNs schemes, such as: Multi-Layered Perceptron
(MLP) or Radial Basis Function Networks
(RBFN). The architecture of the SVM is given by
the learning procedure, so one does not need to
choose a priori the number of hidden units or
layers. Also, what it is more relevant, the
functional to be minimized presents a single
In this paper, we will use the SVM for Regression
estimation, also known as support vector regressor
(SVR), for estimating the parameters of a
GARCH model and for computing the efficient
market hypothesis. The GARCH model is usually
estimated using least squares procedures, which
are only optimal if the variable to be estimated has
been sampled from a gaussian distribution
function. The SVR defines a insensitivity zone
(we will detailed in Section 2.1) that allows to
deal with any distribution function. So, if the
estimated variable has not been sampled from a
gaussian distribution function, SVR can actually
lead to better predictions than the ordinary least
squares. The SVR will be used as well to try to fit
a predictive model over the actual return of the
stock market, in order to address if the efficient
market hypothesis holds in the light of this novel
technique for knowledge discovery.
Figure 1. Cost function associated to SVR errors.
The rest of the paper is outlined as follows.
Section 2 is devoted to introduce the SVR and its
solving procedure. We will deal with the
estimation of the GARCH model in Section 3
using least squares and SVR. In Section 4, we will
apply the SVR to test the efficiency markets
hypothesis trying to predict the return on the
S&P100 index stock market. We end in Section 5
with some concluding remarks.
2. The Support Vector Machines
The SVR needs to work with a training set to
adjust its parameters and afterwards its model can
be use to predict any possible outcome. The
prediction over the samples used for training
purposes is known as in-sample prediction and for
the samples the algorithm did not used for such
purposes out-sample prediction, also known as
generalization error in the NN community. The
SVR needs a labelled training data set (xid and
yi, for i=1,…, n, where xi is the input vector
and yi is its corresponding label)1, to solve the
regression estimation problem ( yi = wTxi + b,
where w and b define the linear regressor), i.e,
finding the values of w and b. The SVR uses the
penalty function shown in Figure 1, in which the
samples with an prediction error (ei = yi  wTxi 
b) lower than ε in absolute value are not penalized
and those samples with a prediction error greater
than ε are linearly penalized. Most regression
estimation problems used a f(ei)= ei2 (least
squares), but for those problems in which yi has
not been drawn form a gaussian distribution
function, the least square (LS) techniques are
suboptimal. Furthermore, the value of ε can be
optimally set if the probability density function
over ei is known, as shown in (Smola et al., 1998).
1
We will use for compactness matrix notation. The vectors
will be column vectors and will be denoted by bold lower
cases, the matrices will be bold upper cases and the scalars
will be slanted lower cases (sometimes also upper cases).
The dot product between columns-vectors will be shown as
matrix multiplication (wTx), where T indicates the matrix
transpose operation.
2.1. SVR Optimisation
The SVR is stated as a constrained optimisation
problem:
n
1

2
min   w  C  i  i 
w ,b ,i ,i 2
i 1




(1)
subject to
yi  w T xi   b    i
(2)
wT xi   b  yi    i
(3)
 i , i  0
(4)
() is a non linear transformation to a higher
dimensional space (xid(xi)H, d  H).
The SVR defines a linear regressor in the
transformed space (H), which is nonlinear in the
input space, unless (xi) = xi (linear regression).
The i are positive slack variables, introduce with
the samples with prediction error greater than .
This problem is usually solved introducing
constraints (2), (3) and (4) using Lagrange
multipliers, leading to the minimization of
LP 




n
1
w   i    i  y i  w T x i   b
2
i 1
n
n
i 1
i 1
   i i   i    i  yi  w T x i   b
n
n
i 1
i 1
   i  i  C   i   i
(5)
with respect to w, b, i and i* and its
maximization with respect to the Lagrange
multipliers, i, i*, i and i*. In order to solve
this problems one needs to compute the KarushKunh-Tucker conditions (Fletcher, 1987), that
states some conditions over the variables in (5),
being (2), (3), (4) and
n
LP
 w    i   i y i x i   0
w
i 1
(6)
n
LP
   i   i  0
b
i 1
(7)
LP
 C   i  i  0
i
(8)
LP
 C  i  i  0
i
(9)
 i , i , i , i  0
(10)
i    i  yi  wT xi   b  0
(11)

   

i

i

 w x   b  y  0
T
i
i i  0 and
i
i i  0
(12)
(13)
The usual procedure to solve the SVR introducing
(6), (7), (8) and (9) into (5), leading to the
maximization of
n


Ld     i   i 
(14)
i 1
 
n
n
i 1 j 1
i


  i  j   j  Τ x i x j 
subject to (7) and 0  i, i*  C. This procedure
can be solved using QP schemes and in order to
solve it one does not need to know the nonlinear
mapping (), only its Reproducing Kernel in
Hilbert Space (RKHS) (xi,xj) = T(xi)(xj). The
SVR can be also solved relying on an Iterative
Re-Weighted Least Squares (IRWLS) procedure,
that is easier to implement and it is much faster
that usual QP schemes, as shown in (Perez-Cruz
and Navia-Vázquez, 2000), which is the one we
will use through out this paper.
Some of the most widely used kernels are show in
Table 1, where k is a natural number and  is a
real number. We must recall that the Mercer
theorem (Schölkopf and Smola, 2001) states the
necessary and sufficient conditions for any
function (xi,xj) to be a kernel is some Hilbert
space.
Table 1. Some of the kernels used in SVR.
Linear
Polynomial
RBF
(xi,xj) = xiTxj
(xi,xj) = (xiTxj+1)k
(xi,xj) = exp (||xiTxj||2/(22))
3. Estimating GARCH with SVR
The SVR can be used instead of the LS to
estimate the parameters of a GARCH model,
described below. With the use of SVR and an
appropriate selection of its insensitive parameter
, we will be able to deal with the main empirical
properties usually observed in high frequency
financial time series: high kurtosis, small first
order autocorrelation of squared observations and
slow decay towards zero of the autocorrelation
coefficients of squared observations. The SVR
can adjust to any probability distribution function
by setting the value of  differently for each
problem.
The ability of GARCH models to provide good
estimates of equity and index return volatility is
well documented. Many studies show that the
parameters of a variety of different GARCH
models are highly significant in sample; see, for
example, Bollerslev (1986 and 1987), Nelson
(1991) or Andersen and Bollerslev (1998).
Although, there is less evidence that GARCH
models provide good forecasts of equity return
volatility. Some studies (Franses and Van Dijk,
1995; Figlewski, 1997) examine the out of sample
predictive ability of GARCH models. All find that
a regression of realized volatility on forecast
volatility produces a low R 2 statistic (often less
than 10%) and hence the predictive power of the
forecasts may be questionable. Recent works are
introducing important improvements in the
volatility forecasting and testing procedures
(Andersen et al. 2001 or Blair et al. 2001), but
SVR estimation can be realized independently of
the initial approach used.
3.1. The GARCH(1, 1) model
The GARCH(1,1) model provides a simple
representation
of
the
main
statistical
characteristics of return series of a wide range of
assets and, consequently, it is extensively used to
model real financial time series. It serves as a
natural benchmark for the forecast performance of
heterocedastic models based on ARCH. In the
simplest set up, if yt follows a GARCH(1,1)
model, then
yt   t  t
 t2     yt21   t21
(15)
where  t is a white noise process with unity
variance NID(0, 1). For habitual financial time
series, the stationary mean of the original data yt
can be neglected and equaled to zero.  t is
known as the conditional volatility of the process.
Following the definition in (15), the conditional
variance  t2 is a stochastic process assumed to be
a constant plus a weighted average of last period’s
forecast,  t21 , and last period’s squared
observation, y t21 .  ,  and  are unknown
parameters that satisfy   0,  ,   0 to ensure
the positivity of the conditional variance. The
parameter  has to be strictly positive for the
process yt not to degenerate. The process yt is
stationary if     1 .
Financial return series are mainly characterized by
having high kurtosis, typically exhibit little
correlation, but the squared observations often
indicate high autocorrelation and persistence. This
implies correlation in the variance process, and is
a indication that the data is a candidate for
GARCH modeling.
3.2. Empirical modeling
In order to illustrate the main empirical properties
often observed in high frequency financial time
series, we analyze the descriptive statistics of six
series observed daily. If we denote by p t the
observed price at time t, we are considering as the
series of interest, the returns defined as
yt  ln pt  ln pt 1 . The series considered are
returns of four international stock market indexes,
the S&P100 index observed from January 1996 to
October 2000, the FTSE 100 index observed from
January 1995 to December 2000, the IBEX 35 of
the Madrid Stock Exchange observed from
January 1990 to December 1999 and the NIKKEI
index from January 1995 to November 1999, and
returns of two stock prices, General Motors and
Hewlett Packard from January 1996 to October
2000. It is possible to observe that all the series2
have zero mean and excess kurtosis, always over
the standard normal value 3. It is also important to
note that, although the series are not severally
autocorrelated, the squared observations are very
correlated and that the analysis of squared
observations shown significant correlation and
persistence.
We have obtained the ML estimates of the
parameters of the GARCH(1,1) model for all the
series considered3. The original series of length N
was divided to define two sets: the in sample or
training set with the first N/2 samples and the out
sample or testing set with the last N/2 samples.
Also, we have estimated the same parameters
using the SVR technique. We must point out that
the parameters are quite different and this is due
to the different optimization procedure used for
both schemes.
3.3. Forecasting GARCH with ML and SVR
Given forecasts yˆ t2 of the squared returns y t2
known at times t  1,..., N , we report the
proportion of variance explained by the forecasts
with the R 2 statistic, e.g. Theil (1971), defined
by:
2
We have eliminated some tables in order to optimize the
text space of the article.
3
The estimation has been carried out with Matlab, version
12.
 y
N
R2  1
t 1
 y
N
t 1
2
t
2
t
 yˆ t2

2
 
 mean y
2
t
(3.5)
2
This relative accuracy statistic indicates that the
model accounts for over R 2 per cent of the
variability in the observations. For example,
R 2 =0.11 means that the model accounts for over
10% of the variability in the observations. If R 2 is
zero or a little value, then the model is not capable
of extract the deterministic part of the time series.
If R 2 is negative, this means that the model is
worse than the statistical mean, because introduce
more variability than the original variance of the
time series.
Table 2. R 2 statistic for GARCH solved by ML
and SVM models. The in-sample data are the first
half and the out-sample data are the second half.
GARCH LS
SP100
FTSE
IBEX
NIKKEI
GM
HP
GARCH SVR
In-sample Out-sample In-sample Out-sample
R2
R2
R2
R2
0.0466
0.0365
0.0565
0.0427
0.0911
0.0352
0.0475
0.0423
0.0590
0.0502
0.0999
0.1341
0.0110
0.0423
0.0108
0.0479
0.0175
0.0055
0.0153
0.0066
-0.0116
-0.0171
2.16E-04
0.00478
In Table 2 we can see that the SVR technique is
able to explain a higher percentage of all the
series in the out-sample except for the IBEX one,
in which the ML is superior. Also the SVR is
always able to predict better than the mean, which
it is not possible for the ML technique for the HP
data set. These results are as expected because the
data sets do not resemble a gaussian distribution
function and a different technique as SVR is able
to extract more knowledge from the data set than
usual ML techniques.
We have plot the predicted value of t2 and
squared observations y t2 for all data in the
sample and out-sample sets in Figure 2, for
S&P100 index. In this plots one can see that
the
inthe
the
prediction for both series is very alike although
the prediction obtain by the SVM explains over a
half a percentage more than the explanation made
by the LS scheme.
Figure 2. S&P100. Squared Observations y t2 and
ˆ t2 GARCH(1,1) estimations using LS and SVM.
-3
6
x 10
5
S&P100 Squared Observations Forecast GARCH LS
R2 out-sample=0.0365
R2 in-sample=0.0466
4
3
2
1
0
0
6
x
5
200
400
600
800
1000
1200
-3
10 S&P100 Squared Observations Forecast GARCH SVM
R2 in-sample=0.0565
R2 out-sample=0.0427
4
3
2
1
0
0
200
400
600
800
1000
1200
4. Conclusions and further work
We have shown that SVR can be used to estimate
the parameters in a GARCH model for estimating
the volatility in a stock market series returns with
a much higher accuracy than the one obtained
with LS. The SVR has been used over a widely
known model producing a better estimate, for
further work it will have to be tested over other
models and kernels in order to address if it is able
to further improve the shown results.
References:
Andersen, T.G. and T. Bollerslev (1998),
“Answering the skeptics: yes standard volatility
models do provide accurate forecasts”,
International Economic Review, 39, pp. 885-905.
Schölkopf, B. and A. Smola (2001), Learning
with kernels, M.I.T. Press (to appear).
Andersen, T.G., T. Bollerslev, F.X. Diebold, and
H. Ebens (2001), “The distribution of realized
stock return volatility”, Journal of Financial
Economics, 61, pp. 43-76.
Smola, A., N. Murata, B. Schölkopf and K.R.
Müller (1998) “Asymptotically Optimal Choice of
-Loss for Support Vector Machines”,
Proceedings of ICANN'98,
Blair, B.J., S.H. Poon and S.J. Taylor (2001),
“Forecasting S&P100 volatility: the incremental
information content of implied volatilities and
high-frequency index returns”, Journal of
Econometrics, 105, pp. 5-26.
Theil, H. (1971), Principles of Econometrics. Ed.
Wiley. New York.
Bollerslev,
T.
(1986),
“Generalized
Autoregressive Conditional Heteroskedasticity”,
Journal of Econometrics, 31, pp. 307-327.
Bollerslev,
T.
(1987),
“A
conditional
heteroskedasticity time series model for
speculative prices and rates of returns”, Review of
Economics and Statistics, 69, pp. 542-547.
Franses, P.H. and D. Van Dijk (1995),
“Forecasting stock market volatility using (nonlinear) GARCH models”, Journal of Forecasting,
15, pp. 229-235.
Figlewski, S. (1997), Forecasting volatility.
Financial Markets. Institutions and Instruments 6,
pp. 1-88.
Fletcher, R. (1987),
Optimization, Wiley.
Practical Methods
of
Nelson,
D.B.
(1991),
“Conditional
heteroskedasticity in asset returns: a new
approach”, Econometrica, 59, pp. 347-370.
Pérez-Cruz, F., A. Navia-Vázquez, P. L. AlarcónDiana and A. Artés-Rodríguez (2000) “An
IRWLS procedure for SVR”, Proceedings of the
EUSIPCO'00.
Vapnik, V.N. (1998) Statistical Learning Theory,
Wiley.
Download