value-growth style rotation on the US stock market based on a Support
Vector Regression model exhibits superior performance over a
benchmark once-and-for-all value-minus-growth investment strategy;
and second, to provide a short but complete in itself introduction to
Support Vector Regressions and Support Vector Machines as a whole.
We find that our style rotation model significantly outperforms the
benchmark strategy, achieving a significant information ratio of 0.83 net
of 25bp single trip transaction costs in the trading period of January
1993 – January 2003. All estimates for the monthly differences in
returns between value and growth stocks are based on historically
available information on 17 pre-specified macroeconomic and technical
factors considered all together. Model selection is not based on familiar
financial selection criteria such as hit ratio or realized information ratio,
but on a standard technique for Support Vector Machines called crossvalidation. The combination of the intrinsic analytical features of
Support Vector Regressions and the cross-validation technique have at
least two merits: first, it assures that model selection is based only on
(artificially created) out-of-sample performance, and second, it appears
to circumvent common factor-model shortcomings such as Data Mining
Bias, Look-Ahead Bias and others. We examine the performance of our
basic model against several model extensions, and discover remarkable
consistency and robustness of our basic-model results. Additionally, we
present the results for a small-big rotation on the US stock market,
which produces slightly better information ratios than value-growth
Chapter 1
The characterization of stock return predictability has long been a subject of great controversy
in the financial spheres (Cremers, 2002). Debating issues range from the extent of stock
market efficiency to the nature and number of factors that could contain information on future
stock returns (Haugen, 2001). At the same time, and in other quarters such as Data Mining,
the popular analytical tool Support Vector Machines (SVM) has been gaining momentum,
following a series of successful applications in fields such as optical character recognition and
DNA analysis (Smola and Schölkopf, 1998, Müller et al., 2001). The possibility to apply
Support Vector Machines in Finance and their excellent performance in time-series prediction
has already been reported by Smola and Schölkopf (1998) and Müller et al. (1997)
respectively. Furthermore, among others, Rocco and Moreno (2001) describe an approach to
detect currency crises based on SVM; Monteiro (2001) applies SVM for interest rate curve
estimation; and Van Gestel et al. (2003) propose an SVM financial credit scoring model to
assess the risk of default of a company and report significantly better results when contrasted
with state-of-art techniques in the field. Regarding financial time-series forecasting, SVM
have been implemented by Trafalis and Ince (2000), for instance. Pérez-Cruz et al. (2003),
further, estimate with SVM the parameters of a GARCH model for predicting the conditional
volatility of stock market returns. Despite the successful applications of SVM in various fields
(including Finance) however, to the best of our knowledge SVM have not yet been employed
in financial style rotation strategies. Therefore, the primary objective of this master’s thesis is
to evaluate the economic significance of applying Support Vector Machines in the financial
domain of stock returns predictability, and in particular, in the so-called in the financial
literature value-versus-growth style rotation. The secondary objective of the thesis, which is
of comparable significance, is to provide a brief, but still complete in itself elementary
introduction to Support Vector Machines.
There is an extensive body of financial literature that documents the (extent of) predictability
of differences of returns both among classes of stocks and between stocks and other assets.
Pesaran and Timmermann (1995), for example show how to construct a profitable rotating
strategy between two assets, stocks and bonds, for the period 1960 – 1992 on the basis of
historical information on 9 candidate explanatory factors. Regarding classes of stocks, such as
value and growth stocks, the long-run profitability of the strategy of going long on value
stocks and short on growth stocks has been popularized by Fama and French (1993). This
profitability (or, the relatively higher return to value stocks) has been tried to be explained
either in terms of risk compensation (Fama and French, 1993) or market overreaction leading
to security mispricing (Lakonishok et al., 1994, La Porta, 1996). Such a long-term strategy
however could potentially be suboptimal to a style strategy of rotating portfolios of value and
growth stocks (say, via switching long value / short growth and long growth / short value
trading positions on the S&P 500 Barra Value and Growth indices), depending on the level of
transaction costs. In fact, the need for a value-growth rotation is created by the enormous
potential that a value-growth rotation based on perfect forecasting ability offers over the
passive Value-minus-Growth strategy, even in a high-transaction-cost regime. As we will
show later, such a perfect-forecast rotation strategy (which takes a long position in the higher
returning asset class and a short position in the lower returning asset class) produces 21.29%
annual return during the sample period, under a 50bp1 single trip transaction costs scenario2.
The corresponding Value-minus-Growth strategy yields a mere 0.21% annual return (although
it suffers from virtually no transaction costs3). This rotation potential has not passed by
unnoticed in practice. Kahn (1996), for instance, reports that most funds either tend to follow
a value-growth style rotation strategy, or adopt a mixed style strategy. Further, Bauer and
Molenaar (2002) note that the difference in return between value and growth stocks is not
stable over time, especially after 1990, and propose a logit model to capture the arising valuegrowth rotation potential on the US stock market.
In order to exploit the admittedly huge potential stemming from value-growth style rotation,
and in line with the primary objective of the thesis, a basic financial factor model that utilizes
Support Vector Regressions has been put forward to capture the historically realized monthly
differences in returns, referred to also as the “value premium” (Bauer and Molenaar, 2002),
between two major US stock indices – the S&P 500 Barra Value and Growth indices. Several
model extensions have also been proposed. The two indices are constructed by dividing the
market capitalization of the US S&P 500 index (approximately) equally between stocks
according to the stocks’ book-to-market (BM) ratio4. The BM ratio is calculated via dividing
the book value of common equity by the market capitalization of a firm5. Stocks with
relatively higher BM ratio (value stocks) are assigned to the Value index, and stocks with
relatively lower BM ratio (growth stocks) constitute the Growth index. The trading period
under consideration ranges from January 1993 till January 2003. One way to assess the
economic significance of the proposed basic factor model, which is in effect pursued in the
thesis, is to simulate a real-time investment strategy, according to which every month
investors either short sell the Growth index and buy the Value index (in other words, establish
a long value / short growth position), or vice-versa, or do not take any trading position at all.
Each of the 121 monthly forecasts is based on 60 months of most recent historically observed
values for 17 macroeconomic and technical pre-specified factors considered as a whole. In
this way it is ensured that investors base their decisions on historically available information
only. The performance of the proposed factor model has to be evaluated against a benchmark
passive market strategy, which we choose to be a “Value-minus-Growth” strategy (or, a
“permanent long-Value-and-short-Growth” strategy) that goes short on the Growth index and
long on the Value index in the beginning of the sample period and never changes that trading
position thereafter. The question of primary concern is: Is it possible to construct a valuegrowth style rotation investment strategy for a certain (fairly long) time period which is more
profitable than a passive, “Value-minus-Growth” market strategy, via using Support Vector
Machines and publicly available historical information on a set of factors thought a priori to
be relevant for forecasting stock returns?
In the process of construction of value-growth rotation strategies, researchers often choose
among factors that are believed to have some intrinsic explanatory power for the difference in
returns between value and growth stocks in general. By and large, these explanatory factors
fall into two categories – (macro)economic and technical (or, market-based). Kao and
Basis points. 100 basis points = 1%.
A single-trip transaction costs of 50 basis points (bp) is defined here as the percentage level of the transaction
costs associated with either establishing a long value and short growth position, or a long growth and a short
value position. Thus, the cost of closing one of these two positions and entering the other is 100 bp.
Some transaction costs, on top of establishing an initial long value / short growth position, are bound to be
incurred, since the S&P 500 Barra Value and Growth indices are rebalanced twice a year.
Source: Note that the BM ratio is the reciprocal of the (market) price-to-book ratio.
Shumaker (1999) for example document the explanatory power of economic factors, such as
the yield-curve spread, real bond yield and earnings-yield gap. Others report the significance
of economic factors such as rate of inflation (Levis and Liodakis, 1999), and growth in
industrial production (Sorensen and Lazzara, 1995). There is also a considerable evidence for
the relevance of some technical variables in predicting the difference in returns between value
and growth portfolios. For example, Levis and Liodakis (1999) report the importance of the
value spread, and Chan et al. (1996) examine the relevance of momentum strategies. Bauer
and Molenaar (2002) try to put together previous research on the subject by constructing
value-growth rotation strategies for a period January 1990 – November 2001 on the basis of
logit models consisting of up to 5 factors from a set of 17 factors claimed to have some
economically interpretable explanatory power in the literature on the subject. In our research,
we use the same set of 17 factors.
What makes our models different from the typical models employed by researchers is that we
apply different model-building tools (that is, Support Vector Machines instead of the widely
used multiple regression analysis), consider all candidate factors as a whole for model
building, and finally, use a cross-validation procedure for model selection.
Considering all candidate explanatory factors simultaneously could seem counterintuitive in
the eyes of most researchers and practitioners. Typically, they use multivariate models
consisting of several candidate factors chosen among a (long) list of factors. Since most of the
studies (with the notable exception of the study of Kao and Shumaker (1999), who utilize a
decision-tree model with a five-fold cross-validation technique) use multiple regression
analysis (in their various forms) for model building, they are bound to take heed of numberof-variables restriction criteria such as adjusted R2. One of the advantages of Support Vector
Machines in this respect is that such restriction is not necessary. They are expected to behave
robustly even in high-dimensional feature problems (Maragoudakis et al., 2002). This is quite
important since utilizing a tool that is able to produce models with good generalization (and
consequently, prediction) ability in a multivariate context can potentially capture variable
interactions that could not possibly be accounted for by models that contain only few
variables. The advantages and disadvantages of Support Vector Machines will be discussed at
length throughout the thesis, both from theoretical point of view and in the context of the
specific task of capturing the value premium on the US stock market.
In the process of model selection, models are chosen only on the basis of performance over
(artificially created) out-of-sample data, in order to avoid the critique of judging the merit of a
model on the basis of in-sample performance. This type of model selection is based on a
cross-validation procedure commonly used in Data Mining6.
Our main results show that the value-growth style rotation strategy based on Support Vector
Machines (and more specifically, Support Vector Regressions) considerably outperforms the
passive long-term Value-minus-Growth strategy, even after various levels of transaction costs
have been accounted for. In the sample period under consideration, against the 0.21% annual
return from the Value-minus-Growth strategy, the best attainable Support Vector Machines
value-growth rotation strategy is able to surpass this benchmark result more than 39 times,
achieving 8.21% annual return net of 25bp single trip transaction costs. Moreover, the
benchmark strategy is 10.55% more volatile in this case. We also set our results against a socalled “MAX” strategy, which captures the maximum return that can be achieved on the basis
The term “Data Mining” in financial literature bears a negative connotation and should be distinguished from
the Data Mining discipline, where the same negative idea is expressed by the term “data dredging”.
of style rotation, net of 50bp transaction costs. According to the “MAX” strategy, each month
throughout the sample period a position is taken that goes long on the better-performing
security class (here, one of the two S&P 500 Barra Value and Growth indices) and short on
the worse-performing security class. This strategy produces 21.29% annual return for the
whole sample period. Several model extensions are put forward, such as considering threeand six-month forecast horizons, which testify to the consistency of the basic one-month
forecast horizon results.
In order to cover both objectives of the thesis effectively, the thesis has been structured as
follows. At the outset (chapter 2), the financial side of the problem setting is addressed. In
particular, financial factor models and their role in stock market predictability are being
briefly discussed. The following four chapters (chapters 3 till 6) deal extensively with Support
Vector Machines as an analytical tool, since they will be used to tackle the task of value
premium predictability. First, chapter 3 acquaints the reader with the fundamental nature of
Support Vector Machines, and their known advantages and limitations. Second, chapter 4
gives an extensive account of the rationale behind Support Vector Machines. Third, in chapter
5, follows the construction of Support Vector Machines for binary classification problems.
And fourth, and as a logical continuation of the preceding three chapters, the Support Vector
Regression tool employed by the basic investment model has been considered in chapter 6.
Whenever it is possible, examples related to the factor models employed in the thesis have
been used along with the analysis of Support Vector Machines.
Chapter 7, “Methodology”, bridges the gap between the theoretical concept of Support Vector
Machines and the practical problem of value premium predictability. This chapter explains
how and why Support Vector Machines can be applied in a specific factor model to address
the question of this predictability. This has been done in several steps. First, the to-beproposed basic investment model, which utilizes Support Vector Regressions, is put in the
context of factor models. Second, the choice of explanatory macroeconomic and technical
variables and the nature of the explained variable (the difference in return between S&P 500
Barra Value and Growth indices: the value premium) are discussed. Third, the most attractive
merits of Support Vector Regressions that should justify their employment as a factor model
tool come into the focus. The necessary and sufficient background being laid down, the basic
investment model and several model extensions are put forward. The chapter closes with a
discussion of why Support Vector Machines and the proposed models are elegantly able to
withstand common drawbacks of factor models highlighted in financial literature by and large,
such as Survival Bias, Look-Ahead Bias, Data Snooping Bias, Data Mining Bias, and
Chapter 8 brings together all of the previous parts of the thesis. It shows how the actual
experiments for the practical realization of the models suggested in the “Methodology”
chapter have been carried out and summarizes the main findings and assesses their
significance. Chapter 9 concludes.
It is important to stress that all chapters of the master’s thesis are constructed in such a way as
to touch upon only topics and issues that are relevant for answering its two objectives. Thus, it
falls outside the scope of the thesis to provide an extensive summary of how stock returns
predictability and Support Vector Machines are reflected in their respective domains of
Finance and Data Mining.
We use the terminology of Haugen (1999).
Chapter 2
Stock Returns Predictability and Factor Models
2. 1 Market (In)Efficiency and Factor Models
The idea that the stock market is fairly inefficient has been gaining momentum in the
academic and financial spheres. The term “market efficiency” is generally used to denote
“the extent to which market prices securities so as to reflect available (historical) information
pertaining to their valuation“ (Haugen, 2001). Therefore, a high degree of market inefficiency
suggests that plenty of securities are mispriced. In this case, investing in a market index
would be a suboptimal investment strategy. In other words, the market index is not guaranteed
to be among the set of portfolios (called the “efficient set”) that offers the maximum expected
return for a given level of risk. Consequently, market inefficiency opens the possibility to
“beat” the market (index).
There is a growing body of financial literature that addresses the question of how to profit
from market inefficiency by putting forward models that admittedly are able to successfully
estimate expected returns and volatility of return of stocks on the basis of (a multitude of)
publicly available factors that are believed to affect (classes of) securities in different ways.
These factors are usually classified into macroeconomic (rate of inflation, rate of growth of
industrial production, etc.) and financial (book-to-market ratio, debt-to-equity ratio, etc.)
characteristics that could contain information on expected future movements of securities.
The multi-factor models could consequently be used in an inefficient market environment to
help investors move closer to the efficient set, and away from the market index. However,
building satisfactory factor models is not an easy task. Another branch of Finance, Behavioral
Finance (see e.g. Thaler, 1993), investigates the effects of investor psychology on stock prices
with a view to exploiting market inefficiencies. Studies in Behavioral Finance will not be
accounted for in the thesis however.
2. 2 Factor models: which factors matter
Despite the growing evidence that multi-factor models are quite powerful in explaining and
predicting stock returns, there does not exist a full consensus on which factors precisely are
important and why (Cremers, 2002). The majority of the studies have looked at the US stock
market, and confirm that stock returns can be predicted to some degree by means of interest
rates, dividend yields and a variety of macroeconomic variables exhibiting clear business
cycle variations (Pesaran, 2003). For example, in the late 1970s, Basu (1977) showed that
stocks with low price-to-earnings ratios performed significantly better (during the period
between April 1957 and March 1971) than stocks with high price-to-earnings ratios.
Subsequent findings have been reported by Keim and Stambaugh (1985) and Rosenberg el al.
(1985), who stress the relevance of price-to-dividends and (market) price-to-book ratios
respectively. Additionally, Chan et al., (1991) document the significant impact of the price-tocash flow ratio on expected returns of stocks (on the Japanese stock market). All of these
ratios are referred to as “value characteristics”, so that stocks with lower such ratios are
labeled “value”, and stock with higher ratios – “growth”. Firm size also appears to play a role
in stock market predictability. Banz (1981) discovered that the stocks of firms with low
market capitalizations (small cap stocks) have higher average returns than stocks with high
market capitalization (large cap stocks). Furthermore, Bhandari (1988) reports that firms with
high leverage (high debt-to-equity ratios) have higher average returns than firms with low
leverage for the period from 1948 to 1979. Two admittedly quite influential papers that pulled
together much of the earlier empirical work were published by Fama and French (Fama and
French, 1992, and Fama and French, 1993), who proposed a three-factor model and argued
that the book-to-market and size variables bear strongest relation to stock market returns. The
number and nature of candidate explanatory factors in factor models varies across studies,
however, and the three-factor model of Fama and French has been extended by other
researchers to include as many as fifty factors simultaneously, as advocated by Haugen
(1999), for example.
Fama and French (1993) provide evidence for a risk-based explanation of the long-term
difference between value stocks and growth stocks. According to them, the book-to-market
ratio is a proxy for an unobservable common risk factor, so that the fact that value stocks
(perceived to be more risky) have higher average returns over the long run is consistent with
rational asset pricing. Jensen et al. (1997) extend this view by claiming that value companies
are quite sensitive to the same macroeconomic conditions, such as interest rate risk and the
business cycle. Alternative views exist however, revealing themselves as differences in
attitude towards market efficiency. While Fama and French (1992) regard markets as
efficient, Haugen (1999) takes up an opposing position by arguing that the differences
between expected and real returns come as a surprise to investors. In the same line of thought,
Lakonishok et al. (1994) argue that value stocks had historically higher returns than growth
stocks because markets were inefficient (that is, investors were systematically wrong in their
expectations about future stock returns).
2. 3 Factor models: fit versus complexity
A disquieting trouble one faces with most of the factor models is that the predictive power of
a model deteriorates in practice with the inclusion of more and more explanatory variables
(factors). The reason is that model complexity increases with the number of factors. At some
point, the benefit of including new information in the model in the form of one explanatory
factor will actually decrease the predictive power of the model (although this will increase the
“fit” of the model on the data that was used to generate it) as the resulting increase in model
complexity will overweigh the benefit of the new information embedded in the factor. This
phenomenon is known as overfitting. Typically, commentators implicitly or explicitly try to
correct for overfitting by estimating multiple regressions that include all possible
combinations of a pre-selected factor set and choose among the resulting models. Thus, if
there are k candidate explanatory factors in a given factor set, then there will be 2k possible
models (multiple regressions). Subsequently, all of these models are ranked according to
some performance criteria, such as statistical (adjusted R2, AIC (Akaike’s Information
Criterion), BIC (Schwarz’s Bayesian Information Criterion), etc.) and financial (hit ratio8,
recursive Sharpe ratio, etc.) criteria9 (Pesaran and Timmermann, 1995). Another two financial
criteria – the mean return of a strategy and the information ratio10 criterion – have been
The hit ratio is the percentage of times (e.g., months) that a correct prediction has been made.
There are also other types of performance criteria, which will not be covered here however.
The information ratio is defined as the mean of a random variable (such as the stock market return realised by
a given model) divided by its standard deviation. This ratio is the same as Sharpe ratio for a long-short strategy.
employed for example by Bauer and Molenaar (2002). The financial criteria are generally
used to correct for the fact that statistical criteria are not necessarily in accordance with the
investor’s loss function (Pesaran and Timmermann, 1995). Other authors opt for a Bayesian
approach to model selection (e.g. Cremers, 2001, Avramov, 2002). What is common however
for nearly all factor models is that in the model selection procedures a balance is explicitly or
implicitly being searched between model complexity (proxied, for instance, by the number of
factors included in a given model) on the one hand, and model “fit” on the model-generating
(training) data, on the other. It is known that including more explanatory variables in a model
will increase its “fit” on the training data, suggested by the R2 statistic. Because of the
problem of overfitting, however, a model with greater R2 can very well be worse (in
predictive power) than a more parsimonious model with a lower R2. As mentioned above,
some criteria (that are widely spread, such as AIC and BIC) try to cope with this situation –
they penalize to a certain extent the inclusion of a new factor, and tolerate it only if it brings
“enough” additional explanatory power. The problem is that it is uncertain which criteria
exactly are most optimal.
2. 4 Factor models: an example
Expected-return multi-factor models are typically constructed on monthly basis, where the
employed optimal model (or a combination of models) is reconsidered every to-be-predicted
month, allowing for changes in the number and nature of included factors. This strategy is, for
example, employed by Bauer and Molenaar (2002), who propose a model to capture the
historical value premium on the US stock market arising from the difference between two
indices, the S&P 500 Barra Value and Growth indices. Every month, a series of parsimonious
multiple regression models are being constructed on the basis of 60 months of historical
values for 17 candidate macroeconomic and financial explanatory factors. Subsequently, all
models are being ranked during a 24-month model selection period according to financial
criteria such as mean return of employed strategy, hit ratio, and information ratio. Alongside,
different transaction-cost scenarios are considered.
The task of the basic model proposed in this thesis is to capture the (absolute value of the)
value premium on the US stock market (that is, the difference between S&P 500 Barra Value
and Growth indices) for a period of 121 months: January 1993 – January 2003. The basic
model utilizes the whole set of 17 factors (listed in Appendix I) used by Bauer and Molenaar
(2002), but instead of multiple regressions, it utilizes Support Vector Regressions.
In order to construct our basic value-growth rotation model, which is based on Support Vector
Regressions, it is indispensable, first, to define Support Vector Machines and describe the
theoretical rationale behind them. This will be done, respectively, in chapter 3 and chapter 4.
After that, in chapter 5, the primary technical building blocks of Support Vector Machines
will be introduced in the context of Support Vector Machines for Classification. Support
Vector Regressions, which built on this these technical concepts, will be discussed in chapter
6 on a theoretical level, and then in chapter 7 on a more practical, financial level, where the
basic model will be put forward as well. Afterwards, in chapter 8, the results from the basic
Support Vector Regression model along with several model extensions will be analyzed and
compared to a passive Value-minus-Growth strategy and a “MAX” strategy.
Chapter 3
Support Vector Machines:
Definition, Advantages, and Limitations
3. 1 What are Support Vector Machines?
Support Vector Machines (SVM) find their roots in Statistical Learning Theory, pioneered by
Vapnik and co-workers (Smola and Schölkopf, 1998). In essence, SVM are just functions,
named generally “learning machines”, whose basic task is to “explore” data (input-output
pairs) and provide optimally accurate predictions on unseen data. SVM could be defined11 as
Support Vector Machines are a classification / regression tool used for optimally predicting
the class membership / real value of unseen outputs that are generated or characterized by one
or more inputs, by means of looking at some available training input-output pairs and then
building a model based on the observed input-output relations.
3. 2 Remarks on the definition
There are several terms in the above definition that demand clarification.
By “optimally predicting” it is meant that the best tradeoff between function’s
complexity and accuracy (the number of training errors it makes) is being
The outputs and inputs can be considered as dependent and independent (or
explained and explanatory) variables respectively. Sometimes outputs are
called “outcomes” or “target values”, and inputs – “features” or “attributes”.
If the outcomes take discrete values (called “classes”), we have a classification
problem, while if they take continuous (real) values – we have a regression
estimation problem.
The “class membership” is the label assigned to a given output in a
classification problem, so that outputs having the same label belong to the
same class. In a two-class classification problem, the labels could just be
“positive” and “negative”.
“Looking at some training input-output pairs” refers to the first stage in the
process of learning from data: reading all training outputs together with their
respective inputs. In simple words: just reading the available data.
This is an informal definition
“Building a certain model” is the second stage in the process of learning from
data, which implements the basic idea behind SVM – the idea of striking the
best balance between minimizing the amount of errors made on the training
data set, while simultaneously maximizing (the Euclidian distance of) the
“margin” between the (two) different classes in higher-dimensional, feature
space implicitly defined by a certain kernel function. In the case of Support
Vector Regression (SVR), one employs the concept of “ε-insensitive region”
instead of the “margin”.
All of the above concepts (e.g. “margin”, “ε-insensitive region”, etc.) will be thoroughly
examined in the course of chapters 4 till 6. The informal definition of Support Vector
Machines and the remarks on the definition are presented here as a compendious first
overview of the idea behind these learning machines.
3. 3 Overall Advantages
One can argue that the combination of three key properties of Support Vector Machines has
made them a favorable analytical tool among other learning algorithms. These properties are
summarized below.
First, contrary to other (machine) learning techniques, SVM behave robustly even in high
dimensional feature problems (Maragoudakis et al., 2002), that is, where the explanatory
variables are numerous, and in noisy, complex domains (Burbidge and Buxton, 2001).
Second, unlike neural networks, SVM cannot be stuck in a local minimum while learning,
since SVM solve a quadratic optimization problem that is bound to arrive at a global solution
(Vapnik, 1995, Smola, 1996).
Third, Support Vector Machines achieve remarkable generalization ability by striking a
balance between a certain level of function’s accuracy on a given training set and its
complexity. Note that in real-world applications the presence of noise (in regression) and
class-overlap (in classification) necessitate the search for such a balance (Vapnik, 1995,
Woodford, 2001)
3. 4 Overall Limitations
There are three major known shortcoming of SVM, which can be summarized as follows.
First, SVM make explicit classifications (point predictions in the case of regression) of new
outputs for new inputs, rather than predicting the posterior probability of class membership
(or, in the case of regression, the probability that a point estimate takes a particular value,
given the new inputs) (Bishop and Tipping, 2000).
Second, there is a requirement to estimate a tradeoff parameter that determines the level of
penalty associated with training errors SVM make. In the case of SVR, we have also to
estimate an additional insensitivity parameter “ε”. This generally entails a “cross-validation”
procedure, which is computationally inefficient (Tipping, 2000), but which will nevertheless
be used in our models. The cross-validation procedure however can also be considered as an
advantage rather than limitation in the context for constructing financial factor models, as will
be shown in chapter 7.
Third, there is a need to utilize “Mercer” kernel functions while constructing SVM (to be
introduced in chapter 5), which somewhat restricts our ability to use any function for
All of the above-mentioned advantages and limitations are quite general in nature. Concrete
advantages and shortcomings of Support Vector Machines (and, in particular, Support Vector
Regression) in relation to financial factor models will be discussed in depth in chapter 7.
Chapter 4
Support Vector Machines and the Concept of Generalization
4. 1
Rationale behind SVM: the concept of generalization
4 .1. 1 The idea behind “complexity” of a function
The problem of overfitting (or, fitting-too-well), referred to in chapter 2 of the thesis, has long
been apparent: functions that perform extremely well on a given training set usually make
unsatisfactory predictions on unseen, test data. Typically, those functions are quite
“complex”, in the sense that by construction they “fit” the training data almost perfectly (see
e.g. Smola, 1996, and Müller et al., 2001). In other words, they make virtually no training
errors. To illustrate the idea of complexity of a function, consider the white and black balls in
Figure 1.
Figure 1. A two-class classification problem of separating black balls from white balls. Not a single line in (a) can separate the two classes
of balls without an error. A polynomial of degree two however is able to separate the same configuration of balls, as seen in (b).
In this two-class classification problem, we are free to choose among any kind of functions to
separate the white and black balls from each other. While some functions will be able to
separate the classes without an error, it is evident that there is no way in which a line can
separate correctly the two classes. In contrast to a line, a parabola or a polynomial of higher
degree can (as illustrated in Figure 1 (a) and (b)). In essence, one can claim that the function
class “polynomials of degree two” (that is, parabolas) represents more “complex” functions
than the (class of) linear functions. But why it actually happens that too complex functions
tend to make more prediction errors? And, even more generally, what kind of functions shall
we use for prediction – “complex” or “simple”?
4 .1. 2 On the importance of generalization
Consider the following pattern recognition (here, “tree recognition”) problem. Two children, a
boy and his little sister, are given a number of pictures with objects. They are told which of
the objects are trees and which are not. They are also given some criteria according to which
to classify an object as either a “tree” or “not tree”. For example, the criteria could include
“number of branches”, “number of leaves”, “colour”, etc. After studying the pictures, they are
presented with a picture of an object. Their task is to name the object, which in this case is a
tree. The boy, having studied in detail all possible kinds of trees, concludes that the object is
not a tree since he has never seen a tree with exactly so many branches before. His little sister,
on the other hand, has been lazy and has not studied too much. However, she concludes
(correctly) that the object is a tree, based on the fact that it is … green. Clearly, none of the
two is a good predictor of unseen trees. In order for the boy – who represents the “overfitting”
function here – to improve, he could for example admit that there is a big chance of other
trees existing with different number of branches. Although this relaxation of his assumptions
will lead to a correct classification of lots of unseen trees that would otherwise be
misclassified, he might feel a bit uneasy because some non–trees could sneaked in as trees
according to the new classification rule. The question he faces is: what is the best tradeoff
between relaxing some assumptions (in other words – using a class of less complex functions)
and increasing the risk of making more mistakes as a result of these relaxations?
It appears that a class of functions with optimal generalization ability is desired: they must
not be too “complex”, on the one hand, but simultaneously with this, they must not make too
many errors on the training data set, on the other. Hence, the criteria for the function we
choose are its complexity (also referred to as capacity) and the number of training errors it
makes (accuracy). Consider for clarity Figure 2.
In Figure 2, the best tradeoff between the complexity of a function class and the minimum
number of training errors it makes is struck at the minimum of the sum of the complexity term
and the amount of training errors. At this point we find the function with best generalization
ability. Functions that are ordered on the right side of it are considered too complex, in other
words – there is overfitting. Analogically, all functions that appear to the left of the best
tradeoff point are not complex enough, that is – there is underfitting. The idea that we should
try to find this particular, optimal best tradeoff point is called Structural Risk Minimisation
principle (Vapnik, 1995, Burges, 1998). There is one crucial detail remaining, however: how
to measure complexity so that we can order functions according to their increasing
Minimum number
of training errors
complexity term +
training errors
training errors
Best tradeoff
Function classes ordered in
increasing complexity
Figure 2. Relation between complexity of the function class used for training, on the one hand, and function (class) complexity term and the
minimum number of training errors realized by a concrete function belonging to the function class, on the other. The function class with best
generalization ability, found at the “best tradeoff” point, is associated with the minimum sum of the complexity term and number of training
errors. Functions to the right of this point will overfit the training data, while those to the left of it – underfit the training data.
4 .1. 3 How to measure the level of complexity?
It seems that all we need to find the desired optimal point is a measure of complexity
according to which all classes of functions (polynomial, trigonometric, etc.), can be ordered.
One such measure of complexity of a class of functions, proposed in the SVM literature, is the
VC dimension (Vapnik, 1995). The horizontal axis in Figure 2 represents all classes of
functions, ordered according to their complexity, which increases monotonically with their
VC dimension.
4 .1. 4 What is the VC dimension?
The VC (Vapnik – Chervonenkis) dimension of a class of functions is defined as the largest
number h of points that can be separated in all possible classification ways they may appear
using functions of the given class (Burges, 1998). It follows that relatively more complex
(classes of) functions have higher VC dimension, since they are able to separate relatively
more points without an error (in all possible classification ways). Let us determine, for
example, the VC dimension of the class of linear functions– what is the maximum number of
points that linear functions can separate in all possible ways without an error?
Figure 3. Possible two-class classifications of three training points in a plane. Burges (1998) shows that there are exactly eight possible
classification ways in which three balls (points) from two classes (b = black, w = white) may appear. Clearly, a line is able to separate the
two classes in all possible eight ways, provided that the balls are not lined up with each other.
It is clear from Figure 3 that lines can separate three black and white balls no matter what
colour we choose the balls to be, provided that the balls do not lie on one and the same line.
Since a line cannot separate four balls in all different classification ways (see Figure 1 for an
example of this impossibility), we conclude that the VC dimension of the class of linear
functions (in a plane) is three.
4. 2 The concept of generalization in a binary classification problem
Now we can place the problem of finding an optimal point between complexity and accuracy
in a more formal setting. Suppose we are given a (row) vector x of n explanatory variables
x = ( x1 , x 2 , x3 ,..., x n ) , and a finite number l of (training) outcomes associated with observed
values of the explanatory variables. The l outcomes can, for the time being, be labeled just
“plus one” and “minus one”. It is assumed that there is a fixed, but unknown relation between
the explained and explanatory variables. To sum up, we have training data in the form of l
input-output pairs:
( x11 , x12 , x13 ,..., x1n ) , y1
( x 21 , x 22 , x 23 ,..., x 2 n ) , y 2
( xl1 , xl 2 , xl 3 ,..., xln ) , y l
or, equivalently:
, y1
, y2
, yl
where xln is the (any real) value that the explanatory variable x n takes in the lth training
input-output pair, and each of the y1 , y 2 ,…, y l is either plus one or minus one; and where
x l is a vector containing the values of all n explanatory variables in the lth training pair, that
is x l = ( xl1 , xl 2 , xl 3 ,..., xln ) . We can present the training data alternatively as:
(x1 , y1 ) , (x 2 , y 2 ) ,..., (x l , y l ) ∈ ℜ n × {±1}
Returning to the tree recognition example, the explanatory variables could be “number of
branches”, “number of leaves”, “colour”, etc., and the training outcomes – “tree” and “not
tree”. Our task is to find a function with the best generalization ability to classify some k
unseen outcomes, given new values for the n inputs. In other words, find the best function f :
ℜ n → {±1} that produces training outcomes12 y i ∈ {±1} , i = 1, 2,…, l, from n inputs,
f ( xi1 , xi 2 , xi 3 ,..., xin ) = y i , for i = 1, 2,…, l,
which will be used to classify k new outcomes,
f ( x j1 , x j 2 , x j 3 ,..., x jn ) = y j , for j = l+1, l+2,…, l+k,
given that there exists a fixed, but unknown relation between the n independent variables and
the two classes in the form of φ : ℜ n → {±1} , according to which all l+k input-output pairs
are generated. Even if it is impossible to find the best function, we can at least try to find the
one with most adequate generalization ability from some pre-chosen classes of functions.
Note that best does not imply making no training errors.
Obviously, one can always find (many) complex functions that will be able to separate the
positive from the negative training outcomes without an error. In this case, the predicted
classes of the training outputs y i ∈ {±1} , i = 1, 2,…, l, and the actual classes of the outputs
y i ∈ {±1} , i = 1, 2,…, l, will be the same. Such a function could be, for example, the function
f ∗ , for which
f ∗ ( xi1 , xi 2 , xi 3 ,..., xin ) = y i = y i , for i = 1, 2,…, l.
It is common to say that the empirical error (or, empirical risk) of such functions is zero.
Moreover, one can always find a different (complex) function f ∗∗ for which holds that
Here the "hat", as in ŷi, indicates estimated class membership from the function f(xi1, xi2, xi3 … xin), while we denote the true class of the ith outcome as yi.
we denote the true class of the ith outcome as y i .
f ∗∗ ( xi1 , xi 2 , xi 3 ,..., xin ) = y i = y i , for i = 1, 2,…, l.
Now, if we are given k additional, test pairs (x l +1 , y l +1 ) , (x l + 2 , y l + 2 ) ,..., (x l + k , y l + k ) , it may well
happen that the first ( f ∗ ) function makes no errors in predicting the test output values (that is:
f ∗ (x l +1 ) = y l +1 = y l +1 , f ∗ (x l + 2 ) = y l + 2 = y l + 2 , …, f ∗ (x l + k ) = y l + k = y l + k ), and the second
( f ∗∗ ) function predicts all test y values wrongly. To complicate things further, observe that
since the functions are complex to begin with (they both have zero empirical risk), there may
exist a simpler function ( f ∗∗∗ ) with better generalization ability – one that allows for some
training errors (say, f ∗∗∗ (x 2 ) = y 2 ≠ y 2 ), but has much lower value for the complexity term,
and thus finds itself closer to the optimal point in Figure 2. The question is: how can we
actually determine the relative position of these functions along the horizontal axis in Figure
4. 3 Bounds on the test error
Notice that the VC dimension of the complex functions in the above binary classification
problem is expected to be quite high in the general case (since they have to make no errors on
l arbitrarily given training outcomes), and possibly it could be infinite (if these functions can
separate without an error any number of positive and negative outcomes). The simpler
function, on the other hand, inevitably makes at least one training error, implying that its VC
dimension should be relatively (much) smaller. It will be tempting to conclude at this stage
that all that is left to do is to pick up functions with different VC dimension, check how many
training errors they make, sum the two terms (VC dimension plus number of training errors),
and choose the function that produces the lowest sum. This strategy will work only if the
“complexity term” in Figure 2 and the level of complexity (the VC dimension) is one and the
same thing, which is generally not true. Vapnik (1995) and Burges (1998) have shown that if
training and testing input-output pairs are generated independently and distributed identically
according to some unknown, but fixed distribution P [( x1 , x 2 , x3 ,..., x n ), y ], and if the VC
dimension ( h ) is less than the number of examples for training ( l ), then with probability at
least 1 – η , the following bound on the test error holds:
R (α ) ≤ R emp (α ) +
h (log 2hl + 1) − log(η / 4)
The test error R(α) is also referred to as the risk of test error, the regularized risk, or simply –
the risk. The α stands for a given class of functions, and Remp(α) is the amount of training
errors (the empirical risk) the best function in the class α makes on the training set. The
complexity term (known also as the “confidence term”) in this case is equal to the second
term on the right-hand side of the bound. The joint distribution of the explanatory and
explained variables, P [( x1 , x 2 , x3 ,..., x n ), y ], is interpreted as a function that associates a
certain probability of a value (in our case, ±1) of y i occurring together with the observed
values xi1 , xi 2 , xi 3 ,..., xin , for i = 1, 2,…, l. Later, while exploiting SVM in the context of
constructing value-growth style rotation strategies, we will apply the concept of balancing the
empirical error and a certain complexity term (arising from the SVM formulation of
classification and regression problems) in a similar way.
It is possible to formulate other bounds in terms of concepts such as the annealed VC entropy,
the Growth function (Vapnik, 1998, Cristianini and Shawe-Taylor, 2000) and the fat
shattering dimension (Cristianini and Shawe-Taylor, 2000), which however will not be done
Some readers may be surprised that the joint probability distribution function,
P [( x1 , x 2 , x3 ,..., x n ), y ], does not appear in the bound. The bound, in fact, holds always
(provided that h < l ), no matter what the underlying – but still, existing – distribution is.
4. 4 Remarks on the choice of a class of functions
It might appear that up till now we have used the term “a function” and “a class of functions”
interchangeably. There is a slight distinction between them. As shown in Figure 1 (b), for
example, there are (at least) two individual functions (two parabolas) belonging to the same
class of functions (the class of parabolas) that are able to separate the white balls from the
black ones without an error. According to the risk bound above, these two functions will be
equally preferable since the empirical risk (here, zero) and VC dimension stay one and the
same. In the event that a parabola which makes mistakes (that is, with empirical error ≥ 1) is
chosen out of the class of parabolas, again according to the bound, it will be less preferable
and need not be considered. Notice that hypothetically speaking it is quite possible that
neither a linear function, which will necessarily make at least one training error, nor a
parabola could actually be most appropriate for this separation problem, since there might
exist some other class of functions, which is able to strike a better balance between empirical
error and complexity term. In section 5.4 we will show the SVM solution for the case of
Figure 1, where one chooses the best function among polynomials on degree two.
We are now in a position to answer the previously posed question of why complex functions
tend to make more prediction errors relative to simpler ones on a given (two-class)
classification task. It is because their VC dimension is so big, that even though they make few
empirical errors, their test error bound (empirical error plus complexity term) is just too high,
for any fixed 1 – η and l.
Notice as well that it is quite possible that a class of functions with higher bound on the test
error can outperform functions with lower bound on a particular task. The boy from the tree
recognition example could have predicted correctly that the object is in fact a tree, had it had
the required number of branches; and the girl could have misclassified the tree, because the
presented object is in winter time and so it is not green enough to be a tree. Formally
speaking, this happens because we have a bound on the test error, which says that the test
error could not be above a given value (for fixed 1 – η), but it can certainly be below that
value, no matter how complex the chosen class of functions is. Imagine, for example, that a
given class of functions with VC dimension 4 has a test error R(α) of no more than 0.09 (for η
= 0.05 for instance), and a function with VC dimension 3 has a test error of no more than 0.07
(for the same η). It may well turn out that on a particular task the more complex class
produces an error of 0.06 (which is less than 0.09), and the simpler one – an error of 0.065
(which is less than 0.07).
As a last remark, observe that all other things held equal, functions with higher VC dimension
will have greater test error bounds, since the complexity term is increasing monotonically
with h.
Chapter 5
Constructing Support Vector Machines for Classification
5. 1 Complexity and the width of the margin
5 .1. 1 The VC dimension of hyperplanes
In order to make use of the test error bound (or other similar bounds involving the VC
dimension), we have to come up with classes of functions whose VC dimension can be
computed. We observed in Figure 3 that the VC dimension of the linear functions in twodimensional (ℜ 2 ) space is computable, and that it equals three. In a similar fashion, notice
that a (two-dimensional) plane in (ℜ 3 ) can separate without an error at most four points in all
possible classification ways the points may appear (if they do not lie on one and the same
plane). As asserted in Burges (1998), the VC dimension of (n – 1)-dimensional hyperplanes in
n-dimensional space is equal to n + 1.
Now that we have found functions (that is, (n – 1)-dimensional hyperplanes) which VC
dimension is known, one more detail remains to be addressed. Returning to Figure 3, observe
that there are many lines that can be used to separate the black and white balls in each one of
the eight cases. Which one of them is optimal?
5 .1. 2 Optimal hyperplanes
Let us, for further clarity, have a look at Figure 4. In Figure 4, we are giving several white and
black balls (outcomes), associated with different values of the two explanatory variables x1
and x 2 . Making a link to Finance, one can think of the two axes as representing two factors in
a factor model. For a value-growth rotation strategy, these could be for instance the earningyield gap and the change in the rate of inflation. In this case, the black and white dots could
be considered hypothetically as months during which either value or growth stocks have
outperformed. As it is evident from the figure, many lines can be drawn which are able to
separate the two classes of value and growth stocks without an error. Abstractly speaking, we
have given one-dimensional hyperplanes (that is, lines) in two-dimensional input space (here,
defined by x1 and x 2 ), together with l members of two different classes. It is evident that the
two classes can be represented sufficiently well in the input space by just attaching the
appropriate labels to them (say, “value month” or “growth month”). Later, in Support Vector
Regression, we will actually need an additional axis (in addition to the axes required for each
of the inputs) since the target values there can take any real values (that is: y ∈ ℜ ), not just
Figure 4. Two out of infinitely many lines able to separate without an error two ball classes. Notice that we can draw a shaded area around
line number 1, called the “margin” between the classes, which just touches one of the black and one of the white points. The most preferable
separation line is line number 2, which yields the biggest margin between the two classes.
So, which line in Figure 4 shall we use? Rather than answering in a formal way, let us give
some intuitive reasoning. Notice that we can make line number 1 in Figure 4 a bit “fatter”
until it touches a black ball on its left and a white ball on its right. For simplicity, assume that
the line lies along the middle of this fat region, as shown in the figure. Relaxing this
assumption will not influence our conclusions. The width of the fat region is called the
“margin” that the given hyperplane produces between the two classes. As noted before, the
VC dimension of the class of lines (in a plane) is three. However one could argue that,
intuitively speaking, line number 2 is the least complex of all lines with zero empirical risk,
since it produces the region with largest area of “doubt” about the class of new balls coming
into the picture. To put it bluntly, line number 2 is the most “unintelligent” among its peers,
that is – among all lines that are able to separate the classes without any error. In line with the
principle of best generalization (the Structural Risk Minimization principle), we can conclude
that the most preferable among all lines with zero empirical risk is the least complex one. The
“doubt” in our case is not to be confused with “indecision”: any line (with zero training error)
will classify new balls as “white” if they appear on its appropriate side, even if they happen to
lie inside the margin determined by the line (inside the shaded regions in the graph).
However, the concept of the region of “doubt” (the region inside the margin) can still be used
intuitively as a proxy for complexity. Therefore, one could claim that complexity decreases
with increasing margin, and consequently line number 2 is the optimal hyperplane.
One can think of yet another intuitive way to show why the line yielding the largest margin is
the most preferable. Imagine for simplicity, that we have given only one black and one white
ball, as in Figure 5. Allowing for more than two balls from two different classes will not
change our intuitive reasoning.
Suppose that the exact position of the two balls is perturbed by some noise, in other words the
input-output relation f : ( x1 , x 2 ) → {“black”, “white”} exists in a noisy environment. Let the
noise intensity be given by the radius of a circle around any ball, so that the greater the radius,
the greater the noise. We can see immediately that line number 2 in Figure 5 can “absorb” the
biggest amount of noise around the two balls. Line number 1, being relatively closer to the
two classes, will for example classify incorrectly the white ball, had it been “pushed” a bit to
the left by some noise. On the other hand, line number 2, the one that yields the largest
margin between the classes, will be able to cope with the same situation. In other words, it is
more preferable than line number 1.
Figure 5. Presence of noise in the data that dislocates points from their truthful position by a certain amount. Since the higher the noise level,
the greater the dislocation, the line in the figure that is furthest away from both classes (line 2) will be able to cope with the greatest noise
There is actually a formal way to display the relation between complexity and the margin
width, by referring for example to the margin-based bounds on generalization as in Cristianini
and Shawe-Taylor (2000).
5. 2 Linear SVM: the separable case
Let us start with defining what is meant by a “separable” case. A given number of l points in
(n-dimensional input space) ℜ n from two classes are said to be “separable” if there exists at
least one hyperplane that can separate the two classes from each other without an error.
In the simplest case the two classes are explained by two explanatory variables, x1 and x 2 .
Such is the case in Figure 4, where the two classes, a total of l black and white balls13, are
separable by a line (which in this case is a hyperplane) in the input space of ( x1 , x 2 ) . The line
that yields the largest margin is also drawn – line number 2. Our task will be to find an
expression for this optimal separating line, given the coordinates of all l balls. Once we have
solved this problem, it would be relatively easy to move to the nonseparable case, and then –
to two-class separation problems in n-dimensional input space, covering both the separable
and nonseparable cases.
In order to determine the exact position of the optimal line in Figure 4, suppose that it has
already been found and has the form of w'1 ∗ x1 + w' 2 ∗ x 2 + b' = 0 . In this case all white balls
satisfy w'1 ∗ x1 + w' 2 ∗ x 2 + b' ≥ a , and all black balls satisfy w'1 ∗ x1 + w' 2 ∗ x 2 + b' ≤ − a , for some
positive14 a. The balls for which these inequalities hold as equalities are called support
vectors. Thus, the support vectors are the balls that just “touch” the sides of the margin.
Notice that the support vectors completely determine the position of the line – even if all the
other balls were removed, the position of the separating line would not change.
Finding the width of the margin involves a few steps. First, we will divide both of the above
inequalities by a, and second, we will define new w1 = w'1 / a , w2 = w' 2 / a and b = b' / a . In
this way we scale differently the optimal separating line and the lines that pass through the
support vectors, which now become w1 ∗ x1 + w2 ∗ x 2 + b = 0 , w1 ∗ x1 + w2 ∗ x 2 + b = 1 , and
We use the terms “balls” and “points” interchangeably
Specifying a as negative will not alter the conclusions from the ensuing analysis.
w1 ∗ x1 + w2 ∗ x 2 + b = − 1 , respectively. From here it is straightforward to find the width of
the margin, which is given by the distance between the latter two parallel lines – those on
which the support vectors from the two classes lie. As a consequence, the (width of the)
margin equals
, where || w || ≡ w1 + w2 . In the general n-dimensional case the vector
|| w ||
|| w || will have n coordinates, not just two. We are now in a position to formulate the
optimization problem of finding the expression for the maximal-margin hyperplane:
|| w ||
Subject to:
(w • x i ) + b ≥ 1 , for all i that represent white balls
(w • x i ) + b ≤ −1 , for all i that represent black balls.
We use small bold letters to denote vectors, and the symbol “•” to denote a dot product, so
that (w • x i ) = w1 ∗ xi 1 + w2 ∗ xi 2 . For ease of exposition, we define y i = 1 if ball i has a label
“white”, and y i = −1 if ball i has a label “black”. Furthermore, notice that maximizing
|| w ||
is equivalent to minimizing its reciprocal
, which in turn is equivalent to minimizing
is necessarily positive). All these transformations will lead
us to the following equivalent formulation of the above optimization problem (Vapnik, 1995,
Müller et al., 2001):
(since the distance
y i ⋅ ((w • x i ) + b) ≥ 1 ,
Subject to
i = 1, 2,…, l.
Notice that this is a convex quadratic optimization problem, which means that there will be
only one, global solution to it. This is a very desirable property of SVM, which distinguishes
them from neural networks.
This constrained optimization problem can be solved by introducing non-negative multipliers
α i and the Lagrangian (Burges, 1998):
L (w, b, α) =
i =1
⋅ [ y i ⋅ ((w • x i ) + b) − 1 ] .
In order to find the optimal solution (w, b, α), the following system of equations (1) – (4)
must be solved for non-negative multipliers α i , in line with optimization theory (Burges,
∂ L (w, b, α )
= 0 , i = 1, 2,…, l
∂ wi
∂ L (w, b, α )
∂ L (w, b, α )
= 0 , i = 1, 2,…, l
α i ⋅ [ y i ⋅ ((w • x i ) + b) − 1] = 0, i = 1, 2,…, l.
The first two sets of equations lead to w = ∑ α i yi x i and
i =1
∑α y
i =1
= 0 , respectively. It can be
shown (e.g., Burges, 1998) that all α multipliers have a value greater than zero if and only if
they are associated with support vectors. If we substitute these two results in the original
Lagrangian function, we arrive at the Wolfe’s dual maximization formulation of the
minimization problem (Vapnik, 1995, Burges, 1998):
1 l l
W(α) = ∑ α i − ∑∑ α i α j y i y j  x i • x j 
2 i =1 j =1
i =1
Subject to
α i ≥ 0 , i = 1, 2,…, l and
i =1
yi α i = 0 .
Prior to elaborating on the reason why we would prefer to use the dual formulation of the
optimization problem, we shall address the case of finding an optimal solution when the two
classes are not separable by a hyperplane.
5. 3 Linear SVM: the nonseparable case
How can we manage problems where the two classes are not linearly separable? If we still
desire to use a linear function for separation, we could introduce so called “slack variables”,
which take into account the possibility that one or more members of the classes will appear on
the “wrong” side of the margin. In Figure 6, for instance, which takes the problem of Figure 4
as a starting point, one of the black balls is classified mistakenly as white by the optimal
hyperplane. Referring to the financial interpretation of Figure 4, this is equivalent to saying
that during a certain month value stocks have outperformed growth stocks contrary to our
Since (by assumption) all black support vectors satisfy w1 ∗ x1 + w2 ∗ x 2 = b − 1 , it follows that
the equation for the line that passes through the misclassified ball and is parallel to the line
through the black support vectors can be given as w1 ∗ x1 + w2 ∗ x 2 = b − 1 + ξ , for some
positive slack variable ξ. The introduction of slack variables will alter our original
optimization problem, which becomes, for some positive constant C (Vapnik, 1995, Müller et
al., 2001):
Subject to
+ C ( ∑ξi )
i =1
y i ⋅ ((w • x i ) + b) ≥ 1 − ξ i ,
ξ i ≥ 0 , i = 1, 2,…, l.
i = 1, 2,…, l
Figure 6. A non-linearly-separable binary classification problem. The optimal hyperplane makes one training error by classifying a black ball
as white. One way of dealing with such situations is to introduce for each training point a slack variable that takes a positive value if the
respective point turns out to be a training error, and zero otherwise.
In line with generalization theory, our goal here is to simultaneously maximize the margin (by
), and minimize the amount of training errors, which are proxied by the
slack variables (a non-zero slack variable means that a training error has been made). In other
words, we minimize the sum of two terms: empirical risk (via the amount of training errors)
and complexity (via the width of the margin). The positive constant C is introduced to control
for the penalty we would like to associate with a given empirical risk: the higher the C, the
greater the penalty associated with a given value for
i =1
. Thus, if C is set too high, then a
relatively small margin between the classes would be tolerated, if it yields a small number of
training mistakes. On the other hand, a small value for C means that the width of the margin
takes precedence over the amount of training mistakes, and so the solution to the optimization
problem will tolerate a relatively large number of training mistakes. The benefit of the
introduction of the constant C is that via C we can control explicitly both the complexity and
the empirical training error by affecting the optimal w and the optimal number of training
mistakes. Consequently, we can call C a “complexity-error tradeoff (adjustment) parameter”.
The above optimization problem has its dual representation in the form of (Vapnik 1995,
Burges, 1998):
W(α) =
i =1
Subject to
i −
1 l l
α i α j y i y j  x i • x j 
2 i =1 j =1
C ≥ α i ≥ 0 , i = 1, 2,…, l and
i =1
yi α i = 0 .
Notice that the slack variables have disappeared in the dual formulation.
5. 4 Nonlinear SVM: the nonseparable case
Notice that the problem in Figure 6, as well as in Figure 4, can be solved with no training
mistakes by a nonlinear function. The introduction of nonlinear functions would be tractable
if we are able to compute their complexity, that is – their VC-dimension. Knowing both
functions’ complexity and empirical risk enables us to make comparisons among them in
terms of generalization theory.
Let us try to find a second-order polynomial that will solve the nonlinear separation problem
of Figure 6. As mentioned, the empirical risk in this case can be zero, since a parabola can
separate the two classes without an error. However, in order to make judgments about the
regularized risk (the risk of test error), we have to be able to compute the VC dimension of
such functions. It has already been shown how to compute the VC dimension of hyperplanes.
Our task, then, boils down to finding a way to represent a given second-order polynomial as a
hyperplane in a certain n-dimensional space. For example, the polynomial
a1 x12 + a 2 2 x1 x 2 + a 3 x 22 + a 4 can be thought of as an equation of a plane in a threedimensional space with coordinates x12 , 2 x1 x2 , and x 22 . The VC dimension of such planes
in ℜ 3 is equal to four, as shown before. Notice that this three-dimensional space is just a
transformation of the two-dimensional input space ( x 2 , x 2 ) via (for example) the explicit
mapping Ф: ℜ 2 → ℜ 3 (Burges, 1998):
 xi12 
→  2 xi1 xi 2  ,
 xi 2 
 xi1 
Φ:  
x 
 i2 
for any point ( xi1 , xi 2 ) . This transformation is illustrated graphically in Figure 7.
In the transformed higher-dimensional space, called feature space, the two classes are clearly
separable. Notice that the two black and the two white balls lie on top of each other in the
transformed space. We can now apply the SVM optimization algorithm of finding the optimal
hyperplane in the new, higher-dimensional space.
The dual optimization problem in this case is (Vapnik, 1995, Müller et al., 2001):
W(α) =
i =1
i −
1 l l
∑∑ α i α j yi y j (Φ(x i ) • Φ(x j ))
2 i =1 j =1
C ≥ α i ≥ 0 , i = 1, 2,…, l and
Subject to
i =1
yi α i = 0
The only difference as compared to the non-transformed case is that we have to compute the
dot product ( Φ (x i ) • Φ (x j ) ) instead of ( x i • x j ), where
Φ(x i ) = ( xi 1 , 2 xi 1 xi 2 , xi 2 ) and x i = ( xi 1 , xi 2 ) .
2 x1 x2
Figure 7. An SVM solution to the classification problem of Figure 1, presented in feature space. The input space (x1 , x2) is transformed via
the mapping Φ into the (x12 , √2 x1x2 , x22 ) feature space . The originally (linearly) nonseparable problem becomes (linearly)
separable in the feature space, where the two black and the two white points overlap each other. The optimal hyperplane, √2 x1x2 = 0, which
is constructed in the feature space, corresponds to a nonlinear decision surface in the input space (which says that the two quadrants with the
two while balls should contain only white balls and the two quadrants with the two black balls should contain only black balls).
In general, feature spaces of more than three dimensions can be used. A computational
problem could potentially arise in such cases, however, since the calculations in the
transformed space could become very cumbersome as its dimensionality increases. This
“curse of dimensionality” is elegantly overcome by SVM (Burges, 1998) since in the dual
formulation of the optimization problem we have to compute only dot products in the form
Φ (x i ) • Φ (x j ) , and never actually need to know the explicit coordinates of the points Φ (x i ) ,
i = 1, 2,…, l , in the feature space. This allows us to make computations even in infinitedimensional feature spaces, as long the dot product Φ (x i ) • Φ (x j ) is computable. In some
cases this dot product can be computed by a simple kernel function:
k (x i , x j ) = Φ (x i ) • Φ (x j )
This is actually the reason why it is more preferable to use the dual formulation of the
optimization problem. The dot product in the feature space of the optimization problem at
hand, for example, is given in explicit form as
Φ(x i ) • Φ (x j ) = ( xi21 , 2 xi 1 xi 2 , xi22 ) • ( x 2j1 , 2 x j1 x j 2 , x 2j 2 ) .
It can also be expressed implicitly via the kernel k (x i , x j ) = (x i • x j ) 2 , since
k (x i , x j ) = (x i • x j ) 2 = (( xi1 , xi 2 ) • ( x j1 , x j 2 )) 2 = ( xi1 x j1 + xi 2 x j 2 ) 2 =
xi21 x 2j1 + 2 xi1 xi 2 x j1 x j 2 + xi22 x 2j 2 = ( xi21 , 2 xi 1 xi 2 , xi22 ) • ( x 2j1 , 2 x j1 x j 2 , x 2j 2 )
It has been shown by Vapnik (1995) that the polynomial kernel k (x i , x j ) = (x i • x j + 1) d
corresponds to a map Φ into the space spanned by all products of order up to d. By using the
kernel k (x i , x j ) = (x i • x j + 1) 2 for the problem of Figure 1, for example, we attain a
nonlinear decision boundary in the input space (which is linear in the corresponding feature
space) represented in Figure 8.
Figure 8. An SVM solution to the classification problem of Figure 1, presented in input space. The decision surface between the two classes
in the figure is found by means of implicitly mapping the input space into a feature space via the kernel function k ( x i, x j ) = ( x i • x j + 1) 2 ,
and then mapping the optimal hyperplane (together with the margin it produces) back from the feature space into the input space. The lightlyshaded area is the margin between the classes. The borders of the margin in feature space correspond to curves in the input space, drawn in
the figure. Notice that all four points are support vectors, since they lie on these curves.
The margin between the classes is, as in the separable case, denoted as a shaded area. The
borders of the margin in feature space appear as curves in the input space.
5. 5 Classifying unseen, test points
In order to find out how to classify a new, unseen point, let us re-write the equation of the
optimal hyperplane in the (nontransformed) linear case: w • x + b = 0. By substituting the
expression of the optimal w, w = ∑ α i yi x i , we attain the hyperplane decision function (for a
i =1
new test point) for the linear case (in line with Vapnik, 1995 and Burges, 1998):
 l
f (x) = sgn  ∑ α i yi (x • x i ) + b  .
 i =1
As a result, if f (x) > 0 (f (x) < 0), the new point will be classified as a white (black) ball.
In case we map our data into a feature space, the equation of the optimal hyperplane becomes
w • Φ(x) + b = 0. Hence, the optimal w equals
i =1
yi Φ(x i ) , and the hyperplane decision
function becomes (Müller et al., 2001):
 l
 l
f (x) = sgn  ∑ α i yi (Φ(x) • Φ(x i ) ) + b  = sgn  ∑ α i y i ⋅ k (x, x i ) + b  .
 i =1
 i =1
It is important to notice that the support vectors in Figure 8 (in this case, all four points) lie on
 l
the curves  ∑ α i yi ⋅ k (x, x i ) + b  = ± 1.
 i =1
5. 6 Admissible kernels
Unfortunately, one cannot use just any kernel function to compute dot products in feature
spaces. The following theorem of functional analysis shows the sufficient conditions for a
kernel to be admissible (Burges, 1998):
Theorem 1 ( Mercer )
There exists a mapping Φ and an expansion k (x, y ) = Φ (x) • Φ (y ) if and only if for any
g(x) such that
∫ g(x)2dx is finite, then
∫ k (x, y ) g(x) g(y) dx dy ≥ 0.
It is generally not true, however, that any kernel that does not satisfy the Mercer condition is
not admissible, that is – cannot be used in the optimization problem (Burges, 1998). In other
words, the Mercer condition is sufficient, but not necessary.
Chapter 6
Support Vector Regression
6. 1 The ε-insensitive loss function
Up till now we have considered Support Vector Machines for classification tasks. In this
chapter, we will extend our analysis to function estimation, which is carried out by Support
Vector Regressions (SVR). In this case of SVR, the target values are thus y ∈ ℜ , and not
y ∈{-1,1}, as in (binary) classification.
In SVR one utilizes the concept of “ε-insensitive region” instead of “the margin”, as in
support vector classification. Following Vapnik (1995), we introduce the ε-insensitive loss
| y - f (x)| ε ≡ max {0, | y - f (x) | - ε}, for a predetermined nonnegative ε.
Intuitively, if the value of the estimate f (x) of y is off-target by ε or less, then there is no
“loss”, that is – no penalty should be imposed. However, if the opposite is true, that is
| y - f (x) | - ε > 0, then the value of the loss function rises linearly with the difference between
y and f (x) above ε, as illustrated in Figure 9.
ε-insensitive loss function
Value off
Figure 9. The ε-insensitive loss function. The ε-insensitive loss function associates no penalty with a given estimated value, if the estimated
value is within ε distance of the true value. However, as the discrepancy grows above ε, the penalty increases monotonically with it.
Notice that the ε-insensitive loss function is different from the quadratic loss function (used in
statistics and elsewhere), and which is given by ( y – f (x))2 .
6. 2 Function estimation with SVR
Let us consider the simplest case first, where there is only one input variable, x1 , and l
training data-points. That is, we have to estimate the function y = w1 ∗ x1 + b , as in Figure 10.
y y − ( w ∗ x + b) ≥ ε
y = w1 ∗ x1 + b
( w1
x 1 + b) − y ≥ ε
Figure 10. An SVR solution to the problem of estimating a relation between x1 and y. All points inside the white region in the figure are
within ε distance from the solid, optimal regression line, and therefore are not penalized. However, penalties ξ and ξ* are assigned to the
two points that lie inside the shaded areas. The optimal regression line is as flat as possible, and strikes a balance between the area of the
white region and the amount of points that lie outside this region.
Imagine that the optimal regression line has already been found (as in Figure 10). The
equation of the optimal line is consequently y = w1 ∗ x1 + b . It is possible to give a financial
interpretation of Figure 10. The y values can be viewed for example as representing the actual
difference between the S&P 500 Barra Value and Growth indices. This difference, in the
simplest case, might be explained by a single factor x1 , say the one-month oil price change.
All points that are within distance ε of the optimal line (that is, all points in the non-shaded
area) are not associated with any loss/penalty, in line with the concept of ε-insensitive loss
function. However, points for which | y − ( w1 ∗ x1 + b) | ≥ ε will be penalized by the
introduction of slack variables ξ i and ξ i* , i = 1 ,2 ,…, l, in line with Smola and Schölkopf
Notice that the flexibility of assigning different values for ε makes it possible to consider a
myriad of overfitting-correction criteria and investors’ loss functions, corresponding to
different values for ε. If ε is set to be too small, and the penalty associated with value offtarget too high, then the resulting ε-insensitive region (in the input space) must necessarily
look like a serpent maneuvering through the data, making lots of curves. As a result, almost
all points will be classified correctly. This type of loss function could be typical for investors
who consider even small losses as quite disastrous. If ε is set too high (and the penalty
associated with values off-target too low), then rather few points will be penalized, meaning
that investors in this case are inclined to put up with greater losses, that is – to be indifferent
to losses of magnitude up to ε. The resulting ε-insensitive region (in the input space) is very
likely to resemble a linear surface. In Figure 10, there are two penalized points, with
respective penalties ξ and ξ * .
All these considerations, together with the Structural Risk Minimization principle, lead
logically to the formulation of the optimization problem used in SVR for function estimation
(Vapnik, 1995):
Subject to
+ C ( ∑ ξ i + ξ i* )
i =1
((w • x i ) + b) − y i ≤ ε + ξ i
i = 1, 2,…, l
y i − ((w • x i ) + b) ≤ ε + ξ ,
i = 1, 2,…, l
ξ i , ξ i* ≥ 0 ,
i = 1, 2,…, l
As in the binary classification problem, it is assumed that there are a total of l training points.
Notice that we can use the same formulation of the optimization problem to solve cases where
the input space is n-dimensional. In this case the vector w and each point x i will have n
coordinates. The predetermined constant C plays a role completely analogical to the one it
plays in classification: it pre-specifies the amount of penalty associated with each training
mistake (that is, with each x i for which either ξ i* > 0 or ξ i > 0).
The formulation of the optimization problem could be explained intuitively as follows. It can
be shown that in solving the optimization problem one strives to strike a balance between the
area of the non-shaded, ε-insensitive, region (as in Figure 10) – in other words, complexity –
and the amount of training errors that are allowed occur. Thus, for example, if the prespecified ε is big enough to give rise to (many) ε-insensitive regions that contain all training
points, then the resulting optimal estimated function will be as horizontal (“flat”) as possible.
As in classification, in SVR there exists a suitable dual representation of the regression
optimization problem (Vapnik, 1995, Smola and Schölkopf, 1998):
W (α ∗ , α ) = − ε ∑ α i* + α i
i =1
) + ∑ (α
i =1
1 l
α i* − α i α *j − α
2 i , j =1
C ≥ α i* , α i ≥ 0 , i = 1, 2,…, l and
Subject to
−α i y i
∑ (α
i =1
− α i* = 0 .
Generalization to nonlinear regression estimation is carried out analogically to the case of
binary classification – by substituting the kernel function k (x i , x j ) for (x i • x j ) in the dual
formulation above. In SVR the regression estimates take the form of (Smola and Schölkopf,
f (x) =
i =1
α i* − α i  ⋅ k (x, x i ) + b .
Analogically to the Support Vector Machines for classification, all α and α * multipliers have
a value greater than zero if and only if they are associated with support vectors.
Chapter 7
7. 1 A factor-model approach to the basic model
Before explaining the technical part of the basic Support Vector Regression Value-versusGrowth rotation model (to be presented in section 7.4), we will first of all put it in the context
of multi-factor models discussed in chapter 2. Consider Figure 11, which presents a tree-like
structure stemming from the term “factor models” that captures different facets (alluded to in
chapter 2) of these models.
models using all factors
simultaneously from a prespecified set
…utilizing multiple
…for estimating
volatility of return
…for estimating
expected returns
selection based
on statistical
criteria, such as
adjusted R2,
models using (many)
subsets of a pre-specified
factor set
…model selection
based on financial
criteria, such as hit
ratio, information
…utilizing other
(regressions) tools
…model selection
based on Bayesian
analysis, principal
analysis, etc.
selection based
on cross validation
Figure 11. Classification of factor models according to different characteristics. The basic model of the thesis can be regarded as a factor
model with features that appear in the shaded rectangles.
The proposed basic model of this thesis has the factor-model characteristics that appear inside
the shaded rectangles of Figure 11. It employs Support Vector Regressions, uses all preselected factors simultaneously, predicts the difference of returns to value and growths stocks
in the S&P 500 index (split in market capitalization according to their book-to-market ratio),
and finally, uses a cross-validation procedure for model selection (to be explained in section
Notice at the outset that regarding market efficiency, whatever the results of the Support
Vector Regression model, these results will be intrinsically inconclusive as evidence in favor
or against US stock market efficiency. This is, to begin with, a consequence of the fact that
although all information on the factors used has been (publicly) available throughout the
whole estimation period, the Support Vector Regression tool was not. This comes at odds
with the notion of the market efficiency, which requires that at the time of model creation
only modeling tools available at that time (and not afterwards) be applied. Additionally, as
pointed out by Pesaran (2003), market efficiency and the stock market non-predictability
property are concepts that cannot be equated per se. Pesaran (2003) shows that “stock market
returns will be non-predictable only if market efficiency is combined with (investor) riskneutrality”.
7. 2 Indices and data choice
7 .2. 1 The explained variable: the “value premium”
The actual task of the proposed Support Vector Regression model is to predict the direction of
the monthly value premium (that is, the difference in monthly returns) between two indices –
the S&P 500 Barra Value and Growth indices. The choice of these two indices is motivated
by the expected low transaction costs (associated with high expected liquidity), since it is
possible to buy and sell futures on them (Bauer and Molenaar, 2002). There exist a number of
characteristics for classifying stocks as belonging either to the value or the growth club, such
as the ratios of (current) market price to earnings per share, and market price to cash flow per
share, but we will confine ourselves only to the book-to-market ratio in deciding upon our
Value-versus-Growth style rotation strategy, because it can be easily implemented (through
the S&P 500 Barra Value and Growth indices). The logic behind the chosen particular split of
stocks can be explained as follows. Firms with low BM ratio are generally expected (by the
market) to grow fast and be quite profitable some time in the future, so as to compensate for
the high market price of their stocks compared to the book value of their (existing) equity
capital. These expected-to-grow-fast firms form the growth club. The rest of the firms are
labeled “value”.
7 .2. 2 On the choice of explanatory factors
Potentially a myriad of factors are expected to have impacts on the two classes of value and
growth stocks. In this thesis, we restrict ourselves to the set of 17 factors used by Bauer and
Molenaar (2002), who claim to consider only factors which effects on stock market returns
are asserted in the literature on the subject to have some economic interpretation. It could be
potentially argued that all of the 17 pre-chosen candidate factors in the base factor set (given
in Appendix I) actually affect value and growth stocks in a certain way, but there does not
exist a consensus in the literature on what is the precise nature of each factor’s influence, what
is the extent of that influence, and whether the direction of the influence is constant through
time. Bauer and Molenaar (2002) for example find that some of their 17 pre-chosen factors
“appear to be relevant in a particular time frame, but loose their power completely in a
different period”. Alongside, Levis and Liodakis (1999) state that “there are good
fundamental reasons and considerable empirical evidence to suggest that … value spreads are
associated with economic fundamentals”. Asness et al. (2000), however, remind that one
criticism of considering economically meaningful variables is that “it becomes very difficult
to determine which of the observed relations are real and which ones are artifacts of the data”. Technical factors
The 17 pre-chosen factors can be divided into “economic” and “technical”. Some suggestion
as to why the pre-chosen technical (or, market-based) factors appear to be relevant can be
found, for example, in the works of Levis and Liodakis (1999)15; Asness et al. (2000),
Copeland and Copeland (1999) and Chan et al. (1996).
To start with, Levis and Liodakis (1999), and Asness et al. (2000) report that, among others,
the one-month lagged value spread is an important predictor of the (following month’s) value
premium. Copeland and Copeland (1999) find out that on days that follow increases
(decreases) in the VIX16, value-based portfolios outperform (underperform) growth-based
portfolios. The authors interpret this observation on the basis of the idea that rising
uncertainty about the future leads to falling confidence in growth stocks and a shift to value
stocks. Chan et al. (1996) investigate whether there is “momentum” in the stock prices (that
is, whether past winners on average continue to outperform past losers), and discover that the
market responds only gradually to new information. The researchers provide also evidence on
the profitability of price momentum strategies and relate it to portfolio value and growth
characteristics by showing that past winners (losers) tend to be growth (value) stocks. Macroeconomic factors
The subset of pre-chosen macroeconomic factors is based on findings documented for
instance by Bauer and Molenaar (2002), Kao and Shumaker (1999), Levis and Liodakis
(1999) and others.
According to these studies, one of the important determinants of the sign of the value
premium is the overall interest rate environment. For example, as the spread between long
term and short term interest rates (the yield-curve spread) widens, firms which profits are
expected to lie in the more distant future – that is, growth firms – will be hurt relatively more,
since their future (expected) earnings are discounted on longer horizons as compared to value
firms. Macedo (1995) maintains that the equity risk premium (the expected future extra return
that the overall stock market or a particular stock must provide over the rate of risk-free bonds
to compensate for market risk) is the strongest determinant for the future style performance.
In his view, a high equity risk premium favors riskier portfolios; and since value stocks are
perceived to be more risky, they tend to do well when equity risk premium is high. A steadily
rising expected equity risk premium implies decreased confidence in the future and hence
hurts growth stocks disproportionately, since their profits are expected to materialize in the
more distant future. Another relevant determinant of the value-growth (monthly) return spread
appears to be the rate of inflation (see Levis and Liodakis, 1999, and Kao and Shumaker,
1999). Additionally, Sorensen and Lazzara (1995) find a positive relationship between the
growth in industrial production and interest rates and the value-growth return spread.
According to Kao and Shumaker (1999), if the earnings-yield gap (which subtracts bond
These results of Levis and Liodakis (1999) are established for the UK stock market.
The Market Volatility Index of the Chicago Board Options Exchange.
yields from a market earnings/price ratio) is small and is produced by a low earnings-to-price
environment in combination with high interest rates, then value stocks should be favored. The
researchers go on contending that regarding credit spreads, one would expect that growth
stocks would outperform value stocks in a recessionary environment characterized by high
default rates. Lucas et al. (2001) consider also the effect of changes in the business cycle,
proxied by a composite index of leading indicators of the US business cycle, and hypothesize
that growing firms are likely to be more flexible to react on and profit from a changing
economic environment.
7 .2. 3 Factor explanatory power and Support Vector Regressions
Without going further into deep discussions on the expected effects of each of the 17
candidate factors, it should be stressed that it seems reasonable in principle to first test what
their actual relevance appears to be, and then try to explain why, at least empirically, a factor
stands out as relevant and why not. In any case, it could well be that all factors depend to a
certain degree on each other and so it will be difficult to disentangle the effects of a single
factor, and also, that different factors are relevant at different times. However, for prediction
purposes – which is in effect the subject of greatest interest – it does not actually matter what
is the precise role of individual factors, since it is more interesting to see how the interactions
of all factors could be used effectively to predict which of the two S&P 500 Barra indices will
outperform the other in a given time period (in our case, the following month). At any rate,
the Support Vector Regression tool that is used to build the proposed models is expected to
derive information (or, estimates) stemming from the interactions between (many)
explanatory factors. However, it is a nonparametric tool which can provide only limited
information as to which individual factors exactly stand out as important. An extensive
account of the properties of Support Vector Regressions in relation to factor models is given
in the following sections.
7. 3 Support Vector Regression as a factor-model tool
Several of the preceding chapters of the thesis have dealt in detail with the rationale behind
and the nature of Support Vector Regression and Support Vector Machines as a whole. What
is important to emphasize in this section, are these qualities of Support Vector Regressions
that justify their employment as a factor model tool at most.
7 .3. 1 The generalization property of Support Vector Regression
First of all stands out the elegant theoretical property of Support Vector Regressions to strike
automatically a balance between model explanatory power (or, “fit”) on the training data and
model complexity, for given regression parameters such as ε, C, and kernel function
parameters. As shown in chapter 4, it is precisely this generalization feature of functions (or,
models) that matters most for prediction purposes. Functions, or models, that extremely
overfit the training data are generally expected to be worse predictors than functions that
make some training mistakes, but are less complex. Overfitting is especially characteristic of
models that include numerous explanatory variables. Despite this, as noted in chapter 3,
Support Vector Regressions, and Support Vector Machines in general, are renowned for their
capacity to achieve good generalization performance even in high-dimensional input
(explanatory) data. This capacity could potentially provide a solution to the debate
surrounding the choice of the most important (several) explanatory factors of the value
premium out of a universe of explanatory factors. Making this choice is to a certain extent
unavoidable in an “ordinary” factor model since the statistical model selection criteria (e.g.,
adjusted R2) associated with multiple regression analyses do not tolerate a large number of
explanatory variables. In contrast, Support Vector Machines are expected to deal with ease
with all candidate explanatory variables considered simultaneously and so it is not imperative
to come up with a list of most important factors. Additionally, there might be hidden
interaction patterns between the explanatory factors that cannot possibly be captured by any
parsimonious model (by construction), but which may be accounted for by a multivariate
analysis (involving in our case 17 explanatory variables) using a tool which possesses an
adequate generalization ability in addressing multi-factor problems.
7 .3. 2 The internally-controlled-complexity property of Support Vector Regression
Using Support Vector Regressions one can alter model complexity without changing the
number (and nature) of explanatory variables. This is, to start with, due to the employment of
the ε-insensitive loss function instead of the “standard” quadratic loss function used
commonly in statistics and econometrics. Smaller values for the ε parameter force in general
the modeled function to provide a better fit on the training data (ceteris paribus) since as the
error-insensitive ε-region becomes smaller, the versatility of the function becomes greater,
and with it model complexity as well. Analogical is the role of the complexity-error tradeoff
parameter, C. A greater value for C forces the modeled function to become more flexible (that
is, complex) and make fewer training mistakes (ceteris paribus). It is also possible that similar
alterations of model complexity can be induced by the parameters (if there are any) of the
utilized kernel function. What is common for all these cases is that model complexity is being
changed “internally”, within the model itself, and not “externally” via making changes in the
(amount and nature of) data used for training.
7 .3. 3 The property of specifying numerous investor loss functions
As a spin-off of the internally-controlled-complexity property, and stemming from the
utilization of the ε-insensitive loss function, comes yet another property, which addresses the
issue of investors’ perception of suffered losses. After all, nobody can admittedly spell out the
precise form of the (aggregate) loss function that investors have. Moreover, it may well
happen that this loss function is not constant through time and is highly sensitive to economic
regime switches. The advantage of intruding the ε-insensitive loss function instead of the
“standard” quadratic loss function is that the ε parameter gives the liberty of explicitly
formulating a myriad of loss functions (see e.g. Figure 9).
7 .3. 4 The property of distinguishing the information-bearing input-output pairs
As pointed out in chapter 6, in Support Vector Regression different weights to all factorsestimate combinations (which can be represented as points in a factors-estimates space) are
automatically optimally (via optimization) assigned for a given ε-insensitive loss function, a
pre-specified complexity-error tradeoff adjustment constant C, and a kernel parameter (if
any). These weights are expected to be indicative of the relative importance of each input-
output pair. Only the support vectors will be given a positive weight, which might suggest
that the rest of the input-output pairs do not contain useful information and should not be
considered for future model-building. Alongside, it might be possible in this process to create
a list of factors ordered according to their relevance. These conclusions however have to be
substantiated or refuted by further research in this area.
7 .3. 5 Cross-validation procedure for choosing among optimal models
What is quite striking is that since the Support Vector Regression parameters can be easily
manually controlled, one can generate a myriad of optimal models – one optimal model for
each possible combination of training parameters. The freedom of choosing among different
values for the ε-parameter, complexity-error tradeoff adjustment parameter, and kernel
parameters (if any) leads to the natural question of how to find the combination between them
that will yield the model with best predictive power. To the best of our knowledge, there has
not yet been discovered a universal optimal technique to deal with this issue, but nevertheless
one way of tackling it is via a “standard” for the Support Vector Machines, i.e. a crossvalidation procedure. Basically, a k-fold cross-validation procedure is utilized as follows: a
given dataset is divided into k folders of equal size; subsequently, a model is built on all
possible (k) combinations of k-1 folders, and each time the remaining one folder is used for
validation. The best model is the one that performs best on average over the k validation
folders. The benefit of using a cross-validation procedure is that by construction it ensures
that model selection is based entirely on out-of-sample rather than in-sample performance.
Thus, the search for the best Support Vector Regression model is immune to a critique of
drawing conclusions about the merits of a factor model based on its in-sample performance.
To illustrate this critique in terms of the concept of generalization, it has been suggested in
chapter 4 that an extremely good model in-sample performance – that is, performance over
the training data set – is associated with a considerable (training data) overfitting, which in
turn is associated with poor generalization ability and poor model predictive power.
For greater clarity, the stages of a 5-fold cross-validation procedure are illustrated in Figure
12. Suppose that we have initially given training data consisting of values of both explanatory
and explained variables for n months, as in Figure 12 (a). The first step of the cross-validation
procedure is to randomly permute the (chronological or original) sequence the data, as in
Figure 12 (b). The second stage is to divide the permuted data into five (approximately)
equally-sized blocks, called folders, as in Figure 12 (c). The third stage consists of five substages. At each sub-stage, four folders of data are used as a training, model-building, set, and
the remaining fifth folder is used for validation (in other words, for testing), as illustrated in
Figure 12 (d). This procedure is repeated five times (one time for each validation folder).
Model selection is based on performance over the five folders used for validation, which is
critical because this ensures that the model selection itself is based only on (artificially
created) out-of-sample performance. If our model-building tool is Support Vector Regression,
then the performance of any model is judged by the mean sum of squared errors between
estimated and real target values associated with each of the five validation folders. The model
that achieves minimum mean sum of squared errors on average (over the five validation
folders) is considered to be the best. This best model is said to achieve minimum crossvalidation mean squared error.
Data for n
The randomly
permuted data is
divided into 5
data for the
n months
Each of the 5 equallysized folders is used for
validation of the models
created using the
remaining (four) folders
Figure 12. A 5-fold cross-validation procedure. The original data in (a) is randomly permuted in (b), and divided
into 5 equally-sized folders in (c). Afterwards, a validation folder is selected a model is build on the remaining
four folders. This procedure is repeated 5 times in total (for each validation folder), as suggested in (d).
The major potential drawback of the cross-validation procedure is that by construction it is
bound at some point to make use of future information to predict past target values, which
seems counterintuitive for a time-series analysis. This issue is somewhat related to the LookAhead Bias critique and will be addressed in section 7.7.2. Another shortcoming is that crossvalidation is a time-consuming procedure, which is not guaranteed to produce the best
estimates of the target values.
7. 4 The basic model
The basic “real-time” simulated investment model consists of two steps.
First, at month t, all (historical) values for all 17 candidate explanatory factors together with
the differences in returns between the S&P 500 Barra Value and Growth indices for months t60 till month t-1 are used to build numerous Support Vector Regressions. Thus, the dependent
variable of the basic model is the “value premium” – the difference between the realized
returns of the S&P 500 Barra Value and Growth indices. The independent variables are the 17
pre-specified factors referred to in section 2.4, and listed in Appendix I. The total time span of
predicted months is between January 1993 and January 2003. Going further back in time is
untenable due to unavailability of (sufficient) macroeconomic data. The choice of exactly 60
months of data for model building is to a certain extent arbitrary. On the one hand, it could be
argued that 60 months of data are rather insufficient for forming reliable forecasting
hypotheses. On the other hand however it seems risky to consider too long periods since the
model might be exposed to a critique of structural-change unaccountability. Moreover,
information that can be extracted from months that lie in the more and more distant past is
becoming increasingly irrelevant to present-day time. Thus, it seems reasonable to make the
assumption that 60 months can be viewed as belonging to roughly the same economic regime.
Second, once the Support Vector Regressions have been constructed, a standard procedure for
ranking the resulting models has been applied. This procedure is a 5-fold cross-validation (as
explained in section 7.3.5), according to which models are ranked on the basis of their crossvalidation mean squared error. The regression with minimal mean squared error is used to
predict the Value index minus the Growth index return difference for month t.
Alongside, some data re-balancing and other adjustments (explained in chapter 8) have been
made, one of the consequences thereof being that if the predicted Value minus Growth return
difference is between -0.05 and 0.05 relative to the average over the training period, then we
conclude that there is no signal for the next month, which implies taking no trading position.
If however the predicted difference is above 0.05, then at time t we buy the Value index and
sell the Growth index, so as to capture the predicted positive value premium. And if the
predicted difference is below -0.05, then analogically at time t we buy the Growth index and
sell the Value index, so as to capture the negative value premium.
In the basic model, transaction costs are unaccounted for. This zero-transaction-cost
assumption will be relaxed in section 7.5, where model extensions are discussed. Let us point
out here also that since the Support Vector Regression is a non-parametric tool, we can obtain
only point estimates and not the probability for a certain value to be observed.
Using only historically available data ensures that the implementation of the trading strategies
is carried out without the benefit of foresight, in the sense that investment decisions are not
based on data that have become available after any of the to-be-predicted months. Moreover,
investment decisions for the to-be-predicted months are always based on the entire factor set
of historical (60-month) data, ensuring that no variable-selection procedures based on
extensive manipulation of the whole available data have been carried out. At any rate, the
utilized cross-validation procedure for model selection ensures that the best candidate model
for each month is being selected only on the basis of performance on external validation
For comparison reasons, we set our results against a benchmark strategy which always (that
is, each month), bets that value stocks will outperform growth stocks. Thus, according to the
benchmark strategy, in the beginning of the forecasting period (January 1993) a hypothetical
investor goes long position on the Value index and a short on the Growth index. This position
is hold throughout the entire prediction period (January 1993 – January 2003). The monthly
difference between the two indices is the monthly return from this Value-minus-Growth
investment strategy.
7. 5 Model extensions
The basic model could be augmented in a number of ways. For example, analogically to
Bauer and Molenaar (2002), next to the one-month-ahead forecast horizon of the basic model,
one can calculate signals for three-, and six-month forecast horizons, and subsequently mix
them in order to come up with one signal. Alongside, different levels of transaction costs
should be taken into account in order to make the implementation of the strategies realistic.
Considering the three-month horizon, for instance, if at time t the models built at t-2, t-1 and t
produce “value”, “growth”, and “value” signals for time t+1 respectively, then the combined
signal for month t+1 using a simple unweighted-average rule is “go long one-thirds on the
Value index, and short one-third on the Growth index”. Notice that one of the two “value”
signals cancels out with the “growth” signal. If the combined signal produced by the three
models pertaining to month t+1 is “no signal”, then no trading position is established. The sixmonth horizon is calculated analogically. One could consider assigning greater weights to
more recent months in this procedure, which is not done here however. One of the main
reasons for estimating the additional three- and six-month horizons is to observe whether the
signals produced by them are consistent with those of the one-month horizon strategy (that is,
with the basic model). “Consistent” in this case means that the results from the three-month
horizon strategy should be worse than the results from the one-month horizon strategy, and
the results from the six-month horizon strategy should be worse than the results from the
three-month horizon strategy. Such consistency, if it exists, would lend greater credibility to
the results of the basic model, on the one hand, and avoid the “data-mining” critique (to be
addressed in subsection 7.6.4) on the other. Indeed, we do find evidence of such consistency.
Another possible extension is to consider models that incorporate different combinations of
explanatory factors. This procedure will lead to a total of 217 candidate factor-models for each
to-be-predicted month. Because of lack of appropriate computational equipment however
only one model, the one based on all 17 pre-specified factors, has been considered.
In the same line of thought, there is no guarantee that exactly 60 months represent an accurate
historical horizon for model selection. It may well happen that 65 or 55 months of training
data produce better results. The advantage of considering and comparing different model
selection horizons, furthermore, is that in this way the reliability of the default 60-month
historical horizon can be put on trial: if small changes in the number of months produce
enormously different results, then this inconsistency would be a sign of unreliability of the
default model strategy. Considering such time horizons however falls out of the scope of this
master’s thesis.
A fourth extension, which is important in practice, is to include transaction costs in the
calculations. We have allowed for two possible (non-zero) transaction cost regimes: one that
assumes that transactions costs are fixed at 25bp single trip and one that assumes 50bp (single
Very importantly, it is worth noting that investors could not know in advance which strategy
would perform best. In order to address this issue, a hyper model selection tool analogical to
the one proposed by Pesaran and Timmermann (1995) could be utilized. Constructing such a
tool however falls out of the scope of this master’s thesis.
The last proposed extension here, which can be viewed rather as curtailment, is to reformulate
the regression problem of the basic model as a classification problem. Similarly to
considering the three- and six-month forecast horizon, the results from this model “extension”
can serve as a consistency test for the basic model. In the classification case, all months where
the value premium is positive/negative can be labeled just “+ 1” / “– 1”, while the values for
the explanatory variables remain unchanged. Since the ε-insensitive loss function parameter
(ε) does not enter the calculations in the classification problems, the time for carrying out
calculations should be relatively shorter. The accuracy of the classification results is expected
to be worse, since months where the value premium is, say, close to 0.01 would have the
same importance (or, weight) in model building as months where the value premium is about
4.50 as both receive a label “+1”. If this expectation turns out to be the case in practice, this
could be perceived as evidence of consistency of the basic model. The results (for the one-,
three- and six-month forecast horizons) from this classification problem approach, which in
fact testify to such kind of consistency, will be shown in section 8.3, immediately after the
results from the Support Vector Regression approach.
7. 6 Small-versus-Big Rotation with Support Vector Regressions
It is important to stress that we are going to present also the results from a so-called monthly
“Small-versus-Big” Support Vector Regression rotation model for the sample period January
1993 – January 2003 and compare these results with those from a “Small-minus-Big” and
“MAX_SB” strategies. Because of space considerations, we will not provide a complete
account of those strategies and just sketch them briefly. The “Small-versus-Big” strategy is a
monthly rotation strategy conducted on the S&P 500 and S&P SmallCap 600 indices17; the
“Small-minus-Big” strategy is an investment strategy that in the beginning of the sample
period goes long on the S&P SmallCap 600 index and short on the S&P 500 index, and holds
that position thereafter; and the “MAX_SB” is a perfect foresight rotation strategy that each
month goes long on the index with higher monthly return and short on the index with lower
monthly return. The results from those strategies are sketched in section 8.2.7, and presented
in full in Appendix 6 and Appendix 7. The associated explanatory variables, listed in
Appendix 5 are somewhat different from those considered for the Value-versus-Growth
strategies. The analysis of the performance of Support Vector Regressions in the “Smallversus-Big” case is of great value, because if the results turn out be as promising as those
from the “Value-versus-Growth” case, then greater credibility should be lent to Support
Vector Machines as a tool for constructing financial factor models. Our results confirm that
Support Vector Machines perform extraordinarily well both in the “Value-versus-Growth”
and the “Small-versus-Big” prediction tasks.
7. 7 Support Vector Machines vis-à-vis common factor model pitfalls
This section deals exclusively with the question of why the proposed methodology is
expected to be immune to common critiques to which factor models are exposed, such as
Survival Bias, Look-Ahead Bias, Data Snooping, Data Mining, and Counterfeit.
7 .7. 1 Support Vector Machines versus the Survival Bias
According to Haugen (1999), Survival Bias occurs “if individual firms that go inactive during
the test period are systematically excluded from the test population”. It could be argued that if
the results from the Support Vector Machines models depend on the status of the companies
(which is either “active” or “inactive”), then failure to include inactive companies in the
calculations may lead to misleading estimates. For example, one can pick some active firms
that were regarded in the beginning of 2001 as successful, scrutinize their technical financial
characteristics (for instance, price-to-earning ratio, book-to-market ratio, etc.), and then come
up with a list of characteristics that the “ideal” firms should have. In this situation, it may well
happen that firms that were close to this “ideal” type in the beginning of, say, 1999 actually
Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank
Russell 2000 indices have been used as inputs for the Small-versus-Big calculations.
went bankrupt by the beginning of 2001, and so have never been taken into consideration at
the time of constructing the list of coveted characteristics. This, most probably, will cast a
serious doubt on the validity of the resulting list of characteristics. The results of the Support
Vector Machines models do not suffer from Survival Bias. Firms that change their status
during the model training period are never excluded from the test population. All tests are
performed over whole indices, which lists of constituents are not adjusted to include only
firms that were active at the end of any model testing period.
7 .7. 2 Support Vector Machines versus the Look-Ahead Bias
Look-Ahead Bias occurs when one builds prediction models based partially on “data items
that would not have been known when the predictions were made” (Haugen, 1999). For
example, suppose one constructs a model based on the entire sample period. In our case this
period is January 1993 – January 2003. Undoubtedly, some explanatory factors would appear
to have greater explanatory power than others. It would be totally unfair to use this whole
information in trying to predict the value of the explained variable for, say, January 1995,
because at that time investors would not have known which factors exactly would turn out to
be important in the future. The Support Vector Machines models do not suffer from such
Look-Ahead Bias. All predictions are based on data for the 60 months preceding any to-bepredicted month.
The Look-Ahead Bias critique however can be partially directed at the cross-validation
procedure used for model selection. As explained in section 7.3.5, by construction the crossvalidation procedure is bound to use future data in predicting past outcomes (see Figure 12).
Although this “future data” is actually past data from the point of view of any to-be-predicted
month (the “future data” is always part of the 60 months of training data, and so have been
available prior to any to-be-predicted month), it seems at first sight unjustifiable to apply the
procedure in analyzing time-series data. The question is, can one use future data (from the
point of view of any month) to predict past outcomes, as the cross-validation procedure
suggests? Even though this should usually be considered as a weak point of the crossvalidation procedure, our assumption that 60 months prior to any month can be considered as
belonging roughly to one and the same economic regime (that is, there are no abrupt regime
changes in-between these 60 months) gives us the possibility to compare input-output
relations at different times across a strict 60-month time frame.
An alternative to the cross-validation procedure for model selection, which does not suffer
even partially from any Look-Ahead Bias, has been implicitly suggested by Bauer and
Molenaar (2002), who build their models based on 60 months of training data and then select
the best model (or models) based on out-of-sample performance over 24 months following the
model training period. The advantage of this approach is that model selection is always based
on performance over a post-training-data period, as opposed to out-of-sample sub-periods
created artificially within the training data (in the case at hand, there are five such sub-periods
in the cross-validation procedure). The disadvantage however is that the selected model (or
models) following 24 months of post-training observed performance has to be used to predict
an outcome that is 25 months ahead of the actual model training period. That is yet another
reason why we have decided to opt for the cross-validation procedure – the selected model
out of this procedure can be used for the prediction of the month coming immediately after
the model training period. In this way all available (60 months of) most recent (and thus, most
relevant) data prior to the to-be-predicted month is used for model building.
7 .7. 3 Support Vector Machines versus the Data Snooping Bias
In financial literature, the term “Data Snooping” is associated with the act of testing one’s
model using the same data as previous studies (Haugen, 1999). At least partially, our models
suffer from this bias: they take as a starting point a set of 17 factors, some or all of which
have been used by other studies. What is crucial to observe however is that the proposed
models in the thesis do not take into account in any way which of these factors appeared to be
important in these studies, since all of these 17 candidate factors have been used
simultaneously for our prediction purposes. The Support Vector Regression tool by
construction implicitly determines by itself for every single to-be-predicted month which
factors play a vital role, and which do not.
7 .7. 4 Support Vector Machines versus the Data Mining Bias
According to Haugen (1999) a Data Miner “spins the computer a thousand times; tries a
thousand ways to beat the market”. Invariably, the Data Miner is bound to hit the bull’s eye
once in a thousand times. The resulting “successful” model will be most probably due to
chance rather than special merit. Support Vector Machine models do not suffer from the Data
Mining Bias, because the available computer has been “spun” only one single time. That is,
only one model, which includes all pre-specified factors, has been tested.
7 .7. 5 Support Vector Machines versus the Counterfeit Critique
The counterfeit critique stems from the observation that to beet the market “on paper” is quite
different from beating the market “for real” (Haugen, 1999). It is precisely for this reason that
we have constructed a real time investment strategy. In this way we are as close to a real
trading simulation as possible. True, the Support Vector Regression tool could not have been
used in the first couple of years of the trading period. However, from a present-time
viewpoint one can certainly assess the economic significance of applying Support Vector
Machines in stock market predictability by tracing the performance of a hypothetical investor
through (a considerable amount of) time.
Chapter 8
Experiments and Results
This chapter describes the actual experiments that have been carried out and the accomplished
results. The software program used throughout the analysis is LIBSVM 2.4, developed by
Chih-Chung Chang and Chih-Jen Lin.
8. 1 Experiments carried out with Support Vector Regression
When employing Support Vector Regression, prediction steps for any month t run as follows.
First of all, 60 months of training data available prior to month t are selected. The data consist
of the differences in returns between the S&P 500 Barra Value and Growth indices (the
explained variable), and the values for all 17 preset factors (the explanatory variables).
Second, Support Vector Regressions have been applied to the 60 months of training data prior
to any to-be-predicted month in order to select the best model. More concretely, a 5-fold
cross-validation procedure has been carried out to determine the best combination among C
(the complexity-error tradeoff parameter), ε (the level of insensitivity of the ε-insensitive loss
function), and a parameter inherent to the kernel function used18. A tiny part of this procedure
is visualized in Figure 13, where the vertical axis shows cross validation minimal squared
errors for C∈(0,32), while keeping ε and the kernel function parameter fixed at 1.0 and 0.007,
respectively. By the “best combination” among the parameters it is meant the one that
produces the minimal sum of squared errors between the true values and their corresponding
Figure 13. Five-fold cross validation mean squared errors associated with complexity-error tradeoff parameter C∈(0,32) and fixed εinsensitive loss function parameter (ε) at 1.0 and Radial Basis Function parameter at 0.007. The to-be-predicted month here is April 2000.
The “best” model is the one for which the combination of the three parameters over suitable parameter ranges produces minimal cross
validation mean squared error.
The Radial Basic Function kernel has been used in the calculations. It has been examined by Burges (1998)
and Smola and Schölkopf (1998), for example.
estimates coming out of the cross-validation procedure.
In practice, it is virtually impossible to find out the best parameter combination from a crossvalidation procedure, since the search for it requires making infinitely many tests. For
example, the C parameter is free to take any positive value, whereas empirical tests can be
performed only over a finite number of those values. As suggested by the figure, this
minimum value is quite well defined.
Post-computational result adjustments have been made in order to avoid the risk of placing
unsubstantiated “trust” on border estimates. Thus, estimates within the arbitrary chosen range
of (-0.05, 0.05) relative to the average over the training period have been regarded as no clear
indications of forecasted direction for the value premium. In these “no signal” cases, no
trading position should be taken.
The advantage of utilizing a cross-validation procedure throughout the analysis is that model
selection is based entirely on performance over (artificially created) out-of-sample data. The
disadvantages are that, first, there is no guarantee that the 5-fold cross validation procedure
will yield the best approximately correct model, and second, the procedure is in itself rather
time-consuming (about three days for the whole 121-month estimation period on a computer
with a 2.66 GHz processor and 512 MB RAM memory).
8. 2 Results from Support Vector Regression Estimation
In this section we will present the main results from the value-growth and small-big rotation
strategies. These strategies include: the passive Value-minus-Growth strategy; the MAX
strategy, the Support Vector Regression strategies for the one-, three- and six-month forecast
horizons under different transaction cost regimes; and, very briefly, the small-versus-big
strategies. Alongside, we will focus our attention on the extent to which the strategies are
consistent with each other.
8 .2. 1 Value-minus-Growth strategy
Let us first of all examine the Value-minus-Growth strategy of implicitly taking each month a
long position in the Value index and short position in Growth index. The main results are
outlined here. Further details can be found in Appendix II. The Value-minus-Growth strategy
has not performed very satisfactorily during the prediction (testing) period, which starts
January 1993 and ends January 2003. The annualized mean return is merely 0.21% and
consequently the realized information ratio is not spectacular too, 0.02. Investors that have
followed the buy-and-hold strategy have experienced devastating maximal 3-month (11.55%) and 12-month (-22.86%) losses. The high standard deviation of returns (10.90%) has
also contributed to the overall distress. The sole bright feature of this strategy is the low level
of transaction costs associated with it.
8 .2. 2 “MAX” strategy
The “MAX” strategy is defined as the strategy of going long on the better-performing index
and short on the worse-performing index every month throughout the sample period. The
detailed results from this strategy for a transaction-cost level of 50bp single trip can be found
in Appendix II. This strategy shows the potential profit from style rotation. In a 0bp, 25bp,
and 50bp transaction-cost environment, the maximum annual mean return from style rotation
is 27.14%, 24.21% and 21.29%. It is interesting to observe that in the 50bp transaction-cost
case 20% of the months yield a negative performance. This comes as a result from the
combination of Value and Growth indices outperforming each other consecutively and a
concomitant (absolute value of the) value premium of less than 0.5%. Such combinations, 24
in total, are most common between 1994 and 1999. The “MAX” strategy, unlike any of the
model strategies based on Support Vector Regressions, goes more than half of the time (in
53.72% of the months) on value and short on growth stocks.
8 .2. 3 Basic model investment strategy
Detailed results from the basic model strategy (of forecasting one-month-ahead difference in
returns between the S&P 500 Value and Growth indices on the basis of the whole set of 17
factors and data for 60 months preceding each of the to-be-predicted months) are presented in
Table 1 on the following page19. What strikes most, is that this strategy has produced much
better results than the Value-minus-Growth one. Investors would have enjoyed an annualized
mean return of 10.19%, under the assumption however of zero transaction costs. Combining
these results with the relatively lower standard deviation of returns yields an (annualized)
information ratio of 1.03 for the prediction 121-month period. It should be stressed however,
that even when high transaction costs of 50 bp (single trip) are added into the calculations, the
realized information ratio remains quite high (0.63), and statistically significant at the (twotail) 5% level. The calculated Z(equality)-scores20 provide a further strong evidence (in the
0bp and 25bp transaction-cost environment) of a significant performance difference between
the basic model rotation strategy and the passive Value-minus-Growth one. Remarkably, the
basic model investment strategy is able to capture more than one third of the return from the
“MAX” strategy (in a 0bp and 25bp transaction-cost environment).
The positive skewness of the basic model adds to the bright picture, suggesting that the risk
from following this strategy is somehow lower than the one implied by the standard deviation
of returns. The largest 3-month (-5.90%) and 12-month (-8.07%) losses (in the zerotransaction-cost case) are substantially lower than those incurred by the Value-minus-Growth
strategy. Only one-third of the time has the basic strategy generated wrong signals. It is
interesting to note that about half of the time it preferred the Growth portfolio, while only
(slightly less than) one-third of the time it favored the Value one. In about 18% of the months
no positions have been taken.
Table 1 is reproduced in Appendix II as well.
Z(equality) measures the risk-adjusted performance difference between a switching Support Vector
Regression strategy and the Value-minus-Growth strategy. The Z(equality)-score is computed in a standard way
(in line with, e.g. Stanton, 1992).
Table 1
Results Value-versus-Growth Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
1-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy.
CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean
Squared Error. All numbers are annualized data unless stated otherwise. All strategies are
long/short monthly positions on the S&P 500 Barra Value and Growth indices.
The overall position for month t+1 is based on the signal produced by the optimal model
based on 60 months of prior historical data (factors included = 17). If for example the
produced signal for month t+1 is “Value”, then a position is taken that is long on the Value
index and short on the Growth index. Note that if the optimal model produces no signal, then
no trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for
instance, if the current position is long-value / short-growth, and the signal for the following
month is “Growth”, then 2 * 0.25% (1* 0.25% for closing the current long-value / shortgrowth position, plus 1* 0.25% for establishing a long-growth / short-value position) have to
be deducted from the following month’s accrued (absolute value of the) value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
8 .2. 4 Three- and six-month horizon strategies
Table 2 and Table 3 in Appendix II present detailed results from the three- and six-month
forecast horizon strategies. The results are suggestive of strong consistency of the one-month
horizon results, as the real performance of the strategies is in the expected logical order: first
is the one-month horizon strategy, second is the three-month horizon strategy, and third
comes the six-month horizon strategy. For the zero-transaction-cost scenario, the lowest
achieved annual mean return comes out from the six-month horizon strategy, and stands at
4.95%. It is associated with information ratio of 0.68. The standard deviations of returns for
the three- and six-month horizon strategies are lowest among all strategies: 8.43% and 7.31%
respectively. The largest 3-month and 12-month relative losses from these two strategies are 9.12% and -8.77%. An interesting feature to observe is that as the estimation horizon
increases, the number of months with Growth position steadily rises to 67% (from about
50%), and the number of months with no position steadily drops from about 18% to less than
7%, suggesting that the strategy of incorporating information from models constructed in
earlier months tends to show a steadily increasing preference for growth stocks over taking no
trading positions.
8 .2. 5 Consistency of the strategies
What is remarkable to notice is that, first, all estimation horizons show quite similar patterns
and, second, that the one-month horizon strategy produces best results. One possible
interpretation of the latter fact is that the approach of utilizing models at present time that
were created in the (more and more distant) past is bound to yield inferior outcomes since
those models become more and more irrelevant. This is evident in particular during the two
periods of February 1993 – June 1993 and January 1999 – October 1999. During these
periods the three- and six-month horizon strategies are able to only slowly catch up pace with
the one-month horizon basic strategy. Not surprisingly, the average absolute value of the
value premium during these two periods considered as a whole, 2.81%, is substantially above
the average one computed over the twelve months preceding each of the two periods, 1.82%.
Thus, the one-month horizon strategy appears, as logically expected, to be the first one to
“sense” upcoming turbulent developments on the stock market. The three- and six-month
horizon strategies follow suit, but with a time lag, most probably due to their lack of taking
into account recent (from the point of view of the to-be-predicted month) relevant
information. In spite of this, during non-turbulent times both the three- and the six-month
horizon strategies seem to perform almost identically as the one-month horizon strategy. This
consistency suggests, above all, that the results produced by the basic model are reliable:
relatively “minor” extensions to the basic model (such as considering three- and six-month
forecast horizons) do not influence abruptly the main outcomes. Moreover, the direction and
the extent of the influence appear logical. Illustrative of these findings is Figure 14, which
graphs the cumulative returns throughout the entire prediction period of all three types of
horizons strategies (for the zero-transaction-cost scenario), plus, for comparison reasons, the
cumulative returns of the Value-minus-Growth strategy.
Figure 14. Accrued cumulative returns from the Value-minus-Growth strategy and the Support Vector Regression (SVR) one-, three-, and
six-month horizon strategies for the period January 1993 – January 2003. The one-month horizon strategy performs best, gaining most of its
accumulated profits during turbulent times on the financial market. In such periods, the three- and six-month horizon models follow suit
with a time lag, as expected. During relatively calmer periods, all strategies perform similarly.
Figure 15 shows the investment style signals associated with the basic model strategy. Note
that style signals, unlike realized excess returns, are not affected by the level of transaction
costs21. According to the figure, the predominant investment style signals during this period
are “Growth”, with some notable exceptions however. “Value” signals have been produced
Figure 15. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the basic model investment strategy.
This is assumed to be true in our case since we regard the estimates of the Support Vector Regression tool as
an indication of the direction of the value-growth return difference, and not the amount of that difference. If,
however, estimates are associated with the expected amount of the value premium, then the level of transaction
costs will influence the investment decision if the expected (absolute value of the) value premium is low.
mostly in 1993, in the beginning of 1994, and in the first half of 2001. Almost no “Value”
signals have been given during the periods that stretch from June 1996 till August 1998, and
from June 1999 till November 2000.
8 .2. 6 Non-zero transaction cost scenarios
Adding transaction costs of 25 bp and 50 bp single trip into the calculations does not change
the results abruptly, as shown in Appendix II and Appendix III. For the worst-case scenario of
50 bp single trip costs, the basic model strategy still performs exceptionally well, managing a
significant (at the 5% two-tail level) information ratio of 0.63. The item that appears to have
deteriorated most as compared to the zero-cost case is the maximum 12-month relative loss,
which has dropped down to -15.26%. Considering the other two horizon strategies, the lowest
possible information ratio achieved stands at 0.27, arising from the six-month strategy in a
50bp transaction-cost environment.
Figure 16 presents the realized excess returns forecasted by the basic model strategy in the 25
bp transaction-cost scenario. It can be seen from the figure that most of the accrued returns
come out of the last four years of the sample period, which actually appears to be the most
Figure 16. Realized excess returns forecasted by the basic investment strategy for the 25 bp transaction costs scenario.
8 .2. 7 Small-versus-Big Strategies
The results from the Small-versus-Big Support Vector Regression strategy and the Smallminus-Big and MAX_SB strategies mentioned in section 7.6 can be found in full in Appendix
6 and Appendix 7. In the sample period of January 1993 – January 2003 the passive Smallminus-Big rotation strategy, unlike the Value-minus-Growth strategy, achieves a negative
annual return of –1.28%. The MAX_SB strategy attains 26.76% annual return in the 50bp
transaction cost scenario, which is 5.47% more than the corresponding result for the MAX
strategy presented section 8.2.2. This fact reveals that the potential benefit from Small-versusBig rotation is much greater than the corresponding Value-versus-Growth rotation22. If the
Support Vector Regression tool can capture this extra potential, then, first, greater credibility
would be lent to Support Vector Regressions as a tool for constructing factor models, and
second, one could claim that there is a consistency between Small-versus-Big and Value22
This will be true if the market impact from Small-versus-Big rotation is the same as that from Value-versusGrowth rotation.
versus-Growth Support Vector Regression strategies. Our results show that this extra potential
can indeed be captured by the Small-versus-Big Support Vector Regression tool. For the zerotransaction-cost regime, for example, the one-, three- and six-month forecast horizon Smallversus-Big strategies produce 10.66%, 7.95% and 7.64% annual returns, while the respective
results from the Value-versus-Growth strategies are 10.19%, 5.77% and 4.95%.
8. 3 Results from the Classification Reformulation of the Regression
Detailed results from the classification problem reformulation of the basic value-growth
regression problem of section 7.4 are presented in Appendix IV. This reformulation can be
used, as mentioned in section 7.5, to make yet another kind of consistency test (next to
considering three- and six-month forecast horizon strategies) for the basic model strategy. The
classification results are expected, logically, to be worse than the regression results. If these
expectations materialize in practice, then they could serve as an indication of consistency of
the basic model strategy. All classification experiments have been carried out in complete
analogy to the regression experiments of section 8.1. There are only two differences. First, the
actual positive/negative monthly value premiums have been replaced with “+ 1” / “– 1”
values respectively in all computations, implying that in classification the months when value
stocks outperformed growth stocks are labeled “+1”, and the months when growth stocks
outperformed value stocks are labeled “-1”. Second, the ε parameter of the ε-insensitive loss
function disappears in the calculations, since it is inherent only to regression estimation
All results from the reformulated classification problem are much worse than those from the
original regression problem. Considering for example the one-month forecast horizon
strategy, in analogy to the regression problem, investors achieve a modest 2.34% mean annual
return for the January 1993 – January 2003 sample period in the zero-transaction-cost
scenario. This result is more than 4 times worse than the corresponding regression result. The
standard deviation of annual returns in this case stands at 10.88%, suggesting that this strategy
is more than 9.78% more volatile than the corresponding regression strategy. As a result, the
realized information ratio from this classification strategy stands quite low at 0.21. The results
for the three- and six-month horizon classification strategies are similar, but slightly worse
than those of the one-month horizon strategy. All horizon strategies do not seem to produce
performances radically different from the passive Value-minus-Growth strategy (see Figure
The results from classification, though worse than those from regression, could be useful in
two main ways. First of all, the relatively worse results are all but unexpected. It seems quite
logical that when all months with positive / negative value premiums are given artificially
equal values of plus one / minus one, then some model prediction power would be lost. This
logic is substantiated in practice by the empirical tests. Second, one can compare the results
produced by the different horizon strategies, and look for (in)consistencies. These results are,
as in the regression problem, consistent with each other, as illustrated in Figure 17.
Remarkably, the performance order of the strategies applied in the classification problem is
the same as the performance order of strategies in the regression problem. Regarding
differences in performance dynamics between the classification and regression problems,
what is striking is that it is especially during the turbulent period of January 1999 – October
1999 that the Support Vector Machines for classification loose ground and produce negative
excess returns, whereas Support Vector Regression gains momentum. This pattern, though not
so pronounced, is repeated again throughout 2001.
Figure 17. Accrued cumulative returns from the Value-minus-Growth strategy and Support Vector Classification (SVC) one-, three-, and sixmonth horizon strategies for the period of January 1993 till January 2003 under the zero-transaction-cost regime. The one-month horizon
strategy performs best, gaining most of its accumulated profits during turbulent times on the financial market, as in the regression model
formulation. In such periods, the three- and six-month horizon models follow suit with a time lag, as expected. During relatively calmer
periods, all strategies perform similarly.
Chapter 9
The purpose of this research is to employ the theoretical opportunities which Support Vector
Machines are expected to provide over common financial factor models in the practical
context of constructing Value-versus-Growth rotation strategies. The biggest theoretical
advantage of utilizing Support Vector Machines is that numerous factors can be included in
one model simultaneously, without a loss of generalization (and thus, prediction) ability. The
biggest practical outcome is that the basic model strategy, which is the one that is logically
expected to perform best, shows remarkable consistency and robustness of results and
produces exceptionally high information ratios.
A number of important theoretical and practical conclusions appear to stand out from this
master’s thesis. From a theoretical viewpoint, one may conclude that it pays to investigate
into the modeling tool of Support Vector Machines while constructing financial factor
models. First of all, this tool enhances the features of state-of-art factor models by providing
the following opportunities: (1) to achieve robust results in the process of model building
when the candidate explanatory variables are numerous and are considered as a group; (2) to
alter manually model complexity without changing the number and nature of explanatory
variables and arrive automatically at a new optimal (in the sense of achieving best
generalization ability) model for each complexity alteration; (3) to specify numerous investor
loss functions and arrive automatically at a new optimal model for each loss function; (4) to
choose among numerous optimal models corresponding to various possible combinations of
Support Vector Machines parameters via a cross-validation model selection procedure,
ensuring in this process both that model selection is based only on (artificially created) out-ofsample performance and that most recent data is used for model building. Second of all,
Support Vector Machines are able to cope with common factor model shortcomings, such as
Data Mining Bias, Look-Ahead Bias, Data Snooping Bias, and Counterfeit. This is, above all,
due to the theoretical property of Support Vector Machines to manage with ease multidimensional attribute spaces, and the employed standard cross-validation model selection
Usually, factor models fall back on linear regression analyses in their various forms in order
to come up with a reliable lucrative investment strategy. Commentators often select their best
models in this process based on widely established overfitting-correction statistical criteria,
such as adjusted R2, AIC, BIC, or financial criteria, such as hit ratio, information ratio, etc.
Alongside, a set of most appealing candidate explanatory factors is typically extracted from a
(long) list of potential explanatory factors. Support Vector Machines, and Support Vector
Regressions in particular, offer a different approach. They usually make use of all information
available as a whole, and attempt to find out the best non-linear decision surface on the space
defined by the explanatory and explained variables, which can be represented as a linear
surface in some higher-dimensional, feature space. In this procedure, different weights are
being assigned to each data point in the feature space, which may vastly contradict the
respective data-point weights coming out of a multiple linear regression analysis.
Additionally, Support Vector Machines offer a myriad of overfitting-correction possibilities
that do not have a direct analogy in multiple linear regression analysis, which can be applied,
quite remarkably, without changing the number and nature of explanatory factors in a given
model. These possibilities are given by the utilization of the complexity-error tradeoff
parameter, ε-insensitive parameter and kernel function parameters (if any).
From a practical point of view, Support Vector Machines have been shown to be able to
produce investment strategies that are able to outperform the passive Value-minus-Growth
more than 39 times, net of 25bp single trip transaction costs, and almost 50 times in a zerotransaction cost environment, for the sample period of January 1993 – January 2003. The
information ratios for the basic model strategy are robust and extraordinarily high: 0.83 and
0.63 for the 25bp and 50bp transaction-cost scenarios respectively. The performance of the
basic investment strategy has been tested against (some) modifications in order to assess its
reliability in a better way. All tested model variations seem to show remarkable consistency,
where the best logically expected model performs best (the one-month forecast horizon
strategy), followed by the models expected to perform worse (the three- and six-month
forecast horizon strategies). Especially during vigorous financial times, the modified
strategies, which base their decisions heavily on models that have been constructed several
months before the actual prediction month, fail to catch up pace quickly with the basic model
strategy, as logically expected. Another possible test of consistency of the basic model
strategy is to reformulate the original regression problem into a classification problem. The
results from the classification reformulation are worse, as expected, which once again testifies
to the consistency of the basic model (regression) strategy.
In spite of the appealing results, there are a number of open, unresolved issues that have not
been touched upon in this thesis. For example, there is no guarantee that the pre-specified
factor set used to create the models contains most (or even, enough) of the information needed
to forecast Value-versus-Growth monthly returns. The reverse could also be true – it is
possible that some of the explanatory factors are actually unnecessary, in which case they
have to be excluded from the models. The procedure to test for this latter possibility is
computationally quite demanding and thus has not been carried out. More broadly, in order to
assess fully the applicability of Support Vector Machines in Finance, they have to be tested in
different financial areas and on different types of financial data sets. For example, it is
interesting to apply Support Vector Machines to the so called “Small-versus-Big” rotation
strategies of predicting the monthly difference of returns between stocks with relatively
higher market capitalization and stocks with relatively lower market capitalization. The
results from this kind of strategies, which utilizes Support Vector Regressions, are show in
Appendix 6 and Appendix 7. It could also be argued that Bayesian type of inference should be
applied to model selection (Cremers, 2002), but the analysis of this kind of topics as well as
technical-in-nature issues that arise from within Support Vector Machines, fall out of the
scope of this master’s thesis.
Appendix I
Factors used in all Value-versus-Growth regression and classification models.
All data are provided by ABP Investments. The factors are the same as those employed by
Bauer and Molenaar (2002).
Technical variables are:
∗ Lagged Value/Growth spread
∗ Lagged Small/Large spread
∗ VIX: the 3-month change in the VIX indicator
∗ 12 month Forward P/E (S&P 500)
∗ 3 month return momentum (S&P 500)
∗ Profit cycle: Year on Year change in earnings per share of the S&P 500
∗ PE dif
∗ DY dif
Economic variables are:
∗ Corporate Credit Spread: the yield spread of (Lehman Aggregate) Baa over Aaa
∗ Core inflation: the 12-month trailing change in the U.S. Consumer Price Index
∗ Earnings-yield gap: the difference between forward E/P ratio (S&P 500) and the 10-year T-
bond yield
∗ Yield Curve Spread: the yield spread of 10-year T-bonds over 3-month T-bills
∗ Real Bond Yield: the 10-year T-bond yield adjusted for the 12-month trailing inflation rate
∗ Ind. Prod: U.S. Industrial Production Seasonally Adjusted
∗ Oil Price: the 1-month price change
∗ ISM (MoM): 1-month change of US ISM Purchasing Managers Index (Mfg Survey)
∗ Leading Indicator: the 12-month change in the Conference Board Leading Indicator
Appendix II
Tables showing the results from different Support Vector Regression Value-versusGrowth investment strategies and different cost scenarios.
Time frame: January 1993 – January 2003
Table 1.
Results Value-versus-Growth Support Vector Regression rotation strategy using a
one-month forecast horizon.
Table 2.
Results Value-versus-Growth Support Vector Regression rotation strategy using a
three-month forecast horizon.
Table 3.
Results Value-versus-Growth Support Vector Regression rotation strategy using a
six-month forecast horizon.
Table 1
Results Value-versus-Growth Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
1-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra
Value and Growth indices.
The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of
prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Value”, then a
position is taken that is long on the Value index and short on the Growth index. Note that if the optimal model
produces no signal, then no trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is long-value / short-growth, and the signal for the following month is “Growth”, then 2 * 0.25%
(1* 0.25% for closing the current long-value / short-growth position, plus 1* 0.25% for establishing a longgrowth / short-value position) have to be deducted from the following month’s accrued (absolute value of the)
value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Table 2
Results Value-versus-Growth Support Vector Regression rotation strategy using a
three-month forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
3-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra
Value and Growth indices.
The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal
models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced
signals for month t+1 are “Value”, “Growth”, and “Value”, then the combined signal is “1/3 Value”. Out of this
combined signal, a position is taken that is long 1/3 of the Value index and short 1/3 of the Growth index. Note
that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should
be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is ½ long-value / ½ short-growth, and the signal for the following month is “½ Growth”, then 2 *
0.125% (1* 0.125% for closing the current ½ long-value / ½ short-growth position, plus 1* 0.125% for
establishing a ½ long-growth / ½ short-value position) have to be deducted from the following month’s accrued
(absolute value of the) value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Table 3
Results Value-versus-Growth Support Vector Regression rotation strategy using a sixmonth forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
6-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra
Value and Growth indices.
The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal
models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the
produced signals for month t+1 are “Value”, “Value”, “Value”, “no signal”, “Growth”, and “Value”, then the
combined signal is “½ Value”. Out of this combined signal, a position is taken that is long ½ of the Value index
and short ½ of the Growth index. Note that if the optimal models produce a combined “no signal” signal, then no
trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is ½ long-value / ½ short-growth, and the signal for the following month is “½ Growth”, then 2 *
0.125% (1* 0.125% for closing the current ½ long-value / ½ short-growth position, plus 1* 0.125% for
establishing a ½ long-growth / ½ short-value position) have to be deducted from the following month’s accrued
(absolute value of the) value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Appendix III
Figures showing the results from different Value-versus-Growth investment strategies
and different cost scenarios.
Time frame: January 1993 – January 2003
Figure A3.1. Accrued cumulative monthly returns from the Value-versus-Growth strategy
and the one-month forecast horizon Support Vector Regression rotation
strategy under different transaction-cost regimes.
Figure A3.2. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the one-month forecast horizon Support Vector Regression rotation strategy
Figure A3.3. Realized excess returns by the one-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A3.4. Accrued cumulative monthly returns from the Value-versus-Growth strategy
and the three-month horizon Support Vector Regression rotation strategy under
different transaction cost regimes.
Figure A3.5. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the three-month forecast horizon Support Vector Regression rotation strategy.
Figure A3.6. Realized excess returns by the three-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A3.7. Accrued cumulative monthly returns from the Value-versus-Growth strategy
and the six-month horizon Support Vector Regression rotation strategy under
different transaction cost regimes.
Figure A3.8. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the six-month forecast horizon Support Vector Regression rotation strategy.
Figure A3.9. Realized excess returns by the six-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A3.2.
Figure A3.1.
Figure A3.3.
Figure A3.5.
Figure A3.4.
Figure A3.6.
Figure A3.8.
Figure A3.7.
Figure A3.9.
Appendix IV
Tables showing the results from different Support Vector Classification investment
strategies and different cost scenarios.
Time frame: January 1993 – January 2003
Table 4.
Results Value-versus-Growth Support Vector Classification rotation strategy using
a one-month forecast horizon.
Table 5.
Results Value-versus-Growth Support Vector Classification rotation strategy using
a three-month forecast horizon.
Table 6.
Results Value-versus-Growth Support Vector Classification rotation strategy using
a six-month forecast horizon.
Table 4
Results Value-versus-Growth Support Vector Classification rotation strategy using a
one-month forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
1-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value
and Growth indices.
The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of
prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Value”, then a
position is taken that is long on the Value index and short on the Growth index.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is long-value / short-growth, and the signal for the following month is “Growth”, then 2 * 0.25%
(1* 0.25% for closing the current long-value / short-growth position, plus 1* 0.25% for establishing a longgrowth / short-value position) have to be deducted from the following month’s accrued (absolute value of the)
value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Table 5
Results Value-versus-Growth Support Vector Classification rotation strategy using a
three-month forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
3-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value
and Growth indices.
The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal
models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced
signals for month t+1 are “Value”, “Growth”, and “Value”, then the combined signal is “1/3 Value”. Out of this
combined signal, a position is taken that is long 1/3 of the Value index and short 1/3 of the Growth index.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is 1/3-long-value / 1/3-short-growth, and the signal for the following month is “1/3 Growth”,
then 2*(1/3)*0.25% (1*(1/3)*0.25% for closing the current 1/3-long-value / 1/3-short-growth position, plus
1*(1/3)*0.25% for establishing a 1/3-long-growth / 1/3-short-value position) have to be deducted from the
following month’s accrued (absolute value of the) value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Table 6
Results Value-versus-Growth Support Vector Classification rotation strategy using a
six-month forecast horizon. Time frame: January 1993 – January 2003
S&P Barra
6-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Growth
% months in Value
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the
timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are
annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value
and Growth indices.
The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal
models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the
produced signals for month t+1 are “Value”, “Value”, “Value”, “no signal”, “Growth”, and “Value”, then the
combined signal is “½ Value”. Out of this combined signal, a position is taken that is long ½ of the Value index
and short ½ of the Growth index. Note that if the optimal models produce a combined “no signal” signal, then no
trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is 1/3-long-value / 1/3-short-growth, and the signal for the following month is “1/3 Growth”,
then 2*(1/3)*0.25% (1*(1/3)*0.25% for closing the current 1/3-long-value / 1/3-short-growth position, plus
1*(1/3)*0.25% for establishing a 1/3-long-growth / 1/3-short-value position) have to be deducted from the
following month’s accrued (absolute value of the) value premium.
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Appendix V
Factors used in Small-versus-Big rotation models.
All data are provided by ABP Investments.
Technical variables are:
∗ Lagged Value/Growth spread
∗ Lagged Small/Large spread
∗ MOM S&P (HoH): 6 months return momentum S&P 500
∗ Profit cycle: Year on Year change in earnings per share of the S&P 500
∗ RF: US Treasury Constant Maturities 3 Mth - Middle Rate
∗ GSCINE (QoQ): GSCI Non Energy (Quarterly changes)
∗ DIV YLD: Difference between Dividend Yields of Barra Value and Barra Growth
∗ VOL S&P (22 DAY): Volatility of S&P 500 on daily basis
Economic variables are:
∗ Corporate Credit Spread: the yield spread of (Lehman Aggregate) Baa over Aaa
∗ Core inflation: the 12-month trailing change in the U.S. Consumer Price Index
∗ Earnings-yield gap: the difference between forward E/P ratio (S&P 500) and the 10-year T-
bond yield
∗ Yield Curve Spread: the yield spread of 10-year T-bonds over 3-month T-bills
∗ Bond Yield: US Treasury Constant Maturities 10 Yr - Middle Rate
∗ Ind. Prod: U.S. Industrial Production Seasonally Adjusted
∗ Oil Price (QoQ): the 3-month change in West Texas Int. Near Month FOB $/BBL
∗ ISM (YoY): yearly change of US ISM Purchasing Managers Index (Mfg Survey),
Seasonally adjusted
∗ Leading Indicator: the 12-month change in the Conference Board Leading Indicator
Appendix VI
Tables showing the results from different Small-versus-Big investment strategies and
different cost scenarios.
Time frame: January 1993 – January 2003
Table 7.
Results Small-versus-Big rotation strategy using a one-month forecast horizon.
Table 8.
Results Small-versus-Big rotation strategy using a three-month forecast horizon.
Table 9.
Results Small-versus-Big rotation strategy using a six-month forecast horizon.
Table 7
Results Small-versus-Big Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003
large/small cap
1-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Big
% months in Small
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross
Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are
long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices23.
The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of
prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Small”, then a
position is taken that is long on the Small cap index and short on the Large cap index. Note that if the optimal
model produces no signal, then no trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is long-small / short-big, and the signal for the following month is “Big”, then 2 * 0.25% (1*
0.25% for closing the current long-small / short-big position, plus 1* 0.25% for establishing a long-big / shortsmall position) have to be deducted from the following month’s accrued (absolute value of the) “small
premium”(the difference in return between S&P SmallCap 600 and S&P 500 indices).
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank
Russell 2000 indices have been used as inputs for the Small-versus-Big calculations.
Table 8
Results Small-versus-Big Support Vector Regression rotation strategy using a threemonth forecast horizon. Time frame: January 1993 – January 2003
large/small cap
3-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Big
% months in Small
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross
Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are
long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices24.
The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal
models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced
signals for month t+1 are “Small”, “Big”, and “Small”, then the combined signal is “1/3 Small”. Out of this
combined signal, a position is taken that is long 1/3 of the Small cap index and short 1/3 of the Large cap index.
Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1
should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is ½ long-small / ½ short-big, and the signal for the following month is “½ Big”, then 2 *
0.125% (1* 0.125% for closing the current ½ long-small / ½ short-big position, plus 1* 0.125% for
establishing a ½ long-big / ½ short-small position) have to be deducted from the following month’s accrued
(absolute value of the) “small premium”(the difference in return between S&P SmallCap 600 and S&P 500
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank
Russell 2000 indices have been used as inputs for the Small-versus-Big calculations.
Table 9
Results Small-versus-Big Support Vector Regression rotation strategy using a six-month
forecast horizon. Time frame: January 1993 – January 2003
large/small cap
6-month forecast horizon
Standard deviation
Information ratio
Minimum (monthly)
Maximum (monthly)
Skewness (monthly)
Excess kurtosis (monthly)
prop. negative months
Largest 3-month loss
Largest 12-month loss
% months in Big
% months in Small
% months no position
(costs 0, 25 and
50 bp)
(costs 0 bp)
(costs 25 bp)
(costs 50 bp)
(costs 50 bp)
SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross
Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are
long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices25.
The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal
models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the
produced signals for month t+1 are “Small”, “Small”, “Small”, “no signal”, “Big”, and “Small”, then the
combined signal is “½ Small”. Out of this combined signal, a position is taken that is long ½ of the Small cap
index and short ½ of the Large cap index. Note that if the optimal models produce a combined “no signal” signal,
then no trading position for month t+1 should be taken.
Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the
current position is ½ long-small / ½ short-big, and the signal for the following month is “½ Big”, then 2 *
0.125% (1* 0.125% for closing the current ½ long-small / ½ short-big position, plus 1* 0.125% for
establishing a ½ long-big / ½ short-small position) have to be deducted from the following month’s accrued
(absolute value of the) “small premium”(the difference in return between S&P SmallCap 600 and S&P 500
indicates significance at the (2-tail) 10% level
indicates significance at the (2-tail) 5% level
indicates significance at the (2-tail) 1% level
Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank
Russell 2000 indices have been used as inputs for the Small-versus-Big calculations.
Appendix VII
Figures showing the results from different Small-versus-Big Support Vector Regression
investment strategies and different cost scenarios.
Time frame: January 1993 – January 2003
Figure A7.1. Accrued cumulative monthly returns from the Small-versus-Big strategy and
the one-month forecast horizon Support Vector Regression rotation strategy
under different transaction-cost regimes.
Figure A7.2. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the one-month forecast horizon Support Vector Regression rotation strategy
Figure A7.3. Realized excess returns by the one-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A7.4. Accrued cumulative monthly returns from the Small-versus-Big strategy and
the three-month horizon Support Vector Regression rotation strategy under
different transaction cost regimes.
Figure A7.5. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the three-month forecast horizon Support Vector Regression rotation strategy.
Figure A7.6. Realized excess returns by the three-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A7.7. Accrued cumulative monthly returns from the Small-versus-Big strategy and
the six-month horizon Support Vector Regression rotation strategy under
different transaction cost regimes.
Figure A7.8. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by
the six-month forecast horizon Support Vector Regression rotation strategy.
Figure A7.9. Realized excess returns by the six-month forecast horizon Support Vector
Regression rotation strategy under the 25 bp transaction-cost scenario.
Figure A7.10. Accrued cumulative monthly returns from the one-, three-, and six-month
forecast horizon Support Vector Regression Small-versus-Big rotation
strategies under zero-transaction-cost regime.
Figure A7.2.
Figure A7.1.
Figure A7.3.
Figure A7.5.
Figure A7.4.
Figure A7.6.
Figure A7.8.
Figure A7.7.
Figure A7.9.
Figure A7.10