SHORT-HORIZON VALUE-GROWTH STYLE ROTATION WITH

Maastricht University Faculty of Economics and Business Administration SHORT-HORIZON VALUE-GROWTH STYLE ROTATION WITH SUPPORT VECTOR MACHINES Final thesis Georgi Nalbantov Student ID: i983140 International Economic Studies November 2003 Supervisor I: Supervisor II: Dr. Rob Bauer Associate Professor of Finance Limburg Institute of Financial Economics (LIFE) Faculty of Economics and Business Administration Maastricht University The Netherlands Dr. Ida Sprinkhuizen-Kuyper Assistant Professor Department of Computer Science Faculty of General Sciences Maastricht University The Netherlands We are thankful to ABP Investments for providing the data for this research. Table of Contents List of Tables ............................................................................................................................iii List of Figures ............................................................................................................................ iv Abstract ...........................................................................................................................vii Chapter 1 Introduction ......................................................................................................... 1 Chapter 2 Stock Returns Predictability and Factor Models................................................. 5 2. 1 2. 2 2. 3 2. 4 Chapter 3 3. 1 3. 2 3. 3 3. 4 Chapter 4 Market (In)Efficiency and Factor Models ................................................................................................5 Factor models: which factors matter.........................................................................................................5 Factor models: fit versus complexity........................................................................................................6 Factor models: an example .......................................................................................................................7 Support Vector Machines: definition, advantages, and limitations.................... 8 What are Support Vector Machines? ........................................................................................................8 Remarks on the definition .........................................................................................................................8 Overall Advantages ...................................................................................................................................9 Overall Limitations ...................................................................................................................................9 Support Vector Machines and the Concept of Generalization .......................... 11 4. 1 Rationale behind SVM: the concept of generalization.......................................................................... 11 4 .1. 1 The idea behind “complexity” of a function................................................................................ 11 4 .1. 2 On the importance of generalization ........................................................................................... 11 4 .1. 3 How to measure the level of complexity? .................................................................................... 13 4 .1. 4 What is the VC dimension? .......................................................................................................... 13 4. 2 The concept of generalization in a binary classification problem ........................................................ 13 4. 3 Bounds on the test error ......................................................................................................................... 15 4. 4 Remarks on the choice of a class of functions ...................................................................................... 16 Chapter 5 Constructing Support Vector Machines for Classification Problems ............... 18 5. 1 Complexity and the width of the margin ............................................................................................... 18 The VC dimension of hyperplanes ............................................................................................... 18 5 .1. 1 5 .1. 2 Optimal hyperplanes .................................................................................................................... 18 5. 2 Linear SVM: the separable case ............................................................................................................ 20 5. 3 Linear SVM: the nonseparable case ...................................................................................................... 22 5. 4 Nonlinear SVM: the nonseparable case................................................................................................. 24 5. 5 Classifying unseen, test points............................................................................................................... 26 5. 6 Admissible kernels ................................................................................................................................. 27 Chapter 6 6. 1 6. 2 Chapter 7 Support Vector Regression................................................................................ 28 The ε-insensitive loss function............................................................................................................... 28 Function estimation with SVR............................................................................................................... 28 Methodology .................................................................................................... 31 7. 1 A factor-model approach to the basic model......................................................................................... 31 7. 2 Indices and data choice .......................................................................................................................... 32 7 .2. 1 The explained variable: the “value premium”............................................................................ 32 7 .2. 2 On the choice of explanatory factors ........................................................................................... 32 7 .2. 3 Factor explanatory power and Support Vector Regressions ...................................................... 34 7. 3 Support Vector Regression as a factor-model tool................................................................................ 34 i 7 .3. 1 The generalization property of Support Vector Regression ........................................................ 34 7 .3. 2 The internally-controlled-complexity property of Support Vector Regression........................... 35 7 .3. 3 The property of specifying numerous investor loss functions ..................................................... 35 7 .3. 4 The property of distinguishing the information-bearing input-output pairs............................... 35 7 .3. 5 Cross-validation procedure for choosing among optimal models .............................................. 36 7. 4 The basic model ..................................................................................................................................... 37 7. 5 Model extensions.................................................................................................................................... 38 7. 6 Small-versus-Big Rotation with Support Vector Regressions .............................................................. 40 7. 7 Support Vector Machines vis-à-vis common factor model pitfalls ...................................................... 40 Support Vector Machines versus the Survival Bias..................................................................... 40 7 .7. 1 7 .7. 2 Support Vector Machines versus the Look-Ahead Bias .............................................................. 41 7 .7. 3 Support Vector Machines versus the Data Snooping Bias.......................................................... 42 7 .7. 4 Support Vector Machines versus the Data Mining Bias ............................................................. 42 7 .7. 5 Support Vector Machines versus the Counterfeit Critique.......................................................... 42 Chapter 8 Experiments and Results ................................................................................... 43 8. 1 Experiments carried out with Support Vector Regression .................................................................... 43 8. 2 Results from Support Vector Regression Estimation............................................................................ 44 Value-minus-Growth strategy ...................................................................................................... 44 8 .2. 1 8 .2. 2 “MAX” strategy............................................................................................................................ 45 8 .2. 3 Basic model investment strategy .................................................................................................. 45 8 .2. 4 Three- and six-month horizon strategies ..................................................................................... 47 8 .2. 5 Consistency of the strategies ........................................................................................................ 47 8 .2. 6 Non-zero transaction cost scenarios............................................................................................ 49 8 .2. 7 Small-versus-Big Strategies ......................................................................................................... 49 8. 3 Results from the Classification Reformulation of the Regression Problem ......................................... 50 Chapter 9 Conclusion......................................................................................................... 52 References ........................................................................................................................... 54 Appendix I Factors used in all Value-versus-Growth regression and classification models.58 Appendix II Tables showing the results from different Support Vector Regression Valueversus-Growth investment strategies and different cost scenarios.................... 59 Appendix III Figures showing the results from different Value-versus-Growth investment strategies and different cost scenarios. .............................................................. 63 Appendix IV Tables showing the results from different Support Vector Classification investment strategies and different cost scenarios. ........................................... 67 Appendix V Factors used in Small-versus-Big rotation models............................................ 71 Appendix VI Tables showing the results from different Small-versus-Big investment strategies and different cost scenarios. .............................................................. 72 Appendix VII Figures showing the results from different Small-versus-Big Support Vector Regression investment strategies and different cost scenarios.......................... 76 ii List of Tables Table 1 Results Value-Versus-Growth Support Vector Regression rotation strategy using a one-month forecast horizon ........................................................................ 46 Table 2 Results Value-Versus-Growth Support Vector Regression Rotation Strategy using a three-month forecast horizon ...................................................................... 61 Table 3 Results Value-versus-Growth Support Vector Regression Rotation strategy using a six-month forecast horizon ......................................................................... 62 Table 4 Results Value-versus-Growth Support Vector Classification rotation strategy using a one-month forecast horizon ........................................................................ 68 Table 5 Results Value-versus-Growth Support Vector Classification rotation strategy using a three-month forecast horizon ...................................................................... 69 Table 6 Results Value-versus-Growth Support Vector Classification rotation strategy using a six-month forecast horizon ......................................................................... 70 Table 7 Results Small-versus-Big rotation strategy using a one-month forecast horizon.... 73 Table 8 Results Small-versus-Big rotation strategy using a three-month forecast horizon..................................................................................................................... 74 Table 9 Results Small-versus-Big rotation strategy using a six-month forecast horizon..................................................................................................................... 75 iii List of Figures Figure 1 A two-class classification problem of separating black balls from white balls ................................................................................................................... 11 Figure 2 Relation between complexity of the function class used for training, on the one hand, and function (class) complexity term and the minimum number of training errors realized by a concrete function belonging to the function class, on the other .............................................................................................. 12 Figure 3 Possible two-class classifications of three training points in a plane................ 13 Figure 4 Two out of infinitely many lines able to separate without an error two ball classes................................................................................................................ 19 Figure 5 Presence of noise in the data that dislocates points from their truthful position by a certain amount ............................................................................. 20 Figure 6 A non-linearly-separable binary classification problem ………….. ................ 23 Figure 7 An SVM solution to the classification problem of Figure 1, presented in feature space ...................................................................................................... 25 Figure 8 An SVM solution to the classification problem of Figure 1, presented in input space......................................................................................................... 26 Figure 9 The ε-insensitive loss function .......................................................................... 28 Figure 10 An SVR solution to the problem of estimating a relation between x1 and y..... 29 Figure 11 Classification of factor models according to different characteristics .............. 31 Figure 12 A 5-fold cross-validation procedure.................................................................. 37 Figure 13 Five-fold cross validation mean squared errors associated with complexity-error tradeoff parameter C∈(0,32) and fixed ε-insensitive loss function parameter (ε) at 1.0 and Radial Basis Function parameter at 0.007 ... 43 Figure 14 Accrued cumulative returns from the Value-minus-Growth strategy and the Support Vector Regression (SVR) one-, three-, and six-month horizon strategies for the period January 1993 – January 2003 ..................................... 48 Figure 15 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the basic model investment strategy............................................................. 48 Figure 16 Realized excess returns forecasted by the basic investment strategy for the 25 bp transaction costs scenario ........................................................................ 49 Figure 17 Accrued cumulative returns from the Value-minus-Growth strategy and Support Vector Classification (SVC) one-, three-, and six-month horizon strategies for the period of January 1993 till January 2003 under the zerotransaction-cost regime...................................................................................... 51 iv Figure A3.1 Accrued cumulative monthly returns from the Value-versus-Growth strategy and the one-month forecast horizon Support Vector Regression rotation strategy under different transaction-cost regimes ................................ 64 Figure A3.2 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the one-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 64 Figure A3.3 Realized excess returns by the one-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario ........... 64 Figure A3.4 Accrued cumulative monthly returns from the Value-versus-Growth strategy and the three-month horizon Support Vector Regression rotation strategy under different transaction cost regimes.............................................. 65 Figure A3.5 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the three-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 65 Figure A3.6 Realized excess returns by the three-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario.............................................................................................................. 65 Figure A3.7 Accrued cumulative monthly returns from the Value-versus-Growth strategy and the six-month horizon Support Vector Regression rotation strategy under different transaction cost regimes.............................................. 66 Figure A3.8 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the six-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 66 Figure A3.9 Realized excess returns by the six-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario ........... 66 Figure A7.1 Accrued cumulative monthly returns from the Small-versus-Big strategy and the one-month forecast horizon Support Vector Regression rotation strategy under different transaction-cost regimes ............................................. 77 Figure A7.2 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the one-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 77 Figure A7.3 Realized excess returns by the one-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario ........... 77 Figure A7.4 Accrued cumulative monthly returns from the Small-versus-Big strategy and the three-month horizon Support Vector Regression rotation strategy under different transaction cost regimes ........................................................... 78 Figure A7.5 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the three-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 78 Figure A7.6 Realized excess returns by the three-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario.............................................................................................................. 78 v Figure A7.7 Accrued cumulative monthly returns from the Small-versus-Big strategy and the six-month horizon Support Vector Regression rotation strategy under different transaction cost regimes ........................................................... 79 Figure A7.8 Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the six-month forecast horizon Support Vector Regression rotation strategy .............................................................................................................. 79 Figure A7.9 Realized excess returns by the six-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario ........... 79 Figure A7.10 Accrued cumulative monthly returns from the one-, three-, and six-month forecast horizon Support Vector Regression Small-versus-Big rotation strategies under zero-transaction-cost regime ................................................... 80 vi Abstract This final thesis has two purposes: first, to examine whether short-term value-growth style rotation on the US stock market based on a Support Vector Regression model exhibits superior performance over a benchmark once-and-for-all value-minus-growth investment strategy; and second, to provide a short but complete in itself introduction to Support Vector Regressions and Support Vector Machines as a whole. We find that our style rotation model significantly outperforms the benchmark strategy, achieving a significant information ratio of 0.83 net of 25bp single trip transaction costs in the trading period of January 1993 – January 2003. All estimates for the monthly differences in returns between value and growth stocks are based on historically available information on 17 pre-specified macroeconomic and technical factors considered all together. Model selection is not based on familiar financial selection criteria such as hit ratio or realized information ratio, but on a standard technique for Support Vector Machines called crossvalidation. The combination of the intrinsic analytical features of Support Vector Regressions and the cross-validation technique have at least two merits: first, it assures that model selection is based only on (artificially created) out-of-sample performance, and second, it appears to circumvent common factor-model shortcomings such as Data Mining Bias, Look-Ahead Bias and others. We examine the performance of our basic model against several model extensions, and discover remarkable consistency and robustness of our basic-model results. Additionally, we present the results for a small-big rotation on the US stock market, which produces slightly better information ratios than value-growth rotation. vii Chapter 1 Introduction The characterization of stock return predictability has long been a subject of great controversy in the financial spheres (Cremers, 2002). Debating issues range from the extent of stock market efficiency to the nature and number of factors that could contain information on future stock returns (Haugen, 2001). At the same time, and in other quarters such as Data Mining, the popular analytical tool Support Vector Machines (SVM) has been gaining momentum, following a series of successful applications in fields such as optical character recognition and DNA analysis (Smola and Schölkopf, 1998, Müller et al., 2001). The possibility to apply Support Vector Machines in Finance and their excellent performance in time-series prediction has already been reported by Smola and Schölkopf (1998) and Müller et al. (1997) respectively. Furthermore, among others, Rocco and Moreno (2001) describe an approach to detect currency crises based on SVM; Monteiro (2001) applies SVM for interest rate curve estimation; and Van Gestel et al. (2003) propose an SVM financial credit scoring model to assess the risk of default of a company and report significantly better results when contrasted with state-of-art techniques in the field. Regarding financial time-series forecasting, SVM have been implemented by Trafalis and Ince (2000), for instance. Pérez-Cruz et al. (2003), further, estimate with SVM the parameters of a GARCH model for predicting the conditional volatility of stock market returns. Despite the successful applications of SVM in various fields (including Finance) however, to the best of our knowledge SVM have not yet been employed in financial style rotation strategies. Therefore, the primary objective of this master’s thesis is to evaluate the economic significance of applying Support Vector Machines in the financial domain of stock returns predictability, and in particular, in the so-called in the financial literature value-versus-growth style rotation. The secondary objective of the thesis, which is of comparable significance, is to provide a brief, but still complete in itself elementary introduction to Support Vector Machines. There is an extensive body of financial literature that documents the (extent of) predictability of differences of returns both among classes of stocks and between stocks and other assets. Pesaran and Timmermann (1995), for example show how to construct a profitable rotating strategy between two assets, stocks and bonds, for the period 1960 – 1992 on the basis of historical information on 9 candidate explanatory factors. Regarding classes of stocks, such as value and growth stocks, the long-run profitability of the strategy of going long on value stocks and short on growth stocks has been popularized by Fama and French (1993). This profitability (or, the relatively higher return to value stocks) has been tried to be explained either in terms of risk compensation (Fama and French, 1993) or market overreaction leading to security mispricing (Lakonishok et al., 1994, La Porta, 1996). Such a long-term strategy however could potentially be suboptimal to a style strategy of rotating portfolios of value and growth stocks (say, via switching long value / short growth and long growth / short value trading positions on the S&P 500 Barra Value and Growth indices), depending on the level of transaction costs. In fact, the need for a value-growth rotation is created by the enormous potential that a value-growth rotation based on perfect forecasting ability offers over the passive Value-minus-Growth strategy, even in a high-transaction-cost regime. As we will show later, such a perfect-forecast rotation strategy (which takes a long position in the higher returning asset class and a short position in the lower returning asset class) produces 21.29% 1 CHAPTER 1 2 annual return during the sample period, under a 50bp1 single trip transaction costs scenario2. The corresponding Value-minus-Growth strategy yields a mere 0.21% annual return (although it suffers from virtually no transaction costs3). This rotation potential has not passed by unnoticed in practice. Kahn (1996), for instance, reports that most funds either tend to follow a value-growth style rotation strategy, or adopt a mixed style strategy. Further, Bauer and Molenaar (2002) note that the difference in return between value and growth stocks is not stable over time, especially after 1990, and propose a logit model to capture the arising valuegrowth rotation potential on the US stock market. In order to exploit the admittedly huge potential stemming from value-growth style rotation, and in line with the primary objective of the thesis, a basic financial factor model that utilizes Support Vector Regressions has been put forward to capture the historically realized monthly differences in returns, referred to also as the “value premium” (Bauer and Molenaar, 2002), between two major US stock indices – the S&P 500 Barra Value and Growth indices. Several model extensions have also been proposed. The two indices are constructed by dividing the market capitalization of the US S&P 500 index (approximately) equally between stocks according to the stocks’ book-to-market (BM) ratio4. The BM ratio is calculated via dividing the book value of common equity by the market capitalization of a firm5. Stocks with relatively higher BM ratio (value stocks) are assigned to the Value index, and stocks with relatively lower BM ratio (growth stocks) constitute the Growth index. The trading period under consideration ranges from January 1993 till January 2003. One way to assess the economic significance of the proposed basic factor model, which is in effect pursued in the thesis, is to simulate a real-time investment strategy, according to which every month investors either short sell the Growth index and buy the Value index (in other words, establish a long value / short growth position), or vice-versa, or do not take any trading position at all. Each of the 121 monthly forecasts is based on 60 months of most recent historically observed values for 17 macroeconomic and technical pre-specified factors considered as a whole. In this way it is ensured that investors base their decisions on historically available information only. The performance of the proposed factor model has to be evaluated against a benchmark passive market strategy, which we choose to be a “Value-minus-Growth” strategy (or, a “permanent long-Value-and-short-Growth” strategy) that goes short on the Growth index and long on the Value index in the beginning of the sample period and never changes that trading position thereafter. The question of primary concern is: Is it possible to construct a valuegrowth style rotation investment strategy for a certain (fairly long) time period which is more profitable than a passive, “Value-minus-Growth” market strategy, via using Support Vector Machines and publicly available historical information on a set of factors thought a priori to be relevant for forecasting stock returns? In the process of construction of value-growth rotation strategies, researchers often choose among factors that are believed to have some intrinsic explanatory power for the difference in returns between value and growth stocks in general. By and large, these explanatory factors fall into two categories – (macro)economic and technical (or, market-based). Kao and 1 Basis points. 100 basis points = 1%. A single-trip transaction costs of 50 basis points (bp) is defined here as the percentage level of the transaction costs associated with either establishing a long value and short growth position, or a long growth and a short value position. Thus, the cost of closing one of these two positions and entering the other is 100 bp. 3 Some transaction costs, on top of establishing an initial long value / short growth position, are bound to be incurred, since the S&P 500 Barra Value and Growth indices are rebalanced twice a year. 4 Source: www.standardandpoors.com. Note that the BM ratio is the reciprocal of the (market) price-to-book ratio. 5 Source: www.barra.com 2 CHAPTER 1 3 Shumaker (1999) for example document the explanatory power of economic factors, such as the yield-curve spread, real bond yield and earnings-yield gap. Others report the significance of economic factors such as rate of inflation (Levis and Liodakis, 1999), and growth in industrial production (Sorensen and Lazzara, 1995). There is also a considerable evidence for the relevance of some technical variables in predicting the difference in returns between value and growth portfolios. For example, Levis and Liodakis (1999) report the importance of the value spread, and Chan et al. (1996) examine the relevance of momentum strategies. Bauer and Molenaar (2002) try to put together previous research on the subject by constructing value-growth rotation strategies for a period January 1990 – November 2001 on the basis of logit models consisting of up to 5 factors from a set of 17 factors claimed to have some economically interpretable explanatory power in the literature on the subject. In our research, we use the same set of 17 factors. What makes our models different from the typical models employed by researchers is that we apply different model-building tools (that is, Support Vector Machines instead of the widely used multiple regression analysis), consider all candidate factors as a whole for model building, and finally, use a cross-validation procedure for model selection. Considering all candidate explanatory factors simultaneously could seem counterintuitive in the eyes of most researchers and practitioners. Typically, they use multivariate models consisting of several candidate factors chosen among a (long) list of factors. Since most of the studies (with the notable exception of the study of Kao and Shumaker (1999), who utilize a decision-tree model with a five-fold cross-validation technique) use multiple regression analysis (in their various forms) for model building, they are bound to take heed of numberof-variables restriction criteria such as adjusted R2. One of the advantages of Support Vector Machines in this respect is that such restriction is not necessary. They are expected to behave robustly even in high-dimensional feature problems (Maragoudakis et al., 2002). This is quite important since utilizing a tool that is able to produce models with good generalization (and consequently, prediction) ability in a multivariate context can potentially capture variable interactions that could not possibly be accounted for by models that contain only few variables. The advantages and disadvantages of Support Vector Machines will be discussed at length throughout the thesis, both from theoretical point of view and in the context of the specific task of capturing the value premium on the US stock market. In the process of model selection, models are chosen only on the basis of performance over (artificially created) out-of-sample data, in order to avoid the critique of judging the merit of a model on the basis of in-sample performance. This type of model selection is based on a cross-validation procedure commonly used in Data Mining6. Our main results show that the value-growth style rotation strategy based on Support Vector Machines (and more specifically, Support Vector Regressions) considerably outperforms the passive long-term Value-minus-Growth strategy, even after various levels of transaction costs have been accounted for. In the sample period under consideration, against the 0.21% annual return from the Value-minus-Growth strategy, the best attainable Support Vector Machines value-growth rotation strategy is able to surpass this benchmark result more than 39 times, achieving 8.21% annual return net of 25bp single trip transaction costs. Moreover, the benchmark strategy is 10.55% more volatile in this case. We also set our results against a socalled “MAX” strategy, which captures the maximum return that can be achieved on the basis 6 The term “Data Mining” in financial literature bears a negative connotation and should be distinguished from the Data Mining discipline, where the same negative idea is expressed by the term “data dredging”. CHAPTER 1 4 of style rotation, net of 50bp transaction costs. According to the “MAX” strategy, each month throughout the sample period a position is taken that goes long on the better-performing security class (here, one of the two S&P 500 Barra Value and Growth indices) and short on the worse-performing security class. This strategy produces 21.29% annual return for the whole sample period. Several model extensions are put forward, such as considering threeand six-month forecast horizons, which testify to the consistency of the basic one-month forecast horizon results. In order to cover both objectives of the thesis effectively, the thesis has been structured as follows. At the outset (chapter 2), the financial side of the problem setting is addressed. In particular, financial factor models and their role in stock market predictability are being briefly discussed. The following four chapters (chapters 3 till 6) deal extensively with Support Vector Machines as an analytical tool, since they will be used to tackle the task of value premium predictability. First, chapter 3 acquaints the reader with the fundamental nature of Support Vector Machines, and their known advantages and limitations. Second, chapter 4 gives an extensive account of the rationale behind Support Vector Machines. Third, in chapter 5, follows the construction of Support Vector Machines for binary classification problems. And fourth, and as a logical continuation of the preceding three chapters, the Support Vector Regression tool employed by the basic investment model has been considered in chapter 6. Whenever it is possible, examples related to the factor models employed in the thesis have been used along with the analysis of Support Vector Machines. Chapter 7, “Methodology”, bridges the gap between the theoretical concept of Support Vector Machines and the practical problem of value premium predictability. This chapter explains how and why Support Vector Machines can be applied in a specific factor model to address the question of this predictability. This has been done in several steps. First, the to-beproposed basic investment model, which utilizes Support Vector Regressions, is put in the context of factor models. Second, the choice of explanatory macroeconomic and technical variables and the nature of the explained variable (the difference in return between S&P 500 Barra Value and Growth indices: the value premium) are discussed. Third, the most attractive merits of Support Vector Regressions that should justify their employment as a factor model tool come into the focus. The necessary and sufficient background being laid down, the basic investment model and several model extensions are put forward. The chapter closes with a discussion of why Support Vector Machines and the proposed models are elegantly able to withstand common drawbacks of factor models highlighted in financial literature by and large, such as Survival Bias, Look-Ahead Bias, Data Snooping Bias, Data Mining Bias, and Counterfeit7. Chapter 8 brings together all of the previous parts of the thesis. It shows how the actual experiments for the practical realization of the models suggested in the “Methodology” chapter have been carried out and summarizes the main findings and assesses their significance. Chapter 9 concludes. It is important to stress that all chapters of the master’s thesis are constructed in such a way as to touch upon only topics and issues that are relevant for answering its two objectives. Thus, it falls outside the scope of the thesis to provide an extensive summary of how stock returns predictability and Support Vector Machines are reflected in their respective domains of Finance and Data Mining. 7 We use the terminology of Haugen (1999). Chapter 2 Stock Returns Predictability and Factor Models 2. 1 Market (In)Efficiency and Factor Models The idea that the stock market is fairly inefficient has been gaining momentum in the academic and financial spheres. The term “market efficiency” is generally used to denote “the extent to which market prices securities so as to reflect available (historical) information pertaining to their valuation“ (Haugen, 2001). Therefore, a high degree of market inefficiency suggests that plenty of securities are mispriced. In this case, investing in a market index would be a suboptimal investment strategy. In other words, the market index is not guaranteed to be among the set of portfolios (called the “efficient set”) that offers the maximum expected return for a given level of risk. Consequently, market inefficiency opens the possibility to “beat” the market (index). There is a growing body of financial literature that addresses the question of how to profit from market inefficiency by putting forward models that admittedly are able to successfully estimate expected returns and volatility of return of stocks on the basis of (a multitude of) publicly available factors that are believed to affect (classes of) securities in different ways. These factors are usually classified into macroeconomic (rate of inflation, rate of growth of industrial production, etc.) and financial (book-to-market ratio, debt-to-equity ratio, etc.) characteristics that could contain information on expected future movements of securities. The multi-factor models could consequently be used in an inefficient market environment to help investors move closer to the efficient set, and away from the market index. However, building satisfactory factor models is not an easy task. Another branch of Finance, Behavioral Finance (see e.g. Thaler, 1993), investigates the effects of investor psychology on stock prices with a view to exploiting market inefficiencies. Studies in Behavioral Finance will not be accounted for in the thesis however. 2. 2 Factor models: which factors matter Despite the growing evidence that multi-factor models are quite powerful in explaining and predicting stock returns, there does not exist a full consensus on which factors precisely are important and why (Cremers, 2002). The majority of the studies have looked at the US stock market, and confirm that stock returns can be predicted to some degree by means of interest rates, dividend yields and a variety of macroeconomic variables exhibiting clear business cycle variations (Pesaran, 2003). For example, in the late 1970s, Basu (1977) showed that stocks with low price-to-earnings ratios performed significantly better (during the period between April 1957 and March 1971) than stocks with high price-to-earnings ratios. Subsequent findings have been reported by Keim and Stambaugh (1985) and Rosenberg el al. (1985), who stress the relevance of price-to-dividends and (market) price-to-book ratios respectively. Additionally, Chan et al., (1991) document the significant impact of the price-tocash flow ratio on expected returns of stocks (on the Japanese stock market). All of these ratios are referred to as “value characteristics”, so that stocks with lower such ratios are labeled “value”, and stock with higher ratios – “growth”. Firm size also appears to play a role in stock market predictability. Banz (1981) discovered that the stocks of firms with low 5 CHAPTER 2 6 market capitalizations (small cap stocks) have higher average returns than stocks with high market capitalization (large cap stocks). Furthermore, Bhandari (1988) reports that firms with high leverage (high debt-to-equity ratios) have higher average returns than firms with low leverage for the period from 1948 to 1979. Two admittedly quite influential papers that pulled together much of the earlier empirical work were published by Fama and French (Fama and French, 1992, and Fama and French, 1993), who proposed a three-factor model and argued that the book-to-market and size variables bear strongest relation to stock market returns. The number and nature of candidate explanatory factors in factor models varies across studies, however, and the three-factor model of Fama and French has been extended by other researchers to include as many as fifty factors simultaneously, as advocated by Haugen (1999), for example. Fama and French (1993) provide evidence for a risk-based explanation of the long-term difference between value stocks and growth stocks. According to them, the book-to-market ratio is a proxy for an unobservable common risk factor, so that the fact that value stocks (perceived to be more risky) have higher average returns over the long run is consistent with rational asset pricing. Jensen et al. (1997) extend this view by claiming that value companies are quite sensitive to the same macroeconomic conditions, such as interest rate risk and the business cycle. Alternative views exist however, revealing themselves as differences in attitude towards market efficiency. While Fama and French (1992) regard markets as efficient, Haugen (1999) takes up an opposing position by arguing that the differences between expected and real returns come as a surprise to investors. In the same line of thought, Lakonishok et al. (1994) argue that value stocks had historically higher returns than growth stocks because markets were inefficient (that is, investors were systematically wrong in their expectations about future stock returns). 2. 3 Factor models: fit versus complexity A disquieting trouble one faces with most of the factor models is that the predictive power of a model deteriorates in practice with the inclusion of more and more explanatory variables (factors). The reason is that model complexity increases with the number of factors. At some point, the benefit of including new information in the model in the form of one explanatory factor will actually decrease the predictive power of the model (although this will increase the “fit” of the model on the data that was used to generate it) as the resulting increase in model complexity will overweigh the benefit of the new information embedded in the factor. This phenomenon is known as overfitting. Typically, commentators implicitly or explicitly try to correct for overfitting by estimating multiple regressions that include all possible combinations of a pre-selected factor set and choose among the resulting models. Thus, if there are k candidate explanatory factors in a given factor set, then there will be 2k possible models (multiple regressions). Subsequently, all of these models are ranked according to some performance criteria, such as statistical (adjusted R2, AIC (Akaike’s Information Criterion), BIC (Schwarz’s Bayesian Information Criterion), etc.) and financial (hit ratio8, recursive Sharpe ratio, etc.) criteria9 (Pesaran and Timmermann, 1995). Another two financial criteria – the mean return of a strategy and the information ratio10 criterion – have been 8 The hit ratio is the percentage of times (e.g., months) that a correct prediction has been made. There are also other types of performance criteria, which will not be covered here however. 10 The information ratio is defined as the mean of a random variable (such as the stock market return realised by a given model) divided by its standard deviation. This ratio is the same as Sharpe ratio for a long-short strategy. 9 CHAPTER 2 7 employed for example by Bauer and Molenaar (2002). The financial criteria are generally used to correct for the fact that statistical criteria are not necessarily in accordance with the investor’s loss function (Pesaran and Timmermann, 1995). Other authors opt for a Bayesian approach to model selection (e.g. Cremers, 2001, Avramov, 2002). What is common however for nearly all factor models is that in the model selection procedures a balance is explicitly or implicitly being searched between model complexity (proxied, for instance, by the number of factors included in a given model) on the one hand, and model “fit” on the model-generating (training) data, on the other. It is known that including more explanatory variables in a model will increase its “fit” on the training data, suggested by the R2 statistic. Because of the problem of overfitting, however, a model with greater R2 can very well be worse (in predictive power) than a more parsimonious model with a lower R2. As mentioned above, some criteria (that are widely spread, such as AIC and BIC) try to cope with this situation – they penalize to a certain extent the inclusion of a new factor, and tolerate it only if it brings “enough” additional explanatory power. The problem is that it is uncertain which criteria exactly are most optimal. 2. 4 Factor models: an example Expected-return multi-factor models are typically constructed on monthly basis, where the employed optimal model (or a combination of models) is reconsidered every to-be-predicted month, allowing for changes in the number and nature of included factors. This strategy is, for example, employed by Bauer and Molenaar (2002), who propose a model to capture the historical value premium on the US stock market arising from the difference between two indices, the S&P 500 Barra Value and Growth indices. Every month, a series of parsimonious multiple regression models are being constructed on the basis of 60 months of historical values for 17 candidate macroeconomic and financial explanatory factors. Subsequently, all models are being ranked during a 24-month model selection period according to financial criteria such as mean return of employed strategy, hit ratio, and information ratio. Alongside, different transaction-cost scenarios are considered. The task of the basic model proposed in this thesis is to capture the (absolute value of the) value premium on the US stock market (that is, the difference between S&P 500 Barra Value and Growth indices) for a period of 121 months: January 1993 – January 2003. The basic model utilizes the whole set of 17 factors (listed in Appendix I) used by Bauer and Molenaar (2002), but instead of multiple regressions, it utilizes Support Vector Regressions. In order to construct our basic value-growth rotation model, which is based on Support Vector Regressions, it is indispensable, first, to define Support Vector Machines and describe the theoretical rationale behind them. This will be done, respectively, in chapter 3 and chapter 4. After that, in chapter 5, the primary technical building blocks of Support Vector Machines will be introduced in the context of Support Vector Machines for Classification. Support Vector Regressions, which built on this these technical concepts, will be discussed in chapter 6 on a theoretical level, and then in chapter 7 on a more practical, financial level, where the basic model will be put forward as well. Afterwards, in chapter 8, the results from the basic Support Vector Regression model along with several model extensions will be analyzed and compared to a passive Value-minus-Growth strategy and a “MAX” strategy. Chapter 3 Support Vector Machines: Definition, Advantages, and Limitations 3. 1 What are Support Vector Machines? Support Vector Machines (SVM) find their roots in Statistical Learning Theory, pioneered by Vapnik and co-workers (Smola and Schölkopf, 1998). In essence, SVM are just functions, named generally “learning machines”, whose basic task is to “explore” data (input-output pairs) and provide optimally accurate predictions on unseen data. SVM could be defined11 as follows: Support Vector Machines are a classification / regression tool used for optimally predicting the class membership / real value of unseen outputs that are generated or characterized by one or more inputs, by means of looking at some available training input-output pairs and then building a model based on the observed input-output relations. 3. 2 Remarks on the definition There are several terms in the above definition that demand clarification. 11 a. By “optimally predicting” it is meant that the best tradeoff between function’s complexity and accuracy (the number of training errors it makes) is being struck. b. The outputs and inputs can be considered as dependent and independent (or explained and explanatory) variables respectively. Sometimes outputs are called “outcomes” or “target values”, and inputs – “features” or “attributes”. c. If the outcomes take discrete values (called “classes”), we have a classification problem, while if they take continuous (real) values – we have a regression estimation problem. d. The “class membership” is the label assigned to a given output in a classification problem, so that outputs having the same label belong to the same class. In a two-class classification problem, the labels could just be “positive” and “negative”. e. “Looking at some training input-output pairs” refers to the first stage in the process of learning from data: reading all training outputs together with their respective inputs. In simple words: just reading the available data. This is an informal definition 8 CHAPTER 3 f. 9 “Building a certain model” is the second stage in the process of learning from data, which implements the basic idea behind SVM – the idea of striking the best balance between minimizing the amount of errors made on the training data set, while simultaneously maximizing (the Euclidian distance of) the “margin” between the (two) different classes in higher-dimensional, feature space implicitly defined by a certain kernel function. In the case of Support Vector Regression (SVR), one employs the concept of “ε-insensitive region” instead of the “margin”. All of the above concepts (e.g. “margin”, “ε-insensitive region”, etc.) will be thoroughly examined in the course of chapters 4 till 6. The informal definition of Support Vector Machines and the remarks on the definition are presented here as a compendious first overview of the idea behind these learning machines. 3. 3 Overall Advantages One can argue that the combination of three key properties of Support Vector Machines has made them a favorable analytical tool among other learning algorithms. These properties are summarized below. First, contrary to other (machine) learning techniques, SVM behave robustly even in high dimensional feature problems (Maragoudakis et al., 2002), that is, where the explanatory variables are numerous, and in noisy, complex domains (Burbidge and Buxton, 2001). Second, unlike neural networks, SVM cannot be stuck in a local minimum while learning, since SVM solve a quadratic optimization problem that is bound to arrive at a global solution (Vapnik, 1995, Smola, 1996). Third, Support Vector Machines achieve remarkable generalization ability by striking a balance between a certain level of function’s accuracy on a given training set and its complexity. Note that in real-world applications the presence of noise (in regression) and class-overlap (in classification) necessitate the search for such a balance (Vapnik, 1995, Woodford, 2001) 3. 4 Overall Limitations There are three major known shortcoming of SVM, which can be summarized as follows. First, SVM make explicit classifications (point predictions in the case of regression) of new outputs for new inputs, rather than predicting the posterior probability of class membership (or, in the case of regression, the probability that a point estimate takes a particular value, given the new inputs) (Bishop and Tipping, 2000). Second, there is a requirement to estimate a tradeoff parameter that determines the level of penalty associated with training errors SVM make. In the case of SVR, we have also to estimate an additional insensitivity parameter “ε”. This generally entails a “cross-validation” procedure, which is computationally inefficient (Tipping, 2000), but which will nevertheless be used in our models. The cross-validation procedure however can also be considered as an CHAPTER 3 10 advantage rather than limitation in the context for constructing financial factor models, as will be shown in chapter 7. Third, there is a need to utilize “Mercer” kernel functions while constructing SVM (to be introduced in chapter 5), which somewhat restricts our ability to use any function for prediction. All of the above-mentioned advantages and limitations are quite general in nature. Concrete advantages and shortcomings of Support Vector Machines (and, in particular, Support Vector Regression) in relation to financial factor models will be discussed in depth in chapter 7. Chapter 4 Support Vector Machines and the Concept of Generalization 4. 1 Rationale behind SVM: the concept of generalization 4 .1. 1 The idea behind “complexity” of a function The problem of overfitting (or, fitting-too-well), referred to in chapter 2 of the thesis, has long been apparent: functions that perform extremely well on a given training set usually make unsatisfactory predictions on unseen, test data. Typically, those functions are quite “complex”, in the sense that by construction they “fit” the training data almost perfectly (see e.g. Smola, 1996, and Müller et al., 2001). In other words, they make virtually no training errors. To illustrate the idea of complexity of a function, consider the white and black balls in Figure 1. (a) (b) Figure 1. A two-class classification problem of separating black balls from white balls. Not a single line in (a) can separate the two classes of balls without an error. A polynomial of degree two however is able to separate the same configuration of balls, as seen in (b). In this two-class classification problem, we are free to choose among any kind of functions to separate the white and black balls from each other. While some functions will be able to separate the classes without an error, it is evident that there is no way in which a line can separate correctly the two classes. In contrast to a line, a parabola or a polynomial of higher degree can (as illustrated in Figure 1 (a) and (b)). In essence, one can claim that the function class “polynomials of degree two” (that is, parabolas) represents more “complex” functions than the (class of) linear functions. But why it actually happens that too complex functions tend to make more prediction errors? And, even more generally, what kind of functions shall we use for prediction – “complex” or “simple”? 4 .1. 2 On the importance of generalization Consider the following pattern recognition (here, “tree recognition”) problem. Two children, a boy and his little sister, are given a number of pictures with objects. They are told which of the objects are trees and which are not. They are also given some criteria according to which to classify an object as either a “tree” or “not tree”. For example, the criteria could include 11 CHAPTER 4 12 “number of branches”, “number of leaves”, “colour”, etc. After studying the pictures, they are presented with a picture of an object. Their task is to name the object, which in this case is a tree. The boy, having studied in detail all possible kinds of trees, concludes that the object is not a tree since he has never seen a tree with exactly so many branches before. His little sister, on the other hand, has been lazy and has not studied too much. However, she concludes (correctly) that the object is a tree, based on the fact that it is … green. Clearly, none of the two is a good predictor of unseen trees. In order for the boy – who represents the “overfitting” function here – to improve, he could for example admit that there is a big chance of other trees existing with different number of branches. Although this relaxation of his assumptions will lead to a correct classification of lots of unseen trees that would otherwise be misclassified, he might feel a bit uneasy because some non–trees could sneaked in as trees according to the new classification rule. The question he faces is: what is the best tradeoff between relaxing some assumptions (in other words – using a class of less complex functions) and increasing the risk of making more mistakes as a result of these relaxations? It appears that a class of functions with optimal generalization ability is desired: they must not be too “complex”, on the one hand, but simultaneously with this, they must not make too many errors on the training data set, on the other. Hence, the criteria for the function we choose are its complexity (also referred to as capacity) and the number of training errors it makes (accuracy). Consider for clarity Figure 2. In Figure 2, the best tradeoff between the complexity of a function class and the minimum number of training errors it makes is struck at the minimum of the sum of the complexity term and the amount of training errors. At this point we find the function with best generalization ability. Functions that are ordered on the right side of it are considered too complex, in other words – there is overfitting. Analogically, all functions that appear to the left of the best tradeoff point are not complex enough, that is – there is underfitting. The idea that we should try to find this particular, optimal best tradeoff point is called Structural Risk Minimisation principle (Vapnik, 1995, Burges, 1998). There is one crucial detail remaining, however: how to measure complexity so that we can order functions according to their increasing complexity? Minimum number of training errors Underfitting Overfitting complexity term + training errors Function complexity term complexity term minimum training errors Best tradeoff Function classes ordered in increasing complexity Figure 2. Relation between complexity of the function class used for training, on the one hand, and function (class) complexity term and the minimum number of training errors realized by a concrete function belonging to the function class, on the other. The function class with best generalization ability, found at the “best tradeoff” point, is associated with the minimum sum of the complexity term and number of training errors. Functions to the right of this point will overfit the training data, while those to the left of it – underfit the training data. CHAPTER 4 13 4 .1. 3 How to measure the level of complexity? It seems that all we need to find the desired optimal point is a measure of complexity according to which all classes of functions (polynomial, trigonometric, etc.), can be ordered. One such measure of complexity of a class of functions, proposed in the SVM literature, is the VC dimension (Vapnik, 1995). The horizontal axis in Figure 2 represents all classes of functions, ordered according to their complexity, which increases monotonically with their VC dimension. 4 .1. 4 What is the VC dimension? The VC (Vapnik – Chervonenkis) dimension of a class of functions is defined as the largest number h of points that can be separated in all possible classification ways they may appear using functions of the given class (Burges, 1998). It follows that relatively more complex (classes of) functions have higher VC dimension, since they are able to separate relatively more points without an error (in all possible classification ways). Let us determine, for example, the VC dimension of the class of linear functions– what is the maximum number of points that linear functions can separate in all possible ways without an error? Figure 3. Possible two-class classifications of three training points in a plane. Burges (1998) shows that there are exactly eight possible classification ways in which three balls (points) from two classes (b = black, w = white) may appear. Clearly, a line is able to separate the two classes in all possible eight ways, provided that the balls are not lined up with each other. It is clear from Figure 3 that lines can separate three black and white balls no matter what colour we choose the balls to be, provided that the balls do not lie on one and the same line. Since a line cannot separate four balls in all different classification ways (see Figure 1 for an example of this impossibility), we conclude that the VC dimension of the class of linear functions (in a plane) is three. 4. 2 The concept of generalization in a binary classification problem Now we can place the problem of finding an optimal point between complexity and accuracy in a more formal setting. Suppose we are given a (row) vector x of n explanatory variables x = ( x1 , x 2 , x3 ,..., x n ) , and a finite number l of (training) outcomes associated with observed values of the explanatory variables. The l outcomes can, for the time being, be labeled just “plus one” and “minus one”. It is assumed that there is a fixed, but unknown relation between the explained and explanatory variables. To sum up, we have training data in the form of l input-output pairs: CHAPTER 4 ( x11 , x12 , x13 ,..., x1n ) , y1 ( x 21 , x 22 , x 23 ,..., x 2 n ) , y 2 … ( xl1 , xl 2 , xl 3 ,..., xln ) , y l 14 or, equivalently: x1 x2 … xl , y1 , y2 , yl where xln is the (any real) value that the explanatory variable x n takes in the lth training input-output pair, and each of the y1 , y 2 ,…, y l is either plus one or minus one; and where x l is a vector containing the values of all n explanatory variables in the lth training pair, that is x l = ( xl1 , xl 2 , xl 3 ,..., xln ) . We can present the training data alternatively as: (x1 , y1 ) , (x 2 , y 2 ) ,..., (x l , y l ) ∈ ℜ n × {±1} Returning to the tree recognition example, the explanatory variables could be “number of branches”, “number of leaves”, “colour”, etc., and the training outcomes – “tree” and “not tree”. Our task is to find a function with the best generalization ability to classify some k unseen outcomes, given new values for the n inputs. In other words, find the best function f : ) ℜ n → {±1} that produces training outcomes12 y i ∈ {±1} , i = 1, 2,…, l, from n inputs, ) f ( xi1 , xi 2 , xi 3 ,..., xin ) = y i , for i = 1, 2,…, l, which will be used to classify k new outcomes, ) f ( x j1 , x j 2 , x j 3 ,..., x jn ) = y j , for j = l+1, l+2,…, l+k, given that there exists a fixed, but unknown relation between the n independent variables and the two classes in the form of φ : ℜ n → {±1} , according to which all l+k input-output pairs are generated. Even if it is impossible to find the best function, we can at least try to find the one with most adequate generalization ability from some pre-chosen classes of functions. Note that best does not imply making no training errors. Obviously, one can always find (many) complex functions that will be able to separate the positive from the negative training outcomes without an error. In this case, the predicted ) classes of the training outputs y i ∈ {±1} , i = 1, 2,…, l, and the actual classes of the outputs y i ∈ {±1} , i = 1, 2,…, l, will be the same. Such a function could be, for example, the function f ∗ , for which ) f ∗ ( xi1 , xi 2 , xi 3 ,..., xin ) = y i = y i , for i = 1, 2,…, l. It is common to say that the empirical error (or, empirical risk) of such functions is zero. Moreover, one can always find a different (complex) function f ∗∗ for which holds that 12 Here the “hat”, as in y) i , indicates estimated class membership from the function f ( xi1, xi2 , xi3 … xin ), while we denote the true class of the ith outcome as y i . CHAPTER 4 15 ) f ∗∗ ( xi1 , xi 2 , xi 3 ,..., xin ) = y i = y i , for i = 1, 2,…, l. Now, if we are given k additional, test pairs (x l +1 , y l +1 ) , (x l + 2 , y l + 2 ) ,..., (x l + k , y l + k ) , it may well happen that the first ( f ∗ ) function makes no errors in predicting the test output values (that is: ) ) ) f ∗ (x l +1 ) = y l +1 = y l +1 , f ∗ (x l + 2 ) = y l + 2 = y l + 2 , …, f ∗ (x l + k ) = y l + k = y l + k ), and the second ( f ∗∗ ) function predicts all test y values wrongly. To complicate things further, observe that since the functions are complex to begin with (they both have zero empirical risk), there may exist a simpler function ( f ∗∗∗ ) with better generalization ability – one that allows for some ) training errors (say, f ∗∗∗ (x 2 ) = y 2 ≠ y 2 ), but has much lower value for the complexity term, and thus finds itself closer to the optimal point in Figure 2. The question is: how can we actually determine the relative position of these functions along the horizontal axis in Figure 2? 4. 3 Bounds on the test error Notice that the VC dimension of the complex functions in the above binary classification problem is expected to be quite high in the general case (since they have to make no errors on l arbitrarily given training outcomes), and possibly it could be infinite (if these functions can separate without an error any number of positive and negative outcomes). The simpler function, on the other hand, inevitably makes at least one training error, implying that its VC dimension should be relatively (much) smaller. It will be tempting to conclude at this stage that all that is left to do is to pick up functions with different VC dimension, check how many training errors they make, sum the two terms (VC dimension plus number of training errors), and choose the function that produces the lowest sum. This strategy will work only if the “complexity term” in Figure 2 and the level of complexity (the VC dimension) is one and the same thing, which is generally not true. Vapnik (1995) and Burges (1998) have shown that if training and testing input-output pairs are generated independently and distributed identically according to some unknown, but fixed distribution P [( x1 , x 2 , x3 ,..., x n ), y ], and if the VC dimension ( h ) is less than the number of examples for training ( l ), then with probability at least 1 – η , the following bound on the test error holds: R (α ) ≤ R emp (α ) + h (log 2hl + 1) − log(η / 4) l The test error R(α) is also referred to as the risk of test error, the regularized risk, or simply – the risk. The α stands for a given class of functions, and Remp(α) is the amount of training errors (the empirical risk) the best function in the class α makes on the training set. The complexity term (known also as the “confidence term”) in this case is equal to the second term on the right-hand side of the bound. The joint distribution of the explanatory and explained variables, P [( x1 , x 2 , x3 ,..., x n ), y ], is interpreted as a function that associates a certain probability of a value (in our case, ±1) of y i occurring together with the observed values xi1 , xi 2 , xi 3 ,..., xin , for i = 1, 2,…, l. Later, while exploiting SVM in the context of constructing value-growth style rotation strategies, we will apply the concept of balancing the CHAPTER 4 16 empirical error and a certain complexity term (arising from the SVM formulation of classification and regression problems) in a similar way. It is possible to formulate other bounds in terms of concepts such as the annealed VC entropy, the Growth function (Vapnik, 1998, Cristianini and Shawe-Taylor, 2000) and the fat shattering dimension (Cristianini and Shawe-Taylor, 2000), which however will not be done here. Some readers may be surprised that the joint probability distribution function, P [( x1 , x 2 , x3 ,..., x n ), y ], does not appear in the bound. The bound, in fact, holds always (provided that h < l ), no matter what the underlying – but still, existing – distribution is. 4. 4 Remarks on the choice of a class of functions It might appear that up till now we have used the term “a function” and “a class of functions” interchangeably. There is a slight distinction between them. As shown in Figure 1 (b), for example, there are (at least) two individual functions (two parabolas) belonging to the same class of functions (the class of parabolas) that are able to separate the white balls from the black ones without an error. According to the risk bound above, these two functions will be equally preferable since the empirical risk (here, zero) and VC dimension stay one and the same. In the event that a parabola which makes mistakes (that is, with empirical error ≥ 1) is chosen out of the class of parabolas, again according to the bound, it will be less preferable and need not be considered. Notice that hypothetically speaking it is quite possible that neither a linear function, which will necessarily make at least one training error, nor a parabola could actually be most appropriate for this separation problem, since there might exist some other class of functions, which is able to strike a better balance between empirical error and complexity term. In section 5.4 we will show the SVM solution for the case of Figure 1, where one chooses the best function among polynomials on degree two. We are now in a position to answer the previously posed question of why complex functions tend to make more prediction errors relative to simpler ones on a given (two-class) classification task. It is because their VC dimension is so big, that even though they make few empirical errors, their test error bound (empirical error plus complexity term) is just too high, for any fixed 1 – η and l. Notice as well that it is quite possible that a class of functions with higher bound on the test error can outperform functions with lower bound on a particular task. The boy from the tree recognition example could have predicted correctly that the object is in fact a tree, had it had the required number of branches; and the girl could have misclassified the tree, because the presented object is in winter time and so it is not green enough to be a tree. Formally speaking, this happens because we have a bound on the test error, which says that the test error could not be above a given value (for fixed 1 – η), but it can certainly be below that value, no matter how complex the chosen class of functions is. Imagine, for example, that a given class of functions with VC dimension 4 has a test error R(α) of no more than 0.09 (for η = 0.05 for instance), and a function with VC dimension 3 has a test error of no more than 0.07 (for the same η). It may well turn out that on a particular task the more complex class produces an error of 0.06 (which is less than 0.09), and the simpler one – an error of 0.065 (which is less than 0.07). CHAPTER 4 17 As a last remark, observe that all other things held equal, functions with higher VC dimension will have greater test error bounds, since the complexity term is increasing monotonically with h. Chapter 5 Constructing Support Vector Machines for Classification Problems 5. 1 Complexity and the width of the margin 5 .1. 1 The VC dimension of hyperplanes In order to make use of the test error bound (or other similar bounds involving the VC dimension), we have to come up with classes of functions whose VC dimension can be computed. We observed in Figure 3 that the VC dimension of the linear functions in twodimensional (ℜ 2 ) space is computable, and that it equals three. In a similar fashion, notice that a (two-dimensional) plane in (ℜ 3 ) can separate without an error at most four points in all possible classification ways the points may appear (if they do not lie on one and the same plane). As asserted in Burges (1998), the VC dimension of (n – 1)-dimensional hyperplanes in n-dimensional space is equal to n + 1. Now that we have found functions (that is, (n – 1)-dimensional hyperplanes) which VC dimension is known, one more detail remains to be addressed. Returning to Figure 3, observe that there are many lines that can be used to separate the black and white balls in each one of the eight cases. Which one of them is optimal? 5 .1. 2 Optimal hyperplanes Let us, for further clarity, have a look at Figure 4. In Figure 4, we are giving several white and black balls (outcomes), associated with different values of the two explanatory variables x1 and x 2 . Making a link to Finance, one can think of the two axes as representing two factors in a factor model. For a value-growth rotation strategy, these could be for instance the earningyield gap and the change in the rate of inflation. In this case, the black and white dots could be considered hypothetically as months during which either value or growth stocks have outperformed. As it is evident from the figure, many lines can be drawn which are able to separate the two classes of value and growth stocks without an error. Abstractly speaking, we have given one-dimensional hyperplanes (that is, lines) in two-dimensional input space (here, defined by x1 and x 2 ), together with l members of two different classes. It is evident that the two classes can be represented sufficiently well in the input space by just attaching the appropriate labels to them (say, “value month” or “growth month”). Later, in Support Vector Regression, we will actually need an additional axis (in addition to the axes required for each of the inputs) since the target values there can take any real values (that is: y ∈ ℜ ), not just two. 18 CHAPTER 5 19 x2 1 2 x1 Figure 4. Two out of infinitely many lines able to separate without an error two ball classes. Notice that we can draw a shaded area around line number 1, called the “margin” between the classes, which just touches one of the black and one of the white points. The most preferable separation line is line number 2, which yields the biggest margin between the two classes. So, which line in Figure 4 shall we use? Rather than answering in a formal way, let us give some intuitive reasoning. Notice that we can make line number 1 in Figure 4 a bit “fatter” until it touches a black ball on its left and a white ball on its right. For simplicity, assume that the line lies along the middle of this fat region, as shown in the figure. Relaxing this assumption will not influence our conclusions. The width of the fat region is called the “margin” that the given hyperplane produces between the two classes. As noted before, the VC dimension of the class of lines (in a plane) is three. However one could argue that, intuitively speaking, line number 2 is the least complex of all lines with zero empirical risk, since it produces the region with largest area of “doubt” about the class of new balls coming into the picture. To put it bluntly, line number 2 is the most “unintelligent” among its peers, that is – among all lines that are able to separate the classes without any error. In line with the principle of best generalization (the Structural Risk Minimization principle), we can conclude that the most preferable among all lines with zero empirical risk is the least complex one. The “doubt” in our case is not to be confused with “indecision”: any line (with zero training error) will classify new balls as “white” if they appear on its appropriate side, even if they happen to lie inside the margin determined by the line (inside the shaded regions in the graph). However, the concept of the region of “doubt” (the region inside the margin) can still be used intuitively as a proxy for complexity. Therefore, one could claim that complexity decreases with increasing margin, and consequently line number 2 is the optimal hyperplane. One can think of yet another intuitive way to show why the line yielding the largest margin is the most preferable. Imagine for simplicity, that we have given only one black and one white ball, as in Figure 5. Allowing for more than two balls from two different classes will not change our intuitive reasoning. Suppose that the exact position of the two balls is perturbed by some noise, in other words the input-output relation f : ( x1 , x 2 ) → {“black”, “white”} exists in a noisy environment. Let the noise intensity be given by the radius of a circle around any ball, so that the greater the radius, the greater the noise. We can see immediately that line number 2 in Figure 5 can “absorb” the biggest amount of noise around the two balls. Line number 1, being relatively closer to the two classes, will for example classify incorrectly the white ball, had it been “pushed” a bit to the left by some noise. On the other hand, line number 2, the one that yields the largest margin between the classes, will be able to cope with the same situation. In other words, it is more preferable than line number 1. CHAPTER 5 20 x2 1 2 x1 Figure 5. Presence of noise in the data that dislocates points from their truthful position by a certain amount. Since the higher the noise level, the greater the dislocation, the line in the figure that is furthest away from both classes (line 2) will be able to cope with the greatest noise level. There is actually a formal way to display the relation between complexity and the margin width, by referring for example to the margin-based bounds on generalization as in Cristianini and Shawe-Taylor (2000). 5. 2 Linear SVM: the separable case Let us start with defining what is meant by a “separable” case. A given number of l points in (n-dimensional input space) ℜ n from two classes are said to be “separable” if there exists at least one hyperplane that can separate the two classes from each other without an error. In the simplest case the two classes are explained by two explanatory variables, x1 and x 2 . Such is the case in Figure 4, where the two classes, a total of l black and white balls13, are separable by a line (which in this case is a hyperplane) in the input space of ( x1 , x 2 ) . The line that yields the largest margin is also drawn – line number 2. Our task will be to find an expression for this optimal separating line, given the coordinates of all l balls. Once we have solved this problem, it would be relatively easy to move to the nonseparable case, and then – to two-class separation problems in n-dimensional input space, covering both the separable and nonseparable cases. In order to determine the exact position of the optimal line in Figure 4, suppose that it has already been found and has the form of w'1 ∗ x1 + w' 2 ∗ x 2 + b' = 0 . In this case all white balls satisfy w'1 ∗ x1 + w' 2 ∗ x 2 + b' ≥ a , and all black balls satisfy w'1 ∗ x1 + w' 2 ∗ x 2 + b' ≤ − a , for some positive14 a. The balls for which these inequalities hold as equalities are called support vectors. Thus, the support vectors are the balls that just “touch” the sides of the margin. Notice that the support vectors completely determine the position of the line – even if all the other balls were removed, the position of the separating line would not change. Finding the width of the margin involves a few steps. First, we will divide both of the above inequalities by a, and second, we will define new w1 = w'1 / a , w2 = w' 2 / a and b = b' / a . In this way we scale differently the optimal separating line and the lines that pass through the support vectors, which now become w1 ∗ x1 + w2 ∗ x 2 + b = 0 , w1 ∗ x1 + w2 ∗ x 2 + b = 1 , and 13 14 We use the terms “balls” and “points” interchangeably Specifying a as negative will not alter the conclusions from the ensuing analysis. CHAPTER 5 21 w1 ∗ x1 + w2 ∗ x 2 + b = − 1 , respectively. From here it is straightforward to find the width of the margin, which is given by the distance between the latter two parallel lines – those on which the support vectors from the two classes lie. As a consequence, the (width of the) 2 2 2 margin equals , where || w || ≡ w1 + w2 . In the general n-dimensional case the vector || w || || w || will have n coordinates, not just two. We are now in a position to formulate the optimization problem of finding the expression for the maximal-margin hyperplane: 2 || w || (1) (2) Maximize Subject to: (w • x i ) + b ≥ 1 , for all i that represent white balls (w • x i ) + b ≤ −1 , for all i that represent black balls. We use small bold letters to denote vectors, and the symbol “•” to denote a dot product, so that (w • x i ) = w1 ∗ xi 1 + w2 ∗ xi 2 . For ease of exposition, we define y i = 1 if ball i has a label 2 “white”, and y i = −1 if ball i has a label “black”. Furthermore, notice that maximizing || w || is equivalent to minimizing its reciprocal w 2 w 2 , which in turn is equivalent to minimizing w is necessarily positive). All these transformations will lead 2 2 us to the following equivalent formulation of the above optimization problem (Vapnik, 1995, Müller et al., 2001): (since the distance w Minimize 2 2 y i ⋅ ((w • x i ) + b) ≥ 1 , Subject to i = 1, 2,…, l. Notice that this is a convex quadratic optimization problem, which means that there will be only one, global solution to it. This is a very desirable property of SVM, which distinguishes them from neural networks. This constrained optimization problem can be solved by introducing non-negative multipliers α i and the Lagrangian (Burges, 1998): L (w, b, α) = w 2 2 – l ∑α i =1 i ⋅ [ y i ⋅ ((w • x i ) + b) − 1 ] . In order to find the optimal solution (w, b, α), the following system of equations (1) – (4) must be solved for non-negative multipliers α i , in line with optimization theory (Burges, 1998): CHAPTER 5 (1) (2) (3) (4) 22 ∂ L (w, b, α ) = 0 , i = 1, 2,…, l ∂ wi ∂ L (w, b, α ) =0 ∂b ∂ L (w, b, α ) = 0 , i = 1, 2,…, l ∂αi α i ⋅ [ y i ⋅ ((w • x i ) + b) − 1] = 0, i = 1, 2,…, l. l The first two sets of equations lead to w = ∑ α i yi x i and i =1 l ∑α y i =1 i i = 0 , respectively. It can be shown (e.g., Burges, 1998) that all α multipliers have a value greater than zero if and only if they are associated with support vectors. If we substitute these two results in the original Lagrangian function, we arrive at the Wolfe’s dual maximization formulation of the minimization problem (Vapnik, 1995, Burges, 1998): l 1 l l Maximize W(α) = ∑ α i − ∑∑ α i α j y i y j  x i • x j  2 i =1 j =1 i =1 Subject to α i ≥ 0 , i = 1, 2,…, l and ∑ l i =1 yi α i = 0 . Prior to elaborating on the reason why we would prefer to use the dual formulation of the optimization problem, we shall address the case of finding an optimal solution when the two classes are not separable by a hyperplane. 5. 3 Linear SVM: the nonseparable case How can we manage problems where the two classes are not linearly separable? If we still desire to use a linear function for separation, we could introduce so called “slack variables”, which take into account the possibility that one or more members of the classes will appear on the “wrong” side of the margin. In Figure 6, for instance, which takes the problem of Figure 4 as a starting point, one of the black balls is classified mistakenly as white by the optimal hyperplane. Referring to the financial interpretation of Figure 4, this is equivalent to saying that during a certain month value stocks have outperformed growth stocks contrary to our expectations. Since (by assumption) all black support vectors satisfy w1 ∗ x1 + w2 ∗ x 2 = b − 1 , it follows that the equation for the line that passes through the misclassified ball and is parallel to the line through the black support vectors can be given as w1 ∗ x1 + w2 ∗ x 2 = b − 1 + ξ , for some positive slack variable ξ. The introduction of slack variables will alter our original optimization problem, which becomes, for some positive constant C (Vapnik, 1995, Müller et al., 2001): Minimize Subject to w 2 (1) (2) 2 l + C ( ∑ξi ) i =1 y i ⋅ ((w • x i ) + b) ≥ 1 − ξ i , ξ i ≥ 0 , i = 1, 2,…, l. i = 1, 2,…, l CHAPTER 5 23 x2 x1 Figure 6. A non-linearly-separable binary classification problem. The optimal hyperplane makes one training error by classifying a black ball as white. One way of dealing with such situations is to introduce for each training point a slack variable that takes a positive value if the respective point turns out to be a training error, and zero otherwise. In line with generalization theory, our goal here is to simultaneously maximize the margin (by 2 w minimizing ), and minimize the amount of training errors, which are proxied by the 2 slack variables (a non-zero slack variable means that a training error has been made). In other words, we minimize the sum of two terms: empirical risk (via the amount of training errors) and complexity (via the width of the margin). The positive constant C is introduced to control for the penalty we would like to associate with a given empirical risk: the higher the C, the greater the penalty associated with a given value for l ∑ξ i =1 i . Thus, if C is set too high, then a relatively small margin between the classes would be tolerated, if it yields a small number of training mistakes. On the other hand, a small value for C means that the width of the margin takes precedence over the amount of training mistakes, and so the solution to the optimization problem will tolerate a relatively large number of training mistakes. The benefit of the introduction of the constant C is that via C we can control explicitly both the complexity and the empirical training error by affecting the optimal w and the optimal number of training mistakes. Consequently, we can call C a “complexity-error tradeoff (adjustment) parameter”. The above optimization problem has its dual representation in the form of (Vapnik 1995, Burges, 1998): Maximize W(α) = l ∑α i =1 Subject to i − 1 l l α i α j y i y j  x i • x j  ∑∑ 2 i =1 j =1 C ≥ α i ≥ 0 , i = 1, 2,…, l and ∑ l i =1 yi α i = 0 . Notice that the slack variables have disappeared in the dual formulation. CHAPTER 5 24 5. 4 Nonlinear SVM: the nonseparable case Notice that the problem in Figure 6, as well as in Figure 4, can be solved with no training mistakes by a nonlinear function. The introduction of nonlinear functions would be tractable if we are able to compute their complexity, that is – their VC-dimension. Knowing both functions’ complexity and empirical risk enables us to make comparisons among them in terms of generalization theory. Let us try to find a second-order polynomial that will solve the nonlinear separation problem of Figure 6. As mentioned, the empirical risk in this case can be zero, since a parabola can separate the two classes without an error. However, in order to make judgments about the regularized risk (the risk of test error), we have to be able to compute the VC dimension of such functions. It has already been shown how to compute the VC dimension of hyperplanes. Our task, then, boils down to finding a way to represent a given second-order polynomial as a hyperplane in a certain n-dimensional space. For example, the polynomial a1 x12 + a 2 2 x1 x 2 + a 3 x 22 + a 4 can be thought of as an equation of a plane in a threedimensional space with coordinates x12 , 2 x1 x2 , and x 22 . The VC dimension of such planes in ℜ 3 is equal to four, as shown before. Notice that this three-dimensional space is just a transformation of the two-dimensional input space ( x 2 , x 2 ) via (for example) the explicit mapping Ф: ℜ 2 → ℜ 3 (Burges, 1998):  xi12    →  2 xi1 xi 2  ,   2  xi 2   xi1  Φ:   x   i2  for any point ( xi1 , xi 2 ) . This transformation is illustrated graphically in Figure 7. In the transformed higher-dimensional space, called feature space, the two classes are clearly separable. Notice that the two black and the two white balls lie on top of each other in the transformed space. We can now apply the SVM optimization algorithm of finding the optimal hyperplane in the new, higher-dimensional space. The dual optimization problem in this case is (Vapnik, 1995, Müller et al., 2001): Maximize W(α) = l ∑α i =1 i − 1 l l ∑∑ α i α j yi y j (Φ(x i ) • Φ(x j )) 2 i =1 j =1 C ≥ α i ≥ 0 , i = 1, 2,…, l and Subject to ∑ l i =1 yi α i = 0 The only difference as compared to the non-transformed case is that we have to compute the dot product ( Φ (x i ) • Φ (x j ) ) instead of ( x i • x j ), where 2 2 Φ(x i ) = ( xi 1 , 2 xi 1 xi 2 , xi 2 ) and x i = ( xi 1 , xi 2 ) . CHAPTER 5 25 x22 x2 (-1,1) (1,1) x1 (-1,-1) Φ (1,-1) 2 x1 x2 x12 Figure 7. An SVM solution to the classification problem of Figure 1, presented in feature space. The input space (x1 , x2) is transformed via the mapping Φ into the (x12 , √2 x1x2 , x22 ) feature space . The originally (linearly) nonseparable problem becomes (linearly) separable in the feature space, where the two black and the two white points overlap each other. The optimal hyperplane, √2 x1x2 = 0, which is constructed in the feature space, corresponds to a nonlinear decision surface in the input space (which says that the two quadrants with the two while balls should contain only white balls and the two quadrants with the two black balls should contain only black balls). In general, feature spaces of more than three dimensions can be used. A computational problem could potentially arise in such cases, however, since the calculations in the transformed space could become very cumbersome as its dimensionality increases. This “curse of dimensionality” is elegantly overcome by SVM (Burges, 1998) since in the dual formulation of the optimization problem we have to compute only dot products in the form Φ (x i ) • Φ (x j ) , and never actually need to know the explicit coordinates of the points Φ (x i ) , i = 1, 2,…, l , in the feature space. This allows us to make computations even in infinitedimensional feature spaces, as long the dot product Φ (x i ) • Φ (x j ) is computable. In some cases this dot product can be computed by a simple kernel function: k (x i , x j ) = Φ (x i ) • Φ (x j ) This is actually the reason why it is more preferable to use the dual formulation of the optimization problem. The dot product in the feature space of the optimization problem at hand, for example, is given in explicit form as Φ(x i ) • Φ (x j ) = ( xi21 , 2 xi 1 xi 2 , xi22 ) • ( x 2j1 , 2 x j1 x j 2 , x 2j 2 ) . It can also be expressed implicitly via the kernel k (x i , x j ) = (x i • x j ) 2 , since k (x i , x j ) = (x i • x j ) 2 = (( xi1 , xi 2 ) • ( x j1 , x j 2 )) 2 = ( xi1 x j1 + xi 2 x j 2 ) 2 = xi21 x 2j1 + 2 xi1 xi 2 x j1 x j 2 + xi22 x 2j 2 = ( xi21 , 2 xi 1 xi 2 , xi22 ) • ( x 2j1 , 2 x j1 x j 2 , x 2j 2 ) It has been shown by Vapnik (1995) that the polynomial kernel k (x i , x j ) = (x i • x j + 1) d corresponds to a map Φ into the space spanned by all products of order up to d. By using the kernel k (x i , x j ) = (x i • x j + 1) 2 for the problem of Figure 1, for example, we attain a 26 CHAPTER 5 nonlinear decision boundary in the input space (which is linear in the corresponding feature space) represented in Figure 8. x2 x1 Figure 8. An SVM solution to the classification problem of Figure 1, presented in input space. The decision surface between the two classes in the figure is found by means of implicitly mapping the input space into a feature space via the kernel function k ( x i, x j ) = ( x i • x j + 1) 2 , and then mapping the optimal hyperplane (together with the margin it produces) back from the feature space into the input space. The lightlyshaded area is the margin between the classes. The borders of the margin in feature space correspond to curves in the input space, drawn in the figure. Notice that all four points are support vectors, since they lie on these curves. The margin between the classes is, as in the separable case, denoted as a shaded area. The borders of the margin in feature space appear as curves in the input space. 5. 5 Classifying unseen, test points In order to find out how to classify a new, unseen point, let us re-write the equation of the optimal hyperplane in the (nontransformed) linear case: w • x + b = 0. By substituting the l expression of the optimal w, w = ∑ α i yi x i , we attain the hyperplane decision function (for a i =1 new test point) for the linear case (in line with Vapnik, 1995 and Burges, 1998):  l  f (x) = sgn  ∑ α i yi (x • x i ) + b  .  i =1  As a result, if f (x) > 0 (f (x) < 0), the new point will be classified as a white (black) ball. In case we map our data into a feature space, the equation of the optimal hyperplane becomes w • Φ(x) + b = 0. Hence, the optimal w equals l ∑α i =1 i yi Φ(x i ) , and the hyperplane decision function becomes (Müller et al., 2001):  l   l  f (x) = sgn  ∑ α i yi (Φ(x) • Φ(x i ) ) + b  = sgn  ∑ α i y i ⋅ k (x, x i ) + b  .  i =1   i =1  CHAPTER 5 27 It is important to notice that the support vectors in Figure 8 (in this case, all four points) lie on  l  the curves  ∑ α i yi ⋅ k (x, x i ) + b  = ± 1.  i =1  5. 6 Admissible kernels Unfortunately, one cannot use just any kernel function to compute dot products in feature spaces. The following theorem of functional analysis shows the sufficient conditions for a kernel to be admissible (Burges, 1998): Theorem 1 ( Mercer ) There exists a mapping Φ and an expansion k (x, y ) = Φ (x) • Φ (y ) if and only if for any g(x) such that ∫ g(x)2dx is finite, then ∫ k (x, y ) g(x) g(y) dx dy ≥ 0. It is generally not true, however, that any kernel that does not satisfy the Mercer condition is not admissible, that is – cannot be used in the optimization problem (Burges, 1998). In other words, the Mercer condition is sufficient, but not necessary. Chapter 6 Support Vector Regression 6. 1 The ε-insensitive loss function Up till now we have considered Support Vector Machines for classification tasks. In this chapter, we will extend our analysis to function estimation, which is carried out by Support Vector Regressions (SVR). In this case of SVR, the target values are thus y ∈ ℜ , and not y ∈{-1,1}, as in (binary) classification. In SVR one utilizes the concept of “ε-insensitive region” instead of “the margin”, as in support vector classification. Following Vapnik (1995), we introduce the ε-insensitive loss function: | y - f (x)| ε ≡ max {0, | y - f (x) | - ε}, for a predetermined nonnegative ε. Intuitively, if the value of the estimate f (x) of y is off-target by ε or less, then there is no “loss”, that is – no penalty should be imposed. However, if the opposite is true, that is | y - f (x) | - ε > 0, then the value of the loss function rises linearly with the difference between y and f (x) above ε, as illustrated in Figure 9. Penalty ε-insensitive loss function -ε ε Value off target Figure 9. The ε-insensitive loss function. The ε-insensitive loss function associates no penalty with a given estimated value, if the estimated value is within ε distance of the true value. However, as the discrepancy grows above ε, the penalty increases monotonically with it. Notice that the ε-insensitive loss function is different from the quadratic loss function (used in statistics and elsewhere), and which is given by ( y – f (x))2 . 6. 2 Function estimation with SVR Let us consider the simplest case first, where there is only one input variable, x1 , and l training data-points. That is, we have to estimate the function y = w1 ∗ x1 + b , as in Figure 10. 28 29 CHAPTER 6 y y − ( w ∗ x + b) ≥ ε 1 1 y = w1 ∗ x1 + b ξ ξ* ε ( w1 ∗ x 1 + b) − y ≥ ε x1 Figure 10. An SVR solution to the problem of estimating a relation between x1 and y. All points inside the white region in the figure are within ε distance from the solid, optimal regression line, and therefore are not penalized. However, penalties ξ and ξ* are assigned to the two points that lie inside the shaded areas. The optimal regression line is as flat as possible, and strikes a balance between the area of the white region and the amount of points that lie outside this region. Imagine that the optimal regression line has already been found (as in Figure 10). The equation of the optimal line is consequently y = w1 ∗ x1 + b . It is possible to give a financial interpretation of Figure 10. The y values can be viewed for example as representing the actual difference between the S&P 500 Barra Value and Growth indices. This difference, in the simplest case, might be explained by a single factor x1 , say the one-month oil price change. All points that are within distance ε of the optimal line (that is, all points in the non-shaded area) are not associated with any loss/penalty, in line with the concept of ε-insensitive loss function. However, points for which | y − ( w1 ∗ x1 + b) | ≥ ε will be penalized by the introduction of slack variables ξ i and ξ i* , i = 1 ,2 ,…, l, in line with Smola and Schölkopf (1998). Notice that the flexibility of assigning different values for ε makes it possible to consider a myriad of overfitting-correction criteria and investors’ loss functions, corresponding to different values for ε. If ε is set to be too small, and the penalty associated with value offtarget too high, then the resulting ε-insensitive region (in the input space) must necessarily look like a serpent maneuvering through the data, making lots of curves. As a result, almost all points will be classified correctly. This type of loss function could be typical for investors who consider even small losses as quite disastrous. If ε is set too high (and the penalty associated with values off-target too low), then rather few points will be penalized, meaning that investors in this case are inclined to put up with greater losses, that is – to be indifferent to losses of magnitude up to ε. The resulting ε-insensitive region (in the input space) is very likely to resemble a linear surface. In Figure 10, there are two penalized points, with respective penalties ξ and ξ * . All these considerations, together with the Structural Risk Minimization principle, lead logically to the formulation of the optimization problem used in SVR for function estimation (Vapnik, 1995): 30 CHAPTER 6 w Minimize 2 Subject to 2 l + C ( ∑ ξ i + ξ i* ) i =1 (1) ((w • x i ) + b) − y i ≤ ε + ξ i i = 1, 2,…, l (2) y i − ((w • x i ) + b) ≤ ε + ξ , i = 1, 2,…, l (3) ξ i , ξ i* ≥ 0 , i = 1, 2,…, l * i As in the binary classification problem, it is assumed that there are a total of l training points. Notice that we can use the same formulation of the optimization problem to solve cases where the input space is n-dimensional. In this case the vector w and each point x i will have n coordinates. The predetermined constant C plays a role completely analogical to the one it plays in classification: it pre-specifies the amount of penalty associated with each training mistake (that is, with each x i for which either ξ i* > 0 or ξ i > 0). The formulation of the optimization problem could be explained intuitively as follows. It can be shown that in solving the optimization problem one strives to strike a balance between the area of the non-shaded, ε-insensitive, region (as in Figure 10) – in other words, complexity – and the amount of training errors that are allowed occur. Thus, for example, if the prespecified ε is big enough to give rise to (many) ε-insensitive regions that contain all training points, then the resulting optimal estimated function will be as horizontal (“flat”) as possible. As in classification, in SVR there exists a suitable dual representation of the regression optimization problem (Vapnik, 1995, Smola and Schölkopf, 1998): l ( W (α ∗ , α ) = − ε ∑ α i* + α i Maximize i =1 − ( ) + ∑ (α l i =1 )( 1 l α i* − α i α *j − α ∑ 2 i , j =1 C ≥ α i* , α i ≥ 0 , i = 1, 2,…, l and Subject to ) −α i y i * i j )(x i ∑ (α •xj) l i =1 i ) − α i* = 0 . Generalization to nonlinear regression estimation is carried out analogically to the case of binary classification – by substituting the kernel function k (x i , x j ) for (x i • x j ) in the dual formulation above. In SVR the regression estimates take the form of (Smola and Schölkopf, 1998): l f (x) = ∑ i =1    α i* − α i  ⋅ k (x, x i ) + b . Analogically to the Support Vector Machines for classification, all α and α * multipliers have a value greater than zero if and only if they are associated with support vectors. Chapter 7 Methodology 7. 1 A factor-model approach to the basic model Before explaining the technical part of the basic Support Vector Regression Value-versusGrowth rotation model (to be presented in section 7.4), we will first of all put it in the context of multi-factor models discussed in chapter 2. Consider Figure 11, which presents a tree-like structure stemming from the term “factor models” that captures different facets (alluded to in chapter 2) of these models. models using all factors simultaneously from a prespecified set single-factor models multiple-factor models …utilizing multiple regressions …for estimating volatility of return …for estimating expected returns …model selection based on statistical criteria, such as adjusted R2, AIC, BIC models using (many) subsets of a pre-specified factor set Factor models …model selection based on financial criteria, such as hit ratio, information ratio …utilizing other (regressions) tools …model selection based on Bayesian analysis, principal component analysis, etc. …model selection based on cross validation procedure Figure 11. Classification of factor models according to different characteristics. The basic model of the thesis can be regarded as a factor model with features that appear in the shaded rectangles. The proposed basic model of this thesis has the factor-model characteristics that appear inside the shaded rectangles of Figure 11. It employs Support Vector Regressions, uses all preselected factors simultaneously, predicts the difference of returns to value and growths stocks in the S&P 500 index (split in market capitalization according to their book-to-market ratio), and finally, uses a cross-validation procedure for model selection (to be explained in section 7.3.5). 31 CHAPTER 7 32 Notice at the outset that regarding market efficiency, whatever the results of the Support Vector Regression model, these results will be intrinsically inconclusive as evidence in favor or against US stock market efficiency. This is, to begin with, a consequence of the fact that although all information on the factors used has been (publicly) available throughout the whole estimation period, the Support Vector Regression tool was not. This comes at odds with the notion of the market efficiency, which requires that at the time of model creation only modeling tools available at that time (and not afterwards) be applied. Additionally, as pointed out by Pesaran (2003), market efficiency and the stock market non-predictability property are concepts that cannot be equated per se. Pesaran (2003) shows that “stock market returns will be non-predictable only if market efficiency is combined with (investor) riskneutrality”. 7. 2 Indices and data choice 7 .2. 1 The explained variable: the “value premium” The actual task of the proposed Support Vector Regression model is to predict the direction of the monthly value premium (that is, the difference in monthly returns) between two indices – the S&P 500 Barra Value and Growth indices. The choice of these two indices is motivated by the expected low transaction costs (associated with high expected liquidity), since it is possible to buy and sell futures on them (Bauer and Molenaar, 2002). There exist a number of characteristics for classifying stocks as belonging either to the value or the growth club, such as the ratios of (current) market price to earnings per share, and market price to cash flow per share, but we will confine ourselves only to the book-to-market ratio in deciding upon our Value-versus-Growth style rotation strategy, because it can be easily implemented (through the S&P 500 Barra Value and Growth indices). The logic behind the chosen particular split of stocks can be explained as follows. Firms with low BM ratio are generally expected (by the market) to grow fast and be quite profitable some time in the future, so as to compensate for the high market price of their stocks compared to the book value of their (existing) equity capital. These expected-to-grow-fast firms form the growth club. The rest of the firms are labeled “value”. 7 .2. 2 On the choice of explanatory factors Potentially a myriad of factors are expected to have impacts on the two classes of value and growth stocks. In this thesis, we restrict ourselves to the set of 17 factors used by Bauer and Molenaar (2002), who claim to consider only factors which effects on stock market returns are asserted in the literature on the subject to have some economic interpretation. It could be potentially argued that all of the 17 pre-chosen candidate factors in the base factor set (given in Appendix I) actually affect value and growth stocks in a certain way, but there does not exist a consensus in the literature on what is the precise nature of each factor’s influence, what is the extent of that influence, and whether the direction of the influence is constant through time. Bauer and Molenaar (2002) for example find that some of their 17 pre-chosen factors “appear to be relevant in a particular time frame, but loose their power completely in a different period”. Alongside, Levis and Liodakis (1999) state that “there are good fundamental reasons and considerable empirical evidence to suggest that … value spreads are CHAPTER 7 33 associated with economic fundamentals”. Asness et al. (2000), however, remind that one criticism of considering economically meaningful variables is that “it becomes very difficult to determine which of the observed relations are real and which ones are artifacts of the data”. 7.2.2.1 Technical factors The 17 pre-chosen factors can be divided into “economic” and “technical”. Some suggestion as to why the pre-chosen technical (or, market-based) factors appear to be relevant can be found, for example, in the works of Levis and Liodakis (1999)15; Asness et al. (2000), Copeland and Copeland (1999) and Chan et al. (1996). To start with, Levis and Liodakis (1999), and Asness et al. (2000) report that, among others, the one-month lagged value spread is an important predictor of the (following month’s) value premium. Copeland and Copeland (1999) find out that on days that follow increases (decreases) in the VIX16, value-based portfolios outperform (underperform) growth-based portfolios. The authors interpret this observation on the basis of the idea that rising uncertainty about the future leads to falling confidence in growth stocks and a shift to value stocks. Chan et al. (1996) investigate whether there is “momentum” in the stock prices (that is, whether past winners on average continue to outperform past losers), and discover that the market responds only gradually to new information. The researchers provide also evidence on the profitability of price momentum strategies and relate it to portfolio value and growth characteristics by showing that past winners (losers) tend to be growth (value) stocks. 7.2.2.2 Macroeconomic factors The subset of pre-chosen macroeconomic factors is based on findings documented for instance by Bauer and Molenaar (2002), Kao and Shumaker (1999), Levis and Liodakis (1999) and others. According to these studies, one of the important determinants of the sign of the value premium is the overall interest rate environment. For example, as the spread between long term and short term interest rates (the yield-curve spread) widens, firms which profits are expected to lie in the more distant future – that is, growth firms – will be hurt relatively more, since their future (expected) earnings are discounted on longer horizons as compared to value firms. Macedo (1995) maintains that the equity risk premium (the expected future extra return that the overall stock market or a particular stock must provide over the rate of risk-free bonds to compensate for market risk) is the strongest determinant for the future style performance. In his view, a high equity risk premium favors riskier portfolios; and since value stocks are perceived to be more risky, they tend to do well when equity risk premium is high. A steadily rising expected equity risk premium implies decreased confidence in the future and hence hurts growth stocks disproportionately, since their profits are expected to materialize in the more distant future. Another relevant determinant of the value-growth (monthly) return spread appears to be the rate of inflation (see Levis and Liodakis, 1999, and Kao and Shumaker, 1999). Additionally, Sorensen and Lazzara (1995) find a positive relationship between the growth in industrial production and interest rates and the value-growth return spread. According to Kao and Shumaker (1999), if the earnings-yield gap (which subtracts bond 15 16 These results of Levis and Liodakis (1999) are established for the UK stock market. The Market Volatility Index of the Chicago Board Options Exchange. CHAPTER 7 34 yields from a market earnings/price ratio) is small and is produced by a low earnings-to-price environment in combination with high interest rates, then value stocks should be favored. The researchers go on contending that regarding credit spreads, one would expect that growth stocks would outperform value stocks in a recessionary environment characterized by high default rates. Lucas et al. (2001) consider also the effect of changes in the business cycle, proxied by a composite index of leading indicators of the US business cycle, and hypothesize that growing firms are likely to be more flexible to react on and profit from a changing economic environment. 7 .2. 3 Factor explanatory power and Support Vector Regressions Without going further into deep discussions on the expected effects of each of the 17 candidate factors, it should be stressed that it seems reasonable in principle to first test what their actual relevance appears to be, and then try to explain why, at least empirically, a factor stands out as relevant and why not. In any case, it could well be that all factors depend to a certain degree on each other and so it will be difficult to disentangle the effects of a single factor, and also, that different factors are relevant at different times. However, for prediction purposes – which is in effect the subject of greatest interest – it does not actually matter what is the precise role of individual factors, since it is more interesting to see how the interactions of all factors could be used effectively to predict which of the two S&P 500 Barra indices will outperform the other in a given time period (in our case, the following month). At any rate, the Support Vector Regression tool that is used to build the proposed models is expected to derive information (or, estimates) stemming from the interactions between (many) explanatory factors. However, it is a nonparametric tool which can provide only limited information as to which individual factors exactly stand out as important. An extensive account of the properties of Support Vector Regressions in relation to factor models is given in the following sections. 7. 3 Support Vector Regression as a factor-model tool Several of the preceding chapters of the thesis have dealt in detail with the rationale behind and the nature of Support Vector Regression and Support Vector Machines as a whole. What is important to emphasize in this section, are these qualities of Support Vector Regressions that justify their employment as a factor model tool at most. 7 .3. 1 The generalization property of Support Vector Regression First of all stands out the elegant theoretical property of Support Vector Regressions to strike automatically a balance between model explanatory power (or, “fit”) on the training data and model complexity, for given regression parameters such as ε, C, and kernel function parameters. As shown in chapter 4, it is precisely this generalization feature of functions (or, models) that matters most for prediction purposes. Functions, or models, that extremely overfit the training data are generally expected to be worse predictors than functions that make some training mistakes, but are less complex. Overfitting is especially characteristic of models that include numerous explanatory variables. Despite this, as noted in chapter 3, Support Vector Regressions, and Support Vector Machines in general, are renowned for their capacity to achieve good generalization performance even in high-dimensional input CHAPTER 7 35 (explanatory) data. This capacity could potentially provide a solution to the debate surrounding the choice of the most important (several) explanatory factors of the value premium out of a universe of explanatory factors. Making this choice is to a certain extent unavoidable in an “ordinary” factor model since the statistical model selection criteria (e.g., adjusted R2) associated with multiple regression analyses do not tolerate a large number of explanatory variables. In contrast, Support Vector Machines are expected to deal with ease with all candidate explanatory variables considered simultaneously and so it is not imperative to come up with a list of most important factors. Additionally, there might be hidden interaction patterns between the explanatory factors that cannot possibly be captured by any parsimonious model (by construction), but which may be accounted for by a multivariate analysis (involving in our case 17 explanatory variables) using a tool which possesses an adequate generalization ability in addressing multi-factor problems. 7 .3. 2 The internally-controlled-complexity property of Support Vector Regression Using Support Vector Regressions one can alter model complexity without changing the number (and nature) of explanatory variables. This is, to start with, due to the employment of the ε-insensitive loss function instead of the “standard” quadratic loss function used commonly in statistics and econometrics. Smaller values for the ε parameter force in general the modeled function to provide a better fit on the training data (ceteris paribus) since as the error-insensitive ε-region becomes smaller, the versatility of the function becomes greater, and with it model complexity as well. Analogical is the role of the complexity-error tradeoff parameter, C. A greater value for C forces the modeled function to become more flexible (that is, complex) and make fewer training mistakes (ceteris paribus). It is also possible that similar alterations of model complexity can be induced by the parameters (if there are any) of the utilized kernel function. What is common for all these cases is that model complexity is being changed “internally”, within the model itself, and not “externally” via making changes in the (amount and nature of) data used for training. 7 .3. 3 The property of specifying numerous investor loss functions As a spin-off of the internally-controlled-complexity property, and stemming from the utilization of the ε-insensitive loss function, comes yet another property, which addresses the issue of investors’ perception of suffered losses. After all, nobody can admittedly spell out the precise form of the (aggregate) loss function that investors have. Moreover, it may well happen that this loss function is not constant through time and is highly sensitive to economic regime switches. The advantage of intruding the ε-insensitive loss function instead of the “standard” quadratic loss function is that the ε parameter gives the liberty of explicitly formulating a myriad of loss functions (see e.g. Figure 9). 7 .3. 4 The property of distinguishing the information-bearing input-output pairs As pointed out in chapter 6, in Support Vector Regression different weights to all factorsestimate combinations (which can be represented as points in a factors-estimates space) are automatically optimally (via optimization) assigned for a given ε-insensitive loss function, a pre-specified complexity-error tradeoff adjustment constant C, and a kernel parameter (if any). These weights are expected to be indicative of the relative importance of each input- CHAPTER 7 36 output pair. Only the support vectors will be given a positive weight, which might suggest that the rest of the input-output pairs do not contain useful information and should not be considered for future model-building. Alongside, it might be possible in this process to create a list of factors ordered according to their relevance. These conclusions however have to be substantiated or refuted by further research in this area. 7 .3. 5 Cross-validation procedure for choosing among optimal models What is quite striking is that since the Support Vector Regression parameters can be easily manually controlled, one can generate a myriad of optimal models – one optimal model for each possible combination of training parameters. The freedom of choosing among different values for the ε-parameter, complexity-error tradeoff adjustment parameter, and kernel parameters (if any) leads to the natural question of how to find the combination between them that will yield the model with best predictive power. To the best of our knowledge, there has not yet been discovered a universal optimal technique to deal with this issue, but nevertheless one way of tackling it is via a “standard” for the Support Vector Machines, i.e. a crossvalidation procedure. Basically, a k-fold cross-validation procedure is utilized as follows: a given dataset is divided into k folders of equal size; subsequently, a model is built on all possible (k) combinations of k-1 folders, and each time the remaining one folder is used for validation. The best model is the one that performs best on average over the k validation folders. The benefit of using a cross-validation procedure is that by construction it ensures that model selection is based entirely on out-of-sample rather than in-sample performance. Thus, the search for the best Support Vector Regression model is immune to a critique of drawing conclusions about the merits of a factor model based on its in-sample performance. To illustrate this critique in terms of the concept of generalization, it has been suggested in chapter 4 that an extremely good model in-sample performance – that is, performance over the training data set – is associated with a considerable (training data) overfitting, which in turn is associated with poor generalization ability and poor model predictive power. For greater clarity, the stages of a 5-fold cross-validation procedure are illustrated in Figure 12. Suppose that we have initially given training data consisting of values of both explanatory and explained variables for n months, as in Figure 12 (a). The first step of the cross-validation procedure is to randomly permute the (chronological or original) sequence the data, as in Figure 12 (b). The second stage is to divide the permuted data into five (approximately) equally-sized blocks, called folders, as in Figure 12 (c). The third stage consists of five substages. At each sub-stage, four folders of data are used as a training, model-building, set, and the remaining fifth folder is used for validation (in other words, for testing), as illustrated in Figure 12 (d). This procedure is repeated five times (one time for each validation folder). Model selection is based on performance over the five folders used for validation, which is critical because this ensures that the model selection itself is based only on (artificially created) out-of-sample performance. If our model-building tool is Support Vector Regression, then the performance of any model is judged by the mean sum of squared errors between estimated and real target values associated with each of the five validation folders. The model that achieves minimum mean sum of squared errors on average (over the five validation folders) is considered to be the best. This best model is said to achieve minimum crossvalidation mean squared error. 37 CHAPTER 7 1 2 3 Data for n months n (a) The randomly permuted data is divided into 5 equally-sized folders (c) 1 2 3 n-k+5 n-k 1 n 3 Randomly permuted data for the n months (b) Each of the 5 equallysized folders is used for validation of the models created using the remaining (four) folders (d) Figure 12. A 5-fold cross-validation procedure. The original data in (a) is randomly permuted in (b), and divided into 5 equally-sized folders in (c). Afterwards, a validation folder is selected a model is build on the remaining four folders. This procedure is repeated 5 times in total (for each validation folder), as suggested in (d). The major potential drawback of the cross-validation procedure is that by construction it is bound at some point to make use of future information to predict past target values, which seems counterintuitive for a time-series analysis. This issue is somewhat related to the LookAhead Bias critique and will be addressed in section 7.7.2. Another shortcoming is that crossvalidation is a time-consuming procedure, which is not guaranteed to produce the best estimates of the target values. 7. 4 The basic model The basic “real-time” simulated investment model consists of two steps. First, at month t, all (historical) values for all 17 candidate explanatory factors together with the differences in returns between the S&P 500 Barra Value and Growth indices for months t60 till month t-1 are used to build numerous Support Vector Regressions. Thus, the dependent variable of the basic model is the “value premium” – the difference between the realized returns of the S&P 500 Barra Value and Growth indices. The independent variables are the 17 pre-specified factors referred to in section 2.4, and listed in Appendix I. The total time span of predicted months is between January 1993 and January 2003. Going further back in time is untenable due to unavailability of (sufficient) macroeconomic data. The choice of exactly 60 months of data for model building is to a certain extent arbitrary. On the one hand, it could be argued that 60 months of data are rather insufficient for forming reliable forecasting hypotheses. On the other hand however it seems risky to consider too long periods since the model might be exposed to a critique of structural-change unaccountability. Moreover, information that can be extracted from months that lie in the more and more distant past is CHAPTER 7 38 becoming increasingly irrelevant to present-day time. Thus, it seems reasonable to make the assumption that 60 months can be viewed as belonging to roughly the same economic regime. Second, once the Support Vector Regressions have been constructed, a standard procedure for ranking the resulting models has been applied. This procedure is a 5-fold cross-validation (as explained in section 7.3.5), according to which models are ranked on the basis of their crossvalidation mean squared error. The regression with minimal mean squared error is used to predict the Value index minus the Growth index return difference for month t. Alongside, some data re-balancing and other adjustments (explained in chapter 8) have been made, one of the consequences thereof being that if the predicted Value minus Growth return difference is between -0.05 and 0.05 relative to the average over the training period, then we conclude that there is no signal for the next month, which implies taking no trading position. If however the predicted difference is above 0.05, then at time t we buy the Value index and sell the Growth index, so as to capture the predicted positive value premium. And if the predicted difference is below -0.05, then analogically at time t we buy the Growth index and sell the Value index, so as to capture the negative value premium. In the basic model, transaction costs are unaccounted for. This zero-transaction-cost assumption will be relaxed in section 7.5, where model extensions are discussed. Let us point out here also that since the Support Vector Regression is a non-parametric tool, we can obtain only point estimates and not the probability for a certain value to be observed. Using only historically available data ensures that the implementation of the trading strategies is carried out without the benefit of foresight, in the sense that investment decisions are not based on data that have become available after any of the to-be-predicted months. Moreover, investment decisions for the to-be-predicted months are always based on the entire factor set of historical (60-month) data, ensuring that no variable-selection procedures based on extensive manipulation of the whole available data have been carried out. At any rate, the utilized cross-validation procedure for model selection ensures that the best candidate model for each month is being selected only on the basis of performance on external validation samples. For comparison reasons, we set our results against a benchmark strategy which always (that is, each month), bets that value stocks will outperform growth stocks. Thus, according to the benchmark strategy, in the beginning of the forecasting period (January 1993) a hypothetical investor goes long position on the Value index and a short on the Growth index. This position is hold throughout the entire prediction period (January 1993 – January 2003). The monthly difference between the two indices is the monthly return from this Value-minus-Growth investment strategy. 7. 5 Model extensions The basic model could be augmented in a number of ways. For example, analogically to Bauer and Molenaar (2002), next to the one-month-ahead forecast horizon of the basic model, one can calculate signals for three-, and six-month forecast horizons, and subsequently mix them in order to come up with one signal. Alongside, different levels of transaction costs should be taken into account in order to make the implementation of the strategies realistic. Considering the three-month horizon, for instance, if at time t the models built at t-2, t-1 and t CHAPTER 7 39 produce “value”, “growth”, and “value” signals for time t+1 respectively, then the combined signal for month t+1 using a simple unweighted-average rule is “go long one-thirds on the Value index, and short one-third on the Growth index”. Notice that one of the two “value” signals cancels out with the “growth” signal. If the combined signal produced by the three models pertaining to month t+1 is “no signal”, then no trading position is established. The sixmonth horizon is calculated analogically. One could consider assigning greater weights to more recent months in this procedure, which is not done here however. One of the main reasons for estimating the additional three- and six-month horizons is to observe whether the signals produced by them are consistent with those of the one-month horizon strategy (that is, with the basic model). “Consistent” in this case means that the results from the three-month horizon strategy should be worse than the results from the one-month horizon strategy, and the results from the six-month horizon strategy should be worse than the results from the three-month horizon strategy. Such consistency, if it exists, would lend greater credibility to the results of the basic model, on the one hand, and avoid the “data-mining” critique (to be addressed in subsection 7.6.4) on the other. Indeed, we do find evidence of such consistency. Another possible extension is to consider models that incorporate different combinations of explanatory factors. This procedure will lead to a total of 217 candidate factor-models for each to-be-predicted month. Because of lack of appropriate computational equipment however only one model, the one based on all 17 pre-specified factors, has been considered. In the same line of thought, there is no guarantee that exactly 60 months represent an accurate historical horizon for model selection. It may well happen that 65 or 55 months of training data produce better results. The advantage of considering and comparing different model selection horizons, furthermore, is that in this way the reliability of the default 60-month historical horizon can be put on trial: if small changes in the number of months produce enormously different results, then this inconsistency would be a sign of unreliability of the default model strategy. Considering such time horizons however falls out of the scope of this master’s thesis. A fourth extension, which is important in practice, is to include transaction costs in the calculations. We have allowed for two possible (non-zero) transaction cost regimes: one that assumes that transactions costs are fixed at 25bp single trip and one that assumes 50bp (single trip). Very importantly, it is worth noting that investors could not know in advance which strategy would perform best. In order to address this issue, a hyper model selection tool analogical to the one proposed by Pesaran and Timmermann (1995) could be utilized. Constructing such a tool however falls out of the scope of this master’s thesis. The last proposed extension here, which can be viewed rather as curtailment, is to reformulate the regression problem of the basic model as a classification problem. Similarly to considering the three- and six-month forecast horizon, the results from this model “extension” can serve as a consistency test for the basic model. In the classification case, all months where the value premium is positive/negative can be labeled just “+ 1” / “– 1”, while the values for the explanatory variables remain unchanged. Since the ε-insensitive loss function parameter (ε) does not enter the calculations in the classification problems, the time for carrying out calculations should be relatively shorter. The accuracy of the classification results is expected to be worse, since months where the value premium is, say, close to 0.01 would have the same importance (or, weight) in model building as months where the value premium is about CHAPTER 7 40 4.50 as both receive a label “+1”. If this expectation turns out to be the case in practice, this could be perceived as evidence of consistency of the basic model. The results (for the one-, three- and six-month forecast horizons) from this classification problem approach, which in fact testify to such kind of consistency, will be shown in section 8.3, immediately after the results from the Support Vector Regression approach. 7. 6 Small-versus-Big Rotation with Support Vector Regressions It is important to stress that we are going to present also the results from a so-called monthly “Small-versus-Big” Support Vector Regression rotation model for the sample period January 1993 – January 2003 and compare these results with those from a “Small-minus-Big” and “MAX_SB” strategies. Because of space considerations, we will not provide a complete account of those strategies and just sketch them briefly. The “Small-versus-Big” strategy is a monthly rotation strategy conducted on the S&P 500 and S&P SmallCap 600 indices17; the “Small-minus-Big” strategy is an investment strategy that in the beginning of the sample period goes long on the S&P SmallCap 600 index and short on the S&P 500 index, and holds that position thereafter; and the “MAX_SB” is a perfect foresight rotation strategy that each month goes long on the index with higher monthly return and short on the index with lower monthly return. The results from those strategies are sketched in section 8.2.7, and presented in full in Appendix 6 and Appendix 7. The associated explanatory variables, listed in Appendix 5 are somewhat different from those considered for the Value-versus-Growth strategies. The analysis of the performance of Support Vector Regressions in the “Smallversus-Big” case is of great value, because if the results turn out be as promising as those from the “Value-versus-Growth” case, then greater credibility should be lent to Support Vector Machines as a tool for constructing financial factor models. Our results confirm that Support Vector Machines perform extraordinarily well both in the “Value-versus-Growth” and the “Small-versus-Big” prediction tasks. 7. 7 Support Vector Machines vis-à-vis common factor model pitfalls This section deals exclusively with the question of why the proposed methodology is expected to be immune to common critiques to which factor models are exposed, such as Survival Bias, Look-Ahead Bias, Data Snooping, Data Mining, and Counterfeit. 7 .7. 1 Support Vector Machines versus the Survival Bias According to Haugen (1999), Survival Bias occurs “if individual firms that go inactive during the test period are systematically excluded from the test population”. It could be argued that if the results from the Support Vector Machines models depend on the status of the companies (which is either “active” or “inactive”), then failure to include inactive companies in the calculations may lead to misleading estimates. For example, one can pick some active firms that were regarded in the beginning of 2001 as successful, scrutinize their technical financial characteristics (for instance, price-to-earning ratio, book-to-market ratio, etc.), and then come up with a list of characteristics that the “ideal” firms should have. In this situation, it may well happen that firms that were close to this “ideal” type in the beginning of, say, 1999 actually 17 Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank Russell 2000 indices have been used as inputs for the Small-versus-Big calculations. CHAPTER 7 41 went bankrupt by the beginning of 2001, and so have never been taken into consideration at the time of constructing the list of coveted characteristics. This, most probably, will cast a serious doubt on the validity of the resulting list of characteristics. The results of the Support Vector Machines models do not suffer from Survival Bias. Firms that change their status during the model training period are never excluded from the test population. All tests are performed over whole indices, which lists of constituents are not adjusted to include only firms that were active at the end of any model testing period. 7 .7. 2 Support Vector Machines versus the Look-Ahead Bias Look-Ahead Bias occurs when one builds prediction models based partially on “data items that would not have been known when the predictions were made” (Haugen, 1999). For example, suppose one constructs a model based on the entire sample period. In our case this period is January 1993 – January 2003. Undoubtedly, some explanatory factors would appear to have greater explanatory power than others. It would be totally unfair to use this whole information in trying to predict the value of the explained variable for, say, January 1995, because at that time investors would not have known which factors exactly would turn out to be important in the future. The Support Vector Machines models do not suffer from such Look-Ahead Bias. All predictions are based on data for the 60 months preceding any to-bepredicted month. The Look-Ahead Bias critique however can be partially directed at the cross-validation procedure used for model selection. As explained in section 7.3.5, by construction the crossvalidation procedure is bound to use future data in predicting past outcomes (see Figure 12). Although this “future data” is actually past data from the point of view of any to-be-predicted month (the “future data” is always part of the 60 months of training data, and so have been available prior to any to-be-predicted month), it seems at first sight unjustifiable to apply the procedure in analyzing time-series data. The question is, can one use future data (from the point of view of any month) to predict past outcomes, as the cross-validation procedure suggests? Even though this should usually be considered as a weak point of the crossvalidation procedure, our assumption that 60 months prior to any month can be considered as belonging roughly to one and the same economic regime (that is, there are no abrupt regime changes in-between these 60 months) gives us the possibility to compare input-output relations at different times across a strict 60-month time frame. An alternative to the cross-validation procedure for model selection, which does not suffer even partially from any Look-Ahead Bias, has been implicitly suggested by Bauer and Molenaar (2002), who build their models based on 60 months of training data and then select the best model (or models) based on out-of-sample performance over 24 months following the model training period. The advantage of this approach is that model selection is always based on performance over a post-training-data period, as opposed to out-of-sample sub-periods created artificially within the training data (in the case at hand, there are five such sub-periods in the cross-validation procedure). The disadvantage however is that the selected model (or models) following 24 months of post-training observed performance has to be used to predict an outcome that is 25 months ahead of the actual model training period. That is yet another reason why we have decided to opt for the cross-validation procedure – the selected model out of this procedure can be used for the prediction of the month coming immediately after the model training period. In this way all available (60 months of) most recent (and thus, most relevant) data prior to the to-be-predicted month is used for model building. CHAPTER 7 42 7 .7. 3 Support Vector Machines versus the Data Snooping Bias In financial literature, the term “Data Snooping” is associated with the act of testing one’s model using the same data as previous studies (Haugen, 1999). At least partially, our models suffer from this bias: they take as a starting point a set of 17 factors, some or all of which have been used by other studies. What is crucial to observe however is that the proposed models in the thesis do not take into account in any way which of these factors appeared to be important in these studies, since all of these 17 candidate factors have been used simultaneously for our prediction purposes. The Support Vector Regression tool by construction implicitly determines by itself for every single to-be-predicted month which factors play a vital role, and which do not. 7 .7. 4 Support Vector Machines versus the Data Mining Bias According to Haugen (1999) a Data Miner “spins the computer a thousand times; tries a thousand ways to beat the market”. Invariably, the Data Miner is bound to hit the bull’s eye once in a thousand times. The resulting “successful” model will be most probably due to chance rather than special merit. Support Vector Machine models do not suffer from the Data Mining Bias, because the available computer has been “spun” only one single time. That is, only one model, which includes all pre-specified factors, has been tested. 7 .7. 5 Support Vector Machines versus the Counterfeit Critique The counterfeit critique stems from the observation that to beet the market “on paper” is quite different from beating the market “for real” (Haugen, 1999). It is precisely for this reason that we have constructed a real time investment strategy. In this way we are as close to a real trading simulation as possible. True, the Support Vector Regression tool could not have been used in the first couple of years of the trading period. However, from a present-time viewpoint one can certainly assess the economic significance of applying Support Vector Machines in stock market predictability by tracing the performance of a hypothetical investor through (a considerable amount of) time. Chapter 8 Experiments and Results This chapter describes the actual experiments that have been carried out and the accomplished results. The software program used throughout the analysis is LIBSVM 2.4, developed by Chih-Chung Chang and Chih-Jen Lin. 8. 1 Experiments carried out with Support Vector Regression When employing Support Vector Regression, prediction steps for any month t run as follows. First of all, 60 months of training data available prior to month t are selected. The data consist of the differences in returns between the S&P 500 Barra Value and Growth indices (the explained variable), and the values for all 17 preset factors (the explanatory variables). Second, Support Vector Regressions have been applied to the 60 months of training data prior to any to-be-predicted month in order to select the best model. More concretely, a 5-fold cross-validation procedure has been carried out to determine the best combination among C (the complexity-error tradeoff parameter), ε (the level of insensitivity of the ε-insensitive loss function), and a parameter inherent to the kernel function used18. A tiny part of this procedure is visualized in Figure 13, where the vertical axis shows cross validation minimal squared errors for C∈(0,32), while keeping ε and the kernel function parameter fixed at 1.0 and 0.007, respectively. By the “best combination” among the parameters it is meant the one that produces the minimal sum of squared errors between the true values and their corresponding Figure 13. Five-fold cross validation mean squared errors associated with complexity-error tradeoff parameter C∈(0,32) and fixed εinsensitive loss function parameter (ε) at 1.0 and Radial Basis Function parameter at 0.007. The to-be-predicted month here is April 2000. The “best” model is the one for which the combination of the three parameters over suitable parameter ranges produces minimal cross validation mean squared error. 18 The Radial Basic Function kernel has been used in the calculations. It has been examined by Burges (1998) and Smola and Schölkopf (1998), for example. 43 CHAPTER 8 44 estimates coming out of the cross-validation procedure. In practice, it is virtually impossible to find out the best parameter combination from a crossvalidation procedure, since the search for it requires making infinitely many tests. For example, the C parameter is free to take any positive value, whereas empirical tests can be performed only over a finite number of those values. As suggested by the figure, this minimum value is quite well defined. Post-computational result adjustments have been made in order to avoid the risk of placing unsubstantiated “trust” on border estimates. Thus, estimates within the arbitrary chosen range of (-0.05, 0.05) relative to the average over the training period have been regarded as no clear indications of forecasted direction for the value premium. In these “no signal” cases, no trading position should be taken. The advantage of utilizing a cross-validation procedure throughout the analysis is that model selection is based entirely on performance over (artificially created) out-of-sample data. The disadvantages are that, first, there is no guarantee that the 5-fold cross validation procedure will yield the best approximately correct model, and second, the procedure is in itself rather time-consuming (about three days for the whole 121-month estimation period on a computer with a 2.66 GHz processor and 512 MB RAM memory). 8. 2 Results from Support Vector Regression Estimation In this section we will present the main results from the value-growth and small-big rotation strategies. These strategies include: the passive Value-minus-Growth strategy; the MAX strategy, the Support Vector Regression strategies for the one-, three- and six-month forecast horizons under different transaction cost regimes; and, very briefly, the small-versus-big strategies. Alongside, we will focus our attention on the extent to which the strategies are consistent with each other. 8 .2. 1 Value-minus-Growth strategy Let us first of all examine the Value-minus-Growth strategy of implicitly taking each month a long position in the Value index and short position in Growth index. The main results are outlined here. Further details can be found in Appendix II. The Value-minus-Growth strategy has not performed very satisfactorily during the prediction (testing) period, which starts January 1993 and ends January 2003. The annualized mean return is merely 0.21% and consequently the realized information ratio is not spectacular too, 0.02. Investors that have followed the buy-and-hold strategy have experienced devastating maximal 3-month (11.55%) and 12-month (-22.86%) losses. The high standard deviation of returns (10.90%) has also contributed to the overall distress. The sole bright feature of this strategy is the low level of transaction costs associated with it. CHAPTER 8 45 8 .2. 2 “MAX” strategy The “MAX” strategy is defined as the strategy of going long on the better-performing index and short on the worse-performing index every month throughout the sample period. The detailed results from this strategy for a transaction-cost level of 50bp single trip can be found in Appendix II. This strategy shows the potential profit from style rotation. In a 0bp, 25bp, and 50bp transaction-cost environment, the maximum annual mean return from style rotation is 27.14%, 24.21% and 21.29%. It is interesting to observe that in the 50bp transaction-cost case 20% of the months yield a negative performance. This comes as a result from the combination of Value and Growth indices outperforming each other consecutively and a concomitant (absolute value of the) value premium of less than 0.5%. Such combinations, 24 in total, are most common between 1994 and 1999. The “MAX” strategy, unlike any of the model strategies based on Support Vector Regressions, goes more than half of the time (in 53.72% of the months) on value and short on growth stocks. 8 .2. 3 Basic model investment strategy Detailed results from the basic model strategy (of forecasting one-month-ahead difference in returns between the S&P 500 Value and Growth indices on the basis of the whole set of 17 factors and data for 60 months preceding each of the to-be-predicted months) are presented in Table 1 on the following page19. What strikes most, is that this strategy has produced much better results than the Value-minus-Growth one. Investors would have enjoyed an annualized mean return of 10.19%, under the assumption however of zero transaction costs. Combining these results with the relatively lower standard deviation of returns yields an (annualized) information ratio of 1.03 for the prediction 121-month period. It should be stressed however, that even when high transaction costs of 50 bp (single trip) are added into the calculations, the realized information ratio remains quite high (0.63), and statistically significant at the (twotail) 5% level. The calculated Z(equality)-scores20 provide a further strong evidence (in the 0bp and 25bp transaction-cost environment) of a significant performance difference between the basic model rotation strategy and the passive Value-minus-Growth one. Remarkably, the basic model investment strategy is able to capture more than one third of the return from the “MAX” strategy (in a 0bp and 25bp transaction-cost environment). The positive skewness of the basic model adds to the bright picture, suggesting that the risk from following this strategy is somehow lower than the one implied by the standard deviation of returns. The largest 3-month (-5.90%) and 12-month (-8.07%) losses (in the zerotransaction-cost case) are substantially lower than those incurred by the Value-minus-Growth strategy. Only one-third of the time has the basic strategy generated wrong signals. It is interesting to note that about half of the time it preferred the Growth portfolio, while only (slightly less than) one-third of the time it favored the Value one. In about 18% of the months no positions have been taken. 19 Table 1 is reproduced in Appendix II as well. Z(equality) measures the risk-adjusted performance difference between a switching Support Vector Regression strategy and the Value-minus-Growth strategy. The Z(equality)-score is computed in a standard way (in line with, e.g. Stanton, 1992). 20 46 CHAPTER 8 Table 1 Results Value-versus-Growth Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 1-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 8.21 9.86 0.83*** 1.73* 0.31 -5.51 11.77 1.19 2.57 0.50 -6.40 -11.51 52.89 28.93 18.18 6.23 9.86 0.63** 1.30 0.30 -5.51 11.52 1.14 2.38 0.50 -6.90 -15.26 52.89 28.93 18.18 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 10.19 9.91 1.03*** 2.15*** 0.32 -5.51 12.02 1.23 2.71 0.33 -5.90 -8.07 52.89 28.93 18.18 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra Value and Growth indices. The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Value”, then a position is taken that is long on the Value index and short on the Growth index. Note that if the optimal model produces no signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is long-value / short-growth, and the signal for the following month is “Growth”, then 2 * 0.25% (1* 0.25% for closing the current long-value / shortgrowth position, plus 1* 0.25% for establishing a long-growth / short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level CHAPTER 8 47 8 .2. 4 Three- and six-month horizon strategies Table 2 and Table 3 in Appendix II present detailed results from the three- and six-month forecast horizon strategies. The results are suggestive of strong consistency of the one-month horizon results, as the real performance of the strategies is in the expected logical order: first is the one-month horizon strategy, second is the three-month horizon strategy, and third comes the six-month horizon strategy. For the zero-transaction-cost scenario, the lowest achieved annual mean return comes out from the six-month horizon strategy, and stands at 4.95%. It is associated with information ratio of 0.68. The standard deviations of returns for the three- and six-month horizon strategies are lowest among all strategies: 8.43% and 7.31% respectively. The largest 3-month and 12-month relative losses from these two strategies are 9.12% and -8.77%. An interesting feature to observe is that as the estimation horizon increases, the number of months with Growth position steadily rises to 67% (from about 50%), and the number of months with no position steadily drops from about 18% to less than 7%, suggesting that the strategy of incorporating information from models constructed in earlier months tends to show a steadily increasing preference for growth stocks over taking no trading positions. 8 .2. 5 Consistency of the strategies What is remarkable to notice is that, first, all estimation horizons show quite similar patterns and, second, that the one-month horizon strategy produces best results. One possible interpretation of the latter fact is that the approach of utilizing models at present time that were created in the (more and more distant) past is bound to yield inferior outcomes since those models become more and more irrelevant. This is evident in particular during the two periods of February 1993 – June 1993 and January 1999 – October 1999. During these periods the three- and six-month horizon strategies are able to only slowly catch up pace with the one-month horizon basic strategy. Not surprisingly, the average absolute value of the value premium during these two periods considered as a whole, 2.81%, is substantially above the average one computed over the twelve months preceding each of the two periods, 1.82%. Thus, the one-month horizon strategy appears, as logically expected, to be the first one to “sense” upcoming turbulent developments on the stock market. The three- and six-month horizon strategies follow suit, but with a time lag, most probably due to their lack of taking into account recent (from the point of view of the to-be-predicted month) relevant information. In spite of this, during non-turbulent times both the three- and the six-month horizon strategies seem to perform almost identically as the one-month horizon strategy. This consistency suggests, above all, that the results produced by the basic model are reliable: relatively “minor” extensions to the basic model (such as considering three- and six-month forecast horizons) do not influence abruptly the main outcomes. Moreover, the direction and the extent of the influence appear logical. Illustrative of these findings is Figure 14, which graphs the cumulative returns throughout the entire prediction period of all three types of horizons strategies (for the zero-transaction-cost scenario), plus, for comparison reasons, the cumulative returns of the Value-minus-Growth strategy. CHAPTER 8 48 Figure 14. Accrued cumulative returns from the Value-minus-Growth strategy and the Support Vector Regression (SVR) one-, three-, and six-month horizon strategies for the period January 1993 – January 2003. The one-month horizon strategy performs best, gaining most of its accumulated profits during turbulent times on the financial market. In such periods, the three- and six-month horizon models follow suit with a time lag, as expected. During relatively calmer periods, all strategies perform similarly. Figure 15 shows the investment style signals associated with the basic model strategy. Note that style signals, unlike realized excess returns, are not affected by the level of transaction costs21. According to the figure, the predominant investment style signals during this period are “Growth”, with some notable exceptions however. “Value” signals have been produced Figure 15. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the basic model investment strategy. 21 This is assumed to be true in our case since we regard the estimates of the Support Vector Regression tool as an indication of the direction of the value-growth return difference, and not the amount of that difference. If, however, estimates are associated with the expected amount of the value premium, then the level of transaction costs will influence the investment decision if the expected (absolute value of the) value premium is low. CHAPTER 8 49 mostly in 1993, in the beginning of 1994, and in the first half of 2001. Almost no “Value” signals have been given during the periods that stretch from June 1996 till August 1998, and from June 1999 till November 2000. 8 .2. 6 Non-zero transaction cost scenarios Adding transaction costs of 25 bp and 50 bp single trip into the calculations does not change the results abruptly, as shown in Appendix II and Appendix III. For the worst-case scenario of 50 bp single trip costs, the basic model strategy still performs exceptionally well, managing a significant (at the 5% two-tail level) information ratio of 0.63. The item that appears to have deteriorated most as compared to the zero-cost case is the maximum 12-month relative loss, which has dropped down to -15.26%. Considering the other two horizon strategies, the lowest possible information ratio achieved stands at 0.27, arising from the six-month strategy in a 50bp transaction-cost environment. Figure 16 presents the realized excess returns forecasted by the basic model strategy in the 25 bp transaction-cost scenario. It can be seen from the figure that most of the accrued returns come out of the last four years of the sample period, which actually appears to be the most volatile. Figure 16. Realized excess returns forecasted by the basic investment strategy for the 25 bp transaction costs scenario. 8 .2. 7 Small-versus-Big Strategies The results from the Small-versus-Big Support Vector Regression strategy and the Smallminus-Big and MAX_SB strategies mentioned in section 7.6 can be found in full in Appendix 6 and Appendix 7. In the sample period of January 1993 – January 2003 the passive Smallminus-Big rotation strategy, unlike the Value-minus-Growth strategy, achieves a negative annual return of –1.28%. The MAX_SB strategy attains 26.76% annual return in the 50bp transaction cost scenario, which is 5.47% more than the corresponding result for the MAX strategy presented section 8.2.2. This fact reveals that the potential benefit from Small-versusBig rotation is much greater than the corresponding Value-versus-Growth rotation22. If the Support Vector Regression tool can capture this extra potential, then, first, greater credibility would be lent to Support Vector Regressions as a tool for constructing factor models, and second, one could claim that there is a consistency between Small-versus-Big and Value22 This will be true if the market impact from Small-versus-Big rotation is the same as that from Value-versusGrowth rotation. CHAPTER 8 50 versus-Growth Support Vector Regression strategies. Our results show that this extra potential can indeed be captured by the Small-versus-Big Support Vector Regression tool. For the zerotransaction-cost regime, for example, the one-, three- and six-month forecast horizon Smallversus-Big strategies produce 10.66%, 7.95% and 7.64% annual returns, while the respective results from the Value-versus-Growth strategies are 10.19%, 5.77% and 4.95%. 8. 3 Results from the Classification Reformulation of the Regression Problem Detailed results from the classification problem reformulation of the basic value-growth regression problem of section 7.4 are presented in Appendix IV. This reformulation can be used, as mentioned in section 7.5, to make yet another kind of consistency test (next to considering three- and six-month forecast horizon strategies) for the basic model strategy. The classification results are expected, logically, to be worse than the regression results. If these expectations materialize in practice, then they could serve as an indication of consistency of the basic model strategy. All classification experiments have been carried out in complete analogy to the regression experiments of section 8.1. There are only two differences. First, the actual positive/negative monthly value premiums have been replaced with “+ 1” / “– 1” values respectively in all computations, implying that in classification the months when value stocks outperformed growth stocks are labeled “+1”, and the months when growth stocks outperformed value stocks are labeled “-1”. Second, the ε parameter of the ε-insensitive loss function disappears in the calculations, since it is inherent only to regression estimation problems. All results from the reformulated classification problem are much worse than those from the original regression problem. Considering for example the one-month forecast horizon strategy, in analogy to the regression problem, investors achieve a modest 2.34% mean annual return for the January 1993 – January 2003 sample period in the zero-transaction-cost scenario. This result is more than 4 times worse than the corresponding regression result. The standard deviation of annual returns in this case stands at 10.88%, suggesting that this strategy is more than 9.78% more volatile than the corresponding regression strategy. As a result, the realized information ratio from this classification strategy stands quite low at 0.21. The results for the three- and six-month horizon classification strategies are similar, but slightly worse than those of the one-month horizon strategy. All horizon strategies do not seem to produce performances radically different from the passive Value-minus-Growth strategy (see Figure 17). The results from classification, though worse than those from regression, could be useful in two main ways. First of all, the relatively worse results are all but unexpected. It seems quite logical that when all months with positive / negative value premiums are given artificially equal values of plus one / minus one, then some model prediction power would be lost. This logic is substantiated in practice by the empirical tests. Second, one can compare the results produced by the different horizon strategies, and look for (in)consistencies. These results are, as in the regression problem, consistent with each other, as illustrated in Figure 17. Remarkably, the performance order of the strategies applied in the classification problem is the same as the performance order of strategies in the regression problem. Regarding differences in performance dynamics between the classification and regression problems, what is striking is that it is especially during the turbulent period of January 1999 – October CHAPTER 8 51 1999 that the Support Vector Machines for classification loose ground and produce negative excess returns, whereas Support Vector Regression gains momentum. This pattern, though not so pronounced, is repeated again throughout 2001. Figure 17. Accrued cumulative returns from the Value-minus-Growth strategy and Support Vector Classification (SVC) one-, three-, and sixmonth horizon strategies for the period of January 1993 till January 2003 under the zero-transaction-cost regime. The one-month horizon strategy performs best, gaining most of its accumulated profits during turbulent times on the financial market, as in the regression model formulation. In such periods, the three- and six-month horizon models follow suit with a time lag, as expected. During relatively calmer periods, all strategies perform similarly. Chapter 9 Conclusion The purpose of this research is to employ the theoretical opportunities which Support Vector Machines are expected to provide over common financial factor models in the practical context of constructing Value-versus-Growth rotation strategies. The biggest theoretical advantage of utilizing Support Vector Machines is that numerous factors can be included in one model simultaneously, without a loss of generalization (and thus, prediction) ability. The biggest practical outcome is that the basic model strategy, which is the one that is logically expected to perform best, shows remarkable consistency and robustness of results and produces exceptionally high information ratios. A number of important theoretical and practical conclusions appear to stand out from this master’s thesis. From a theoretical viewpoint, one may conclude that it pays to investigate into the modeling tool of Support Vector Machines while constructing financial factor models. First of all, this tool enhances the features of state-of-art factor models by providing the following opportunities: (1) to achieve robust results in the process of model building when the candidate explanatory variables are numerous and are considered as a group; (2) to alter manually model complexity without changing the number and nature of explanatory variables and arrive automatically at a new optimal (in the sense of achieving best generalization ability) model for each complexity alteration; (3) to specify numerous investor loss functions and arrive automatically at a new optimal model for each loss function; (4) to choose among numerous optimal models corresponding to various possible combinations of Support Vector Machines parameters via a cross-validation model selection procedure, ensuring in this process both that model selection is based only on (artificially created) out-ofsample performance and that most recent data is used for model building. Second of all, Support Vector Machines are able to cope with common factor model shortcomings, such as Data Mining Bias, Look-Ahead Bias, Data Snooping Bias, and Counterfeit. This is, above all, due to the theoretical property of Support Vector Machines to manage with ease multidimensional attribute spaces, and the employed standard cross-validation model selection procedure. Usually, factor models fall back on linear regression analyses in their various forms in order to come up with a reliable lucrative investment strategy. Commentators often select their best models in this process based on widely established overfitting-correction statistical criteria, such as adjusted R2, AIC, BIC, or financial criteria, such as hit ratio, information ratio, etc. Alongside, a set of most appealing candidate explanatory factors is typically extracted from a (long) list of potential explanatory factors. Support Vector Machines, and Support Vector Regressions in particular, offer a different approach. They usually make use of all information available as a whole, and attempt to find out the best non-linear decision surface on the space defined by the explanatory and explained variables, which can be represented as a linear surface in some higher-dimensional, feature space. In this procedure, different weights are being assigned to each data point in the feature space, which may vastly contradict the respective data-point weights coming out of a multiple linear regression analysis. Additionally, Support Vector Machines offer a myriad of overfitting-correction possibilities that do not have a direct analogy in multiple linear regression analysis, which can be applied, quite remarkably, without changing the number and nature of explanatory factors in a given 52 CHAPTER 9 53 model. These possibilities are given by the utilization of the complexity-error tradeoff parameter, ε-insensitive parameter and kernel function parameters (if any). From a practical point of view, Support Vector Machines have been shown to be able to produce investment strategies that are able to outperform the passive Value-minus-Growth more than 39 times, net of 25bp single trip transaction costs, and almost 50 times in a zerotransaction cost environment, for the sample period of January 1993 – January 2003. The information ratios for the basic model strategy are robust and extraordinarily high: 0.83 and 0.63 for the 25bp and 50bp transaction-cost scenarios respectively. The performance of the basic investment strategy has been tested against (some) modifications in order to assess its reliability in a better way. All tested model variations seem to show remarkable consistency, where the best logically expected model performs best (the one-month forecast horizon strategy), followed by the models expected to perform worse (the three- and six-month forecast horizon strategies). Especially during vigorous financial times, the modified strategies, which base their decisions heavily on models that have been constructed several months before the actual prediction month, fail to catch up pace quickly with the basic model strategy, as logically expected. Another possible test of consistency of the basic model strategy is to reformulate the original regression problem into a classification problem. The results from the classification reformulation are worse, as expected, which once again testifies to the consistency of the basic model (regression) strategy. In spite of the appealing results, there are a number of open, unresolved issues that have not been touched upon in this thesis. For example, there is no guarantee that the pre-specified factor set used to create the models contains most (or even, enough) of the information needed to forecast Value-versus-Growth monthly returns. The reverse could also be true – it is possible that some of the explanatory factors are actually unnecessary, in which case they have to be excluded from the models. The procedure to test for this latter possibility is computationally quite demanding and thus has not been carried out. More broadly, in order to assess fully the applicability of Support Vector Machines in Finance, they have to be tested in different financial areas and on different types of financial data sets. For example, it is interesting to apply Support Vector Machines to the so called “Small-versus-Big” rotation strategies of predicting the monthly difference of returns between stocks with relatively higher market capitalization and stocks with relatively lower market capitalization. The results from this kind of strategies, which utilizes Support Vector Regressions, are show in Appendix 6 and Appendix 7. It could also be argued that Bayesian type of inference should be applied to model selection (Cremers, 2002), but the analysis of this kind of topics as well as technical-in-nature issues that arise from within Support Vector Machines, fall out of the scope of this master’s thesis. References Avramov, D., Stock return predictability and model uncertainty, Journal of Financial Economics 64, pp. 423-458, 2002. Asness, C., J. Friedman, R. Krail, and J. Liew, Style timing: value versus growth, Journal of Portfolio Management 26, pp. 51-60, 2000. Banz, R., The relationship between return and market value of common stocks, Journal of Financial Economics 9, pp. 3-18, 1981. Basu, S., Investment performance of common stocks in relation to their price-earnings ratios, Journal of Finance 32, pp. 663-682, 1977. Bauer, R. and R. Molenaar, Is the Value Premium Predictable in Real Time?, Working Paper 02-003, Limburg Institute of Financial Economics, 2002 Bauer, R., J. Derwall and R. Molenaar, The Real-Time Predictability of the Size and Value Premium in Japan, Working Paper 03-011, Limburg Institute of Financial Economics, 2003 Bhandari, L., Debt/equity ratio and expected common stock returns, Journal of Finance 43, pp. 507-528, 1988. Bishop, C. and M. Tipping, Variational relevance vector machines, In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pp. 46-53, Morgan Kaufmann Publishers, 2000. Burbidge, R., and B. Buxton, An Introduction to Support Vector Machines for Data Mining, Keynote Papers, Young OR12, 2001. Burges, C., A tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2, pp 121-167, 1998. Chan, L., Y. Hamao, and J. Lakonishok, Fundamentals and stock returns in Japan, Journal of Finance 46, pp. 1739-1764, 1991. Chan, L., N. Jegadeesh, and J. Lakonishok, Momentum strategies, Journal of Finance 51, pp. 1681-1713, 1996. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2002, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm Copeland, M. and T. Copeland, Market timing: style and size rotation using the VIX, Financial Analyst Journal 55, pp. 73-81, 1999. Cremers, K., Stock Return Predictability: A Bayesian Model Selection Perspective, Review of Financial Studies, Vol. 15, No. 4, pp. 1223-1249, 2002. 54 REFERENCES 55 Cristianini, N. and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. Fama, E. and K. French, The cross-section of expected stock returns, Journal of Finance 47, pp. 427-465, 1992. Fama, E. and K. French, Common risk factors in the returns on stocks and bonds, Journal of Financial Economics 33, pp. 3-53, 1993. Fama, E. and K. French, Value versus growth: the international evidence, Journal of Finance 53, pp. 1975-1999, 1998. Haugen, R., The Inefficient Stock Market: What Pays Off and Why, Second Edition. Upper Saddle River, N.J.: Prentice Hall, 1999. Haugen, R., Modern Investment Theory, Fifth Edition. Upper Saddle River, N.J.: Prentice Hall, 2001. Jensen, G., R. Johnson, and J. Mercer, New Evidence on Size and Price-to-Book Effects in Stock Returns, Financial Analyst Journal 53, pp. 34-42, 1997. Kahn, V., A Question of Style: Must Consistency Equal Mediocrity in Mutual Funds? Financial World, pp. 70-75, 1996. Kao, D. and R. Shumaker, Equity style timing, Financial Analysts Journal 55, pp. 37-48, 1999. Keim, D., Dividend Yields and Stock Returns: Implications of Abnormal January Returns, Journal of Financial Economics, 14, pp. 473-489, 1985. Keim, D. and R. Stambaugh, Predicting returns in the stock and bond markets, Journal of Financial Economics 17, pp. 357-390, 1986. La Porta, R., Expectations and the cross-section of returns, Journal of Finance 51, pp. 17151742, 1996. Lakonishok, J., A. Schleifer, and R. Vishny, Contrarian investment, extrapolation and risk, Journal of Finance 49, pp. 1541-1578, 1994. Levis, M. and M. Liodakis, The profitability of style rotation strategies in the United Kingdom, Journal of Portfolio Management 25, pp. 73-86, 1999. Liew, J. and M. Vassalou, Can book-to-market, size and momentum be risk factors that predict economic growth?, Journal of Financial Economics 57, pp. 221-245, 2000. Lucas, A., R. Van Dijk, and T. Kloek, Stock selection, style rotation, and risk, Journal of Empirical Finance 9, pp. 1-34, 2002. Macedo, R., Value, Relative Strength, and Volatility in Global Equity Country Selection, Financial Analysts Journal 51, No.2, pp. 70-78, 1995. REFERENCES 56 Maragoudakis, M., K. Kermanidis, N. Fakotakis, and G. Kokkinakis, Combining Bayesian and Support Vector Machines Learning to automatically complete Syntactical Information for HPSG-like Formalisms, 2002, http://slt.wcl.ee.upatras.gr/papers/maragoudakis10.pdf Monteiro, A., Interest Rate Curve Estimation: A Support Vector Regression Application to Finance, 2001, http://www.princeton.edu/~monteiro/SVM%20swaps.pdf Müller, K., A. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V. Vapnik. Predicting time series with support vector machines, Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science. Springer, 1997. Müller, K., S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, An Introduction to Kernel-Based Learning Algorithms, IEEE Transactions on Neural Networks, Vol.12, No.2, 2001. Pérez-Cruz, F., J. Afonso-Rodríguez, and J. Giner, Estimating GARCH models using support vector machines, Quantitative Finance 3, pp. 163-172, 2003. Pesaran, M. and A. Timmermann, Predictability of stock returns: robustness and economic significance, Journal of Finance 50, pp. 1201-1228, 1995. Pesaran, M., Stock Market Regressions, 2003, http://www.econ.cam.ac.uk/ faculty/pesaran/ReturnRegressions.pdf Rocco, C. and J. Moreno, A support vector machine model for currency crisis discrimination, 2001, http://www.aiecon.org/staff/shc/course/annga/CIEF4-2.ps Rosenberg, B., K. Reid, and R. Lanstein, Persuasive evidence of market inefficiency, Journal of Portfolio Management 11, pp. 9-17, 1985. Smola, A., Regression estimation with Support Vector Learning Machines, Master's thesis, Technical University Munich, 1996. Smola A. and B. Schölkopf, A tutorial on support vector regression, NeuroCOLT2 Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998. Sorensen, E., and C. Lazzara, Equity Style Management: The Case of Growth and Value, In R. Klein and J. Lederman, eds., Equity Style Management: Evaluating and Selecting Investment Styles, Burr Ridge, IL: Irwin, 1995 Stanton, A., Primer of Biostatistics, 3rd Ed., McGraw-Hill, New York, pp. 81-88, 1992. Tipping, M., The relevance vector machine, In Advances in Neural Information Processing Systems, No. 12, pp. 652-658, MIT Press, 2000. Thaler, R. (ed.), Advances in Behavioral Finance, Russell Sage Foundation, 1993. Trafalis, T. and H. Ince, Support Vector Machine for Regression and Applications to Financial Forecasting, 2000, http://212.67.202.199/~msewell/svm/regression/TrIn00.pdf REFERENCES 57 Van Gestel, T., B. Baesens, J. Garcia, and P. Van Dijcke, A Support Vector Machine Approach to Credit Scoring, 2003, http://www.geocities.com/joaogarcia18/BANKFINVer4.pdf Vapnik, V., The Nature of Statistical Learning Theory, Springer, New York, 1995; 2nd edition, 2000. Woodford, B., Comparative analysis of the EFuNN and the Support Vector Machine models for the classification of horticulture data, 2001, http://divcom.otago.ac.nz/infosci/kel/CBIIS/pubs/ps-gz/wood-kasa-annes2001.ps.gz Appendix I Factors used in all Value-versus-Growth regression and classification models. All data are provided by ABP Investments. The factors are the same as those employed by Bauer and Molenaar (2002). Technical variables are: ∗ Lagged Value/Growth spread ∗ Lagged Small/Large spread ∗ VIX: the 3-month change in the VIX indicator ∗ 12 month Forward P/E (S&P 500) ∗ 3 month return momentum (S&P 500) ∗ Profit cycle: Year on Year change in earnings per share of the S&P 500 ∗ PE dif ∗ DY dif Economic variables are: ∗ Corporate Credit Spread: the yield spread of (Lehman Aggregate) Baa over Aaa ∗ Core inflation: the 12-month trailing change in the U.S. Consumer Price Index ∗ Earnings-yield gap: the difference between forward E/P ratio (S&P 500) and the 10-year T- bond yield ∗ Yield Curve Spread: the yield spread of 10-year T-bonds over 3-month T-bills ∗ Real Bond Yield: the 10-year T-bond yield adjusted for the 12-month trailing inflation rate ∗ Ind. Prod: U.S. Industrial Production Seasonally Adjusted ∗ Oil Price: the 1-month price change ∗ ISM (MoM): 1-month change of US ISM Purchasing Managers Index (Mfg Survey) ∗ Leading Indicator: the 12-month change in the Conference Board Leading Indicator 58 Appendix II Tables showing the results from different Support Vector Regression Value-versusGrowth investment strategies and different cost scenarios. Time frame: January 1993 – January 2003 Table 1. Results Value-versus-Growth Support Vector Regression rotation strategy using a one-month forecast horizon. Table 2. Results Value-versus-Growth Support Vector Regression rotation strategy using a three-month forecast horizon. Table 3. Results Value-versus-Growth Support Vector Regression rotation strategy using a six-month forecast horizon. 59 60 APPENDIX II Table 1 Results Value-versus-Growth Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 1-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 8.21 9.86 0.83*** 1.73* 0.31 -5.51 11.77 1.19 2.57 0.50 -6.40 -11.51 52.89 28.93 18.18 6.23 9.86 0.63** 1.30 0.30 -5.51 11.52 1.14 2.38 0.50 -6.90 -15.26 52.89 28.93 18.18 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 10.19 9.91 1.03*** 2.15*** 0.32 -5.51 12.02 1.23 2.71 0.33 -5.90 -8.07 52.89 28.93 18.18 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra Value and Growth indices. The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Value”, then a position is taken that is long on the Value index and short on the Growth index. Note that if the optimal model produces no signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is long-value / short-growth, and the signal for the following month is “Growth”, then 2 * 0.25% (1* 0.25% for closing the current long-value / short-growth position, plus 1* 0.25% for establishing a longgrowth / short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 61 APPENDIX II Table 2 Results Value-versus-Growth Support Vector Regression rotation strategy using a three-month forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 3-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 4.15 8.47 0.49 0.91 0.38 -4.28 11.94 1.76 5.56 0.55 -8.47 -11.02 62.81 28.93 8.26 2.53 8.53 0.30 0.53 0.38 -4.28 11.85 1.74 5.45 0.58 -8.55 -13.27 62.81 28.93 8.26 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 5.77 8.43 0.68** 1.28 0.38 -4.28 12.02 1.76 5.59 0.43 -8.39 -8.77 62.81 28.93 8.26 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra Value and Growth indices. The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Value”, “Growth”, and “Value”, then the combined signal is “1/3 Value”. Out of this combined signal, a position is taken that is long 1/3 of the Value index and short 1/3 of the Growth index. Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is ½ long-value / ½ short-growth, and the signal for the following month is “½ Growth”, then 2 * 0.125% (1* 0.125% for closing the current ½ long-value / ½ short-growth position, plus 1* 0.125% for establishing a ½ long-growth / ½ short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 62 APPENDIX II Table 3 Results Value-versus-Growth Support Vector Regression rotation strategy using a sixmonth forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 6-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 3.48 7.34 0.47 0.79 0.18 -4.28 7.97 1.11 2.15 0.55 -9.24 -10.19 66.94 26.45 6.61 2.00 7.39 0.27 0.43 0.18 -4.28 7.93 1.08 2.08 0.59 -9.37 -12.23 66.94 26.45 6.61 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 4.95 7.31 0.68** 1.15 0.19 -4.28 8.01 1.12 2.19 0.45 -9.12 -8.15 66.94 26.45 6.61 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P 500 Barra Value and Growth indices. The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Value”, “Value”, “Value”, “no signal”, “Growth”, and “Value”, then the combined signal is “½ Value”. Out of this combined signal, a position is taken that is long ½ of the Value index and short ½ of the Growth index. Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is ½ long-value / ½ short-growth, and the signal for the following month is “½ Growth”, then 2 * 0.125% (1* 0.125% for closing the current ½ long-value / ½ short-growth position, plus 1* 0.125% for establishing a ½ long-growth / ½ short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level Appendix III Figures showing the results from different Value-versus-Growth investment strategies and different cost scenarios. Time frame: January 1993 – January 2003 Figure A3.1. Accrued cumulative monthly returns from the Value-versus-Growth strategy and the one-month forecast horizon Support Vector Regression rotation strategy under different transaction-cost regimes. Figure A3.2. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the one-month forecast horizon Support Vector Regression rotation strategy Figure A3.3. Realized excess returns by the one-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. Figure A3.4. Accrued cumulative monthly returns from the Value-versus-Growth strategy and the three-month horizon Support Vector Regression rotation strategy under different transaction cost regimes. Figure A3.5. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the three-month forecast horizon Support Vector Regression rotation strategy. Figure A3.6. Realized excess returns by the three-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. Figure A3.7. Accrued cumulative monthly returns from the Value-versus-Growth strategy and the six-month horizon Support Vector Regression rotation strategy under different transaction cost regimes. Figure A3.8. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the six-month forecast horizon Support Vector Regression rotation strategy. Figure A3.9. Realized excess returns by the six-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. 63 64 APPENDIX III Figure A3.2. Figure A3.1. Figure A3.3. 65 APPENDIX III Figure A3.5. Figure A3.4. Figure A3.6. 66 APPENDIX III Figure A3.8. Figure A3.7. Figure A3.9. Appendix IV Tables showing the results from different Support Vector Classification investment strategies and different cost scenarios. Time frame: January 1993 – January 2003 Table 4. Results Value-versus-Growth Support Vector Classification rotation strategy using a one-month forecast horizon. Table 5. Results Value-versus-Growth Support Vector Classification rotation strategy using a three-month forecast horizon. Table 6. Results Value-versus-Growth Support Vector Classification rotation strategy using a six-month forecast horizon. 67 68 APPENDIX IV Table 4 Results Value-versus-Growth Support Vector Classification rotation strategy using a one-month forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 1-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG (costs 0, 25 and 50 bp) CV CV CV (costs 0 bp) (costs 25 bp) (costs 50 bp) 2.34 10.88 0.21 0.44 0.13 -9.48 12.02 0.43 2.35 0.54 -13.28 -15.18 53.72 46.28 0.00 -0.54 11.00 -0.05 -0.15 0.10 -9.98 12.02 0.48 2.45 0.55 -13.78 -19.68 53.72 46.28 0.00 -3.42 11.19 -0.31 -0.74 0.08 -10.48 12.02 0.51 2.48 0.58 -14.28 -24.18 53.72 46.28 0.00 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 MAX (costs 50 bp) 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value and Growth indices. The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Value”, then a position is taken that is long on the Value index and short on the Growth index. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is long-value / short-growth, and the signal for the following month is “Growth”, then 2 * 0.25% (1* 0.25% for closing the current long-value / short-growth position, plus 1* 0.25% for establishing a longgrowth / short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 69 APPENDIX IV Table 5 Results Value-versus-Growth Support Vector Classification rotation strategy using a three-month forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 3-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 0.28 8.81 0.03 0.02 0.01 -9.48 9.74 -0.03 4.43 0.50 -9.85 -18.28 55.37 40.50 4.13 -1.54 8.84 -0.17 -0.40 -0.01 -9.81 9.57 -0.06 4.42 0.55 -9.86 -20.61 55.37 40.50 4.13 -3.36 8.90 -0.38 -0.81 -0.04 -10.14 9.40 -0.10 4.36 0.61 -10.36 -22.95 55.37 40.50 4.13 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value and Growth indices. The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Value”, “Growth”, and “Value”, then the combined signal is “1/3 Value”. Out of this combined signal, a position is taken that is long 1/3 of the Value index and short 1/3 of the Growth index. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is 1/3-long-value / 1/3-short-growth, and the signal for the following month is “1/3 Growth”, then 2*(1/3)*0.25% (1*(1/3)*0.25% for closing the current 1/3-long-value / 1/3-short-growth position, plus 1*(1/3)*0.25% for establishing a 1/3-long-growth / 1/3-short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 70 APPENDIX IV Table 6 Results Value-versus-Growth Support Vector Classification rotation strategy using a six-month forecast horizon. Time frame: January 1993 – January 2003 S&P Barra 6-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Growth % months in Value % months no position VmG CV CV CV MAX (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) -0.33 8.02 -0.04 -0.13 0.01 -9.48 9.74 0.09 5.48 0.49 -10.02 -18.16 57.02 32.23 10.74 -1.79 8.02 -0.22 -0.47 -0.02 -9.81 9.40 0.02 5.39 0.60 -10.14 -19.74 57.02 32.23 10.74 -3.25 8.05 -0.40 -0.81 -0.05 -10.14 9.07 -0.05 5.26 0.61 -10.27 -21.32 57.02 32.23 10.74 0.21 10.90 0.02 -0.11 -12.02 9.74 0.01 2.44 0.46 -11.55 -22.86 0.00 100.00 0.00 21.29 7.84 2.72*** 4.99*** 0.50 -0.98 11.02 1.61 3.40 0.20 -1.99 2.21 46.28 53.72 0.00 VmG denotes Value-minus-Growth strategy. MAX denotes perfect foresight rotation strategy. CV denotes the timing strategy based on Support Vector Classification Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P Barra Value and Growth indices. The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Value”, “Value”, “Value”, “no signal”, “Growth”, and “Value”, then the combined signal is “½ Value”. Out of this combined signal, a position is taken that is long ½ of the Value index and short ½ of the Growth index. Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is 1/3-long-value / 1/3-short-growth, and the signal for the following month is “1/3 Growth”, then 2*(1/3)*0.25% (1*(1/3)*0.25% for closing the current 1/3-long-value / 1/3-short-growth position, plus 1*(1/3)*0.25% for establishing a 1/3-long-growth / 1/3-short-value position) have to be deducted from the following month’s accrued (absolute value of the) value premium. * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level Appendix V Factors used in Small-versus-Big rotation models. All data are provided by ABP Investments. Technical variables are: ∗ Lagged Value/Growth spread ∗ Lagged Small/Large spread ∗ MOM S&P (HoH): 6 months return momentum S&P 500 ∗ Profit cycle: Year on Year change in earnings per share of the S&P 500 ∗ RF: US Treasury Constant Maturities 3 Mth - Middle Rate ∗ GSCINE (QoQ): GSCI Non Energy (Quarterly changes) ∗ DIV YLD: Difference between Dividend Yields of Barra Value and Barra Growth ∗ VOL S&P (22 DAY): Volatility of S&P 500 on daily basis Economic variables are: ∗ Corporate Credit Spread: the yield spread of (Lehman Aggregate) Baa over Aaa ∗ Core inflation: the 12-month trailing change in the U.S. Consumer Price Index ∗ Earnings-yield gap: the difference between forward E/P ratio (S&P 500) and the 10-year T- bond yield ∗ Yield Curve Spread: the yield spread of 10-year T-bonds over 3-month T-bills ∗ Bond Yield: US Treasury Constant Maturities 10 Yr - Middle Rate ∗ Ind. Prod: U.S. Industrial Production Seasonally Adjusted ∗ Oil Price (QoQ): the 3-month change in West Texas Int. Near Month FOB $/BBL ∗ ISM (YoY): yearly change of US ISM Purchasing Managers Index (Mfg Survey), Seasonally adjusted ∗ Leading Indicator: the 12-month change in the Conference Board Leading Indicator 71 Appendix VI Tables showing the results from different Small-versus-Big investment strategies and different cost scenarios. Time frame: January 1993 – January 2003 Table 7. Results Small-versus-Big rotation strategy using a one-month forecast horizon. Table 8. Results Small-versus-Big rotation strategy using a three-month forecast horizon. Table 9. Results Small-versus-Big rotation strategy using a six-month forecast horizon. 72 73 APPENDIX VI Table 7 Results Small-versus-Big Support Vector Regression rotation strategy using a onemonth forecast horizon. Time frame: January 1993 – January 2003 large/small cap 1-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Big % months in Small % months no position SmB CV CV CV MAX_SB (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 7.39 10.93 0.68** 1.48 0.43 -7.70 16.28 0.76 4.35 0.46 -9.21 -7.71 45.45 44.63 9.92 26.76 8.89 3.01*** 5.99*** 0.78 -0.96 16.78 2.61 11.66 0.11 -1.18 8.03 52.07 47.93 0.00 -1.28 12.94 -0.10 0.05 -15.71 16.78 0.21 4.55 0.52 -21.63 -31.85 0.00 100.00 0.00 10.66 10.88 0.98*** 2.15** 0.45 -7.70 16.78 0.77 4.80 0.33 -8.84 -3.21 45.45 44.63 9.92 9.02 10.88 0.83*** 1.81* 0.44 -7.70 16.53 0.77 4.61 0.45 -8.84 -5.46 45.45 44.63 9.92 SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices23. The overall position for month t+1 is based on the signal produced by the optimal model based on 60 months of prior historical data (factors included = 17). If for example the produced signal for month t+1 is “Small”, then a position is taken that is long on the Small cap index and short on the Large cap index. Note that if the optimal model produces no signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is long-small / short-big, and the signal for the following month is “Big”, then 2 * 0.25% (1* 0.25% for closing the current long-small / short-big position, plus 1* 0.25% for establishing a long-big / shortsmall position) have to be deducted from the following month’s accrued (absolute value of the) “small premium”(the difference in return between S&P SmallCap 600 and S&P 500 indices). * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 23 Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank Russell 2000 indices have been used as inputs for the Small-versus-Big calculations. 74 APPENDIX VI Table 8 Results Small-versus-Big Support Vector Regression rotation strategy using a threemonth forecast horizon. Time frame: January 1993 – January 2003 large/small cap 3-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Big % months in Small % months no position SmB CV CV CV MAX_SB (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) 6.59 10.91 0.60* 1.31 0.29 -10.55 16.36 0.57 5.64 0.45 -13.51 -4.83 48.76 49.59 1.65 5.24 10.90 0.48 1.04 0.26 -10.64 15.95 0.55 5.24 0.46 -13.67 -6.91 48.76 49.59 1.65 -1.28 12.94 -0.10 0.05 -15.71 16.78 0.21 4.55 0.52 -21.63 -31.85 0.00 100.00 0.00 7.95 10.95 0.73** 1.59 0.31 -10.47 16.78 0.59 6.00 0.41 -13.34 -2.75 48.76 49.59 1.65 26.76 8.89 3.01*** 5.99*** 0.78 -0.96 16.78 2.61 11.66 0.11 -1.18 8.03 52.07 47.93 0.00 SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices24. The overall position for month t+1 is based on the unweighted average of three signals produced by the optimal models associated with months t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Small”, “Big”, and “Small”, then the combined signal is “1/3 Small”. Out of this combined signal, a position is taken that is long 1/3 of the Small cap index and short 1/3 of the Large cap index. Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is ½ long-small / ½ short-big, and the signal for the following month is “½ Big”, then 2 * 0.125% (1* 0.125% for closing the current ½ long-small / ½ short-big position, plus 1* 0.125% for establishing a ½ long-big / ½ short-small position) have to be deducted from the following month’s accrued (absolute value of the) “small premium”(the difference in return between S&P SmallCap 600 and S&P 500 indices). * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 24 Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank Russell 2000 indices have been used as inputs for the Small-versus-Big calculations. 75 APPENDIX VI Table 9 Results Small-versus-Big Support Vector Regression rotation strategy using a six-month forecast horizon. Time frame: January 1993 – January 2003 large/small cap 6-month forecast horizon Mean Standard deviation Information ratio Z(equality) Median Minimum (monthly) Maximum (monthly) Skewness (monthly) Excess kurtosis (monthly) prop. negative months Largest 3-month loss Largest 12-month loss % months in Big % months in Small % months no position SmB CV CV CV MAX_SB (costs 0, 25 and 50 bp) (costs 0 bp) (costs 25 bp) (costs 50 bp) (costs 50 bp) -1.28 12.94 -0.10 0.05 -15.71 16.78 0.21 4.55 0.52 -21.63 -31.85 0.00 100.00 0.00 7.64 10.04 0.76** 1.59 0.45 -7.70 16.78 1.19 7.57 0.38 -8.84 -2.14 45.45 48.76 5.79 6.40 10.02 0.64** 1.33 0.43 -7.70 16.41 1.15 7.07 0.46 -8.92 -3.35 45.45 48.76 5.79 5.16 10.02 0.52 1.06 0.41 -7.70 16.03 1.10 6.53 0.46 -9.01 -4.56 45.45 48.76 5.79 26.76 8.89 3.01*** 5.99*** 0.78 -0.96 16.78 2.61 11.66 0.11 -1.18 8.03 52.07 47.93 0.00 SmB denotes Small-minus-Big strategy (long Small and short Big). MAX_SB denotes perfect foresight Smallversus-Big rotation strategy. CV denotes the timing strategy based on Support Vector Regression Cross Validation Mean Squared Error. All numbers are annualized data unless stated otherwise. All strategies are long/short monthly positions on the S&P SmallCap 600 and S&P 500 indices25. The overall position for month t+1 is based on the unweighted average of six signals produced by the optimal models associated with months t-4, t-3, t-2, t-1, t and t+1 respectively (factors included = 17). If for example the produced signals for month t+1 are “Small”, “Small”, “Small”, “no signal”, “Big”, and “Small”, then the combined signal is “½ Small”. Out of this combined signal, a position is taken that is long ½ of the Small cap index and short ½ of the Large cap index. Note that if the optimal models produce a combined “no signal” signal, then no trading position for month t+1 should be taken. Transaction costs are assumed to be 0 bp, 25 bp, and 50 bp single trip. In the 25 bp case for instance, if the current position is ½ long-small / ½ short-big, and the signal for the following month is “½ Big”, then 2 * 0.125% (1* 0.125% for closing the current ½ long-small / ½ short-big position, plus 1* 0.125% for establishing a ½ long-big / ½ short-small position) have to be deducted from the following month’s accrued (absolute value of the) “small premium”(the difference in return between S&P SmallCap 600 and S&P 500 indices). * ** *** indicates significance at the (2-tail) 10% level indicates significance at the (2-tail) 5% level indicates significance at the (2-tail) 1% level 25 Prior to the introduction of the S&P SmallCap 600 index in January 1994, the Frank Russell 1000 and Frank Russell 2000 indices have been used as inputs for the Small-versus-Big calculations. Appendix VII Figures showing the results from different Small-versus-Big Support Vector Regression investment strategies and different cost scenarios. Time frame: January 1993 – January 2003 Figure A7.1. Accrued cumulative monthly returns from the Small-versus-Big strategy and the one-month forecast horizon Support Vector Regression rotation strategy under different transaction-cost regimes. Figure A7.2. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the one-month forecast horizon Support Vector Regression rotation strategy Figure A7.3. Realized excess returns by the one-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. Figure A7.4. Accrued cumulative monthly returns from the Small-versus-Big strategy and the three-month horizon Support Vector Regression rotation strategy under different transaction cost regimes. Figure A7.5. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the three-month forecast horizon Support Vector Regression rotation strategy. Figure A7.6. Realized excess returns by the three-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. Figure A7.7. Accrued cumulative monthly returns from the Small-versus-Big strategy and the six-month horizon Support Vector Regression rotation strategy under different transaction cost regimes. Figure A7.8. Investment signals (“value” = 1, “growth” = -1, “no signal” = 0) produced by the six-month forecast horizon Support Vector Regression rotation strategy. Figure A7.9. Realized excess returns by the six-month forecast horizon Support Vector Regression rotation strategy under the 25 bp transaction-cost scenario. Figure A7.10. Accrued cumulative monthly returns from the one-, three-, and six-month forecast horizon Support Vector Regression Small-versus-Big rotation strategies under zero-transaction-cost regime. 76 77 APPENDIX VII Figure A7.2. Figure A7.1. Figure A7.3. 78 APPENDIX VII Figure A7.5. Figure A7.4. Figure A7.6. 79 APPENDIX VII Figure A7.8. Figure A7.7. Figure A7.9. APPENDIX VII Figure A7.10 80

SHORT-HORIZON VALUE-GROWTH STYLE ROTATION WITH

Related documents

Products

Support

SHORT-HORIZON VALUE-GROWTH STYLE ROTATION WITH

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib