The Efficient Set GA for Stock Portfolios Jacqueline Shoaf and James A. Foster Dept. of Computer Science, University of Idaho, Moscow, Idaho email: jackies@alaska.net, foster@cs.uidaho.edu Abstract The genetic algorithm (GA) for the efficient set portfolio problem based on the Markowitz model introduced by Shoaf and Foster[4] offers significant benefits over the quadratic programming approach. These benefits include simultaneous optimization of risk and return. The efficient set GA uses an indirect representation style in order to avoid unfeasible solutions and penalty functions. This representation is generally applicable to problems which seek an optimal partition for a given amount of some resource which includes both negative and positive allocations. Efficient set GA evolution scales well and is O(n log n) with a small constant for portfolios containing up to n=100 stocks. Using demes further improves the quality of solution and the run time for this GA. 1. Introduction The efficient set portfolio problem is to find the allocation of investments for given set of securities with minimum risk for any given rate of return. Markowitz’s [2] approach to solving this problem uses the covariance matrix derived from historical rates of return to predict the variance, or “risk factor” of any allocation of resources. The Markowitz model accomodates both long and short positions. A long position represents an allocation for purchase of securities, whereas a short position represents an allocation from the sale of borrowed securities. In the Markowitz model, the weighted sum of the values in the rates of return covariance matrix represents the overall variance, rp2 , of a portfolio. Let n be the number of stocks in the portfolio, xi be the proportion of resources allocated for stock i (negative for short positions), E(r*p) be the given expected rate of return for the portfolio, and E(rj) be the expected rate of return for each security. The objective equation for the efficient set portfolio problem is: n n j 1 i 1 ( xi x j Cov(ri , rj )) min (rp)2 = with the following constraints: N 1) E(r*p)= x E(r ) j j 1 j n 2) 1.0 = x j 1 j The quadratic programming approach to this problem described by Haugen [1, Appendix 3] requires the objective equation to be rewritten in Lagrangian form. For minimization, the partial derivative of each variable is taken and set to 0. This leaves a set of linear simultaneous equations which can be solved for the coefficients of allocation for the minimum variance portfolio. This portfolio will have the given expected rate of return , E(r*p), which is specified as a constraint. Note that E(r*p) is required as input, so the quadratic programming approach can only solve the efficient set problem for one portfolio rate of return. Also, the algorithm for solving a set of linear simultaneous equations has time complexity between O(n2) and O(n3), according to Smith [5]. 2. The Efficient Set GA The GA solution to the efficient set problem Shoaf and Foster [4] alters the problem slightly to solve for an efficient set portfolio over the entire range of potential expected portfolio returns. Each member of the GA population represents an allocation of resources for the portfolio. The user selects a desirable balance between risk and return using adjustable constants in the GA fitness function: 2 ( rp ) E (r ) E (r * ) p p where ,, are set by the user, E(rp) represents the expected rate of return of the portfolio represented by the population member, and E(r*p) represents the user’s target expected rate of portfolio return. Because the efficient set portfolio problem is an allocation problem, a direct representation of resource allocation by each population member in the GA will not work well. This type of representation will result in predominantly unfeasible solutions in every generation, where the allocations do not sum to 1.0. Our representation has a single field of k+1 bits for each security. The first bit indicates whether the position on that stock will be long (one) or short (zero). The remaining k bits are an unsigned index onto an “allocation wheel”. Conceptually, this is a wheel representing the resources to be allocated. It is divided into 2k equal sections, each indexed by a k bit binary value. The distance between an index and the index of the next long position, plus any enclosed short position wedges, is the percentage of the total resource allocation for the security with that index. (See Figure 1.) More precisely, suppose that i1,…,in are the indices for n securities, in non-decreasing order, mod 2k (so that 000, for example, follows 111). Now, let S be the set of indexes of securities with short positions. Let, L(j) be the next index of a long position on the allocation wheel. Now, let dj be: k a iL( j ) i j b iL( k ) ik mod 2 kS , jk L ( j ) dj 2k with a=-1 and b=1 when j S, and a=1 and b=0 otherwise. Pictorially, dj is the length of the arc between the index ij and the next long security, with any subtended short position arcs added in. Figure 1 is an example of this representation for k=3 and n=5. An important benefit of this representation is that jS d j jS d j 1 , for any set of short positions S and any chromosome. That is, the total investment is always one hundred percent of the available resources. This makes it impossible to create an unfeasible solution. So, every member of the population in each generation can contribute to the next generation and no valuable schemata are discarded. This also avoids any unpredictable influence on evolution by penalty functions in the fitness function, since they are not necessary. In general this representation makes the GA efficient. This indirect style of representation should work for any optimization problem where allocation proportions may be either positive or negative and must sum to a given value. Stock 0 Stock 1 Stock 2 Stock 3 Stock 4 INDEX INDEX INDEX INDEX INDEX 1 110 0 000 0 101 1 111 1 010 LONG SHORT SHORT LONG LONG 000 111 Stock 3 Stock 1 (short) Stock 0 110 001 010 Stock 2 (short) 101 Stock 4 INDEX 011 100 Sum of Allocation s for Stocks 0-4 (in order): .125 + -.25 + -.125 + .625 + .625 = 1.00 Figure 1. Allocation based on solution representation One of the less obvious effects of this representation style, however, is the sensitivity of the efficient set GA to increases in mutation and crossover rates. A change in the index of one security can affect one to two other security allocations. Experiments were conducted using a small set of 5 stocks [3] comparing the effective set GA to the quadratic programming technique and demonstrated the benefits of simultaneously optimizing for return and risk. These experiments demonstrated that the GA could find portfolio allocations with similar risk and higher rates of return than the risk-constrained quadratic programming solution. The data for these experiments was derived from end-of-week closing data accumulated over an eleven month period beginning October 3, 1994. The covariance matrix for stock rates of return is shown in Table 1 and the averaged annualized rates of return are shown in Table 2. Table 1. Stock covariance matrix CYBE ISLI NBL ORLY CYBE ISLI 2.45 NBL 1.36 2.76 ORLY 0.07 -0.10 0.99 RGIS 0.55 0.44 -0.02 0.68 Table 2. Average Rate of Return CYBE 0.589 ISLI -1.573 NBL 1.219 ORLY .159 RGIS -0.094 The allocation proportions, obtained using the quadratic programming technique with a given expected return rate of .15, are shown in Table 3. This solution yields a minimum risk, rp2, of .405. Table 3. Allocation by quadratic programming CYBE ISLI NBL ORLY RGIS -0.l0 0.15 0.30 0.55 0.10 Five randomly-seeded GA runs were conducted using the same data [3]. The length of each chromosome was 35 bits, with individual fields of 7 bits. Each run lasted 200 generations using a population size of 300. The allocation proportions obtained using the GA are shown in Table 4. This solution is the best over all five runs. These allocations yield an expected rate of return, of .52 with an associated minimum risk, rp2 , of .384. In this case the efficient set GA was clearly able to find a better solution than quadratic programming. Table 4. Allocation by GA CYBE ISLI NBL ORLY RGIS 0.0 0.047 .422 .516 0.016 5 Stock Portfolio GA Scaled Fitness B est o f Gen 2 A ve o f Gen 1 200 150 100 50 0 0 Generation Figure 2. The convergence graph of the averaged GA solution, which plots average generation fitness and best-of-generation fitness against generation, is shown in Figure 2. The plot demonstrates the expected exponential increase in fitness values. 3. Complexity of the Efficient Set GA We also ran to determine the exact expected time complexity of the GA as a function of n, the number of securities in the portfolio. Since n is a variable that is local only to the objective function of the GA, this would confirm that the time complexity of the fitness function dominates the GA as n increases. An efficient algorithm in terms of n implies that the GA can be used for portfolio allocations involving large numbers of securities. Notice that the size of the chromosomes, and therefore the complexity of the GA, depends on both the number of securities and the number of slices on the allocation wheel, so it is not obvious a priori that the number of securities is the critical performance parameter. Each chromosome contains n fields, representing an investment position (long or short) and the index used for allocation for each security. Thus the product of n and the field size determines the total length of each chromosome. However, the fitness function does a number of transformations based on the values in each field, in order to convert the chromosome into a portfolio allocation. These transformations affect the algorithmic time complexity of the GA. The indirect representation method for allocation determination described earlier requires that the fields of the chromosome be sorted, an operation of average expected complexity in O(n log n) 2 using a quicksort (the O( n ) worst case behavior rarely shows up in practice). We anticipated that sorting would dominate the time complexity of the fitness function and the GA. The experiments consisted of 25 sets of GA runs, one set for each of 5 different values of n: 8, 16, 32, 64 and 100. Time was clocked on either side of the evolution step in the GA in order to bypass time required by chromosome initialization, which is assumed to be linear in n with a small constant. Between the sets, all other GA constants remained the same. However, since the absolute length of the chromosome is also affected by the change in n, a separate set of experiments was conducted to determine which type of change dominated time complexity of the GA. A fixed population size of 100 chromosomes and runs lasting 100 generations were used. These and other basic GA parameters are noted in Table 5. Table 5. Basic GA parameters for complexity experiments GA Type Crossover Selection Field Size Population Size Crossover Rate Mutation Rate Generations Simple 2-Pt Roulette-Wheel 7 100 .6 .001 100 The averaged experimental results are summarized on the chart in Figure 3. Let t be the time, in seconds, required to evolve this Sec/100 Generations Time= c * n log2 n 150 100 50 C=0.2 Averaged Data Time = a+ c * FIELDSIZE 18 17 16 15 14 c=0.49, , a=14.3 Averaged Data 3 4 5 6 7 FIELDSIZE (Bits per Field) Figure 4. C=0.1 0 0 operation, which is dependent only on total number of bits. Sec./100 Generations population 100 generations. The algorithmic complexity of the fitness function based on the data used here appears to very well fit the equation t=F(n)= c*n log2 n . In this case the constant c is between .1 and .2. 50 100 n= Stock Set Size (Number of Stocks) Figure 3 In order to determine whether the GA run time was dominated by changes in n or by changes in overall chromosome length, we ran an additional set of experiments in which n, the number of securities, remains constant but the chromosome length changes based on the field size The number of bits in a field determines the minimum allocation proportion for any stock in the portfolio and the maximum number of securities in the portfolio with allocations greater than 0. Increasing the number of securities in the portfolio also makes it desirable to increase the field size to reduce the likelihood of index collision. So, in practice these two values should be correlated. But for our experiments, we used a constant number of fields (securities), n=32, and varied the size of each field from 3 to 7, effectively changing the chromosome length from 96 to 224 bits. With the exception of field size, all other parameters from Table 5 remained the same. The averaged results from sets of 25 experiments in each configuration, summarized in the chart in Figure 4, shows that increasing the number of bits in each field has a more gradual effect on time complexity than increasing n, the number of fields. The graph shows constant linear growth in time with increasing chromosome length. For this population, t=a*c*(bits per field), with a=14.3 and c=0.49. The effect of increasing number of bits in the chromosome, without increasing n, may be to increase the time required for the mutation Therefore, the empirical data from these experiments confirms that overall algorithmic time complexity for this GA application is affected mainly by changes in the number of security in the portfolio, n, and that the order of expected time complexity for the fitness function and the GA is O(n log n) with small constants. 4. The Deme Modification A natural modification for improving time efficiency for the efficient set GA over multiple single population runs is the use of a deme model. In a deme model, subpopulations evolve independently and migrate their most highly fit members periodically. The deme model was designed to allow parallel evolution on a multiprocessor system. In addition to improving time efficiency the deme model may improve the capability of the efficient set GA. For the efficient set GA the deme model appears to be better than multiple single runs because it provides for alternating periods of local hillclimbing and global competition between improved local optima. The single population and deme models are compared here based on the number of generation steps rather than absolute GA runtime. There are several reasons for this. The deme modification that was implemented for these experiments requires a steady state GA framework, one in which only a small proportion of deme members are replaced every generation. The deme model has a migration step, which is absent from the single population model. And, although the deme model can be run on a multiprocessor system, our implementation was designed for a single processor system with deme evolution implemented sequentially. Because of The GA parameters are shown in Table 6. Two sets of experiments were run for the single population model. In the first set, mutation and crossover rates were the highest possible (within .001 and .1 increments, respectively) that would still allow convergence and guarantee a final local hill-climbing phase. The second set was run at a much higher mutation rate with no convergence (there were less than 3 matched population members in the any final generation from set #2) to see whether it would be possible to find a more optimal solution doing a more random search. In general, higher mutation rates provide the longest possible solution space exploration phase in the efficient set GA, which may be a result of the potentially highly multimodal solution space. The fitness profile (Figure 5) from one of the deme runs in set #3 demonstrates the alternating influences of local improvement with total population competition during evolution. The results of the three sets of experiments, in terms of the statistics for the generated portfolios, are shown in Table 7. While the empirical data is very limited, it serves to illustrate how well the deme model works for this GA. The portfolio statistics, which reflect optimal fitness statistics, show that the deme model has the potential to produce results comparable to and better than the single population GA for the same time resource or number of generations. Assuming the time for the migration step is minimal, the time resource required for multiple GA runs can be cut 1800 1600 1400 1200 800 1 2000 N/A All 100 2-Pt Roulette 6 10 .6 .007 0 1000 1 2000 N/A All 100 2-Pt Roulette 6 10 .6 .001 Ave of Gen 0.4 600 Populations Gens Epochs Replace/Gen PopSize Xover Select Fields Stock Set Xover Rate Mut Rate 3 (10 runs) Steady State/Deme 5 100 20 Gens/Epoch 5 100 2-Pt Roulette 6 10 .7 .007 0.8 400 2 (10 runs) Simple 1.2 0 GA Type 1 (10 runs) Simple Dem e GA Best of Gen 200 Experiment Set geometrically by the use of the deme model with no loss of capability. Scaled Fitness the disparity in the models, this comparison is based on the quality of optimal results between the two models after 2000 generation steps rather than on absolute runtime. A generation step for the deme model includes one generation in each of the subpopulations, since the deme model can potentially allow evolution to proceed simultaneously for each subpopulation on a multiprocessor system. Table 6. Generations Figure 5. Table 7. Portfolio statistics Return Risk Set 1 2 3 Mean .325 .366 .398 Std. Dev. .069 .074 .063 Mean .485 .426 .424 Std. Dev. . 122 . 022 . 017 Conclusions We compared a simple GA approach to solving the efficient set problem to the more traditional quadratic programming approach using covariances. The GA can simultaneously minimize risk and maximize expected return, whereas the quadratic programming approach must hold the risk constant. This flexibility allows the GA to discover portfolio opportunities that the more traditional approach misses. We also examined the expected time complexity of the GA solution. Our experiments show that when the GA is run with portfolios smaller than n=100 stocks, the expected time complexity of the genetic algorithm is O(n log n) with a very small constant. This is greatly superior to the time complexity of quadratic programming. Moreover, the GA complexity can be primarily attributed to the fitness function, which produces a portfolio allocation from an indirect solution representation. The representation style is advantageous because it eliminates the possibility of infeasible solutions and the need for penalty functions. Additional experiments demonstrate that the O(n log n) complexity of the fitness function overshadows the linear relationship between overall length of the chromosome and GA runtime. Finally, we demonstrated the effectiveness of using demes for this GA. This modification was shown to have the potential of finding solutions comparable and possibly superior to those gained from multiple single population runs. A contributing factor to the success of this type of modification may be the highly multimodal character of the potential solution space in the efficient set problem. Acknowledgements The software for this work used the GAlib genetic algorithm package, written by Matthew Wall at the Massachusetts Institute of Technology. References [1] Haugen, R.A., Modern Investment Theory, Prentice Hall Inc.,Englewood Cliffs, N.J., 1993. [2]Markowitz, H.M., Portfolio Selection, Basil Blackwell, Inc. Cambridge, MA., 1991. [3]Shoaf, J.S. and Foster, J.A., “A Genetic Algorithm Solution to the Efficient Set Problem: A Technique for Portfolio Selection Based on the Markowitz Model”, Tech Report. , Dept. of Computer Science, Univ. of Idaho, Moscow, ID, 1995. [4]Shoaf, J.S., and Foster, J.A., “A Genetic Algorithm Solution to the Efficient Set Problem: A Technique for Portfolio Selection Based on the Markowitz Model”, Proc. 1996 Annual Meeting,, Vol. 2, Decision Sciences Institute, Orlando, FL., 1996, pp. 571-573. [5] Smith, H.A., Data Structures: Form and Function, Harcourt Brace Jovanovich, Inc. San Diego, CA., 1987.