Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 Frequent-Pattern Tree Algorithm Application to S&P and Equity Indexes E. Younsi, H. Andriamboavonjy, A. David, S. Dokou and B. Lemrabet Software and time optimization are very important factors in financial markets, which are competitive fields, and emergence of new computer tools further stresses the challenge. In this context, any improvement of technical indicators which generate a buy or sell signal is a major issue. Thus, many tools have been created to make them more effective. This worry about efficiency has been leading in present paper to seek best (and most innovative) way giving largest improvement in these indicators. The approach consists in attaching a signature to frequent market configurations by application of frequent patterns extraction method which is here most appropriate to optimize investment strategies. The goal of proposed trading algorithm is to find most accurate signatures using back testing procedure applied to technical indicators for improving their performance. The problem is then to determine the signatures which, combined with an indicator, outperform this indicator alone. To do this, the FP-Tree algorithm has been preferred, as it appears to be the most efficient algorithm to perform this task. Keywords: Quantitative Analysis, Back-testing, Computational Models, Apriori Algorithm, Pattern, Recognition, Data Mining, FP-Tree I. Introduction With computer technology advance, it is today very easy to collect all kinds of data. However, processing limitations persist in extracting values. In all domains, from science to finance, if huge volumes of data exist, most of them remain untapped. Banks, for example, must analyze quickly enough this mass of data to make right investment decisions. The problem is no longer in the collection but in the value extraction from this data. To meet this challenge, knowledge extraction in databases (data mining) has been developed [1,8,9,16, 19,21], particularly in “financial-decision” area [4,5,6]. In this world, events follow traditional cycle of cause and effect. Expansion of "Hedge funds" and development of computers have boosted data analysis. With emergence of behavioral finance [23] a comprehensive in-depth analysis of all data and archived transactions is needed. In this context, the improvement of technical indicators [7,22] generating a buy/sell signal is a major issue. This concern for efficiency leads to find the best (and most innovative) way allowing a further improvement of these indicators [2]. In the ignorance of internal economic system dynamics, the technique of frequent patterns seemed to be the most appropriate method to allow fund managers to optimize their quantitative strategies [18]. To address the problem of handling large databases, a method is identified as capable of classifying curves representing data behavior with a relatively modest numeric involution without losing information [15]. Thus, these __________________________________________________________________ E. Younsi, H. Andriamboavonjy, A. David, S. Dokou and B. Lemrabet, Undergraduate Students, ECE Paris School of Engineering, Corresponding author: eyounsi@ece.fr 1 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 classes of behavior are subject to frequent patterns algorithm and the latter will not have the problem of storage memory. Mining frequent patterns is the most studied data mining task for ten years. It is used here to find signatures. A signature is a set of elements, representing in present case a market configuration, which will be compared to another one to find similarities. Hence, the method is to assign a signature to common market configurations. Through back-testing procedure [24], the best signatures are kept for the next period, called validation if it is preceded by a "good" signature and used in combination with the technical indicator to improve its performance. A conventional data mining (i.e., searching for frequent patterns) is used to identify signatures characterizing the few days before technical indicator outbreak. So one gets specific underlying rules of technical analysis based on several days and adapted to the period. Clustering, or unsupervised classification to group similar situations is the general context of financial decision-making on which present problem is based. The algorithm principle is based on Frequent Pattern Tree (FP-Tree) extraction algorithm [17]. Indeed, implementation of FP-Tree is the most effective for achieving the analysis of the whole set of all frequent patterns, as well as the more suitable to real-time high frequency trading. Two methods exist for sequential pattern mining: Apriori-based approaches [3,14], which include GSP and SPADE algorithms [12], and Pattern-Growth-based approaches [25], which include FreeSPan and PrefixSpan algorithms. Common data mining approach is to find frequent item-sets from a transaction dataset and then derive association rules. Finding frequent item-sets (item-sets with appearance frequency larger than or equal to a user specified minimum support) is not trivial because of combinatorial explosion. Once frequent item-sets are obtained, it is straightforward to generate association rules with confidence larger than or equal to a user specified minimum confidence. The most outstanding improvement over Apriori is FP-Growth (frequent pattern growth) [10,11], to be developed later in Part II, based on “divide and conquer” strategy compressing the database representing all essential information and dividing the compressed database into a set of conditional databases, each associated with one frequent item-set and mining each one separately. FP-Growth is an order of magnitude faster than original Apriori algorithm. SPADE (Pattern Discovery using Equivalence classes) is a new algorithm for fast mining of sequential patterns in large databases. Unlike previous approaches which make multiple database scans and use complex hash-tree structures that tend to have sub-optimal locality, SPADE, a vertical format-based mining, decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences. Not only can each equivalence class be solved independently, but it is also very likely that it can be processed in main-memory. Thus SPADE usually makes only three database scans – one for frequent 1-sequences, another for frequent 2-sequences, and one to generate all other frequent sequences. Extensive experiments have shown that SPADE outperforms GSP by a factor of two, and by an order of magnitude with precomputed support of 2-sequences. However, the transformation of database requires a large memory capacity and a considerable response time of the program. It still needs three database scans whereas FP-Growth requires only two. FreeSpan, or Frequent Pattern-Projected Sequential Pattern Mining [13], is a scalable and efficient sequential mining method. It uses frequent items to recursively project sequence databases, and grows subsequence fragments in each projected database. The strength of FreeSpan is that it searches a smaller projected database than GSP in each subsequent database projection. This is because FreeSpan projects a large sequence database recursively into a set of small projected sequence databases based on the currently mined frequent item-patterns, and the subsequent mining is confined to each projected database relevant to a smaller set of candidates. The major overhead of FreeSpan is that it may have to generate many nontrivial projected databases. If a pattern appears in each sequence of a database, its projected database does not shrink (except from the removal of some infrequent items). For example, a 2 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 length-k subsequence may grow at any position, the search for length-(k+1) candidate sequence will need to check every possible combination, which is costly. From this context, FP-Tree algorithm [1] has been finally opted for. It overcomes these disadvantages and optimizes the tool for investment decision. The interest of FP-Tree algorithm is to avoid running repeatedly through the database path by storing all common elements in a compact structure called Frequent-Pattern Tree. Furthermore, these elements are automatically sorted in the compact structure, which accelerates pattern search. FP-Tree algorithm appears as the most appropriate solution in this case among existing data mining algorithms, with significantly less execution time and greater accuracy. Indeed, the faster the algorithm is, the higher the financial profitability can be. FP-Tree algorithm is a powerful tool for analyzing signatures to get the most frequent of them and to optimize “back testing” process. II. FP-Tree Algorithm The methodology combines a signature with frequent scenarios through FP-Tree algorithm. To implement the trading strategy based on signatures, one must first define the content of a market configuration (a scenario) and transform it into the format expected by extraction of frequent patterns algorithm format. An intermediate coding step is thus necessary to make rough data compatible with the algorithm. FP-Tree algorithm is particularly used for searching frequent patterns in a super data base. The interest of this algorithm is to avoid browsing the database repeatedly, by storing the set of frequent elements in the (compact) FP-Tree structure. Moreover, these frequent elements are sorted automatically in the compact structure, which accelerates the research of patterns. FP-Tree algorithm is the most effective solution among Data Mining algorithms, with less important complexity and greater precision. FP-Tree is composed by two elements: a header table to index the different item sets and a tree structure with a root labeled “null”. Figure 1. Transactions Data Base Considering the transactions database, see Figure 1, where first column represents transaction number, second one the corresponding items, a minimum support has to be determined previously. For a 5 transactions data base as in Figure 1, the minimum support is computed as following: support minimum = (50/100 * 5) = 2.5. Item sets are then pruned according to minimum support (here >=3). Items are sorted orderly to obtain Header Table. Here, only Items {f c a m b p} are retained. For each Figure 2. Frequent Ordered Items transaction, frequent items are ordered in parallel, see Figure 2. Once this step is completed, everything is ready to construct the FP-Tree structure with creation and insertion of the different nodes. 3 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 The first created node, called the root does not contain anything but the links to its “children”. Each transaction is browsed and its different items are placed in the tree as follows: if for an element, a node is already created, the number of occurrence is incremented in the node, else the node is created. Then, for each element created, a link is established from the header table to the element inserted in the tree. First transaction is composed by the elements {f, c, a, m, p} increasingly ordered according to their weight. As f is the first element of the list, a node is inserted from the element “root” of the tree. The element f has been inserted for the first time, therefore its number of occurrence is 1. Two links are created with the element f: with the element “root” and with the header table. The same method is applied for insertion of the other elements of first transaction {c, a, m, p}. Figure 3. FP-Tree Structure after Insertion of First Transaction Elements The example of second transaction shows the case where occurrence number of an element has to be incremented. According to Figure 1, the second transaction is composed by elements {f, c, a, b, m}. This time, the tree already contains elements and therefore each element will have its number of occurrence incremented by 1. This is the case especially for elements {f}, {c} and {a}: their number of occurrence is 2 (1+1). As element {b} does not exist yet in the tree, then a new node is created for this element and from the current position in the tree: a link is therefore created from node {a} to node {b} and from the header table to node {b}. The same case is applied for element {m}, which has not yet been inserted in the tree. The second transaction is thus treated, and left transactions have their Figure 4. Construction of FP-Tree from 2nd Transaction element inserted in same way, see Figure 4. At the end of the procedure for all the transactions of the data base, the FP-Tree structure is illustrated as in Figure 3 where each element of the FP-Tree structure is linked from the correspondent element of the header table. 4 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 Figure 5. FP-Tree Structure The latest step is to insure that information inside the tree is well validated, by comparing information obtained from the different nodes with the one from the header table. From this table, the position of each item can easily be found in the FP-Tree. To identify the frequent item sets from data base transactions, the x-conditional data base method (x is any item belonging to the header table) will be used. The x-conditional data base is first applied to the item having lowest occurrence. To determine the x-conditional data base, a simple method is followed and will be applied in the same way to the other items by browsing from the bottom to the top of header table. 1. Determining from the tree the paths containing x 2. For each path, equalizing x-occurrence with the other item inside this path 3. Removing x from the different paths 4. Putting together each item, counting their occurrence and pruning the items according to the minimum support. 5. From remaining items, construction of the x-conditional tree. 6. Including x and make all the possible combinations with the remaining items, keeping the same occurrence for each combination. 7. Doing again this procedure until the header table is utterly browsed. After the procedure is completed, the following results are obtained for all items, see Figure 6. Figure 6. Summary Table of x-Conditioned Basic Patterns III. Discussion and Results The main advantage offered by FP-Tree algorithm is the reduced number of database scanning. The algorithm is also comprehensive as all information is contained in the frequent items, and concise as only these frequent items are memorized, which has a positive effect both on mining efficiency and on processing time. Nevertheless, in spite of his compact structure, it is not guaranteed that all FP-Tree structure will be stored in computer memory, in case of a large volume of database transaction. Moreover, this construction may use much more time and system resources than expected. To address this problem of Big Data, the program includes upstream different data types of behavior. Different materials serving as inputs for the FP-Tree algorithm are obtained which reduce the size of needed memory. The algorithm will be applied in each data compartment. Construction of data distribution from database transaction is shown on Figure 7, where parameter epsilon (left of curve) determines the size of maximum difference between the behaviors among each data class. This parameter is variable depending on FP-Tree algorithm performance. 5 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 Figure 7. Data Division and Classification vs Time In order to compare the performance of FP-Tree algorithm method and previous existing one like Apriori, four different back-testing with five different databases have been performed. The backtesting have been realized on samples of 21 500, 10 000 and 5 000 Timeframes. Data are current ones, ie tests are performed by comparing future data with previous ones. The rate support of data repetitions is 1.0. Below is our conclusion: - Concerning software performance, the large number of data still limits execution time. Nevertheless, the larger is the number of data, the more chance there exists to recognize a signature over the past. -Epsilon is an important parameter to take into account. Indeed, its variation influences precisely the quality of results, in term of returns and standard deviation when analyzing previous and current signatures. The larger epsilon is, the higher chance exists to find frequent patterns. However, the smaller is epsilon, the more accurate is the result. To calibrate epsilon, many back- testing have been performed to find its right value by analysis of the interval [.01,1]. For example, 0.01 means that only one piece of information has been taken out of 100. To get more powerful software, it is possible to incorporate a decaying exponential law into the code giving more weight to recent information and filtering on our database. Taking the decaying function f(x)=1 exp(x) and by varying , the value of recent data can be adjusted according to its relevance. For example, information of a value from 1985 will be much more important than information of a value from 1950. Effectively, it is expectable that due to constant improvement of technical analysis over the years, recent data are more complete and faithful than old ones. These circumstances result from notion of market efficiency. First, a market is defined as efficient only and only if all available data about each financial asset listed on this market is immediately incorporated into asset price [ ] (Fama, 1965). This efficiency allows understand why assets price are in synchrony with actual value thanks to new tools of technical analysis. Their constant improvement is the reason why asset data from 1985 is better than the one from 1950, justifying the possibility to include a decaying (exponential) rule into the algorithm. The second remark is concerning parameter . It was indeed noticed earlier how much was involved into result precision, in that the smaller was, the less time was needed to analyse the data since one cannot get a large part with the ribbon created by . To prove this, two experiments have been carried out to stabilize and previous relationship between analysis time and amount of manipulated data has been reported on Figure 8 for = .3 and .6 respectively. Execution Time (min) Execution Time (min) 1000 800 600 400 200 0 0 10000 20000 Number of TimeFrames 1000 800 600 400 200 0 30000 0 10000 Number of TimeFrames 6 20000 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 Figure 8. Algorithm Execution Time vs Frame Number for =.3(right) and =.6 (left) Execution Time (min) From Figure 8, the algorithm needs less time to analyse the whole database for smaller , as reasonably less data are manipulated. Indeed when the user decides to choose a smallest he decides to decrease algorithm search precision. So the database will be browsed faster but chances for matching a new signature to an already existing one are fewer. After several tests, we managed to figure out the curve linking the analysis time according to parameter (with the amount of data becoming an intrinsic magnitude of this curve). This parameter varies from 0.01 to 1. When it is equivalent to 1, it no longer influences the database and becomes somehow neutral. Figure 9 displays the execution time vs. value: 1000 500 0 0 0.5 Value of Epsilon 1 1.5 Figure 9. Algorithm Execution Time vs Value IV. Conclusion The improvement of technical indicators, generating a buy/sell signal to optimize fund managers quantitative strategies, requires good ways to search best signatures in order to study similarities with already present market patterns. Present study suggests a method to solve the problem of handling large databases and to contribute to quicker and more efficient decision by investors. The results show that by using a good extraction algorithm of frequent item sets, it is possible to find in a minimal time best signatures from historical data when compared to existing ones. The implementation of efficient algorithm FP-Tree with a support rate of 1.0 makes this research with average execution speed divided by two, whatever the number of time frames, compared to other uses of data mining algorithm like Apriori. Despite the size of data, no information is lost and results accuracy is not degraded. The combination of a method of digital involution on upstream data and FP-Tree algorithm is essential to address fully the problem of handling large databases and to avoid a decline in profitability. Aknowledgments The authors are very much indebted to ECE Paris School of Engineering to have provided the environment where the study has been developed, to Dr for advices during the research, and to Pr. M. Cotsaftis for help in the preparation of the manuscript. References [1] N.P. Lin, W.H. Hao, H.J. Chen : Fast Accumulation Lattice Algorithm for Mining Sequential Patterns, Proc. 6th Conf. on WSEAS Intern. Conf. on Applied Computer Science, Vol.6, pp.229-233 [2] T. Oates, D. Jensen, P.R. Cohen : Automatically Acquiring Rules for Event Correlation from Event Logs, Technical Report 97-14, Computer Science Dept, University of Massachusetts, Amherst, MA, 1997. [3] R J. Bayardo : Efficiently Mining Long Patterns from Databases, Proc. ACM-SIGMOD Intern. Conf. on Management of Data, pp.145-154, 1999 [4] C. Avery, P. Zemsky : Multidimensional Uncertainty and Herd Behavior in Financial Markets, Amer. Economic Review, Vol.9(1), pp.724-748, 1998 [5] S. Grossman, J. Stiglitz : On the Impossibility of Informationally Efficient Markets, Amer. Economic Review, Vol.70(3), pp.393-408, 1980 [6] R. Bloomfield, M. O’Hara : Market Transparency: Who Wins and Who Loses?, The Review of Financial Studies, Vol.12(1),pp.5-35, 1999 7 Proceedings of 7th Asia-Pacific Business Research Conference 25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0 [7] S.B. Achelis : Technical Analysis from A to Z, 2nd ed., McGraw-Hill, New York, 2000 [8] R. Agrawal, R. Srikant : Fast Algorithm for Mining Association Rules in Large Databases, Proc. 20th Int. Conf. Very Large Databases (VLDB’94), Chile, pp.487-499, 1994 [9] B. Dunkel, N. Soparkar : Data Organization and Access for Efficient Data Mining, Proc. 15th Int. Conf. on Data Engineering (ICDE’99), pp.522-529, 1999 [10] J. Han, J. Pei, Y. Yin : Mining Frequent Patterns without Candidate Generation, Tech. Rept., 99-10, Simon Fraser Univ., Vancouver, CA, 1999 [11] Consultant and IT Architect KTA, L’algorithme FP-Growth, unpublished [12] M.J. Zaki : SPADE, an Efficient Algorithm For Mining Frequent Sequences, Machine Learning, Vol.42, pp.31-60, 2001 [13] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu : FreeSPan: Frequent Pattern-Projected Sequential Pattern Mining, Proc. 6th SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, pp.355359, 2000 [14] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. MacLachlan, A. Ng, B. Liu, P.S. Yu, Z.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg : Top 10 Algorithms in Data Mining, Knowledge Information Systems, Vol.14, pp.1-37, 2008 [15] I. Klein, W. Schachermayer : Asymptotic Arbitrage in Non-Complete Large Financial Markets, Theory of Probability and its Application, Vol.41(4), pp.927-934, 1996 [16] P.-N. Tan, M. Steinbach, V. Kumar : Introduction to Data Mining, Addison-Wesley, New York 2006. [17] J. Han, J. Pei : Mining Frequent Patterns by Pattern-Growth Methodology and Implications, ACM SIGKDD Explorations Newsletter, Vol.2(2), pp.14-20, 2000 [18] J. Pei, J. Han, L.VS Lakshmann : Mining Frequent Itemsets with Convertible Constants, Proc. 17th Intern. Conf. on Data Engineering, pp.433-442, 2001 [19]B. Liu : Web Data Mining: Exploring Hyperlinks, Contents and Usage, Data, Springer, Heidelberg, 2007 [20] Q. Yang, X. Wu : 10 Challenging Problems in Data Mining Research, Intern. J. Information Technologies and Decision Making, Vol.5(4), pp.597–604, 2006 [21] A.Gangemi : A Comparison of Knowledge Extraction Tools for the Semantic Web, Proc. ESWC 2013, LNCS 7882, pp.351-366, Springer-Verlag, Berlin Heidelberg, 2013 [22] M. Sheikh, S. Conlon : A Ruled-Based System to Extract Financial Information, JCIS, Vol.52(4), pp.10-19, 2014 [23] M. Sewell : Introduction to Behavioral Finance, Behavioral Finance Net, 14 April 2010 [24] S.D. Campbell : A Review of Backtesting and Backtesting Procedures, Finance and Economics Discussion Ser., Div. of Research and Statistics and Monetary Affairs, Fed. Reserve board, Washington, D.C., 2005-21 [25] F. Verhein : Frequent Pattern Growth (FP-Growth) Algorithm, School of Information Technologies, Univ. of Sydney, Australia, Jan. 10, 2008 8