Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1-922069-64-1 Application of Frequent-Pattern Tree Algorithm to Dow Jones US Pharmaceuticals Index Enis Younsi, Holy Andriamboavonjy, Anthony David, Steven Dokou and Badre Lemrabet Time optimization and high quality software are very important factors in competitive financial markets, so any resulting improvement of technical indicators that generate a buy or sell signal is a major issue. Many tools have thus been created to make them more effective. This worry about efficiency has been leading in a previous paper to search best (and most innovative) approach giving largest improvement of these indicators. The approach consists in attaching a signature to frequent market configurations by application of frequent patterns extraction method here most appropriate to optimize investment strategies. Ultimate algorithm goal is to extract most accurate signatures using back testing procedure applied to technical indicators to improve their performance, and to determine the ones that outperform these indicators. To this aim, the FP-Tree algorithm has been selected as most efficient to perform the task. Application to DJ Pharmaceuticals Index confirms interesting possibility of proposed method. Keywords: Quantitative Analysis, Back-testing, Computational Models, Apriori Algorithm, Pattern Recognition, Data Mining, FP-Tree I. Introduction Today it is very easy to collect all kinds of data. However, processing limitations persist in extracting values. In all domains, from science to finance, if huge volumes of data exist, most of them remain untapped. Banks, for example, must analyze quickly enough this mass of data to make right investment decisions. The problem is no longer in the collection but in the value extraction from this data. To meet this challenge, knowledge extraction in databases (data mining) has been largely developed [1, 2, 3, 4, 5], particularly in “financialdecision” area [6, 7, 8]. In this world, events follow traditional cycle of cause and effect. Expansion of "Hedge funds" and development of computers have boosted data analysis. With emergence of behavioral finance [9] a comprehensive in-depth analysis of all data and archived transactions is needed. In this context, the improvement of technical indicators [10, 11] generating a buy/sell signal is a major issue. This concern for efficiency leads to find the best (and most innovative) way allowing a further improvement of these indicators [12]. In the ignorance of internal economic system dynamics, the technique of frequent patterns seemed to be the most appropriate method to allow fund managers to optimize their quantitative strategies [13]. To address the problem of handling large databases, a method is identified as capable of classifying curves representing data behavior with a relatively modest numeric involution without losing information [14]. Thus, these classes of behavior are subject to frequent patterns algorithm and the latter will not have the problem of storage memory. Mining frequent patterns is the most studied data mining task for last ten years. It is used here to find signatures. A signature is a set of elements, representing in present case a market configuration, which will be compared to another one to find similarities. Hence, the method is to assign a signature to common market configurations. Through back-testing procedure [15], the best signatures are kept for the next period, called validation if it is preceded by a "good" signature and used in combination with the technical indicator to improve its performance. A conventional stock data mining (i.e., searching for frequent patterns) is used to ______________________________________________________________________________________ E. Younsi, H. Andriamboavonjy, A. David, S. Dokou and B. Lemrabet, Undergraduate Students, ECE Paris School of Engineering, Corresponding author: eyounsi@ece.fr 1 Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1-922069-64-1 identify signatures characterizing the few days before technical indicator outbreak. So one gets specific underlying rules of technical analysis based on several days and adapted to the period. Clustering, or unsupervised classification to group similar situations is the general context of financial decision-making on which present problem is based. The algorithm principle is based on Frequent Pattern Tree (FP-Tree) extraction algorithm [16]. Indeed, implementation of FP-Tree is the most effective for achieving the analysis of the whole set of all frequent patterns, as well as the more suitable to real-time high frequency trading. Two methods exist for sequential pattern mining: Apriori-based approaches [17,18], which include GSP and SPADE algorithms [19], and Pattern-Growth-based approaches [20], which include FreeSPan and PrefixSpan algorithms. Common data mining approach is to find frequent item-sets from a transaction dataset and then derive association rules. Finding frequent item-sets (item-sets with appearance frequency larger than or equal to a user specified minimum support) is not trivial because of combinatorial explosion. Once frequent item-sets are obtained, it is straightforward to generate association rules with confidence larger than or equal to a user specified minimum confidence. The most outstanding improvement over Apriori is FP-Growth (frequent pattern growth) [21,22], based on “divide and conquer” strategy compressing the database representing all essential information and dividing the compressed database into a set of conditional databases, each associated with one frequent item-set and mining each one separately. FP-Growth is an order of magnitude faster than original Apriori algorithm. Unlike previous approaches which make multiple database scans and use complex hash-tree structures that tend to have sub-optimal locality, SPADE (Pattern Discovery using Equivalence classes), decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences. Not only can each equivalence class be solved independently, but it is also very likely that it can be processed in main-memory. FreeSpan, or Frequent Pattern-Projected Sequential Pattern Mining [23], is a scalable and efficient sequential mining method. It uses frequent items to recursively project sequence databases, and grows subsequence fragments in each projected database. The strength of FreeSpan is that it searches a smaller projected database than GSP in each subsequent database projection. The major overhead of FreeSpan is that it may have to generate many nontrivial projected databases. From this context, FP-Tree algorithm [24] has been finally opted for. It overcomes these disadvantages and optimizes the tool for investment decision. The interest of FP-Tree algorithm is to avoid running repeatedly through the database path by storing all common elements in a compact structure called Frequent-Pattern Tree. Furthermore, these elements are automatically sorted in the compact structure, which accelerates pattern search. FP-Tree algorithm appears as the most appropriate solution in this case among existing data mining algorithms, with significantly less execution time and greater accuracy. Indeed, the faster the algorithm is, the higher the financial profitability can be. FPTree algorithm is a powerful tool for analyzing signatures to get the most frequent of them and to optimize “back testing” process. II. FP-Tree Algorithm The methodology combines a signature with frequent scenarios through FP-Tree algorithm. To implement the trading strategy based on signatures, one must first define the content of a market configuration (a scenario) and transform it into the format expected by extraction of frequent patterns algorithm format. An intermediate coding step is thus necessary to make rough data compatible with the algorithm. FP-Tree algorithm is particularly used for searching frequent patterns in a super data base. The interest of this algorithm is to avoid browsing the database repeatedly, by storing the set of frequent elements in the (compact) FP-Tree structure. Moreover, these frequent elements are sorted automatically in the compact structure, which accelerates the research of patterns. FP-Tree algorithm is the most effective solution among Data Mining algorithms, with less important complexity and greater precision. FP-Tree is basically composed by two elements: a header table to index the different item sets and a tree structure with a root labeled “null”. After different steps of data manipulation [mm], last one is to insure that information inside the tree is well validated, by comparing information obtained from the different nodes with the one from the header table. From this table, the position of each item can easily be found in the FP-Tree. To identify the frequent item sets from data base transactions, the x-conditional data base method (x is any item belonging to the header table) will be used. The xconditional data base is first applied to the item having lowest occurrence. To determine the x-conditional data base, a simple method in seven steps summarized below is followed and will be applied in the same way to the other items by browsing from the bottom to the top of header table [25]. 2 Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1-922069-64-1 1. 2. 3. 4. 5. 6. Determining from the tree the paths containing x For each path, equalizing x-occurrence with the other item inside this path Removing x from the different paths Putting together each item, counting their occurrence and pruning the items according to the minimum support. From remaining items, construction of the x-conditional tree. Including x and make all the possible combinations with the remaining items, keeping the same occurrence for each combination. 7. Doing again this procedure until the header table is utterly browsed. The main advantage offered by FP-Tree algorithm is the reduced number of database scanning. The algorithm is also comprehensive as all information is contained in the frequent items, and concise as only these frequent items are memorized, which has a positive effect both on mining efficiency and on processing time. Nevertheless, in spite of its compact structure, it is not guaranteed that all FP-Tree structure will be stored in computer memory, in case of too large volume of database transaction. To address this problem of Big Data, the program includes upstream different data types of behavior and different inputs for the FP-Tree algorithm are obtained which reduce the size of needed memory. The algorithm will be applied in each data compartment. III. Algorithm Efficiency Construction of data distribution from database transaction is shown on Figure 1, where parameter (left of curve) determines the size of maximum difference between the behaviors among each data class. This parameter is variable depending on FP-Tree algorithm performance. Figure 1. Data Division and Classification vs Time In order to compare the performance of FP-Tree algorithm method and previous existing one like Apriori, four different back-testing with five different databases have been performed on samples of 21 500, 10 000 and 5 000 Timeframes. Data are current ones, ie tests are performed by comparing future data with previous ones. The rate support of data repetitions is 1.0. As a result a balance exists on software performance side between a large data set limiting execution time, and the higher chance to recognize a signature over the past with larger data set. Another balance exists also with . Larger increases the chance to find frequent patterns, and the smaller is , the more accurate is the result. To calibrate , many back-testing have been performed to find its right value by analysis of the interval [.01,1] 0.01 means that only one piece of information has been taken out of 100 . To augment software power and reduce data processing, a decaying law can be introduced into the code giving more weight to recent information and filtering on database. With f(x) = , the value of recent data can be adjusted according to its relevance. The second point is concerning the tradeoff preciseness vs data processing with To this aim, relationship between analysis time and amount of manipulated data has been first studied and shows strong reduction of execution time for smaller as reasonably less data are manipulated but there are fewer chances for matching a new signature to an already existing one. After several tests, the curve linking execution parameter (with the amount of data becoming an intrinsic magnitude of this curve) is displayed on Figure 3 with parameter varying from 0.01 to 1. When 1, it no longer influences the database and becomes somehow neutral 3 Execution Time (min) Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1-922069-64-1 1000 500 0 0 0.5 1 1.5 Value of Epsilon Figure 3. Algorithm Execution Time vs Value IV. Application and Convergence Analysis After evaluating the treatment time needed to analyze the different signature batches, the algorithm has been applied to real financial products. Three different financial products with different behaviors: Stocks, Commodities and Bonds have been considered. Each type of financial instruments has a very different behavior and testing the algorithm on each of them would allow evaluate its prediction ability. Here a stock has been first taken as being very variable, this would be a very good test for the proposed algorithm to verify if it can identify signatures with very high amplitude. Indeed, the more irregular is the product price, the more difficult it is to find similarities in the past and guess future variations. So attention has been focused here on a more global Dow Jones Sector Index. The Dow Jones U.S. Select Sector Indices measure precise, and typically narrowly-defined, sectors of the U.S. stock market. To be included in the indices, stocks must meet minimum size requirements based on market capitalization. Component weights are capped for diversification. For present application, and Figure 4 Historical Prices of the Dow Jones US Pharmaceuticals Index (DJUSPR) since its beginning As seen on Figure 4, Pharmaceuticals Index has been growing steadily over the last five years with some little decreases. From its creation to 2009, this index has been globally stable and exhibits different types of variations. The proposed algorithm being based on past variation, the more different situations it will analyze, the more it will be able to guess future trends. On the other hand, the disadvantage of DJ US Pharmaceuticals Index is that it is extremely recent with fourteen years of collected data compared to the other stocks existing on the market. So it is very interesting to test the algorithm on this product to see if it can make a difference between a little decrease followed by growth and a real decrease or a leveling off. The first step has been to gather all Open, High, Low and Close data of this index. Indeed, to be able to create a reliable signature database, one must make sure to have all needed data for each day to avoid a shift in the data. Since its origin, there have been 3691 Quotations, or 3671 timeframes of twenty days for DJ Pharmaceuticals Index. The analysis and the signatures creation took 143 minutes. A small number of timeframes are analyzed compared to the amount collected in the different data batches, so benchmarking the algorithm is quick, but what is slower is the calibration. As seen before ing fix the needed accuracy for signature comparison. For very high prescribed precision, finding exactly the same signature is very rare. Conversely, choosing a very low accuracy allows find many other signatures looking like the one under analysis. So the algorithm must be tested with different values of . Starting from initial value = 1, one is looking for the exact same signature as in the past. If the result is positive, this is a very good opportunity to find if the chart would go up, down or stay stable. With this value of , it was not 4 Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1-922069-64-1 possible to find a similar signature. This result agrees with the fact that stock variations are strong so it is not easy to find exactly the same in the future. 100 50 0 0.1 0.2 0.5 0.8 1 Similar Signatures Found Figure 5 Number of similar signatures found vs value As seen on Figure 5 It also depends on the amount of available data. In this case, it is seen that one needs to take = 0,5, 22 signatures are found, which is a very high number for a recent database. Taking = 0.8 gives 2 cases similar to current one. The first one indicates that there will be a decrease in the following time period and the second one that the quotation will increase, so it is not possible to use this data as it does not give a precise statement on the action to take. So even if signatures are found, it does not mean that the exact future trend has been found. One may be in the case where a new situation is occurring that previously never happened or a case where the same number of signatures with a decrease and increase has been found. An important point is the choice of correct In fact when applying present algorithm to market, a specific value has to be determined for each stock, suggesting implement adaptive way [26] allowing the algorithm estimate the correct value and remember previous determined one. V. Conclusion The improvement of technical indicators, generating a buy/sell signal to optimize fund managers quantitative strategies, requires good ways to search best signatures in order to study similarities with already present market patterns. Present study suggests a method to solve the problem of handling large databases and to contribute to quicker and more efficient decision by investors. The results show that by using a good extraction algorithm of frequent item sets, it is possible to find in a minimal time best signatures from historical data when compared to existing ones. The implementation of efficient algorithm FP-Tree with a support rate = 1.0 makes this research with average execution speed divided by two, whatever the number of time frames, compared to other uses of data mining algorithm like Apriori. Despite the size of data, no information is lost and results accuracy is not degraded. The combination of a method of digital involution on upstream data and FP-Tree algorithm is essential to address fully the problem of handling large databases and to avoid a decline in profitability. Application to DJ Pharmaceuticals Index confirms the applicability and the limitations of proposed method. Acknowledgments The authors are very much indebted to ECE Paris School of Engineering to have provided the environment where the study has been developed, and to Pr. M. Cotsaftis for help in preparation of the manuscript. References [1] P.-N. Tan, M. Steinbach, V. Kumar 2006, „Introduction to Data Mining‟ Addison-Wesley, New York 2006; N.P. Lin, W.H. Hao, H.J. Chen : Fast Accumulation Lattice Algorithm for Mining Sequential Patterns, Proc. 6th Conf. on WSEAS Intern. Conf. on Applied Computer Science, Vol.6, pp.229-233, 2007 5 Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1922069-64-1 [2] R. Agrawal, R. Srikant : Fast Algorithm for Mining Association Rules in Large Databases, Proc. 20th Int. Conf. Very Large Databases (VLDB‟94), Santiago, Chile, pp.487-499, 1994; H. Liu, J. Han, D. Xin, Z. Shao : Mining Frequent Patterns on Very High Dimensional Data: a Top-down Row Enumeration Approach, Proc. SIAM Intern. Conf. on Data Mining (SDM’06), Bethesda, MD, pp.280-291, 2006 [3] B. Dunkel, N. Soparkar : Data Organization and Access for Efficient Data Mining, Proc. 15th Int. Conf. on Data Engineering (ICDE‟99), pp.522-529, 1999; T.P. Exarchos, M.G. Tsipouras, C. Papaloukas, D.I. Fotiadis : An Optimized Sequential Pattern Matching Methodology for Sequence Classification, Knowledge Information System, Vol.19, pp.249–264, 2009; M.J. Zaki, S. Parthasarathy, M. Ogihara W. Li : Parallel Algorithm for Discovery of Association Rules, Data Mining Knowl Discov,, Vol.1, pp.343-374, 1997 [4] B. Liu : Web Data Mining: Exploring Hyperlinks, Contents and Usage, Springer, Heidelberg, 2007 [5] A. Gangemi : A Comparison of Knowledge Extraction Tools for the Semantic Web, Proc. ESWC 2013, LNCS 7882, pp.351-366, Springer-Verlag, Berlin Heidelberg, 2013 [6] C. Avery, P. Zemsky : Multidimensional Uncertainty and Herd Behavior in Financial Markets, Amer. Economic Review, Vol.9(1), pp.724-748, 1998 [7] S. Grossman, J. Stiglitz : On the Impossibility of Informationally Efficient Markets, Amer. Economic Review, Vol.70(3), pp.393-408, 1980 [8] R. Bloomfield, M. O‟Hara : Market Transparency: Who Wins and Who Loses?, The Review of Financial Studies, Vol.12(1),pp.5-35, 1999 [9] M. Sewell : Introduction to Behavioral Finance, Behavioral Finance Net, 14 April 2010 [10] S.B. Achelis : Technical Analysis from A to Z, 2nd ed., McGraw-Hill, New York, 2000 [11] M. Sheikh, S. Conlon : A Ruled-Based System to Extract Financial Information, JCIS, Vol.52(4), pp.10-19, 2014 [12] T. Oates, D. Jensen, P.R. Cohen : Automatically Acquiring Rules for Event Correlation from Event Logs, Technical Report 97-14, Computer Science Dept, University of Massachusetts, Amherst, MA, 1997. [13] J. Pei, J. Han, L.VS Lakshmann : Mining Frequent Item-sets with Convertible Constants, Proc. 17th Intern. Conf. on Data Engineering, pp.433-442, 2001; G. Grahne, J. Zhu : High Performance Mining of Maximal Frequent Item-sets, Proc.6th SIAM Intern. Workshop on High Performance Data Mining, pp 135-143, 2003 [14] I. Klein, W. Schachermayer : Asymptotic Arbitrage in Non-Complete Large Financial Markets, Theory of Probability and its Application, Vol.41(4), pp.927-934, 1996 [15] S.D. Campbell : A Review of Back-testing and Back-testing Procedures, Finance and Economics Discussion Ser., Div. of Research and Statistics and Monetary Affairs, Fed. Reserve board, Washington, D.C., 2005-21 [16] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu : FreeSPan: Frequent Pattern-Projected Sequential Pattern Mining, Proc. 6th SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, pp.355-359, 2000 [17] R J. Bayardo : Efficiently Mining Long Patterns from Databases, Proc. ACM-SIGMOD Intern. Conf. on Management of Data, pp.145-154, 1999 [18] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. MacLachlan, A. Ng, B. Liu, P.S. Yu, Z.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg : Top 10 Algorithms in Data Mining, Knowledge Information Systems, Vol.14, pp.1-37, 2008 [19] M.J. Zaki : SPADE, an Efficient Algorithm For Mining Frequent Sequences, Machine Learning, Vol.42, pp.31-60, 2001 6 Proceedings of 29th International Business Research Conference 24 - 25 November 2014, Novotel Hotel Sydney Central, Sydney, Australia, ISBN: 978-1922069-64-1 [20] F. Verhein : Frequent Pattern Growth (FP-Growth) Algorithm, School of Information Technologies, Univ. of Sydney, Australia, Jan. 10, 2008; J. Han, J. Pei : Mining Frequent Patterns by Pattern-Growth Methodology and Implications, ACM SIGKDD Explorations Newsletter, Vol.2(2), pp.14-20, 2000 [21] J. Han, J. Pei, Y. Yin : Mining Frequent Patterns without Candidate Generation, Tech. Rept., 99-10, Simon Fraser Univ., Vancouver, CA, 1999 [22] Consultant and IT Architect KTA, L’algorithme FP-Growth, unpublished [23] J. Han, H. Cheng, D. Xin, X. Yan : Frequent Pattern Mining: Current Status and Future Directions, Data Min Knowl Disc, Vol.15, pp.55–86, 2007; B. Goethals : Survey on Frequent Pattern Mining. Online, Technical Report, http://www.adrem.ua.ac.be/bibrem/pubs/fpm_survey.pdf, 2005 [24] B.S. Kumar, K.V. Rukmani : Implementation of Web Usage Mining Using APRIORI and FP Growth Algorithms, Int. J. of Advanced Networking and Applications, Vol.1(6), pp.400404, 2010 [25] E. Younsi, H. Andriamboavonjy, A. David, S. Dokou, B. Lemrabet : Frequent-Pattern Tree Algorithm – Application to S&P and Equity Indexes, Proc. 7th Asia-Pacific Business Research Conference, Singapore, August 25-26, pp.1-8, 2014 [26] D. Xin, X. Shen, Q. Mei, J. Han : Discovering Interesting Patterns through User‟s Interactive Feedback, Proc. 2006 ACM SIGKDD Intern. Conf. on Knowledge Discovery in Databases (KDD’06), Philadelphia, PA, pp.773–778, 2006 7