Proceedings of 7th Asia-Pacific Business Research Conference

advertisement
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
Frequent-Pattern Tree Algorithm Application to S&P and
Equity Indexes
E. Younsi, H. Andriamboavonjy, A. David, S. Dokou and B. Lemrabet
Software and time optimization are very important factors in financial markets, which
are competitive fields, and emergence of new computer tools further stresses the
challenge. In this context, any improvement of technical indicators which generate a
buy or sell signal is a major issue. Thus, many tools have been created to make them
more effective. This worry about efficiency has been leading in present paper to seek
best (and most innovative) way giving largest improvement in these indicators. The
approach consists in attaching a signature to frequent market configurations by
application of frequent patterns extraction method which is here most appropriate to
optimize investment strategies. The goal of proposed trading algorithm is to find most
accurate signatures using back testing procedure applied to technical indicators for
improving their performance. The problem is then to determine the signatures which,
combined with an indicator, outperform this indicator alone. To do this, the FP-Tree
algorithm has been preferred, as it appears to be the most efficient algorithm to
perform this task.
Keywords: Quantitative Analysis, Back-testing, Computational Models, Apriori
Algorithm, Pattern, Recognition, Data Mining, FP-Tree
I. Introduction
With computer technology advance, it is today very easy to collect all kinds of data.
However, processing limitations persist in extracting values. In all domains, from
science to finance, if huge volumes of data exist, most of them remain untapped.
Banks, for example, must analyze quickly enough this mass of data to make right
investment decisions. The problem is no longer in the collection but in the value
extraction from this data.
To meet this challenge, knowledge extraction in databases (data mining) has been
developed [1,8,9,16, 19,21], particularly in “financial-decision” area [4,5,6]. In this
world, events follow traditional cycle of cause and effect. Expansion of "Hedge funds"
and development of computers have boosted data analysis. With emergence of
behavioral finance [23] a comprehensive in-depth analysis of all data and archived
transactions is needed.
In this context, the improvement of technical indicators [7,22] generating a buy/sell
signal is a major issue. This concern for efficiency leads to find the best (and most
innovative) way allowing a further improvement of these indicators [2]. In the
ignorance of internal economic system dynamics, the technique of frequent patterns
seemed to be the most appropriate method to allow fund managers to optimize their
quantitative strategies [18]. To address the problem of handling large databases, a
method is identified as capable of classifying curves representing data behavior with
a relatively modest numeric involution without losing information [15]. Thus, these
__________________________________________________________________
E. Younsi, H. Andriamboavonjy, A. David, S. Dokou and B. Lemrabet, Undergraduate Students, ECE
Paris School of Engineering, Corresponding author: eyounsi@ece.fr
1
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
classes of behavior are subject to frequent patterns algorithm and the latter will not
have the problem of storage memory. Mining frequent patterns is the most studied
data mining task for ten years. It is used here to find signatures.
A signature is a set of elements, representing in present case a market configuration,
which will be compared to another one to find similarities. Hence, the method is to
assign a signature to common market configurations. Through back-testing
procedure [24], the best signatures are kept for the next period, called validation if it
is preceded by a "good" signature and used in combination with the technical
indicator to improve its performance.
A conventional data mining (i.e., searching for frequent patterns) is used to identify signatures
characterizing the few days before technical indicator outbreak. So one gets specific underlying rules
of technical analysis based on several days and adapted to the period. Clustering, or unsupervised
classification to group similar situations is the general context of financial decision-making on which
present problem is based. The algorithm principle is based on Frequent Pattern Tree (FP-Tree)
extraction algorithm [17]. Indeed, implementation of FP-Tree is the most effective for achieving the
analysis of the whole set of all frequent patterns, as well as the more suitable to real-time high
frequency trading.
Two methods exist for sequential pattern mining: Apriori-based approaches [3,14], which include GSP
and SPADE algorithms [12], and Pattern-Growth-based approaches [25], which include FreeSPan and
PrefixSpan algorithms. Common data mining approach is to find frequent item-sets from a transaction
dataset and then derive association rules. Finding frequent item-sets (item-sets with appearance
frequency larger than or equal to a user specified minimum support) is not trivial because of
combinatorial explosion. Once frequent item-sets are obtained, it is straightforward to generate
association rules with confidence larger than or equal to a user specified minimum confidence.
The most outstanding improvement over Apriori is FP-Growth (frequent pattern growth) [10,11], to be
developed later in Part II, based on “divide and conquer” strategy compressing the database
representing all essential information and dividing the compressed database into a set of conditional
databases, each associated with one frequent item-set and mining each one separately. FP-Growth is
an order of magnitude faster than original Apriori algorithm. SPADE (Pattern Discovery using
Equivalence classes) is a new algorithm for fast mining of sequential patterns in large databases.
Unlike previous approaches which make multiple database scans and use complex hash-tree structures
that tend to have sub-optimal locality, SPADE, a vertical format-based mining, decomposes the
original problem into smaller sub-problems using equivalence classes on frequent sequences. Not only
can each equivalence class be solved independently, but it is also very likely that it can be processed in
main-memory.
Thus SPADE usually makes only three database scans – one for frequent 1-sequences, another for
frequent 2-sequences, and one to generate all other frequent sequences. Extensive experiments have
shown that SPADE outperforms GSP by a factor of two, and by an order of magnitude with precomputed support of 2-sequences. However, the transformation of database requires a large memory
capacity and a considerable response time of the program. It still needs three database scans whereas
FP-Growth requires only two.
FreeSpan, or Frequent Pattern-Projected Sequential Pattern Mining [13], is a scalable and efficient
sequential mining method. It uses frequent items to recursively project sequence databases, and grows
subsequence fragments in each projected database. The strength of FreeSpan is that it searches a
smaller projected database than GSP in each subsequent database projection. This is because FreeSpan
projects a large sequence database recursively into a set of small projected sequence databases based
on the currently mined frequent item-patterns, and the subsequent mining is confined to each projected
database relevant to a smaller set of candidates. The major overhead of FreeSpan is that it may have to
generate many nontrivial projected databases. If a pattern appears in each sequence of a database, its
projected database does not shrink (except from the removal of some infrequent items). For example, a
2
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
length-k subsequence may grow at any position, the search for length-(k+1) candidate sequence will
need to check every possible combination, which is costly.
From this context, FP-Tree algorithm [1] has been finally opted for. It overcomes these disadvantages
and optimizes the tool for investment decision. The interest of FP-Tree algorithm is to avoid running
repeatedly through the database path by storing all common elements in a compact structure called
Frequent-Pattern Tree. Furthermore, these elements are automatically sorted in the compact structure,
which accelerates pattern search. FP-Tree algorithm appears as the most appropriate solution in this
case among existing data mining algorithms, with significantly less execution time and greater
accuracy. Indeed, the faster the algorithm is, the higher the financial profitability can be. FP-Tree
algorithm is a powerful tool for analyzing signatures to get the most frequent of them and to optimize
“back testing” process.
II. FP-Tree Algorithm
The methodology combines a signature with frequent scenarios through FP-Tree algorithm. To
implement the trading strategy based on signatures, one must first define the content of a market
configuration (a scenario) and transform it into the format expected by extraction of frequent patterns
algorithm format. An intermediate coding step is thus necessary to make rough data compatible with
the algorithm. FP-Tree algorithm is particularly used for searching frequent patterns in a super data
base. The interest of this algorithm is to avoid browsing the database repeatedly, by storing the set of
frequent elements in the (compact) FP-Tree structure. Moreover, these frequent elements are sorted
automatically in the compact structure, which accelerates the research of patterns. FP-Tree algorithm
is the most effective solution among Data Mining algorithms, with less important complexity and
greater precision. FP-Tree is composed by two elements: a header table to index the different item
sets and a tree structure with a root labeled “null”.
Figure 1. Transactions Data Base
Considering the transactions database, see Figure 1, where first column represents transaction number,
second one the corresponding items, a minimum support has to be determined previously. For a 5
transactions data base as in Figure 1, the minimum support is computed as following: support
minimum = (50/100 * 5) = 2.5. Item sets are then pruned according to minimum support (here >=3).
Items are sorted orderly to obtain Header Table. Here, only Items {f c a m b p} are retained. For each
Figure 2. Frequent Ordered Items
transaction, frequent items are ordered in parallel, see Figure 2. Once this step is completed,
everything is ready to construct the FP-Tree structure with creation and insertion of the different
nodes.
3
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
The first created node, called the root does not contain anything but the links to its “children”. Each
transaction is browsed and its different items are placed in the tree as follows: if for an element, a node
is already created, the number of occurrence is incremented in the node, else the node is created. Then,
for each element created, a link is established from the header table to the element inserted in the tree.
First transaction is composed by the elements {f, c, a, m, p} increasingly ordered according to their
weight. As f is the first element of the list, a node is inserted from the element “root” of the tree. The
element f has been inserted for the first time, therefore its number of occurrence is 1. Two links are
created with the element f: with the element “root” and with the header table. The same method is
applied for insertion of the other elements of first transaction {c, a, m, p}.
Figure 3. FP-Tree Structure after Insertion of First Transaction Elements
The example of second transaction shows the case where occurrence number of an element has to be
incremented. According to Figure 1, the second transaction is composed by elements {f, c, a, b, m}.
This time, the tree already contains elements and therefore each element will have its number of
occurrence incremented by 1. This is the case especially for elements {f}, {c} and {a}: their number of
occurrence is 2 (1+1). As element {b} does not exist yet in the tree, then a new node is created for this
element and from the current position in the tree: a link is therefore created from node {a} to node {b}
and from the header table to node {b}. The same case is applied for element {m}, which has not yet
been inserted in the tree. The second transaction is thus treated, and left transactions have their
Figure 4. Construction of FP-Tree from 2nd Transaction
element inserted in same way, see Figure 4. At the end of the procedure for all the transactions of the
data base, the FP-Tree structure is illustrated as in Figure 3 where each element of the FP-Tree
structure is linked from the correspondent element of the header table.
4
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
Figure 5. FP-Tree Structure
The latest step is to insure that information inside the tree is well validated, by comparing information
obtained from the different nodes with the one from the header table. From this table, the position of
each item can easily be found in the FP-Tree. To identify the frequent item sets from data base
transactions, the x-conditional data base method (x is any item belonging to the header table) will be
used. The x-conditional data base is first applied to the item having lowest occurrence. To determine
the x-conditional data base, a simple method is followed and will be applied in the same way to the
other items by browsing from the bottom to the top of header table.
1. Determining from the tree the paths containing x
2. For each path, equalizing x-occurrence with the other item inside this path
3. Removing x from the different paths
4. Putting together each item, counting their occurrence and pruning the items according to the
minimum support.
5. From remaining items, construction of the x-conditional tree.
6. Including x and make all the possible combinations with the remaining items, keeping the
same occurrence for each combination.
7. Doing again this procedure until the header table is utterly browsed.
After the procedure is completed, the following results are obtained for all items, see Figure 6.
Figure 6. Summary Table of x-Conditioned Basic Patterns
III. Discussion and Results
The main advantage offered by FP-Tree algorithm is the reduced number of database scanning. The
algorithm is also comprehensive as all information is contained in the frequent items, and concise as
only these frequent items are memorized, which has a positive effect both on mining efficiency and on
processing time. Nevertheless, in spite of his compact structure, it is not guaranteed that all FP-Tree
structure will be stored in computer memory, in case of a large volume of database transaction.
Moreover, this construction may use much more time and system resources than expected. To address
this problem of Big Data, the program includes upstream different data types of behavior. Different
materials serving as inputs for the FP-Tree algorithm are obtained which reduce the size of needed
memory. The algorithm will be applied in each data compartment.
Construction of data distribution from database transaction is shown on Figure 7, where parameter
epsilon (left of curve) determines the size of maximum difference between the behaviors among each
data class. This parameter is variable depending on FP-Tree algorithm performance.
5
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
Figure 7. Data Division and Classification vs Time
In order to compare the performance of FP-Tree algorithm method and previous existing one like
Apriori, four different back-testing with five different databases have been performed. The backtesting have been realized on samples of 21 500, 10 000 and 5 000 Timeframes. Data are current ones,
ie tests are performed by comparing future data with previous ones. The rate support of data
repetitions is 1.0. Below is our conclusion:
- Concerning software performance, the large number of data still limits execution time. Nevertheless,
the larger is the number of data, the more chance there exists to recognize a signature over the past.
-Epsilon is an important parameter to take into account. Indeed, its variation influences precisely the
quality of results, in term of returns and standard deviation when analyzing previous and current
signatures. The larger epsilon is, the higher chance exists to find frequent patterns. However, the
smaller is epsilon, the more accurate is the result. To calibrate epsilon, many back- testing have been
performed to find its right value by analysis of the interval [.01,1]. For example, 0.01 means that only
one piece of information has been taken out of 100.
To get more powerful software, it is possible to incorporate a decaying exponential law into the code
giving more weight to recent information and filtering on our database. Taking the decaying function
f(x)=1 exp(x) and by varying , the value of recent data can be adjusted according to its relevance.
For example, information of a value from 1985 will be much more important than information of a
value from 1950. Effectively, it is expectable that due to constant improvement of technical analysis
over the years, recent data are more complete and faithful than old ones. These circumstances result
from notion of market efficiency. First, a market is defined as efficient only and only if all available
data about each financial asset listed on this market is immediately incorporated into asset price [ ]
(Fama, 1965). This efficiency allows understand why assets price are in synchrony with actual value
thanks to new tools of technical analysis. Their constant improvement is the reason why asset data
from 1985 is better than the one from 1950, justifying the possibility to include a decaying
(exponential) rule into the algorithm. The second remark is concerning parameter . It was indeed
noticed earlier how much  was involved into result precision, in that the smaller  was, the less time
was needed to analyse the data since one cannot get a large part with the ribbon created by . To prove
this, two experiments have been carried out to stabilize  and previous relationship between analysis
time and amount of manipulated data has been reported on Figure 8 for  = .3 and .6 respectively.
Execution Time (min)
Execution Time (min)
1000
800
600
400
200
0
0
10000
20000
Number of TimeFrames
1000
800
600
400
200
0
30000
0
10000
Number of TimeFrames
6
20000
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
Figure 8. Algorithm Execution Time vs Frame Number for =.3(right) and =.6 (left)
Execution Time (min)
From Figure 8, the algorithm needs less time to analyse the whole database for smaller , as
reasonably less data are manipulated. Indeed when the user decides to choose a smallest  he decides
to decrease algorithm search precision. So the database will be browsed faster but chances for
matching a new signature to an already existing one are fewer.
After several tests, we managed to figure out the curve linking the analysis time according to
parameter (with the amount of data becoming an intrinsic magnitude of this curve). This  parameter
varies from 0.01 to 1. When it is equivalent to 1, it no longer influences the database and becomes
somehow neutral. Figure 9 displays the execution time vs.  value:
1000
500
0
0
0.5
Value of Epsilon
1
1.5
Figure 9. Algorithm Execution Time vs  Value
IV. Conclusion
The improvement of technical indicators, generating a buy/sell signal to optimize fund managers
quantitative strategies, requires good ways to search best signatures in order to study similarities with
already present market patterns. Present study suggests a method to solve the problem of handling
large databases and to contribute to quicker and more efficient decision by investors. The results show
that by using a good extraction algorithm of frequent item sets, it is possible to find in a minimal time
best signatures from historical data when compared to existing ones. The implementation of efficient
algorithm FP-Tree with a support rate of 1.0 makes this research with average execution speed divided
by two, whatever the number of time frames, compared to other uses of data mining algorithm like
Apriori. Despite the size of data, no information is lost and results accuracy is not degraded. The
combination of a method of digital involution on upstream data and FP-Tree algorithm is essential to
address fully the problem of handling large databases and to avoid a decline in profitability.
Aknowledgments
The authors are very much indebted to ECE Paris School of Engineering to have provided the
environment where the study has been developed, to Dr for advices during the research, and to Pr. M.
Cotsaftis for help in the preparation of the manuscript.
References
[1] N.P. Lin, W.H. Hao, H.J. Chen : Fast Accumulation Lattice Algorithm for Mining Sequential Patterns, Proc.
6th Conf. on WSEAS Intern. Conf. on Applied Computer Science, Vol.6, pp.229-233
[2] T. Oates, D. Jensen, P.R. Cohen : Automatically Acquiring Rules for Event Correlation from Event Logs,
Technical Report 97-14, Computer Science Dept, University of Massachusetts, Amherst, MA, 1997.
[3] R J. Bayardo : Efficiently Mining Long Patterns from Databases, Proc. ACM-SIGMOD Intern. Conf. on
Management of Data, pp.145-154, 1999
[4] C. Avery, P. Zemsky : Multidimensional Uncertainty and Herd Behavior in Financial Markets, Amer.
Economic Review, Vol.9(1), pp.724-748, 1998
[5] S. Grossman, J. Stiglitz : On the Impossibility of Informationally Efficient Markets, Amer. Economic Review,
Vol.70(3), pp.393-408, 1980
[6] R. Bloomfield, M. O’Hara : Market Transparency: Who Wins and Who Loses?, The Review of Financial
Studies, Vol.12(1),pp.5-35, 1999
7
Proceedings of 7th Asia-Pacific Business Research Conference
25 - 26 August 2014, Bayview Hotel, Singapore ISBN: 978-1-922069-58-0
[7] S.B. Achelis : Technical Analysis from A to Z, 2nd ed., McGraw-Hill, New York, 2000
[8] R. Agrawal, R. Srikant : Fast Algorithm for Mining Association Rules in Large Databases, Proc. 20th Int.
Conf. Very Large Databases (VLDB’94), Chile, pp.487-499, 1994
[9] B. Dunkel, N. Soparkar : Data Organization and Access for Efficient Data Mining, Proc. 15th Int. Conf. on
Data Engineering (ICDE’99), pp.522-529, 1999
[10] J. Han, J. Pei, Y. Yin : Mining Frequent Patterns without Candidate Generation, Tech. Rept., 99-10, Simon
Fraser Univ., Vancouver, CA, 1999
[11] Consultant and IT Architect KTA, L’algorithme FP-Growth, unpublished
[12] M.J. Zaki : SPADE, an Efficient Algorithm For Mining Frequent Sequences, Machine Learning, Vol.42,
pp.31-60, 2001
[13] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu : FreeSPan: Frequent Pattern-Projected
Sequential Pattern Mining, Proc. 6th SIGKDD Intern. Conf. on Knowledge Discovery and Data Mining, pp.355359, 2000
[14] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. MacLachlan, A. Ng, B. Liu, P.S.
Yu, Z.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg : Top 10 Algorithms in Data Mining, Knowledge
Information Systems, Vol.14, pp.1-37, 2008
[15] I. Klein, W. Schachermayer : Asymptotic Arbitrage in Non-Complete Large Financial Markets, Theory of
Probability and its Application, Vol.41(4), pp.927-934, 1996
[16] P.-N. Tan, M. Steinbach, V. Kumar : Introduction to Data Mining, Addison-Wesley, New York 2006.
[17] J. Han, J. Pei : Mining Frequent Patterns by Pattern-Growth Methodology and Implications, ACM SIGKDD
Explorations Newsletter, Vol.2(2), pp.14-20, 2000
[18] J. Pei, J. Han, L.VS Lakshmann : Mining Frequent Itemsets with Convertible Constants, Proc. 17th Intern.
Conf. on Data Engineering, pp.433-442, 2001
[19]B. Liu : Web Data Mining: Exploring Hyperlinks, Contents and Usage, Data, Springer, Heidelberg, 2007
[20] Q. Yang, X. Wu : 10 Challenging Problems in Data Mining Research, Intern. J. Information Technologies
and Decision Making, Vol.5(4), pp.597–604, 2006
[21] A.Gangemi : A Comparison of Knowledge Extraction Tools for the Semantic Web, Proc. ESWC 2013,
LNCS 7882, pp.351-366, Springer-Verlag, Berlin Heidelberg, 2013
[22] M. Sheikh, S. Conlon : A Ruled-Based System to Extract Financial Information, JCIS, Vol.52(4), pp.10-19,
2014
[23] M. Sewell : Introduction to Behavioral Finance, Behavioral Finance Net, 14 April 2010
[24] S.D. Campbell : A Review of Backtesting and Backtesting Procedures, Finance and Economics Discussion
Ser., Div. of Research and Statistics and Monetary Affairs, Fed. Reserve board, Washington, D.C., 2005-21
[25] F. Verhein : Frequent Pattern Growth (FP-Growth) Algorithm, School of Information Technologies, Univ.
of Sydney, Australia, Jan. 10, 2008
8
Download