Application of Frequent-Pattern Tree Algorithm to Dow Jones US
Pharmaceuticals Index
Enis Younsi, Holy Andriamboavonjy, Anthony David, Steven Dokou and Badre Lemrabet
Time optimization and high quality software are very important factors in competitive financial
markets, so any resulting improvement of technical indicators that generate a buy or sell signal is a
major issue. Many tools have thus been created to make them more effective. This worry about
efficiency has been leading in a previous paper to search best (and most innovative) approach
giving largest improvement of these indicators. The approach consists in attaching a signature to
frequent market configurations by application of frequent patterns extraction method here most
appropriate to optimize investment strategies. Ultimate algorithm goal is to extract most accurate
signatures using back testing procedure applied to technical indicators to improve their
performance, and to determine the ones that outperform these indicators. To this aim, the FP-Tree
algorithm has been selected as most efficient to perform the task. Application to DJ
Pharmaceuticals Index confirms interesting possibility of proposed method.
Keywords: Quantitative Analysis, Back-testing, Computational Models, Apriori Algorithm, Pattern
Recognition, Data Mining, FP-Tree
I. Introduction
Today it is very easy to collect all kinds of data. However, processing limitations persist in extracting values.
In all domains, from science to finance, if huge volumes of data exist, most of them remain untapped. Banks,
for example, must analyze quickly enough this mass of data to make right investment decisions. The problem
is no longer in the collection but in the value extraction from this data. To meet this challenge, knowledge
extraction in databases (data mining) has been largely developed [1, 2, 3, 4, 5], particularly in “financialdecision” area [6, 7, 8]. In this world, events follow traditional cycle of cause and effect. Expansion of "Hedge
funds" and development of computers have boosted data analysis. With emergence of behavioral finance [9]
a comprehensive in-depth analysis of all data and archived transactions is needed.
In this context, the improvement of technical indicators [10, 11] generating a buy/sell signal is a major issue.
This concern for efficiency leads to find the best (and most innovative) way allowing a further improvement of
these indicators [12]. In the ignorance of internal economic system dynamics, the technique of frequent
patterns seemed to be the most appropriate method to allow fund managers to optimize their quantitative
strategies [13]. To address the problem of handling large databases, a method is identified as capable of
classifying curves representing data behavior with a relatively modest numeric involution without losing
information [14]. Thus, these classes of behavior are subject to frequent patterns algorithm and the latter will
not have the problem of storage memory. Mining frequent patterns is the most studied data mining task for
last ten years. It is used here to find signatures.
A signature is a set of elements, representing in present case a market configuration, which will be compared
to another one to find similarities. Hence, the method is to assign a signature to common market
configurations. Through back-testing procedure [15], the best signatures are kept for the next period, called
validation if it is preceded by a "good" signature and used in combination with the technical indicator to
improve its performance. A conventional stock data mining (i.e., searching for frequent patterns) is used to
identify signatures characterizing the few days before technical indicator outbreak. So one gets specific
underlying rules of technical analysis based on several days and adapted to the period. Clustering, or
unsupervised classification to group similar situations is the general context of financial decision-making on
which present problem is based. The algorithm principle is based on Frequent Pattern Tree (FP-Tree)
extraction algorithm [16]. Indeed, implementation of FP-Tree is the most effective for achieving the analysis
of the whole set of all frequent patterns, as well as the more suitable to real-time high frequency trading.
Two methods exist for sequential pattern mining: Apriori-based approaches [17,18], which include GSP and SPADE
algorithms [19], and Pattern-Growth-based approaches [20], which include FreeSPan and PrefixSpan algorithms.
Common data mining approach is to find frequent item-sets from a transaction dataset and then derive association
rules. Finding frequent item-sets (item-sets with appearance frequency larger than or equal to a user specified
minimum support) is not trivial because of combinatorial explosion. Once frequent item-sets are obtained, it is
straightforward to generate association rules with confidence larger than or equal to a user specified minimum
The most outstanding improvement over Apriori is FP-Growth (frequent pattern growth) [21,22], based on “divide and
conquer” strategy compressing the database representing all essential information and dividing the compressed
database into a set of conditional databases, each associated with one frequent item-set and mining each one
separately. FP-Growth is an order of magnitude faster than original Apriori algorithm. Unlike previous approaches
which make multiple database scans and use complex hash-tree structures that tend to have sub-optimal locality,
SPADE (Pattern Discovery using Equivalence classes), decomposes the original problem into smaller sub-problems
using equivalence classes on frequent sequences. Not only can each equivalence class be solved independently, but it
is also very likely that it can be processed in main-memory.
FreeSpan, or Frequent Pattern-Projected Sequential Pattern Mining [23], is a scalable and efficient sequential mining
method. It uses frequent items to recursively project sequence databases, and grows subsequence fragments in each
projected database. The strength of FreeSpan is that it searches a smaller projected database than GSP in each
subsequent database projection. The major overhead of FreeSpan is that it may have to generate many nontrivial
projected databases.
From this context, FP-Tree algorithm [24] has been finally opted for. It overcomes these disadvantages and optimizes
the tool for investment decision. The interest of FP-Tree algorithm is to avoid running repeatedly through the database
path by storing all common elements in a compact structure called Frequent-Pattern Tree. Furthermore, these
elements are automatically sorted in the compact structure, which accelerates pattern search. FP-Tree algorithm
appears as the most appropriate solution in this case among existing data mining algorithms, with significantly less
execution time and greater accuracy. Indeed, the faster the algorithm is, the higher the financial profitability can be. FPTree algorithm is a powerful tool for analyzing signatures to get the most frequent of them and to optimize “back
testing” process.
II. FP-Tree Algorithm
The methodology combines a signature with frequent scenarios through FP-Tree algorithm. To implement the trading
strategy based on signatures, one must first define the content of a market configuration (a scenario) and transform it
into the format expected by extraction of frequent patterns algorithm format. An intermediate coding step is thus
necessary to make rough data compatible with the algorithm. FP-Tree algorithm is particularly used for searching
frequent patterns in a super data base. The interest of this algorithm is to avoid browsing the database repeatedly, by
storing the set of frequent elements in the (compact) FP-Tree structure. Moreover, these frequent elements are sorted
automatically in the compact structure, which accelerates the research of patterns. FP-Tree algorithm is the most
effective solution among Data Mining algorithms, with less important complexity and greater precision. FP-Tree is
basically composed by two elements: a header table to index the different item sets and a tree structure with a root
labeled “null”. After different steps of data manipulation [mm], last one is to insure that information inside the tree is well
validated, by comparing information obtained from the different nodes with the one from the header table. From this
table, the position of each item can easily be found in the FP-Tree. To identify the frequent item sets from data base
transactions, the x-conditional data base method (x is any item belonging to the header table) will be used. The xconditional data base is first applied to the item having lowest occurrence. To determine the x-conditional data base, a
simple method in seven steps summarized below is followed and will be applied in the same way to the other items by
browsing from the bottom to the top of header table [25].
Determining from the tree the paths containing x
For each path, equalizing x-occurrence with the other item inside this path
Removing x from the different paths
Putting together each item, counting their occurrence and pruning the items according to the minimum support.
From remaining items, construction of the x-conditional tree.
Including x and make all the possible combinations with the remaining items, keeping the same occurrence for
each combination.
7. Doing again this procedure until the header table is utterly browsed.
The main advantage offered by FP-Tree algorithm is the reduced number of database scanning. The algorithm is also
comprehensive as all information is contained in the frequent items, and concise as only these frequent items are
memorized, which has a positive effect both on mining efficiency and on processing time. Nevertheless, in spite of its
compact structure, it is not guaranteed that all FP-Tree structure will be stored in computer memory, in case of too
large volume of database transaction. To address this problem of Big Data, the program includes upstream different
data types of behavior and different inputs for the FP-Tree algorithm are obtained which reduce the size of needed
memory. The algorithm will be applied in each data compartment.
III. Algorithm Efficiency
Construction of data distribution from database transaction is shown on Figure 1, where parameter
(left of curve)
determines the size of maximum difference between the behaviors among each data class. This parameter is variable
depending on FP-Tree algorithm performance.
Figure 1. Data Division and Classification vs Time
In order to compare the performance of FP-Tree algorithm method and previous existing one like Apriori, four different
back-testing with five different databases have been performed on samples of 21 500, 10 000 and 5 000 Timeframes.
Data are current ones, ie tests are performed by comparing future data with previous ones. The rate support of data
repetitions is 1.0. As a result a balance exists on software performance side between a large data set limiting execution
time, and the higher chance to recognize a signature over the past with larger data set. Another balance exists also
with . Larger increases the chance to find frequent patterns, and the smaller is , the more accurate is the result.
To calibrate , many back-testing have been performed to find its right value by analysis of the interval [.01,1]
means that only one piece of information has been taken out of 100 .
To augment software power and reduce data processing, a decaying law can be introduced into the code giving more
weight to recent information and filtering on database. With f(x) =
, the value of recent
data can be adjusted according to its relevance.
The second point is concerning the tradeoff preciseness vs data processing with
To this aim, relationship
between analysis time and amount of manipulated data has been first studied and shows strong reduction of execution
time for smaller
as reasonably less data are manipulated but there are fewer chances for matching a new
signature to an already existing one. After several tests, the curve linking execution
parameter (with the
amount of data becoming an intrinsic magnitude of this curve) is displayed on Figure 3 with parameter varying from
0.01 to 1. When
1, it no longer influences the database and becomes somehow neutral
Execution Time (min)
Value of Epsilon
Figure 3. Algorithm Execution Time vs
IV. Application and Convergence Analysis
After evaluating the treatment time needed to analyze the different signature batches, the algorithm has been applied
to real financial products. Three different financial products with different behaviors: Stocks, Commodities and Bonds
have been considered. Each type of financial instruments has a very different behavior and testing the algorithm on
each of them would allow evaluate its prediction ability. Here a stock has been first taken as being very variable, this
would be a very good test for the proposed algorithm to verify if it can identify signatures with very high amplitude.
Indeed, the more irregular is the product price, the more difficult it is to find similarities in the past and guess future
variations. So attention has been focused here on a more global Dow Jones Sector Index. The Dow Jones U.S. Select
Sector Indices measure precise, and typically narrowly-defined, sectors of the U.S. stock market. To be included in the
indices, stocks must meet minimum size requirements based on market capitalization. Component weights are capped
for diversification. For present application, and
Figure 4 Historical Prices of the Dow Jones US Pharmaceuticals Index (DJUSPR) since its beginning
As seen on Figure 4, Pharmaceuticals Index has been growing steadily over the last five years with some little
decreases. From its creation to 2009, this index has been globally stable and exhibits different types of variations. The
proposed algorithm being based on past variation, the more different situations it will analyze, the more it will be able to
guess future trends. On the other hand, the disadvantage of DJ US Pharmaceuticals Index is that it is extremely recent
with fourteen years of collected data compared to the other stocks existing on the market. So it is very interesting to
test the algorithm on this product to see if it can make a difference between a little decrease followed by growth and a
real decrease or a leveling off.
The first step has been to gather all Open, High, Low and Close data of this index. Indeed, to be able to create a
reliable signature database, one must make sure to have all needed data for each day to avoid a shift in the data.
Since its origin, there have been 3691 Quotations, or 3671 timeframes of twenty days for DJ Pharmaceuticals Index.
The analysis and the signatures creation took 143 minutes. A small number of timeframes are analyzed compared to
the amount collected in the different data batches, so benchmarking the algorithm is quick, but what is slower is the
calibration. As seen before
ing fix the needed accuracy for signature comparison. For very high
prescribed precision, finding exactly the same signature is very rare. Conversely, choosing a very low accuracy allows
find many other signatures looking like the one under analysis. So the algorithm must be tested with different values of
. Starting from initial value = 1, one is looking for the exact same signature as in the past. If the result is positive,
this is a very good opportunity to find if the chart would go up, down or stay stable. With this value of , it was not
possible to find a similar signature. This result agrees with the fact that stock variations are strong so it is not easy to
find exactly the same in the future.
Similar Signatures Found
Figure 5 Number of similar signatures found vs
As seen on Figure 5
It also depends on the amount of
available data. In this case, it is seen that one needs to take
= 0,5, 22
signatures are found, which is a very high number for a recent database. Taking
= 0.8 gives 2 cases similar to
current one. The first one indicates that there will be a decrease in the following time period and the second one that
the quotation will increase, so it is not possible to use this data as it does not give a precise statement on the action to
take. So even if signatures are found, it does not mean that the exact future trend has been found.
One may be in the case where a new situation is occurring that previously never happened or a case where the same
number of signatures with a decrease and increase has been found.
An important point is the choice of correct
In fact when applying present algorithm to market, a specific value
has to be determined for each stock, suggesting implement adaptive way [26] allowing the algorithm estimate the
correct value and remember previous determined one.
V. Conclusion
The improvement of technical indicators, generating a buy/sell signal to optimize fund managers quantitative strategies,
requires good ways to search best signatures in order to study similarities with already present market patterns.
Present study suggests a method to solve the problem of handling large databases and to contribute to quicker and
more efficient decision by investors. The results show that by using a good extraction algorithm of frequent item sets, it
is possible to find in a minimal time best signatures from historical data when compared to existing ones. The
implementation of efficient algorithm FP-Tree with a support rate = 1.0 makes this research with average execution
speed divided by two, whatever the number of time frames, compared to other uses of data mining algorithm like
Apriori. Despite the size of data, no information is lost and results accuracy is not degraded. The combination of a
method of digital involution on upstream data and FP-Tree algorithm is essential to address fully the problem of
handling large databases and to avoid a decline in profitability. Application to DJ Pharmaceuticals Index confirms the
applicability and the limitations of proposed method.
Proceedings of 29th International Business Research Conference
Proceedings of 29th International Business Research Conference
