CS 632 Paper Survey on Sequential Pattern Data mining

advertisement
CS 632 Paper Survey on Sequential Pattern Data mining
Steven Yuan-Wei Li
Introduction:
Data mining has gained popularity recently in many fields.
The most frequently mentioned examples includes:
Micro marketing in retail industry.
Genetic engineering: DNA sequence patterns.
Financial industry use data mining to identify interesting share price movements.
According to the classification of IBM's Quest Data mining project
(http://www.almaden.ibm.com/cs/quest/publications.html)
Data mining can be classified into the following sub-fields:
Associations
Classification
Clustering
Database-Mining Integration
Deviation Detection
Incremental Mining
OLAP
Text and Web Mining
Time-Series Clustering
Sequential Patterns
The topic surveyed here is sequential patterns data mining. It is a direct descendant or
generalization of association rules data mining. So this papers survey starts by first
introducing the basic concept of association rules mining. Then I will bring in the
topic of sequential pattern mining in its most primitive phase, which is the 94 paper
by R. Agrawal and R. Srikant. I will explain the basic idea and investigate some
inherent problems of its Apriory algorithm. After that, my survey branch out into to
directions, the primary is on performance improvement, the second is on
generalizations.
Section I: Association rules[1]
R. Agrawal's Association rules mining paper is the origin of a whole series of follow
up papers. The simplest way to demonstrate the purpose of association rule mining is
statements like:
" 20 % of the customers who buy product A will also buy product B in the same
transaction"
In a formal problem formulation, there is a set of Items I = { i1, i2, i3...} which can be
thought of as the universe products sold by a retailer.
Then the targeted database is a set of transactions D where each transaction T is a set
of items such that T is contained in I. Associated with each transaction has a unique
identifier TID. A transaction contains an item set or pattern X, which is a set of items
in I if X  I.
An association rule is an implication of the form X ==> Y, where X, Y  I, and X  Y
= Null. The rule X==>Y holds in the transaction set D with confidence c if c% of
transaction in D that contain X also contains Y. The rule X==>Y has support s if s%
of the transactions in D contain X union Y. Given a set of transaction D, the goal of
mining association rules is to generate all association rules that have support and
confidence greater than a user specified level.
The Apriory algorithm proposed for mining association rules is a bottom-up, breadth
first algorithm. It first use candidate sets from previous pass to generate frequent sets
(or large item set) in each next level (number of items in X), and in the same pass
count the support of that frequent set, then generate the candidate sets for next pass.
The recursions continue until the current frequent set can no longer generate any
bigger frequent sets.
The candidate generation function prefix joins 2 of the current frequent sets to
generate the candidate set for next pass. But some of the generated candidates will be
pruned because their immediate subsets are not in the current frequent set.
The candidate sets in each pass are stored in hash-tree in Agrawal's paper. A node of
the hash-tree either contains a list of item sets for leaf nodes, or a hash table for
internal nodes. At an internal node of depth d, we decide which branch to go by hash
the d-th item of T. By hashing every item of T in the root, the completeness of support
count can be guaranteed.
section II: sequential patterns[2]
Association rules mining is for the mining of intra-transaction patterns, while mining
sequential patterns is to search for inter-transaction patterns.
For example, consider the statement " 25 % of the customers who buy product A will
buy product B within 2 months of the first transaction"
In its most simple form, mining sequential patterns set no constrains on the time gap
between 2 transactions, as long as the transactions considered are conducted by the
same customer.
Like the mining for association rules, the transaction database often are transformed
either in memory or physically into another abstract database to facilitate the mining
work. Often this can be done by mapping items into integers. Therefore all
transactions can be represented as ordered integer lists.
Compare to mining association rules, a few extra works must be done before one can
mining sequential patterns. First, the database must be sorted first by customers, then
by transaction time. At most one transaction at a time for one customer assumption
was made.
In association rules mining, we only care about transactions when counting supports,
but in sequential patterns mining, one customer can contribute to one candidate
pattern at most once when we count the supports for candidate patterns.
In short, in association mining we count the fraction of transactions, while in
sequential mining we count the fraction of customers.
The Apriory algorithm can be use here without major changes for the purpose of
candidate generation, support counting and maximum/pruning. Hash-trees can also be
used here for the purpose of storing candidates.
section III: performance improvement and generalizations
Performance improvement:
The Apriory algorithm adopts a bottom-up, breadth first way of finding all the
maximum frequent sequential patterns.
The cost of plain Apriory comes from the reading of the transaction database (I/O),
one time transformation (CPU & possibly I/O) and generation of candidates and
backward pruning (CPU).
The inherent problem with this and similar algorithm is that a lot of the candidate
frequent sets (patterns) will have to be generated, counted and then pruned later,
because we only need the maximum frequent sets (patterns).
When all maximum frequent sets are short (say only 3 items at most), these
Algorithms perform reasonably well. However, performance drastically decreases
when any one of the maximum frequent sets becomes longer, because a maximum
frequent item set of size L implies the presence of 2^L - 2 non-trivial frequent item
sets (its nontrivial subsets) as well. Apriory will examine each of such subset.
Another perspective is about the support level. When the specified support level is
high, more candidates will be pruned in each pass of the Apriory algorithm, but if we
lower the required support level, and pass through rate of candidate frequent sets
become much higher and will hurt the performance.
In the papers surveyed, several improvement over this shortcoming were proposed,
but the ideas can be generalized as the following:
The common goal is to reduce the number of candidates generated, and bypass some
passes if possible.
Specific approaches:
1. Instead of only using bottom-up approach, also using top-down approach to speed
up the pruning process. The performance boost comes from the early identification of
long sequences (maximum frequent candidate sets). The goal is to use the bottom
generated frequent sets to bring down the coordinates of the maximum frequent
candidate sets on top, and at the same time use the maximum frequent candidate sets
to prune out its subset appear in the bottom. 2 papers surveyed pursued this approach,
but used slightly different algorithms.
I. Pincer Search: [4]
Both sets of frequent and in-frequent sets are maintained. The top (the maximum
frequent candidate set) is initialized to be the sequence of all items. But the MFCS
will be shortened quickly as set-difference operator operates on MFCS and
in-frequent sets. This helps to bring the MFCS to its final size very quickly. In
addition, Pincer search needs a recovery function to bring back the candidates pruned
but shouldn’t because in the beginning MFCS is often bigger than it should. Subset
in-frequency pruning is as with Apriory, and superset frequency is by deleting from
the frequent candidate sets the MFCS’s subset. In this paper, the effect of item
distribution was also evaluated through experiment. For the same number of frequent
item sets, their distribution can be clustered or scattered. In concentrated distribution,
on each level the frequent item sets has many common items: the frequent items tend
to cluster. In scattered distribution, the performance lift by adding top-down
frequent-subset pruning is not as obvious as is in clustered distribution. This is
because of the over head of MFCS is relatively huge. Also in scattered distribution
MFCS tends to be short and can’t pruning a lot of sub sets.
II. Max-miner:[5]
Using Rymon’s generic set enumeration tree, and representing each node in the set
enumeration tree by candidate groups. A candidate group (g) consists of 2 item sets,
head h(g) and tail t(g). For a candidate group {1,2,3,4}, if a node enumerate {1} has
h(g)={1}, t(g)={2,3,4}. When we are counting the support of candidate group g, we
are computing the support of item sets h(g), h(g)t(g) and h(g)  {i} for all {I}  t(g).
The support for item sets other than h(g) are for pruning. For example, consider the
first item sets h(g)  t(g). Since h(g)  t(g) contains every item that appears in any
sub-node of g, if it is frequent, then any item sets of its sub nodes are also frequent,
but not maximum. Super set frequency pruning can therefore be implemented by
halting sub node expansion at any candidate group g for which h(g)  t(g) is frequent.
Consider on the other hand the item set h(g)  {I} for some {I}  t(g). If h(g)  {I}
is infrequent, then any head of a sub-node that contains item {I} will also be
in-frequent. Subset infrequency pruning can therefore be implemented by removing
any such tail item from a candidate group before expanding its sub-nodes.
2.Stick to bottom up approach, but instead of increment 1 item at a time, dynamically
change the advancing steps according to the previous passes' average pass through
ratio (hit ratio). The higher the pass through ratio, the bigger the steps are.
The purpose of this is to balance the trade off between the time wasted in counting
non-maximum sequences if we use small steps versus counting the extensions of
non-frequent sets if we use large steps.
Generalization:[3]
3 possible ways of generalization are proposed in the papers surveyed for sequential
patterns mining.
Sliding windows: Instead of restricting one item (an item can be composed of many
products group together) to be from one single transaction, we can relax the
requirement to require that as long as all the products of this same group are
purchased within the same week, then this item exist for this customer.
Maximum time gap: We are interested in sequential pattern, but only if the
neighboring items are meaningfully close enough. We won’t think buy a TV today
and then buy a VCR 5 years later by the same customer will not be an interesting
pattern for most retailers. Maximum gap constraints helps to filter out patterns that
span for too long between transactions to be meaningful.
Taxonomies: sometimes the items are more specific than we need. For example,
customer A bought a RCA TV than bought a RCA VCR one month later and customer
B bought a GE TV and then bought a GE VCR one month later can be generalized as
the pattern “buy one TV then buy one VCR one month later”.
Section IV: Conclusions
This paper survey provides basic framework for understanding current research topics
on sequential pattern data mining, its evolutions/generalization and improvement over
time. Issues such as scalability, data distribution and generalizations are also covered.
From these papers it can be observed that there are still different possible way to
speed up the support counting phase and improve the over all performance of data
mining. Either by reducing candidates generated in each pass or bypass certain passes
(levels). There are opportunities that the methods surveyed here can be integrated to
provide even better mining algorithms.
References:
[1] Agrawal, R. Srikant, R. 1994 Fast Algorithms for Mining Association Rules. IBM
Research Report RJ9839, June 1994, IBM Almaden Research Center, San Jose, CA.
[2] Agrawal, R. Srikant, R. 1995 Mining Sequential Patterns. In Proceeding of the 11th
Int’l Conf. On Data Engineering.
[3] Agrawal, R. Srikant, R. 1995 Mining Sequential Patterns: Generalization and
Performance Improvement. Proc. of the Fifth Int'l Conference on Extending Database
Technology (EDBT), Avignon, France, March 1996.
[4]Lin. D. and Kedem, Z. M. 1998. Pincer-Search: A New Algorithm for Discovering
The Maximum Frequent Sets. In Proceeding of the 6th European Conf. On Extending
Database Technology.
[5]Robero J. Bayrardo Jr. Efficiently Mining Long Patterns from Database. Proc. of
the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June
1998
Read but not very relevant
[6] R. Agrawal, G. Psaila, E. L. Wimmers, M. Zait, ``Querying Shapes of Histories'',
Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland,
September 1995
[7] R. Agrawal, K. Lin, H. S. Sawhney, K. Shim: ``Fast Similarity Search in the
Presence of Noise, Scaling, and Translation in Time-Series Databases'', Proc. of the
21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995.
Download