CS 632 Paper Survey on Sequential Pattern Data mining Steven Yuan-Wei Li Introduction: Data mining has gained popularity recently in many fields. The most frequently mentioned examples includes: Micro marketing in retail industry. Genetic engineering: DNA sequence patterns. Financial industry use data mining to identify interesting share price movements. According to the classification of IBM's Quest Data mining project (http://www.almaden.ibm.com/cs/quest/publications.html) Data mining can be classified into the following sub-fields: Associations Classification Clustering Database-Mining Integration Deviation Detection Incremental Mining OLAP Text and Web Mining Time-Series Clustering Sequential Patterns The topic surveyed here is sequential patterns data mining. It is a direct descendant or generalization of association rules data mining. So this papers survey starts by first introducing the basic concept of association rules mining. Then I will bring in the topic of sequential pattern mining in its most primitive phase, which is the 94 paper by R. Agrawal and R. Srikant. I will explain the basic idea and investigate some inherent problems of its Apriory algorithm. After that, my survey branch out into to directions, the primary is on performance improvement, the second is on generalizations. Section I: Association rules[1] R. Agrawal's Association rules mining paper is the origin of a whole series of follow up papers. The simplest way to demonstrate the purpose of association rule mining is statements like: " 20 % of the customers who buy product A will also buy product B in the same transaction" In a formal problem formulation, there is a set of Items I = { i1, i2, i3...} which can be thought of as the universe products sold by a retailer. Then the targeted database is a set of transactions D where each transaction T is a set of items such that T is contained in I. Associated with each transaction has a unique identifier TID. A transaction contains an item set or pattern X, which is a set of items in I if X I. An association rule is an implication of the form X ==> Y, where X, Y I, and X Y = Null. The rule X==>Y holds in the transaction set D with confidence c if c% of transaction in D that contain X also contains Y. The rule X==>Y has support s if s% of the transactions in D contain X union Y. Given a set of transaction D, the goal of mining association rules is to generate all association rules that have support and confidence greater than a user specified level. The Apriory algorithm proposed for mining association rules is a bottom-up, breadth first algorithm. It first use candidate sets from previous pass to generate frequent sets (or large item set) in each next level (number of items in X), and in the same pass count the support of that frequent set, then generate the candidate sets for next pass. The recursions continue until the current frequent set can no longer generate any bigger frequent sets. The candidate generation function prefix joins 2 of the current frequent sets to generate the candidate set for next pass. But some of the generated candidates will be pruned because their immediate subsets are not in the current frequent set. The candidate sets in each pass are stored in hash-tree in Agrawal's paper. A node of the hash-tree either contains a list of item sets for leaf nodes, or a hash table for internal nodes. At an internal node of depth d, we decide which branch to go by hash the d-th item of T. By hashing every item of T in the root, the completeness of support count can be guaranteed. section II: sequential patterns[2] Association rules mining is for the mining of intra-transaction patterns, while mining sequential patterns is to search for inter-transaction patterns. For example, consider the statement " 25 % of the customers who buy product A will buy product B within 2 months of the first transaction" In its most simple form, mining sequential patterns set no constrains on the time gap between 2 transactions, as long as the transactions considered are conducted by the same customer. Like the mining for association rules, the transaction database often are transformed either in memory or physically into another abstract database to facilitate the mining work. Often this can be done by mapping items into integers. Therefore all transactions can be represented as ordered integer lists. Compare to mining association rules, a few extra works must be done before one can mining sequential patterns. First, the database must be sorted first by customers, then by transaction time. At most one transaction at a time for one customer assumption was made. In association rules mining, we only care about transactions when counting supports, but in sequential patterns mining, one customer can contribute to one candidate pattern at most once when we count the supports for candidate patterns. In short, in association mining we count the fraction of transactions, while in sequential mining we count the fraction of customers. The Apriory algorithm can be use here without major changes for the purpose of candidate generation, support counting and maximum/pruning. Hash-trees can also be used here for the purpose of storing candidates. section III: performance improvement and generalizations Performance improvement: The Apriory algorithm adopts a bottom-up, breadth first way of finding all the maximum frequent sequential patterns. The cost of plain Apriory comes from the reading of the transaction database (I/O), one time transformation (CPU & possibly I/O) and generation of candidates and backward pruning (CPU). The inherent problem with this and similar algorithm is that a lot of the candidate frequent sets (patterns) will have to be generated, counted and then pruned later, because we only need the maximum frequent sets (patterns). When all maximum frequent sets are short (say only 3 items at most), these Algorithms perform reasonably well. However, performance drastically decreases when any one of the maximum frequent sets becomes longer, because a maximum frequent item set of size L implies the presence of 2^L - 2 non-trivial frequent item sets (its nontrivial subsets) as well. Apriory will examine each of such subset. Another perspective is about the support level. When the specified support level is high, more candidates will be pruned in each pass of the Apriory algorithm, but if we lower the required support level, and pass through rate of candidate frequent sets become much higher and will hurt the performance. In the papers surveyed, several improvement over this shortcoming were proposed, but the ideas can be generalized as the following: The common goal is to reduce the number of candidates generated, and bypass some passes if possible. Specific approaches: 1. Instead of only using bottom-up approach, also using top-down approach to speed up the pruning process. The performance boost comes from the early identification of long sequences (maximum frequent candidate sets). The goal is to use the bottom generated frequent sets to bring down the coordinates of the maximum frequent candidate sets on top, and at the same time use the maximum frequent candidate sets to prune out its subset appear in the bottom. 2 papers surveyed pursued this approach, but used slightly different algorithms. I. Pincer Search: [4] Both sets of frequent and in-frequent sets are maintained. The top (the maximum frequent candidate set) is initialized to be the sequence of all items. But the MFCS will be shortened quickly as set-difference operator operates on MFCS and in-frequent sets. This helps to bring the MFCS to its final size very quickly. In addition, Pincer search needs a recovery function to bring back the candidates pruned but shouldn’t because in the beginning MFCS is often bigger than it should. Subset in-frequency pruning is as with Apriory, and superset frequency is by deleting from the frequent candidate sets the MFCS’s subset. In this paper, the effect of item distribution was also evaluated through experiment. For the same number of frequent item sets, their distribution can be clustered or scattered. In concentrated distribution, on each level the frequent item sets has many common items: the frequent items tend to cluster. In scattered distribution, the performance lift by adding top-down frequent-subset pruning is not as obvious as is in clustered distribution. This is because of the over head of MFCS is relatively huge. Also in scattered distribution MFCS tends to be short and can’t pruning a lot of sub sets. II. Max-miner:[5] Using Rymon’s generic set enumeration tree, and representing each node in the set enumeration tree by candidate groups. A candidate group (g) consists of 2 item sets, head h(g) and tail t(g). For a candidate group {1,2,3,4}, if a node enumerate {1} has h(g)={1}, t(g)={2,3,4}. When we are counting the support of candidate group g, we are computing the support of item sets h(g), h(g)t(g) and h(g) {i} for all {I} t(g). The support for item sets other than h(g) are for pruning. For example, consider the first item sets h(g) t(g). Since h(g) t(g) contains every item that appears in any sub-node of g, if it is frequent, then any item sets of its sub nodes are also frequent, but not maximum. Super set frequency pruning can therefore be implemented by halting sub node expansion at any candidate group g for which h(g) t(g) is frequent. Consider on the other hand the item set h(g) {I} for some {I} t(g). If h(g) {I} is infrequent, then any head of a sub-node that contains item {I} will also be in-frequent. Subset infrequency pruning can therefore be implemented by removing any such tail item from a candidate group before expanding its sub-nodes. 2.Stick to bottom up approach, but instead of increment 1 item at a time, dynamically change the advancing steps according to the previous passes' average pass through ratio (hit ratio). The higher the pass through ratio, the bigger the steps are. The purpose of this is to balance the trade off between the time wasted in counting non-maximum sequences if we use small steps versus counting the extensions of non-frequent sets if we use large steps. Generalization:[3] 3 possible ways of generalization are proposed in the papers surveyed for sequential patterns mining. Sliding windows: Instead of restricting one item (an item can be composed of many products group together) to be from one single transaction, we can relax the requirement to require that as long as all the products of this same group are purchased within the same week, then this item exist for this customer. Maximum time gap: We are interested in sequential pattern, but only if the neighboring items are meaningfully close enough. We won’t think buy a TV today and then buy a VCR 5 years later by the same customer will not be an interesting pattern for most retailers. Maximum gap constraints helps to filter out patterns that span for too long between transactions to be meaningful. Taxonomies: sometimes the items are more specific than we need. For example, customer A bought a RCA TV than bought a RCA VCR one month later and customer B bought a GE TV and then bought a GE VCR one month later can be generalized as the pattern “buy one TV then buy one VCR one month later”. Section IV: Conclusions This paper survey provides basic framework for understanding current research topics on sequential pattern data mining, its evolutions/generalization and improvement over time. Issues such as scalability, data distribution and generalizations are also covered. From these papers it can be observed that there are still different possible way to speed up the support counting phase and improve the over all performance of data mining. Either by reducing candidates generated in each pass or bypass certain passes (levels). There are opportunities that the methods surveyed here can be integrated to provide even better mining algorithms. References: [1] Agrawal, R. Srikant, R. 1994 Fast Algorithms for Mining Association Rules. IBM Research Report RJ9839, June 1994, IBM Almaden Research Center, San Jose, CA. [2] Agrawal, R. Srikant, R. 1995 Mining Sequential Patterns. In Proceeding of the 11th Int’l Conf. On Data Engineering. [3] Agrawal, R. Srikant, R. 1995 Mining Sequential Patterns: Generalization and Performance Improvement. Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. [4]Lin. D. and Kedem, Z. M. 1998. Pincer-Search: A New Algorithm for Discovering The Maximum Frequent Sets. In Proceeding of the 6th European Conf. On Extending Database Technology. [5]Robero J. Bayrardo Jr. Efficiently Mining Long Patterns from Database. Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998 Read but not very relevant [6] R. Agrawal, G. Psaila, E. L. Wimmers, M. Zait, ``Querying Shapes of Histories'', Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995 [7] R. Agrawal, K. Lin, H. S. Sawhney, K. Shim: ``Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases'', Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995.