Survey on Data Mining -- Association Rules Chen-jung Stella Li csl15@cornell.edu Introduction Data mining has become a research area with increasing importance due to its capability of helping end users extract useful information from large databases. With rapid growth in adapting high-technological tools, businesses can now generate and collect massive amount of data, which they could not have done before. From this myriad collection of data, businesses would like to "discover" certain useful and interesting information, which could assist them in marketing strategies, decision making, etc. This can all be accomplished with various data mining techniques, such as association rules, characterization, classification, clustering, and so forth. This survey paper will present the method on association rules and explore various algorithms that have been studied in this field. The mining association rules basically is to find important associations among items in a given database of sales transactions such that the presence of some items will imply the presence of other items in the same transaction. A formal model is introduced in [AIS93]. Let I={i1, i2, … , im} be a set of binary attributes, called items. Let D a set of transactions and each transaction T is a set of items such that TI. Let X be a set of items. A transaction T is said to contain X if and only if XT. An association rule is an implication of the form XY, where X I, Y I, and XY = . (X is called the antecedent of the rule, while Y is called the consequent of the rule.) Furthermore, the rule XY is said to hold in the transaction set D with confidence c if there are c% of the transaction set D containing X also containing Y. The rule XY is said to have support s in the transaction set D if there are s% of transactions in D containing XY. The confidence factor indicates the strength of the implication rules; whereas the support factor indicates the frequencies of the occurring patterns in the rule. The goal of mining association rules is then to find "strong" rules, which are rules with high confidence and strong support, in a large databases. The problem of mining association rules can further be decomposed into two sub-problems using the above formulation: 1) discovering the large itemsets, and 2) 1 using the large itemsets to generate the association rules for the database. The first sub-problem, discovering the large itemsets, is said to generate all combinations of items that have fractional transaction support above a certain threshold. All other combinations that fall below the support threshold are called small itemsets. The second sub-problem, generating association rules, uses the large itemsets found in the first problem and generates the desired rules. A straightforward way to generate rules is the following: for every large itemset X, find all non-empty subsets of X. And for every such subset Y, output a rule of the form Y(X-Y) if the ratio of support of X to support of Y is exceeds c, the minimum confidence factor. Since there is a straightforward way in solving the second sub-problem, the major emphasis has been placed on finding efficient algorithm in finding the large itemsets. The algorithms that will be discussed in the following sections are the AIS algorithm [AIS93], Apriori and AprioriTid [AS94], DHP [PCY95], Partition Algorithm [SON95], and SETM [HS95]. AIS Algorithm Before introducing the AIS algorithm, there are two terms needed to be defined. The "frontier set" of a pass is a set which consists of itemsets that are extended during the pass. During each pass, a measurement of the support for certain itemsets is taken. These itemsets are called "candidate itemsets." They are derived from the tuples in the database and the itemsets contained in the frontier set. The AIS algorithm makes multiple passes over the database. Initially, the frontier set is empty. During each pass over the database, the candidate sets are generated from taking the extensions of the frontier set and the tuples in the database. At the end of each pass, if the support for a candidate set is equal or above the minimum support required, this candidate set is kept and considered as a large itemset which will then be used in the following passes. Meanwhile, this itemset will be determined whether it should be added to the frontier set for the next pass. That is, those candidate itemsets that were expected to be small but turned out to be large in the current pass would be included in the frontier set for the next pass. The entire algorithm terminates when the frontier set becomes empty. At the end, the remaining candidate itemsets are the large itemsets that we are supposed to discover. Apriori, AprioriTid, and AprioriHybrid Algorithms The Apriori algorithm first counts occurrences of items to determine the large 1-itemsets. (itemset with the cardinality of 1) Then there are two phases in the 2 subsequent passes. First of all, the large itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the "apriori-gen" function. Next, the support of candidates in Ck is counted by scanning the database. The final answer is obtained by taking the union of all Lk itemsets. The "apriori-gen" function takes Lk-1 as argument and returns a superset of the set of all large k-itemsets. Firstly, the join step is taken by joining Lk-1 with Lk-1. The next step is the prune step, which deletes all itemsets cCk such that some (k-1)-subset of c is not in Lk-1. The intuition behind the "apriori-gen" function is that every subset of a large itemset must be large; thus we can combine almost-matching pairs of large (k-1)-itemsets, i.e. Lk-1, and then prune out those with non-large (k-1)-subsets. The AprioriTid algorithm is a variation of the Apriori algorithm. The AprioriTid algorithm also uses the "apriori-gen" function to determine the candidate itemsets before the pass begins. The main difference from the Apriori algorithm is that the AprioriTid algorithm does not use the database for counting support after the first pass. Instead, the set <TID, {Xk}> is used for counting. (Each Xk is a potentially large k-itemset in the transaction with identifier TID.) The benefit of using this scheme for counting support is that at each pass other than the first pass, the scanning of the entire database is avoided. But the downside of this is that the set <TID, {Xk}> that would have been generated at each pass may be huge. Another algorithm, called AprioriHybrid, is introduced in [AS94]. The basic idea of the AprioriHybird algorithm is to run the Apriori algorithm initially, and then switch to the AprioriTid algorithm when the generated database (i.e. <TID, {Xk}>) would fit in the memory. DHP Algorithm The DHP (Direct Hashing and Pruning) algorithm is an effective hash-based algorithm for the candidate set generation. The DHP algorithm consists of three steps. The first step is to get a set of large 1-itemsets and constructs a hash table for 2-itemsets. (The hash table contains the information regarding the support of each itemset.) The second step generates the set of candidate itemsets Ck, but it only adds the k-itemset into Ck if that k-itemset is hashed into a hash entry whose value is greater than or equal to the minimum transaction support. The third part is essentially the same as the second part except it does not use the hash table in determining whether to include a particular itemset into the candidate itemsets. The second part is designed for the use in early iteration, whereas the third part should be used for later iterations when the number of hash buckets with a support count greater 3 than or equal to s (the minimum transaction support required) is less than a pre-defined threshold. Note that the DHP algorithm has two major features: one is its efficiency in generation of large itemsets; the other is effectiveness in reduction on transaction database size. The generation of smaller candidate sets is the key to effectively trim the transaction database size at earlier stages of the iterations such that the computational cost for the later iterations is significantly reduced. Therefore, the DHP algorithm is particularly powerful to determine large itemsets in early stages, and improves the performance bottleneck in a great deal. Partition Algorithm The Partition algorithm logically partition the database D into n partitions, and only reads the entire database at most two times to generate the association rules. The reason for using the partition scheme is that any potential large itemset would appear as a large itemset in at least one of the partitions. The algorithm consists of two phases. In the first phase, the algorithm iterates n times, and during each iteration, only one partition is considered. At any given iteration, the function "gen_large_itemsets" takes a single partition and generates local large itemsets of all lengths from this partition. All of these local large itemsets of the same lengths in all n partitions are merged and then combined to generate the global candidate itemsets. In the second phase, the algorithm counts the support of each global candidate itemsets and generates the global large itemsets. Note that the database is read twice during the process: once in the first phase and the other in the second phase, which the support counting requires a scan of the entire database. By taking minimal number of passes through the entire database drastically saves the time used for doing I/O. SETM Algorithm The SETM algorithm is a set-oriented algorithm for mining association rules. The algorithms introduced previously are mostly tuple-oriented. The SETM differs from the previous algorithms in a great deal; particularly, it incorporates the simple database primitives such as sorting and merge-scan join. In the first iteration, the SETM algorithm sorts R1 (denoted as the relation in the 1st iteration) on item, and C1 is the count relation generated for R1. In the following iterations, there are two sort operations and one merge-scan join performed. After the first sort operation, a temporary relation, R'k, in the kth iteration is generated by performing merge-scan join on Rk-1 and R1. Then, the second sort operation is to sort R'k on items, which is used for efficient generation of the support counts. The generation of counts 4 involves a sequential scan over Rk. At the end of each iteration, deleting tuples from Rk that do not meet the minimum support generates Rk. Note that this is done via table look-ups on relation Ck. Another major contribution of the paper [HS95] is that it proposes ways to express the generation of association rules using SQL queries. Thus, besides introducing the SETM algorithm, the paper [HS95] also presents two formulations of their set-oriented data mining strategy that would naturally leads to nested-loop joins and sort-merge joins. Let suppose that there is a relation, SALES(trans_id, item). The first step in the formulation for leading to nested-loop joins is to generate the counts for each item x and to store the result in relation C1(item, count). INSERT INTO C1 SELECT r1.item, COUNT(*) FROM SQLES r1 GROUP BY r1.item HAVING COUNT(*) >= :min_support; Then the next step is to generate all patterns (x,y) and check whether they meet the minimum support criterion. For ease of expressing, let's take a specific item X. All the patterns (A,y) can be generated using a self-join of SALES. SELECT r1.item, r2.item, COUNT(*) FROM SALES r1, SALES r2 WHERE r1.trans_id = r2.trans_id AND R1.item = 'A' AND r2.item <> 'A' GROUP BY r1.item, r2.item HAVING COUNT(*) >= :min_support; Another formulation leads to sort-merge joins. The first step is to generate ordered patterns of length k: INSERT INTO R'k SELECT p.trans_id, p.item1, …, p.item[k-1], q.item FROM Rk-1 p, SALES q WHERE q.trans_id = p.trans_id AND q.item > p.item[k-1]; The next step is to generate the counts relations for those patterns in R'k that meet the minimum support criterion. 5 INSERT INTO Ck SELECT p.item1, …, p.item[k], COUNT(*) FROM R'k p GROUP BY p.item,…, p.item[k] HAVING COUNT(*) >= :min_support; Then, select those tuples from R'k that meet the minimum support such that a further extension of the relations can be done. INSERT INTO Rk SELECT p.trans_id, p.item1, … p.item[k] FROM R'k p, Ck q WHERE p.item1 = q.item1 AND … p.item[k] = q.item[k] ORDER BY p.trans_id, p.item1, …, p.item[k]; This final step is repeated until Rk=. Therefore, the paper [HS95] proves the point that "at least some aspects of data mining can be carried out by using general query languages such as SQL, rather than by developing specialized black box algorithms." Conclusion The goal of mining association rules is to discover important associations among items in a database of transactions such that the presence of some items will imply the presence of other items. The problem of mining association rules has been decomposed into two sub-problems: discovering the large itemsets, and then generate rules based on these large itemsets. The attention has been placed on the first sub-problem since the second sub-problem is quite straightforward. Thus, there have been several algorithms proposed to solve the first sub-problem. These researches in algorithms of mining association rules are basically motivated by the fact that the amount of the processed data in mining association rules is huge; thus it is crucial to devise efficient algorithms to conduct mining on such data. The classical one is introduced in [AIS93] and is often referred as the AIS algorithm. It first layouts the framework of mining association rules and introduces a terminology called candidate itemsets, which is used throughout in the later work. However, the Apriori and AprioriTid algorithms [AS94] are the most well known ones in the field. 6 Both of them use the "apriori-gen" function in generating the candidate itemsets. The DHP algorithm [PCY95] is very similar to the Apriori algorithm except that it employs a hash table scheme to make the generation more efficiently and effectively. The Partition algorithm [SON95] reduces the I/O overhead, in comparison with the Apriori algorithm, by first dividing the database into partitions and generating large itemsets from those partitions. The last algorithm being looked at is the SETM algorithm [HS95]. The SETM algorithm is inherently different from the others because it is a set-oriented algorithm. It is developed from realizing that there are certain formulations in SQL statements that could lead to nested-loop joins and sort-merge joins. Since the efficiency in mining association rules is application dependent, up to this point, there is no clear performance modeling to determine which algorithm performs ultimately better than others. References [AIS93] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases," Proc. ACM SIGMOD, May 1993. [AS94] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. of Very Large Data Bases, Sept. 1994. [PCY95] J.-S Park, M.-S. Chen, and P.S. Yu, "An Effective Hash Based Algorithm for Mining Association Rules," Proc. ACM SIGMOD, May 1995. [SON95] A. Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for Mining Association Rules in large Databases," Proc. 21st Int'l Conf. of Very Large Data Bases, 1995 [HS95] M. Houtsma and A. Swami, "Set-Oriented Mining for Association Rules in Relational Databases," IEEE 11th Int'l Conf. on Data Engineering, 1995 7