Parallel Mining of Maximal Frequent Itemsets form Databases Soon M.Chunf and Congnan Luo Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’03) Outline Introduction Max-Miner Algorithm Parallel Max-Miner (PMM) Algorithm Performance Evaluation Conclusion Introduction (1) In mining association rules, the most time-consuming job is finding all frequent itemsets from a large database with respect to a given minimum support In Apriori, the subset-infrequency based pruning step prevents many candidate k-itemsets from being counted in each pass k In Apriori-like algorithms, if there is a frequent itemset with length l, then they will generate and count its 2l subsets. Introduction (2) Our basic idea is that if we find a large frequent itemset early, we can avoid counting all its subsets because they are all frequent We propose a parallel algorithm, named Parallel MaxMiner (PMM), for mining maximal frequent items The PMM requires multiple passes over the database, like the Count Distribution algorithm, need synchronization between nodes at every pass end Max-Miner algorithm Unlike Apriori, the Max-Miner algorithm extracts only the maximal frequent itemset Superset-frequency based pruning Max-miner always attempts to look ahead in order to identify large frequent itemsets early So all subsets of these discovered frequent itemsets can be pruned form the search space Set-enumeration tree of Max-Miner (1) Set-enumeration tree of Max-Miner (2) Each node in the tree is called a candidate group A candidate group g consists of two components which are actually two itemsets The first itemset is called the head of the group and denoted by h(g) The second itemset is called the tail of the group and denoted by t(g) t(g) is an ordered set and contains all the items not in h(g) but can potentially appear in any subnode derived from node g The main procedure of Max-Miner (1) From the root of the tree at level 0, count the support of 1-itemsets. Only the 1-itemsets which are frequent can be enumerated at level 1 4 nodes are generated at level 1 if 1, 2, 3, and 4 are all frequent 1-itemsets For the node g1, we need to count the support of {h(g1) t(g1)}={1,2,3,4} If the support of {h(g1) t(g1)} is equal or greater than minsup, then we do not need to expand the tree from the node g1 anymore The main procedure of Max-Miner (2) At any node g, if {h(g) t(g)} is not frequent, for each item I in t(g), we check if {h(g) i} is frequent If {h(g) i} is frequent, a corresponding subnode is generated We notice that for a candidate group node g, if an item appears last in the tail of g in ordering, it will appear in most offsprings of the node g To discover the maximal frequent itemsets early, we better order the subnodes of each node in ascending order of their support Parallel Max-Miner (PMM) algorithm The database is evenly divided into N partitions {D0, D1, D2, …, DN-1}, one for each of the N nodes {P0, P1, P2, …, PN-1} Each node has the same number of transactions allocated PMM requires multiple passes over database For each pass k, all the nodes have exactly the same set of candidate groups, Ck. Each node count the support of Ck in local database, independently At the end of each pass, all nodes exchange the count information so that they can generate the same set of Ck-1 for the next pass Performance Evaluation Speedup of PMM Sizeup of PMM Conclusion We proposed a parallel maximal frequent itemset mining algorithm, Parallel Max-Miner, for sharednothing multiprocessor systems Drawback: quire synchronization between nodes to exchange the count information at the end of every pass