CSCI6405 class project Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen AR mining Outline • • • • • • Motivation Dataset Apriori based hash tree algorithm FP-tree algorithm Conclusion Reference AR mining Motivation • Make the time of generating rules as shot as possible! • To understand the three algorithms – Apriori algorithm – Apriori with hash tree algorithm – FP-tree algorithm • Learn how to improve an algorithm AR mining Dataset • IBM dataset generator – Can set item number – Can set minimal support – Can set dataset size 12589 2 3 4 6 7 12 Tid item AR mining Apriori principle • Apriori principle – A candidate generation-and-test Approach [4] – Given a frequent itemset, its subset must be frequent – A set is infrequent, its super set will not be generated and tested • But there is still some places can be improved – Count the support – I/O scan times AR mining Apriori Hash Tree Alg • • • • Candidate K-itemset size is l There is n transactions Average transaction size is m Calculate support count: – Original Apriori Alg: O(n l m k ) – With hash tree: O( n.log(l).(mk) ) O(n logl m k ) AR mining Apriori Hash Tree Alg • Candidate is stored in a hash tree structure Tid Items 1 12 2 136 3 123 4 24 5 236 6 56 1-itemset candidate hash tree 1(2) 2(1) 1(1) 1(2) 2(1) 3(1) AR mining 3(1) Apriori Hash Tree Alg Ti Item d s 1 12 2 136 3 123 4 24 5 236 6 56 1itemset , Min support = 2 1(3) 4(1) 2(4) 5(1) AR mining 3(3) 6(3) Apriori Hash Tree Alg Ti Item d s 1 12 2 136 3 123 4 24 5 236 6 56 2 itemset, Min support = 2 2 3(2) 2 6(1) 1 2(2) 3 6(2) 1 3(2) 1 6(1) 3 itemset, Min support = 2 1 2 3(1) AR mining FP-tree • Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once. • But I/O scan is very time consuming. • FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times. AR mining FP-tree • FP-tree algorithm and implementation – By Xiaobo Chen AR mining FP-tree (Frequent Pattern Tree) • Mining frequent pattern without candidate generation • Divide and conquer methodology: decompose mining tasks into smaller ones AR mining FP-tree (Merits of FP-tree algorithm) • Make most use of common shared prefix • Complete and compact All information of a transaction is stored in a path The size is constrained by the data set consequently, the longest path corresponds to the longest pattern The compact ratio: over 100 AR mining FP-tree (Construction of FP-tree) • • • • • • min_support = 3 TID 100 200 300 400 500 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root f:1 c:1 a:1 m:1 p:1 AR mining FP-tree (construction (Cont’d)) TID 100 200 300 400 500 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} root f:2 c:2 a:2 m:1 b:1 p:1 m:1 AR mining FP-tree construction (Cont’d) • • • • • • TID 100 200 300 400 500 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 root f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 AR mining Item frequency f 4 c 4 a 3 b 3 m 3 p 3 FP-tree (Mining Frequent Patterns Using the FP-tree) • General idea (divide-and-conquer) – Recursively grow frequent pattern path using the FPtree • Method – For each item, construct its conditional pattern-base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) AR mining FP-tree (Mining Frequent Patterns Using the FP-tree) • • Start with last item in order (i.e., p). Follow node pointers and traverse only the paths containing p. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root f:4 p c:1 c:3 b:1 a:3 p:1 Conditional pattern base for p fcam:2, cb:1 Constructing a new FPtree based on this pattern base leads to only one branch c:3 Thus we derive only one frequent pattern cont. p. Pattern cp m:2 p:2 AR mining FP-tree (Mining Frequent Patterns Using the FP-tree) • • Move to next least frequent item in order, i.e., m Follow node pointers and traverse only the paths containing m. • Accumulate all of transformed prefix paths of that item to form a conditional pattern base root f:4 Conditional pattern base for m c:3 m fca:2, fcab:1 a:3 m:2 b:1 m:1 AR mining Constructing a new FP-tree based on this pattern base leads to path fca:3 From this we derive frequent patterns fcam, fcm, cam, fm, cm, am FP-tree (Conditional Pattern-Bases for the example) Item p Conditional pattern-base Conditional FP-tree {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty AR mining FP-tree (Why is Frequent pattern Growth fast?) • Performance studies show that FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection • Reasoning: – No candidate generation, no candidate test – Use compact data structure – Eliminate repeated database scan – Basic operation is counting and FP-tree building AR mining FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold 100 D1 FP-grow th runtime 90 D1 Apriori runtime 80 Run time(sec.) 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) AR mining 2.5 3 Conclusion • FP-tree is faster than other two algorithms. • Apriori as well as hash tree algorithms are easier to implement. – We can easily combine them with other methods or tools. (i.e. distributed parallel computing). • The parameter of dataset is very important too. – Density, size, min support … AR mining References • [1] Jiawei Han and Micheline Kamber: "Data Mining: Concepts and Techniques ", Morgan Kaufmann, 2001 • [2] Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000 • [3] N.Mamoulis, Advanced Database Technologies (Slides) • [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001. AR mining