FPGrowth-Tanasic - start [kondor.etf.rs]

advertisement
Mining Frequent Patterns Using
FP-Growth Method
Ivan Tanasić (itanasic@gmail.com)
Department of Computer Engineering and Computer Science,
School of Electrical Engineering,
University of Belgrade

Mining Frequent Patterns
without Candidate Generation:
A Frequent-Pattern Tree Approach
◦
◦
◦
◦
Jiawei Han (UIUC)
Jian Pei (Buffalo)
Yiwen Yin (SFU)
Runying Mao (Microsoft)
Ivan Tanasic (itanasic@gmail.com)
2/25
Problem Definition

Mining frequent patterns from a DB
◦ Frequent intemsets
 (milk + bread)
◦ Frequent sequential patterns
 (computer -> printer -> paper)
◦ Frequent structural patterns
 (subgraphs, subtrees)
Ivan Tanasic (itanasic@gmail.com)
3/25
Problem Importance 1/2
Basic DM primitive
 Used for mining data relationships

◦ Associations
◦ Correlations

Helps with basic DM tasks
◦ Classification
◦ Clustering
Ivan Tanasic (itanasic@gmail.com)
4/25
Problem importance 2/2

Association rules
◦ buys(“laptop”)=>buys(“mouse”)
[support = 2%, confidence = 30%]
• Support=% of all transactions
containing that items
•Confidence=% of transactions
containing I1 that contain I2
Ivan Tanasic (itanasic@gmail.com)
5/25
Problem Trend
Apriori speedup using techniques
 New data structures (trees)
 Association rule specific algorithms
 Specific AR algorithms (OneR, ZeroR)
 FP-Growth still widely used

Ivan Tanasic (itanasic@gmail.com)
6/25
Existing Solutions 1/3 (Apriori)
Agrawal et al. (1994)
 AP: All nonempty subsets
of a frequent itemset
must also be frequent
 Starts from 1-itemsets
 Join + prune (using AP + min supp)
 Generates huge number of candidates

Ivan Tanasic (itanasic@gmail.com)
7/25
Existing Solutions 2/3 (ECLAT)
Zaki (2000)
 Equivalence CLass Transformation
 Vertical format:
{item,TID_set} instead of {TID,itemset}
 Intersects TID_sets of candidates
 TID_sets holds support info (no scans)
 Still generates candidates

Ivan Tanasic (itanasic@gmail.com)
8/25
Existing Solutions 3/3 (TreeProjection)
Agarwal et al. (2001)
 Creates a lexicographical tree
and projects db into sub-dbs
based on the patterns mined so far
 Recursively mines subdatabases
 Less scalable then FP-Growth

Ivan Tanasic (itanasic@gmail.com)
9/25
FP-Tree construction 1/6
Desc. supp. sort
• Min support = 2
Ivan Tanasic (itanasic@gmail.com)
10/25
FP-Tree construction 2/6
Desc. supp. sort
T1={I2,I1,I5}
Ivan Tanasic (itanasic@gmail.com)
11/25
FP-Tree construction 3/6
Desc. supp. sort
T1 = {I2, I1, I5}
T2 = {I2, I4}
Ivan Tanasic (itanasic@gmail.com)
12/25
FP-Tree construction 4/6
Desc. supp. sort
T1 = {I2, I1, I5}
T2 = {I2, I4}
T3 = {I2, I3}
Ivan Tanasic (itanasic@gmail.com)
13/25
FP-Tree construction 5/6
Desc. supp. sort
T1 = {I2, I1, I5}
T2 = {I2, I4}
T3 = {I2, I3}
T4 = {I2, I1, I4}
Ivan Tanasic (itanasic@gmail.com)
14/25
FP-Tree construction 6/6
Desc. supp. sort
T1 = {I2, I1, I5}
T2 = {I2, I4}
T3 = {I2, I3}
T4 = {I2, I1, I4}
T5 = {I1, I3}
T6 = {I2, I3}
T7 = {I1, I3}
T8 = {I2, I1, I3, I5}
T9 = {I2, I1, I3}
Ivan Tanasic (itanasic@gmail.com)
15/25
Mining of the FP-Tree 1/4
It. Conditional P. base
Cond. FP-Tree
Freq. Patterns Generated
I5
{I2:2, I1:2}
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2}
{{I2,I1:1},{I2,I1,I3:1}}
Ivan Tanasic (itanasic@gmail.com)
16/25
Mining of the FP-Tree 2/4
It. Conditional P. base
Cond. FP-Tree
Freq. Patterns Generated
I5
{{I2,I1:1},{I2,I1,I3:1}}
{I2:2, I1:2}
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2}
I4
{{I2,I1:1},{I2:1}}
{I2:2}
{I2,I4:2}
Ivan Tanasic (itanasic@gmail.com)
17/25
Mining of the FP-Tree 3/4
It. Conditional P. base
Cond. FP-Tree
Freq. Patterns Generated
I5
{{I2,I1:1},{I2,I1,I3:1}}
{I2:2, I1:2}
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2}
I4
{{I2,I1:1},{I2:1}}
{I2:2}
{I2,I4:2}
I3
{{I2,I1:2},{I2:2},{I1:2}}
{I2:4,I1:2},{I1:2}
{I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
Ivan Tanasic (itanasic@gmail.com)
18/25
Mining of the FP-Tree 4/4
It. Conditional P. base
Cond. FP-Tree
Freq. Patterns Generated
I5
{{I2,I1:1},{I2,I1,I3:1}}
{I2:2, I1:2}
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2}
I4
{{I2,I1:1},{I2:1}}
{I2:2}
{I2,I4:2}
I3
{{I2,I1:2},{I2:2},{I1:2}
{I2:4,I1:2},{I1:2}
{I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
I1
{{I2:4}}
{I2:4}
{I2,I1:4}
Ivan Tanasic (itanasic@gmail.com)
19/25
How much batter is it 1/3?

Runtime on sparse data:
Ivan Tanasic (itanasic@gmail.com)
20/25
How much batter is it 2/3?

Runtime on mixed data:
Ivan Tanasic (itanasic@gmail.com)
21/25
How much batter is it 3/3?

Compactness:
Ivan Tanasic (itanasic@gmail.com)
22/25
Is it Original?

A lot of methods try to improve Apriori
◦
◦
◦
◦

Hashing
Transaction reduction
Partitioning
Sampling
TreeProjection uses similar structure,
but it is still a different method
Ivan Tanasic (itanasic@gmail.com)
23/25
Importance over time
Basic primitive
(strong foundation for tall building)
 Performance gets very important
as databases are getting huge
 Scalability also
 FP-Growth has both
performance and scalability

Ivan Tanasic (itanasic@gmail.com)
24/25
Conclusion
An important method
for solving important DM tasks
 Fast
 Compact
 Scalable (db projection/tree on disk)

Ivan Tanasic (itanasic@gmail.com)
25/25
Mining Frequent Patterns Using
FPGrowth Method
Ivan Tanasić (itanasic@gmail.com)
Department of Computer Engineering and Computer Science,
School of Electrical Engineering,
University of Belgrade
Download