class presentation

advertisement
CSCI6405 class project
Implementation and comparison of
three AR mining algorithms
Xuehai Wang, Xiaobo Chen, Shen chen
AR mining
Outline
•
•
•
•
•
•
Motivation
Dataset
Apriori based hash tree algorithm
FP-tree algorithm
Conclusion
Reference
AR mining
Motivation
• Make the time of generating rules as shot as
possible!
• To understand the three algorithms
– Apriori algorithm
– Apriori with hash tree algorithm
– FP-tree algorithm
• Learn how to improve an algorithm
AR mining
Dataset
• IBM dataset generator
– Can set item number
– Can set minimal support
– Can set dataset size
12589
2 3 4 6 7 12
Tid
item
AR mining
Apriori principle
• Apriori principle
– A candidate generation-and-test Approach [4]
– Given a frequent itemset, its subset must be
frequent
– A set is infrequent, its super set will not be
generated and tested
• But there is still some places can be
improved
– Count the support
– I/O scan times
AR mining
Apriori Hash Tree Alg
•
•
•
•
Candidate K-itemset size is l
There is n transactions
Average transaction size is m
Calculate support count:
– Original Apriori Alg:
O(n  l  m
k )
– With hash tree: O( n.log(l).(mk) )
O(n  logl    m
k )
AR mining
Apriori Hash Tree Alg
• Candidate is stored in a hash tree structure
Tid Items
1
12
2
136
3
123
4
24
5
236
6
56
1-itemset candidate hash tree
1(2)
2(1)
1(1)
1(2)
2(1)
3(1)
AR mining
3(1)
Apriori Hash Tree Alg
Ti Item
d s
1
12
2
136
3
123
4
24
5
236
6
56
1itemset , Min support = 2
1(3)
4(1)
2(4)
5(1)
AR mining
3(3)
6(3)
Apriori Hash Tree Alg
Ti Item
d s
1
12
2
136
3
123
4
24
5
236
6
56
2 itemset, Min support = 2
2 3(2)
2 6(1)
1 2(2)
3 6(2)
1 3(2)
1 6(1)
3 itemset, Min support = 2
1 2 3(1)
AR mining
FP-tree
• Since the mining dataset is always very
huge, it’s impossible to read all transactions
into computer memory all in once.
• But I/O scan is very time consuming.
• FP-tree algorithm will try to suite all
information from the dataset into computer
memory, hence only need to scan I/O two
times.
AR mining
FP-tree
• FP-tree algorithm and implementation
– By Xiaobo Chen
AR mining
FP-tree (Frequent Pattern Tree)
• Mining frequent pattern without candidate
generation
• Divide and conquer methodology:
decompose mining tasks into smaller ones
AR mining
FP-tree (Merits of FP-tree algorithm)
• Make most use of common shared prefix
• Complete and compact
All information of a transaction is
stored in a path
The size is constrained by the data set
consequently, the longest path corresponds to the longest
pattern
The compact ratio: over 100
AR mining
FP-tree (Construction of FP-tree)
•
•
•
•
•
•
min_support = 3
TID
100
200
300
400
500
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
root
f:1
c:1
a:1
m:1
p:1
AR mining
FP-tree (construction (Cont’d))
TID
100
200
300
400
500
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
root
f:2
c:2
a:2
m:1
b:1
p:1
m:1
AR mining
FP-tree construction (Cont’d)
•
•
•
•
•
•
TID
100
200
300
400
500
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
min_support = 3
root
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
AR mining
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
FP-tree (Mining Frequent Patterns Using the FP-tree)
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the FPtree
• Method
– For each item, construct its conditional pattern-base,
and then its conditional FP-tree
– Repeat the process on each newly created conditional
FP-tree
– Until the resulting FP-tree is empty, or it contains only
one path (single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern)
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
•
•
Start with last item in order (i.e., p).
Follow node pointers and traverse only the paths containing p.
•
Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
root
f:4
p
c:1
c:3
b:1
a:3
p:1
Conditional pattern base for p
fcam:2, cb:1
Constructing a new FPtree based on this
pattern base leads to
only one branch c:3
Thus we derive only
one frequent pattern
cont. p. Pattern cp
m:2
p:2
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
•
•
Move to next least frequent item in order, i.e., m
Follow node pointers and traverse only the paths containing m.
•
Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
root
f:4
Conditional pattern base for m
c:3
m
fca:2, fcab:1
a:3
m:2
b:1
m:1
AR mining
Constructing a new FP-tree
based on this pattern base
leads to path fca:3
From this we derive frequent
patterns fcam, fcm, cam, fm,
cm, am
FP-tree (Conditional Pattern-Bases for the example)
Item
p
Conditional pattern-base
Conditional FP-tree
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
AR mining
FP-tree (Why is Frequent pattern Growth
fast?)
• Performance studies show that
FP-growth is an order of magnitude faster than
Apriori, and is also faster than tree-projection
• Reasoning:
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
AR mining
FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support
Threshold
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
AR mining
2.5
3
Conclusion
• FP-tree is faster than other two algorithms.
• Apriori as well as hash tree algorithms are
easier to implement.
– We can easily combine them with other
methods or tools. (i.e. distributed parallel
computing).
• The parameter of dataset is very important
too.
– Density, size, min support …
AR mining
References
• [1] Jiawei Han and Micheline Kamber: "Data
Mining: Concepts and Techniques ", Morgan
Kaufmann, 2001
• [2] Jiawei Han, Jian Pei, Yiwen Yin: Mining
Frequent Patterns without Candidate
Generation, ACM SIGMOD, 2000
• [3] N.Mamoulis, Advanced Database
Technologies (Slides)
• [4] Jiawei Han and Micheline Kamber. Data
Mining - Concepts and Techniques.
MorganKaufmann Publishers, 2001.
AR mining
Download