Mining Frequent Item Sets by Opportunistic Projection

advertisement
Mining Frequent Item
Sets by Opportunistic
Projection
1,4
1
2
3
Junqiang Liu , Yunhe Pan , Ke Wang , Jiawei Han
Institute of Artificial Intelligence, Zhejiang University, China
School of Computing Science, Simon Fraser University, Canada
3 Department of Computer Science, UIUC, USA
4 Dept. of CS, Hangzhou University of Commerce, China
1
2
Outline

How to discover frequent item sets

Previous works

Our approach: Mining Frequent Item
Sets by Opportunistic Projection

Performance evaluations

Conclusions
2
What Are Frequent Items Sets

What is a frequent item set?

set of items, X, that occurs together frequently in a
database, i.e., support(X) ≥ a given threshold

Example
tid
items
01
a
a
b
b
a
02
03
04
05
c
b
f
c
c
d
c
h
k
e
f
f
j
p
f
g i m p
l m o
o
s
l m n p
Given support threshold 3, frequent item
sets are as follows:
a:3, b:3, c:4, f :4, m:3, p:3,
ac:3, af :3, am:3, cf :3, cm:3, cp:3, fm:3,
acf :3, acm:3, afm:3, cfm:3,
acfm:3
3
How To Discover Frequent Item Sets

Frequent item sets can be represented by a tree,
which is not necessarily materailized.
( ,)
(a,3) (b,3) (c,4) (f,4) (m,3) (p,3)
(c,3) (f,3) (m,3) (f,3) (m,3) (p,3) (m,3)
(f,3) (m,3) (m,3)
(m,3)
(m,3)

Mining process:
 a process of tree construction, accompanied by
 a process of projecting transaction subsets
4
Frequent Item Set Tree - FIST

FIST is an ordered tree


each node: (item,weight)
the following are imposed



Frequent item set



items ordered on a path (top-down)
items ordered at children (left to right)
a path starting from the FIST root
its support is the ending node’s weight
PTS - projected transaction subset


Each FIST node has its own PTS, filtered or unfiltered
All transactions that support the frequent item set
represented by the node
5
Frequent Item Set Tree (example)
( ,)
01
02
03
04
05
(a,3)
(f,3)
01 f m
02 f m
05 f m
01 m
02 m
05 m
(f,3) (m,3)
01 m
02 m
05 m
(m,3)
c
b
f
c
c
(b,3)
01 c f m p
02 b c f m
05 c f m p
(c,3)
a
a
b
b
a
02 c f m
03 f
04 c p
(m,3)
(m,3)
d
c
h
k
e
f
f
j
p
f
g i m p
l m o
o
s
l m n p
(c,4)
01
02
04
05
f m p
f m
p
f m p
(f,3)
(m,3)
01 m p
02 m
05 m p
01 p
05 p
(f,4)
(m,3)
01 m p
02 m
05 m p
01 p
05 p
(p,3)
(p,3)
(m,3)
(m,3)
(i,w): a FIST node
: the PTS of the node
6
Factors relate to
Mining Efficiency and Scalability

The FIST construction strategy


The PTS representation



breadth first v.s. depth first
Memory-based representation: array-based, tree-based,
vertical bitmap, horizontal bitstring, etc.
Disk-based representation
PTS projecting method and item counting method
7
Previous Works
PTS
Representation
Projecting
Method
breadth first
original DB
on the fly
breadth first
original DB
on the fly
Research
Strategy
Apriori
Tree
Projection
FPGrowth
depth first
FP-tree
H-Mine
depth first
H-struct
Depth
Project
depth first
MAFIA
depth first
horizontal
bitstring
vertical
bitmap
Remarks
Repetitive DB Scans
Huge FIST for dense
Exp. pattern matching

#of conditional FPtree in
same order of mag. as
# of fre. item sets
Not most eff. for sparse
partially materialize 
Call FP-Growth for dense
sub H-struct
Partition for large
 Maximal fre. item sets
selective projection Less efficient than arraybased for sparse & large
recursively materialize Less efficient than treecompressions
based for dense
recursively materialize
conditional DB/Fptree

8
Our Approach: Mining Frequent Item Sets
by Opportunistic Projection

Philosophy:
The algorithm must adapt the construction strategy of FIST,
the representation of PTS, and the methods of item
counting in and projection of PTSs to the features of PTSs.

Main points:



Mining sparse data by projecting array-based PTS
Intelligent projecting tree-based PTS for dense
data
Heuristics for opportunistic projection
9
Mining sparse data by
projecting array-based PTS
TVLA – threaded varied length array for sparse PTS






FIL– local frequent items list
LQ – linked queues
arrays
FIL a 3
Each local frequent item has
a FIL entry that consists of
an item, a count, & a pointer.
Each transaction is stored in
an array that is threaded to
FIL by LQ according to the
heading item in the imposing
order.
b
c
f
m
p
3
4
4
3
3
01
LQ
02
a
c
f
m
p
a
b
c
f
m
05
04
f
03
b
f
b
c
p
array
a
c
f
m
p
filtered TVLA of the original DB in the example
10
How to project TVLA for PTS



Arrays (transactions)
that support a node’s
first child are threaded
by the LQ attached to
the first entry of FIL.
(see previous figure)
TVLA for a child node’s
PTS has its own FIL
and LQ.
A child TVLA is
unfiltered if it shares
arrays with its parent,
filtered otherwise.
parent TVLA
FIL a 3
b 3
c 4
f 4
m 3
p 3
01
c 3
f 3
m 3
01
c 3
f 3
m 3
01
FIL(a)
FIL(a)
02
a
c
f
m
p
a
b
c
f
m
02
05
04
f
03
b
f
a
c
f
m
p
b
c
p
05
unfiltered child TVLA
02
c
f
m
05
c
c
f
f
m
m
filtered child TVLA
11
How to project TVLA for PTS (cont.)
Get next child’s PTS by shifting transactions
currently explored (current child’s PTS)
a 3
02 03 04
f
b 3
01
05
a
a
a
c 4
TVLA
c
b
c 
b
b
f 4

in slide 10
f
c
f
f
c
m 3
m
f
m
p
p 3
p
m
p
threaded in the LQ
a
b
c
f
m
p
3
3
4
4
3
3
01
a
b
c
f
m
p
3
3
4
4
3
3
01
02
a
c
f
m
p
04
f
a
b
c
f
m
b
f
05
a
c
f
m
p
b
c
p

3
3
4
4
3
3
01
05
a
c
f
m
p
a
b
c
f
m
b
f
b
c
p
a
c
f
m
p


NULL
a
b
c
f
m
p
02
a
c
f
m
p
05
a
b
c
f
m
b
f
b
c
p
a
c
f
m
p
12
Intelligent projecting tree-based
PTS for dense data


Tree-based Representation of dense PTS,
inspired by FP-Growth
Novel projecting methods, totally differ
from FP-Growth


Bottom up pseudo projection
Top down pseudo projection
13
Tree-based Representation of
dense PTS

TTF - threaded transaction forest


IL - item list: each entry consists of an item, a count, and a pointer.
Forest: each node labeled by an item, associated with a weight.
a,3





Each local item in PTS has an
entry in the IL.
Each transaction in the PTS is
one path starting from a root
in the forest.
count is the number of
transactions represented by
the path.
All nodes of the same item
threaded by an IL entry.
TTF is filtered if only local
frequent items appear in TTF,
otherwise unfiltered.
a
b
c
f
m
p
3
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
p,1
filtered TTF of original DB in the example
14
Bottom up pseudo projection
of TTF (example)
a,3
a,3
a
b
c
f
m
p
3
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1

f,1
p,2
a
b
c
f
m
p
3
3
4
4
3
2
b,1
c,2
c,1
f,2
f,1
m,2
m,1
a,3
b,2
c,1
p,2
p,1

f,1
a
b
c
f
m
p
3
3
4
4
3
2
p,1
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
p,1

a,3
3
1
3
3
3
2
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,2
c,1

a
b
c
f
m
p
f,1
p,1
a
b
c
f
m
p
3
3
2
2
1
1
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
a,3
b,2
c,1

a,3
f,1
p,1
a
b
c
f
m
p
3
3
4
3
3
3
b,2
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,2
c,1
f,1
p,1
15
Top down pseudo projection
of TTF (example)
a,3
a
b
c
f
m
p
3
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
a,1
b,2
c,1

f,1
p,2
a
b
c
f
m
p
1
3
4
4
3
3
p,1
b,1
c,2
c,1
f,2
f,1
m,2
m,1
a,3
b,2
c,1

f,1
a
b
c
f
m
p
3
2
4
4
3
3
p,1
p,2
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,1
c,1
f,1
p,1
p,2

a,3
2
1
3
2
2
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,1
c,1

a
b
c
f
m
p
f,1
p,1
a
b
c
f
m
p
3
1
3
3
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
a,3
b,2
c,1

a,2
f,1
p,1
a
b
c
f
m
p
3
2
3
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,1
c,1
f,1
p,1
16
Opportunistic Projection:
Observations and Heuristics

Observation 1:



Upper portion of a FIST can fit in memory.
Transactions’ Number that support length k item sets decreases
sharply when k is greater than 2.
Heuristic 1:


Grow the upper portion of a FIST breadth first.
Grow the lower portion under level k depth first, whenever the
reduced transaction set can be represented by a memory based
structure, either TVLA or TTF.
17
Opportunistic Projection:
Observations and Heuristics(2)

Observation 2:



TTF compresses well at lower levels or denser branches, where
there are fewer local frequent items in PTSs and the relative
support is larger.
TTF is space expensive relative to TVLA if its compression ratio is
less than 6-t/n ( t: number of transactions, n: number of items in a
PTS).
Heuristic 2:

Represent PTSs by TVLA at high levels on FIST, unless the
estimated compression ratio of TTF is sufficiently high.
18
Opportunistic Projection:
Observations and Heuristics(3)

Observation 3:



PTSs shrink very quickly at high levels or sparse branches on FIST
where filtered PTSs are usually in form of TVLA.
PTSs at lower levels or dense branches shrink slowly where PTSs
are represented by TTF. The creation of filtered TTF involves
expensive pattern matching.
Heuristic 3:


Make a filtered copy for the child TVLA as long as there is free
memory when projecting a parent TVLA.
Delimitate the pseudo child TTF first and then make a filtered copy
if it shrinks substantially sharp when projecting a parent TTF.
19
Algorithm OpportuneProject






OpportuneProject(Database: D)
begin
create a null root for frequent item set tree
T;
D’= BreadthFirst(T, D);
GuidedDepthFirst(root_of_T, D’);
end
20
Performance Evaluation:
Efficiency on BMS-POS (sparse)
21
Performance Evaluation:
Efficiency on BMS-WebView1 (sparse)
22
Performance Evaluation:
Efficiency on BMS-WebView2 (sparse)
23
Performance Evaluation:
Efficiency on Connect4 (dense)
24
Performance Evaluation:
Efficiency on T25I20D100kN20kL5k
25
Performance Evaluation:
Scalability on T25I20D1mN20kL5k
26
Performance Evaluation:
Scalability on T25I20D10mN20kL5k
27
Performance Evaluation:
Scalability on T25I20D100k~15mN20kL5k
28
Conclusions

OpportuneProject

maximize efficiency and scalability for all data features
by combining

depth first with breadth first search strategies

array-based and tree-based representation for projected
transaction subsets

unfiltered, and filetered projections
29
Acknowledgement
We would like to thank
Blue Martini Software, Inc.
for providing us the BMS datasets!
30
References







[1] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for
generation of frequent itemsets. In Journal of Parallel and Distributed Computing
(Special Issue on High Performance Data Mining), 2000.
[2] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth first generation of long
patterns, in Proceedings of SIGKDD Conference, 2000.
[3] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. In SIGMOD’93, Washington, D.C., May 1993.
[4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In
VLDB'94, pp. 487-499, Santiago, Chile, Sept. 1994.
[5] R.J.Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98,
pp. 85-93, Seattle, Washington, June 1998.
[6] D.Burdick, M.Calimlim, J.Gehrke. MAFIA: A maximal frequent itemset
algorithm for transactional databases. In proceedings of the 17th Internation
Conference on Data Engineering, Heidelberg, Germany, April 2001.
[7] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. Dynamic
Itemset Counting and Implication Rules for Market Basket Analysis. In
SIGMOD’97, 255-264. Tucson, AZ, May 1997.
31
References (2)






[8] J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. In VLDB'95, Zuich, Switzerland, Sept. 1995.
[9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. In SIGMOD’2000, Dallas, TX, May 2000.
[10] D-I. Lin and Z. M. Kedem. Pincer-search: A new algorithm for discovering
the maximum frequent set. In 6th Intl. Conf. Extending Database Technology,
March 1998.
[11] J.S.Park, M.S.Chen, and P.S.Yu. An effective hash based algorithm for
mining association rules. In Proc. 1995 ACM-SIGMOD, 175-186, San Jose, CA,
Feb. 1995.
[12] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, H-Mine: HyperStructure Mining of Frequent Patterns in Large Databases, Proc. 2001 Int. Conf.
on Data Mining (ICDM'01)}, San Jose, CA, Nov. 2001.
[13] Ashok Sarasere, Edward Omiecinsky, and Shamkant Navathe. An efficient
algorithm for mining association rules in large databases. In 21st Int'l Conf. on
Very Large Databases (VLDB), Zurich, Switzerland, Sept. 1995.
32
References (3)





[14] H.Toivonen. Sampling large databases for association rules. In Proc. 1996
Int. Conf. Very Large Data Bases (VLDB’96), 134-145, Bombay, India, Sept. 1996.
[15] Zijian Zheng, Ron Kohavi and Llew Mason. Real World Performance of
Association Rule Algorithms. In Proc. 2001 Int. Conf. on Knowledge Discovery in
Databases (KDD'01), San Francisco, California, Aug. 2001.
[16] http://fuzzy.cs.uni-magdeburg.de/~borgelt/src/apriori.exe
[17] http://www.almaden.ibm.com/cs/quest/syndata.html
[18] http://www.ics.uci.edu/~mlearn/MLRepository.html
33
Thank you !!!
34
Download