Chen

advertisement
Efficiently Mining Frequent
Trees in a Forest
Mohammed J. Zaki
1
Frequent Structure Mining (FSM)


Dealing with extracting patterns (association,
sequence, frequent tree, graph, and etc.) in massive
databases
Typical application



Bioinformatics
Web mining
Mining semi-structured documents
2
Tree Mining Problems



Goal: to efficiently enumerate all frequent subtrees in
a forest (database of trees) according to a given
minimum support (minsup)
The support of a subtree S is the number of trees in
D that contains one occurrence of S.
A subtree S is frequent if its support is more than or
equal to a user specified minsup value.
3
Rooted, Ordered & Labeled tree




A tree is an acyclic connected graph
Rooted: exist one vertices which is distinguished from
others
Ordered: the children of each node in a rooted tree
are ordered.
Labeled: each node is associated with a label.
Every tree in the paper is a rooted, ordered and labeled
tree.
4
Definition of Subtrees
 We denote a tree as T = {N, B}. N is a set of
labeled nodes and B is a set of branches.
 We say that a tree S = {Ns, Bs} is an embedded
subtree of T = {N, B}, if:
1. Ns is a subset of N
2. A branch appears in S iff two vertices are on the same
path from the root to a leaf in T.
 A disconnected pattern is a sub-forest of T.
Hence, embedded trees allow not only direct parent-child
branches, but also ancestor-descendant branches.
5
Examples of subtrees:
0
0
2
1
1
4
1
3
2
4
Subtree S
2
2
1
1
Tree T
3
1
4
3
2
Not a subtree,
a sub-forest
6
Node Numbers and Labels
0

0
1
2
2
3
1
1
4
4
1
3
6
5
7
2

Each node has a welldefined number, i, according
to its position in a depth-first
traversal of a tree
The label of each node is
taken from a set of labels L
= {0, 1, …, m-1}. It
represents the value of each
node.
7
Scope of Node
[0,7]

0
[1,4]
[5,7]
2
[2,3]
1
[3,3]
1
4
[4,4]
[6,7]
1

[7,7]
3
2
The scope of each node ni is given
as [i, r], i.e., the lower bound is the
position (i) of itself, and the upper
bound is the position (r) of its rightmost leaf node.
Assume two node x, y has the
following scope Sx = [ix, rx] and Sy =
[iy, ry].


Sx is strictly less than (<) Sy iff rx <
ly, i.e., Sx occurs before Sy. It means
that y is an embedded sibling of x
Sx contains Sy iff lx <= ly and rx >=
ry. It means that y is a descendant
of x
8
Representing trees as Strings
0
2
1
4
1
3
2
To create String encoding, which
is denoted as t, we perform a
depth-first search starting (also
ending) at the root, adding the
current node’s label x to t.
Whenever we backtrack from a
child to its parent we add an
special symbol –1 to the string.
1
The String Encoding:
0 2 1 1 –1 –1 1 –1 –1 4 3 –1 2 –1 -1
9
Equivalence Classes

Two k-subtrees X, Y are in the same prefix equivalence class iff they
share a common prefix up to the (k-1)th nodes
2
x
1
0
x
x
3
x
Not a valid
element!
 Prefix String: 2 1 0 –1 3
 The following three subtrees are in the same prefix
equivalence class:
2 1 0 –1 3 –1 –1 x –1
2 1 0 –1 3 –1 x –1 –1
2 1 0 –1 3 x –1 –1 –1
// (x, 0)
// (x, 1)
// (x, 3)
 Element list: (label, the position of the node which x
is attached)
(x, 0); (x, 1); (x, 3)
 A valid element x may be attached to only those
that lie on the path from the root to the right-most
leaf.
10
Candidate Generation:



Goal: Given an equivalence class of k-subtrees, try to obtain
candidate (k+1)-subtrees.
Main idea: consider each pair of elements in the class for
extension, including self-extension.
Theorem:
Assume elements are kept sorted by node label as the primary key and
position as the secondary key. Let P be a prefix class, and (x,i) and (y,
j) denote any two elements in the class. Px denotes the class
representing extension of element (x, i). Define (y,j) join (x,i ) as
follows:
Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px.
2) If P = 0, add (y, j) to Px.
Case II ( i > j ): add (y,j) to Px
Case III ( i < j ): no new element is possible in this case

The Theorem has a mistake.
11
1
1
2
2
Prefix: 1 2
Element List: (3, 1); (4, 0)
4
3
Prefix = 1 2 3
Prefix = 1 2 –1 4
1
1
1
1
2
2
2
2
3
(3,1)
3
3
3
(3,1) join (3,1)
4
3
(3,2)
(4,0)
(4,0) join (3,1)
1
4
4
2
(4,0)
(4,2)
4
4
(4,0) join (4,0)
If we add (y, j+1), i.e., (4, 1), we
get the following tree: 1 2 4 –1 4,
wrong!
TreeMiner Algorithm

TreeMiner (D(database of tree, Forest), minsup)




F1 = { frequent 1-subtrees };
F2 = { classes [P]1 of frequent 2-subtrees };
For all [P], do Enumerate-Frequent-Subtree;
Enumerate-Frequent-Subtree Fk

For each element (x, i) € [P] do
 For each element (y, j) € [P] do




(y,j) join (x, i) => at most two new candidate subtrees
For each subtree, do scope-list joins
If it is frequent, then we add the subtree to the list of
frequent-subtree.
Repeated until all frequent subtrees have been enumerated.
P: prefix class. [P]1 means the prefix size = 1, i.e., only one node in the
prefix class. Px refers to the new prefix tree formed by adding (x, i) to P.
Fk: the set of all frequent subtrees of size k.
13
An example of TreeMiner Algorithm
0
0
1
1
2
3
1
2
3
2
2
5
2
1
2
3
4
Tree T1
(T0, 1 2 –1 3 4 –1 –1)
4
(T1, 2 1 2 –1 4 –1 –1 2 –1 3 –1)
3
(T2, 1 3 2 –1 –1 5 1 2 –1 3 4 –1 –1 –1 -1)
6
3
4
4
2
1
2
Tree T0
1
5
D in Horizontal Format: (tree-id, string
encoding):
3
5
2
3
4
0
1
7
Tree T2
Database D of 3
Trees
D in Vertical Format ( tree-id, scope) pairs:
1
2
3
4
0, [0,3]
0, [1,1]
0, [2,3]
0, [3,3]
1, [1,3]
1, [0,5]
1, [5,5]
1, [3,3]
2, [0,7]
1, [2,2]
2, [1,2]
2, [7,7]
2, [4,7]
1, [4,4]
2, [6,7]
5
2, [3,7]
2, [2,2]
2, [5,5]
14
Scope-List Joins Example: minsup = 100%
Step 1: Calculate F1:
Prefix = {}, Element list:
(1,-1), (2,-1), (3,-1), (4,-1)
1
2
3
4
0,[0,3]* 0,[1,1]
0,[2,3]
0,[3,3]
1,[1,3]
1,[0,5]
1,[5,5]
1,[3,3]
2,[0,7]
1,[2,2]
2,[1,2]
2,[7,7]
2,[4,7]
1,[4,4]
2,[6,7]
Step 2: Calculate F2:
Suppose Prefix = {1},
Element list:(2,0), (4,0)
Step 3: Calculate F3:
Suppose Prefix = {1,2},
Element list:(4,0)
1
1
1
2
4
2
4
0,0,[1,1]* 0,0,[3,3]
0,01,[3,3]*
1,1,[2,2]
1,1,[3,3]
1,12,[3,3]
2,[2,2]
2,0,[2,2]
2,0,[7,7]
2,02,[7,7]
2,[5,5]
2,0,[5,5]
2,4,[7,7]
2,05,[7,7]
2,4,[5,5]
2,45,[7,7]
Infrequent Element: (5,-1)
*: 0 – tree id
[0,3] – node scope
Infrequent Element:
(1,0), (3,0)
*: 0 – tree id
0 – the node number (position)
of the prefix {1}
[1,1] – scope of the element
node.
Infrequent Element:
(2,0), (2,1), (4,0)
*: 0 – tree id
01 – the node number
(position) of the prefix {12}
[3,3] – scope of the element
node.
Conclusion





Introduce the notion of mining embedded subtrees in
a (forest) database of trees
Systematic candidate subtree generation. No subtree
is generated more than once. (but has a mistake)
Use a string encoding of tree to store dataset
efficiently
Use a node’s scope to develop scope-lists
Introduce a new algorithm – TreeMiner
16
Download