Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki 1 Frequent Structure Mining (FSM) Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases Typical application Bioinformatics Web mining Mining semi-structured documents 2 Tree Mining Problems Goal: to efficiently enumerate all frequent subtrees in a forest (database of trees) according to a given minimum support (minsup) The support of a subtree S is the number of trees in D that contains one occurrence of S. A subtree S is frequent if its support is more than or equal to a user specified minsup value. 3 Rooted, Ordered & Labeled tree A tree is an acyclic connected graph Rooted: exist one vertices which is distinguished from others Ordered: the children of each node in a rooted tree are ordered. Labeled: each node is associated with a label. Every tree in the paper is a rooted, ordered and labeled tree. 4 Definition of Subtrees We denote a tree as T = {N, B}. N is a set of labeled nodes and B is a set of branches. We say that a tree S = {Ns, Bs} is an embedded subtree of T = {N, B}, if: 1. Ns is a subset of N 2. A branch appears in S iff two vertices are on the same path from the root to a leaf in T. A disconnected pattern is a sub-forest of T. Hence, embedded trees allow not only direct parent-child branches, but also ancestor-descendant branches. 5 Examples of subtrees: 0 0 2 1 1 4 1 3 2 4 Subtree S 2 2 1 1 Tree T 3 1 4 3 2 Not a subtree, a sub-forest 6 Node Numbers and Labels 0 0 1 2 2 3 1 1 4 4 1 3 6 5 7 2 Each node has a welldefined number, i, according to its position in a depth-first traversal of a tree The label of each node is taken from a set of labels L = {0, 1, …, m-1}. It represents the value of each node. 7 Scope of Node [0,7] 0 [1,4] [5,7] 2 [2,3] 1 [3,3] 1 4 [4,4] [6,7] 1 [7,7] 3 2 The scope of each node ni is given as [i, r], i.e., the lower bound is the position (i) of itself, and the upper bound is the position (r) of its rightmost leaf node. Assume two node x, y has the following scope Sx = [ix, rx] and Sy = [iy, ry]. Sx is strictly less than (<) Sy iff rx < ly, i.e., Sx occurs before Sy. It means that y is an embedded sibling of x Sx contains Sy iff lx <= ly and rx >= ry. It means that y is a descendant of x 8 Representing trees as Strings 0 2 1 4 1 3 2 To create String encoding, which is denoted as t, we perform a depth-first search starting (also ending) at the root, adding the current node’s label x to t. Whenever we backtrack from a child to its parent we add an special symbol –1 to the string. 1 The String Encoding: 0 2 1 1 –1 –1 1 –1 –1 4 3 –1 2 –1 -1 9 Equivalence Classes Two k-subtrees X, Y are in the same prefix equivalence class iff they share a common prefix up to the (k-1)th nodes 2 x 1 0 x x 3 x Not a valid element! Prefix String: 2 1 0 –1 3 The following three subtrees are in the same prefix equivalence class: 2 1 0 –1 3 –1 –1 x –1 2 1 0 –1 3 –1 x –1 –1 2 1 0 –1 3 x –1 –1 –1 // (x, 0) // (x, 1) // (x, 3) Element list: (label, the position of the node which x is attached) (x, 0); (x, 1); (x, 3) A valid element x may be attached to only those that lie on the path from the root to the right-most leaf. 10 Candidate Generation: Goal: Given an equivalence class of k-subtrees, try to obtain candidate (k+1)-subtrees. Main idea: consider each pair of elements in the class for extension, including self-extension. Theorem: Assume elements are kept sorted by node label as the primary key and position as the secondary key. Let P be a prefix class, and (x,i) and (y, j) denote any two elements in the class. Px denotes the class representing extension of element (x, i). Define (y,j) join (x,i ) as follows: Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px. 2) If P = 0, add (y, j) to Px. Case II ( i > j ): add (y,j) to Px Case III ( i < j ): no new element is possible in this case The Theorem has a mistake. 11 1 1 2 2 Prefix: 1 2 Element List: (3, 1); (4, 0) 4 3 Prefix = 1 2 3 Prefix = 1 2 –1 4 1 1 1 1 2 2 2 2 3 (3,1) 3 3 3 (3,1) join (3,1) 4 3 (3,2) (4,0) (4,0) join (3,1) 1 4 4 2 (4,0) (4,2) 4 4 (4,0) join (4,0) If we add (y, j+1), i.e., (4, 1), we get the following tree: 1 2 4 –1 4, wrong! TreeMiner Algorithm TreeMiner (D(database of tree, Forest), minsup) F1 = { frequent 1-subtrees }; F2 = { classes [P]1 of frequent 2-subtrees }; For all [P], do Enumerate-Frequent-Subtree; Enumerate-Frequent-Subtree Fk For each element (x, i) € [P] do For each element (y, j) € [P] do (y,j) join (x, i) => at most two new candidate subtrees For each subtree, do scope-list joins If it is frequent, then we add the subtree to the list of frequent-subtree. Repeated until all frequent subtrees have been enumerated. P: prefix class. [P]1 means the prefix size = 1, i.e., only one node in the prefix class. Px refers to the new prefix tree formed by adding (x, i) to P. Fk: the set of all frequent subtrees of size k. 13 An example of TreeMiner Algorithm 0 0 1 1 2 3 1 2 3 2 2 5 2 1 2 3 4 Tree T1 (T0, 1 2 –1 3 4 –1 –1) 4 (T1, 2 1 2 –1 4 –1 –1 2 –1 3 –1) 3 (T2, 1 3 2 –1 –1 5 1 2 –1 3 4 –1 –1 –1 -1) 6 3 4 4 2 1 2 Tree T0 1 5 D in Horizontal Format: (tree-id, string encoding): 3 5 2 3 4 0 1 7 Tree T2 Database D of 3 Trees D in Vertical Format ( tree-id, scope) pairs: 1 2 3 4 0, [0,3] 0, [1,1] 0, [2,3] 0, [3,3] 1, [1,3] 1, [0,5] 1, [5,5] 1, [3,3] 2, [0,7] 1, [2,2] 2, [1,2] 2, [7,7] 2, [4,7] 1, [4,4] 2, [6,7] 5 2, [3,7] 2, [2,2] 2, [5,5] 14 Scope-List Joins Example: minsup = 100% Step 1: Calculate F1: Prefix = {}, Element list: (1,-1), (2,-1), (3,-1), (4,-1) 1 2 3 4 0,[0,3]* 0,[1,1] 0,[2,3] 0,[3,3] 1,[1,3] 1,[0,5] 1,[5,5] 1,[3,3] 2,[0,7] 1,[2,2] 2,[1,2] 2,[7,7] 2,[4,7] 1,[4,4] 2,[6,7] Step 2: Calculate F2: Suppose Prefix = {1}, Element list:(2,0), (4,0) Step 3: Calculate F3: Suppose Prefix = {1,2}, Element list:(4,0) 1 1 1 2 4 2 4 0,0,[1,1]* 0,0,[3,3] 0,01,[3,3]* 1,1,[2,2] 1,1,[3,3] 1,12,[3,3] 2,[2,2] 2,0,[2,2] 2,0,[7,7] 2,02,[7,7] 2,[5,5] 2,0,[5,5] 2,4,[7,7] 2,05,[7,7] 2,4,[5,5] 2,45,[7,7] Infrequent Element: (5,-1) *: 0 – tree id [0,3] – node scope Infrequent Element: (1,0), (3,0) *: 0 – tree id 0 – the node number (position) of the prefix {1} [1,1] – scope of the element node. Infrequent Element: (2,0), (2,1), (4,0) *: 0 – tree id 01 – the node number (position) of the prefix {12} [3,3] – scope of the element node. Conclusion Introduce the notion of mining embedded subtrees in a (forest) database of trees Systematic candidate subtree generation. No subtree is generated more than once. (but has a mistake) Use a string encoding of tree to store dataset efficiently Use a node’s scope to develop scope-lists Introduce a new algorithm – TreeMiner 16