ICDE 2014, Chicago, USA A General Algorithm for Subtree Similarity-Search Sara Cohen, Nerya Or The Hebrew University of Jerusalem 1 The Setting Huge Labeled Tree Data • Arises in – computational biology, – image analysis, – automatic theorem proving, – compiler optimization – XML databases 2 Subtree Similarity-Search Query tree Q Database tree T n nodes ⇨ n subtrees Top-k subtrees of T, most similar to Q • Goal: Given a (small) tree Q and a number k, find the k subtrees S of T most similar to Q 3 Subtree Similarity-Search Query tree Q Database tree T n nodes ⇨ n subtrees Top-k subtrees of T, most similar to Q • Goal: Given a (small) tree Q and a number k, find the k subtrees S of T most similar to Q • Similarity: defined using a function that takes two trees and returns a real value 4 The Bottom Line • An algorithm for subtree similarity-search • Compatible with a wide family of tree distance functions • Runtime is linear – (Depending on the distance function used; see paper for exact analysis) • Experimental results show near-invariance to query size and number of results fetched 5 Defining Distance • We introduce profile distance functions for determining similarity among two given trees • Several previously proposed distance measures can be shown to be profile distance functions: – – – – pq-gram distance (Augsten et. al.) Windowed pq-gram distance (Augsten et. al.) Binary branch distance (Yang et. al.) Other multiset-based distance measures 6 Profile Distance Functions • Main idea: 1. Associate each tree T with a multiset of small objects that represent the tree structure and contents 2. Use a multiset comparison method to determine similarity between two trees 7 Profile Distance Functions Compare the multisets Summarize the interesting features of each tree using a multiset Distance value between the two trees 8 Profile Distance: A Simple Example “meow” “woof” “ribbit” “meow” “cluck”, “meow”, “meow”, “ribbit”, “woof” “cluck” Compare the multisets. For example: Dice coefficient Summarize the interesting features of each tree using a multiset. For example: take bags of the tree labels “purr” “meow” “cluck” “purr” “cluck”, “meow”, “purr”, “purr” 9 Profile Distance: pq-grams (Augsten et al.) “meow” “ribbit” “ribbit” “cluck” “woof” “ribbit” “meow” “meow” “meow” “cluck” * * * … and many more * “cluck” Summarize the interesting features of each tree using a multiset. For example: pq-grams Compare the multisets. For example: Normalized Dice for multisets (Augsten et al.) “purr” “meow” “cluck” “purr” … etc. This profile function pays respect to the tree’s structure as well as its content! 10 Profile Distance Functions • Main idea: 1. Associate each tree T with a multiset of small objects that represent the tree structure and contents Actually, multiset for tree is determined by multisets associated with nodes 2. Use a multiset comparison method to determine similarity between two trees Comparison functions will be based on intersection, union and sizes of multisets 11 Multisets Associated with Trees • Each node u is associated with two multisets: – : Contains elements that describe the subtree rooted at u – : Contains elements that describe the node u and its surroundings • A tree T, rooted at node r, is then associated with the multiset: 12 Example: Subtree Rooted In Node r Take the local multiset from the root node … Take the global multiset from non-root nodes … r … … v1 … v3 v2 … 13 Subtree Similarity Search A friendly reminder - Our mission: find the top-k subtrees of a tree T most similar to a query tree Q This problem can trivially be solved in polynomial time The challenge: huge size of the data, and efficiently computing distances for all subtrees Query tree Q Database tree T Top-k subtrees of T, most similar to Q 14 Subtree Similarity Search • Our algorithm’s basic strategy, given a number k, a query Q, and a tree T: – Go over T in post-order: – Calculate , the subtree S rooted in the current node of T for – Derive a distance value between Q and S – If S is one of the top-k subtrees we’ve seen, keep it in the results set 15 Calculating The Multiset Unions • Note: • Using the following formula, calculating the multiset size for each subtree S while iterating over T in post-order is easy: 16 Calculating The Multiset Intersections – Notation: is the number of times x appears in A – We sum over each x exactly once, even if it appears several times in the multisets • Suppose we want to calculate the size of the multiset intersection between A={α,α,α,β} and B={α,α,β,γ} 17 Calculating The Multiset Intersections • We begin with describing a simple algorithm for calculating the intersection sizes – This method is used within the DynamicSearch algorithm in the paper • Later, we will describe an improved algorithm – This improved approach is what we use in the ProfileSimSearch algorithm in the paper 18 Multiset Intersections; Simple Version • We want to find the intersection size for each subtree S • Q always stays constant, so we calculate the multiset once • Any element contributes 0 to this sum, so we will only calculate for 19 Multiset Intersections; Simple Version • For each distinct define a queue , • This queue initially contains all of which are null placeholders • For example, if queues: elements, ={a,a,a,b}, we have two null null null null 20 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into null , null , and, null 21 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into null , , and, null 22 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into , , and, null 23 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into , , and, 24 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into , , and, 25 Multiset Intersections; Simple Version • We iterate over T in post-order • For each node v, and for each x such that , we perform the following action times: – Pop an element from – Insert v into , , and, 26 Multiset Intersections; Simple Version • In , we have: … v4 v1 v5 v6 A prefix of nulls and nodes from outside the current subtree null Current iteration’s node in T v8 v3 v2 null … v9 v2 v3 v7 A suffix of the nodes from the current subtree that have x in their global profile v3 v5 v7 v8 v8 27 Multiset Intersections; Simple Version • The length of the queue exactly is always • We can count the size of the suffix and prefix in order to obtain the intersection size (with respect to x), – Note: “local” multiset elements can fit in any slot of the prefix and contribute to the intersection size. We use this fact to account for the local multiset of the current node. 28 Is that all? The tree T is huge! Runtime of the simple algorithm is too high. 29 Making it Scalable • By careful book-keeping, we can avoid the need to count the size of each queue suffix – This reduces the runtime from quadratic to linear • Calculating the intersection with local multiset elements is still needed – But, the runtime of this operation is bounded by the local multiset sizes, so overall linear in the input size 30 Calculating the Suffix Size On-The-Fly 1st attempt: • Each node in T keeps a counter, initialized to 0 – However, we’ll never use more than O(height(T)) memory • During the post-order iteration over T: – Increase counter(v) whenever v is enqueued in some queue – At the end of the iteration over v, add counter(v) to counter(v.parent) This is not good enough! What happens when a node is evicted from the queue? 31 Calculating the Suffix Size On-The-Fly Fixed: • Each node in T keeps a counter, initialized to 0 – However, we’ll never use more than O(height(T)) memory • During the post-order iteration over T: – Increase counter(v) whenever v is enqueued in some queue – At the end of the iteration over v, add counter(v) to counter(v.parent) – Whenever a node u is evicted from a queue and node v is inserted instead, decrement counter(LCA(u,v)) counter(w) contains the size of the suffix during the iteration over w 32 Calculating the Suffix Size On-The-Fly • The queue contains the last nodes that we’ve seen, to which x was associated with • Each x can’t contribute more than intersection size … u to the … w w is the lowest common ancestor (LCA) of u,v dequeue u decrement counter(w) enqueue v … … Queue length is always v u v 33 The ProfileSimSearch Algorithm • Runtime: – Linear in the multiset sizes for Q,T, plus a factor of |T|log(k) (Assuming O(1) calculation time for lowest common ancestors) • Memory use: – Linear in the query’s multiset size, in k, and in height(T) • Runs in a single post-order pass over T • Multisets of T’s nodes can be indexed in advance, for a quick implementation – If all multiset elements can be generated on-the-fly easily, no such preprocessing is necessary 34 Experimentation 35 Setup • State of the art for subtree similarity search: – TASM-postorder [Augsten et al.] – StructureSearch [Cohen] – Both algorithms use tree edit distance, and not profile distance functions • We also compare performance with the implementation of tree-to-tree distance using pq-grams by Augsten et al. 36 Setup • Data sets: – DBLP (17.6 million nodes) – XMark100 – XMark1800 (3.6 to 57.8 million nodes) – Sprot (9.4 million nodes) • Queries: – Random subtrees from the data • Extensive experimentation in paper • In the next slides, all times are in seconds 37 Varying |Q| (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested 38 Varying k (Dataset: 14.5 million nodes) Similar results were observed on all other datasets that were tested 39 Varying Dataset Size Different multiset-generating functions are compared here 40 Comparison with tree-to-tree pq-gram distance • A MySQL-based implementation of the pq-gram distance calculation routine given by Augsten et al. is compared to ProfileSimSearch • Note: ProfileSimSearch may output top-k results, while the other algorithm is designed to calculate pq-gram distance between two given trees Q,T • Both algorithms use an indexing stage over the database tree T, which is not measured in the following results 41 Comparison with tree-to-tree pq-gram distance 42 Conclusion and Future Work • We presented a definition capable of expressing a large general family of tree distance functions • Efficient and scalable algorithm for subtree search using this definition – Can also be used for tree search with a large set of trees • Future Work: – Use of upper bounds on subtree sizes or other attributes, to prune search space – Use a profile distance function to obtain bounds on tree edit distance, and modify the algorithm to calculate top-k using tree edit distance 43 Thanks! Questions? 44