Conference presentation - The Hebrew University of Jerusalem

advertisement
ICDE 2014, Chicago, USA
A General Algorithm for Subtree
Similarity-Search
Sara Cohen, Nerya Or
The Hebrew University
of Jerusalem
1
The Setting
Huge
Labeled Tree Data
• Arises in
– computational biology,
– image analysis,
– automatic theorem proving,
– compiler optimization
– XML databases
2
Subtree Similarity-Search
Query
tree Q
Database
tree T
n nodes ⇨ n subtrees
Top-k subtrees of T,
most similar to Q
• Goal: Given a (small) tree Q and a number k,
find the k subtrees S of T most similar to Q
3
Subtree Similarity-Search
Query
tree Q
Database
tree T
n nodes ⇨ n subtrees
Top-k subtrees of T,
most similar to Q
• Goal: Given a (small) tree Q and a number k,
find the k subtrees S of T most similar to Q
• Similarity: defined using a function that
takes two trees and returns a real value
4
The Bottom Line
• An algorithm for subtree similarity-search
• Compatible with a wide family of tree distance
functions
• Runtime is linear
– (Depending on the distance function used; see paper
for exact analysis)
• Experimental results show near-invariance to
query size and number of results fetched
5
Defining Distance
• We introduce profile distance functions for
determining similarity among two given trees
• Several previously proposed distance measures
can be shown to be profile distance functions:
–
–
–
–
pq-gram distance (Augsten et. al.)
Windowed pq-gram distance (Augsten et. al.)
Binary branch distance (Yang et. al.)
Other multiset-based distance measures
6
Profile Distance Functions
• Main idea:
1. Associate each tree T with a multiset
of
small objects that represent the tree structure and
contents
2. Use a multiset comparison method to determine
similarity between two trees
7
Profile Distance Functions
Compare the
multisets
Summarize the
interesting features of
each tree using a
multiset
Distance
value
between
the two
trees
8
Profile Distance: A Simple Example
“meow”
“woof”
“ribbit”
“meow”
“cluck”,
“meow”,
“meow”,
“ribbit”,
“woof”
“cluck”
Compare the multisets.
For example: Dice coefficient
Summarize the interesting
features of each tree
using a multiset.
For example: take bags
of the tree labels
“purr”
“meow”
“cluck”
“purr”
“cluck”,
“meow”,
“purr”,
“purr”
9
Profile Distance: pq-grams (Augsten et al.)
“meow”
“ribbit”
“ribbit”
“cluck”
“woof”
“ribbit”
“meow”
“meow”
“meow”
“cluck”
*
*
*
… and many
more
*
“cluck”
Summarize the interesting
features of each tree
using a multiset.
For example: pq-grams
Compare the multisets.
For example: Normalized Dice
for multisets (Augsten et al.)
“purr”
“meow”
“cluck”
“purr”
… etc.
This profile
function pays
respect to the
tree’s structure
as well as its
content!
10
Profile Distance Functions
• Main idea:
1. Associate each tree T with a multiset
of
small objects that represent the tree structure and
contents
Actually, multiset for tree is determined by
multisets associated with nodes
2. Use a multiset comparison method to determine
similarity between two trees
Comparison functions will be based on intersection,
union and sizes of multisets
11
Multisets Associated with Trees
• Each node u is associated with two multisets:
–
: Contains elements that describe the
subtree rooted at u
–
: Contains elements that describe the node
u and its surroundings
• A tree T, rooted at node r, is then associated
with the multiset:
12
Example: Subtree Rooted In Node r
Take the local multiset
from the root node
…
Take the global multiset
from non-root nodes
…
r
…
…
v1
…
v3
v2
…
13
Subtree Similarity Search
A friendly reminder -
Our mission: find the top-k subtrees of a
tree T most similar to a query tree Q
This problem can trivially be solved in
polynomial time
The challenge: huge size of the data, and
efficiently computing distances for all subtrees
Query
tree Q
Database
tree T
Top-k subtrees of T,
most similar to Q
14
Subtree Similarity Search
• Our algorithm’s basic strategy, given a number
k, a query Q, and a tree T:
– Go over T in post-order:
– Calculate
,
the subtree S rooted in the current node of T
for
– Derive a distance value between Q and S
– If S is one of the top-k subtrees we’ve seen, keep it
in the results set
15
Calculating The Multiset Unions
• Note:
• Using the following formula, calculating the
multiset size
for each subtree S while
iterating over T in post-order is easy:
16
Calculating The Multiset Intersections
– Notation:
is the number of times x appears in A
– We sum over each x exactly once, even if it appears
several times in the multisets
• Suppose we want to calculate the size of the
multiset intersection between
A={α,α,α,β} and B={α,α,β,γ}
17
Calculating The Multiset Intersections
• We begin with describing a simple algorithm for
calculating the intersection sizes
– This method is used within the DynamicSearch
algorithm in the paper
• Later, we will describe an improved algorithm
– This improved approach is what we use in the
ProfileSimSearch algorithm in the paper
18
Multiset Intersections; Simple Version
• We want to find the intersection size
for each subtree S
• Q always stays constant, so we calculate the
multiset
once
• Any element
contributes 0 to this
sum, so we will only calculate for
19
Multiset Intersections; Simple Version
• For each distinct
define a queue
,
• This queue initially contains
all of which are null placeholders
• For example, if
queues:
elements,
={a,a,a,b}, we have two
null
null
null
null
20
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
null
,
null
, and,
null
21
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
null
,
, and,
null
22
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
,
, and,
null
23
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
,
, and,
24
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
,
, and,
25
Multiset Intersections; Simple Version
• We iterate over T in post-order
• For each node v, and for each x such that
, we perform the
following action
times:
– Pop an element from
– Insert v into
,
, and,
26
Multiset Intersections; Simple Version
• In
, we have:
…
v4
v1
v5
v6
A prefix of nulls and
nodes from outside the
current subtree
null
Current iteration’s
node in T
v8
v3
v2
null
…
v9
v2
v3
v7
A suffix of the nodes from the
current subtree that have x in
their global profile
v3
v5
v7
v8
v8
27
Multiset Intersections; Simple Version
• The length of the queue
exactly
is always
• We can count the size of the suffix and prefix in
order to obtain the intersection size (with
respect to x),
– Note: “local” multiset elements can fit in any slot of
the prefix and contribute to the intersection size. We
use this fact to account for the local multiset of the
current node.
28
Is that all?
The tree T is huge!
Runtime of the simple algorithm is too high.
29
Making it Scalable
• By careful book-keeping, we can avoid the
need to count the size of each queue suffix
– This reduces the runtime from quadratic to linear
• Calculating the intersection with local multiset
elements is still needed
– But, the runtime of this operation is bounded by the
local multiset sizes, so overall linear in the input size
30
Calculating the Suffix Size On-The-Fly
1st attempt:
• Each node in T keeps a counter, initialized to 0
– However, we’ll never use more than O(height(T)) memory
• During the post-order iteration over T:
– Increase counter(v) whenever v is enqueued in some
queue
– At the end of the iteration over v, add counter(v) to
counter(v.parent)
This is not good enough!
What happens when a node
is evicted from the queue?
31
Calculating the Suffix Size On-The-Fly
Fixed:
• Each node in T keeps a counter, initialized to 0
– However, we’ll never use more than O(height(T)) memory
• During the post-order iteration over T:
– Increase counter(v) whenever v is enqueued in some
queue
– At the end of the iteration over v, add counter(v) to
counter(v.parent)
– Whenever a node u is evicted from a queue and
node v is inserted instead, decrement
counter(LCA(u,v))
counter(w) contains the size of the
suffix during the iteration over w
32
Calculating the Suffix Size On-The-Fly
• The queue contains the last
nodes that we’ve
seen, to which x was associated with
• Each x can’t contribute more than
intersection size
…
u
to the
…
w
w is the lowest
common
ancestor
(LCA) of u,v
dequeue u
decrement
counter(w)
enqueue v
…
…
Queue length is always
v
u
v
33
The ProfileSimSearch Algorithm
• Runtime:
– Linear in the multiset sizes for Q,T, plus a factor of |T|log(k)
(Assuming O(1) calculation time for lowest common ancestors)
• Memory use:
– Linear in the query’s multiset size, in k, and in height(T)
• Runs in a single post-order pass over T
• Multisets of T’s nodes can be indexed in advance, for a
quick implementation
– If all multiset elements can be generated on-the-fly easily, no
such preprocessing is necessary
34
Experimentation
35
Setup
• State of the art for subtree similarity search:
– TASM-postorder [Augsten et al.]
– StructureSearch [Cohen]
– Both algorithms use tree edit distance, and not profile
distance functions
• We also compare performance with the
implementation of tree-to-tree distance using
pq-grams by Augsten et al.
36
Setup
• Data sets:
– DBLP (17.6 million nodes)
– XMark100 – XMark1800 (3.6 to 57.8 million nodes)
– Sprot (9.4 million nodes)
• Queries:
– Random subtrees from the data
• Extensive experimentation in paper
• In the next slides, all times are in seconds
37
Varying |Q| (Dataset: 14.5 million nodes)
Similar results were observed on all other datasets that were tested
38
Varying k (Dataset: 14.5 million nodes)
Similar results were observed on all other datasets that were tested
39
Varying Dataset Size
Different multiset-generating functions are compared here
40
Comparison with tree-to-tree pq-gram distance
• A MySQL-based implementation of the pq-gram
distance calculation routine given by Augsten et
al. is compared to ProfileSimSearch
• Note: ProfileSimSearch may output top-k
results, while the other algorithm is designed to
calculate pq-gram distance between two given
trees Q,T
• Both algorithms use an indexing stage over the
database tree T, which is not measured in the
following results
41
Comparison with tree-to-tree pq-gram distance
42
Conclusion and Future Work
• We presented a definition capable of expressing
a large general family of tree distance functions
• Efficient and scalable algorithm for subtree
search using this definition
– Can also be used for tree search with a large set of
trees
• Future Work:
– Use of upper bounds on subtree sizes or other
attributes, to prune search space
– Use a profile distance function to obtain bounds on
tree edit distance, and modify the algorithm to
calculate top-k using tree edit distance
43
Thanks!
Questions?
44
Download