On the Optimality of the Holistic Twig Join Algorithm

advertisement
On the Optimality of the
Holistic Twig Join Algorithm
Speaker: Byron Choi (Upenn)
Joint Work with Susan Davidson (Upenn),
Malika Mahoui (Upenn) and Derick Wood
(HKUST)
DIMACS Streaming Data Working
Group II
A Scenario
Small
Devices
XML Doc. Server
Limited
computing
resources
Memory
Picking up
useful
elements on
the fly
Memory is shared
by many
Concurrent apps.
Streams of
elements
Background

The Model, Data Representation and
Assumptions
The Model

Data Streaming Model



Spend constant time to process each
element
An element in a stream is either discarded
or stored in the main memory once it is
processed
See the element in streams only once
Node Representation


4-ary tuple: <preorder #, postorder #, depth,
label>
Complexity of Desc, Child, Ances, Parent:
O(1)

Desc(n1, n2) = true if
n1.preorder < n2.preorder ^ n1.postorder > n2.postorder

Child(n1, n2) = true if
n1.preorder < n2.preorder ^ n1.postorder > n2.postorder
^ n1.depth + 1 = n2.depth
Example Document
a1 (1, 9, 1, A)
b1 c2 (8, 8, 2, C)
(2, 7, 2, B)
(3, 6, 3, A) a2
(4, 4, 4, B) b2 c1 (5, 5, 4, C)
Twig Queries

Syntax:
Step ::= / | //
NodeTest ::= symbol
Path ::= Step NodeTest | Step NodeTest Path
Twig ::= Path | Path (Twig, Twig, …, Twig)
A



Example
B
C
// A (//B, //C)
In English: Want to find the A nodes which
has a B descendent and a C descendent
Twig Join Algorithms

Containment Join [Jiang et al.]




Path Join [Zhang et al.]



Decompose a twig query into a set of steps
Apply relational join algor. to join the nodes of each step
Use customized traditional indexes and estimation methods
[SIGMOD03]
Decompose a twig query into a set of paths
Apply relational join algor. to join the nodes of each path
Holistic Twig Join [Bruno et al.]

Evaluate the twig query as a whole
Twig Join Algorithms (cont’)


The first two approaches may compute
large intermediate results and not
suitable for data streaming
In this talk we will focus on the third
approach.

The TwigStack Algor. (Bruno et al. SIGMOD
02)
The TwigStack Algor.
(Overview)

Associate a stream to each NodeTest


Asymptotically optimal among the algorithms that
read the entire input




The nodes in the stream satisfy the NodeTest
Scan the streams only once
Spend constant memory only on the nodes that are useful,
i.e. participate in at least one solution
Guarantee the optimality when the query contains
descendent edges only.
Suboptimal when the query contains some child
edges

Memory is spent on possibly useless nodes.
Problem Statement

Given a twig query and the associated
streams, is it possible to find all
solutions …



By using a single forward scan of the
streams
By paying constant memory only to the
useful nodes
By spending constant time on processing
each node in the streams
Main Results So Far

Assume the data streaming model…



There is no optimal holistic twig join algorithm –
Theorem 1.
The evaluation of the twig queries is not memory
bounded – Theorem 1.
By relaxing some restrictions on the data
streaming model, we showed…

The lower bounds of such relaxed models are still
quite high – Theorem 2 and Theorem 3.
Outline





TwigStack By Examples
Offline Sorting
Multiple Scans
Discussion
Conclusion
TwigStack By Examples



Query: //A (//B, //C)
Document:
Streams:



a1
b1
c2
a2
b2
c1
TA = [a1, a2], TB = [b1, b2], TC = [c1, c2]
pA, pB, pC are the anchor pointing to the
“top” of the streams
Useful nodes are stored in the main memory
and can be read later
TwigStack By Examples

Step 0



pA -> a1, pB -> b1, pC -> c1
a1 is useful, TA is advanced,
pA->a2
Step 1

a1
c2
b1
a2
b2
c1
a1
a1
b1
b1 is useful, TB is advanced,
pB->b2
a2
b2
c1
c2
TwigStack By Examples

Step 2

b1
a1
a1
a2 is useful, TA is advanced,
pA -> null
a2
b2

Step 3

b1
b2 is useful, TB is advanced,
pB -> null
c2
b1
c1
a1
a2
a1
b1
a2
b2
c1
c2
TwigStack By Examples

Step 4

b1
c1 is useful, TC is advanced, b2
a1
a2
a1
pC -> c2
a2
b2

c1
Step 5


c2
b1
Printing
b1
a1
a1
b1
Step 6

c2 is useful, TC is advanced,
pC-> null
a2
b2
c2
TwigStack By Examples



Query: //A (/B, /C)
Document:
Streams: TA = [a1, a2], TB = [b1, b2],
TC = [c1, c2]
a1
b1
a2
b2
c1
c2
TwigStack By Examples

a1
Computation 1



pA -> a1, pB -> b1, pC -> c1
TA is advanced, pA->a2, TB is advanced,
pB -> b2
a2 is useful (a1 is discarded)
b1
a2
b2
c1
a1

b1
Computation 2



TC is advanced, pC->c2
a1 is useful
a2 is useless because c1 is discarded
a2
b2
c1
c2
TwigStack By Examples

The Extreme Case

O(stream size)
a1
b1
c4
a1
b2
c3
a1
b3
c2
a1
b4
c1
TwigStack Pseudo Code
We’ve only walked
through the red
boxes
Twig Queries over Streams

Theorem 1


There is no optimal holistic twig join
algorithms, no matter how the nodes are
sorted.
Memory must be spent on possibly useless
nodes

Given arbitrary streams, memory requirement
of exact algorithms is unbounded.
Proof of Theorem 1 (Sketch)




Fix a document
Issue a few queries: //A//B, /A (/A, /A)
and /A/A
Optimality implies certain constraints on
the streams
No single stream can satisfy all the
constraints
Proof of Theorem 1 (cont’)



Reduce a twig query to a SPJ query
the twig query is memory bounded iff
the SPJ query is memory bounded.
Babcock et al PODS 02
Outline





TwigStack By Examples
Offline Sorting
Multiple Scans
Discussion
Conclusion
Variation 1: Offline Sorting



Pre-compute some intermediate results
and collect the results in a scan
Allow offline sorting on the nodes and
keep all the necessary sorted nodes
Allow the algorithm to scan the nodes
in the correct orderings
Motivation

The anchors are performing a depth
first transversal

But why? How about an ordering in which
recursions are removed?
a1
b1
a2
b2
c1
a1
c2
b1
a2
c2
b2
c1
The Lower Bound

The number of necessary sorting performed
offline is high




Data redundancy
m is the number of structurally recursive label
in the doc. DTD. d is the doc. depth.
d
The lower bound is m
We identify a restricted case that DTDs help
to lower the lower bound
Variation 2: Multiple Scans


Massive storage (tapes, disks) naturally
produces a stream of items.
Sequential scans is a vital requirement
of such storage

Can only allow a small number of scans
due to the high volume of data
The Lower Bound



Allow P scans on the data streams.
The lower bound of P is high
t
d where d is the doc. depth and t is the
number of simple child-edge query in a
twig query
Discussion




Bruno et al. assigns memory to possible
useless nodes and illustrates that such
computation model is practical by
experiments
No work on approximating the twig queries
with provable guarantees
Constraints expressed in DTDs
Our work assumes certain representation of
the node: ancestor, descendent, parent, child
relationship can be determined in O(1)
Conclusion




The evaluation of twig queries in data
streaming context is tricky.
It is not memory bounded.
Optimal memory constraint cannot be
satisfied in a pass of streams.
Need to look for other solutions.
Download