Efficient Processing of XML Twig Patterns with Parent Child Edges

advertisement
CIKM 2004
Washington D.C. U.S.A.
Efficient Processing of XML Twig
Patterns with Parent Child Edges: A
Look-ahead Approach
Jiaheng Lu, Ting Chen, Tok Wang Ling
National University of Singapore
Nov. 11. 2004
1
Outline

☞ XML Twig Pattern Matching






Problem definition
State of the Art: TwigStack
Sub-optimality of TwigStack
Our algorithm: TwigStackList
Performance
Conclusion
2
XML Twig Pattern Matching

An XML document is commonly modeled as a rooted,
ordered and labeled tree.
book
preface
chapter
“Intro”
title
“Data”
“XML”
………….
section
section
title
chapter
paragraph
figure
section
paragraph
paragraph
figure
figure
3
Regional Coding


Node Label1: (startPos: endPos, LevelNum)
E.g.
book (0: 32, 1)
preface (1:3, 2)
chapter (4:29, 2)
section (5:28, 3)
“Intro” (2:2, 3)
title: (6:8, 4)
“Data” (7:7, 3)
chapter(30:31, 2)
section(9:17, 4)
section(18:23, 4)
paragraph(24:27, 4)
paragraph(19:22, 5)
title: (10:12, 5)
figure (25:26, 5)
paragraph(13:16, 5)
figure (20:21, 6)
“XML” (11:11, 3)
1.
figure (14:15, 6)
M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
4
What is a Twig Pattern?



A twig pattern is a small tree whose nodes are tags, attributes or text
values and edges are either Parent-Child (P-C) edges or AncestorDescendant (A-D) edges.
E.g. Selects Figure elements which are descendants of Paragraph
elements which in turn are children of Section elements having
child element Title
Twig pattern :
Section
Title
Paragraph
Figure
5
XML Twig Pattern Matching


Problem Statement
 Given a query twig pattern Q, and an XML database
D, we need to compute ALL the answers to Q in D.
E.g. Consider Q1 and Doc 1:
Q1:
s1
Doc1:
t1
t2
Section
s2
p1
title
figure
Query solutions:
(s1, t1, f1)
(s2, t2, f1)
(s1, t2, f1)
f1
6
Previous work: TwigStack

TwigStack2: a holistic approach

Two-phase algorithm:
 Phase 1 TwigJoin: intermediate root-leaf paths are outputted
 Phase 2 Merge: merge the intermediate path list to get the result
2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In
Proceedings of ACM SIGMOD, 2002.
7
Previous work: TwigStack


A node q in a twig pattern Q is associated with a stack Sq
Insertion and deletion in a stack Sq

Insertion: An element eq from stream Tq is pushed into its stack
Sq if and only if



eq has a descendant eqi in each Tqi , where qi is a child of q
Each node eqi recursively has the first property
Deletion: An element eq is popped out from its stack if all
matches involving it have been output.
8
Sub-optimality of TwigStack



TwigStack is I/O optimal for only ancestor-descendant
edge query
Unfortunately, TwigStack is sub-optimal for queries with
any parent-child edge.
TwigStack may output a large size of intermediate
results that are not merge-joinable to any final solution
for queries with parent-child relationships.
9
Sub-optimality of TwigStack: an
example
A simple XML tree
Twig Pattern
s1
t1
Section
p1
title
paragraph
t2
figure
f1


Since s1 has descendants t1,p1 and in turn p1 has descendant f1,
TwigStack output an intermediate path solution <s1,t1>.
But it is useless, for there is no solution for this example at all.
10
Main problem and our experiment




TwigStack might output some intermediate results that are
useless to query answers .
To have a better understanding , we perform TwigStack on
real dataset.
Data set : TreeBank[from U. of Washington XML datasets]
Queries:




Q1:VP [/DT] //PRP_DOLLAR_
Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ
Q3: S [/JJ] /NP
All queries contain parent-child relationships.
11
Our experimental results
Intermediate paths Mergeby TwigStack
joinable
paths
Q1 10,663
5
Percentage of
useless
intermediate paths
99.9%
Q2 24,493
49
99.5%
Q3 70,967
10
99.9%
Most intermediate paths do not contribute to final answers
due to parent-child edges!
It is a big challenge to improve TwigStack to answer queries
with parent-child edges.
12
Intuition for improvement
A simple XML tree
Twig Pattern
s1
t1
Section
p1
title
paragraph
t2
figure
f1


Our intuitive observation: why not read more paragraph elements and cache them in
the main memory?
For example, after we scan the p1, we do not stop and continue to read the next
paragraph element. Then we find that there is only one paragraph element and f1 is
not the child of paragraph. So we should not output any intermediate solution.
13
Outline

XML Twig Pattern Matching






Problem definition
State of the Art: TwigStack
Sub-optimality of TwigStack
☞ Our algorithm TwigStackList
Experimental results
Conclusion
14
Our main idea


Main idea: we read more elements in the input streams
and cache some of them in the main memory so that we
can make a more accurate decision about whether an
element can contribute to final answer.
But we cannot cache too many elements in the main
memory. For each node q in twig query, the number of
elements with tag q cached in the main memory should
not be greater than the longest path in the XML dataset.
15
Our caching method

What elements should be cached into the main memory?
 Only those that might contribute to final answers
A simple XML tree
Twig Pattern
s1
Section
p1
t1
title
p2
p3
f1


paragraph
figure
We only need to cache p1,p3 into main memory, why not p2?
Because if p2 contributed to final answers, then there would be an element
before f1 to become the child of p2. But now we see that f1 is the first element.
So p2 is guaranteed not to contribute to final answers.
16
Our criteria for pushing an element to
stack






The criteria for an element to be pushed into stack is very
important for controlling intermediate results. Why?
Because, once an element is pushed into stack, then this element is
ready to output. So less elements are pushed into stack, less
intermediate results are output.
Our criteria: Given an element eq from stream Tq, before eq is
pushed into stack Sq , we ensure that
(i) element eq has a descendant eq’ for each child q’ of q, and
(ii) if (q, q’) is a parent-child relationship, eq’ has parent with tag q
in the path from eq to eqmax , where eqmax is the descendant of eq
with the maximal start value, qmax being a child of q.
(iii) each of q’ recursively satisfy the first two conditions.
17
Examples
A simple XML tree
s1
Twig Pattern
Section
t1
p1
title
p2
paragraph
p3
figure
f1



Element p3 can be pushed into stack , but p1, p2 cannot.
Because p3 has a child f1.
Although p1 has a descendant f1, but f1 is not the child of p1.
18


Our algorithm: TwigStackList
We propose a novel holistic twig algorithm
TwigStacklist to evaluate a twig query.
Unique features of TwigStackList:



It considers the parent-child edge in the query
There is a list for each query node to cache elements
that likely participate in final solutions.
It identifies a broader class of optimal queries.
TwigStackList can guarantee the I/O optimality for
queries with only ancestor-descendant edges
19
connecting branching nodes and their children.
TwigStackList : an example
An XML tree
Twig Pattern
Section
Root
t1
title
s2
s1
p1
t3
p2
s2
paragraph
p3
p2
p1
p3
t3
figure
t2
p3
f1
f2
f2
Stack
List
Scan s1, t1, p1 ,f1.
20
TwigStackList : an example
An XML tree
Twig Pattern
Section
Root
t1
title
s2
s1
p1
t3
p2
s2
paragraph
p3
p2
p1
p3
t3
figure
t2
p3
f1
f2
f2
Stack
Since p1 is not the parent of f1 (but
ancestor) , we continue to scan p2 and
put p1 to list.
List
21
TwigStackList : an example
An XML tree
Twig Pattern
Section
Root
t1
title
s2
s1
p1
t3
p2
s2
paragraph
p3
p2
p1
p3
t3
figure
t2
p3
f1
f2
f2
Stack
Put p2,p3 to list and the cursor points to
p3, for it is the parent of f2.
List
22
TwigStackList : an example
An XML tree
Twig Pattern
Root
t1
title
s2
s1
p1
t3
s2
Section
p2
paragraph
p3
p2
p1
p3
t3
figure
t2
p3
f1
f2
f2
Stack
List
Merge
Output intermediate solutions: <s2,t3>,<s2,p3,f2>
Final: <s2,t3,p3,f2>
23
TwigStackList v.s. TwigStack
Root
An XML tree

Section
s2
s1
t1
Twig Pattern
p1
t3
p2
t2
p3
f1
f2
title
paragraph
figure
TwigStackList shows I/O optimal for the above query. In
contrast, TwigStack shows sub-optimal, for it output the
“uesless” path solution < s1,t1>
24
Sub-optimality of TwigStackList

Although TwigStackList broadens the class of optimal query compared to
TwigStack, TwigStackList is still show sub-optimality for queries with parentchild edge connecting branching nodes.
A simple XML tree
Twig Pattern
Section
s1
t1
s2
title
paragraph
p1

Observe that there is no matching solution for this dataset. But TwigStackList
caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a
useless solution.
25
Sub-optimality of TwigStackList

Although TwigStackList broadens the class of optimal query compared to
TwigStack, TwigStackList is still show sub-optimality for queries with parentchild edge connecting branching nodes.
A simple XML tree
Twig Pattern
Section
s1
t1
s2
p2
title
paragraph
p1

Observe that there is no matching solution for this dataset. But TwigStackList
caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a
useless solution.
 Here the behavior of TwigStackList is still reasonable since we do not know
whether s1 has a child p2 following p1 before we advance p1.
26
Outline

XML Twig Pattern Matching



Problem definition
State of the Art: TwigStack
Sub-optimality of TwigStack

Our algorithm TwigStackList

☞ Experimental results

Conclusion
27
Experimental Setting

Experimental Setting


Pentium 4 CPU, RAM 768MB, disk 2GB
TreeBank



Download from University of Washington XML dataset
Maximal depth 36, 2.4 million nodes
Random


Seven tags : a, b, c, d, e, f, g. ; uniform distributed
Fan-out of elements varied 2-100, depth varied 10-100
28
Performance against TreeBank

Queries with XPath expression:
Q1
S[//MD]//ADJ
Q4
VP[/DT]//PRP_DOLLAR_
Q2
S/VP/PP[/NP/VBN]/IN
Q5
S[//VP/IN]//NP
Q3
S/VP//PP[//NP/VBN]//IN
Q6
S[/JJ]/NP
Number of intermediate path solutions for TwigStackList V.s. TwigStack

TwigStack
TwigStackList
Reduction percentage
Useful Path
Q1
35
35
0%
35
Q2
2957
143
95%
92
Q3
25892
4612
82%
4612
Q4
10663
11
99.9%
5
Q5
702391
22565
96.8%
22565
Q6
70988
30
99.9%
10
29
Performance analysis




We have three observations:
(1) when queries contain only ancestor-descendant
edges, two algorithms have similar performance. See
Q1.
(2)When edges connecting branching nodes contain
only ancestor-descendant relationships, TwigStack is
optimal, but TwigStack show the sub-optimal. See
Q3.Q5
(3) When edges connecting branching nodes contain
parent-child relationships, both TwigStack and
TwigStackList are sub-optimal. But TwigStack
typically output far few “useless” (<5%) intermediate
30
solution than TwigStack. See Q2,Q4,Q6.
Performance against random dataset
a
a
c
b
e
d
f
g
a
b
c
b
c
d
f
d
f
e
g
e
g
(c) Q3
(b) Q2
(a) Q1
a
a
b c
d
b
c d
e
g
e
f
f
(d) Q4
From the following table, we see
that for all queries, TwigStackList
again is more efficient than
TwigStack in terms of the size of
intermediate results.
g
(e) Q5
TwigStack
TwigStackList
Reduction
Useful Path
Q1
9048
4354
52%
2077
Q2
1098
467
57%
100
Q3
25901
14476
44%
14476
Q4
32875
16775
49%
16775
Q5
3896
1320
66%
566
31
Outline

XML Twig Pattern Matching



Problem definition
State of the Art: TwigStack
Sub-optimality of TwigStack

Our algorithm TwigStackList
Experimental results

☞ Conclusion

32
Conclusion





Previous algorithm TwigStack show the sub-optimality
for queries with parent-child edges.
We propose a new algorithm TwigStackList to address
this problem.
TwigStackList broadens the class of query with I/O
optimality.
Experiments show that TwigStackList typically output
much fewer useless intermediate result as far as the
query contains parent-child edges.
We recommend to use TwigStackList as a new holistic
join algorithm to evaluate a query with parent-child
33
edges.


Thank You!
Q&A
34
Download