APWeb 2004 Hangzhou, China

advertisement
APWeb 2004 Hangzhou, China
Labeling and Querying Dynamic XML Trees
Jiaheng Lu and Tok Wang Ling
School of Computing
National University of Singapore
1
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Methods
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
2
APWeb 2004 Hangzhou, China
Introduction to Structural Join
XML employs tree-structured model for representing data
XML query can be decomposed into a set of basic
structural ( parent-child or ancestor-descendant )
relationships between pairs of nodes
3
APWeb 2004 Hangzhou, China
book
<book title=“XML”>
<allauthors>
<author>John</author>
title
allauthors
year
chapter
<author>Tom</author>
</allauthors>
<year>2003</year>
<chapter>
XML
author
author
<head>….</head>
2003 head
section
<section>…</section>
</chapter>
</book>
John
Tom
….
….
b) XML tree
parent-child
a) XML source
book
ancestor-descendant
Title
author
XML
John
c) Xpath Tree Pattern
book
Title
Any node in XML tree may be an element,
attribute, value of XML source.
Title
book
XML
Author
Author
John
d) Basic Structural relationship
4
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Method
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
5
APWeb 2004 Hangzhou, China
Labeling scheme
In order to perform structural join, each node in an XML tree
is assigned an unique label.
We can determine the ancestor-descendant (or parent-child)
relationship for any two nodes from their labels.
The method of assigning the labels is called as labeling
scheme.
6
APWeb 2004 Hangzhou, China
Range labeling scheme
In the range labeling scheme, the label of a node v is
interpreted as a pair of numbers <av, bv>: av is called as the
start position, while bv is the end position. A node v (<av, bv> )
is an ancestor of u (<au, bu> ) iff av≤au≤bu≤ bv. In other
words, range <au, bu> is contained in range <av, bv>.
7
APWeb 2004 Hangzhou, China
Range labeling scheme
Book
<1,12>
Title <1,2>
XML
<1,1>
Allauthors
<3,6>
Year
<7,8>
Author Author
<3,4> <5,6>
2003
<7,7>
John
<3,3>
Chapter <9,11>
Head
<9,9>
Section
<10,10>
Tom
<5,5>
8
APWeb 2004 Hangzhou, China
Range labeling scheme
Pros: Ancestor-descendants relationship can be decided in
constant time.
Cons: This method lacks of flexibility. There is a renumbering
problem for insertion nodes. To get around this problem,
some papers propose to leave some “gaps” between the
numbers of the leaves. However, if one part of the documents
is heavily updated, the available numbers may be still not
enough and the tree needs to be renumbered. So this approach
cannot ultimately solve this problem.
9
APWeb 2004 Hangzhou, China
Prefix labeling scheme
Edith Cohen in PODS 2002 proposes a simple prefix labeling scheme,
which can avoid renumbering in any case.
For any new node v, the label L(v) = L(u) + 1…10
where u is the parent of v, i is the number of labeled children of u. Root
node is labeled as an empty string.
(1) Edith Cohen, Haim Kaplan, Tova Milo Labeling Dynamic XML Trees
ACM PODS 2002.
10
APWeb 2004 Hangzhou, China
How to generate simple prefix label
L(v)=L(u) + 1…10
u is the parent of v and i is the number of labeled children of u.
Book
“0”
Title
“
”
“10
”
Authors
Author
Author
“100
”
“1010”
Author
“10110
”
11
APWeb 2004 Hangzhou, China
Simple Prefix labeling scheme
Pros: Compared other labeling scheme, such as range
labeling scheme, simple prefix scheme does not
need renumbering for any insertion sequence.
Cons: The index size is too large. The tight bound of
size is O(N2) in the worst case, where N is the
number of nodes in an XML tree.
12
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Method
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
13
APWeb 2004 Hangzhou, China
Group
Definition: Given a XML tree T, a group is a set of subtrees. All root
nodes of subtrees in this set have the common parent node in T.
A
F
B
G
C
One
Subtree
D
I
H
J
Two subtrees
E
Group
14
APWeb 2004 Hangzhou, China
Group
Property 1: Given a XML tree T and a group S, for any
node n∈T, but n S, one
 of the following two conditions
must be satisfied: (1) n is an ancestor of all nodes in S;
(2) n is not an ancestor of any node in S.
In other words, it is impossible that n is an ancestor of part
of nodes in S.
15
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Method
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
16
APWeb 2004 Hangzhou, China
GRP Labeling Scheme
Group based Prefix (GRP) labeling scheme associates
each node n in the XML tree with a pair of number
<groupID, prefix-label>, where groupID is a nonegtive
integer and prefix-label is a binary string. All nodes on
the same group have the same groupID, and are
distinguished by their prefix-label .
17
APWeb 2004 Hangzhou, China
GRP label example:
Root
1,“0”
1,“010”
A
B
1,“00”
C
D
2,“0”
2,“10”
E
2,“110”
In this example, the maximal number of each group is three.
18
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Method
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
19
APWeb 2004 Hangzhou, China
Group based Structural Join (GRJ) Algorithm
The main idea in GRJ is to divide the join operations
into two classes, one is intra-group join, and the other is
inter-group join.
Intra-group join means the join happens among the
elements in the same group.
Inter-group join means the join happens among the
elements in different groups.
20
APWeb 2004 Hangzhou, China
Intra-group Join
Intra-group join is easy to understand. There are two
alternative methods to perform this join:
(i) simply comparing the prefix labels of any two elements to
identify their relationship, like nested-loop join in RDBMS.
(ii) A clever method is first to sort the prefix-label, then use a
stack to cache the potential ancestor and only scan the join
data once, like sort-merge join in RDBMS.
21
APWeb 2004 Hangzhou, China
Inter-group Join algorithm
The key point of inter-group join is to use a hash table to
cache the ancestor nodes of each group.
A key of hash table is a group ID of the descendant set and
a value of hash table is the parent element of this group.
22
APWeb 2004 Hangzhou, China
Algorithm. GRJ algorithm
Input: A is the ancestor list and D is the descendant list
Output: Pairs of ancestor-descendant elements
1.Scan A,D list once to assign every element to their respective group bucket;
2. Initialize DgroupHash as a hash table, where keys are group IDs of Dlist, and each value is
initialized as an empty set.
/* DgroupHash will cache the ancestors of any group in Dlist. */
3. For i:=2 to max group do
/*since group 1 only contains root node, here begin from group 2 *
4.
Output each elements in set DgroupHash(i) as the ancestor of each element in Dlist of
group i ;
5.
Delete key i from DgroupHash;
6.
Perform Intra-group join for group i;
7. Perform Inter-group join for group i (join result is stored in the hash table DgroupHash).
8.
End For
23
APWeb 2004 Hangzhou, China
Contents
Introduction
Introduction to structural join
Introduction to labeling scheme
Our Method
Preliminary definition
Group based prefix labeling scheme
Group based join algorithm
Our Experiments
24
APWeb 2004 Hangzhou, China
Experiment setup
Comprehensive experiments were conducted to study the
effectiveness and efficiency of GRJ algorithm.
We use synthetic and real-life data including XMARK, IBM
XML generator and DBLP.
25
APWeb 2004 Hangzhou, China
Query performance
For GRP scheme, we use GRJ algorithm.
For SP scheme, we first use block nested loop (BNL)
algorithm. Because if the label of node is given directly
according to their inserted order, they are usually
unsorted, we cannot use more efficient algorithm.
Experiment result: GRJ is much efficient than BNL.
26
APWeb 2004 Hangzhou, China
Elapsed Time(#sec)
GRJ algorithm
BNL algorithm
30
25
20
15
10
5
0
3
10
20
25
30
Number of nodes(K)
27
APWeb 2004 Hangzhou, China
Query Performance
When the special efforts are taken to guarantee that element
lists are sorted, for SP labeling scheme, we may use a more
efficient algorithm, called Stack-Tree-Desc(2) to perform
structural join. Stack-Tree-Desc is like sort-merge join in
RDBMS.
Since the original Stack-Tree-Desc algorithm is based on
range labeling scheme, here we first modify it to utilize SP
labeling scheme (but the main idea is the same).
(2):D. Srivastava, S.Al-khalifa, H. V. Jagadish, N. Koudas, J. M. Patel
and Yuqing Wu. Structural Joins: A primitive for efficient XML query
pattern matching. In ICDE 2002
28
APWeb 2004 Hangzhou, China
SP-stack-tree-desc
GRJ
350
Elapsed time
300
250
200
150
100
50
0
50
100
150
# of nodes in join set( K)
250
29
APWeb 2004 Hangzhou, China
Query Performance
Interestingly, we find that although GRJ algorithm needs to
scan the element lists twice and Stack-Tree-Desc algorithm
scan them only once, GRJ algorithm still performs better
than Stack-Tree-Desc algorithm for the large data.
30
APWeb 2004 Hangzhou, China
Query Performance
This can be explained as follows: Stack-Tree-Desc algorithm
is based on SP labeling scheme, while GRJ is based on GRP
scheme. Since the size of labels of SP is much larger than
that of GRP, the time of accessing the GRP labels twice may
be still smaller than accessing SP labels once. As a result,
GRJ algorithm outperforms Stack-Tree-Desc algorithm.
This result shows the importance of the size of labels.
31
APWeb 2004 Hangzhou, China
------ End-------
Thank you !
Question and Answer
32
Download