(ppf file)

advertisement
Efficient processing of path query
with not-predicates on XML
data
Enhua Jiao, Tok Wang Ling, Chee-Yong Chan
{jiaoenhu, lingtw, chancy}@comp.nus.edu.sg
Computer Science Department
School of Computing
National University of Singapore
1
Outline
•
•
•
•
XML Basics
Motivating example
Naïve approach
Our solutions:
– PathStack
– Imp-PathStack
• A performance study
• Conclusion and future work
2
XML basics
• Commonly modeled as ordered trees
– Tree nodes: elements and values.
– Parent-child node pairs: element - direct subelement, element – value.
<project>
<supplier>
<part>
<color> ‘blue’
</color>
<part>
<color>‘red’
</color>
</part>
</part>
</supplier>
</project>
s pj
element
...
p ro je c t
s up plie r
s up p lie r
s up plie r
pa rt
pa rt
c olor
c olo r
c o lo r
pa rt
c olor
c olo r
're d '
're d'
'b lu e '
c o lo r
'b lu e '
'ye llow '
p a rt
p a rt
're d '
value
3
XML basics: node labeling scheme
• How to determine the structural relationship
between two XML data nodes?
– i.e., parent-child, ancestor-descendant, precedingfollowing relationships.
• A set of labeling schemes were proposed
– Represent each node in XML data tree with a label
according to its position in the tree. The structural
relationship between two data nodes can be easily
determined from their respective labels.
4
XML basics: XML path queries
• Building blocks of XML queries: path query (PQ)
– specify a path pattern to be matched by paths in xml
data tree:
• //project/supplier[.//part/color=‘red’]
– By value search
• color=‘red’
• easily supported by existing indices.
– By structure search
• //project/supplier[.//part/color]
• the focus of current research.
5
Motivating examples
•
•
Current research focus: path query without not-predicates
//project/supplier[.//part/color=‘red’]
spj
project
supplier
...
supplier
supplier
part
part
color
color
color
part
color
color
'red'
'red'
'blue'
color
'blue'
'yellow'
part
part
'red'
• path query with not-predicates: //project/supplier[not(.//part[./color=‘red’])]
• No solutions proposed so far to process such queries.
6
Naïve approach
• Decompose //project/supplier[not(.//part[./color=‘red’])] into
– //project/supplier
– //project/supplier[.//part/color=‘red’]
• Make use of existing solutions.
• Answer can be obtained by comparing two result sets
spj
project
supplier
...
supplier
supplier
part
part
color
color
color
part
color
color
'red'
'red'
'blue'
color
'blue'
'yellow'
part
part
'red'
Such concept
is applied
recursively for
path queries
with recursive
not-predicates
7
Naïve approach: problems
• High I/O
– XML data is scanned repetitively.
– Writing/reading of intermediate results.
• High CPU
– Redundant processing of some structural relationships.
– Set difference computation.
• High memory space
– Storage of intermediate results.
8
Our Solution: PathStack
• Objectives
– XML data is scanned only once.
– No intermediate results.
– No redundant processing of structural
relationships.
– Run time memory is bounded by the longest
path in XML data tree.
9
PathStack: query definitions
• //project/supplier[not(.//part/color=‘red’)]
n1 : project
|
n2 :supplier
||
n3 : part
|
n4 : color
|
'red'
• ni: element tagname
where i indicates the nesting level of the
element.
• Two query nodes are connected by “||” if
they are of ancestor-descendant
relationship, or “|” if they are of parent-child
relationship.
• “” represents a not-predicate.
• Result: <project, supplier> such that this
project node is a parent of the supplier
node, and the supplier doesn’t have a
descendant part node with ‘red’ color.
10
PathStack : satisfaction of subqueries
n1 : project
|
n2:supplier
||
n3: part
|
n4 : color
|
'red'
(a) query
spj
...
project
supplier
part
part
supplier
supplier
part
part
color color
color
part
'red'
'blue'
color
'red'
color
color
'blue' 'yellow'
'red'
(b) Data tree
11
PathStack : data structures
• Each query node ni:X is associated with a data stream Ti and a stack
S i.
• Data stream (Ti): containing all data nodes from XML data tree with
tagname = X, sorted in document order.
• Stack (Si):
– Let nj: Y be the query node which is the parent of the highest
negative edge.
– Regular stack: associated with query nodes with i<j
• Stack item: <X, pointer to an item in Si-1>, X is a data node.
– Boolean stack: associated with query nodes with i≥j.
• Stack item: <X, pointer to an item in Si-1, satisfy>, X is a
data node, satisfy is a boolean variable indicating if X
satisfies its corresponding subquery.
• Can be denoted as Sbooli as well.
12
PathStack : an example
A1
B1
C1
B2
D1 E 1 C2
(a) data tree
n1 : A
||
n2 : B
||
n3 : C
||
n4 : D
(b) query
T1:
T2:
T3:
T4:
[A 1]
[B 1, B 2]
[C1, C2]
[D1]
(c) associated
streams
D1, t
C1, t
B 1, f
A1
Sbool 4
Sbool 3
Sbool 2
S1
(d) associated stacks
in (a), Ai , Bi , … are the labels for element with tagname ‘A’, ‘B’, …
respectively. It’s for easy distinguish of elements with the
same tagname.
13
PathStack : key idea
• Visit data nodes in the set of associated streams in document order.
• Pop nodes in the set of stacks that do not lie on the same path as the
data node selected in current round. Nodes must be popped from Si
in decreasing i order.
• Let nj: Y be the query node which is the parent of the highest negative
edge.
• For <X, satisfy> popped from Si:
– if i>j, then we can determine if some nodes in Si-1 satisfies their
corresponding subquery, based on the satisfy of X, and the edge
between query node ni-1 and ni.
– Else if i=j and satisfy=true, then there is a potential answer which can be
read from the set of stacks.
• Push current node into its corresponding stack Sk. If Sk is a boolean
stack, current node’s satisfy value will be initialized according to the
edge between nK and nK+1.
14
PathStack : key idea (cont.)
A1
B1
C1
D1
B2
E1
C2
(a) data tree
n1 : A
||
n2 : B
||
n3 : C
||
n4 : D
(b) query
B2, t
D1, t
C21, ft
Sbool 4
Sbool 3
B1, tf
A1
Sbool 2
S1
(c) stack encoding
15
PathStack : key idea (cont.)
<A1, B2>
A1
B1
C1
D1
B2
E1
C2
(a) data tree
n1 : A
||
n2 : B
||
n3 : C
||
n4 : D
(b) query
answer
B2, t
Sbool 4
C2, f
B1, f
A1
Sbool 3
Sbool 2
S1
(c) stack encoding
16
Imp-PathStack: minimizing Number of
Boolean Stacks
• Boolean stacks are more costly to maintain than regular
stacks.
• Can we use less Boolean stacks to achieve the same result as
PathStack?
– Yes, only query node with negative child edge needs to be
associated with Boolean stack.
– The leaf node in query path: always true (virtual Boolean stack)
– Query node with positive child node: satisfy value can be
determined easily from the nodes in Sboolj, where nj is the nearest
descendant query node of ni that is associated with a (real or
virtual) boolean stack
17
Imp-PathStack: optimizing Stack
Operations
• Some document nodes that do not affect the final results
are still pushed into stacks.
• Can we avoid pushing such nodes into stacks?
root
A1
A2
B1
B2 B3
C1
C2
D1
B5
B4
C4
C3
E1
A3
Not affecting the
satisfy value of
A1, A2 and A3,
can be skipped
n1 : A
||
n2 : B
||
n3 : C
D2
(a) data tree
(b) query1
18
Performance study: configurations
• The testbed: implemented the Naïve approach, PathStack and
imp-PathStack in Java using file system as storage engine.
Experiments were run on a 750Mhz Ultra Sparc III CPU with 512MB
main memory and a 300MB quota of disk space.
• Experimental dataset: Treebank.xml. It has a max depth of 35, an
average depth of 7.87, an average fan-out of 2.3, and about half
million nodes.
• Experimental queries: 3 sets of path queries which contain 1, 2 or
3 not-predicates (denoted as Q1, Q2 and Q3 respectively) were
used in the experiment. All queries have around 152000 data nodes
totally (30% of the experimental dataset) in their associated streams
and 2000 nodes (0.4% selectivity) in final results.
•
Evaluation metric
– Execution time
– Disk I/O: count the total number of data nodes read from/written to disk.19
Performance study: experiment queries
Q1
EMPTY/S//X/VP[not(NP/PP//JJR)]
EMPTY/S//X[not(VP/NP/PP//JJR)]
EMPTY/S[not(//X/VP/NP/PP//JJR)]
Q2
EMPTY/S//X/VP[not(NP/PP[not(//JJR)])]
EMPTY/S//X[not(VP/NP/PP[not(//JJR)])]
EMPTY/S[not(//X/VP/NP/PP[not(//JJR)])]
Q3
EMPTY/S//X/VP[not(NP[not(PP[not(//JJR)])])]
EMPTY/S//X[not(VP/NP[not(PP[not(//JJR)])])]
EMPTY/S[not(//X/VP/NP[not(PP[not(//JJR)])])]
20
Performance study: Naïve vs. PathStack
Naive
80
69
60
51.1
40
20
31.1
21.3
21.4
21.4
0
Q1
QS1
Q2
QS2
Q3
QS3
PathStack
disk I/O (# of nodes)
execution time(sec)
PathStack
Naive
600 k
480.2
346.2
400 k
193.9
200k 154.1
153.9
154.4
0
QS1
Q1
QS2
Q2
QS3
Q3
Observation: PathStack is more efficient than the Naïve approach.
Performance improvement increases with number of not predicates .
Why?
In the Naive approach, the more not-predicates in the given query, the
more repetitive scans of the associated streams will be performed, and
the more intermediate results will be generated.
21
execution time (sec)
Performance study: PathStack vs. impPathStack
25
20.7
20 18.7
21.3
21.4
21.4
21.1
20.2
18.7
18.7
15
sequential scan
imp-PathStack
10
5
PathStack
0
QS1
Q1
Q2
QS2
QS3
Q3
Observation: imp-PathStack requires less execution time, however the
improvement is very marginal. Why?
Execution time dominating factor: I/O cost, CPU cost contributes a small
portion to the overall execution time.
Due to lack of index support, in our implementation, we still need to read
the entire associated streams of a query to determine what are the nodes
that can be skipped (which means no reduction of I/O cost in node
22
skipping step).
Performance study: PathStack vs. impPathStack
Stream Size * Nodes Skipped
% of skipping
(# of nodes)
(# of nodes)
Q1
152.1 k
10.2 k
6.7 %
Q2
152.1 k
3.6 k
2.4 %
Q3
152.1 k
28.1 k
18.5 %
* Stream size of each query set refers the total number of nodes
in the set of data streams of each query.
Observation: (1) percentage of nodes skipped is irrelevant to the
number of not-predicates in the query; (2) the percentage of
nodes skipped is not exciting. Why?
The experimental data set we used has a deeply nested structure
with low fan-out, our node skipping mechanism works well for
data set with high fan-out.
23
Conclusion and future work
• In this paper, we have
– Defined the representation and matching of path
queries with not-predicates.
– Proposed PathStack and its improved variant impPathStack.
– Implemented the naïve approach and our two solutions
to study their performances.
• For future work, we would like to extend our
algorithm to process more general twig queries
with not-predicates.
24
References
•
•
•
•
•
E. Jiao, Efficient processing of XML path queries with not-predicates,
M.Sc. Thesis, National University of Singapore, 2004.
N. Bruno, N. Koudas, and D. Srivastava. Holistic Twig Joins:
Optimal XML pattern matching. In Proc. of the SIGMOD, 2002.
D. Florescu and D. Kossman. Storing and querying XML data using
an RDMBS. IEEE Data Engineering Bulletin, 22(3): 27-34, 1999.
H. Jiang, H. Lu, W. Wang, Efficient Processing of XML Twig Queries
with OR-Predicates, In Proc. of the SIGMOD 2004.
D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, and Y.
Wu. Structural joins: A primitive for efficient XML query pattern matching. In
Proc. of the ICDE, pages 141-152, 2002.
25
Download