indexing - Missouri University of Science and Technology

advertisement
Efficient Processing of XPath
Queries Using Indexes
Yan Chen1, Sanjay Madria1, Kalpdrum Passi2, Sourav Bhowmick3
1
Department of Computer Science, University of Missouri-Rolla,
Rolla, MO 65409, USA
madrias@umr.edu
2 Dept. of Math. & Computer Science, Laurentian University,
Sudbury ON P3E 2C6 Canada
kpassi@cs.laurentian.ca
3 School of Computer Engineering, Nanyang Technological
University, Singapore
assourav@ntu.edu.sg
1
Querying Semistructured Data
• Query languages to query semistructured data
– XQuery, XML-QL, XML-GL, Lorel, and Quilt
• Semistructured data is represented as a graph
• Queries on such data are expressed in the form of regular
path expressions
• XPath is a language that describes the syntax for addressing
path expressions over XML data
• Indexes on XML data - improves the performance of the
query on large XML files
• Indexing techniques used in relational and object-oriented
databases do not suffice for semistructured data due to the
nature of the data
2
Indexing Semistructured Data
• Dataguides
– record information on the existing paths in a database
– do not provide any information of parent-child
relationships between nodes in the database
– as a result they cannot be used for navigation from any
arbitrary node.
• T-indexes
– specialized path indexes, which only summarize a
limited class of paths.
– 1-index and 2-index are special cases of T-indexes
3
Indexing Semistructured Data
• LORE
– Uses four different types of index structures - value,
text, link, and path indexes
– Value index and text index are used to search objects
that have specific values
– link index and path index provide fast access to parents
of an object and all objects reachable via a given
labeled path
– Lore uses OEM (Object Exchange Model) to store data
and OQL (Object Query Language) as its query
language
4
Indexing Semistructured Data
• ToXin
– has two different types of index structure: the value
index and the path index.
– The path index has two parts: index tree and instance
functions, and these functions can be used to trace the
parent-child relationship.
– Their path index contains only parent and children
information but in our model, we store the complete
path from root to each node.
– ToXin uses index for single level while we use multiple
index for different levels
5
A Sample XML File
<BOOKSTORE name = “Benny-bookstore”>
<BOOK title = “Brave the new world”>
<ISBN>1-1-1</ISBN>
<AUTHOR> David </AUTHOR>
</BOOK>
<BOOK title = “Glory days”>
<ISBN>1-1-2</ISBN>
<AUTHOR> Chris </AUTHOR>
</BOOK>
<BOOK title = “I love the game”>
<ISBN>1-1-3</ISBN>
<AUTHOR> Chris</AUTHOR>
</BOOK>
<BOOK title = “What lies beneath”>
<ISBN>1-1-4</ISBN>
<AUTHOR> Michael</AUTHOR>
</BOOK>
<BOOK title = “Matrix II”>
<ISBN>1-1-5</ISBN>
<AUTHOR> Jason </AUTHOR>
</BOOK>
<BOOK title = “The Root”>
<ISBN>1-1-6</ISBN>
<AUTHOR> Tomas </AUTHOR>
</BOOK>
</BOOKSTORE>
6
XML as DOM Tree
[BOOKSTORE:
&1
Benny-bookstore]
[BOOK:
I love the game]
[BOOK:
[BOOK:
[BOOK:
Matrix II]
The Root]
What lies beneath]
&2
&3
&4
&13
&16
&19
[BOOK: Brave the New
[BOOK: Matrix]
World]
&7
&8
&9
&10
&11
&12
&14
[ISBN:1-1-1]
[ISBN:1-1-2]
[ISBN:1-1-3]
[ISBN:1-1-4]
[AUTHOR: David]
[AUTHOR: Chris] [AUTHOR: Chris]
Chris]
&15
&17
&18
&20
&21
[ISBN:1-1-5]
[ISBN:1-1-6] [AUTHOR: Tomas]
[AUTHOR: Michael] [AUTHOR: David]
Jason]
7
Indexing XML Data - Motivation
• Retrieve all the books with author’s name as “Chris”
from the Benny-bookstore
– We need to find all the nodes in the DOM tree with child nodes
of BOOKSTORE as BOOK.
– Then for each BOOK, we need to test the author’s name.
– After about 100,000 comparisons we get a couple of books
with author “Chris” as the output
– By using index on AUTHOR, we do not need to test author of
each BOOK node.
– With the index of the key as “Chris”, we can find all author
nodes faster
– The nodes obtained can be checked if they satisfy the query
condition.
– This is a “bottom-up” query plan.
– Such a plan is useful in the case when we have a relatively8
“small” result set at the bottom, which can be pre-selected
Indexing XML Data - Motivation
• Find all the books with the name beginning with
“glory” and the author as “Chris”
– The query plan could be to get all the books with the
name “glory” disregarding their authors.
– If there are small number of books satisfying the
constraint, (e.g., four “glory” books), it might be useful to
introduce another type of index, which is built on the
values of some nodes.
– Here, we need index upon strings.
– On the basis of the nodes obtained in the first step, we can
further test another condition on the query.
– Hence, we can build a set of nodes as the “entry set”,
which will depend on the specific query and on the type of
XML data
9
Types of Indexes
• Name-index (Nindex)
– A name index locates nodes with the tag names
– The Nindex for the incoming tag <BOOK> over the XML fragment in figure
2 will then be {&2, &3, &4, &13, &16, &19}
• Value-index (Vindex)
– A value-index locates nodes with given value
– The Value-index for the word “Chris” is {&10, &12}, for the word “the” is
{&2, &4}
• Path-index (Pindex)
– A path-index, locates nodes with the path from root node
– Path index is the information we attach to each node to record its ancestors’
paths
– In Dom tree the path information of &11 is {&1, &4}; node &7 is {&1, &2}
• Descent Number (DN)
– Descent Number is the information we attach to every node to record the
number of its descents.
– In the DOM tree, the DN of node &11 is 0; the DN of node &3 is 2 10
Example for XPath Queries
<bib>
<book> <publisher> Addison-Wesley </publisher>
<author> Serge Abiteboul </author>
<author> <first-name> Rick </first-name>
<last-name> Hull </last-name>
</author>
<author> Victor Vianu </author>
<title> Foundations of Databases </title>
<year> 1995 </year>
</book>
<book price=“55”>
<publisher> Freeman </publisher>
<author> Jeffrey D. Ullman </author>
<title> Principles of Database and Knowledge Base Systems </title>
<year> 1998 </year>
</book>
</bib>
11
Data Model for XPath
The root
Processing
instruction
Comment
bib
book
publisher
The root element
book
author
. . . .
Much like the Xquery data model
Addison-Wesley
Serge Abiteboul
12
XPath: Simple Expressions
/bib/book/year
Result: <year> 1995 </year>
<year> 1998 </year>
/bib/paper/year
Result: empty
(there were no papers)
13
Entry-point Technique
• We find an entry-point node among a set of middle level
nodes in the XPath expression.
• Then we split the XPath expression at the entry-point and
test for the path condition for the first part and eliminate
nodes from DOM tree that do not satisfy the path
condition.
• Then we test the remaining part of the XPath expression
recursively eliminating nodes that do not satisfy the path
condition.
• The algorithm can be implemented either using top-down
approach or bottom-up approach
14
Entry-point Technique – An Example
Select BOOKSTORE/BOOK
where BOOK.name = “Glory days” and /AUTHOR.title =
“Chris” and BOOKSTORE.name = “Benny-bookstore”
• The above query is transformed to the following XPath
expression
/BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title
= “Glory Days”] /Child :: AUTHOR/child :: FIRSTNAME[name = “Chris”]
• Use Nindex to get all BOOK nodes or AUTHOR nodes
15
Entry-point Technique – An Example
• Get all books named “Glory Days” and then test the
condition on each one of them if the author is “Chris”
/BOOKSTORE [name = “Benny-bookstore”]/child:: BOOK[title
= “Glory Days”]
• Then, we test each author child node, which is the latter part
of X-path expression
/Child :: AUTHOR/child :: FIRSTNAME[name = “Chris”]
• In second strategy, first get all authors named “Chris”, and
then test the parent nodes if book name is “Glory Days”
16
Entry-point Root-first Algorithm
INPUT: XPath expression root/X1/X2/…/Xi/…/Xm STEP 3: FOR each node xn in S
STEP 1: FOR each Xi
BEGIN
BEGIN
IF the path starting from root to node xn
IF Xi is indexed THEN
is not included in the path
BEGIN
root/X1/X2/…/Xn-1/Xn
get every node xi of type Xi
THEN
get the DN ni of each xi
delete the sub tree that does not
Sumi = ni
satisfy the path condition
END
END
END
STEP 4: FOR each node xn in S, consider all sub
STEP 2: Get entry point Xn with minimum Sum,
trees starting with xn
add all xn to a node set S;
BEGIN
Consider the tree obtained after deleting all
IF Xn+1/…/Xm is same as /Xm
branches that do not have the node xn in its
THEN return nodes Xm
path.
ELSE INPUT = Xn/Xn+1/…/Xm
split the XPath into root/X1/X2/…/Xn-1 and
GO TO STEP 1
/Xn+1/…/Xm by the entry point Xn;
END
17
Example – Entry-point Root-first
Algorithm
A
B (14)
B (17)
D
C
I
E (8)
F G
G H G
H H
F
H
C
D
E (4)
E (6)
H
X-Path: A/B/C/E//H
F
H
F
G G
18
Example – Entry-point Root-first
Algorithm
• Step 1: calculate descent numbers (DN) of the
nodes that have indexes
• DN of node B = 31
• DN of node E = 18
• Entry-point = node E (minimum DN)
19
Example – Entry-point Root-first
Algorithm
• Step 2: Delete the branches that do not have E
A
B
B
C
E
F
G G H
G H
XPath – A/B/C/E and E//H
H
C
D
E
E
F
H
F
G G
20
Example – Entry-point Root-first
Algorithm
• Step 3: test A/B/C/E on each E node and discard the right
most sub tree with node E
E
F
G G H
E
G H
H
F
• Step 4: evaluate E//H on each E and finally we get the
three H nodes
• Cost – O(N) where N is the number of nodes
21
Rest-tree Conception
• Performance deterioration in Entry-point algorithm
– Find books written by “David” where the title of the book
contains the word “book”
– The XML file might have hundreds of books having the
word “book” in the title and
– further there might be a large number of books by author
“David”, but only one of them has the word “book” in its
title
– The Entry-point algorithm first eliminates all the nodes that
do not have the word “book” in its title.
– Then it eliminates the nodes that do not have “David” as the
author
– Due to relatively large number of instances at the two levels,
large number of eliminations is required
22
Rest-tree Conception
• The tree formed by the nodes
that meet certain condition at its
level, along with its descendant
and ancestor nodes
• In the example, the Rest-tree of
the node that satisfies the
condition that the <BOOK>
node has the word “glory” in its
title, is as shown
&1
[BOOKSTORE:
Benny-bookstore]
[BOOK: Glory days]
&3
&9
[ISBN:1-1-2]
&10
[AUTHOR: Chris]
23
Rest-tree Conception
• First employ Entry-point algorithm to find all nodes that meet
the condition statements at each level
• The final result will then be the intersection of the Rest-trees
of these nodes
• In practice, we do not need to find the Rest-tree of every node
satisfying the condition.
• Small set of nodes are left after applying the Entry-point
algorithm
• So we need to find the Rest-trees of a relatively small set of
nodes within a small sub tree
• To get the intersection of rest-trees, note that the nodes that
satisfy the query condition and that have the minimum
number of descendants is available from the Entry-point
algorithm
24
Rest-tree Conception
• The minimum level is the anchor level of the resttree algorithm.
• We just need to intersect the Rest-trees at this
minimum level.
• For example, after the first step of Entry-point
algorithm, we know there are 2000 nodes at Level
A that meet say condition A, 1000 nodes at Level B
that meet condition B, 200 nodes at Level C, 3000
at Level D, 400 at Level E.
• The minimum level is C and the order of the levels
is C->E->B->A->D
25
Rest-tree Conception
• Ancestor node information is available as pathindex
• Filter some nodes at Level C by checking the
grandparent node information of the 400 nodes at
Level E
• Similarly, we can filter some other nodes at Level
C by checking the parent node information of the
nodes at Level B.
• The intersection at Level C will be complete by
checking ancestor information at Level D nodes.
• The final step is to get all the nodes that satisfy the
query requirement
26
Rest-tree Algorithm
INPUT: X-path expression root/X1/X2/…/Xi/…/Xm
STEP 1: FOR each Xi
BEGIN
IF Xi is indexed THEN
BEGIN
get every node xi of type Xi
get the DN number ni of each xi
Sumi = ni;
END
END
STEP 2: get entry point Xj with minimum Sum, add
all xj to a node set Sj;
get comparison point Xk with second minimum
Sum, add all xk to a node set Sk;
STEP 3: IF level j > k
FOR each node xk in Sk
IF its ancestor is not in Sj THEN
delete xk from Sk
ELSE
FOR each node xj in Sj
IF its ancestor is not in Sk THEN
delete xj from Sj
STEP 4: FOR each node xj in Sj
BEGIN
IF the path starting from root to node
xj is not included in the path
root/X1/X2/…/Xj
THEN
delete the sub tree that does not
satisfy the path condition
END
STEP 5: FOR each node xj in Sj, consider all sub
trees starting with xj
BEGIN
IF Xj+1/…/Xm is same as /Xm
THEN return nodes Xm
ELSE INPUT = Xj/ Xj+1/…/Xm
GO TO STEP 1;
END
27
Rest-tree Algorithm - Example
XPath - A/B/C/E//H
Step 1: Calculate DNs
A
B (17)
C (9)
E (8)
F
G G H G H H
B(1)
D
B (14)
D
I
C (5)
D
E (4)
F
H
H
DOM Tree
F
M
C (6)
E (6)
H
F
G G
I
H
H
F
28
Rest-tree Algorithm - Example
Step 2: Minimum DN
DN of node B = 32
DN of node C = 20
DN of node E = 18
A
B (17)
C (9)
E (8)
F
G G H G H H
B(1)
D
B (14)
D
I
C (5)
D
E (4)
F
H
H
F
M
C (6)
E (6)
H
F
G G
I
H
H
F
29
Rest-tree Algorithm - Example
Step 3: Delete “E” nodes whose ancestor does not have “C”
A
B (17)
C (9)
E (8)
F
G G H G H H
B(1)
D
B (14)
D
I
M
C (5)
C (6)
E (4)
F
H
H
F
I
H
H
F
30
Rest-tree Algorithm - Example
Step 4: Delete the subtree that does not satisfy the path A/B/C/E
Step 5: Get all the nodes from E//H
A
B (17)
B (14)
C (9)
C (5)
E (8)
F
G G H
E (4)
G H
H
F
31
Test Cases and Comparisons
• Size of DOM Tree
– Entry-point algorithm performs much better than the
traditional algorithm, taking less than one third of the
processing time of the traditional algorithm
DOM Tree Size
Time (Milli-Sec)
700
600
500
400
300
200
100
0
1
2
3
4
5
6
7
8
Number of Nodes (10,000)
Increasing Number of Nodes for XPath: //A20//C30//A80
32
Test Cases and Comparisons
• Result Nodes Set
– The processing time for the Entry-point algorithm has
increased slightly with increasing number of result nodes.
– Partially, the reason is due to the recursive function call in
the Entry-point Algorithm code
Result Nodes Set
Time (Milli-Sec)
300
250
200
150
100
50
0
1
2
3
Num ber of Result Nodes (10)
Increasing Number of Result Nodes
33
Test Cases and Comparisons
• Tree Height
– The variation tendency of processing time of the three
methods is the same with the height of the tree
Time (Milli-Sec)
Tree Height Increasing
140
120
100
80
60
40
20
0
0
10
16
23
Tree Height
Tree Height Increasing
34
Test Cases and Comparisons
• Without Index on result nodes
– The traditional method turns out to be a disaster, falling
into no index method category.
– However, the Entry-point Algorithm is still in good
shape
Time (Milli-Sec)
Without index on result nodes
300
200
100
0
1
2
3
Nodes num ber (10,000)
Tree Height Increasing
35
Conclusions
• Proposed three types of indexes on XML data to execute
efficiently XPath queries.
• We proposed two algorithms to process XPath queries
using these indexes to optimize the queries.
• We have also simulated both bottom-up and top-down
approaches
• Processing XPath query using the Entry-point indexing
technique performs much better than traditional algorithms
with or without indexes
36
Download