Validity-Sensitive Querying of XML Databases Slawomir Staworko Jan Chomicki

advertisement
Validity-Sensitive Querying of XML Databases
Slawomir Staworko
Jan Chomicki
Department of Computer Science and Engineering
University at Buffalo, SUNY
Motivation
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
Legacy XML databases
I
Database updates
Motivation
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
Legacy XML databases
I
Database updates
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
Motivation
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases
Database updates
proj
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Motivation
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases
Database updates
proj
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Motivation
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
Legacy XML databases
Database updates
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
emp
emp
name salary
name salary
John $80k
Mary
name salary
???
???
proj
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Positive Core XPath
Positive Core XPath Queries
I
text values but without attributes
I
all standard axes
I
tests with subexpressions and equality:
//*[A/B], //*[B//text()=‘abcd’]
I
I
N0
A
N1
N3
B
but no negation in tests:
//*[not A/B], //*[B/text()6=‘abcd’]
N2
B
abcd
and no functions
Tree Reachability Fact: (x, Q, y )
query Q
node x
object y
Query Answers
Basic facts use only /* and following sibling::
I
(N0 ,/*,N1 ), (N1 ,/*,N2 )
I
Other facts are inferred with (Horn) rules
(X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ).
(N0 ,/*/*,N2 ).
I
given query Q and document T
with the root node r
find all tree facts that hold in T
x in an answer to Q in T iff the
tree fact (r , Q, x) holds in T
Query Evaluation
Bottom-up approach
I
I
I
computing tree facts for query Q
tree facts for T1 , . . . , Tm
computed before
including inferred facts (involving
subqueries of Q)
Tree
N0
X
N1
X1
...
Ni−1
Xi−1
T1
Ti−1
Algorithm
I start with ∅
II for subtree Ti (i = 1, . . . , m)
1 add all facts of the subtree (obtained by
recursion)
2 add (N0 ,/*,Ni )
3 if i > 1 add (Ni−1 ,following sibling::*,Ni )
Ni
Xi
Ti
...
Nm
Xm
Tm
Editing operations
A
B
C
D
Editing operations
Cost: 3
Insert
A
B
C
Editing operations
I
Inserting a subtree
A
D
E
B
C
F
D
G
Editing operations
Cost: 3
Insert
A
B
C
Editing operations
I
I
Inserting a subtree
Deleting a subtree
Cost: 2
Delete
A
D
E
B
C
F
A
D
G
E
F
D
G
Editing operations
Cost: 1
H
B
Cost: 3
Modify
D
C
A
B
C
Editing operations
I
I
I
Insert
Inserting a subtree
Deleting a subtree
Modifying node’s label
Cost: 2
Delete
A
D
E
B
C
F
A
D
G
E
F
D
G
Edit Distance and Repairs
Distance between documents
dist(T , S) is the minimal cost of
transforming T into S
dist = 1
C
C
Distance to a DTD
A
dist(T , D) is the minimal cost of
repairing T w.r.t D i.e.,
B
B
B
B
min{dist(T , S)|S valid w.r.t D}
DTD
C → (A,B)*
A → EMPTY
B → EMPTY
Repair
T 0 is a repair of T w.r.t D iff
dist(T 0 , T ) = dist(T , D)
C
A
B
C
A
B
C
A
B
There can be an
exponential
number of repairs
A
B
Valid Query Answers
Valid Query Answers
XML Document
x is a valid answer to query Q in T w.r.t. D iff
x is an answer to Q in every repair of T w.r.t. D.
proj
name
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
proj
Pierogies
...
emp
emp
name salary name salary
Alice $90K Bob
Queries and Valid Answers
//proj[emp[1]/salary=’$90K’]/name/text()
//proj[name=’Pierogies’]/emp[1]/salary/text()
//proj[name=’Pierogies’]/emp[1]/name/text()
//proj[name=’Pierogies’]/emp[1]/salary
→
→
→
→
{Pierogies}
{$90K}
∅
∅
$90K
Trace graph
Read A
DTD
C
A
B
B
C → (A,B)*
A → EMPTY
B → EMPTY
Q0
Q1
Read B
A
B
B
Trace graph
Read A
DTD
C
A
B
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
Q0
Q1
Ins B
Read B
A
B
B
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
A
B
B
Q0
Q0
Q0
Q0
Q1
Q1
Q1
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
A
Q0
Q0
Re
Q1
B
ad
Q0
Re
Q1
B
ad
Q0
d
a
Re
Q1
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1
ad
Del
Q1
d
a
Re
Del
Q0
Ins B
Del
Re
Del
Q0
Ins B
ad
Del
Q0
Ins B
Ins B
Re
B
Ins A
Del
Q0
Q1
B
Ins A
A
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1 ,0
ad
Del
Q1 ,1
d
a
Re
Del
Q0 ,1
Ins B
Del
Re
Del
Q0 ,0
Ins B
ad
Del
Q0 ,1
Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1 ,0
ad
Del
Q1 ,1
d
a
Re
Del
Q0 ,1
Ins B
Del
Re
Del
Q0 ,0
Ins B
ad
Del
Q0 ,1
Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Del
Q1
Ins B
Read B
Compact representation of all repairs
Del
Q1 ,1
d
a
Re
Del
Ins A
Ins A
Ins A
Q1 ,0
ad
Q0 ,1
Repairing Paths:
Ins B
Del
Re
Del
Q0 ,0
Ins B
ad
Ins B
Ins B
Q1 ,1
Re
Del
Q0 ,1
Ins A
Del
Q0 ,0
Q1 ,2
I
(Read, Read, Del)
I
(Read, Read, Ins A, Read)
I
(Read, Del, Read)
Computing Valid Query Answers
Certain Tree Facts
Tree facts present in every repair of
a given tree
A
B
Q0 ,0
Q0 ,0
ad
Re
ad
Bottom-up approach
Q1 ,0
I
I
certain facts for all children
certain facts common for every
minimal tree satisfying DTD
Q0 ,1
ead
R
Q1 ,1
set of facts collected “so far”:
I
Intersection of the sets of all repairing paths
Del
For every repairing path, construct:
I
Obtain certain facts by
Del
Ins A
Re
Precomputed values:
B
I
I
start with ∅
Read adds certain facts of the corresponding
child
Del adds nothing
Ins A adds certain facts common for minimal
trees labeled with A
Eager intersection
Data complexity of VQA
Problem
Possibly an exponential number of paths
Computation of valid answers to positive
core XPath queries without joins can be performed in polynomial time in the size of the
document.
Solution: Eager Intersection
Data complexity of VQA with joins
For queries without tests of form (joins)
There exists a query with joins for which
computing VQA is co-NP-complete.
[Q/text() = P/text()].
Intersect all sets of certain facts for paths
sharing the same last adding operation
(Read/Ins).
Combined complexity of VQA
Combined complexity of computing valid
query answers is co-NP-complete.
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
base line
Validate regular automata
Dist
distance computation
Data generation
1. random valid document
2. removing and adding
random nodes
3. invalidity ratio
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
Parse
base line
Validate regular automata
Dist
distance computation
1. random valid document
2. removing and adding
random nodes
3. invalidity ratio
Dist
12
Time (sec.)
Data generation
Validate
9
6
3
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
10
20
30
Document size (MB)
40
50
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
base line
Validate regular automata
Dist
distance computation
1. random valid document
2. removing and adding
random nodes
Dist
60
DTD size |D|
90
6
Time (sec.)
Data generation
Validate
4
2
3. invalidity ratio
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
30
120
Experimental Results: Valid Query Answer Computation
Compared algorithms
QA base line
VQA eager intersection
Data generation
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
Experimental Results: Valid Query Answer Computation
Compared algorithms
QA
QA base line
VQA eager intersection
VQA
16
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation
12
8
4
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
3
6
Document size (MB)
9
Experimental Results: Valid Query Answer Computation
Compared algorithms
VQA
QA base line
VQA eager intersection
30
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation
20
10
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
10
20
30
40
DTD size |D|
50
60
Conclusions and Future Work
Conclusions
I
Framework for querying of documents with validity violations of local
nature (missing or superfluous nodes)
I
Efficient algorithm for computing valid answers to a class of XPath queries
Future Work
I
Valid answers by query rewriting
I
Valid answers to queries with negation
I
Other tree operations (swap, shift, contraction, . . . )
I
Semantic inconsistencies (keys, functional dependencies,. . . )
Download