Valid Query Answers for XML Slawomir Staworko Jan Chomicki June 25-29, 2007

advertisement
Valid Query Answers for XML
Slawomir Staworko1
1
2
Jan Chomicki2
University at Buffalo and INRIA Futurs
University at Buffalo and Warsaw University
June 25-29, 2007
Invalid XML documents
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
Legacy XML databases
I
Database updates
Invalid XML documents
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
Legacy XML databases
I
Database updates
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
Invalid XML documents
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases
Database updates
proj
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Invalid XML documents
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
emp
name salary
name salary
John $80k
Mary
Legacy XML databases
Database updates
proj
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Invalid XML documents
Querying Invalid XML
I
Integration of XML documents
I
Slight differences between
schemas (e.g. different
cardinality constraints)
I
I
Document with errors
proj
name
Pierogies
emp
Legacy XML databases
Database updates
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
emp
emp
name salary
name salary
John $80k
Mary
name salary
???
???
proj
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
Query : get salaries of all employees that are not managers
//proj/name/emp/following sibling::emp/salary
...
$40k
Core XPath
Core XPath Queries
I
text values but no attributes
I
all standard axes
I
subexpressions, value tests, and joins:
//*[A/B], //*[B/text()=C/text()]
I
negation and disjunction:
//*[not A/B], //A | //B
I
no functions
count(/A/*)
N0
A
N1
N3
B
N2
B
abcd
Tree Reachability Fact: (x, Q, y )
query Q
node x
object y
Query Answers
I
Basic facts use only /* and following sibling::
(N0 ,/*,N1 ), (N1 ,/*,N2 ),...
I
I
Other facts are inferred with implications:
(X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ).
(N0 ,/*/*,N2 ).
given query Q and document T
with the root node r
find all tree facts that hold in T
x in an answer to Q in T iff the
tree fact (r , Q, x) holds in T
Query Evaluation for Positive Core XPath (no negation)
Bottom-up approach
I
I
I
computing tree facts for query Q
tree facts for T1 , . . . , Tm
computed before
including inferred facts (involving
subqueries of Q)
Tree
N0
X
Ni−1
N1
X1
...
Xi−1
T1
Ti−1
Algorithm
I start with ∅
II for subtree Ti (i = 1, . . . , m)
1 add all facts of the subtree (obtained by
recursion)
2 add (N0 ,/*,Ni )
3 if i > 1 add (Ni−1 ,following sibling::*,Ni )
Ni
Xi
Ti
...
Nm
Xm
Tm
Editing operations
A
B
C
D
Editing operations
Cost: 3
Insert
A
B
C
Editing operations
I
Inserting a subtree
A
D
B
C
E
F
D
G
Editing operations
Cost: 3
Insert
A
B
C
Editing operations
I
I
Inserting a subtree
Deleting a subtree
Cost: 2
Delete
A
D
B
C
E
F
A
D
G
E
F
D
G
Editing operations
Cost: 1
H
B
D
C
Editing operations
I
I
I
Cost: 3
Modify
Inserting a subtree
Deleting a subtree
Modifying a node’s
label
Insert
A
B
C
Cost: 2
Delete
A
D
B
C
E
F
A
D
G
E
F
D
G
Edit Distance and Repairs [3]
Distance between documents
dist(T , S) is the minimal cost of
transforming T into S
dist = 1
C
C
Distance to a DTD
A
dist(T , D) is the minimal cost of
repairing T w.r.t D i.e.,
B
B
B
B
min{dist(T , S)|S valid w.r.t D}
DTD
C → (A,B)*
A → EMPTY
B → EMPTY
Repair
T 0 is a repair of T w.r.t D iff
dist(T 0 , T ) = dist(T , D)
C
A
B
C
A
B
C
A
B
There can be an
exponential
number of repairs
A
B
Valid Query Answers
Valid Query Answers
XML Document
x is a valid answer to query Q in T w.r.t. D iff
x is an answer to Q in every repair of T w.r.t. D.
proj
name
Document Type Definition (D0 )
proj
emp
name
salary
→
→
→
→
(name, emp, proj*, emp*)
(name, salary)
#PCDATA
#PCDATA
proj
Pierogies
...
emp
emp
name salary name salary
Alice $90K Bob
Queries and Valid Answers
//proj[emp[1]/salary=’$90K’]/name/text()
//proj[name=’Pierogies’]/emp[1]/salary/text()
//proj[name=’Pierogies’]/emp[1]/name/text()
//proj[name=’Pierogies’]/emp[1]/salary
→
→
→
→
{Pierogies}
{$90K}
∅
∅
$90K
Trace graph
Read A
DTD
C
A
B
B
C → (A,B)*
A → EMPTY
B → EMPTY
Q0
Q1
Read B
A
B
B
Trace graph
Read A
DTD
C
A
B
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
Q0
Q1
Ins B
Read B
A
B
B
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
A
B
B
Q0
Q0
Q0
Q0
Q1
Q1
Q1
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
A
Q0
Q0
Re
Q1
B
B
Q0
ad
Re
ad
Q1
Q0
Re
Q1
ad
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1
Del
Q1
Re
ad
Del
Q0
Ins B
Del
ad
Re
Del
Q0
Ins B
ad
Del
Q0
Ins B
Ins B
Re
B
Ins A
Del
Q0
Q1
B
Ins A
A
Q1
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1 ,0
Del
Q1 ,1
Re
ad
Del
Q0 ,1
Ins B
Del
ad
Re
Del
Q0 ,0
Ins B
ad
Del
Q0 ,1
Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Q1
Ins B
Read B
Ins A
Ins A
Q1 ,0
Del
Q1 ,1
Re
ad
Del
Q0 ,1
Ins B
Del
ad
Re
Del
Q0 ,0
Ins B
ad
Del
Q0 ,1
Ins B
Ins B
Re
B
Ins A
Del
Q0 ,0
Q1 ,1
B
Ins A
A
Q1 ,2
Del
Trace graph
Read A
DTD
C
A
B
Ins A
Del
C → (A,B)*
A → EMPTY
B → EMPTY
B
Q0
Del
Q1
Ins B
Read B
Compact representation of all repairs
Del
Q1 ,1
ad
Del
Ins A
Ins A
Ins A
Q1 ,0
Re
Q0 ,1
Repairing Paths:
Ins B
Del
ad
Re
Del
Q0 ,0
Ins B
ad
Ins B
Ins B
Q1 ,1
Re
Del
Q0 ,1
Ins A
Del
Q0 ,0
Q1 ,2
I
(Read, Read, Del)
I
(Read, Read, Ins A, Read)
I
(Read, Del, Read)
Computing Valid Query Answers
Certain Tree Facts
Tree facts present in every repair of
a given tree
A
B
Q0 ,0
Q0 ,0
e ad
ad
R
Bottom-up approach
Q1 ,0
Q0 ,1
ead
R
Q1 ,1
set of facts collected “so far”:
I
Intersection of the sets of all repairing paths
Del
For every repairing path, construct:
I
Obtain certain facts by
Del
Ins A
Re
Precomputed values:
I certain facts for all children
I certain facts common for every
minimal tree satisfying DTD
B
I
I
start with ∅
Read adds certain facts of the corresponding
child
Del adds nothing
Ins A adds certain facts common for minimal
trees labeled with A
Eager intersection
Problem
Possibly an exponential number of paths
Solution: Eager Intersection
Intersect all sets of certain facts for paths sharing the same last operation adding
facts (Read/Ins).
Computing VQA
Queries
Combined-complexity
Data-complexity
Descending axes
PTIME
PTIME
+ Ascending axes
co-NP-complete
???
+ Sliding axes
co-NP-complete
???
+ Negation/Disjunction
co-NP-complete
???
+ Joins
co-NP-complete
co-NP-complete
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
base line
Validate regular automata
Dist
distance computation
Data generation
1. random valid document
2. removing and adding
random nodes
3. invalidity ratio
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
Parse
base line
Validate regular automata
Dist
distance computation
1. random valid document
2. removing and adding
random nodes
3. invalidity ratio
Dist
12
Time (sec.)
Data generation
Validate
9
6
3
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
10
20
30
Document size (MB)
40
50
Experimental Results: Edit Distance Computation
Compared algorithms
Parse
base line
Validate regular automata
Dist
distance computation
1. random valid document
2. removing and adding
random nodes
Dist
60
DTD size |D|
90
6
Time (sec.)
Data generation
Validate
4
2
3. invalidity ratio
dist(T , D)/|T | ' 0.1%
4. small height (8-10)
30
120
Experimental Results: Valid Query Answer Computation
Compared algorithms
QA base line
VQA eager intersection
Data generation
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
Experimental Results: Valid Query Answer Computation
Compared algorithms
QA
QA base line
VQA eager intersection
VQA
16
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation
12
8
4
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
3
6
Document size (MB)
9
Experimental Results: Valid Query Answer Computation
Compared algorithms
VQA
QA base line
VQA eager intersection
30
I
invalidity ratio
dist(T , D)/|T | ' 0.1%
I
small height (8-10)
Time (sec.)
Data generation
20
10
Implemented Queries
I //, /, following sibling::
I name()=A,text()=’str’
I QA works in linear time
10
20
30
40
DTD size |D|
50
60
Conclusions and Future Work
Conclusions
I
Framework for querying documents with validity violations of local nature
(missing or superfluous nodes)
I
Efficient algorithm for computing valid answers to a class of XPath queries
Future Work
I
Valid answers by query rewriting
I
In-depth analysis of data complexity
I
Other tree operations: inner node deletion/insertion, node move, . . . [1]
I
Semantic inconsistencies: keys, functional dependencies,. . . [2]
Acknowledgments
Marcelo Arenas
Leo Bertossi
Wenfei Fan
Jerzy Marcinkowski
Slawomir Staworko
Jef Wijsen
P. Bille.
A Survey on Tree Edit Distance and Related Problems.
Theoretical Computer Science, 337(1-3):217–239, 2003.
S. Flesca, F. Furfaro, S. Greco, and E. Zumpano.
Querying and Repairing Inconsistent XML Data.
In Web Information Systems Engineering, pages 175–188. Springer, LNCS
3806, 2005.
S. Staworko and J. Chomicki.
Validity-Sensitive Querying of XML Databases.
In EDBT Workshops (dataX), pages 164–177. Springer, LNCS 4254, 2006.
Download