Valid Query Answers for XML Slawomir Staworko1 1 2 Jan Chomicki2 University at Buffalo and INRIA Futurs University at Buffalo and Warsaw University June 25-29, 2007 Invalid XML documents Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I Legacy XML databases I Database updates Invalid XML documents Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I Legacy XML databases I Database updates Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary Invalid XML documents Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp emp name salary name salary John $80k Mary Legacy XML databases Database updates proj Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Invalid XML documents Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp emp name salary name salary John $80k Mary Legacy XML databases Database updates proj Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Invalid XML documents Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp Legacy XML databases Database updates Document Type Definition (D0 ) proj emp name salary → → → → emp emp name salary name salary John $80k Mary name salary ??? ??? proj (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Core XPath Core XPath Queries I text values but no attributes I all standard axes I subexpressions, value tests, and joins: //*[A/B], //*[B/text()=C/text()] I negation and disjunction: //*[not A/B], //A | //B I no functions count(/A/*) N0 A N1 N3 B N2 B abcd Tree Reachability Fact: (x, Q, y ) query Q node x object y Query Answers I Basic facts use only /* and following sibling:: (N0 ,/*,N1 ), (N1 ,/*,N2 ),... I I Other facts are inferred with implications: (X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ). (N0 ,/*/*,N2 ). given query Q and document T with the root node r find all tree facts that hold in T x in an answer to Q in T iff the tree fact (r , Q, x) holds in T Query Evaluation for Positive Core XPath (no negation) Bottom-up approach I I I computing tree facts for query Q tree facts for T1 , . . . , Tm computed before including inferred facts (involving subqueries of Q) Tree N0 X Ni−1 N1 X1 ... Xi−1 T1 Ti−1 Algorithm I start with ∅ II for subtree Ti (i = 1, . . . , m) 1 add all facts of the subtree (obtained by recursion) 2 add (N0 ,/*,Ni ) 3 if i > 1 add (Ni−1 ,following sibling::*,Ni ) Ni Xi Ti ... Nm Xm Tm Editing operations A B C D Editing operations Cost: 3 Insert A B C Editing operations I Inserting a subtree A D B C E F D G Editing operations Cost: 3 Insert A B C Editing operations I I Inserting a subtree Deleting a subtree Cost: 2 Delete A D B C E F A D G E F D G Editing operations Cost: 1 H B D C Editing operations I I I Cost: 3 Modify Inserting a subtree Deleting a subtree Modifying a node’s label Insert A B C Cost: 2 Delete A D B C E F A D G E F D G Edit Distance and Repairs [3] Distance between documents dist(T , S) is the minimal cost of transforming T into S dist = 1 C C Distance to a DTD A dist(T , D) is the minimal cost of repairing T w.r.t D i.e., B B B B min{dist(T , S)|S valid w.r.t D} DTD C → (A,B)* A → EMPTY B → EMPTY Repair T 0 is a repair of T w.r.t D iff dist(T 0 , T ) = dist(T , D) C A B C A B C A B There can be an exponential number of repairs A B Valid Query Answers Valid Query Answers XML Document x is a valid answer to query Q in T w.r.t. D iff x is an answer to Q in every repair of T w.r.t. D. proj name Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA proj Pierogies ... emp emp name salary name salary Alice $90K Bob Queries and Valid Answers //proj[emp[1]/salary=’$90K’]/name/text() //proj[name=’Pierogies’]/emp[1]/salary/text() //proj[name=’Pierogies’]/emp[1]/name/text() //proj[name=’Pierogies’]/emp[1]/salary → → → → {Pierogies} {$90K} ∅ ∅ $90K Trace graph Read A DTD C A B B C → (A,B)* A → EMPTY B → EMPTY Q0 Q1 Read B A B B Trace graph Read A DTD C A B B Ins A Del C → (A,B)* A → EMPTY B → EMPTY Q0 Q1 Ins B Read B A B B Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B A B B Q0 Q0 Q0 Q0 Q1 Q1 Q1 Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B A Q0 Q0 Re Q1 B B Q0 ad Re ad Q1 Q0 Re Q1 ad Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 Del Q1 Re ad Del Q0 Ins B Del ad Re Del Q0 Ins B ad Del Q0 Ins B Ins B Re B Ins A Del Q0 Q1 B Ins A A Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 ,0 Del Q1 ,1 Re ad Del Q0 ,1 Ins B Del ad Re Del Q0 ,0 Ins B ad Del Q0 ,1 Ins B Ins B Re B Ins A Del Q0 ,0 Q1 ,1 B Ins A A Q1 ,2 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 ,0 Del Q1 ,1 Re ad Del Q0 ,1 Ins B Del ad Re Del Q0 ,0 Ins B ad Del Q0 ,1 Ins B Ins B Re B Ins A Del Q0 ,0 Q1 ,1 B Ins A A Q1 ,2 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Del Q1 Ins B Read B Compact representation of all repairs Del Q1 ,1 ad Del Ins A Ins A Ins A Q1 ,0 Re Q0 ,1 Repairing Paths: Ins B Del ad Re Del Q0 ,0 Ins B ad Ins B Ins B Q1 ,1 Re Del Q0 ,1 Ins A Del Q0 ,0 Q1 ,2 I (Read, Read, Del) I (Read, Read, Ins A, Read) I (Read, Del, Read) Computing Valid Query Answers Certain Tree Facts Tree facts present in every repair of a given tree A B Q0 ,0 Q0 ,0 e ad ad R Bottom-up approach Q1 ,0 Q0 ,1 ead R Q1 ,1 set of facts collected “so far”: I Intersection of the sets of all repairing paths Del For every repairing path, construct: I Obtain certain facts by Del Ins A Re Precomputed values: I certain facts for all children I certain facts common for every minimal tree satisfying DTD B I I start with ∅ Read adds certain facts of the corresponding child Del adds nothing Ins A adds certain facts common for minimal trees labeled with A Eager intersection Problem Possibly an exponential number of paths Solution: Eager Intersection Intersect all sets of certain facts for paths sharing the same last operation adding facts (Read/Ins). Computing VQA Queries Combined-complexity Data-complexity Descending axes PTIME PTIME + Ascending axes co-NP-complete ??? + Sliding axes co-NP-complete ??? + Negation/Disjunction co-NP-complete ??? + Joins co-NP-complete co-NP-complete Experimental Results: Edit Distance Computation Compared algorithms Parse base line Validate regular automata Dist distance computation Data generation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10) Experimental Results: Edit Distance Computation Compared algorithms Parse Parse base line Validate regular automata Dist distance computation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio Dist 12 Time (sec.) Data generation Validate 9 6 3 dist(T , D)/|T | ' 0.1% 4. small height (8-10) 10 20 30 Document size (MB) 40 50 Experimental Results: Edit Distance Computation Compared algorithms Parse base line Validate regular automata Dist distance computation 1. random valid document 2. removing and adding random nodes Dist 60 DTD size |D| 90 6 Time (sec.) Data generation Validate 4 2 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10) 30 120 Experimental Results: Valid Query Answer Computation Compared algorithms QA base line VQA eager intersection Data generation I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time Experimental Results: Valid Query Answer Computation Compared algorithms QA QA base line VQA eager intersection VQA 16 I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Time (sec.) Data generation 12 8 4 Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time 3 6 Document size (MB) 9 Experimental Results: Valid Query Answer Computation Compared algorithms VQA QA base line VQA eager intersection 30 I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Time (sec.) Data generation 20 10 Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time 10 20 30 40 DTD size |D| 50 60 Conclusions and Future Work Conclusions I Framework for querying documents with validity violations of local nature (missing or superfluous nodes) I Efficient algorithm for computing valid answers to a class of XPath queries Future Work I Valid answers by query rewriting I In-depth analysis of data complexity I Other tree operations: inner node deletion/insertion, node move, . . . [1] I Semantic inconsistencies: keys, functional dependencies,. . . [2] Acknowledgments Marcelo Arenas Leo Bertossi Wenfei Fan Jerzy Marcinkowski Slawomir Staworko Jef Wijsen P. Bille. A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science, 337(1-3):217–239, 2003. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Querying and Repairing Inconsistent XML Data. In Web Information Systems Engineering, pages 175–188. Springer, LNCS 3806, 2005. S. Staworko and J. Chomicki. Validity-Sensitive Querying of XML Databases. In EDBT Workshops (dataX), pages 164–177. Springer, LNCS 4254, 2006.