Validity-Sensitive Querying of XML Databases Slawomir Staworko Jan Chomicki Department of Computer Science and Engineering University at Buffalo, SUNY Motivation Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I Legacy XML databases I Database updates Motivation Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I Legacy XML databases I Database updates Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary Motivation Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp emp name salary name salary John $80k Mary Legacy XML databases Database updates proj Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Motivation Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp emp name salary name salary John $80k Mary Legacy XML databases Database updates proj Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Motivation Querying Invalid XML I Integration of XML documents I Slight differences between schemas (e.g. different cardinality constraints) I I Document with errors proj name Pierogies emp Legacy XML databases Database updates Document Type Definition (D0 ) proj emp name salary → → → → emp emp name salary name salary John $80k Mary name salary ??? ??? proj (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA Query : get salaries of all employees that are not managers //proj/name/emp/following sibling::emp/salary ... $40k Positive Core XPath Positive Core XPath Queries I text values but without attributes I all standard axes I tests with subexpressions and equality: //*[A/B], //*[B//text()=‘abcd’] I I N0 A N1 N3 B but no negation in tests: //*[not A/B], //*[B/text()6=‘abcd’] N2 B abcd and no functions Tree Reachability Fact: (x, Q, y ) query Q node x object y Query Answers Basic facts use only /* and following sibling:: I (N0 ,/*,N1 ), (N1 ,/*,N2 ) I Other facts are inferred with (Horn) rules (X , Q/P, Y ) ⇐ (X , Q, Z ) ∧ (Z , P, Y ). (N0 ,/*/*,N2 ). I given query Q and document T with the root node r find all tree facts that hold in T x in an answer to Q in T iff the tree fact (r , Q, x) holds in T Query Evaluation Bottom-up approach I I I computing tree facts for query Q tree facts for T1 , . . . , Tm computed before including inferred facts (involving subqueries of Q) Tree N0 X N1 X1 ... Ni−1 Xi−1 T1 Ti−1 Algorithm I start with ∅ II for subtree Ti (i = 1, . . . , m) 1 add all facts of the subtree (obtained by recursion) 2 add (N0 ,/*,Ni ) 3 if i > 1 add (Ni−1 ,following sibling::*,Ni ) Ni Xi Ti ... Nm Xm Tm Editing operations A B C D Editing operations Cost: 3 Insert A B C Editing operations I Inserting a subtree A D E B C F D G Editing operations Cost: 3 Insert A B C Editing operations I I Inserting a subtree Deleting a subtree Cost: 2 Delete A D E B C F A D G E F D G Editing operations Cost: 1 H B Cost: 3 Modify D C A B C Editing operations I I I Insert Inserting a subtree Deleting a subtree Modifying node’s label Cost: 2 Delete A D E B C F A D G E F D G Edit Distance and Repairs Distance between documents dist(T , S) is the minimal cost of transforming T into S dist = 1 C C Distance to a DTD A dist(T , D) is the minimal cost of repairing T w.r.t D i.e., B B B B min{dist(T , S)|S valid w.r.t D} DTD C → (A,B)* A → EMPTY B → EMPTY Repair T 0 is a repair of T w.r.t D iff dist(T 0 , T ) = dist(T , D) C A B C A B C A B There can be an exponential number of repairs A B Valid Query Answers Valid Query Answers XML Document x is a valid answer to query Q in T w.r.t. D iff x is an answer to Q in every repair of T w.r.t. D. proj name Document Type Definition (D0 ) proj emp name salary → → → → (name, emp, proj*, emp*) (name, salary) #PCDATA #PCDATA proj Pierogies ... emp emp name salary name salary Alice $90K Bob Queries and Valid Answers //proj[emp[1]/salary=’$90K’]/name/text() //proj[name=’Pierogies’]/emp[1]/salary/text() //proj[name=’Pierogies’]/emp[1]/name/text() //proj[name=’Pierogies’]/emp[1]/salary → → → → {Pierogies} {$90K} ∅ ∅ $90K Trace graph Read A DTD C A B B C → (A,B)* A → EMPTY B → EMPTY Q0 Q1 Read B A B B Trace graph Read A DTD C A B B Ins A Del C → (A,B)* A → EMPTY B → EMPTY Q0 Q1 Ins B Read B A B B Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B A B B Q0 Q0 Q0 Q0 Q1 Q1 Q1 Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B A Q0 Q0 Re Q1 B ad Q0 Re Q1 B ad Q0 d a Re Q1 Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 ad Del Q1 d a Re Del Q0 Ins B Del Re Del Q0 Ins B ad Del Q0 Ins B Ins B Re B Ins A Del Q0 Q1 B Ins A A Q1 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 ,0 ad Del Q1 ,1 d a Re Del Q0 ,1 Ins B Del Re Del Q0 ,0 Ins B ad Del Q0 ,1 Ins B Ins B Re B Ins A Del Q0 ,0 Q1 ,1 B Ins A A Q1 ,2 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Q1 Ins B Read B Ins A Ins A Q1 ,0 ad Del Q1 ,1 d a Re Del Q0 ,1 Ins B Del Re Del Q0 ,0 Ins B ad Del Q0 ,1 Ins B Ins B Re B Ins A Del Q0 ,0 Q1 ,1 B Ins A A Q1 ,2 Del Trace graph Read A DTD C A B Ins A Del C → (A,B)* A → EMPTY B → EMPTY B Q0 Del Q1 Ins B Read B Compact representation of all repairs Del Q1 ,1 d a Re Del Ins A Ins A Ins A Q1 ,0 ad Q0 ,1 Repairing Paths: Ins B Del Re Del Q0 ,0 Ins B ad Ins B Ins B Q1 ,1 Re Del Q0 ,1 Ins A Del Q0 ,0 Q1 ,2 I (Read, Read, Del) I (Read, Read, Ins A, Read) I (Read, Del, Read) Computing Valid Query Answers Certain Tree Facts Tree facts present in every repair of a given tree A B Q0 ,0 Q0 ,0 ad Re ad Bottom-up approach Q1 ,0 I I certain facts for all children certain facts common for every minimal tree satisfying DTD Q0 ,1 ead R Q1 ,1 set of facts collected “so far”: I Intersection of the sets of all repairing paths Del For every repairing path, construct: I Obtain certain facts by Del Ins A Re Precomputed values: B I I start with ∅ Read adds certain facts of the corresponding child Del adds nothing Ins A adds certain facts common for minimal trees labeled with A Eager intersection Data complexity of VQA Problem Possibly an exponential number of paths Computation of valid answers to positive core XPath queries without joins can be performed in polynomial time in the size of the document. Solution: Eager Intersection Data complexity of VQA with joins For queries without tests of form (joins) There exists a query with joins for which computing VQA is co-NP-complete. [Q/text() = P/text()]. Intersect all sets of certain facts for paths sharing the same last adding operation (Read/Ins). Combined complexity of VQA Combined complexity of computing valid query answers is co-NP-complete. Experimental Results: Edit Distance Computation Compared algorithms Parse base line Validate regular automata Dist distance computation Data generation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10) Experimental Results: Edit Distance Computation Compared algorithms Parse Parse base line Validate regular automata Dist distance computation 1. random valid document 2. removing and adding random nodes 3. invalidity ratio Dist 12 Time (sec.) Data generation Validate 9 6 3 dist(T , D)/|T | ' 0.1% 4. small height (8-10) 10 20 30 Document size (MB) 40 50 Experimental Results: Edit Distance Computation Compared algorithms Parse base line Validate regular automata Dist distance computation 1. random valid document 2. removing and adding random nodes Dist 60 DTD size |D| 90 6 Time (sec.) Data generation Validate 4 2 3. invalidity ratio dist(T , D)/|T | ' 0.1% 4. small height (8-10) 30 120 Experimental Results: Valid Query Answer Computation Compared algorithms QA base line VQA eager intersection Data generation I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time Experimental Results: Valid Query Answer Computation Compared algorithms QA QA base line VQA eager intersection VQA 16 I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Time (sec.) Data generation 12 8 4 Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time 3 6 Document size (MB) 9 Experimental Results: Valid Query Answer Computation Compared algorithms VQA QA base line VQA eager intersection 30 I invalidity ratio dist(T , D)/|T | ' 0.1% I small height (8-10) Time (sec.) Data generation 20 10 Implemented Queries I //, /, following sibling:: I name()=A,text()=’str’ I QA works in linear time 10 20 30 40 DTD size |D| 50 60 Conclusions and Future Work Conclusions I Framework for querying of documents with validity violations of local nature (missing or superfluous nodes) I Efficient algorithm for computing valid answers to a class of XPath queries Future Work I Valid answers by query rewriting I Valid answers to queries with negation I Other tree operations (swap, shift, contraction, . . . ) I Semantic inconsistencies (keys, functional dependencies,. . . )