On Testing Satisfiability of Tree Pattern Queries Laks V.S. Lakshmanan, Ganesh Ramesh, Hui Wang, Zheng Zhao Summary by Victoria Shulman Becher The article studies the satisfiability of tree pattern queries, i.e. whether there exists a database, consistent with the schema, on which the query has a non-empty answer. This can effect substantial savings in query evaluation. In the article, the writers identified cases in which testing satisfiability can be solved in polynomial time, and developed efficient algorithms for this purpose. They also identify the cases were the problem is NP-complete. Introduction: Formulating queries against XML database can be more challenging than for relational databases. Querying an XML database in the absence of any schema knowledge can be complicate. Even when a schema is known, getting the query right can still be non trivial for the user. Satisfiability testing is a necessary first step in building any tool that could assist the user getting the query right. Example of unsatisfiable tree pattern query: FOR $a IN document(‘‘doc.xml’’)//a, $e IN $a/b//e, $f IN $a/d//f, $c IN $a//c, $e1 IN $c//e, $f1 IN $c//f WHERE $e = $e1 AND $f = $f1 RETURN The constraint that the two E and two F leaves must be identical requires nodes A, B, C and E to lie on the same path . Similarly, A, D, C and F. This is impossible, since C, having a different tag than the two children of A, is forced to be a descendant of both, whereas B and D cannot lie on the same path. (Single (double) lines on diagram represent parent-child (ancestor-descendant) relationship between nodes). Definitions: A database D = (N, e, r, λ) is a finite rooted ordered tree, where N- represents element nodes, e- represents parent-child relationship, λ the labelling function, assigns a tag with each node, r is the root. A tree pattern query (TPQ) Q = (V, E, F), where (V, E) is a rooted tree, with nodes V labelled by variables, and with edges E = EcU Ed.Where pc-edges (Ec) represent child relationship and ad-edges (Ed) represent descendant relationships of the XPATH. F is a conjunction of tag constraints (TCs of the form $x.tag = t where t is a tag name), value-based constraints (VBCs include selection constraints: $x.val relop c,$x.attr relop c and join constraints $x.attr relop $y.attr’, $x.val relop $y.val where relop {=, ≠, >,≤ ≥,<} attr, attr’ are attributes, val represents content, c is a constant) and node identity constraints (NICs - $x idop $y where idop { , }). A matching of a TPQ Q to a database D is a function h: Q→D that maps nodes of Q to nodes of D such that: (i) structural relationships are preserved and (ii) the formula F is satisfied. We say that a database D satisfies a query Q provided there is a matching h : Q→D. A query Q is satisfiable provided there is a database D that satisfies Q. Problems Studied: the article considers testing satisfiability of various classes of TPQs (with/without VBCs, with/without disjunction in VBCs, with/without join and node identity constraints, with/without wildcards) both in the absence of a schema and in the presence of a schema without disjunction and cycles. Satisfiability without Schema: The following proposition summarizes the situation for join-free TPQs with wildcards: For a join-free tree pattern query Q, possibly containing wildcards, the following holds: 1. If Q contains no VBCs associated with any node then Q is satisfiable. 2. If Q contains value-based selection constraints (but is join-free), then Q is satisfiable iff for every node, the associated set of VBCs is consistent. The complexity of verifying satisfiability depends on the type of each node constrains. If no disjunction occurs, consistency of constrains can be verified in polynomial time .If VBCs constraints associated with a node x can involve arbitrary disjunctions, testing consistency is NP-complete. Wildcard-free TPQs with joins: The presence of join and node identity constraints interacts in a complex way with the structural constraints. The reasoning about satisfiability was separated into structure and value-based parts and their interaction was determined. The examples of unsatisfiable queries shown in the article illustrate that: 1. Testing satisfiability involves inferring relationships between pairs of nodes based on structural constraints stated in the query and hence inference rules are required. 2. Some of the intermediate relationships cannot be directly represented in the language of TPQs e.g., “x and y must lie on the same path” or ( x ≈ y ad(x, y) ad(y, x) ) To prevent complexity involved in these relationships representation, the following predicates were added: sad(x, y) meaning x ≈ y or ad(x, y), OTSP(x, y) meaning sad(x, y) or ad(y, x), COUS(x, y) meaning ¬OTSP(x, y). Determining satisfiability of a query works as follows. First, we use inference rules to obtain the closure of structural predicates. Then, we check the resulting set of predicates for violations. A violation is a pair of conflicting predicates between a pair of nodes. Examples of conflicting pairs of predicates are x y, x y; ad(x, y), sad(y, x); and OTSP(x, y),COUS(x, y). Violations make the query unsatisfiable. In order to efficiently implement a procedure for satisfiability checking a (structural) constraint graph GQ for the query Q is constructed as follows: GQ contains one node for each query node. For each predicate φ(x, y) in Q, GQ contains a directed edge labeled φ from x to y. For symmetric predicates, the edge is bidirected. The following tools are used: Inference rules are used to define new structural predicates that are inferred from existing ones. A chase procedure which applies the inference rules until no new inferences are possible. A violation, which is a pair of conflicting predicates between a pair of nodes. A constraint graph, with chase applied on it, is a chased constraint graph. The article proves that a tree pattern query containing node identity constraints, but no wildcards, is satisfiable iff the chased constraining graph of Q is violation-free (Completeness of Chase). A naive implementation of the chase would take time O(n5), where n is the number of nodes in the query. This is because each rule involves 3 nodes and there are O(n2) iterations possible in the worst case before no new inferences are made. A more efficient implementation is suggested in the article, where the worst case remains the same, but practically is much better. VBCs Consistency of VBCs constraints can be checked in polynomial time. The checking algorithm can be implemented efficiently using a separate value-based constraint graph using ideas similar to the structural constraint graph. The procedure for testing satisfiability of a query Q with structural constraints and VBCs is then as follows: (i) Chase the VBCs (using a separate value based constraint graph); if any violation is found return “unsatisfiable”. (ii) Construct the (structural) constraint graph G of Q; propagate all constraints x y derived from VBC chase to G and chase it; (iii) Q is satisfiable iff the chase terminates with no violation. The article proves that a tree pattern query with structural constraints and VBCs and no wildcard can be tested for satisfiability in polynomial time using the procedure above (TPQs with VBCs). In TPQs with Wildcards, Joins, and Disjunction the problem becomes NP-complete. The article proves that testing satisfiability of a tree pattern query with wildcard and only ≈ constraints, where the query uses only pc- and sad-edges, is NP-complete ([Hidders]). The article also proves that testing satisfiability of a tree pattern query containing VBCs, with disjunction allowed in selection constraints associated with nodes, is NP-complete (TPQ with disjunction) The complexity results for the schemaless case are summarized in the table below: Disjunction NIS/joins Wildcards Complexity * PTIME * PTIME * * NP-complete * * NP-complete Satisfiability in the presence of Acyclic Schema: A schema of a database is abstracted as a graph with nodes corresponding to tags and edges, labelled by one of the quantifiers (?, 1, *,+ with their standard meaning ‘optional’,’ one’,’ zero or more’,’ one or more’. In the paper only DTDs are considered. In order to present the main issues appearing in this part, some definitions are needed: An embedding of a query Q into a schema ∆ is a function f: Q → ∆ satisfying the following conditions: (i) f maps each tagged node to a node with the same tag; (ii) whenever (x, y) is a pc-edge (ad-edge) in Q, there is an edge (path) from f(x) to f(y) in ∆. Let ∆ be a schema and let Q be a tree pattern query with no wildcards or VBCs. Then Q is satisfiable with respect to ∆ iff there is an embedding f from Q into ∆. The article shows that in the case of TPQ that contain only wildcards nodes, checking satisfiability trivially reduces to checking if the schema is of a given depth. When wildcards are present, semantically we can assign any tags to the wildcards and check for the existence of an embedding. This approach takes exponential time. In order to confirm an existence of embedding in this case the labeling procedure is used. Then article proves that a TPQ containing wildcards but no VBCs and no NICs is satisfiable with respect to a schema ∆ iff the set of schema labels computed by the labelling procedure is not empty for each node in the TPQ (Labeling). In the presence of Node Identity Constraints and VBCs determining satisfiability of a query works as follow: We use the schema to infer structural predicates between any pair of query nodes (which are tagged). We use inference rules to compute the closure of structural predicates and check the resulting set for violations. The query is satisfiable iff the resulting set is violation-free. As before, we use a constraint graph and a set of inference rules to compute the closure. The set of inference rules are adapted from those developed for the schemaless case. Rules involving sad or OTSP are dropped, since the schema allows us to derive an unambiguous ad relationship whenever sad or OTSP holds. Additionally, we need to infer relationships between element types from the schema. The schema can tell us that two tags t, t’ are related by a pc-/ad-relationship, or that two query nodes must be identical or that they must be cousins. The article proves that a TPQ with NICs but no wildcards is satisfiable with respect to an acyclic schema iff there is an embedding of the TPQ into the schema and no violation is detected when the constraint graph of the TPQ is chased (Chase Completeness with Schema). The article also proves that the query complexity of satisfiability checking in the presence of acyclic schema without choice is PTIME (Query Complexity) and that the combined complexity of satisfiability in the presence of acyclic schema without choices is co-NPcomplete (Combined Complexity). In the presence of schema testing satisfiability of a TPQ with Node Identity Constraints and Wildcards is NP-complete. The article proves that a TPQ is satisfiable with respect to a schema is NP-complete if it contains wildcards and NICs and if it contains VBCs (and no NICs) (Hardness Results). The complexity results for the schema case are summarized in table below: Disjunction NIS/joins Wildcards Complexity * PTIME * PTIME * * NP-complete * NP-complete Experimental Results To study the effectiveness of testing satisfiability, the authors systematically ran a range of experiments to measure the impact of various parameters. They ran the experiments on the XMark benchmark dataset and Biomedical dataset from National Biomedical Research Foundation. For each dataset they constructed documents of various sizes using the IBM XMLGenerator. The queries chosen for experimentation correspond to classes of TPQ studied in the paper. The results of measuring savings and overheads were: Saving Ratio: As expected for unsatisfiable queries, satisfiability check leads to phenomenal savings. Saving ratio is close to 1 (usually between about 0.8 and 0.9) whether the schema is present or not. Overhead Ratio: the overhead ratio decreases as the document size increases. The results show that the overhead is a negligible fraction of the evaluation time. The authors also tested the impact of number of constraints on satisfiability check time. For satisfiable queries, as expected the time increases, while for unsatisfiable queries, it decreases as violations are found faster. Summary The article presents a method for testing satisfiability of various classes of tree pattern queries, which are known to be closely related to XPath and XQuery and to be of fundamental importance. The problem studied for both queries with and without a schema (acyclic and choice-free). Cases in which it is NP-complete or PTIME were identified. For the latter case, efficient algorithms were developed based on a chase procedure. Analytical results were complemented with an extensive set of experiments. Satisfiability checking can provide substantial savings in query evaluation, and the analysis results demonstrate that it incurs a negligible overhead over satisfiable queries.