On Testing Satisfiability of Tree Pattern Queries

advertisement
On Testing Satisfiability of Tree Pattern Queries
Laks V.S. Lakshmanan, Ganesh Ramesh,
Hui Wang, Zheng Zhao
Summary by Victoria Shulman Becher
The article studies the satisfiability of tree pattern queries, i.e. whether there exists a
database, consistent with the schema, on which the query has a non-empty answer. This
can effect substantial savings in query evaluation.
In the article, the writers identified cases in which testing satisfiability can be solved in
polynomial time, and developed efficient algorithms for this purpose. They also identify
the cases were the problem is NP-complete.
Introduction: Formulating queries against XML database can be more challenging than
for relational databases. Querying an XML database in the absence of any schema
knowledge can be complicate. Even when a schema is known, getting the query right can
still be non trivial for the user. Satisfiability testing is a necessary first step in building
any tool that could assist the user getting the query right.
Example of unsatisfiable tree pattern query:
FOR $a IN document(‘‘doc.xml’’)//a,
$e IN $a/b//e, $f IN $a/d//f,
$c IN $a//c, $e1 IN $c//e, $f1 IN $c//f
WHERE $e = $e1 AND $f = $f1
RETURN
The constraint that the two E and two F leaves must be identical
requires nodes A, B, C and E to lie on the same path . Similarly, A, D, C and F. This is
impossible, since C, having a different tag than the two children of A, is forced to be a descendant
of both, whereas B and D cannot lie on the same path. (Single (double) lines on diagram represent
parent-child (ancestor-descendant) relationship between nodes).
Definitions:
A database D = (N, e, r, λ) is a finite rooted ordered tree, where N- represents
element nodes, e- represents parent-child relationship, λ the labelling function, assigns a
tag with each node, r is the root.
A tree pattern query (TPQ) Q = (V, E, F), where (V, E) is a rooted tree, with
nodes V labelled by variables, and with edges E = EcU Ed.Where pc-edges (Ec) represent
child relationship and ad-edges (Ed) represent descendant relationships of the XPATH.
F is a conjunction of tag constraints (TCs of the form $x.tag = t where t is a tag name),
value-based constraints (VBCs include selection constraints: $x.val relop c,$x.attr relop c
and join constraints $x.attr relop $y.attr’, $x.val relop $y.val where relop  {=, ≠, >,≤
≥,<} attr, attr’ are attributes, val represents content, c is a constant) and node identity
constraints (NICs - $x idop $y where idop  {  ,  }).
A matching of a TPQ Q to a database D is a function h: Q→D that maps nodes of
Q to nodes of D such that: (i) structural relationships are preserved and (ii) the formula F
is satisfied. We say that a database D satisfies a query Q provided there is a matching
h : Q→D. A query Q is satisfiable provided there is a database D that satisfies Q.
Problems Studied: the article considers testing satisfiability of various classes of TPQs
(with/without VBCs, with/without disjunction in VBCs, with/without join and node
identity constraints, with/without wildcards) both in the absence of a schema and in the
presence of a schema without disjunction and cycles.
Satisfiability without Schema: The following proposition summarizes the situation for
join-free TPQs with wildcards:
For a join-free tree pattern query Q, possibly containing wildcards, the following holds:
1. If Q contains no VBCs associated with any node then Q is satisfiable.
2. If Q contains value-based selection constraints (but is join-free), then Q is satisfiable
iff for every node, the associated set of VBCs is consistent.
The complexity of verifying satisfiability depends on the type of each node constrains. If
no disjunction occurs, consistency of constrains can be verified in polynomial time .If
VBCs constraints associated with a node x can involve arbitrary disjunctions, testing
consistency is NP-complete.
Wildcard-free TPQs with joins: The presence of join and node identity constraints
interacts in a complex way with the structural constraints. The reasoning about
satisfiability was separated into structure and value-based parts and their interaction was
determined.
The examples of unsatisfiable queries shown in the article illustrate that:
1. Testing satisfiability involves inferring relationships between pairs of nodes based on
structural constraints stated in the query and hence inference rules are required.
2. Some of the intermediate relationships cannot be directly represented in the language
of TPQs e.g., “x and y must lie on the same path” or ( x ≈ y  ad(x, y)  ad(y, x) )
To prevent complexity involved in these relationships representation, the following
predicates were added:
sad(x, y) meaning x ≈ y or ad(x, y),
OTSP(x, y) meaning sad(x, y) or ad(y, x),
COUS(x, y) meaning ¬OTSP(x, y).
Determining satisfiability of a query works as follows. First, we use inference rules to
obtain the closure of structural predicates. Then, we check the resulting set of predicates
for violations. A violation is a pair of conflicting predicates between a pair of nodes.
Examples of conflicting pairs of predicates are x  y, x  y; ad(x, y), sad(y, x); and
OTSP(x, y),COUS(x, y). Violations make the query unsatisfiable.
In order to efficiently implement a procedure for satisfiability checking a
(structural) constraint graph GQ for the query Q is constructed as follows: GQ contains
one node for each query node. For each predicate φ(x, y) in Q, GQ contains a directed
edge labeled φ from x to y. For symmetric predicates, the edge is bidirected.
The following tools are used:
Inference rules are used to define new structural predicates that are inferred from existing
ones.
A chase procedure which applies the inference rules until no new inferences are possible.
A violation, which is a pair of conflicting predicates between a pair of nodes.
A constraint graph, with chase applied on it, is a chased constraint graph.
The article proves that a tree pattern query containing node identity constraints, but no
wildcards, is satisfiable iff the chased constraining graph of Q is violation-free
(Completeness of Chase).
A naive implementation of the chase would take time O(n5), where n is the number of
nodes in the query. This is because each rule involves 3 nodes and there are O(n2)
iterations possible in the worst case before no new inferences are made. A more efficient
implementation is suggested in the article, where the worst case remains the same, but
practically is much better.
VBCs
Consistency of VBCs constraints can be checked in polynomial time. The checking
algorithm can be implemented efficiently using a separate value-based constraint graph
using ideas similar to the structural constraint graph.
The procedure for testing satisfiability of a query Q with structural constraints and VBCs
is then as follows: (i) Chase the VBCs (using a separate value based constraint graph); if
any violation is found return “unsatisfiable”. (ii) Construct the (structural) constraint
graph G of Q; propagate all constraints x  y derived from VBC chase to G and chase it;
(iii) Q is satisfiable iff the chase terminates with no violation.
The article proves that a tree pattern query with structural constraints and VBCs and no wildcard
can be tested for satisfiability in polynomial time using the procedure above (TPQs with VBCs).
In TPQs with Wildcards, Joins, and Disjunction the problem becomes NP-complete.
The article proves that testing satisfiability of a tree pattern query with wildcard and only ≈
constraints, where the query uses only pc- and sad-edges, is NP-complete ([Hidders]).
The article also proves that testing satisfiability of a tree pattern query containing VBCs, with
disjunction allowed in selection constraints associated with nodes, is NP-complete (TPQ with
disjunction)
The complexity results for the schemaless case are summarized in the table below:
Disjunction
NIS/joins
Wildcards
Complexity
*
PTIME
*
PTIME
*
*
NP-complete
*
*
NP-complete
Satisfiability in the presence of Acyclic Schema:
A schema of a database is abstracted as a graph with nodes corresponding to tags and
edges, labelled by one of the quantifiers (?, 1, *,+ with their standard meaning
‘optional’,’ one’,’ zero or more’,’ one or more’. In the paper only DTDs are considered.
In order to present the main issues appearing in this part, some definitions are needed:
An embedding of a query Q into a schema ∆ is a function f: Q → ∆ satisfying the
following conditions: (i) f maps each tagged node to a node with the same tag; (ii)
whenever (x, y) is a pc-edge (ad-edge) in Q, there is an edge (path) from f(x) to f(y) in ∆.
Let ∆ be a schema and let Q be a tree pattern query with no wildcards or VBCs. Then
Q is satisfiable with respect to ∆ iff there is an embedding f from Q into ∆.
The article shows that in the case of TPQ that contain only wildcards nodes, checking
satisfiability trivially reduces to checking if the schema is of a given depth.
When wildcards are present, semantically we can assign any tags to the wildcards and
check for the existence of an embedding. This approach takes exponential time.
In order to confirm an existence of embedding in this case the labeling procedure is used.
Then article proves that a TPQ containing wildcards but no VBCs and no NICs is
satisfiable with respect to a schema ∆ iff the set of schema labels computed by the
labelling procedure is not empty for each node in the TPQ (Labeling).
In the presence of Node Identity Constraints and VBCs determining satisfiability of a
query works as follow: We use the schema to infer structural predicates between any pair
of query nodes (which are tagged). We use inference rules to compute the closure of
structural predicates and check the resulting set for violations. The query is satisfiable iff
the resulting set is violation-free. As before, we use a constraint graph and a set of
inference rules to compute the closure. The set of inference rules are adapted from those
developed for the schemaless case. Rules involving sad or OTSP are dropped, since the
schema allows us to derive an unambiguous ad relationship whenever sad or OTSP holds.
Additionally, we need to infer relationships between element types from the schema. The
schema can tell us that two tags t, t’ are related by a pc-/ad-relationship, or that two query
nodes must be identical or that they must be cousins.
The article proves that a TPQ with NICs but no wildcards is satisfiable with respect to an
acyclic schema iff there is an embedding of the TPQ into the schema and no violation is
detected when the constraint graph of the TPQ is chased (Chase Completeness with
Schema).
The article also proves that the query complexity of satisfiability checking in the presence
of acyclic schema without choice is PTIME (Query Complexity) and that the combined
complexity of satisfiability in the presence of acyclic schema without choices is co-NPcomplete (Combined Complexity).
In the presence of schema testing satisfiability of a TPQ with Node Identity
Constraints and Wildcards is NP-complete.
The article proves that a TPQ is satisfiable with respect to a schema is NP-complete if it
contains wildcards and NICs and if it contains VBCs (and no NICs) (Hardness Results).
The complexity results for the schema case are summarized in table below:
Disjunction
NIS/joins
Wildcards
Complexity
*
PTIME
*
PTIME
*
*
NP-complete
*
NP-complete
Experimental Results
To study the effectiveness of testing satisfiability, the authors systematically ran a range
of experiments to measure the impact of various parameters. They ran the experiments on
the XMark benchmark dataset and Biomedical dataset from National Biomedical
Research Foundation. For each dataset they constructed documents of various sizes using
the IBM XMLGenerator. The queries chosen for experimentation correspond to classes
of TPQ studied in the paper. The results of measuring savings and overheads were:
Saving Ratio: As expected for unsatisfiable queries, satisfiability check leads to
phenomenal savings. Saving ratio is close to 1 (usually between about 0.8 and 0.9)
whether the schema is present or not.
Overhead Ratio: the overhead ratio decreases as the document size increases. The
results show that the overhead is a negligible fraction of the evaluation time.
The authors also tested the impact of number of constraints on satisfiability check time.
For satisfiable queries, as expected the time increases, while for unsatisfiable queries, it
decreases as violations are found faster.
Summary
The article presents a method for testing satisfiability of various classes of tree pattern
queries, which are known to be closely related to XPath and XQuery and to be of
fundamental importance. The problem studied for both queries with and without a
schema (acyclic and choice-free). Cases in which it is NP-complete or PTIME were
identified. For the latter case, efficient algorithms were developed based on a chase
procedure. Analytical results were complemented with an extensive set of experiments.
Satisfiability checking can provide substantial savings in query evaluation, and the
analysis results demonstrate that it incurs a negligible overhead over satisfiable queries.
Download