TESTING THE SATISFIABILITY OF TREE PATTERN QUERIES WITH NODE IDENTITY CONSTRAINTS A Thesis by Barbara Jane Gobbert B. Science, University Of Queensland, 1979 B. Commerce, University Of Queensland, 1983 Submitted to the Department of Computer Science and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Master of Science May 2007 © Copyright 2007 by Barbara Jane Gobbert All Rights Reserved TESTING THE SATISFIABILITY IN TREE PATTERN QUERIES WITH NODE IDENTITY CONSTRAINTS I have examined the final copy of this Thesis for form and content and recommend that it be accepted in partial fulfillment of the requirement for the degree of Master of Science, with a major in Computer Science. ___________________________ Prakash Ramanan, Committee Chair We have read this Thesis and recommend its acceptance: ________________________________ Prakash Ramanan, Committee Member ________________________________ Sattiraju Prabhakar, Committee Member ________________________________ Thomas DeLillo, Committee Member iii DEDICATION To my family iv ACKNOWLEDGEMENT I would like to thank my advisor, Dr. Ramanan, for his help and insight on issues raised during the course of this research. This work was motivated by my previous work with the use of XML in the management, storage and retrieval of long-term electronic records as part of the Victorian Electronic Records Strategy in Australia. v ABSTRACT This research deals with testing the satisfiability of a subclass of XQuery and XPath expressions that contain node identity constraints. This subclass of expressions is called Conjunctive XPath. A query is satisfiable if there exists a database of XML documents that will result in a non-empty answer to the query, whereas a query that is not satisfiable will result in an empty answer when run against any database. Determining that a query is unsatisfiable prior to execution will result in savings in computer run-time by not executing unsatisfiable queries. Although the general problem is undecidable, we examine a subclass of queries called Conjunctive XPath where satisfiability is decidable. Previous researchers have presented algorithms for determining satisfiability based on predicate logic and also using non-deterministic finite automata. We present an algorithm for XPath queries with a single node identity constraint, based on topological sorting. This algorithm has faster run-time compared to previously known algorithms. vi TABLE OF CONTENTS Chapter Page 1. INTRODUCTION 2. XML, XPATH AND XQUERY 2.1 2.2 2.3 2.4 2.5 2.6 3. 6. 3 XML Documents .............................................................. XPath .................................................................................. Tree Pattern Queries .......................................................... XQuery ................................................................................ Tree Pattern Queries with Node Identity Constraints........... Topological Sort of Directed Acyclic Graphs........................ 3 9 11 14 15 19 ................................................ 21 Complexity of Deciding Satisfiability .................................... Unsatisfiability Due to Structural Constraints ....................... Determining Satisfiability from Structural Constraint Graph. Summary of the Problem ..................................................... 22 23 30 41 ALGORITHM ................................................................................ 42 Outline of Algorithm ............................................................ Formalization of Algorithm ................................................... Application of Algorithm ..................................................... Runtime of Algorithm .......................................................... 42 46 56 59 SUMMARY AND CONCLUSIONS .................................................... 61 5.1 5.2 5.3 5.4 Theoretical Implications ...................................................... Conclusion ........................................................................... Extension of Algorithm ........................................................ Areas of Further Research................................................... 61 61 62 63 LIST OF REFERENCES ................................................................... 64 4.1 4.2 4.3 4.4 5. 1 ....................................................... THE SATISFIABILITY PROBLEM 3.1 3.2 3.3 3.4 4. .......................................................................... vii LIST OF FIGURES Figure Page 1. Example of an XML Document........................................................... 4 2. XML document tree ............................................................................ 8 3. Tree Pattern Query............................................................................. 13 4. Two examples of descendant nodes .................................................. 16 5. Tree Pattern Query from Lakshmanan ............................................... 18 6. Structural Constraint Graphs for TPQ of Figure 5 .............................. 18 7. Directed Acyclic Graphs ..................................................................... 20 8. Nodes cannot be on same path and also be cousins ......................... 26 9. Node cannot have 2 parents .............................................................. 27 10. Constraint graphs with 3 vertices and 2 arcs...................................... 31 11. Constraint graphs – 3 vertices, 3 arcs, satisfiable .............................. 32 12. Constraint graphs – 3 vertices, 3 arcs, unsatisfiable .......................... 33 13. Constraint graphs – 4 vertices, 4 arcs, satisfiable .............................. 34 14. Constraint graphs – 4 nodes, 4 arcs, unsatisfiable............................. 34 15. Constraint graphs – 4 vertices, 5 arcs, satisfiable .............................. 36 16. Constraint graphs – 4 vertices, 5 arcs, 1 NIC, unsatisfiable ............... 37 17. Constraint graphs – 4 vertices, 5 arcs, 2 NICs, unsatisfiable ............. 38 18. Constraint graphs – 6 vertices, 7 arcs, 2 NICs, satisfiable ................. 40 19. Constraint graphs – 6 vertices, 8 arcs, 3 NICs, unsatisfiable ............. 40 20. Generalized Structural Constraint Graph for C XPath query .............. 45 viii 21. Query Q, document (Q), and embedding of Q in (Q)..................... 46 22. Case 1: before and after the merge.................................................... 48 23. Case 2: before and after the merge.................................................... 49 24. At least one of the arcs must be a d-arc............................................. 50 25. Case 3: before and after embedding .................................................. 51 26. Case 4: arcs are c-arcs ...................................................................... 52 27. Case 4.1: before and after embedding ............................................... 53 28. Case 4.2: Pu contains no d-arcs ......................................................... 54 29. Case 4.2: Pv partitioned into c-chains and d-chains ........................... 55 30. Case 4.2: Example of satisfiable query with c-chain .......................... 57 31. Case 4.2: Example of satisfiable query with d-chains and c-chain ..... 58 32. Case 4.2: Example of unsatisfiable query .......................................... 59 ix LIST OF ABBREVIATIONS / NOMENCLATURE Symbol Meaning NIC Node Identity Constraint TPQ Tree Pattern Query W3C World Wide Web Consortium XML eXtensible Mark-up Language => "is an ancestor of" relation -> "is the parent of" relationship / Root element / Child axis // Descendant axis BNF Backus Naur Form x CHAPTER 1 INTRODUCTION In this thesis, we study the problem of testing the satisfiability of some XPath queries with node identity constraints. A query is satisfiable if there exists a database of XML documents that will produce a non-empty answer to that query. Research has shown that in many cases the decidability of whether a query is satisfiable or not is NP-hard (Hidders 2003) (Geerts and Fan 2005), and in some cases it is actually undecidable (Benedikt, Fan et al. 2005). We will focus on special cases where the problem is decidable and solutions can be found in polynomial time. Further, this thesis examines the use of topological sort to determine some conditions, which will result in an XPath query being unsatisfiable. The queries we consider are XPath 2.0 (Kay 2004) queries that involve only child and descendant relationships amongst the elements of the XML document. We will examine how Node Identity Constraints (NICs) impose structural constraints on the documents; this will help us to determine the satisfiability of the query. Other researchers have used predicate logic to examine satisfiability (Lakshmanan, Ramesh et al. 2004), and also using non-deterministic finite automata (Fernández, Hidders et al. 2004). We present an algorithm based on topological sorting of the query vertices to determine if a query is satisfiable. If a total ordering of the child and descendant edges among the query nodes can be obtained, then the query is satisfiable. 1 In Chapter 2 we provide examples of XML documents and discuss some of the key features of XML documents and XPath queries. Tree Pattern Queries are defined and examples provided for sample XML documents and XPath queries. In Chapter 3 the satisfiability problem is examined with examples that highlight specific structural aspects that are evident in unsatisfiable queries. Constraint graphs (Lakshmanan, Ramesh et al. 2004) are used to explain the impact of increasing the number of vertices in a Tree Pattern Query. The impact on satisfiability determination is explored by adding c-edges to a constraint graph and comparing this to the impact of adding d-edges. In Chapter 4 the algorithm for determining satisfiability for a Tree Pattern Query with a single node identity constraint is presented, first as an outline followed by a formal theorem. Four cases are considered which cover the various possible structures in these special Tree Pattern Queries. In the last chapter, theoretical implications are considered and the extension of the algorithm to Tree Pattern Queries with multiple Node Identity Constraints is discussed. 2 CHAPTER 2 XML, XPATH AND XQUERY XML, eXtensible Mark-up Language, was developed as an open standard for marking up documents by the World Wide Web Consortium (W3C) XML Working Group in 1996 (W3C 1996). Due to its flexibility, it has become widely adopted in the electronic exchange of information across the Internet and also for the longterm storage of electronic information (Lewis, Bernstein et al. 2002). XML is a meta-language that allows a user to define a markup language to describe their data. Data in XML format is both human readable and machine readable. 2.1 XML Documents Mark-up languages are used to define extra information about text within the same document. For example, chapter and section headings in a book are an example of mark-up information within a piece of writing. The most well known mark-up language is HTML, where the tags are used to provide additional information about how a web browser should display the data content. By using XML, a user can assign meaning to their data within the document itself. Each piece of information is surrounded by a description of its meaning. XML is particularly suited to documents which often contain semi-structured content. This has resulted in XML documents becoming a popular format for representing, storing and sharing information by many different computer applications. XML documents can be accompanied by optional schema documents for type definitions and other constraints; however this thesis focuses on XML documents and databases without the presence of schemas. 3 <?xml version="1.0" ?> <StudentList Date="2006-04-15"> <Student> <Personal> <Name> <FamilyName>Aardvark </FamilyName> <GivenName>Alexander</GivenName> <GivenName>Stuart</GivenName> </Name> <StudentID>123-456-789</StudentID> <Address Type="Mailing" Status="Confirmed"> <AddressLine>P.O. Box 12</AddressLine> <AddressLine>Any Town, VIC 3130</AddressLine> </Address> <Address Type="Residential" Status="Previous"> <AddressLine>12 Some Street</AddressLine> <AddressLine>Any Town, VIC 3130</AddressLine> </Address> </Personal> <Courses> <Semester ID= "Spring 2006"> <Course> <Code>CS311</Code> <Instructor> <FamilyName>Zebra></FamilyName> </Instructor> </Course> <Course> <Code>MA412</Code> <Instructor> <FamilyName>Smith></FamilyName> </Instructor> </Course> <Course> <Code>CH101</Code> <Instructor> <FamilyName>Brown></FamilyName> </Instructor> </Course> </Semester> <Semester ID= "Fall 2005"> <Course> <Code>CS211</Code> </Course> 4 <Course> <Code>MA333</Code> </Course> </Semester> </Courses> </Student> <Student> <Personal> <Name> <FamilyName>Zebra </FamilyName> <GivenName>Zoe</GivenName> </Name> <BirthDate>1993-05-05</BirthDate> <Address Type="Email">zzebra@cox.net</Address> </Personal> </Student> </StudentList> Figure 1. Example of an XML document. Figure 1 shows an example of an XML document for a list of students. Notice how the document is self-describing in that the name for each piece of data is immediately adjacent to the actual data content. Each piece of data is enclosed by opening and closing tags of the form <tag>data</tag>. The combination of the opening tag, optional data, and a closing tag is called an “element” of the XML document. Other data with its tags can be nested within another element; however the opening and closing tags of the elements must be properly nested. A nested element cannot extend beyond its outer enclosing element, for example the <FamilyName> tag is nested properly within the <Name> tag which is properly nested within the outer element of <Personal>, because the opening tag <FamilyName> is after the opening tag <Name> and the closing tag 5 </FamilyName> is before the closing tag </Name>. There is one root element for the document and all other elements are nested within the root element. The data is called semi-structured because each element can repeat a varying number of times or be absent altogether. A student can have a varying number of given names, they can have a variety of different type of addresses with varying lines of address information. Similarly, a student can be enrolled for courses for different semesters. Some students may not be enrolled in any courses, particularly if they are new at the University. Some data is included inside the opening tag itself, and this type of data is called an Attribute. The Date, Type and ID information above are examples of attributes. Attributes usually contain information about the element itself, in welldesigned XML documents. The Address opening tag has an attribute 'Type' which holds information about the nature of the address instead of direct address data. Well-formed XML documents contain: • one or more elements identifying the data which they surround with opening and closing tags • an unique root element • properly nested elements • no repetition of attributes within one opening tag, and • attribute values are enclosed in quotes. 6 A formal data model is used to define XML, and in this data model an XML document is represented as an ordered, labeled tree of nodes, where the nodes represent the elements (Kay 2004). Each node of an XML document tree may have a number of child nodes, but each node, except for the root node has exactly one parent node. The nodes are numbered sequentially so each node has a unique identifier in addition to the tag-name of the XML element. Tagnames can be repeated throughout the document. The XML document in Figure 1 can be represented as an XML document tree as shown in Figure 2. Each element, such as Student, in the document is represented as a node of the tree, where the tag name is the node name. Nested elements of an outer element are shown as child nodes of the parent element. For example the nested elements Name, StudentID and Address with the element Personal are shown as separate child nodes of the node labeled Personal. Edges are used to link the parent node to its nested child nodes to show the hierarchy. The root node, called the document node, of the XML document tree is different from the root element of the XML document. The root node is represented by / and it has a single child node representing the root element of the document, in this example, StudentList. The attributes and textual information of an element are represented as the contents of the node. 7 Figure 2. XML document tree. The XML document tree can be represented by a 4-tuple (Lakshmanan, Ramesh et al. 2004) T = (N, P, r, ), where N is the set of element nodes P represents the parent-child relationships r is the root element of the document, and is the labeling function which assigns a unique tag name to each node 8 All nodes of the tree have a parent node except for the root node. For an element or attribute node x D, (x) denotes its tag-name. The special root node of the document, root(D), does not correspond to any element of the document and ( root(D)) = /. 2.2 XPath XPath is a query language for selecting a set of nodes in an XML document. The XPath data model views an XML document as a rooted tree of nodes, as in section 2.1. The input to an XPath expression is an XML document tree and the output is a set of nodes of the tree. An XPath query Q consists of a sequence of location steps, Q = L1L2 … Ln. Each location step Li specifies an axis, node test, and predicates, Li = <axis> <node test> <predicates> XPath provides thirteen relative axes which are defined for a given context node: • self, attribute, namespace, • child, descendant, descendant-or-self, • parent, ancestor, ancestor-or-self, • preceding-siblings, following-siblings, • preceding (nodes that end before the context node) and following. The most common axes used are child and descendant, which are represented by / and // respectively in XPath expressions. In this thesis, only these two axes are considered. Node test is a test on the name or type of the element. 9 We consider a subclass of XPath 2.0 called Conjunctive XPath (C XPath) consisting of queries, where each predicate is either an and of predicates, or a relative query. This class of queries is defined by the following grammar in Backus-Naur Form: <query> ::= <loc_step> | <loc_step> <query> <loc_step> ::= <axis> <node_test> <predicates> <axis> :: = / | // <node_test> ::= elem_label | * <predicates> ::= | [<predicate>] <predicate> ::= <predicate> and <predicate> | . <query> elem_label belongs to the alphabet of tag names; .<query> indicates a relative query; * is the wildcard label that matches any tag name; axis is either / and // which correspond to child or descendant axis respectively. Let axis(Li), nodeTest(Li) and predicate(Li) denote the axis, node test and predicate in step Li, respectively. XPath expressions can either be absolute path expressions, which always start the navigation at the root of the document tree, or relative path expressions which start the navigation from the current node which is called the context node. For example, //Address will select all Address elements in the document. In comparison, the expression .//Address will select all Address elements that are descendants of the context node, while the expression Address will return only 10 the Address elements which are children of the current node (Lewis, Bernstein et al. 2002). In an XPath expression, every step evaluates a sequence of nodes. For example, /doc/bib/book will find all book nodes within all ‘bib’ nodes in a document with root element “doc”. The data model also defines an ordered collection of nodes, called a NodeList, for the traversal of the list of nodes. Using the NodeList, nodes can be located using path expressions. In this example of an XPath expression, //Student//Address[@status = “Confirmed”]/AddressLine the node Address is a descendant node of the Student nodes, and AddressLine is a child node of the Address nodes. The query will search for Address nodes with an attribute of status = “Confirmed” and return the AddressLine nodes immediately below those Address nodes. 2.3 Tree Pattern Queries (Amer-Yahia, Cho et al. 2001) proposed a model to represent a C Xpath query in the form of a tree-shaped pattern. A query Q C XPath can be represented by a tree tree(Q) = (V,A) where V is a set of vertices and A is a set of arcs (Ramanan 2003). Each vertex v V has a tag (v) {/, *} associated with it. (v) is the element name associated with v; / is the tag of root(Q), and * denotes the wildcard tag. Each arc a A is either a child arc (c-arc) or a descendant arc (d-arc), corresponding to a child or descendant axis in Q, 11 respectively. In our figures, c-arcs and d-arcs are represented by single lines and double lines, respectively. For example, Figure 3 shows tree(Q) for Q = //a[b and .//c]/ * [a and .//b]. The vertex of tree(Q) that corresponds to the node test in the last location step in Q is called the output vertex of Q, and is denoted by opv(Q); it is marked by a # sign in Figure 3 (a). For an arc r = (u, v): If r is a c-arc, we say that v is a c-child of u; if r is d-arc , v is a d-child of u. In general, |tree(Q)| is linear in |Q|. From now onwards, we will not distinguish between Q and tree(Q). To minimize confusion the terms vertices and arcs will be used when referring to components of Q; and nodes and edges to refer to components of D, an XML document. For a vertex u Q, let Qu denote the subtree of Q that is rooted at u. For a node n D, let Dn denote the subtree of D rooted at n. An embedding of Qu in Dn is a mapping from the vertices of Qu to the nodes Dn, that satisfies the following conditions: 1. Preserve vertex types: For each vertex v in Qu: • If (v) = /, then (v) = root(D). In this case, v = u = root(Q) and n = root(D). • If (v) , then ( (v)) = (v). 2. Preserve arc types: For each vertex v in Qu: • If v´ is a c-child of v: (v´) is a child of (v) in D. • If v´ is a d-child of v: (v´) is a descendant of (v) in D. 12 The output of Q on D is Q(D) = { (opv(Q)) | is an embedding of Q in D}. The answer to the query is the set Q(D) of nodes that result from all the possible ways of “embedding” the tree pattern into the database. For example, using the XML document from Figure 1, a query to find for all students only the AddressLines for Addresses which have a status of “Confimed”, can be represented by the XPath query //Student//Address[@status = “Confirmed”]/AddressLine and as a tree pattern as shown below in Figure 4. Q(D) = { <AddressLine>P.O. Box 12</AddressLine>, <AddressLine>Any Town, VIC 3130</AddressLine>} Figure 3. Tree Pattern Query. 13 2.4 XQuery XQuery is a declarative language that can be used to formulate queries for both XML documents and XML databases. Declarative languages state what needs to be computed instead of how it is to be computed. XPath is a subset of XQuery, which is a more powerful query language. XQuery, like XPath, uses a tree-structured data model for the XML data and for navigation. XQuery was developed by the World Wide Web Consortium (W3C) XML Working Group to be a concise and flexible query language for XML (Brundage 2004). Structured Query Language, SQL, was developed to query relational data, unordered sets of "flat" rows of data. In contrast, XQuery focuses on ordered sequences of values and hierarchical nodes and supports node identity. XQuery can construct temporary XML results within a query. New XML documents can also be constructed using XQuery. The central expression in XQuery is the FLWOR (For, Let, Where, Order by, and Return) expression. Simple FLWOR expressions can be expressed by using XPath expressions to navigate the NodeList using the relative axes. As an example of a FLOWR expression, consider the XML document from Figure 1 and the following query, which will produce a document of student names who took CS311 in the Spring 2006 semester. The document will be sorted by the student’s last name. FOR $s IN doc(“http://www.wsu. Studentlist.xml”)//Student LET $c := $s//Semester[ID=“Spring 2006”]/Course WHERE $c/Code = “CS311” 14 ORDERBY ($s//FamilyName) RETURN <StudentSummary LastName=$s//FamilyName FirstName=$s//GivenName /> 2.5 Tree Pattern Queries with Node Identity Constraints Tree Pattern Queries with Node Identity Constraints arise in XQuery. XQuery provides the operator ‘is’ to determine whether two nodes have the same identity or not. Tree Pattern Queries with node identity constraints can be represented as Structural Constraint Graphs, where the node pairs with identity constraints are represented as a single node. In a tree and therefore in an XML document, a node can have only one parent. In a tree pattern query, paths are represented with child edges and descendant edges. For example, suppose node x has 2 descendants e and f. Then both e and f share a common ancestor x. There are 2 possibilities, either e and f are on the same path from node x, or they are cousins. If e and f share a common ancestry path (or pedigree), it could be x => e => f or it could be x => f => e where => is used to represent "is a descendant of" relation. In the other possibility, "cousins" share a common ancestor but are the descendants of sibling nodes. Sibling nodes are child nodes of the same parent node and so "cousin" nodes cannot occur on the same common path. Therefore 15 nodes e and f do not lie on the same path. If e and f are cousin nodes, then a query with a node identity constraint on them is unsatisfiable. The least common ancestor of two nodes e and f in a TPQ T is the node x that is an ancestor of both e and f and that has the greatest depth in T (Bender, Pemmasani et al. 2001). Therefore any query that requires two nodes to be identical but the nodes do not share a common path from their least common ancestor is not satisfiable (Lakshmanan, Ramesh et al. 2004). Figure 4. Two examples of descendant nodes 16 Consider the following XQuery in Figure 4 from (Lakshmanan, Ramesh et al. 2004): FOR $a in document(“doc.xml”)//a, $e IN $a/b//e, $f IN $a/d//f, $c IN $a//c, $e1 IN $c//e, $f1 IN $c//f WHERE $e = $e1 AND $f = $f1 RETURN {$a} The corresponding TPQ with two node identity constraints is shown in Figure 5. The query can be represented by the structural constraint graphs in Figure 6 (a) and (b). The pairs of node identity constraints are represented by a single vertex for each NIC as shown in Figure 6(a). However the Structural Constraint Graph for this TPQ may alternatively be represented by Figure 6(b). Vertices B and C are both ancestors of vertex E and also descendants of vertex A, but B is a child of A. This implies that vertex C must be a descendant of vertex B. Similarly vertex C is also a descendant of vertex D. Therefore the implicit constraints on vertex C may be represented more clearly by explicitly showing vertex C as a descendant of vertex B and also as a descendant of vertex D. 17 Figure 5. Tree Pattern Query from (Lakshmanan, Ramesh et al. 2004). Figure 6. Structural Constraint Graphs for TPQ of Figure 5. 18 Vertex A has child vertices B and D and so B and D are "siblings". Therefore B and D cannot share any common descendants of A as this would require the descendants to lie on the same path. However vertex C is a common descendant of both B and D. Therefore this query is unsatisfiable. 2.6 Topological Sort of Directed Acyclic Graphs A directed graph, G = (V, A), where V is a set of vertices, and A is a set of arcs. An arc is represented by a tuple (u,v) where u, v are elements of V. In a directed graph, the first vertex in the arc is the start of the arc, and the second vertex is the end of the arc. Therefore in a directed graph, arc (u,v) is different from arc (v,u). A directed graph is acyclic if it contains no cycles, that is, it is possible to produce a linear ordering of the vertices in G consistent with the total order u < v if (u,v) is an element of A. Any such ordering is a topological sort. For any directed acyclic graph, there will be one or more topological sorts. The topological sort of G can be found using a variation of the Depth First Search algorithm (Cormen, Leiserson et al. 2001). Tree pattern queries can be represented by a special type of directed graph, Structural Constraint Graphs, where there are two types of arcs representing the child and descendant relationships of the query. For the purposes of a topological sort, descendant arcs behave the same as directed arcs in a graph, and it is possible that other vertices may appear in the linear 19 ordering between the two vertices of the descendant arc. However a child arc means that the second vertex of the arc must appear immediately after the first vertex of a child arc in a linear ordering. Standard topological sort algorithms cannot be used on directed graphs with child arcs. For Graph 1 shown in Figure 7, there are many possible linear orderings, such as ABDFCEGH, ABCDEFGH etc. However for Graph 2, which has both c-arcs and d-arcs, there is only one possible linear ordering which is ABCEGDFH. Figure 7. Directed Acyclic Graphs. 20 CHAPTER 3 THE SATISFIABILITY PROBLEM Increasingly, more electronic information is managed by XML or stored as XML documents. In part this is due to the adoption of XML for data exchange between organizations and the use of XML by archival agencies for the long-term storage of digital information (Quenault 2004). When terabytes of XML documents need to be searched, it is essential that queries are optimized. A key step in query evaluation is to determine if there is an answer to the query before running the query against a database. A query is satisfiable if there is a database that has a non-empty answer. (Lakshmanan, Ramesh et al. 2004) showed that a satisfiability check can make substantial savings in query evaluation. Some queries are unsatisfiable because there is no database that can return a non-empty answer to the query. In these cases, if the query is determined to be unsatisfiable, then the query does not need to be evaluated against the given database. This chapter examines some conditions which will result in the query being unsatisfiable. First we examine the complexity of deciding whether a query is satisfiable. Next we examine some structural constraints on a query that if broken causes a query to be unsatisfiable. In the next chapter we present an efficient algorithm to test for satisfiability in some special cases. 21 3.1 Complexity of Deciding Satisfiability (Hidders 2003) examined the problem of deciding the satisfiability of XPath 2.0 expressions. He showed the problem of deciding the satisfiability of XPath expressions is NP-hard when all axes are allowed and predicates may contain set intersection, set union, and set difference. Subsequent researchers have examined special situations where the problem of deciding satisfiability may be more tractable. Although XML data is semi-structured, an XML document is often associated with a set of rules, a grammar. These rules impose structural constraints on the database and may be represented by a Document Type Definition (DTD) or an XML Schema. The presence of a DTD or XML Schema provides an additional factor in determining the satisfiability of an XPath expression. In the presence of recursion in the DTD, (Benedikt, Fan et al. 2005) showed that it is undecidable to determine if an XPath expression with negation is satisfiable. XPath queries can be constructed using vertical axes, such as child, descendant, parent, and ancestor axes, which are similar to file path navigation in many operating systems. However XPath queries can also be constructed to access the order of XML data. These types of queries, using horizontal or siblings axes, allow access to the ordered sequence of the nodes. This makes it possible to construct queries such as find the third course taken by students, or find the second author of books. In the absence of DTDs, (Benedikt, Fan et al. 2005) showed that for queries with vertical axes and without negation, it was possible to test for satisfiability in time (O |Q|3). In contrast, (Geerts and Fan 22 2005) found that for queries with sibling axes and without a DTD, that satisfiability is undecidable. 3.2 Unsatisfiability Due To Structural Constraints Given the difficulty of deciding satisfiability, we shall focus our examination on queries which contain only descendant and child axes. Queries with negation are excluded from this study. These queries shall be considered in the absence of DTDs and XML Schemas. There are 3 constraints that can be represented in a Tree Pattern Query: tag, value and node identity. Tag constraints are used when searching for particular types of nodes, for example author nodes in a book database. Value constraints are used for searching for particular content of a node, for example books with date published after a particular year. Node identity constraints are particularly common in XQuery FLOWR expressions (see Figure 4) when searching for nodes that have several conditions in common. Node Identity Constraints are similar in concept to join conditions in SQL queries. In TPQ, two nodes, b1 and b2, have a Node Identity Constraint, NIC, when there exists one or more paths from an ancestor node, x, to the two nodes, b1 and b2 which are equated by the node identity constraint. The nodes must have the same tag. The interaction between these various constraints can be evaluated to determine if a query is satisfiable or not. Queries with just tag and value constraints are satisfiable, providing the value based constraints are consistent and in the absence of a DTD or XML Schema (Lakshmanan, Ramesh et al. 23 2004). Queries that contain Node Identity Constraints may be satisfiable under certain conditions. Violations of node identity constraints, tag constraints, and value-based constraints can make a query unsatisfiable, through a pair of conflicting predicates. Examples are (a) node x is identical to node y and node x is not identical to node y (b) node x is the ancestor of y and y is the ancestor of x, referred to as a cycle (c) node x and node y are on the same path and node x and node y are cousins (d) node x is a child of node y and node x is a child of node z ( a node cannot have 2 parents). The conflicts in examples (a) and (b) are self-evident. Example 1 – Violation (c) //x[b//d = c//d] is an example of a query that is unsatisfiable. This is because the query is asking for 2 distinct children of node x that have a common descendant. Let NIC-node be used to refer to the node with the node identity constraint, in this example this is the node with tag name D. In a constraint graph, each NICnode will have at least 2 incoming arcs. There will be a vertex which is the start of these incoming paths that end with the NIC-node, and this vertex will be the least common ancestor of the NIC-node. In the constraint graph above, X is the least common ancestor of the NIC-node. In the XPath query, this node is called the context node. In general, in a constraint graph for a query, where the least 24 common ancestor of the NIC-node has 2 or more child arcs that form paths to the same NIC-node, the query will be unsatisfiable. There is one special case where it is possible for the query to be satisfiable. Consider the query, //x[c//f//d = c//d], where d is the NIC-node and x is the context node. Due to the way the query is written, the references to node c are referring to all nodes with tag c which have a parent with tag x. This is because c occurs inside the square brackets. If c was the context node of the two d nodes then the query would be written as //x/c[.//f/d = .//d] and this query is satisfiable. Therefore in the special case above, the query could be satisfied if the two child nodes of x were in fact the same node. This is possible only if the nodes have the same tags which they do in this special case. 25 Figure 8. Nodes cannot be on the same path and also be cousins. Example 2 – violation (d) //a[.//b/d = .//c/d] is an example of a query that is unsatisfiable, as it requires d to be the child of two different nodes, which is not possible in an XML document. In general, any query is unsatisfiable where the NIC-node has two or more parents. There is one special case where it is possible for the query to be satisfiable. Consider the query, //a[.//b//c/d = .//f//c/d], where d is the NIC-node with 2 parents. However in this particular query, it would 26 be possible for the query to be satisfied if both the parent nodes were in fact the same node. Figure 9. Nodes cannot have 2 parents. Example 3 – complex violation of (c) //a[/b/c/d/e/f = //g//f] is another example of a query that is unsatisfiable, as the path from a to f (the pedigree) is completely defined by the child edges and it does not contain g. Consider the query //a[b/c/d/e/f = .//c/d//f] which can be satisfied if the nodes c and d on both paths are the same nodes. 27 In general, any query is unsatisfiable where one path in the query contains only child edges from the least common ancestor to the NIC-node, unless it is possible to "map" or embed the other path(s) onto the path of child edges. Therefore in cases where the length of the path with descendant edges is shorter than or equal to the length of the path with all child edges, it may be possible for the query to be satisfiable in the following situation. Let the sequence of nodes defining the full pedigree, P, (that is, the path containing all the child edges) be written as a string starting with the first node, followed by the next node, etc until the string is finished with the node with the Node Identity Constraint. (NIC). So a/b/c/d/e/f would be written as the string abcedf. Let the other path, Q, be written as the string containing all the nodes on the other path to the NIC, such that where there is a child edge, the parent node is followed by the child node tab, so a/b would be written as ab, and where there is a descendant edge the wildcard * is inserted into the string, so a//b would be written as a*b, where the wildcard can match 0 or more nodes from string P. Let the length of path P be represented by |p| and the length of path Q by |q|. If |p| < |q| then the query is unsatisfiable. However if |p| >= |q| and q is a substring of p then the query is satisfiable. There are various algorithms for determining if a string is a substring of another string, (Stephen 1994). 28 Example 4 – satisfiable queries //a[b//d//e = .//f//e] is a query that can be satisfied as node f can be embedded in the path either between b and d or between d and e. The query can be rewritten as //a/[b[//d//e = //f//e]] and queries in this form are satisfiable. Example 5 – cycle in a TPQ Query that results in a cycle, for example the following FLWOR expression from (Lakshmanan, Ramesh et al. 2004), will be unsatisfiable. FOR $a in document("doc.xml")//a $e IN $a/b//e, $f IN $a/d//f, $c IN $a//c, $e1 IN $c//e, $f1 IN $c//f WHERE $e = $e1 AND $f = $f1 RETURN {$a} Figure 4 shows the tree pattern query and the constraint graph for this FLOWR expression. The constraint that $e = $e1 requires a/b/c to lie on the same path to node e, and similarly $f = $f1 requires a/d/c to lie on the path to node f, thus requiring node a to have a child with two different labels. Different representations of the constraint graph will also contain a cycle, which is a violation of condition (b) which makes the query unsatisfiable. 29 Example 6 – satisfiable query //a[.//b//c//d/e = .//f//e] is a query that can be satisfied as node f can be embedded in the path as an ancestor of d. In general queries of this form, with only descendant edges and no child edges, can be satisfied. A constraint graph with only descendant edges is comparable to a directed acyclic graph. A total ordering, or a topological sort, of the nodes of the TPQ can be matched directly to at least an XML document that contains nodes in the same sequence as the result of the topological sort of the constraint graph. Therefore there is at least one database which will satisfy the query, so the query is satisfiable. 3.3 Determining Satisfiability from Structural Constraint Graph In this section we use the structural constraint graphs for minimized tree patterns as described by (Lakshmanan, Ramesh et al. 2004), but with a difference. Nodes with node identity constraints are shown as the same node, which allows a DAG to be used instead of the more specialized tree. XML queries that initially appear to be very similar can vary in terms of satisfiability. We shall examine various structural constraint graphs and compare the ability to produce a topological sort of the DAG with c-arcs and d-arcs, with the satisfiability of the query with reference to the structural constraints described in 3.2. In terms of the constraint graph, these constraints can be redefined to the following rules: (a) a vertex can have only one parent but many ancestors (b) constraint graph does not contain a cycle 30 (c) if a path between two vertices contains only child arcs, then any alternative path must be able to be “embedded” into that c-arc only path. In the case of a single vertex, a topological sort is possible and the query is satisfiable. Similarly in the case of 2 vertices, a topological sort is possible and the query is satisfiable. Now to consider some more interesting cases. 3.3.1 Constraint Graph with 3 vertices There are only 3 possible constraint graphs (ignoring isomorphic graphs) consisting of 3 vertices and 2 arcs as shown in Figure 10. Note that the direction of the arcs is assumed by the relative hierarchy of the nodes, that is all arcs are in the downward direction. Single lines represent c-arcs and double lines represent d-arcs. Figure 10. Constraint graphs with 3 vertices and 2 arcs. We will not consider graphs with fewer arcs than vertices, as we are interested in exploring the impact of node identity constraints on the satisfiability of queries. By drawing the constraint graph with only one vertex to represent the 31 node with the node identity constraint, then the node representing the NIC node will have at least 2 incoming arcs. Any nodes that are not on the path the NIC nodes are effectively “pruned” from the constraint graph as they have no impact on the satisfiability of the query. If we consider 3 nodes and 3 edges in the TPQ, there are 4 graphs which represent satisfiable queries (Figure 11) and 4 graphs that represent unsatisfiable queries (Figure 12). In the case of the graphs representing the unsatisfiable queries, it is not possible to produce a topological sort of the graph. The most noticeable point of difference is that the constraint graphs of the satisfiable queries all have a d-arc between the highest vertex and the lowest vertex. Whereas the unsatisfiable queries all have a c-arc from the highest vertex to the lowest vertex. Figure 11. Constraint graphs – 3 vertices, 3 arcs, satisfiable. 32 If we compare the topological sort to inequality expressions, then a c-arc between b and c can be considered the same as b < c where c = b + 1. In contrast, d-arcs represent the relationship b < c. Figure 12. Constraint graphs – 3 vertices, 3 arcs, unsatisfiable. 3.3.2. Constraint Graph for TPQs with 4 nodes With 4 nodes in a TPQ and one Node Identity Constraint, then it is possible for the restraint graphs to have 4 or 5 arcs. Initially graphs with 4 arcs are examined. Out of the 10 distinct graphs, 4 are satisfiable and 6 are not. 33 Figure 13. Constraint graphs – 4 vertices, 4 arcs, satisfiable. Notice that as the number of c-arcs increases, then more possible constraint graphs are for queries that are unsatisfiable. Once half the arcs are c-arcs, there is only one variation where the query is satisfiable. With the addition of another arc, then it is possible to determine unsatisfiability with the presence of an unsatisfiable sub-graph within the constraint graph. Figure 14. Constraint graphs – 4 vertices, 4 arcs, unsatisfiable. 34 There are 4 possible ways to add a 5th arc to the constraint graphs of 4 vertices and 4 arcs. a. Add a d-arc from A to D which will not affect the satisfiability of the 4 arc graph b. Add a c-arc from A to D which will make all the graphs unsatisfiable c. Add a d-arc from B to C which will make some previously satisfiable queries unsatisfiable. d. Add a c-arc from B to C which will make all the queries unsatisfiable except for where there are d-arcs from A to C and from B to D. The first 4 constraint graphs in Figure 15 show that the addition of a new d-arc from A to D has no impact on the satisfiability of the query as this arc is redundant as it contains no additional constraints within the query. The new arc does not increase the number of NICs in the constraint graph. Expressed as inequality expressions, we have added a < d to the existing expressions of a < b and b < d, which has added no further information. However the addition of a “sideways” d-arc from B to C does affect the satisfiability of the query. Vertex C and Vertex D represent separate NIC nodes in the constraint graph. Notice that the constraint graph now contains sub-graphs for each NIC. Only 4 graphs represent a satisfiable query with the addition of this additional d-arc. These 4 graphs also contain satisfiable sub-graphs. Alternatively if a c-arc is added from Node B to Node C, these same constraint graphs are still satisfiable. In contrast, the addition of a c-arc from A to D makes all the queries 35 unsatisfiable, as shown in Figure 16. This additional c-arc places an additional constraint, namely that the lowest node must be a child of the highest node. Figure 15. Constraint graphs – 4 vertices, 5 arcs, satisfiable. 36 The addition of further edges to queries that are already unsatisfiable has no effect on satisfiability. If a topological sort is not possible for a graph, the addition of further arcs cannot make the sort possible. The unsatisfiable queries with the addition of c-edge from A to D are shown in Figure 16. Figure 16. Constraint graphs – 4 vertices, 5 arcs, 1 NIC, unsatisfiable. The addition of an arc from B to C adds a new separate NIC node to the constraint graph. The last 8 graphs in Figure 15, show the queries that are still 37 satisfiable after adding a d-edge or c-edge from B to C. Figure 17 shows the queries that are satisfiable with 4 edges but are unsatisfiable with an additional edge from B to C. Figure 17. Constraint graphs – 4 vertices, 5 arcs, 2 NICs, unsatisfiable. The addition of 6th edge can be accomplished in 2 ways: either have 2 or 3 NIC nodes. First we will examine the scenario with 2 NIC nodes and consider the impact on the satisfiable queries shown in Figure 15. We can add a d-arc from A to D but we have shown that this arc is redundant as a constraint. Similarly we have previously shown that adding a c-arc from A to D makes the queries unsatisfiable, basically because it requires node D to be both a child and a grand-child of Node A at the same time which is not possible. 38 The addition of a third NIC Node will lead to a cycle in the graph. No ordering of the vertices is possible if there is a cycle in the constraint graph since it means that in the TPQ, node B is a child or descendant of Node C which is also a child or descendant of node B. 3.3.3. Constraint Graph with 6 vertices With 6 nodes in a TPQ, the number of possible structural constraint graphs is quite large. However there are some general principles that can be applied to assist with determining whether a constraint graph has a topological sort and hence whether the query is satisfiable. If there exists a sub-graph in the query that it unsatisfiable, then the larger query is unsatisfiable. If there is just one NIC in the graph then it is relatively easy to determine satisfiability. If the ancestor of the path has 2 c-edges or if the NIC node has 2 incoming c-edges then the query will be unsatisfiable as either a node has 2 parents, or we are asking for nodes to be cousin and also have the same parent, which is not possible. More complex issues need to be considered when there are 2 or 3 NICs in the constraint graph. Both of the constraint graphs shown in Figure 18 are satisfiable and also each graph has only one possible topological sort that satisfies the constraints of the c-arcs and d-arcs. For the first graph the only possible ordering is A => C -> E => B -> D => F For the second graph the only possible ordering is A => B => C -> D => E => F If the arc from C to D is omitted from both graphs then there are several possible orderings for each graph. Once there is only one possible ordering for a graph, then for the query to remain satisfiable as further constraints are added, it means 39 that these additional constraints are redundant as they do not add any additional constraints on the query. Figure 18. Constraint graphs – 6 vertices, 7 arcs, 2 NICs, satisfiable. Using the graphs from Figure 18, let us examine the impact of adding another NIC to the constraint graph at node E. For the first graph a d-arc is added from B to E, and for the second graph a c-arc is added from B to E. In both cases the query becomes unsatisfiable and there is no ordering possible of the constraint graph. These graphs are shown in Figure 19. Figure 19. Constraint graphs – 6 vertices, 8 arcs, 3 NICs, unsatisfiable. 40 3.4 Summary of the Problem Determining satisfiability of XPath expressions with Node Identity Constraints is a complex problem and in some situations the problem is undecidable. We examined the satisfiability of query expressions in the absence of DTD and XML schemas and in the absence of sibling axes, wildcard and negative expressions within the query. This restricted group of queries has been shown to be have PTIME solutions for determining satisfiability. We examined conditions that cause a query to be unsatisfiable and compared the related constraint graph and the examined whether it was possible to produce a topological sort of the nodes. We found that iff a topological sort of the constraint graph could be produced then the query was satisfiable. Similarly we found that if it was not possible to produce a topological sort of the constraint graph then the equivalent query was unsatisfiable. In the next chapter we present an efficient algorithm to test for satisfiability in these special cases of queries without negation and wildcards, with only child and descendant axes and consider these queries in the absence of DTDs and XML Schemas. 41 CHAPTER 4 ALGORITHM In this chapter, we present an algorithm for testing the satisfiability of a Tree Pattern Query. First we present an algorithm for a TPQ with a single node identity constraint. First we will provide an outline of the algorithm and then examine how the algorithm handles various cases. This algorithm can be used to test for satisfiability of queries with more than one node identity constraint, if the query vertices involved in the various NICs are disjoint. 4.1 Outline of Algorithm The general structure for a structural constraint graph for a Conjunctive XPath query with one node identity constraint is shown in Figure 20. In the generalized Structural Constraint Graph, the least common ancestor of the nodes with the Node Identity Constraint is represented by the vertex b; the Node identity constraint is shown as vertex v(1+2) indicating that node v1 is identical to node v2 in the document. The path from node b to node v1 is represented by e1, p1, f1 and similarly the path from node b to node v2 is represented by e2, p2, f2 . The algorithm determines if there is at least one possible way to create a single path from b to v(1+2) that satisfies the structural constraints. In the case where arcs e1 , e2 , and f1 , f2 are all d-arcs, it is possible to embed the path e2, p2, f2 into the d-arc e1 and produce a single path from b to v(1+2) and thus the query is satisfiable as it is possible to find an XML document that satisfies the query 42 which has its elements listed in the same order as the nodes on the path. Other embeddings are also possible that will also produce a single path. The algorithm considers 4 special cases where it is possible that the query may not be satisfiable: 1) arcs e1 and e2 are both c-arcs 2) arcs f1 and f2 are both c-arcs 3) arc e1 is a c-arc and e2 is a d-arc and arc f1 is a d-arc and f2 is a c-arc 4) arc e1 is a c-arc and e2 is a d-arc but arc f1 is a c-arc and f2 is a d-arc. In this case 2 sub-cases are considered 4.1) path p1 contains a d-arc 4.1) path p1 contains no d-arcs, that is it contains only c-arcs. In case 1, we are requiring nodes c and d to be cousins but also to be on the same path to v(1+2). Generally this would cause the query to be unsatisfiable. However there is one special case where this constraint conflict can be resolved: namely nodes c and d are the same node which would require them to have the same tag. In this special case, we can “merge” these 2 nodes and this new node is now the least common ancestor of v(1+2), then testing for satisfiability can be done recursively. In case 2, we are requiring nodes m and n to be the parents of v(1+2). This is a conflict as a node cannot have 2 parents in a TPQ and so the query would be unsatisfiable. However there is a special case where this conflict can be resolved namely nodes m and n are the same node which would require them to have the 43 same tag. In this special case, we can “merge” these 2 nodes and this new node is now the node with the node identity constraint instead of v(1+2) which is now the child of the new merged node, then testing for satisfiability can be done recursively. From Case 1, we can assume that at least one of e1 and e2 is a d-arc, (otherwise we would apply the algorithm recursively until we had this situation, and if we cannot do this then the query is unsatisfiable). Without loss of generality, we can assume that e2 is a d-arc. From Case 2, we can assume that at least one of f1 and f2 is a d-arc. If f1 is a d-arc, then there is a path b -> c => m => d => n -> v(1+2) which satisfies the constraints arising from the possible c-arcs, and so the query is satisfiable. However in Case 4 we consider the situation where f1 is a c-arc and also e1 is a c-arc. If path p1 contains a d-arc, then it is possible to embed the path e2, p2, f2 into this d-arc in p1 as both e2 and f2 are d-arcs. However if p1 does not contain a d-arc, then p2 must be a substring of p1 that conforms to the c-arc constraints. 44 Figure 20. Generalized Structural Constraint Graph for a C XPath query with 1 NIC. 4.2 Formalization of Algorithm For any TPQ Q (no node identity constraint), let (Q) be the XML document obtained from Q as follows: • Replace all c-arcs and d-arcs by parent-child edges; • Replace each * vertex with any tag name in . There is a natural embedding of Q in (Q): Each vertex is mapped to the corresponding node. So TPQ Q (no node identity constraint) is definitely satisfiable as it can be embedded in (Q). For example, see Figure 21. 45 Figure 21. Query Q, document (Q), and an embedding of Q in (Q). Now consider a TPQ Q with one Node Identity Constraint (NIC). Let u and v be the vertices of Q that are equated by the node identity constraint. Let b be the least common ancestor of u and v in Q. Let Pu = (b, u1, u2, …, um, um+1 = u) be the path from b to u in Q, and let Pv = (b, v1, v2, …, vn, vn+1 = v) be the path from b to v in Q. 46 Theorem 4.1 A TPQ Q with one NIC that equates vertices u and v is satisfiable iff there exists a path P of XML nodes and an embedding of Pu and Pv in P such that • (b) has the same value for Pu and Pv • (u) = (v) Proof. First consider the “only-if” case. Consider any possible embedding of Q in an XML document. Clearly, D must contain a path satisfying the constraints in the theorem. Now consider the “if” case. Suppose that there exists a path P and an embedding satisfying the constraints in the theorem. Then P can be extended by attaching extra nodes and subtrees, to embed the whole query Q. Consider a possible embedding of Pu and Pv in path P. We need to find the conditions under which such P and exist. In the Figures in this section, a solid single line represents a c-arc, a solid double line represents a d-arc, and a dotted line represents either a c-arc or a d-arc. We consider the following cases. Case 1. The arcs (b, u1) and (b, v1) are both c-arcs. Then, in any possible embedding of Pu and Pv in a path P, we must have (u1) = (v1). So, for Q to be satisfiable, we must have (u1) = (v1); so let (u1) = (v1). Let Q’ be the TPQ obtained by merging u1 and v1 into a single vertex b1; the node identity 47 constraint is rephrased in terms of b1 (see Figure 21 (b)). Q is satisfiable iff Q’ is; satisfiability of Q’ is tested recursively. (a) (b) Figure 22. Case 1: before and after the merge. Case 2. The arcs (um, u) and (vn, v) are both c-arcs. Then, in any possible embedding of Pu and Pv in a path P, we must have (um) = (vn). So, for Q to be satisfiable, we must have (um) = (vn); so let (um) = (vn). Let Q’ be the same as Q, except that the node identity constraint is rephrased to equate um and vn (see Figure 23 (b)). Q is satisfiable iff Q’ is; satisfiability of Q’ is tested recursively. 48 (a) (b) Figure 23. Case 2: before and after the merge. By Case1, we can assume that at least one of (b, u1) and (b, v1) is a d-arc. Without loss of generality, let (b, v1) be a d-arc. 49 Figure 24. At least one of arcs (b,u1) and (b,v1) must be a d-arc By Case 2, we can assume that at least one of arcs (um, u) and (vn, v) is a darc. We consider two cases. Case 3. The arc (um, u) is a d-arc (see Figure 24 (a)). Then, Q is satisfiable. Let P’u = Pu – {u} = (b, u1, u2, …, um), and let P’v = Pv – {b} = (v1, v2, …, vn, vn+1 = v). Let P = (P’u) • (P’v), where • denotes the concatenation operator. There is a natural embedding of Pu and Pv in P (see Figure 24 (b)). So by Theorem 4.1, Q is satisfiable. 50 (a) (b) Figure 25. Case 3: before and after an embedding. 51 Case 4. The arc (um, u) is a c-arc, but the arc (vn, v) is a d-arc (see Figure 26). Figure 26. Case 4: arcs (b, u1) and (um,Vu+v) are c-arcs. We consider two sub-cases. Case 4.1. Pu contains a d-arc (x,y). Then, Q is satisfiable. Let P’u,1 be the part of Pu from b to x; P’u,1 = (b, u1, u2 …, ui = x), for some i < m. Let P’u,2 be the part of Pu from y to u; P’u,2 = (y = ui+1, ui+2, …, um, um+1 = u). Let P’v = Pv – {b, v} = (v1, v2, …, vn). Let P = (P’u,1) • (P’v) • (P’u,2) . There is a natural embedding of of Pu and Pv in P (see Figure 27). So by Theorem 4.1, Q is satisfiable. 52 Figure 27. Case 4.1: before and after an embedding. 53 Figure 28. Case 4.2: Pu contains no d-arcs. Case 4.2. Pu contains no d-arcs (see Figure 28). In this case, we must take P to be (Pu) , which is essentially Pu except that we can replace each * vertex label by any tag name in . So, Q is satisfiable iff Pv can be embedded in Pu. To determine whether such an embedding exists, we partition Pv into maximal length chains such that each chain consists either of only c-arcs or only d-arcs; they will be referred to as c-chains and d-chains, respectively. Let Pv = (Pv,1, Pv,2, …, Pv,k) be the partition. Pv,i is a d-chain if i is odd, and is a c-chain if i is even, Since (b, v1) and (vn, v) are both d-arcs, both Pv,1 and Pv,k are d-chains, and k must be odd. Let the vertices assigned to Pv,i, denoted by V(Pv,i ), be defined as follows: If Pv,i 54 is a d-chain, V(Pv,i ) is the sequence of vertices on Pv,i, excluding the two endpoints; so, if Pv,i consists of a single d-arc, V(Pv,i ) would be empty. If Pv,i is a c-chain, V(Pv,i ) is the sequence of vertices on Pv,i, including the two endpoints. Figure 29. Case 4.2: Pv partitioned into c-chains and d-chains. Embedding Pv in Pu reduces to embedding each Pv,i, one below the other. Suppose that we have embedded Pv,i and let x be the image in Pu of the last vertex in V(Pv,i ). Now consider the embedding of Pv,i+1 .If Pv,i+1 is a c-chain, we need to find the first sequence of consecutive vertices (not necessarily immediately) after x in Pu , whose tag names match the sequence of tag names 55 in V(Pv,i+1 ). If Pv,i+1 is a d-chain, we need to find the first sequence of not necessarily consecutive vertices (not necessarily immediately) after x in Pu , whose tag names match the sequence of tag names in V(Pv,i+1 ). Q is satisfiable iff each of the c-chains and d-chains in Pv can be embedded in Pu, in this manner, one below the other. 4.3 Application of Algorithm to Examples for Case 4.2 Figure 30 shows an example of a query that is satisfiable. By applying the algorithm, we partition the path Pv = (b, c, e, g) into a c-chain consisting of the carc (c,e) and into two d-chains containing no vertices. Next the path Pu = (b,c,d,c,e,f,g) is checked to determine if there is a possible embedding. The first vertex c is a match but the next vertex d does not match the next vertex e in the c-chain. So this is not a possible embedding so the checking continues. Vertex d is not a match but the next vertex c is a possible match. The subsequent vertex e is a match to the next vertex in the c-chain. All the vertices of the c-chain have been embedded into the path of c-arcs from b to g. The two d-chains do not contain any vertices so no further embedding is required. 56 Figure 30. Case 4.2: Example of satisfiable query with c-chain. Figure 31 shows a more complex example with two d-chains and 1 cchain where the d-chains contain a vertex. The chains are shown together with their “match” in the long path, which shows that an embedding is possible and so the query is satisfiable. 57 Figure 31. Case 4.2: Example of satisfiable query with d-chains and c-chain If an embedding is not possible, then the query is unsatisfiable. Figure 32 shows an example of such a query. In this example, it is not possible to match the c-chain with a set of c-arcs in the long path. The vertices x and h do not match and so an embedding is not possible. 58 Figure 32. Case 4.2: Example of unsatisfiable query 4.4 Runtime of the Algorithm Let n = | Pu | and m = | Pv |. Consider embedding a chain Pv,i in Pu . If Pv,i is a d-chain, the time spent is equal to the length of Pu that is used up by the embedding; this is because we never look at a vertex in Pu more than once. So, the total time spent over all d-chains is O(n). Now consider the c-chains. For a cchain Pv,i , we spend O(|Pv,i |) time for each starting point we try in Pu. So, total time spent on embedding Pv,i is O(m |Pv,i |). Over all the c-chains, the total time 59 is O(nm). This can be improved to O(n+m) using the string matching algorithm of Knuth, Morris and Pratt (Cormen, Leiserson et al. 2001). 60 CHAPTER 5 SUMMARY AND CONCLUSIONS 5.1 Theoretical Implications Efficient algorithms for evaluating XPath and XQuery expressions are an important research area of Computer Science. Techniques developed for SQL queries are only partially applicable to the more complex queries for XML data. The ability to determine if a query is satisfiable before running a query against a large dataset is useful. Although it is not always possible to decide if an XPath or XQuery query is satisfiable, there is a subset of queries where it is possible to determine if a query is satisfiable. One subset is the Conjunctive XPath subclass and we have shown an efficient algorithm for determining satisfiability for this class of query with one Node Identity Constraint. 5.2 Conclusion Algorithms for determining satisfiability may also be applicable to other problems. XPath in many ways resembles the path structures for storing and retrieving data in file systems. The structural constraint graphs are also similar to the representation of scheduling problems: this activity must occur after another but immediately before another. The concept of child and descendant axis could be used to represent the scheduling constraints and hence the algorithm developed for XML queries could be applied to other problems that can be represented by structural constraint graphs. 61 5.3 Extension of the Algorithm The algorithm developed in Chapter 4 is for a single node identity constraint. However in more complex XQuery queries, the TPQ may contain multiple Node Identity Constraints. In section 3.3.3, it was noted that if there exists a sub-graph in the query that it unsatisfiable, then the larger query is unsatisfiable. Therefore one method to extend this algorithm to multiple node identity constraints would be to determine if the various node identity constraints are satisfiable. A rough outline of such an algorithm is provided below, for the case where the various NICs involve disjoint sets of vertices. Step 1: Partition the graph into sub-graphs for each Node identity Constraint. Step 2: Choose a node identity constraint and apply the algorithm from Theorem 4.3. If an embedding is possible, then continue applying the algorithm for any remaining node identity constraints, however do not make the embedding. It is only necessary to determine if an embedding is possible. If an embedding is possible for all the node identity constraints continue to Step 3, else the query is unsatisfiable. In the situation where the NICS involved overlapping sets of vertices, these multiple node identity constraints can introduce a cycle into the graph. A 62 topological sort is one method of determining if a graph contains a cycle. If a topological sort can be created for a DAG, then the graph does not contain a cycle. 5.4 Areas of Further Research The extended algorithm requires further research in the area of nested and overlapping Node Identity Constraints. Where the Node Identity Constraints do not overlap or are not nested within each other, then the extended algorithm would be suitable. In the case of nested or overlapping Node Identity Constraints, an embedding for the inner most NIC, may make an embedding for the outermost or overlapping NIC impossible to achieve. This is because our algorithm selects just one of many possible embeddings to determine satisfiability. When the NICs overlap or are nested, it may be that a different choice of embeddings need to be considered. 63 REFERENCES 64 LIST OF REFERENCES Amer-Yahia, S., S. Cho, et al. (2001). Minimization of Tree Pattern Queries Proceedings of ACM SIGMOD, Santa Barbara California. Bender, M. A., G. Pemmasani, et al. (2001). Finding Least Common Ancestors in Directed Acyclic Graphs. Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington, D.C. Benedikt, M., W. Fan, et al. (2005). XPath Satisfiability in the Presence of DTDs ACM Symposium on Principles of Database Systems (PODS), Baltimore, Maryland. Brundage, M. (2004). XQuery: The XML Query Language. Boston, Addison-Wesley. Cormen, T., C. Leiserson, et al. (2001). Introduction to Algoritms. Cambridge, MA, MIT Press/McGraw Hill. Fernández, M., J. Hidders, et al. (2004). Automata for Avoiding Unnecessary Ordering Operations in XPath Implementations, University of Antwerp Technical Report TR UA 2004-02. Geerts, F. and W. Fan (2005). Satisfiability of XPath Queries with Sibling Axes The 10th International Workshop on Database Programming Languages, Trondheim, Norway. Hidders, J. (2003). Satisfiability of XPath Expressions. Proceedings of the 9th International Conference on Data Base Programming Languages, Potsdam, Germany. Kay, M. (2004). XPath 2.0 Programmer's Reference (Programmer to Programmer). Indianapolis, IN, Wiley Publishing. Lakshmanan, L. V. S., G. Ramesh, et al. (2004). On Testing Satisfiability of Tree Pattern Queries. Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada. Lewis, P. M., A. Bernstein, et al. (2002). XML and Web Data. Databases and Transaction Processing: An Application-Oriented Approach. Boston, Addison-Wesley: 537623. Quenault, H. S. (2004). VERS: Building a Digital Heritage. VALA, Victorian Association for Library Automation, Melbourne, Australia. Ramanan, P. (2003). Covering Indexes for XML Queries: Bisimulation - Simulation = Negation. VLDB. 65 Stephen, G. A. (1994). String Searching Algorithms. Singapore, World Scientific. W3C (1996). Extensible Markup Language (XML). 66