XTree for Declarative XML Querying Zhuo Chen, Tok Wang Ling, Mengchi Liu, and Gillian Dobbie January 2004 1 Outlines Introduction Preliminaries XTree Algorithm to transform XTree query to XQuery Conclusion and future works 2 Outlines Introduction Preliminaries XTree Algorithm to transform XTree query to XQuery Conclusion and future works 3 Introduction How to query XML documents is an important issue in XML research Various query languages proposed: XPath, XQuery, Lorel, XML-GL, XQL, XML-QL, XSLT, YATL, XDuce, a rule-based semantic querying, a declarative XML querying, etc XQuery based on XPath is selected as the basis for an official W3C query language for XML 4 Introduction In this paper, we will Analyze the limitations of XPath Propose a new set of syntax rules called XTree, which is a generalization of XPath Show how XTree can efficiently replace the notations of XPath Give algorithms to convert queries based on XTree expressions to standard XQuery queries 5 Outlines Introduction Preliminaries Background on XPath Limitations of XPath XTree Algorithm to transform XTree query to XQuery Conclusion and future works 6 Preliminaries XPath A W3C standard A set of syntax rules for defining parts of an XML document It uses paths to identify nodes (elements and attributes) in XML documents These path expressions look very much like computer file system 7 Background on XPath Sample XML document of a bibliography <bib name=“IT”> <book id=“b001” year=“1994”> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id =“b002” year=“1992”> <title>Advanced Programming in the Unix Environment</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id=“b003” year=“2000”> <title>Data on the Web</title> <edition>3</edition> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann</publisher> </book> <journal id=“j001” year=“1998”> <title>XML</title> <editor><last>Date</last><first>C.</first></editor> <editor><last>Gerbarg</last><first>M.</first></editor> <publisher>Morgan Kaufmann</publisher> </journal> </bib> 8 Background on XPath XPath examples /bib/book/@year /bib/book/author Get all attributes of each book /bib/book[2] Get all sub-elements of each book /bib/book/@* Get all elements named “author”, regardless of their absolute paths /bib/book/* Get element “author” of each book //author Get attribute “year” of each book Get the second book element /bib/book[last()] Get the last book element 9 Background on XQuery XQuery An XML querying language to search XML documents Based on XPath FLWOR statements For – Let – Where – Order by – Return For clause iterate the variable over the result of its expression Let clause bind the variable to the result of its expression Complex queries (nested clauses) Complex result constructions User-defined functions 10 Background on XQuery XQuery example List year an title of all books published after 1995 XQuery: for $book in /bib/book where $book/@year > 1995 return <book> { $book/@year } { $book/title } </book> Result: <book year=“2000”> <title>Data on the Web</title> </book> 11 Limitations of XPath XPath has some limitations: 1. We can only assign one variable for each XPath expression It is just a linear path, which is not like the XML’s tree structure Inefficient If a query needs to get values from several places, it has to use several paths 2. It is difficult to reveal the relationship among correlated XPaths This may cause mistakes if a user does not pay attention when writing a query Eg, if we want to output title and author of each book XPath 1: /bib/book/title, XPath 2: /bib/book/author Wrong! The above two paths are not correlated 12 Limitations of XPath XPath has some limitations: 3. XPath is inefficient to express query that returns elements at path A while the condition is in a distant path B Difficult to distinguish condition branch from target branch Especially for multiple conditions and nested conditions Eg, find the value of publisher id of a book which has an author with last name as “Stevens” and first name as “W.” /bib/book/author[last=“Stevens” and first=“W.”]/../publisher/@pubid 4. XPath expressions are only used in the querying part of XQuery, not in the result construction part In XQuery, the result construction part mixes literal text, variable evaluation and even nested sub-queries The whole query is difficult to read and comprehend 13 Limitations of XPath XPath has some limitations: 5. XPath can only bind variable on the whole node (element or attribute) structure, which is a name-value pair If we want to get the substructure of the node, we have to invoke built-in functions local-name() to get node name string() to get string value Difficult to query XML documents with unknown structure, or to rename the nodes in the result Eg, Suppose we do not know the for $book in /bib/book let $attrib := $book/@* sub-structure of book element, we want to re-structure books in this way: return <book> keep text nodes and sub-elements { $book/text(), $book/* } unchanged, but convert attributes to <attribute name={ local-name($attrib) } sub-elements: value={ string($attrib) }/> </book> 14 Outlines Introduction Preliminaries XTree Basic syntax XTree for querying XTree for result construction Algorithm to transform XTree query to XQuery Conclusion and future works 15 XTree XTree is a generalization of XPath XTree has a tree structure like XML XTree is more efficient than XPath In the querying part, one XTree expression can bind multiple variables In the result construction part, one XTree expression can be used to define the result format In XQuery, one XPath expression can only bind one variable Avoid nested structure in the query Make the whole query easier to read and understand Supports list-valued variables explicitly, and determines their values uniquely 16 XTree syntax Similar to that of XPath ( ) in front to indicate the URL of the document Sibling tree nodes are enclosed by { }, and separated by commas { } can be nested In XTree, conditions are written directly without { } Use logic variables as place holders to bind/match the values at their places / means parent-child hierarchy // means no matter how many levels down (ancestor-descent) → to assign variables in the querying part ← to get values from variables in the result construction part Only interested sub-trees are written in XTree, not the whole XML tree structure 17 XTree for querying Symbol → will assign values of nodes on the left side to the variable on the right side Example. For the sample bibliography document, suppose we want to get the year and title of each book, and its authors’ last names and first names We can use the variables $y, $t, $first, $last to bind them respectively as in the following XTree expression: /bib/book/{@year→$y, title→$t, author/{last→$last, first→$first}} We can instantiate many variables in one XTree expression The above XTree expression corresponds to the following 6 XPath expressions in XQuery: for $book in /bib/book, $y in $book/@year, $t in $book/title, $author in $book/author, $last in $author/last, $first in $author/first 18 XTree for querying XTree allows a user to use path abbreviation as in XPath Example. Suppose we want to get the last name and first name elements at whatever depth in the document, we can write the following XTree expression: /bib//{last→$last, first→$first} The square braces enclosing two elements last and first specifies that these two elements are sibling. According to the XML document, the parent of sibling elements last and first is /bib/book/author or /bib/journal/editor 19 XTree for querying XTree allows a user to bind variables on the structure of XML document A user can assign variable $var on the left side of → symbol Here $var will bind to the name of the corresponding node Example. Suppose we want to obtain some attribute with value “2000” in some book element, and bind variable $b to that book: /bib/book→$b/@$attr=“2000” According to the sample document, $b will bind to the third book, and $attr will bind to the attribute name “year”. 20 XTree Two types of variables Single-valued variables List-valued variables $X An element instance of the specified path {$X} A list of all $X instances Explicitly indicated by a pair of curly braces Note that both sibling nodes and list-valued variables are enclosed by curly braces Sibling nodes will have commas as separators in the braces List-valued variables does not have commas in the braces 21 List-valued variables Object-oriented functions of list-valued variables: Aggregate functions Suppose list-valued variable {$nums} binds to a list of numbers {$nums}.count() {$nums}.avg() {$nums}.min() {$nums}.max() {$nums}.sum() returns the number of items in the list returns the average value of items in the list returns the minimum value in the list returns the maximum value in the list returns the sum of values in the list 22 List-valued variables Object-oriented functions of list-valued variables: List operations Suppose list-valued variable {$names} binds to a list of name elements {$names}.[1-3, 6] {$names}.last() {$names}.sort() {$names}.sort_desc() {$names}.distinct() {$names}.random(3) $name {$names} {$names’} {$names} returns a sublist of 1st to 3rd items, and 6th item returns the last item in the list sorts the items in the list in ascending order sorts the items in the list in descending order eliminates duplicate items in the list picks out 3 items randomly check whether an item is in the list check whether the first list is a sub-list of the second list 23 Semantics of list-valued variables Definition 1. The associated path of variable $a (or {$a}) is the absolute path expression from root to the nodes represented by $a (or {$a}). /bib/book→$b/title→$t the associated path of $t is /bib/book/title. Definition 2. Variable $a is an ancestor variable of $b if $a and $b are defined in the same XTree expression, and the associated path of $a is a prefix of the associated path of $b. /bib/book→$b/{title→$t, author→$a} $b is an ancestor variable of $t and $a, but $t is not an ancestor variable of $a. 24 Semantics of list-valued variables Definition 3. In an XTree expression, when a variable is bound to a value in the query evaluation, the variable is instantiated. /bib/book/{author→$a/first→$first, title→$t} In the evaluation, when we have reach /bib/book/author, $a is instantiated; when reach /bib/book/author/first, $first is instantiated. Definition 4. The value of list-valued variable {$a} is a list of all instances of $a with all its ancestor variables instantiated. /bib/book/author→{$a} {$a} means all the author elements of all the books value of {$a} /bib/book→$b/author→{$a} {$a} means all the authors of a value of {$a} certain book $b 25 XTree for result construction XTree expression can also be used to define the result format Symbol ← will get values of variables from right side and assign them to the expression on the left side The result construction part is just one XTree expression No nested structure as the return clause of XQuery Since XTree already has a tree structure Easy to read and understand Must be concrete No condition checking or uncertainty in the structure Unlike XTree expressions in the querying part 26 XTree for result construction Example. We want to list the titles and publishers of books which are published after 1993, suppose we have bound the variables by the following XTree expression: /bib/book/{@year>1993, title→$t, publisher→$p} We can write the following XTree expression to define the result format: /result/recentbook/{title←$t, publisher←$p} The result format is defined as: under the root result, each recentbook element will store the title and publisher of that book <result> <recentbook> <title>TCP/IP Illustrated</title> <publisher>Addison-Wesley</publisher> </recentbook> <recentbook> <title>Data on the web</title> <publisher>Morgan Kaufmann</publisher> </recentbook> <result> 27 XTree for result construction Example. For each book, show the title, the number of authors and the first author, suppose the variable bindings are defined in the following XTree expression: /bib/book/{title→$t, author→{$a}} We can write the following XTree expression to return the result: /result/book/{title←$t, authNum←{$a}.count(), author←{$a}[1]} {$a}.count() counts the number of items in the {$a} list {$a}[1] returns the first item in the {$a} list Output: <result> <book> <title>TCP/IP Illustrated</title> <authNum>1</authNum> <author><last>Stevens</last><first>W.</first></author> </book> <book> <title>Advanced Programming in the Unix Environment</title> <authNum>1</authNum> <author><last>Stevens</last><first>W.</first></author> </book> <book> <title>Data on the Web</title>> <authNum>3</authNum> <author><last>Abiteboul</last><first>Serge</first></author> </book> </result> 28 XTree for result construction The right side of ← symbol can be: A pre-defined variable or invocation of functions on variables Literal text, indicating static content Omitted, indicating an empty value Example. Suppose we want to return a book whose title is “Computer Architecture”, and which does not have a specified author, we can write the following XTree expression: /bib/book/{title←“Computer Architecture”, no-author} It will output the following XML segment: <bib> <book> <title>Computer Architecture</title> <no-author/> </book> </bib> 29 XTree for result construction Query based on XTree expressions has QWOC (Query-Where-Order by-Construct) statements Query clause contains one or more XTree expressions for selection and variables binding Where clause is optional, it defines constraints Order by clause is optional, it defines the ordering Construct clause contains one XTree expression to define the output format 30 Outlines Introduction Preliminaries XTree Algorithm to transform XTree query to XQuery An algorithm to transform an XTree expression in the query part to a set of XPath expressions An algorithm to transform an XTree expression in the result construction part to some nested XQuery expressions Conclusion and future works 31 Transformation algorithm for querying part Transform an XTree expression in the querying part to a set of XPath expressions Not as trivial as just extracting each path associated with a variable to be an XTree expression Variables may correlate to each other by some common ancestors We have to use such common ancestors to constrain the descendent variables The common ancestors we want are just those branching nodes (the nodes just before every pair of square braces for branching) Use stack to store such common ancestors for later use 32 Transformation algorithm for querying part Process the XTree expression from left to right, for each common ancestor of variables (except the root), assign a single-valued variable on it if it is not originally bound to a variable Translate each single-valued variable to be an XPath expression in a for clause; translate each list-valued variable to be an XPath expression in a let clause Try to write the path expression of a variable to be the relative path of its nearest ancestor variable (make use of the stack) If it has such ancestor variable, then write its path expression to be the relative path from that ancestor variable If it does not have any ancestor variable, then write its path expression to be the absolute path from the root The output paths will be in depth-first order of the XTree 33 Transformation algorithm for querying part Example: /bib/{book→$b/{title→$t, /bib/{book/{title→$t, author→{$a}}, author→{$a}}, journal→$j/{title→$jt, journal/{title→$jt, editor/{last→$last, editor/{last→$last, editor→$e/{last→$last, first→$first}}} first→$first}}} first→$first}}} XPaths generated: for $b in /bib/book for $t in $b/title let $a := $b/author for $j in /bib/journal for $jt in $j/title for $e in $j/editor for $last in $e/last for $first in $e/first 34 Transformation algorithm for result construction part Transform an XTree expression in the result construction part to some XQuery expressions More complicated We will often encounter nested sub-queries in XQuery Consider the case that the node name to get the variable value is different from the node name where the variable was bound in the querying part Process the XTree expression step by step Find the corresponding XPath expression of each variable in the XPaths generated from last algorithm Translate each variable value substitution to some XQuery statement Use curly braces { } to form sub-query blocks according to the structure of the XTree expression in construct clause 35 Transformation algorithm for result construction part Example: query /bib/{book/{title→$t, author→{$a}}, journal/{title→$jt, editor/{last→$last, first→$first}}} construct /result/{book/{name←$t, authors/{@count←{$a}.count( ), au←{$a}}}, journal/{title←$jt, editor/{first←$first, last←$last}}} Generated XPath expressions of the querying part: for $b in /bib/book for $t in $b/title let $a := $b/author for $j in /bib/journal for $jt in $j/title for $e in $j/editor for $last in $e/last for $first in $e/first 36 Transformation algorithm for result construction part Output: <result> { for $b in /bib/book return <book> { for $t in $b/title return <name> {$t/*} {$t/@*} {$t/text()} </name> } { let $a := $b/author return <authors count={count($a)}> { for $x in $a return <au> {$x/*} {$x/@*} {$x/text()} </au> } </authors> } </book> } { for $j in /bib/journal return <journal> { for $jt in $j/title return {$jt} } { for $e in $j/editor return <editor> { for $first in $e/first return {$first} } { for $last in $e/last return {$last} } </editor> } </journal> } </result> 37 Outlines Introduction Preliminaries XTree Algorithm to transform XTree query to XQuery Conclusion and future works Conclusion Future works 38 Conclusion Discussed the limitations of XPath Proposed a new set of syntax rules called XTree XTree has a tree structure In the querying part, one XTree expression can bind multiple variables In the result construction part, one XTree expression can define the result format List-valued variables are explicitly indicated, and their values are uniquely determined XTree is more compact and convenient to use than XPath Designed algorithms to transform a query based on XTree expressions to a standard XQuery query 39 Future works Implement an XTree query parser Queries based on XTree expressions can be executed directly The query evaluation will be more efficient on this approach, since we will have a global view of the whole query tree Extend the transformation algorithms to support queries with join, negation, grouping and recursion Optimize the output XQuery queries of our transformation algorithms according to the schema of the XML document Observe the progressive development of XPath to continuously enhance our XTree 40 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. S.Abiteboul, D.Quass, J.McHugh, J.Widom, and J.L. Wiener. The Lorel Query Language for Semistructured Data. International Journal of Digital Library 1(1):68-99, 1997. S.Ceri, S.Comai, E.Damiani, P.Fraternali, S.Paraboschi, and L.Tanca. XML-GL: a Graphical Language for Querying and Restructuring WWW data. In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999. S.Cluet and J.Simeon. YATL: a Functional and Declarative Language for XML. Draft manuscript, March 2000. H.Hosoya and B.Pierce. XDuce: A Typed XML Processing Language (Preliminary Report). In Proceedings of WebDB Workshop, 2000. M.Liu and T.W.Ling. Towards Declarative XML Querying. In Proceedings of WISE 2002, 127-138, Singapore, 2002. P.Chippimolchai, V.Wuwongse and C.Anutariya. Semantic Query Formulation and Evaluation for XML Databases. In Proceedings of WISE 2002, 205-214, Singapore, 2002. D.Chamberlin, P. Fankhauser, M.Marchiori, and J.Robie. XML Query Requirements. W3C Working Draft, In http://www.w3.org/TR/xquery-requirements/, June 2003. J. Clark and S.DeRose. XML Path Language (XPath) Version 1.0. W3C Recommendation, In http://www.w3.org/TR/xpath, November 2001. D.Chamberlin, D.Florescu, J.Robie, J.Simon, and M.Stefanescu. XQuery 1.0: A Query Language for XML. W3C Working Draft, In http://www.w3.org/TR/xquery/, May 2003. J.Robie, J.Lapp, and D.Schach. XML Query Language (XQL). In http://www.w3.org/TandS/QL/QL98/pp/xql.html, 1998. A. Deutsch, M.Fernandez, D.Florescu, A.Levy, and D.Suciu. XML-QL: A Query Language for XML. In http://www.w3.org/TR/NOTE-xml-ql/, August 1998. J.Clark. XSL Transformations (XSLT) Version 1.0. W3C Recommendation, In http://www.w3.org/TR/xslt, November 1999. 41 Thank you 42 <bib name=“IT”> <book id=“b001” year=“1994”> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id =“b002” year=“1992”> <title>Advanced Programming in the Unix Environment</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id=“b003” year=“2000”> <title>Data on the Web</title> <edition>3</edition> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann</publisher> </book> <journal id=“j001” year=“1998”> <title>XML</title> <editor><last>Date</last><first>C.</first></editor> <editor><last>Gerbarg</last><first>M.</first></editor> <publisher>Morgan Kaufmann</publisher> </journal> </bib> {$a} back 43 <bib name=“IT”> <book id=“b001” year=“1994”> <title>TCP/IP Illustrated</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id =“b002” year=“1992”> <title>Advanced Programming in the Unix Environment</title> <author><last>Stevens</last><first>W.</first></author> <publisher>Addison-Wesley</publisher> </book> <book id=“b003” year=“2000”> <title>Data on the Web</title> <edition>3</edition> <author><last>Abiteboul</last><first>Serge</first></author> <author><last>Buneman</last><first>Peter</first></author> <author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann</publisher> </book> <journal id=“j001” year=“1998”> <title>XML</title> <editor><last>Date</last><first>C.</first></editor> <editor><last>Gerbarg</last><first>M.</first></editor> <publisher>Morgan Kaufmann</publisher> </journal> </bib> {$a} $b {$a} $b {$a} $b back 44