Quilt Query Interface for Niagara Gokul Nadathur, Sanjeev Kulkarni

advertisement
Quilt Query Interface for Niagara
Gokul Nadathur, Sanjeev Kulkarni And Nitin Bahadur
{gokul, bnitin, sanjeevk} @cs.wisc.edu
Introduction
Niagara is a research project in Wisconsin that can be used to query XML data[1]. Currently
Niagara uses XML-QL[4] as its query language. However it seems that Quilt is going to be the query
language standard for XML documents. Quilt[2] is a functional language in which a query is represented
as an expression.
Our project involved creating a Quilt interface for Niagara. This interface will produce a Logical
plan, which can then be processed by the Niagara query execution engine. Thus with our interface
Niagara can take input queries in Quilt and process them.
Considering the numerous features Quilt provides, we have implemented only a subset of the
language. This subset encapsulates the essential features of the language that illustrate the feasibility of
using Quilt queries in the existing Niagara framework. In particular our interface implements:
1. Simple Path expressions.
2. Features of path expression like // and *.
3. Complex predicates involving operators like =, AND and others.
4. Quilt FLWR expressions
5. FOR, WHERE and RETURN expressions.
6. Queries involving multiple sources.
7. Join queries
We believe that these features constitute the core functionality and power of Quilt and therefore
validate our approach. Further, using our existing framework it is fairly easy to extend it to cover other
features of the language.
We begin by examining the Quilt language and comparing its power and flexibility with XMLQL. Then we delve into how we approached to translate Quilt into Niagara's Logical Plan. We discuss the
various problems that we faced. Finally we outline what can be done and draw some conclusions.
Quilt
Quilt queries consist of a series of expressions. An expression can be of several types. The
important ones are
1.
Path expressions
These are used to navigate through the XML documents. Path expressions in Quilt are based on the XPath
specification[3]. Thus the expression
document("zoo.xml")/Animal/Elephant
will traverse through the xml file zoo.xml and returns all child Elephant elements of the Animal elements.
Quilt path expressions are very rich in features. The expression
document("movie.xml")/Movies/Actress/..
will return all the Movie elements in which there is an Actress child element. ..(DOTDOT) is a special
feature that will return the parent of the current element. The expression
1
document("movie.xml")/Movies//Year
will return all the Movie elements which have a descendent element Year. Thus // generalizes the concept
of child nodes of / to that of descendent.
One can use Predicates in conjunction with path expressions. These predicates use the compare operators
( eg. =, >, etc. ) on either element values or on the element attributes. An example of this is the expression
document("Movies.xml")/Movie[Year = 1998]/Actor = "Bruce Wills"
that returns all the movies made in Year 1998 in which Bruce Wills was the actor. Here note that Year is
an attribute of the Movie element and "Year = 1998" is a predicate on the year attribute of the Movie
element while the predicate "Actor = "Bruce Wills"" is a predicate on the value of Actor element.
2.
FLWR expressions
These are the main variable binding expressions in Quilt. They bind an element or a collection of
elements to a name. The variable that is bound to can be used in subsequent Quilt expressions. The main
binding expressions in Quilt are the FOR and the LET expressions. The FOR expression is used to iterate
over a collection of elements with the variable bound to one element in the collection in each iteration.
Thus the statement
FOR $e IN document("Movie.xml")/Movie
will bind $e to one element of the collection of Movie elements. LET on the other hand binds the variable
name to the entire collection. So the statement
LET $e := document("Movies.xml")/Movie
will bind the variable e to the collection of all Movie elements in the file Movies.xml. Quilt supports a
number of operators on collections that can be made use of by using the LET construct.
3.
Conditional Expressions.
These are useful when the structure of the information to be returned depends on some condition. These
expressions are of the form
IF Expression THEN Expression ELSE Expression
4.
Functions.
Quilt allows users to define their own functions to be used in their queries.
For a complete set of Quilt expressions refer to [2].
Query parsing stages in Niagara
Niagara currently uses XML-QL as its query language. The backend query engine of Niagara
uses its own Logical Algebra. XML-QL queries are parsed and converted to this Logical Algebra, which
is then fed into the query execution engine for processing. In this section we briefly describe the Logical
Algebra of Niagara and the process of translation of XML-QL query to the Logical Algebra.
2
Niagara Logical Algebra
The logical algebra of Niagara is the query language independent interface to the query execution
engine. The main operators of this Algebra are:
1. Select: This is the select operator that selects tuples based on a certain condition.
2. Follow: This operator is used to traverse from the parent element to its child.
3. Expose: This operator projects the required elements.
4. Vertex: This creates new elements.
5. Join: This is the join operator that is used to join two relations.
XML-QL parsing and translation
The conversion from the XML-QL query to the Logical Plan occurs in two stages.
1. Construction of an Abstract Syntax Tree ( AST )
In this stage an Abstract Syntax tree is created while parsing XML-QL query. This AST basically
encapsulates the XML-QL query into a tree form and as such is very specific to the XML-QL syntax.
2. Deriving the Logical Plan Tree from the AST.
In this stage the constructed AST of the first phase is translated to the Logical Plan. XML-QL
specific AST is converted to the Logical Plan Tree by mapping the functionality of the corresponding
structures of the XML-QL structures with those equivalent in the Logical Algebra.
The Quilt Interface
While constructing our interface we considered three approaches to solve the problem. Each
approach has its own advantages and disadvantages. The approaches are:
1. Map the Quilt query into the AST of existing XML-QL specific tree.
With this approach our Quilt interface will parse Quilt into XML-QL and thus use the existing
translation process of AST to Logical Plan for the second stage. The main advantage of this approach is
that once the AST is ready, the existing translator will take care of the rest. Thus AST construction is the
only thing that needs to be done. But existing AST is very specific to the XML-QL language. It is not
clear whether all Quilt features can be captured with the XML-QL features. Thus this plan was not
pursued further.
2. Build an AST and then build a translator to translate into Logical Plan.
This approach allows us to define our own Quilt specific structures and a new AST. We can then
use this AST to do a translation similar to the existing one to create the logical plan. But it is not clear
whether this approach is really worth it. Some expressions translate themselves very easily into the
corresponding operators of the Logical Algebra; so it is not clear whether we need a separate phase of
translation.
3. Translate Quilt directly into Logical Plan.
With this approach as and when Quilt rules are encountered, interpret them, map them to Niagara
operators and add them to a Logical tree. However the big question was whether we could translate all
rules of Quilt into the corresponding Logical Plan operators.
3
Our Approach
We have followed the third approach. There were several reasons to make this decision.
The main motivation for this comes from the fact that many Quilt structures map directly to
Logical Algebra structures. To illustrate, let's consider a simple Quilt path expression.
element1/element2
When we reduce this rule in the parser, we can set element2 to be the Follow child of element1.
We can also update the path expression of element2 by appending itself to the path expression of
element1. The same is the case with Predicates. The predicate
element = "Value"
maps directly to the Logical Algebra Select operator making a reduction possible immediately.
Thus in our approach we maintain the following structures at any point.
1. A forest of logical trees.
2. A reduced skeleton of AST pointing to this forest as a placeholder for the logical tree.
Whenever we reduce a rule
1. If we can map the rule to a logical algebra operator, we combine the appropriate Logical Plan trees.
Thus in the above follow example, we combine the logical plan tree for element1 and element2 by
making element2 follow a child of element1 follow tree.
2. Else if we recognize a new operand of a logical plan operator, we create a logical plan tree and create a
corresponding placeholder node that points to this logical tree.
Once the parsing is over, we will have one connected Logical tree constructed.
Optimizations
Our implementation performs the following optimizations.
a) Consider the following Quilt path expression
document("Movies.xml")/Movies/Year/MovieName/Actor/Name="Bruce Wills"/SigningAmount
If the Follow tree chain is created like below
Name <- Actor <- MovieName <- Year <- Movies
the un-nested follow containers will look like
(Movies,
Movies:Year,
Movies:Year:MovieName,
MovieName:Year:MovieName:Actor:Name )
Movies:Year:MovieName:Actor,
that will contain too many tuples. But a close observation suggests that we can crunch these individual
follows to a single follow containing (Moives:Year:MovieName:Actor:Name) which dramatically
decreases tuple space thereby boosting performance.
4
b) We push Select on top of Follows
c) Intelligent Evaluation of AND
Consider the evaluation of the predicate
document("Movies.xml")/Movie[/Actor = "Mel Gibson" AND /Actress = "Kate Winslet"]
A naive evaluation plan of this query would be to first collect all the tuples featuring Mel Gibson
as the Actor, and then choosing all the tuples featuring Kate Winslet as the Actress and then computing
the AND over the two. However the clever thing would be to collect all the tuples featuring Mel Gibson
and among them chose the tuples with Kate Winslet as the Actress. In our implementation we do the
above at the rule when we process the AND clause where we simply put one Predicate as the child of
other. This small optimization leads to a considerable improvement of performance for AND rules.
Implementation Details
One of the first modules we created was a C based parser for Quilt. Although a Java parser based
on cup was available to us we had to resolve some incompatibilities between the semantics of cup scanner
in Java and the lex scanner in C.
While implementing our Quilt interface we have used the same structure as that of the current
Niagara implementation. This has the added benefit that the transition from XML-QL based interface to
Quilt interface is very simplified. One has to just call the Quilt parser instead of the XML-QL parser and
at the end of the parsing the Logical Plan will be available to be fed to the execution engine.
Type of Expressions covered
Currently we support the following type of Quilt expressions
1
2
3
4
5
6
7
Simple Path expressions: We support multiple follows to arbitrary depth. As described earlier we also
crunch multiple follows into single follows for performance reasons.
Special Path expression feature // and *.
Path expression predicates and predicate lists. Predicates are mapped to the Select operators and
predicate lists are done the same way as AND.
Variable Binding. We maintain a Book data structure that maintains the mapping between the bound
variables and the Logical Tree Nodes they get bound to. Later usages of a particular bound variable
are then simply replaced by the Logical Tree Node maintained in the mapping.
FOR WHERE RETURN Clauses: This makes use of the variable binding infrastructure of the above.
RETURN clauses with multiple nesting levels in CONSTRUCTion of result vertices.
Joins.
Problems faced
This section details the various implementation problems that we faced in this project.
1. Changing Niagara
One of the main challenges that we faced was that the Niagara algebra was undergoing transformation
while we were translating the Quilt grammar to it. More than once we had to change our Logical Tree
structure to accommodate these changes.
5
2. Niagara Blues
During our project we discovered a lot of bugs in the Niagara 2.0. For example, in many of the XML-QL
test cases we noticed that the path expressions generated by Niagara were inconsistent. In such situations
the main problem was that we weren't sure whether it was a bug or a feature. We are greatly thankful to
the Niagara team members for extending full support to us in all these cases.
3. Understanding data structures, constructing the logical plan as Niagara expects.
Since our goal was to use their data structures, understanding their data structures was imperative. While
this meant a huge learning curve, but had the advantage that our module could be plugged into the current
code.
4. Huge Quilt grammar.
Unlike XML-QL, which has a very small and compact grammar, Quilt's grammar has over 65 production
rules. And most of the rules have a large number of entry paths. So at each production rule in the
grammar, we had to check for a lot of conditions.
What can be done and we have not done
1. DOTDOT. Going back to parent node. This is non-trivial with follow optimizations. DOTDOT means
un-nesting to previous level and if we optimize, we lose the previous level. So some checks have to be
made to support DOTDOT. However it can be accommodated.
2. Namespaces. We are not sure if these can be implemented in Niagara. We have not looked deep into
this.
3. Complex RETURN Clauses. A Return Clause (CONSTRUCT) can also be an expression in Quilt. We
haven't tested for all possible expression cases in the RETURN clause. These have to be handled slightly
differently than ordinary expressions and can be done if time is spent on them.
4. We are not sure how IF THEN ELSE and FUNCTIONS in Quilt translate to Niagara.
5. Niagara database does not have elements with attributes. So we could not test any attribute related
queries with Niagara , although we wrote code to support that.
6. There are some complex cases in Path expressions, which we haven’t supported due to the lack of time.
What cannot be done due to lack of Niagara support.
1.
chapter[5] and chapter[RANGE 2 TO 5] type path expressions cannot be implemented as Niagara
does not have support for indexing a particular element.
2.
IDRefs or references ( @) cannot be implemented.
3.
There is no direct support for Universal Quantifier in Niagara.
Conclusions
We have built a Quilt interface to the Niagara query engine. Our implementation supports basic
path expressions, predicates and predicate lists, FWR expressions, variable bindings and Join operators.
With our current design we've shown that it could be extended to cover additional expressions like LET.
Also some Quilt structures cannot be supported in the current framework of Niagara. Our implementation
6
has shown that the Niagara algebra is powerful enough to capture most of Quilt and with some
modifications, the whole of Quilt could probably be handled.
Acknowledgements
We are grateful to the Niagara team members for their wholehearted cooperation. In particular we
want to thank Stratis Viglas for always being there to help us deal with various data structures and
correct bugs the in Niagara code he have us. We also thank Rajshekhar for his useful advice at various
times during the project. Finally we thank Prof. David Dewitt for suggesting this project to us.
References
[1]
[2]
[3]
[4]
Niagara, www.cs.wisc.edu/niagara, 2000.
Quilt, XML Query Language,
http://www.almaden.ibm.com/cs/people/chamberlin/quilt.html, 2000.
XML-QL, A Query Language for XML,
http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/, 1998.
Xpath, XML Path Language, http://www.w3.org/TR/xpath, Nov. 1999.
7
Download