Issues In XML Query Optimization

advertisement
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
ISSUES AND TECHNIQUES IN XML QUERY
OPTIMIZATION
1 ASMITA
1
P. ASRE,
2 PROF.
DR. M. S. ALI
PG Student, Department of Computer Sc. & Engg. , Prof. Ram Meghe Institute of
Technology & Research, Badnera.
2 Principal, Prof. Ram Meghe College of Engineering & Management, Badnera.
asmitaasre@gmail.com , softalis@hotmail.com
ABSTRACT: Over the past few years, XML (eXtensible Mark-up Language) has emerged as the standard for
information representation and data exchange over the Internet. Query optimization in the context of XML
databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as
compared with other data models, e.g., relational models. This high complexity renders a much enlarged search
space for XML query optimization. Furthermore, XML applications are typically Web-hooked and have many
simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and
optimization. The classical cost-based and heuristic-based approaches yield unacceptably low efficiency when
applied to XML data—query optimization itself becomes very time-consuming because of the huge search space
for optimization caused by the high complexity of the XML data model. Lots of work related to XML query
processing, evaluation and optimization has been done, but the majority is focused on investigation for efficient
supporting algorithms and indexing schemes. Various optimization technologies have been developed to solve
the query retrieval and updating problems. Towards the later year, most researchers proposed hybrid
optimization techniques. Hybrid system opens the possibility of covering each technology’s weakness by its
strengths. This paper reviews on the issues and techniques used in query optimization for XML databases.
Keywords: Query Optimization, XML Databases, Cost Model, Query Optimization issues.
1. INTRODUCTION
After relational, network-based, hierarchical,
object-oriented, object-relational, and deductive
database systems, academic research and
businesses increase their attention to the
database-driven processing of XML documents,
resulting in a new kind of information system,
namely the (native) XML database system
(XDBS). This development is reasonable,
because the eXtensible Markup Language
nowadays plays an important role in various key
technologies like content management systems,
electronic data interchange, and data integration
techniques. Furthermore, for the management of
a possibly large collection of XML documents,
the classical advantages of dedicated database
systems over file systems still hold: Convenient
use of XML data through a standardized
application programming interface (API);
Transactional warranties for all operations on
XML data; Processing of large volumes of data,
measured in number of documents as well as
document size. Further advantages of database
systems like scalability with respect to the
current transactional load, high availability and
fault tolerance, as well as data and application
independence
shall
be
mentioned
for
completeness, though they are not XML specific.
As XML [1] has gained prevalence in recent
years, the storage and querying of XML data have
become an important issue. Effective query
optimization is crucial to obtaining good performance
from an XML database given a declarative query
specification. A join is frequently the most expensive
physical operation in evaluating a relational query.
Thus, selection of join order is a key task for a
relational query optimizer.
This observation is true for an XML query optimizer
as well, but with significant twists. Perhaps the most
important of these is the prevalence of structural
joins in XML. Structural join order selection is a
critical component of an XML query optimizer. A
join in the relational context is usually a value-based
equi-join, which involves two tables and is based on
the values of two columns, one in each table. In the
XML context, even though there are value-based
joins, structural joins occur much more frequently. A
structural join focuses on the containment (ancestordescendant or parent-child) relationship of the XML
elements to be joined. The join condition is specified
not on the value of the XML elements, but on their
relative positions in the XML document. In short,
queries on XML data have some features that are
different from queries in the traditional relational
context. Therefore, the set of alternative plans, and
their relative costs, in the XML context are also quite
different.
An XML query language defines more
comprehensible and structurized construct for
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 83
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
conducting operation on an XML document or
various XML documents. For processing the query,
an XML query engine or processor translates the
syntaxes and executing the operations hinted by the
query. Output is returned after process and
processing time is projected to be minimum thus
alluding efficient processing. [2]
A query processor extracts the high level abstraction
of declarative query and its procedural evaluation
into a set of low-level operations [3]. Analogous to
SQL processor, SQL query is translated at logical
access model and then the logical access prior to
accessing and returning the physical storage model.
When querying data in XML, the query is read by the
query parser. It manipulates it and passes an internal
data structure to the Query Optimizer. Then, the
Query Optimizer, using information from the
Metadata Manager, changes the structure and passes
it to the Query Evaluator. The Query Evaluator uses
the Data Manager and the Index Manager to evaluate
the query. Again, both the Data Manager and the
Index Manager use the Storage Manager to get actual
data. Overall, the evaluator takes in a query
evaluation tree and substitutes each of the nodes with
an appropriate access method. The evaluator uses
both the Index Manager and the Data Manager to get
results of the query and passes them to the Query
Output API.
Query optimization in the context of XML databases
is extremely challenging. The main reason for this is
the high complexity of the XML data model, as
compared with other data models, e.g., relational
models. This high complexity renders a much
enlarged search space for XML query optimization.
Furthermore, XML applications are typically Webhooked and have many simultaneous, interactive
users. This dynamic nature requires highly efficient
XML query processing and optimization. The
classical cost-based and heuristic-based approaches
yield unacceptably low efficiency when applied to
XML data—query optimization itself becomes very
time-consuming because of the huge search space for
optimization caused by the high complexity of the
XML data model. Lots of work related to XML query
processing has been done, but the majority is focused
on investigation for efficient supporting algorithms
[4], [5] and indexing schemes
There are many query optimization techniques
discovered and implemented by researchers. Among
them are path traversal, use of indexes, use of
materialized views, pipeline evaluation, structured
join order selection, schema-based optimization, and
reformulation of XML constraints, duplicate removal,
tree pattern matching and labeling scheme. [6].
2. RELATED WORK
There are many query optimization techniques
discovered and implemented by researchers. Among
them are path traversal, use of indexes, use of
materialized views, pipeline evaluation, structured
join order selection, schema-based optimization, and
reformulation of XML constraints, duplicate removal,
tree pattern matching and labeling scheme. Query
optimization in the context of XML databases is
extremely challenging. The main reason for this is the
high complexity of the XML data model, as
compared with other data models, e.g., relational
models. This high complexity renders a much
enlarged search space for XML query optimization.
Furthermore, XML applications are typically Web
hooked and have many simultaneous, interactive
users. This dynamic nature requires highly efficient
XML query processing and optimization. The
classical cost-based and heuristic based approaches
yield unacceptably low efficiency when applied to
XML data – query optimization itself becomes very
time-consuming because of the huge search space for
optimization caused by the high complexity of the
XML data model. Lots of work related to XML query
processing has been done, but the majority is focused
on investigation for efficient supporting algorithms
[7, 8, 9, 10, 11, 12, 13] and indexing schemes.
Complete and systematical work on XML query
optimization has hardly been reported.
The essential difference between XML data and
traditional data (e.g., relational data) is the extra
structural relationships between the various elements
in an XML data source. On the one hand, these
structural relationships render high complexity for
XML data modeling and query optimization; on the
other hand, they imply invaluable opportunities—
source of semantic knowledge for efficient XML
query optimization which is, however, often
overlooked or inadequately exploited. For example,
assume a query asks for the instances of element type
t1 that are subelements of the instances of element
type t2 and if we know that every t1 element has to
be contained in a t2 element according to the data
sources DTD or XSD (XML Schema Definition),
then we can simply return all instances of type t1
without checking the containment relationship
between the elements of the two types. This is a
trivial example of using the structure knowledge of
XML data to conduct a deterministic transformation
on an input query for better evaluation efficiency.
Here, by deterministic, we mean that the
transformation once carried out will bring in definite
improvement (simplification in this case) on the input
query expression. More complicated cases exist that
yield plenty of opportunities for efficient
optimization of XML queries, e.g., exploiting an
available structure index which is otherwise
inapplicable to an input query (A structure index for
now can be thought of as a shortcut between two
distant elements within the structure of the XML data
source). Obviously, using a structure index can
significantly reduce the cost of evaluating a covered
containment operation.
For example, let t1, t2, and t3 be three element types;
t1 is a distant descendent of t2, and t2 is a direct child
of t3 (see Fig. 1); if the evaluation of a query needs to
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 84
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
check the containment relationship between t1 and t2,
a straightforward but costly way is to traverse all the
intermediate “generations” between t1 and t2;
however if a structure index between t1 and t3 is
available, we can bypass the long path (from t1 to t2)
by using the structure index to reach t2’s parent, t3,
first, and then get to the target t2.
T3
T2
T1
Figure 1: Bypassing a lengthy path using a
‘shortcut’
The above examples indicate a huge realm where
efficient XML query optimization can be pursued by
exploiting the structure knowledge of XML data and
structure-related semantics carried on by an XML
query expression.
XML query optimization is one of the most
challenging issues facing the database research
community. It is only obtainable through a
systematically and carefully worked-out approach.
To this end, initially one needs to study the
equivalence issue related to XML queries and XML
data because query transformation has to be based on
query equivalence. Secondly, there is a need to work
out a good strategy to efficiently and effectively
accomplish XML query optimization by using these
equivalences. A pretty straightforward guideline is to
find a way to radically prune the search space during
query optimization but still be able to obtain a
sufficiently good plan though it may not be an
optimal one which is the so-called deterministic
optimization approach. By this approach, every
transformation applied to a query has to be
deterministic, in the sense that it is bound to produce
nontrivial improvement on the input query in terms
of evaluation efficiency. This type of transformation,
by its nature, is heuristic-based and relies on proper
exploitation of the structure knowledge of XML data.
Often in a scenario, a structure index is available but
not applicable because of the particular, unfavorable
structure pattern of a query. However, the hidden
opportunity of eventually applying the index to the
query may exist and can be identified by guided
transformations that exploit particular structural
properties of the source XML data. [14, 15]
3. ISSUES IN QUERY OPTIMIZATION
Given a query, there are generally a variety of
methods for computing the answer. It is the
responsibility of the system to transform the query as
entered by the user into an equivalent query that can
be computed more efficiently. The process of finding
a good strategy for processing a query is called query
optimization. Complex queries mainly involve many
accesses to the disk. Since the transfer of data from
disk is slow relative to the speed of main memory
and the CPU of the computer system, it is worthwhile
to allocate a considerable amount of processing to
choose a method that will minimize the disk accesses.
The size of the database, number of tuples, size of the
tuple and other parameters that also has an effect on
the time required to process a query. Hence, time
utilized for a query to get the answer as against the
query executed by a user, the disk space i.e. storage
required to store the results, and the potential cost of
the query are the prominent parameters in the
optimization process. Being, storage space is not an
issue nowadays since, it is available in a great
amount and at an inexpensive price. Time and the
cost of a query evaluation process are to be
optimized. These two parameters are non-linear,
hence a focus of this dissertation is to minimize the
time of execution of a query.Query optimization in
the context of XML databases is extremely
challenging. The main reason for this is the high
complexity of the XML data model, as compared
with other data models, e.g., relational models. This
high complexity renders a much enlarged search
space for XML query optimization. Furthermore,
XML applications are typically Web-hooked and
have many simultaneous, interactive users. This
dynamic nature requires highly efficient XML query
processing and optimization. The classical cost-based
and heuristic-based approaches yield unacceptably
low efficiency when applied to XML data—query
optimization itself becomes very time-consuming
because of the huge search space for optimization
caused by the high complexity of the XML data
model. Lots of work related to XML query
processing has been done, but the majority is focused
on investigation for efficient supporting algorithms
[16], [17] and indexing schemes [18].
There are many query optimization techniques
discovered and implemented by researchers. Among
them are path traversal, use of indexes, use of
materialized views, pipeline evaluation, structured
join order selection, schema-based optimization, and
reformulation of XML constraints, duplicate removal,
tree pattern matching and labeling scheme.
A problem with path traversal methods is that
traversing is only possible in the constrained set of
path. However, for the structure summary indexing,
most of it has the problem of large index size growth
in the worst case and not supporting partial queries
path matching. Labeling scheme allow quick
determination of the relationships among the element
nodes and reduce the index size, but fails to support
dynamic XML data. Towards the later year, most
researchers proposed hybrid-indexing techniques.
Hybrid system opens the possibility of covering each
technology’s weaknesses by its strengths. [18]
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 85
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
4. Comparison of existing algorithms for query optimization
Category
Based
content
Optimization using Tree
Yes
Optimization using Lore
Yes
on
Based on
structure
Rewrite
Query
Reduce
search area
Yes
Optimizing queries in XML
Structured document
Optimization based on Schema
Optimization by rewriting queries
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Optimization by classifying elements
Optimization
By transforming semantic XPath
query
Table 1. Part I Comparative study of various
Category
Optimization using Tree
Statistic
implementa
tion
Yes
Optimization using Lore
Yes
Classify
elements
Simplify
query
Sub query
execution
Yes
Yes
Optimizing queries in XML
Structured document
Optimization on Schema
Yes
Optimization by rewriting queries
Optimization by classifying elements
Yes
Optimize by transforming semantic XPath
query
Table 1. Part II Comparative study of various
Table 2. Comparison of algorithms
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 86
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
Table 1 and Table 2 focuses and provides the
different approaches and the baseline provided earlier
to solve the problem of query optimization for XML
databases.
5. QUERY OPTIMIZATION TECHNIQUES IN
A NUT SHELL
The tree generation algorithm and some of the
optimal plan selection and generation algorithms run
in a polynomial time and hence, they need to be
optimized to run in linear time. PAT algebra is being
extended to make it more suitable for query
optimization. Frequency search operations heavily
make use of the indexing technologies in PAT. The
future research will also be focused more towards
generation and use of partially correlated sub-plans,
which depend on bindings passed between portions
of query plan. When a significant number of paths
pass through a small number of objects, a
transformation which introduces a group-by clause
can be useful. Further examination is being
conducted in order to implement the TOXIN Graph
and to check if the TOXIN Tree can be extended to
be used as an alternative to DOM for querying,
updating and storing XML documents.
Value-based grouping and join techniques are being
investigated along with multi-way structural joins,
new access methods for merged operators and several
structural pattern techniques. In addition to this, new
optimization algorithms have to be implemented to
improve caching in Web service management
Systems, XQUERY constructs are to be optimized.
Cost based decisions are to be integrated in earlier
stages of the query evaluation process and the cost
model has to be refined in order model the CPU cost
in a precise manner.
6. OUR PROPOSE APPROACH
Query optimization is function wherein multiple
query plans satisfy a query by selecting a appropriate
query plan which is evaluated by the system. The
proposed approach of query optimization is by using
XML database management system. When Query is
optimized the time to traverse the query over the path
gets minimized. Thus it enhances the throughput of
the system. Cost and time are the two important
constrains, if cost decreases, the time for optimizing
query increases and if cost increases then the time for
optimizing query decreases since cost and time both
are non linear. The proposed system deals with
various evaluation plan for satisfying the query by
selecting an appropriate evaluation plan so as to
minimize the time leading to the optimization of a
query.
Selects the nearest neighbor
destination (node)
Entity
Reference
Point to store
Envelope
Reference
XML
Based
Database
System
Point to store
Next nearest neighbor
destination (node)
Output XML
Document
Reference
Point to store
Figure 2: System Diagram
The conceptual model of the proposed system is
illustrated in figure 2. The model depicts the main
components enveloped in the system. The entity
submits the envelop to the desired node (location)
and simultaneously keep the track of the delivery of
the envelop by storing the corresponding reference
point in the database. The node wherein the first
envelop is delivered is then treated as the reference
point and the process is continued till the last envelop
is delivered to the desired location. By referring to
the reference points which were stored in the
database, one can get the path of delivered envelop to
the node or location.
7. CONCLUSION
With the continuous expansion of XML applications,
more and more data is represented and stored by
using XML standards. From the database point of
view, the data in these many XML documents can be
collected and analyzed. How to search the contents of
these documents is becoming increasingly important.
Because of the characteristics of Web data, the data
XML documents described must be irregular and
incomplete, that is, the so-called "semi-structured
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 87
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
data", which is different from common relational data
model or object-oriented data model, so how to
effectively query XML documents is a hot research
problem.
Query optimization in the context of XML databases
is extremely challenging. The main reason for this is
the high complexity of the XML data model, as
compared with other data models, e.g., relational
models. This high complexity renders a much
enlarged search space for XML query optimization.
Furthermore, XML applications are typically Webhooked and have many simultaneous, interactive
users. This dynamic nature requires highly efficient
XML query processing and optimization. The
classical cost-based and heuristic-based approaches
yield unacceptably low efficiency when applied to
XML data—query optimization itself becomes very
time-consuming because of the huge search space for
optimization caused by the high complexity of the
XML data model. Lots of work related to XML query
processing has been done, but the majority is focused
on investigation for efficient supporting algorithms
and indexing schemes. Our approach of minimizing
time to execute a query, will certainly add to the
available solutions in the context of query
optimization in XML databases. The proposed work
is being carried out and the experimental results will
be compared with the existing solution.
8. REFERENCES
[1] T. Bray, J. Paoli, C. M. Sperberg-McQueen.
“Extensible Markup Language (XML) 1.0”. W3C
Recommendation.
Available
at
http://www.w3.org/TR/1998/REC-xml-19980210,
Feb. 1998.
[2] Christian Mathis, Theo Härder-“A Query
Processing Approach for XML Database Systems”.
www.lgis.informatik.unikl.de/cms/fileadmin/users/mathis/.../gws2.pdf
[3] C. Mathis and T. Harder. “A Query Processing
Approach for XML Database Systems”. 2005.
[4] Dunren Che, Karl Aberer, M. Tamer Özsu.
”Query optimization in XML structured-document
databases”. The VLDB Journal (2006) DOI
10.1007/s00778-005-0172-6. Published online: 28
April 2006 _c Springer-Verlag 2005
[5] Gottlob, G., Koch, C., Pichler, R.: “Efficient
algorithms for processing XPath queries. In:
Proceedings of VLDB, Hongkong, China (2002).
[6] Chan, C.-Y., Felber, P., Garofalakis, M., Rastogi,
R.: “Efficient filtering of XML documents with
XPath expressions”. In: Proceedings of International
Conference on Data Engineering, San Jose,
California, February 2002, pp. 235–244 (2002).
[7] S-Y Chien, Z Vagena, D Zhang, V J Tsotras, C
Zaniolo (2002) Efficient Structural Joins on Indexed
XML Documents. In: Proc. of 28th International
Conference on VLDB, Hong Kong, China, 2002,pp:
263-274.
[8] M F Fernandez, D Suciu (1998) Optimizing
Regular Path Expressions Using Graph Schemas. In:
Proceedings of the Fourteenth International
Conference on Data Engineering, February 23-27,
1998, Orlando, Florida, USA, pp:14-23.
[9] S Guha, H V Jagadish, N Koudas, D Srivastava, T
Yu (2002) Approximate XML joins. In: Proceedings
of ACM SIGMOD International Conference on
Management of Data, pp:287-298.
[10] Q Li, B Moon (2001) Indexing and Querying
XML Data for Regular Path Expressions. In:
Proceedings Of the 27th International Conference on
Very Large Databases, Rome, Italy, September 2000,
pp:361-370.
[11]. JMcHugh, S Abiteboul, R Goldman, D Quass,
J. Widom (1997) Lore: A Database Management
System for Semistructured Data. SIGMOD Record,
September 1997, 26(3):54-66.
[12] D Srivastava, S Al-Khalifa, H V Jagadish, N
Koudas, J M Patel, Y Wu (2002) Structural Joins: A
Primitive for Efficient XML Query Pattern Matching.
In: Proceedings of ICDE, 2002.
[13] N Zhang, V Kacholia, M T Vzsu (2004) A
Succinct Physical Storage Scheme for Efficient
Evaluation of Path Queries in XML. In: Proceedings
of 20th International Conference on Data
Engineering, Boston, MA, March 2004, pp: 56-65.
[14] Q Li, B Moon (2001) Indexing and Querying
XML Data for Regular Path Expressions. In:
Proceedings of the 27th International Conference on
Very Large Databases, Rome, Italy, September 2000,
pp: 361-370.
[15] T Milo, D Suciu (1999) Index Structures for
Path Expressions. In: Proceedings of ICDT, pp: 277295.
[16] F Frasincar, G-J Houben, C Pau (2002) XAL: an
Algebra for XML Query Optimization. In:
Proceedings of 13th Australasian Database
Conference, Melbourne, Australia.
[17] L J Gerstein (1987) Discrete Mathematics and
Algebraic Structures. W H Freeman and Company,
New York, 1987.
[18] G Gottlob, C Koch, R Pichler (2002) Efficient
Algorithms for Processing XPath Queries. In:
Proceedings of VLDB, Hongkong, China.
ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01
Page 88
Download