JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING ISSUES AND TECHNIQUES IN XML QUERY OPTIMIZATION 1 ASMITA 1 P. ASRE, 2 PROF. DR. M. S. ALI PG Student, Department of Computer Sc. & Engg. , Prof. Ram Meghe Institute of Technology & Research, Badnera. 2 Principal, Prof. Ram Meghe College of Engineering & Management, Badnera. asmitaasre@gmail.com , softalis@hotmail.com ABSTRACT: Over the past few years, XML (eXtensible Mark-up Language) has emerged as the standard for information representation and data exchange over the Internet. Query optimization in the context of XML databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as compared with other data models, e.g., relational models. This high complexity renders a much enlarged search space for XML query optimization. Furthermore, XML applications are typically Web-hooked and have many simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and optimization. The classical cost-based and heuristic-based approaches yield unacceptably low efficiency when applied to XML data—query optimization itself becomes very time-consuming because of the huge search space for optimization caused by the high complexity of the XML data model. Lots of work related to XML query processing, evaluation and optimization has been done, but the majority is focused on investigation for efficient supporting algorithms and indexing schemes. Various optimization technologies have been developed to solve the query retrieval and updating problems. Towards the later year, most researchers proposed hybrid optimization techniques. Hybrid system opens the possibility of covering each technology’s weakness by its strengths. This paper reviews on the issues and techniques used in query optimization for XML databases. Keywords: Query Optimization, XML Databases, Cost Model, Query Optimization issues. 1. INTRODUCTION After relational, network-based, hierarchical, object-oriented, object-relational, and deductive database systems, academic research and businesses increase their attention to the database-driven processing of XML documents, resulting in a new kind of information system, namely the (native) XML database system (XDBS). This development is reasonable, because the eXtensible Markup Language nowadays plays an important role in various key technologies like content management systems, electronic data interchange, and data integration techniques. Furthermore, for the management of a possibly large collection of XML documents, the classical advantages of dedicated database systems over file systems still hold: Convenient use of XML data through a standardized application programming interface (API); Transactional warranties for all operations on XML data; Processing of large volumes of data, measured in number of documents as well as document size. Further advantages of database systems like scalability with respect to the current transactional load, high availability and fault tolerance, as well as data and application independence shall be mentioned for completeness, though they are not XML specific. As XML [1] has gained prevalence in recent years, the storage and querying of XML data have become an important issue. Effective query optimization is crucial to obtaining good performance from an XML database given a declarative query specification. A join is frequently the most expensive physical operation in evaluating a relational query. Thus, selection of join order is a key task for a relational query optimizer. This observation is true for an XML query optimizer as well, but with significant twists. Perhaps the most important of these is the prevalence of structural joins in XML. Structural join order selection is a critical component of an XML query optimizer. A join in the relational context is usually a value-based equi-join, which involves two tables and is based on the values of two columns, one in each table. In the XML context, even though there are value-based joins, structural joins occur much more frequently. A structural join focuses on the containment (ancestordescendant or parent-child) relationship of the XML elements to be joined. The join condition is specified not on the value of the XML elements, but on their relative positions in the XML document. In short, queries on XML data have some features that are different from queries in the traditional relational context. Therefore, the set of alternative plans, and their relative costs, in the XML context are also quite different. An XML query language defines more comprehensible and structurized construct for ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 83 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING conducting operation on an XML document or various XML documents. For processing the query, an XML query engine or processor translates the syntaxes and executing the operations hinted by the query. Output is returned after process and processing time is projected to be minimum thus alluding efficient processing. [2] A query processor extracts the high level abstraction of declarative query and its procedural evaluation into a set of low-level operations [3]. Analogous to SQL processor, SQL query is translated at logical access model and then the logical access prior to accessing and returning the physical storage model. When querying data in XML, the query is read by the query parser. It manipulates it and passes an internal data structure to the Query Optimizer. Then, the Query Optimizer, using information from the Metadata Manager, changes the structure and passes it to the Query Evaluator. The Query Evaluator uses the Data Manager and the Index Manager to evaluate the query. Again, both the Data Manager and the Index Manager use the Storage Manager to get actual data. Overall, the evaluator takes in a query evaluation tree and substitutes each of the nodes with an appropriate access method. The evaluator uses both the Index Manager and the Data Manager to get results of the query and passes them to the Query Output API. Query optimization in the context of XML databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as compared with other data models, e.g., relational models. This high complexity renders a much enlarged search space for XML query optimization. Furthermore, XML applications are typically Webhooked and have many simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and optimization. The classical cost-based and heuristic-based approaches yield unacceptably low efficiency when applied to XML data—query optimization itself becomes very time-consuming because of the huge search space for optimization caused by the high complexity of the XML data model. Lots of work related to XML query processing has been done, but the majority is focused on investigation for efficient supporting algorithms [4], [5] and indexing schemes There are many query optimization techniques discovered and implemented by researchers. Among them are path traversal, use of indexes, use of materialized views, pipeline evaluation, structured join order selection, schema-based optimization, and reformulation of XML constraints, duplicate removal, tree pattern matching and labeling scheme. [6]. 2. RELATED WORK There are many query optimization techniques discovered and implemented by researchers. Among them are path traversal, use of indexes, use of materialized views, pipeline evaluation, structured join order selection, schema-based optimization, and reformulation of XML constraints, duplicate removal, tree pattern matching and labeling scheme. Query optimization in the context of XML databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as compared with other data models, e.g., relational models. This high complexity renders a much enlarged search space for XML query optimization. Furthermore, XML applications are typically Web hooked and have many simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and optimization. The classical cost-based and heuristic based approaches yield unacceptably low efficiency when applied to XML data – query optimization itself becomes very time-consuming because of the huge search space for optimization caused by the high complexity of the XML data model. Lots of work related to XML query processing has been done, but the majority is focused on investigation for efficient supporting algorithms [7, 8, 9, 10, 11, 12, 13] and indexing schemes. Complete and systematical work on XML query optimization has hardly been reported. The essential difference between XML data and traditional data (e.g., relational data) is the extra structural relationships between the various elements in an XML data source. On the one hand, these structural relationships render high complexity for XML data modeling and query optimization; on the other hand, they imply invaluable opportunities— source of semantic knowledge for efficient XML query optimization which is, however, often overlooked or inadequately exploited. For example, assume a query asks for the instances of element type t1 that are subelements of the instances of element type t2 and if we know that every t1 element has to be contained in a t2 element according to the data sources DTD or XSD (XML Schema Definition), then we can simply return all instances of type t1 without checking the containment relationship between the elements of the two types. This is a trivial example of using the structure knowledge of XML data to conduct a deterministic transformation on an input query for better evaluation efficiency. Here, by deterministic, we mean that the transformation once carried out will bring in definite improvement (simplification in this case) on the input query expression. More complicated cases exist that yield plenty of opportunities for efficient optimization of XML queries, e.g., exploiting an available structure index which is otherwise inapplicable to an input query (A structure index for now can be thought of as a shortcut between two distant elements within the structure of the XML data source). Obviously, using a structure index can significantly reduce the cost of evaluating a covered containment operation. For example, let t1, t2, and t3 be three element types; t1 is a distant descendent of t2, and t2 is a direct child of t3 (see Fig. 1); if the evaluation of a query needs to ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 84 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING check the containment relationship between t1 and t2, a straightforward but costly way is to traverse all the intermediate “generations” between t1 and t2; however if a structure index between t1 and t3 is available, we can bypass the long path (from t1 to t2) by using the structure index to reach t2’s parent, t3, first, and then get to the target t2. T3 T2 T1 Figure 1: Bypassing a lengthy path using a ‘shortcut’ The above examples indicate a huge realm where efficient XML query optimization can be pursued by exploiting the structure knowledge of XML data and structure-related semantics carried on by an XML query expression. XML query optimization is one of the most challenging issues facing the database research community. It is only obtainable through a systematically and carefully worked-out approach. To this end, initially one needs to study the equivalence issue related to XML queries and XML data because query transformation has to be based on query equivalence. Secondly, there is a need to work out a good strategy to efficiently and effectively accomplish XML query optimization by using these equivalences. A pretty straightforward guideline is to find a way to radically prune the search space during query optimization but still be able to obtain a sufficiently good plan though it may not be an optimal one which is the so-called deterministic optimization approach. By this approach, every transformation applied to a query has to be deterministic, in the sense that it is bound to produce nontrivial improvement on the input query in terms of evaluation efficiency. This type of transformation, by its nature, is heuristic-based and relies on proper exploitation of the structure knowledge of XML data. Often in a scenario, a structure index is available but not applicable because of the particular, unfavorable structure pattern of a query. However, the hidden opportunity of eventually applying the index to the query may exist and can be identified by guided transformations that exploit particular structural properties of the source XML data. [14, 15] 3. ISSUES IN QUERY OPTIMIZATION Given a query, there are generally a variety of methods for computing the answer. It is the responsibility of the system to transform the query as entered by the user into an equivalent query that can be computed more efficiently. The process of finding a good strategy for processing a query is called query optimization. Complex queries mainly involve many accesses to the disk. Since the transfer of data from disk is slow relative to the speed of main memory and the CPU of the computer system, it is worthwhile to allocate a considerable amount of processing to choose a method that will minimize the disk accesses. The size of the database, number of tuples, size of the tuple and other parameters that also has an effect on the time required to process a query. Hence, time utilized for a query to get the answer as against the query executed by a user, the disk space i.e. storage required to store the results, and the potential cost of the query are the prominent parameters in the optimization process. Being, storage space is not an issue nowadays since, it is available in a great amount and at an inexpensive price. Time and the cost of a query evaluation process are to be optimized. These two parameters are non-linear, hence a focus of this dissertation is to minimize the time of execution of a query.Query optimization in the context of XML databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as compared with other data models, e.g., relational models. This high complexity renders a much enlarged search space for XML query optimization. Furthermore, XML applications are typically Web-hooked and have many simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and optimization. The classical cost-based and heuristic-based approaches yield unacceptably low efficiency when applied to XML data—query optimization itself becomes very time-consuming because of the huge search space for optimization caused by the high complexity of the XML data model. Lots of work related to XML query processing has been done, but the majority is focused on investigation for efficient supporting algorithms [16], [17] and indexing schemes [18]. There are many query optimization techniques discovered and implemented by researchers. Among them are path traversal, use of indexes, use of materialized views, pipeline evaluation, structured join order selection, schema-based optimization, and reformulation of XML constraints, duplicate removal, tree pattern matching and labeling scheme. A problem with path traversal methods is that traversing is only possible in the constrained set of path. However, for the structure summary indexing, most of it has the problem of large index size growth in the worst case and not supporting partial queries path matching. Labeling scheme allow quick determination of the relationships among the element nodes and reduce the index size, but fails to support dynamic XML data. Towards the later year, most researchers proposed hybrid-indexing techniques. Hybrid system opens the possibility of covering each technology’s weaknesses by its strengths. [18] ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 85 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING 4. Comparison of existing algorithms for query optimization Category Based content Optimization using Tree Yes Optimization using Lore Yes on Based on structure Rewrite Query Reduce search area Yes Optimizing queries in XML Structured document Optimization based on Schema Optimization by rewriting queries Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Optimization by classifying elements Optimization By transforming semantic XPath query Table 1. Part I Comparative study of various Category Optimization using Tree Statistic implementa tion Yes Optimization using Lore Yes Classify elements Simplify query Sub query execution Yes Yes Optimizing queries in XML Structured document Optimization on Schema Yes Optimization by rewriting queries Optimization by classifying elements Yes Optimize by transforming semantic XPath query Table 1. Part II Comparative study of various Table 2. Comparison of algorithms ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 86 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING Table 1 and Table 2 focuses and provides the different approaches and the baseline provided earlier to solve the problem of query optimization for XML databases. 5. QUERY OPTIMIZATION TECHNIQUES IN A NUT SHELL The tree generation algorithm and some of the optimal plan selection and generation algorithms run in a polynomial time and hence, they need to be optimized to run in linear time. PAT algebra is being extended to make it more suitable for query optimization. Frequency search operations heavily make use of the indexing technologies in PAT. The future research will also be focused more towards generation and use of partially correlated sub-plans, which depend on bindings passed between portions of query plan. When a significant number of paths pass through a small number of objects, a transformation which introduces a group-by clause can be useful. Further examination is being conducted in order to implement the TOXIN Graph and to check if the TOXIN Tree can be extended to be used as an alternative to DOM for querying, updating and storing XML documents. Value-based grouping and join techniques are being investigated along with multi-way structural joins, new access methods for merged operators and several structural pattern techniques. In addition to this, new optimization algorithms have to be implemented to improve caching in Web service management Systems, XQUERY constructs are to be optimized. Cost based decisions are to be integrated in earlier stages of the query evaluation process and the cost model has to be refined in order model the CPU cost in a precise manner. 6. OUR PROPOSE APPROACH Query optimization is function wherein multiple query plans satisfy a query by selecting a appropriate query plan which is evaluated by the system. The proposed approach of query optimization is by using XML database management system. When Query is optimized the time to traverse the query over the path gets minimized. Thus it enhances the throughput of the system. Cost and time are the two important constrains, if cost decreases, the time for optimizing query increases and if cost increases then the time for optimizing query decreases since cost and time both are non linear. The proposed system deals with various evaluation plan for satisfying the query by selecting an appropriate evaluation plan so as to minimize the time leading to the optimization of a query. Selects the nearest neighbor destination (node) Entity Reference Point to store Envelope Reference XML Based Database System Point to store Next nearest neighbor destination (node) Output XML Document Reference Point to store Figure 2: System Diagram The conceptual model of the proposed system is illustrated in figure 2. The model depicts the main components enveloped in the system. The entity submits the envelop to the desired node (location) and simultaneously keep the track of the delivery of the envelop by storing the corresponding reference point in the database. The node wherein the first envelop is delivered is then treated as the reference point and the process is continued till the last envelop is delivered to the desired location. By referring to the reference points which were stored in the database, one can get the path of delivered envelop to the node or location. 7. CONCLUSION With the continuous expansion of XML applications, more and more data is represented and stored by using XML standards. From the database point of view, the data in these many XML documents can be collected and analyzed. How to search the contents of these documents is becoming increasingly important. Because of the characteristics of Web data, the data XML documents described must be irregular and incomplete, that is, the so-called "semi-structured ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 87 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING data", which is different from common relational data model or object-oriented data model, so how to effectively query XML documents is a hot research problem. Query optimization in the context of XML databases is extremely challenging. The main reason for this is the high complexity of the XML data model, as compared with other data models, e.g., relational models. This high complexity renders a much enlarged search space for XML query optimization. Furthermore, XML applications are typically Webhooked and have many simultaneous, interactive users. This dynamic nature requires highly efficient XML query processing and optimization. The classical cost-based and heuristic-based approaches yield unacceptably low efficiency when applied to XML data—query optimization itself becomes very time-consuming because of the huge search space for optimization caused by the high complexity of the XML data model. Lots of work related to XML query processing has been done, but the majority is focused on investigation for efficient supporting algorithms and indexing schemes. Our approach of minimizing time to execute a query, will certainly add to the available solutions in the context of query optimization in XML databases. The proposed work is being carried out and the experimental results will be compared with the existing solution. 8. REFERENCES [1] T. Bray, J. Paoli, C. M. Sperberg-McQueen. “Extensible Markup Language (XML) 1.0”. W3C Recommendation. Available at http://www.w3.org/TR/1998/REC-xml-19980210, Feb. 1998. [2] Christian Mathis, Theo Härder-“A Query Processing Approach for XML Database Systems”. www.lgis.informatik.unikl.de/cms/fileadmin/users/mathis/.../gws2.pdf [3] C. Mathis and T. Harder. “A Query Processing Approach for XML Database Systems”. 2005. [4] Dunren Che, Karl Aberer, M. Tamer Özsu. ”Query optimization in XML structured-document databases”. The VLDB Journal (2006) DOI 10.1007/s00778-005-0172-6. Published online: 28 April 2006 _c Springer-Verlag 2005 [5] Gottlob, G., Koch, C., Pichler, R.: “Efficient algorithms for processing XPath queries. In: Proceedings of VLDB, Hongkong, China (2002). [6] Chan, C.-Y., Felber, P., Garofalakis, M., Rastogi, R.: “Efficient filtering of XML documents with XPath expressions”. In: Proceedings of International Conference on Data Engineering, San Jose, California, February 2002, pp. 235–244 (2002). [7] S-Y Chien, Z Vagena, D Zhang, V J Tsotras, C Zaniolo (2002) Efficient Structural Joins on Indexed XML Documents. In: Proc. of 28th International Conference on VLDB, Hong Kong, China, 2002,pp: 263-274. [8] M F Fernandez, D Suciu (1998) Optimizing Regular Path Expressions Using Graph Schemas. In: Proceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, pp:14-23. [9] S Guha, H V Jagadish, N Koudas, D Srivastava, T Yu (2002) Approximate XML joins. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp:287-298. [10] Q Li, B Moon (2001) Indexing and Querying XML Data for Regular Path Expressions. In: Proceedings Of the 27th International Conference on Very Large Databases, Rome, Italy, September 2000, pp:361-370. [11]. JMcHugh, S Abiteboul, R Goldman, D Quass, J. Widom (1997) Lore: A Database Management System for Semistructured Data. SIGMOD Record, September 1997, 26(3):54-66. [12] D Srivastava, S Al-Khalifa, H V Jagadish, N Koudas, J M Patel, Y Wu (2002) Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In: Proceedings of ICDE, 2002. [13] N Zhang, V Kacholia, M T Vzsu (2004) A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. In: Proceedings of 20th International Conference on Data Engineering, Boston, MA, March 2004, pp: 56-65. [14] Q Li, B Moon (2001) Indexing and Querying XML Data for Regular Path Expressions. In: Proceedings of the 27th International Conference on Very Large Databases, Rome, Italy, September 2000, pp: 361-370. [15] T Milo, D Suciu (1999) Index Structures for Path Expressions. In: Proceedings of ICDT, pp: 277295. [16] F Frasincar, G-J Houben, C Pau (2002) XAL: an Algebra for XML Query Optimization. In: Proceedings of 13th Australasian Database Conference, Melbourne, Australia. [17] L J Gerstein (1987) Discrete Mathematics and Algebraic Structures. W H Freeman and Company, New York, 1987. [18] G Gottlob, C Koch, R Pichler (2002) Efficient Algorithms for Processing XPath Queries. In: Proceedings of VLDB, Hongkong, China. ISSN: 0975 – 6760| NOV 11 TO OCT 12 | VOLUME – 02, ISSUE - 01 Page 88