CXquery (Chamois Xquery) and its Applications Hwan-Seung Yong (용 환승, 龍 煥昇) Dept. of Computer Science and Engineering 梨花女子大學校 (Ewha Womans Univ.) Seoul, 大韓民國 2005-09-26 H.S.Yong, EWU. 1 Contents Motivations of CXquery: Structure Agnostic Query Query Processing Issues Developing CXquery System Experience CXquery to Xquery Conversion CXquery to XML Stream Query Processing Final Remarks 2005-09-26 H.S.Yong, EWU. 2 What is CXquery CXquery: Chamois Xquery Chamois Project name for Component Based Knowledge Engineering System Framework in Ewha lead by Dr. Won Kim [IEEE 2002] Chamois is an antelope name living Alps Mountain This animal requires short steps to leap high CXquery is same with Xquery except one We don’t need Xpath composing conditions. Only use element/attribute name 2005-09-26 H.S.Yong, EWU. 3 Background In RDB query is made using schema (relation name, attribute name) with Constant Values for condition checking Schema is relatively simple structure Easy to learn query and can be used by end user In OO and ORDB Schema have complex structure So query composition and design is very hard task Only professionals do In XML, what happen? 2005-09-26 H.S.Yong, EWU. 4 XML and Query issues Hard to compose like OO/OR case Try to design SQL like XML query until now XML even allow data with no schema (DTD unknown) Xquery, Xpath: W3C standard Is it SQL like? How do we make query? Natural language query for RDB For easy of use How about natural language query for XML? Or how about semi-natural language query 2005-09-26 H.S.Yong, EWU. 5 Some aspects on XML XML is a meta language for encoding domain information (book, movie, music, product, company, math, chemistry etc) There need XML standard for each domain worldwide (MathML, CML, BioXML etc) But not yet enough This means data from equal domain can be encoded using different DTD There can be many kind of movie DTDs, music DTDs. Xquery have to follow DTDs, so same query can be expressed by different Xquery 2005-09-26 H.S.Yong, EWU. 6 DTD Design Choices of same data Element representation Attribute representation Nested representation Combinations of the above 2005-09-26 H.S.Yong, EWU. 7 Element representation: 1st DTD Type Xquery depends on DTD structure for $m in doc ()//movie where $m/genre[text() = “action”] and $m/year[text() = “1994”] and $m/actor[text() = “Jean Reno”] return <title>$m/title/text()</title> <movie> <year>1994</year> <country>America</country> <country>France</country> <genre>drama</genre> <genre>action</genre> <title>Leon</title> ① elements representation <director>Luc Besson</director> <actor>Jean Reno</actor> <actor>Natalie Portman</actor> </movie> 2005-09-26 H.S.Yong, EWU. 8 Attribute representation: 2nd DTD Type for $m in doc ()//movie where $m/*[@genre = “action”] and $m/*[@year = “1994”] and $m/*[actor = “Jean Reno”] return <title>$m/*/@title</title> <movie> <production year=”1994” country=”America France”> </production> <detail_info genre=”drama action” title=”Leon”> ② attributes representation </detail_info> <people actor=”Jean Reno” director=”Luc Besson”> </people> </movie> 2005-09-26 H.S.Yong, EWU. 9 Nested representation: Third DTD Type for $y in doc ()//year where $y[text() = “1994”] // genre[text() = “action”] // actor[text()= “Jean Reno”] return $y//title/text()</title> <year>1994 <country>America <genre>action <movie> <title>Leon</title> <director>Luc Besson</director> <actor>Jean Reno</actor> </movie> … ③ nested representation <genre> </country> … </year> 2005-09-26 H.S.Yong, EWU. 10 Combinations: Fourth DTD Type for $g in doc ()//genre where $g[@* = “action”] // year[*= “1994”] // actor[text() = “Jean Reno”] return <title>$g//title/text()</title> <genre type=”action”> <country name=”America”> <year><yyyy>1994</yyyy> <movie> <title>Leon</title> <people director=”Luc Besson” actor=”Jean Reno”> </people> </movie> ④ nested + attributes + elements … representation </year> </country> </genre> 2005-09-26 H.S.Yong, EWU. 11 Independence and DBMS But in XML in heterogeneous(?) distributed environment Each Xquery seriously depends on its DTD Without defining single DTD and XML data conversion, we have to make different Xquery 2005-09-26 H.S.Yong, EWU. 12 Real SQL like XML query Rather use XPath expressions Just use /continent/country/state/city/name = ‘Kyoto’ //city/name=‘Kyoto’ //city//* = ‘Kyoto’ //city/@name=‘Kyoto’ name=‘Kyoto’ Just use element or attribute name instead of Xpath “Find information about city named Kyoto” 2005-09-26 Natural query requires heav semantic processing H.S.Yong, EWU. 13 CXquery Approach Assumption User have to know exact tag name (Element/Attribute) and values User didn’t know the structure (DTD) of XML Query Example Search for movie titles whose genre is ‘action’, release year is ‘1994’, and whose stars include ‘Jean Reno’ genre = “action” and year = “1994” and actor = “Jean Reno” Apply to XQuery 2005-09-26 for $t in doc()//title where genre = “action” and year = “1994” and actor = “Jean Reno” return $t H.S.Yong, EWU. 14 Contents Motivations of CXquery: Structure Agnostic Query Query Processing Issues Developing CXquery System Experience CXquery to Xquery Conversion CXquery to XML Stream Query Processing Final Remarks 2005-09-26 H.S.Yong, EWU. 15 Four Query Processing Issues First, ‘similarity matching’ is required In an environment where the schema or DTD of XML documents is not precisely known or “fuzzy” (approximate) search is done, even the precise names of the elements and attributes may not be known. Use thesaurus based matching E.g) for the element names “actor”, “genre”, and “year”, the query processor may also need to search for names such as “performer”, “category”, and “date”, respectively. 2005-09-26 H.S.Yong, EWU. 16 Query Processing Issues Second, heterogeneous representation of same content in XML intervening elements and/or an attribute between an element name and its corresponding value. Figure (a): One example DTD: element representation Figure (b): type intervenes between genre and “action”, name intervenes between actor and “Jean Reno”, and yyyy intervenes between year and “1994”. Figure (c), genre, year, and actor are represented as attributes Figure (d), genre, year, and actor are represented as elements but their values are represented as attribute values. This introduces significant implementation difficulties Processor should consider all possible representations. 2005-09-26 H.S.Yong, EWU. 17 (a) 2005-09-26 (c) (b) H.S.Yong, EWU. (d) 18 Query Processing Issues Third, intervening elements (<family>)and/or an attribute between an element and its corresponding value leads to “semantic uncertainty” in the association between the element and the value. Ex) <actor> <family> <name>Jean Reno</name> … </family> </actor> “Jean Reno” is the value associated with the element or attribute “family” of “actor”. blind binding of actor to “Jean Reno” is possble, declare that the search predicate “actor = Jean Reno” is true Semantic correctness may be in question !!! 2005-09-26 H.S.Yong, EWU. 19 Query Processing Issues Fourth, identification of nearest common ancestor (NCA) is needed of all element and attribute names that appear in the search predicates For query-processing optimization <movies> For preventing erroneous<movie> results 2005-09-26 <general_info> <year>1994</year> <genre>action</genre> </general_info> ○ <detail_info> <actors> <actor>Jean Reno</actor> </actors> </detail_info> × </movie> <movie> <general_info> <year>1994</year> <genre>action</genre> ... </movie> H.S.Yong, EWU. </movies> 20 Query Processing Issues However, the problem is difficult since the structure of the XML hierarchy is not specified in CXquery Ex) NCA of year, genre, and actor 2005-09-26 H.S.Yong, EWU. 21 Contents Motivations of CXquery: Structure Agnostic Query Query Processing Issues Developing CXquery System Experience CXquery to Xquery Conversion CXquery to XML Stream Query Processing Final Remarks 2005-09-26 H.S.Yong, EWU. 22 One Approach to support CXquery Implementation Condition clause: genre =“action” AND year =“1993” AND actor =“Tommy Lee Jones” Structure?? Data names : element/attribute name Data values Result clause: title 2005-09-26 H.S.Yong, EWU. 23 One Approach to support CXquery Implementation of XML Server based on CXquery Special Indexing is used Node index: all element and attribute name Value index: all constant value in XML All node and value numbering to find their structural relationship Indices are stored using RDB Performance evaluation shows promising result. [ISMIS 2005] 2005-09-26 H.S.Yong, EWU. 24 One Approach Query processor should drive all paths among the names and values. Identification of name and value relationship Identification of relationship between names Classification of all possible paths XML can have is investigated genre =“action” AND year = “1993” AND actor =“Tommy Lee Jones” All possible paths 2005-09-26 H.S.Yong, EWU. 25 Path m-FE Path m-HE Path m-HEA Path m-FEA V1 V1 Vn CnE C1E C1A … CkE CmA Vn V1 Vn-1 Vn V2 CkE CmA V1 V2 iE Vn-1 Vn C1A … Path d-uHE-FE Path d-FEA-FEA CnA iE V1 Path d-lHE-FE Path d-FE-FEA Path d-FA-FE C1E C1A C1E C1E … CnE Path d-FE-FA Vn iE C1A … CnE V1 C1E iA … iE V1 Vn CnA C1E … CnE Vn Path d-HE-HA C1E iE V1 CnE Vn V1 Path d-ulHE-HE iE iE iE iE iE 2005-09-26 Vn CnA Vn CnE C1E iA iE C1A iA V1 CnE V1 Vn Vn V1 iE Vn C1E H.S.Yong, EWU. Vn iE iE CnE V1 iE CnE V1 Vn Vn Path d-HEA-HE iE V1 iA … CnE iA V1 Vn Path d-lHE-HE C1E iE C1E iE V1 iE Path d-HA-HE Path d-HE-HEA Path d-HEA-FE C1A V1 C1E … CnE V1 iA C1E iE C1E … CnE iE iA … CnE Path d-ulHE-FE Path d-uHE-HE iE iE C1E iE CnE iA Vn Vn Path d-HEA-HEA iE C1A CnE iA V1 Vn 26 One Approach To search all possible paths, node numbering scheme is used for each node in XML <year yyyy=”1994”> <country name=“America”> <movie > yyyy <title>Leon</title> <genre type=“drama” type=”action”></genre> <people> name <director>Luc Besson</director> 1994 <actor> <name>Jean Reno</name> <name>Natalie Portman</name>America title </actor> </people> Leon </movie> …. <country name=”France”> …. </year> 2005-09-26 H.S.Yong, EWU. year country movie genre type type drama action country …. people director …. name France actor Luc name Besson Jean Reno name Natalie Portman 27 Node numbering to identify relationship 10,1000 year 20,25 30,490 yyyy country 20,25 1994 40,45 50,170 name 40,45 movie 70,75 America title 500,990 510,515 …. name 80,110 120,180 genre people Leon France 90,95 100,105 Leon type type 90,95 100,105 130,135 150,155 160,165 drama action Luc name Besson name 130,135 …. 510,515 70,75 140,170 director actor 150,155 Jean Reno 2005-09-26 title country 160,165 Natalie Portman H.S.Yong, EWU. Doc-ID StartRegion End-Region name 1 10 220 movie 1 20 30 year 1 40 110 1 120 210 Basicinfo people 28 Processing flow diagram overview • Implement an XML-server to evaluate the performance of the query expression XML Documents Node & Value Analyzer Node names & Identifier Identifier Creator Values & Identifier Queries Components of a condition clause & components of a result clause Query results 2005-09-26 Value Table Index Constructor Data Table Index Table Parser Parser Node Table SQL Translator SQL statements Result Creator H.S.Yong, EWU. Path Type Classifier SQL Processor Region Processor Query Processor 29 Contents Motivations of CXquery: Structure Agnostic Query Query Processing Issues Developing CXquery System Experience CXquery to Xquery Conversion CXquery to XML Stream Query Processing Final Remarks 2005-09-26 H.S.Yong, EWU. 30 CXquery to Xquery Conversion System Diagram Overview CXQuery CXQuery to Xquery Converter XML DB Result User DTD/Result Xquery XML Server XML Document 2005-09-26 DTD 1 XML DTD 2 XML (eXist 1.0) H.S.Yong, EWU. 31 CXquery to Xquery Converter Set of Xquery should be generated for one CXquery based on number of different DTD For $c in doc() Where genre=”action” AND actor=”Jean Reno” Return title CXQuery For $c in /movies Where $c/genre=”action” AND $c/actor=”Jean Reno” Return $c/title 2005-09-26 H.S.Yong, EWU. XQuery 32 Xquery for each DTD type DTD Type 1 For $c in /movies Where $c/movie/@genre=”action” AND $c/movie/actor=” Jean Reno” Return $c/movie/title DTD Type 2 For $c in /movies Where $c/movie/genre=”action” AND $c/movie/genre/actor=” Jean Reno” Return $c/ title DTD Type 3 For $c in /movie Where $c/movie/@genre=”action” AND $c/movie/title/@actor=” Jean Reno” Return $c/movie/title DTD Type 4 For $c in /movies Where $c/movie/genre=”action” AND $c/movie/actor=” Jean Reno” Return $c/movie/title 2005-09-26 H.S.Yong, EWU. 33 Contents Motivations of CXquery: Structure Agnostic Query Query Processing Issues Developing CXquery System Experience CXquery to Xquery Conversion CXquery to XML Stream Query Processing Final Remarks 2005-09-26 H.S.Yong, EWU. 34 System Flow diagram Input CXQuery File DTD File XML Stream CXQueries Processing Path Generator DTD Path Set CXQuery Converter XML Steam Xpath Queris Yfilter XML Stream Engine Output XML문서 2005-09-26 H.S.Yong, EWU. 35 CXquery to Xquery Conversion (b) CXQuery (a) DTD Path Set path_mondial-cities.txt /cities/city/name /cities/city/latitude /cities/city/population /cities/city/located_at /cities/city[@is_country_cap] /cities/city[@is_state_cap] /cities/city/population[@year] /cities/city/located_at[@watertype] CXQ1:is_country_cap="yes" or latitude CXQ2:car_code="MK and area="25333" CXQ3:name="Caspian Sea" or area="17000" CXQ4:latitude CXQ5:ethnicgroups CXQ6:name CXQ7:country="Korea" … (d) Xquery Generation (c) Xpath ser from (a) and (b) /cities/city/name /cities/city/latitude /cities/city[@is_country_cap] 2005-09-26 /cities/city/latitude /cities/city[@is_country_cap=“yes”] /cities/city[name=“Caspian Sea”] /cities/city/name H.S.Yong, EWU. 36 Xquery Conversion for each DTDs path_mondial-countries.txt ... /countries/country/name /countries/country/provinces /countries/country/encompasses /countries/country/neighbor /countries/country[@car_code] /countries/country[@area] /countries/country/population[@year] /countries/country/ethnicgroups[@percentage] /countries/country/religions[@percentage] /countries/country/ethnicgroups path_mondial-cities.txt ... /cities/city/name /cities/city/latitude /cities/city/population /cities/city/located_at /cities/city[@is_country_cap] /cities/city[@is_state_cap] /cities/city/population[@year] ... qry6.txt CXQ1:is_country_cap="yes" or latitude CXQ2:name="Caspian Sea" or area="17000" CXQ3:latitude CXQ4:ethnicgroups CXQ5:name xpath_qry6.txt CXQ5:/cities/city/name CXQ1:/cities/city/latitude CXQ1:/cities/city[@is_state_cap=“yes”] CXQ5:/continents/continent/name CXQ5:/countries/country/name CXQ4:/countries/country/ethnicgroups CXQ2:/countries/country[@area="17000"] CXQ4:/countries/country/ethnicgroups[@percentage] CXQ5:/lakes/lake/name path_mondial-continents.txt /continents/continent /continents/continent/name path_mondial-lakes.txt /lakes/lake/name 2005-09-26 H.S.Yong, EWU. 37 Implementation Result CXquery Example 2005-09-26 H.S.Yong, EWU. 38 Matching process of CXquery with Path Set 2005-09-26 H.S.Yong, EWU. 39 Xquery Conversion results 2005-09-26 H.S.Yong, EWU. 40 Example 6 CXquery 2005-09-26 H.S.Yong, EWU. 41 Converted 13 Xquery 2005-09-26 H.S.Yong, EWU. 42 CXquery for distributed XML servers In heterogeneous DBMS environment Single standard schema is required in central server Query translation is required Query on Standard schema translated into site’s schema Distributed CXquery environment We don’t need standard XML schema but collection of Each Site’s DTD is enough User only compose query using CXquery CXquery has DTD neutral property Central site then convert CXquery to site’s Xquery and collect result. 2005-09-26 H.S.Yong, EWU. 43 Heterogeneous XML stream query processing Stream data is increasing RSS stream, news stream, stock trading, sensor stream, multimedia stream etc. Stream processing engine is needed Handle large number of heterogeneous XML stream concurrently How do we use single stream query on this multiple heterogeous streams Query translation for each stream and processing differently? Apply single CXquery to multiple heterogeneous stream. 2005-09-26 H.S.Yong, EWU. 44 Final Remarks CXquery having no path is introduced This area of research need more works from now on Technical issues for future research Element/Attribute – Value Association is required to solve Semantic ambiguity problem 2005-09-26 Name = “Kyoto” vs name=“Tanaka” vs name = “Winter Sonata” H.S.Yong, EWU. 45 Final Remarks Possible approaches Define DTD tag name more specifically System can resolve domain conflict exactly through data mining etc. Cityname = “Kyoto” vs person-name=“Tanaka” vs Movie-name = “Winter Sonata” “Kyoto” represents city name, “Tanaka” represents Person name etc. User specify exact domain name for all constants 2005-09-26 Name = “Kyoto[City]”, name=“Tanaka[Person]” name=“Winter Sonata[Movie]” XML extension is required H.S.Yong, EWU. 46 Thank you for your attention 聞いて いただいて どうも ありがとう ございました Questions? 2005-09-26 H.S.Yong, EWU. 47 References ISMIS 2005] Wol Young Lee, Hwan Seung Yong, "A Query Expression and Processing Technique for an XML Search Engine," ISMIS 2005: 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, NY, USA, May 2005.pp.266275. [JOT 2004b]Won Kim, Wol Young Lee, Hwan Seung Yong, "On Query-Processing Issues for Non-Navigational Queries for XML," in Journal of Object Technology, Vol.3, No. 10, November-December 2004, pp. 19-26. [JOT 2004a] Won Kim, Wol Young Lee, Hwan Seung Yong, On Supporting Structure-Agnostic Queries for XML, in Journal of Object Technology, Vol.3, No.7, July-August 2004, pp.27-35 , [JOT 2002] Won Kim et al., "The Chamois Reconfigurable DataMining Architecture, " Journal of Object Technology, Vol. 1, No. 2, July-August 2002, pp.2-10. [IEEE 2002] Won Kim et al., "Chamois: A Component-Based Knowledge Engineering Framework," IEEE Computer, Vol. 35, No. 5, May 2002, pp. 46-54. 2005-09-26 H.S.Yong, EWU. 48