Cooperative XML (CoXML) Query Answering Motivation XML has become the standard format for information representation and data exchange An explosive increase in the amount of XML data available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication SwissProt – protein sequence databases XMark – online auction data …. Effective XML search methods are needed! 2 Challenges XML schema is usually very complex E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths It is often unrealistic for users to fully understand a schema before asking queries Exact query answering is inadequate and approximate query answering is more appropriate! 3 Approach: CoXML Derive approximate answers by relaxing query conditions, i.e., query relaxation Query Approximate Answers Cooperative XML Query Answering XML Database Engine XML Documents 4 Roadmap Introduction Background CoXML Related Work Conclusion 5 XML Data Model XML data is often modeled as an ordered labeled tree Tree nodes: elements Tree edges: element-nesting relationships 1 article Element 2 title Search engine spam detection 3 author 4 XYZ Content name 5 IEEE Fellow 6 title 2003 year 7 body 8 section ..a spam detection technique by content analysis… 6 XML Query Model XML queries are often modeled as trees Structure conditions: a set of query nodes connected by Content conditions: Parent-to-child (‘/’): directly connected Ancestor-to-descendant (‘// ’): connected (either directly or indirectly) Either value predicates or keyword constraints on query nodes Example article title year section search engine 2003 spam detection 7 XML Query Answer An answer for a query is a set of nodes in a data tree that satisfies both structure and content conditions Example 1 article 2 title Search engine spam detection 6 year 3 author 4 name 5 title 2003 article 7 8 body section title year section search 2003 spam engine detection Query Tree XYZ IEEE Fellow ..a spam detection technique by content analysis… Data Tree 8 XML Query Relaxation Types Value relaxation: enlarging a value condition’s search scope article article title year section title search engine 2003 spam detection search engine year section 2000spam 2005 detection Node relabel: changing the label a node to a similar or a more general label by domain knowledge article document title year section title year section search engine 2003 spam detection search engine 2003 spam detection [1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000) 9 XML Query Relaxation Types Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge article article title year section title year section search engine 2003 spam detection search engine 2003 spam detection Node deletion: dropping a node from a query tree article article title year section search engine 2003 spam detection search engine year section 2003 spam detection 10 XML Relaxation Properties Definition Relaxation operation: an application of a relaxation type to a specific query node or edge Lemma Given a query tree with n applicable relaxation operations, there are potentially up to 2n relaxed trees Possible combinations: n n ... 1 n 11 Roadmap Introduction Background CoXML Related Work Conclusion 12 Challenges Query relaxation is often user-specific A query with n relaxation operations has potentially up to 2n relaxed queries Different users may have different approximate matching specifications for a given query tree How to provide user-specific approximate query answering? How to systematically relax a query? Query relaxation generates a set of approximate answers How to effectively rank the returned approximate answers? 13 CoXML System Overview relaxation language RLXQuery similarity metrics ranked results Ranking Module query Relaxation results relaxed query Engine query exact answers relaxation indexes XTAH Relaxation Index Builder CoXML XML Database Engine XML Documents 14 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion 15 Relaxation Language Motivation Enabling users to specify approximate conditions in queries and to control the approximate matching process RLXQuery - relaxation-enabled XQuery Extends the standard XML query language (XQuery) with relaxation constructs & controls, such as ~ : approximate conditions ! : non-relaxable conditions REJECT : unacceptable relaxations AT-LEAST : minimum # of answers to be returned RELAX-ORDER : relaxation orders among multiple conditions USE: allowable relaxation types 16 RLXQuery Example FOR $a in doc (“bib.xml”)//article WHERE $a/year = ~2003 V-COND-LABEL t1 and ~($a[about(./!title, “search engine”)]/body/section)[about(., “spam detection”)] S-COND-LABEL t2 RETURN $a RELAX-ORDER (t1, t2) USE (edge generalization, node deletion) AT-LEAST 20 article t1 year ! title body 2003 search engine section spam detection t2 17 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion 18 Relaxation Index Naïve approach Observation Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers Exhaustive, but not scalable Many queries share the same (or similar) tree structures Our approach: relaxation index Consider the structure of a query tree T as a template Build indexes on the relaxed trees of T Use the index to guide the relaxations of any query with the same (or similar) tree structure as that of T 19 Relaxation Index - XTAH XTAH A hierarchical multi-level labeled cluster of relaxed trees Building an XTAH Given a query structure template T, generate all possible relaxed trees Each relaxed trees uses an unique set of relaxation operations Cluster relaxed trees into groups based on relaxation operations and distances similar to “suffix-tree” clustering 20 XTAH Example relax edge_generalization … {gen(e$1,$2)} T1 article … {gen(e$1, $2), gen(e$3, $4)} title body node_relabel section {gen(e$3, $4)} T2 article title body section T3 article title body section … … node_deletion ... {gen(e$3, $4), gen(e$1,$3)} … {del($2)} T6 article body … {del($2), del($3)} section T4 article … title body T7 article … section section A sample XTAH for the template structure T article $1 title $2 body $3 section $4 Template structure T gen(e$u, $v) – relaxing the edge between $u and $v del($u) – deleting the node $u 21 XTAH Properties Each group consists of a set of relaxed trees obtained by using similar relaxation operations Efficient location of relaxed trees based on relaxation operations The higher level a group, the less relaxed the trees in the group Relaxing queries at different granularities by traversing up and down the XTAH 22 XTAH-Guided Query Relaxation Problem Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification? Approach First, prune XTAH groups containing trees that use unacceptable relaxations as specified in the query This step can be efficiently achieved by utilizing internal node labels Then, iteratively search the XTAH for the best relaxed query 23 Query Relaxation Process Example article year ! title body 2003 search engine section title $2 body $3 section $4 The template structure, T spam t2 detection t1 article $1 Relaxation Control USE (edge generalization, node deletion) AT-LEAST 20 Sample RLXQuery relax edge_generalization … {gen(e$1,$2)} T1 article title body … node_relabel {gen(e$1, $2), gen(e$3, $4)} section {gen(e$3, $4)} T2 article title body section T3 article title body section … … node_deletion ... {gen(e$3, $4), gen(e$1,$3)} … {del($2)} T6 article body … {del($2), del($3)} section T4 article … title body section A sample XTAH for the template structure T T7 article … section 24 XTAH-Guided Query Relaxation Problem Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level? R0 R1 R2 R5 R8 R3 R11 relaxed tree j Approach: M-tree Assign representatives to internal groups Representatives summarize distance properties of the trees within groups Use representatives to guide the search path to the best relaxation candidate [2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)25 Roadmap Introduction Background CoXML Relaxation Language Relaxation Indexes Ranking Evaluation Testbed Related Work Conclusion 26 Ranking Ranking criteria Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes Approach Content similarity – extended vector space model Structure similarity – tree editing distance with a model for assigning operation cost Overall relevancy – a ranking model combing both content and structure similarities 27 Content Similarity Traditional IR ranking Vector Space Model content similarity between a query and a document Term Frequency Inverse Document Frequency Weighted Term Frequency Inverse Element Frequency XML content ranking Extended Vector Space Model content similarity between a query and an answer (i.e., a set of data nodes) 28 Weighted Term Frequency Terms under different paths of a node weight differently Example 5 Data 6 title Spam Detection By Content Analysis section Query 8 paragraph …an approach to detect spam by … 12 reference Spam detection taxonomy section spam detection The weighted term frequency for a term t in a node v is: m tf w (v, t ) w( pi ) tf( pi , t ) i 1 pi: a path under the node v to a term t; m: # of different paths under the node v that contain the term t 29 Inverse Element Frequency The more number of XML elements containing a term, the less disambiguating power the term has E.g., the term “spam” is less disambiguating than the term “detection” The inverse element frequency for a query term t is N1 ief ($u , t ) log N2 $u: a query node whose content condition contains the term t N1: # of data nodes that match the structure condition related to $u N2: # of data nodes that match the structure condition related to $u and contain t 30 Extended Vector Space Model The content similarity between an answer A and a query Q is n |$ ui .cont | cont_sim( A, Q) i 1 j 1 tf w (vi , tij ) ief($ui , tij ) n: # of nodes in Q {$u1, …, $un}: the set of query nodes in Q {v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n) |$ui.cont|: the number of terms in the content conditions on the node $ui tij: a term in the content condition on the query $ui 31 Structure Distance Function Both XML data and queries are modeled as trees Similarities between trees are often computed by editing distances, i.e., the cost of the cheapest sequence of editing operations that transform one tree into the other tree The structure distance between an answer A and a query Q can be measured as the total cost of relaxation operations used to derive A k struct_dist( A, Q) cost(ri ) i 1 {r1, …, rk}: the set of relaxation operations used to derive A cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 ) 32 Relaxation Operation Cost Naïve approach Assign uniform cost to all relaxation operations Simple but ineffective Our approach Assign an operation cost based on the similarity between the two nodes being approximated by the operation The closer the two nodes, the less the operation costs cos t (ri ) 1 similarity($u, $v) ri: a relaxation operation $u, $v: the two nodes that are being approximated by ri 33 Nodes Approximated By Relaxation Operations Relaxation Operation Nodes being approximated by the operation: ($u, $v) Example Node relabel (a node with the old label, a node (article, document) with the new label) Node deletion (a child node, the parent node) (section, body) Edge generalization (a child node, a descendant node) (article/title, article//title) T1 article title body section Query tree T2 document title body T3 article title body section Node Relabel T4 article title body section Node deletion Edge generalization 34 content similarity structure distance overall relevancy 35 Overall Relevancy Function The overall relevancy of an answer A to a query Q, sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q) Properties sim(A, Q) = cont_sim(A, Q) if struct_dist(A, Q) = 0 sim(A, Q) as cont_sim(A, Q ) sim(A, Q) as struct_dist(A, Q ) Implementation sim( A, Q) struct_dist( A, Q ) cont_sim( A, Q) is a small constant between 0 and 1 36 Roadmap Introduction Background CoXML Relaxation Indexes Relaxation Language Ranking Evaluation Testbed Related Work Conclusion 37 Evaluation Studies INEX (Initiative for the evaluation of XML) Document collections Scientific articles from IEEE Computer Society 1995 – 2002 About 500MByte Each article consists of 1500 XML nodes on average Queries Similar to TREC for text retrieval Strict content and structure (SCAS) Vague content and structure (VCAS) Golden standard Relevance assessment provided by INEX 38 Evaluation of Content Similarity Datasets: INEX 03 test collection Query sets: 30 SCAS queries Comparisons: 38 submissions in INEX 03 1 0.8 Precision Avg. Precision 0.3309 0.6 0.4 0.2 0 0.5 Recall 1 39 Evaluation of the Cost Model Dataset: INEX 05 test collection Query set: 22 simple VCAS queries Evaluation metric: normalized extended cumulative gain (nxCG) the official evaluation metric used in INEX 05 Given a number i (i1), nxCG@i, similar to precision@i, measures the relative gain users accumulated up to the rank i E.g., nxCG@10, nxCG@25, nxCG@50, … Cost Models: UCost: uniform cost for each relaxation operation (Baseline) SCost: our proposed cost model 40 Retrieval performance improvements with semantic cost model Query set: all content-and-structure queries in INEX 05 nxCG@10 (, cost model) Cost Model 0.1 0.3 0.5 0.7 0.9 Uniform 0.2584 0.2616 0.2828 0.2894 0.2916 Semantic 0.3319 (+28.44%) 0.3190 (+21.94%) 0.3196 (+13.04%) 0.3068 (+6%) 0.2957 (+4.08%) sim( A, Q) struct_dist( A, Q ) cont_sim( A, Q) Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results 41 Evaluation of the Cost Model Result Cost Model sim( A, Q) struct_dist( A, Q ) cont_sim( A, Q) 0.1 0.3 0.5 0.7 UCost 0.2584 0.2616 0.2828 SCost 0.3319 (+28.44%) 0.3190 (+21.94%) 0.3196 0.3068 (+13.04%) (+6%) 0.2894 0.9 0.2916 0.2957 (+4.08%) Each cell: nxCG@10 for a given pair (, cost model) (% of improvement over the baseline) Utilizing node similarities to distinguish costs of different operations improves retrieval performance! Similar results are observed using nxCG@25 and nxCG@50 42 Expressiveness of the Relaxation Language INEX 05 Topic 267 <inex_topic topic_id="267" query_type="CAS" > <castitle> //article//fm//atl[about(., "digital libraries")] </castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative> </inex_topic> Expressing Topic 267 using RLXQuery FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atl REJECT(fm, bb) WHERE $b[about(., “digital libraries”)] RETURN $b 43 Effectiveness of the Relaxation Control Expressing Topic 267 with RLXQuery FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atl REJECT(fm, bb) WHERE $b[about(., “digital libraries”)] RETURN $b Results Evaluation Metric nxCG@10 nxCG@25 No relaxation control 0.1013 0.2365 With relaxation control 1.0 0.8986 Method Perfect accuracy Relaxation control enables the system to provide answers with greater relevancy! 44 Evaluation of the Ranking Function Dataset: INEX 05 test collection Query set: 4 official VCAS queries with available relevance assessments Comparison: top-1 submission in INEX 05 Results Metric nxCG@10 nxCG@25 Topic Top-1 CoXML Top-1 CoXML 256 0.4293 0.4248 0.4733 0.5555 264 0.0 0.0069 0.0 0.0033 275 0.7715 0.638 0.589 0.5922 284 0.0 0.1259 0.0 0.1233 Average 0.3002 (+0.4%) 0.2989 0.2656 0.3186 (+20%) The systematic relaxation approach enables our system to derive more approximate answers! Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only! 45 Roadmap Introduction Background CoXML Relaxation Indexes – XTAH Relaxation Language – RLXQuery Ranking Evaluation Testbed Related Work Conclusion 46 CoXML Testbed RLXQuery Approximate Answers RLXQuery Parser RLXQuery Preprocessor Relaxation Controller Database Manager Relaxation Manager XTAH XML Database Engine Ranking Module Relaxation Index Builder XML Document s Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian 47 Relaxation Examples using the Testbed 48 Relaxation Examples using the Testbed 49 Roadmap Introduction Background CoXML Related Work Conclusion 50 Related Work: Query Relaxation Relaxation based on schema conversions ([LC01, LMC01], [LMC03]) No structure relaxation Native XML relaxation Propose structure relaxation types [e.g., KS01, ACS02] We use the relaxation types introduced in [ACS02] Investigate efficient algorithms for deriving top-K answers based on relaxation types supported [e.g, Sch02, ACS02, ALP04, AKM05] No relaxation control 51 Related Work: XML Ranking Content ranking Most extend ranking models for text retrieval to the XML scenario, e.g., HyRex, XXL, JuruXML, XSearch We utilize structure to distinguish terms of different weights occurring in different parts of a document Structure ranking Based on tree editing distance algorithms w/o considering operation cost [NJ02] Based on the occurrence frequency of the query trees, paths, or predicates in data [MAK05, AKM05] Our structure ranking is similar to editing distance, but we consider operation cost 52 Conclusion Cooperative XML (CoXML) query answering RLXQuery enables users to effectively express approximate query conditions and to control the approximate matching process XTAH provides systematic query relaxation guidance Both content and structure similarity metrics for evaluating the relevancy of approximate answers Evaluation studies with the INEX test collections demonstrate the effectiveness of our methodology 53