Structure and Content Scoring for XML Amélie Marian (Rutgers University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo) Motivations: XML Data Heterogeneity Data book book info author (Dickens) info edition (paperback) title (Great Expectations) book author (Dickens) title (Great Expectations) info edition (paperback) author (Dickens) title (Great Expectations) Heterogeneous XML Data about books Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] Query root node: Distinguished node 6/30/2016 Amélie Marian - Rutgers University book info edition (paperback) author title (Dickens) (Great Expectations) 2 XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] book Tree pattern relaxations: Data Leaf node deletion Edge generalization Subtree promotion author (Dickens) 6/30/2016 author (Dickens) title (Great Expectations) info edition (paperback) title (Great Expectations) edition (paperback) book book info info author (Dickens) book edition? title (Great Expectations) Amélie Marian - Rutgers University info edition (paperback) author (Dickens) title (Great Expectations) 3 Motivations Top-k query processing suitable for relaxed XML queries over heterogeneous collections Return k XML nodes that are closest to query structure Opportunity for more efficient query processing Need scoring mechanism to identify best k answers 6/30/2016 Amélie Marian - Rutgers University 4 Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 6/30/2016 Amélie Marian - Rutgers University 5 Scoring Functions Critical for Top-k Query Processing Top-k answer quality depends on scoring function Efficient top-k query processing requires scoring function: Monotonic Fast to compute Little attention given to scoring functions for structured and semi-structured data Extensively studied over text data (e.g., tf.idf) Proposed scoring function inspired by tf.idf for XML data 6/30/2016 Amélie Marian - Rutgers University 6 Adaptation of tf.idf to XML Queries Document Collection (Information Retrieval) XML Document Document XML Node (result is a subtree rooted at a distinguished node, i.e., a node with a given label and structural properties) Keyword(s) Query Pattern idf (inverse document frequency) is a idf is a function of the fraction of function of the fraction of documents distinguished nodes that match the that contain the keyword(s) query pattern tf (term frequency) is a function of the number of occurrences of the keyword in the document 6/30/2016 tf is a function of the number of ways the query pattern matches the distinguished node Amélie Marian - Rutgers University 7 Scoring Function for XML Approximate Matches book Required properties: book book Exact matches should be info edition edition scored higher than relaxed info info edition infoedition (paperback) (paperback) matches (idf) (paperback) (paperback) author Distinguished nodes with author title title title (Dickens) several matches should be (Dickens) (Great(Great (Great ranked higher than those Expectations)Expectations) Expectations) with fewer matches (tf) How to combine tf and idf? book tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties 6/30/2016 (a) (b) score(a) <= >= score(b) Amélie Marian - Rutgers University 8 A Family of Scoring Methods for XML Path Queries Twig predicate High quality Expensive computation info Path predicates Binary predicates info edition (paperback) author title (Dickens) (Great Expectations) 6/30/2016 book + book + book info edition (paperback) author title (Dickens) (Great Expectations) Low quality Fast computation book book Query info edition (paperback) author title (Dickens) (Great Expectations) book + book + book + book author title info edition (Dickens) (Great (paperback) Expectations) Amélie Marian - Rutgers University 9 Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 6/30/2016 Amélie Marian - Rutgers University 10 Matrix Representation of Twigs Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: a Query: a b c d e a = / // / // Partial Tuple: (not (nojoined matches (e1 with matches) for e yet) e) b d c e b = / X X c = X X d = / e = a b c d e a = // // / X // ? a1 b1 d1 c1 e1 b c d e = X X X ? = X X ? = X? / = X? Matrix subsumption used to compare tuple and queries 6/30/2016 Amélie Marian - Rutgers University 11 Representing Relaxed Query Patterns: DAG Structure a b Each child is more relaxed (has more matches) than its parent idf of a child is no higher than the idf of its parent idf scores are accessible in constant time for any match (complete or partial) using hash function Exhaustive algorithm to build the DAG c a a b c b c a a b c b c a a b c b a a c b a 6/30/2016 Amélie Marian - Rutgers University 12 Information stored in the DAG a idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) b 1.228 c 1.2 a a b c b c 1.195 a a 1.167 1.195 b b c c a a 1.167 1.156 b c b a a c b 1.049 1.156 a 1 6/30/2016 Amélie Marian - Rutgers University 13 Query Processing using the DAG Benefits: Score computation done in a preprocessing phase (using exact or approximate information) Score access during query processing done in constant time Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) tf estimated at runtime based on available information 6/30/2016 Amélie Marian - Rutgers University 14 Quality/Space/Time tradeoff Binary Predicates Smaller DAG (O(4q)) Faster pre-processing (and processing) Lower Quality (fewer possible scores) Path Predicates and Twig 6/30/2016 DAG is O(4q^2/2)) in space (still reasonable in practice) More pre-processing Higher Quality (more differences between scores) Amélie Marian - Rutgers University 15 Contributions Scoring mechanism for XML queries Data structures for top-k query processing Experimental evaluation 6/30/2016 Amélie Marian - Rutgers University 16 Experimental Setup Data: Synthetic heterogeneous document collections generated with Toxgene Real dataset: Wall Street Journal Treebank corpora Pregenerated queries exhibiting different sizes, query structures and predicates Measures: 6/30/2016 DAG size DAG preprocessing time Query processing time Precision (percentage of top-k answers that are actual topk answers, as given by Twig) Amélie Marian - Rutgers University 17 XML Scoring Precision Twig Path-Independent Binary-Independent 1 Precision 0.8 0.6 0.4 0.2 0 q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 6/30/2016 Amélie Marian - Rutgers University 18 XML Scoring Preprocessing Time Twig Path-Independent Binary-Independent DAG Preprocessing Time (in sec) 100000 10000 1000 100 10 1 0.1 0.01 6/30/2016 Amélie Marian - Rutgers University q17 q16 q15 q14 q13 q12 q11 q10 q9 q8 q7 q6 q5 q4 q3 q2 q1 q0 0.001 19 XML Scoring Real data Twig Path-Independent Binary-Independent 1 Precision 0.8 0.6 0.4 0.2 0 TB0 TB1 TB2 TB3 TB4 TB5 TB0 TB1 TB2 O(1000) 6/30/2016 TB3 TB4 TB5 O(10000) Amélie Marian - Rutgers University 20 Conclusions Scoring method for XML queries Efficient data structures to compute and access scores during top-k query processing Inspired from tf.idf Accounts for structure and content Accounts for structural relaxations DAG Matrix representation of queries and tuples Evaluation of the scoring methods tradeoffs 6/30/2016 Answer quality vs. preprocessing time Amélie Marian - Rutgers University 21 Related Work IR Scoring Content only XML Scoring Content with structure XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) XML Structural Relaxation 6/30/2016 FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] Amélie Marian - Rutgers University 22 Future Work Streaming scenarios Integration with approximate text scoring Incremental updates on DAG Approximate scoring Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) Unify structure and content score Quality evaluation (INEX) 6/30/2016 Amélie Marian - Rutgers University 23