Structure and Content Scoring for XML

advertisement
Structure and Content Scoring for
XML
Amélie Marian (Rutgers University)
Joint work with:
Sihem Amer-Yahia (AT&T Research Labs)
Nick Koudas (University of Toronto)
Divesh Srivastava (AT&T Research Labs)
David Toman (University of Waterloo)
Motivations:
XML Data Heterogeneity
Data
book
book
info
author
(Dickens)
info
edition
(paperback)
title
(Great
Expectations)
book
author
(Dickens)
title
(Great
Expectations)
info
edition
(paperback)
author
(Dickens)
title
(Great
Expectations)
Heterogeneous XML Data about books
 Query:
book[./info[./title=“Great Expectations” and
./author=“Dickens”] and ./edition=“paperback”]
Query root node:
Distinguished node
6/30/2016
Amélie Marian - Rutgers University
book
info
edition
(paperback)
author
title
(Dickens) (Great
Expectations)
2
XML Query Relaxation
Query
[Amer-Yahia et al. EDBT’02]

book
Tree pattern relaxations:



Data
Leaf node deletion
Edge generalization
Subtree promotion
author
(Dickens)
6/30/2016
author
(Dickens)
title
(Great
Expectations)
info
edition
(paperback)
title
(Great
Expectations)
edition
(paperback)
book
book
info
info
author
(Dickens)
book
edition?
title
(Great
Expectations)
Amélie Marian - Rutgers University
info
edition
(paperback)
author
(Dickens)
title
(Great
Expectations)
3
Motivations

Top-k query processing suitable for
relaxed XML queries over
heterogeneous collections



Return k XML nodes that are closest to
query structure
Opportunity for more efficient query
processing
Need scoring mechanism to identify
best k answers
6/30/2016
Amélie Marian - Rutgers University
4
Contributions



Scoring mechanism for XML queries
Data structures for top-k query
processing
Experimental evaluation
6/30/2016
Amélie Marian - Rutgers University
5
Scoring Functions Critical for
Top-k Query Processing


Top-k answer quality depends on scoring
function
Efficient top-k query processing requires
scoring function:



Monotonic
Fast to compute
Little attention given to scoring functions for
structured and semi-structured data
Extensively studied over text data (e.g., tf.idf)
 Proposed scoring function inspired by tf.idf for XML
data

6/30/2016
Amélie Marian - Rutgers University
6
Adaptation of tf.idf to XML Queries
Document Collection
(Information Retrieval)
XML Document
Document
XML Node (result is a subtree
rooted at a distinguished node, i.e.,
a node with a given label and
structural properties)
Keyword(s)
Query Pattern
idf (inverse document frequency) is a idf is a function of the fraction of
function of the fraction of documents distinguished nodes that match the
that contain the keyword(s)
query pattern
tf (term frequency) is a function of
the number of occurrences of the
keyword in the document
6/30/2016
tf is a function of the number of
ways the query pattern matches the
distinguished node
Amélie Marian - Rutgers University
7
Scoring Function for XML
Approximate Matches




book
Required properties:

book
book
Exact matches should be
info
edition
edition
scored higher than relaxed
info
info
edition infoedition
(paperback)
(paperback)
matches (idf)
(paperback) (paperback)
author
Distinguished nodes with
author
title
title title
(Dickens)
several matches should be
(Dickens) (Great(Great
(Great
ranked higher than those Expectations)Expectations)
Expectations)
with fewer matches (tf)
How to combine tf and idf?

book
tf.idf, as used by IR,
violates above properties
Ranking based on idf, then
breaking ties using tf
satisfies the properties
6/30/2016
(a)
(b)
score(a) <=
>= score(b)
Amélie Marian - Rutgers University
8
A Family of Scoring Methods for
XML Path Queries

Twig predicate




High quality
Expensive computation
info
Path predicates
Binary predicates


info
edition
(paperback)
author
title
(Dickens) (Great
Expectations)
6/30/2016
book + book + book
info
edition
(paperback)
author
title
(Dickens) (Great
Expectations)
Low quality
Fast computation
book
book
Query
info
edition
(paperback)
author
title
(Dickens) (Great
Expectations)
book + book + book + book
author
title
info
edition
(Dickens) (Great
(paperback)
Expectations)
Amélie Marian - Rutgers University
9
Contributions



Scoring mechanism for XML queries
Data structures for top-k query
processing
Experimental evaluation
6/30/2016
Amélie Marian - Rutgers University
10
Matrix Representation of Twigs

Twigs (queries and tuples) can be represented by
matrices that capture all relationships in the query:
a
Query:
a
b
c
d
e
a
=
/
//
/
//
Partial Tuple:
(not
(nojoined
matches
(e1 with
matches)
for
e yet)
e)
b
d
c
e
b
=
/
X
X
c
=
X
X
d
=
/
e
=
a
b
c
d
e
a
=
//
//
/
X
//
?
a1
b1
d1
c1
e1
b
c
d
e
=
X
X
X
?
=
X
X
?
=
X?
/
=
X?
Matrix subsumption used to compare tuple and queries
6/30/2016
Amélie Marian - Rutgers University
11
Representing Relaxed Query Patterns:
DAG Structure
a
b



Each child is more relaxed
(has more matches) than its
parent
idf of a child is no higher than
the idf of its parent
idf scores are accessible in
constant time for any match
(complete or partial) using
hash function
Exhaustive algorithm to build the DAG
c
a
a
b
c
b
c
a
a
b
c
b c
a
a
b c
b
a
a
c
b
a
6/30/2016
Amélie Marian - Rutgers University
12
Information stored in the DAG
a

idf score information:
idf=(1+|a|)/(1+|ap|), where |ap| is
the number of a nodes that
satisfy the query predicate

For query processing:



Best possible score from
here
Best possible score after
each remaining join
operations
Number of matches
(useful for tf)
b 1.228
c
1.2
a
a
b
c
b
c
1.195
a
a
1.167
1.195 b
b c
c
a
a
1.167
1.156
b c
b
a
a
c
b
1.049
1.156
a
1
6/30/2016
Amélie Marian - Rutgers University
13
Query Processing using the DAG

Benefits:




Score computation done in a preprocessing
phase (using exact or approximate information)
Score access during query processing done in
constant time
Additional information needed for query
processing precomputed and accessed in
constant time (e.g., score upper bound)
tf estimated at runtime based on available
information
6/30/2016
Amélie Marian - Rutgers University
14
Quality/Space/Time tradeoff

Binary Predicates




Smaller DAG (O(4q))
Faster pre-processing (and processing)
Lower Quality (fewer possible scores)
Path Predicates and Twig



6/30/2016
DAG is O(4q^2/2)) in space (still reasonable in
practice)
More pre-processing
Higher Quality (more differences between
scores)
Amélie Marian - Rutgers University
15
Contributions



Scoring mechanism for XML queries
Data structures for top-k query
processing
Experimental evaluation
6/30/2016
Amélie Marian - Rutgers University
16
Experimental Setup

Data:




Synthetic heterogeneous document collections generated
with Toxgene
Real dataset: Wall Street Journal Treebank corpora
Pregenerated queries exhibiting different sizes,
query structures and predicates
Measures:




6/30/2016
DAG size
DAG preprocessing time
Query processing time
Precision (percentage of top-k answers that are actual topk answers, as given by Twig)
Amélie Marian - Rutgers University
17
XML Scoring Precision
Twig
Path-Independent
Binary-Independent
1
Precision
0.8
0.6
0.4
0.2
0
q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17
6/30/2016
Amélie Marian - Rutgers University
18
XML Scoring
Preprocessing Time
Twig
Path-Independent
Binary-Independent
DAG Preprocessing Time (in sec)
100000
10000
1000
100
10
1
0.1
0.01
6/30/2016
Amélie Marian - Rutgers University
q17
q16
q15
q14
q13
q12
q11
q10
q9
q8
q7
q6
q5
q4
q3
q2
q1
q0
0.001
19
XML Scoring
Real data
Twig
Path-Independent
Binary-Independent
1
Precision
0.8
0.6
0.4
0.2
0
TB0
TB1
TB2
TB3
TB4
TB5
TB0
TB1
TB2
O(1000)
6/30/2016
TB3
TB4
TB5
O(10000)
Amélie Marian - Rutgers University
20
Conclusions

Scoring method for XML queries




Efficient data structures to compute and access
scores during top-k query processing



Inspired from tf.idf
Accounts for structure and content
Accounts for structural relaxations
DAG
Matrix representation of queries and tuples
Evaluation of the scoring methods tradeoffs

6/30/2016
Answer quality vs. preprocessing time
Amélie Marian - Rutgers University
21
Related Work

IR Scoring


Content only
XML Scoring

Content with structure



XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG
[WebDB’04]
None of these techniques account for structural
relaxations (with the exception of our previous work
[ICDE’05])
XML Structural Relaxation

6/30/2016
FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01],
Schlieder [EDBT’02], Delobel and Rousset [FMII’01]
Amélie Marian - Rutgers University
22
Future Work

Streaming scenarios



Integration with approximate text scoring



Incremental updates on DAG
Approximate scoring
Extend proposed XML scoring function to handle
text content approximation (e.g., misspellings)
Unify structure and content score
Quality evaluation (INEX)
6/30/2016
Amélie Marian - Rutgers University
23
Download