CoXMLr

advertisement
Cooperative XML (CoXML)
Query Answering
Motivation

XML has become the standard format for information

representation and data exchange
An explosive increase in the amount of XML data
available on the web, e.g.,






Bills at the Library of Congress
IEEE Computer Society’s publication
SwissProt – protein sequence databases
XMark – online auction data
….
Effective XML search methods are needed!
2
Challenges

XML schema is usually very complex



E.g., the schema for the IEEE Computer Society publication
dataset contains about 170 distinct tags and more than 1000
distinct paths
It is often unrealistic for users to fully understand a
schema before asking queries
Exact query answering is inadequate and approximate
query answering is more appropriate!
3
Approach: CoXML
Derive approximate answers by relaxing
query conditions, i.e., query relaxation
Query
Approximate
Answers
Cooperative XML
Query Answering
XML Database
Engine
XML
Documents
4
Roadmap





Introduction
Background
CoXML
Related Work
Conclusion
5
XML Data Model

XML data is often modeled as an ordered labeled tree


Tree nodes: elements
Tree edges: element-nesting relationships
1 article
Element
2
title
Search engine
spam detection
3 author
4
XYZ
Content
name
5
IEEE
Fellow
6
title
2003
year
7
body
8
section
..a spam detection
technique by content
analysis…
6
XML Query Model

XML queries are often modeled as trees

Structure conditions: a set of query nodes connected by



Content conditions:


Parent-to-child (‘/’): directly connected
Ancestor-to-descendant (‘// ’): connected (either directly or indirectly)
Either value predicates or keyword constraints on query nodes
Example
article
title
year
section
search
engine
2003
spam
detection
7
XML Query Answer


An answer for a query is a set of nodes in a data tree that
satisfies both structure and content conditions
Example
1 article
2
title
Search engine
spam detection
6 year
3 author
4 name
5 title
2003
article
7
8
body
section
title
year
section
search 2003 spam
engine
detection
Query Tree
XYZ
IEEE
Fellow
..a spam detection
technique by content
analysis…
Data Tree
8
XML Query Relaxation Types

Value relaxation: enlarging a value condition’s search scope
article

article
title
year
section
title
search
engine
2003
spam
detection
search
engine
year
section
2000spam
2005 detection
Node relabel: changing the label a node to a similar or a more
general label by domain knowledge
article
document
title
year
section
title
year
section
search
engine
2003
spam
detection
search
engine
2003
spam
detection
[1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)
9
XML Query Relaxation Types

Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge
article

article
title
year
section
title
year
section
search
engine
2003
spam
detection
search
engine
2003
spam
detection
Node deletion: dropping a node from a query tree
article
article
title
year
section
search
engine
2003
spam
detection
search
engine
year
section
2003
spam
detection
10
XML Relaxation Properties

Definition


Relaxation operation: an application of a relaxation
type to a specific query node or edge
Lemma

Given a query tree with n applicable relaxation
operations, there are potentially up to 2n relaxed trees

Possible combinations:
n
 n
   ...   
1
 n
11
Roadmap





Introduction
Background
CoXML
Related Work
Conclusion
12
Challenges

Query relaxation is often user-specific



A query with n relaxation operations has potentially up
to 2n relaxed queries


Different users may have different approximate matching
specifications for a given query tree
How to provide user-specific approximate query answering?
How to systematically relax a query?
Query relaxation generates a set of approximate answers

How to effectively rank the returned approximate answers?
13
CoXML System Overview
relaxation
language
RLXQuery
similarity
metrics
ranked results
Ranking
Module
query
Relaxation
results
relaxed query
Engine
query
exact
answers
relaxation
indexes
XTAH
Relaxation
Index Builder
CoXML
XML
Database Engine
XML
Documents
14
Roadmap



Introduction
Background
CoXML







Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion
15
Relaxation Language

Motivation


Enabling users to specify approximate conditions in queries and
to control the approximate matching process
RLXQuery - relaxation-enabled XQuery

Extends the standard XML query language (XQuery) with
relaxation constructs & controls, such as






~ : approximate conditions
! : non-relaxable conditions
REJECT : unacceptable relaxations
AT-LEAST : minimum # of answers to be returned
RELAX-ORDER : relaxation orders among multiple conditions
USE: allowable relaxation types
16
RLXQuery Example
FOR $a in doc (“bib.xml”)//article
WHERE $a/year = ~2003 V-COND-LABEL t1 and
~($a[about(./!title, “search engine”)]/body/section)[about(.,
“spam detection”)] S-COND-LABEL t2
RETURN $a
RELAX-ORDER (t1, t2)
USE (edge generalization, node deletion)
AT-LEAST 20
article
t1
year
! title
body
2003
search
engine
section
spam
detection
t2
17
Roadmap



Introduction
Background
CoXML







Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion
18
Relaxation Index

Naïve approach



Observation


Generate all possible relaxed queries & iteratively select the
best relaxed query to derive approximate answers
Exhaustive, but not scalable
Many queries share the same (or similar) tree structures
Our approach: relaxation index



Consider the structure of a query tree T as a template
Build indexes on the relaxed trees of T
Use the index to guide the relaxations of any query with the
same (or similar) tree structure as that of T
19
Relaxation Index - XTAH

XTAH


A hierarchical multi-level labeled cluster of relaxed trees
Building an XTAH



Given a query structure template T, generate all possible
relaxed trees
Each relaxed trees uses an unique set of relaxation
operations
Cluster relaxed trees into groups based on relaxation
operations and distances similar to “suffix-tree” clustering
20
XTAH Example
relax
edge_generalization
…
{gen(e$1,$2)}
T1 article
…
{gen(e$1, $2),
gen(e$3, $4)}
title body
node_relabel
section
{gen(e$3, $4)}
T2
article
title body
section
T3 article
title body
section
…
…
node_deletion
...
{gen(e$3, $4),
gen(e$1,$3)}
…
{del($2)}
T6 article
body
…
{del($2),
del($3)}
section
T4 article
…
title body
T7 article
…
section
section
A sample XTAH for the template structure T
article $1
title $2 body $3
section $4
Template structure T
gen(e$u, $v) – relaxing the edge between $u and $v
del($u) – deleting the node $u
21
XTAH Properties

Each group consists of a set of relaxed trees obtained by
using similar relaxation operations


Efficient location of relaxed trees based on relaxation
operations
The higher level a group, the less relaxed the trees in
the group

Relaxing queries at different granularities by traversing up
and down the XTAH
22
XTAH-Guided Query Relaxation

Problem


Given a query with relaxation specifications (constructs and
controls), how to search an XTAH for relaxed queries that
satisfy the specification?
Approach

First, prune XTAH groups containing trees that use
unacceptable relaxations as specified in the query


This step can be efficiently achieved by utilizing internal node labels
Then, iteratively search the XTAH for the best relaxed query
23
Query Relaxation Process Example
article
year
! title
body
2003
search
engine
section
title $2 body $3
section $4
The template structure, T
spam
t2
detection
t1
article $1
Relaxation Control
USE (edge generalization,
node deletion)
AT-LEAST 20
Sample RLXQuery
relax
edge_generalization
…
{gen(e$1,$2)}
T1 article
title body
…
node_relabel
{gen(e$1, $2),
gen(e$3, $4)}
section
{gen(e$3, $4)}
T2
article
title body
section
T3 article
title body
section
…
…
node_deletion
...
{gen(e$3, $4),
gen(e$1,$3)}
…
{del($2)}
T6 article
body
…
{del($2),
del($3)}
section
T4 article
…
title body
section
A sample XTAH for the template structure T
T7 article
…
section
24
XTAH-Guided Query Relaxation

Problem

Given a query and an XTAH, how to efficiently locate the best
relaxation candidate at the leaf level?
R0
R1
R2
R5
R8
R3
R11
relaxed tree j

Approach: M-tree



Assign representatives to internal groups
Representatives summarize distance properties of the trees within groups
Use representatives to guide the search path to the best relaxation candidate
[2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)25
Roadmap



Introduction
Background
CoXML







Relaxation Language
Relaxation Indexes
Ranking
Evaluation
Testbed
Related Work
Conclusion
26
Ranking

Ranking criteria


Based on both content and structure similarities between a
query and an answer, i.e., a set of data nodes
Approach



Content similarity – extended vector space model
Structure similarity – tree editing distance with a model for
assigning operation cost
Overall relevancy – a ranking model combing both content
and structure similarities
27
Content Similarity
Traditional IR ranking
Vector Space Model
content similarity between
a query and a document
Term Frequency
Inverse Document Frequency
Weighted Term Frequency
Inverse Element Frequency
XML content ranking
Extended Vector Space Model
content similarity between
a query and an answer
(i.e., a set of data nodes)
28
Weighted Term Frequency


Terms under different paths of a node weight differently
Example
5
Data
6 title
Spam Detection By
Content Analysis

section
Query
8 paragraph
…an approach to
detect spam by …
12 reference
Spam detection
taxonomy
section
spam
detection
The weighted term frequency for a term t in a node v is:
m
tf w (v, t )   w( pi )  tf( pi , t )
i 1
pi: a path under the node v to a term t;
m: # of different paths under the node v that contain the term t
29
Inverse Element Frequency



The more number of XML elements containing a term,
the less disambiguating power the term has
E.g., the term “spam” is less disambiguating than the
term “detection”
The inverse element frequency for a query term t is
N1
ief ($u , t )  log
N2
$u: a query node whose content condition contains the term t
N1: # of data nodes that match the structure condition related to $u
N2: # of data nodes that match the structure condition related to $u and contain t
30
Extended Vector Space Model

The content similarity between an answer A and a
query Q is
n |$ ui .cont |
cont_sim( A, Q)  
i 1

j 1
tf w (vi , tij )  ief($ui , tij )
n: # of nodes in Q
{$u1, …, $un}: the set of query nodes in Q
{v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n)
|$ui.cont|: the number of terms in the content conditions on the node $ui
tij: a term in the content condition on the query $ui
31
Structure Distance Function


Both XML data and queries are modeled as trees
Similarities between trees are often computed by
editing distances,


i.e., the cost of the cheapest sequence of editing operations
that transform one tree into the other tree
The structure distance between an answer A and a query
Q can be measured as the total cost of relaxation
operations used to derive A
k
struct_dist( A, Q)  cost(ri )
i 1
{r1, …, rk}: the set of relaxation operations used to derive A
cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 )
32
Relaxation Operation Cost

Naïve approach



Assign uniform cost to all relaxation operations
Simple but ineffective
Our approach


Assign an operation cost based on the similarity between
the two nodes being approximated by the operation
The closer the two nodes, the less the operation costs
cos t (ri ) 1 similarity($u, $v)
ri: a relaxation operation
$u, $v: the two nodes that are being approximated by ri
33
Nodes Approximated By Relaxation Operations
Relaxation
Operation
Nodes being approximated by the
operation: ($u, $v)
Example
Node relabel
(a node with the old label, a node (article, document)
with the new label)
Node deletion
(a child node, the parent node)
(section, body)
Edge
generalization
(a child node, a descendant node)
(article/title, article//title)
T1
article
title
body
section
Query tree
T2
document
title
body
T3
article
title
body
section
Node Relabel
T4
article
title
body
section
Node deletion
Edge generalization
34
content similarity
structure distance
overall relevancy
35
Overall Relevancy Function



The overall relevancy of an answer A to a query Q,
sim(A, Q), is a function of cont_sim(A, Q) and
struct_dist(A, Q)
Properties
 sim(A, Q) = cont_sim(A, Q) if struct_dist(A, Q) = 0
 sim(A, Q)  as cont_sim(A, Q ) 
 sim(A, Q)  as struct_dist(A, Q ) 
Implementation
sim( A, Q)  struct_dist( A, Q )  cont_sim( A, Q)
 is a small constant between 0 and 1
36
Roadmap



Introduction
Background
CoXML







Relaxation Indexes
Relaxation Language
Ranking
Evaluation
Testbed
Related Work
Conclusion
37
Evaluation Studies

INEX (Initiative for the evaluation of XML)


Document collections




Scientific articles from IEEE Computer Society 1995 – 2002
About 500MByte
Each article consists of 1500 XML nodes on average
Queries



Similar to TREC for text retrieval
Strict content and structure (SCAS)
Vague content and structure (VCAS)
Golden standard

Relevance assessment provided by INEX
38
Evaluation of Content Similarity


Datasets: INEX 03 test collection
Query sets: 30 SCAS queries
Comparisons: 38 submissions in INEX 03
1
0.8
Precision

Avg. Precision
0.3309
0.6
0.4
0.2
0
0.5
Recall
1
39
Evaluation of the Cost Model



Dataset: INEX 05 test collection
Query set: 22 simple VCAS queries
Evaluation metric: normalized extended cumulative gain (nxCG)




the official evaluation metric used in INEX 05
Given a number i (i1), nxCG@i, similar to precision@i,
measures the relative gain users accumulated up to the rank i
E.g., nxCG@10, nxCG@25, nxCG@50, …
Cost Models:


UCost: uniform cost for each relaxation operation (Baseline)
SCost: our proposed cost model
40
Retrieval performance improvements with
semantic cost model

Query set: all content-and-structure queries in INEX 05
nxCG@10 (, cost model)

Cost Model
0.1
0.3
0.5
0.7
0.9
Uniform
0.2584
0.2616
0.2828
0.2894
0.2916
Semantic
0.3319
(+28.44%)
0.3190
(+21.94%)
0.3196
(+13.04%)
0.3068
(+6%)
0.2957
(+4.08%)
sim( A, Q)  struct_dist( A, Q )  cont_sim( A, Q)
Assigning relaxation operation with different cost based on the similarities of the
nodes being operated improves retrieval performance!
nxCG@25 and nxCG@50 yield similar results
41
Evaluation of the Cost Model

Result

Cost Model
sim( A, Q)  struct_dist( A, Q )  cont_sim( A, Q)
0.1
0.3
0.5
0.7
UCost
0.2584
0.2616
0.2828
SCost
0.3319
(+28.44%)
0.3190
(+21.94%)
0.3196
0.3068
(+13.04%) (+6%)
0.2894
0.9
0.2916
0.2957
(+4.08%)
Each cell: nxCG@10 for a given pair (, cost model)
(% of improvement over the baseline)
Utilizing node similarities to distinguish costs of different operations improves
retrieval performance!
Similar results are observed using nxCG@25 and nxCG@50
42
Expressiveness of the Relaxation Language

INEX 05 Topic 267
<inex_topic topic_id="267" query_type="CAS" >
<castitle> //article//fm//atl[about(., "digital libraries")] </castitle>
<description> Articles containing "digital libraries" in their title. </description>
<narrative> I'm interested in articles discussing Digital Libraries as their main
subject. Therefore I require that the title of any relevant article mentions "digital
library" explicitly. Documents that mention digital libraries only under the
bibliography are not relevant, as well as documents that do not have the phrase "digital
library" in their title. </narrative>
</inex_topic>

Expressing Topic 267 using RLXQuery
FOR $a in doc(“inex.xml”)//article
LET $b = $a//fm//!atl REJECT(fm, bb)
WHERE $b[about(., “digital libraries”)]
RETURN $b
43
Effectiveness of the Relaxation Control

Expressing Topic 267 with RLXQuery
FOR $a in doc(“inex.xml”)//article
LET $b = $a//fm//!atl REJECT(fm, bb)
WHERE $b[about(., “digital libraries”)]
RETURN $b

Results
Evaluation Metric
nxCG@10
nxCG@25
No relaxation control
0.1013
0.2365
With relaxation control
1.0
0.8986
Method
Perfect accuracy
Relaxation control enables the system to provide answers with greater relevancy!
44
Evaluation of the Ranking Function




Dataset: INEX 05 test collection
Query set: 4 official VCAS queries with available relevance assessments
Comparison: top-1 submission in INEX 05
Results
Metric
nxCG@10
nxCG@25
Topic
Top-1
CoXML
Top-1
CoXML
256
0.4293
0.4248
0.4733
0.5555
264
0.0
0.0069
0.0
0.0033
275
0.7715
0.638
0.589
0.5922
284
0.0
0.1259
0.0
0.1233
Average
0.3002 (+0.4%)
0.2989
0.2656
0.3186 (+20%)
The systematic relaxation approach enables our system to derive more approximate answers!
Our ranking function, based on both content and structure relevancy, outperforms other
ranking functions using content similarities only!
45
Roadmap



Introduction
Background
CoXML







Relaxation Indexes – XTAH
Relaxation Language – RLXQuery
Ranking
Evaluation
Testbed
Related Work
Conclusion
46
CoXML Testbed
RLXQuery
Approximate Answers
RLXQuery
Parser
RLXQuery
Preprocessor
Relaxation
Controller
Database
Manager
Relaxation
Manager
XTAH
XML
Database Engine
Ranking
Module
Relaxation
Index Builder
XML
Document
s
Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian
47
Relaxation Examples using the Testbed
48
Relaxation Examples using the Testbed
49
Roadmap





Introduction
Background
CoXML
Related Work
Conclusion
50
Related Work: Query Relaxation

Relaxation based on schema conversions ([LC01,
LMC01], [LMC03])


No structure relaxation
Native XML relaxation

Propose structure relaxation types [e.g., KS01, ACS02]


We use the relaxation types introduced in [ACS02]
Investigate efficient algorithms for deriving top-K answers
based on relaxation types supported [e.g, Sch02, ACS02,
ALP04, AKM05]

No relaxation control
51
Related Work: XML Ranking

Content ranking



Most extend ranking models for text retrieval to the XML
scenario, e.g., HyRex, XXL, JuruXML, XSearch
We utilize structure to distinguish terms of different weights
occurring in different parts of a document
Structure ranking



Based on tree editing distance algorithms w/o considering
operation cost [NJ02]
Based on the occurrence frequency of the query trees, paths,
or predicates in data [MAK05, AKM05]
Our structure ranking is similar to editing distance, but we
consider operation cost
52
Conclusion

Cooperative XML (CoXML) query answering




RLXQuery enables users to effectively express
approximate query conditions and to control the
approximate matching process
XTAH provides systematic query relaxation guidance
Both content and structure similarity metrics for
evaluating the relevancy of approximate answers
Evaluation studies with the INEX test collections
demonstrate the effectiveness of our methodology
53
Download