Keyword Proximity Search on XML Graphs

advertisement
Keyword Proximity Search on XML
Graphs
Vagelis Hristidis
Yannis Papakonstatinou
Andrey Balmin
@UCSD
Presenter: Feng Shao
Outline
Introduction
 Proximity Keyword Query Semantics
 Architecture
 XML Decompositions
 Execution
 Experiment
 Conclusion

Introduction

Keyword search is easy-to-use
 No
need to know the structure and query
language

XML: labeled graph, representing
semistructured self-describing data.
Feb.10,
5th birthday of XML
From www.w3c.org
Problem--Keyword proximity query

Input: a set of keywords

Results: trees of XML fragments(called target
objects) that contains all the keywords, ranked
according to their size

Assume the existence of schema, facilitates the
presentation of the results and used in
optimizing the performance of the system.
Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6
Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8
Challenges

Presentation of result graphs:
 Semantically
meaningful
 Avoid a huge number of trivial results
Challenges

Presentation of result graphs:
 Semantically
meaningful
 Avoid a huge number of trivial results

Providing fast response time
 Efficient
storage of data
 On-demand execution, guided according to
user’s navigation
Outline
Introduction
 Proximity Keyword Query Semantics
 Architecture
 XML Decompositions
 Execution
 Experiment
 Conclusion

Semantics

XML Graph: a labeled graph



Schema graph: a directed graph



Node v: id(v), label λ(v),value val(v)
Edge: containment and reference edges
Node vs: labelλ(vs), content type type(vs)(all or
choice)
Edge es: containment or refrence, annotated with a
maximum occurrence occ(es)
A XML graph conforms to a schema graph
schema graph
XML Graph
Query semantics

Result: the set of all possible Minimal Total Target Object

What’s MTTON?
Networks(MTTON’s)

Node network j: an uncycled subgraph of G, such that each edge in j

Total node network j of keyword {k1,…,km}: a node network where

Minimal Total Node Network(MTTN): a total node network j where
no node can be removed and j still be a total node network. Score :
is an edge in G
every keyword is contained at least one node n of j
number of edges

Target object of node n: a segment of XML graph, large enough to
be meaningful and semantically identify the node n, and as small as
possible.
MTTON(cont.)


Given a MTNN j with nodes v1, . . . , vn there is a
corresponding MTTON t, which is a tree whose

nodes is a minimal set of target objects {t1, . . . , tm} such that
for every node nk ∈ j there is a tl ∈ t such that target(nk) = tl.

There is an edge from a target object ti to a target object tj if
there is an edge ( or a path) from a node that belongs to ti to a
node that belongs to tj .
The score of a MTTON j is the score of its corresponding
MTNN.
MTNN: name
MTNN:namepersonnation
MTTN & MTTON
Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR]
Target object

Defined from an administrator using the Target Schema
Segment (TSS) graph

TSS graph: a partial mapping of nodes in G



A node tS is created in GTSS for each set S = {s1, . . . , sw} of
nodes of G that are mapped to tS.
An edge (tS, tS’) is created in GTSS if the schema graph has nodes
s ∈ S and s ‘∈ S’, that are connected directly through an edge
(s,s’) or indirectly through a path of dummy schema nodes.
Target decomposition: given the TSS graph, decompose
XML graph into target objects, connected to each other
Example
MTTN & MTTON
Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR]
Presentation Graph

Naïve method: multiple threads,
evaluating various plans for producing
MTTON’s, and outputs as they come.
 Pro:
fast response time
 Con: many trivial results

Interactive interface: allows navigation
and hides the trivial results
Presentation Graph
Outline
Introduction
 Proximity Keyword Query Semantics
 Architecture
 XML Decompositions
 Execution
 Experiment
 Conclusion

Architecture
Load Stage
Keyword: <TO_id,node_id, schema_node>
The number of nodes of each type and etc.
A decomposition of the TSS graph into
fragments, which correspond to connection
relations that allow efficient retrieval of
MTTON’s.
Given an object id instantly return the whole target object
Example of decomposition
Query processing
Keyword: TV, VCR
Keyword: <TO_id,node_id, schema_node>
Execution Plan
Candidate Network
Schema graph and TSS graph
Candidate TSS Network
Connection relations schema
Execution Plan
Schema graph
TSS graph
Connection relations
Outline
Introduction
 Proximity Keyword Query Semantics
 Architecture
 XML Decompositions
 Execution
 Experiment
 Conclusion

XML Decomposition

Decompose TSS graph into fragments
Determines how the connections are stored in the
database
Dramatically change the performance

Example:


a
a
Decomposition Tradeoff
 # fragments v.s. performance

Minimal decomposition



A fragment is built for each edge of TSS graph
Candidate TSS network C of size S, requires S-1 joins
Maximal decomposition



A fragment F is built for every possible candidate TSS network C
C requires zero joins.
Not feasible in practice
Tradeoff (cont.)

Clustering and indexing are critical




Classify TSS graph, based on the storage redundancy in
the corresponding connection relations.


Maximal decomp.: multi-attribute indices
Non-maximal decomp.: a connection relation R is clustered on
the direction that R is used
Example
4NF, inlined( non-MVD,no-4NF)
Decomposition Algorithm
 See paper
Outline
Introduction
 Proximity Keyword Query Semantics
 Architecture
 XML Decompositions
 Execution
 Experiment
 Conclusion

Execution
Goal: fast response time
 Web search engine-like presentation

 Use
inlined decomposition
 Use thread pool
 Use nest-loop joins
 Example:
Outmost loop: over TSS partVCR,name
 Optimization: store partial results
Execution

Presentation graphs(on-demand)
 Initially,
Xkeyword decomposition is used to
retrieve the top result of each CN.
 Then use a combination of decompositions to
find the minimal connection of the expanded
nodes.
Outline
Introduction
 Architecture
 Proximity Keyword Query Semantics
 XML Decompositions
 Execution
 Experiment
 Conclusion

Experiments

Measure various decompositions , for top-K and
full results

Evaluate the performance of algorithm for
search engine-like presentation method and ondemand expansion method

Data: DBLP XML database, 2 keywords
Maximum size of CTSSN: M = 6
Max size of fragments: L = 2
Decompositions
Execution algorithm
Speedup = optimized algorithm / naïve, non-caching algorithm
Execution algorithm
Keyword queries: the names of two authors, k1 and k2
Candidate Network: Authork1 Paper  Authork2
Time measured: average time to expand a Paper node
Outline
Introduction
 Architecture
 Proximity Keyword Query Semantics
 XML Decompositions
 Execution
 Experiment
 Conclusion

Conclusion

Xkeyword is built on a relational database and, hence,
can accommodate very large graphs.

Present keyword proximity search semantics, extended
to capture the novel result presentation method.

Present an architecture allowing for choosing which
connections will be precomputed

Address on-demand performance requirement

Demo: http://www.db.ucsd.edu/Xkeyword
Download