Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Balmin @UCSD Presenter: Feng Shao Outline Introduction Proximity Keyword Query Semantics Architecture XML Decompositions Execution Experiment Conclusion Introduction Keyword search is easy-to-use No need to know the structure and query language XML: labeled graph, representing semistructured self-describing data. Feb.10, 5th birthday of XML From www.w3c.org Problem--Keyword proximity query Input: a set of keywords Results: trees of XML fragments(called target objects) that contains all the keywords, ranked according to their size Assume the existence of schema, facilitates the presentation of the results and used in optimizing the performance of the system. Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6 Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8 Challenges Presentation of result graphs: Semantically meaningful Avoid a huge number of trivial results Challenges Presentation of result graphs: Semantically meaningful Avoid a huge number of trivial results Providing fast response time Efficient storage of data On-demand execution, guided according to user’s navigation Outline Introduction Proximity Keyword Query Semantics Architecture XML Decompositions Execution Experiment Conclusion Semantics XML Graph: a labeled graph Schema graph: a directed graph Node v: id(v), label λ(v),value val(v) Edge: containment and reference edges Node vs: labelλ(vs), content type type(vs)(all or choice) Edge es: containment or refrence, annotated with a maximum occurrence occ(es) A XML graph conforms to a schema graph schema graph XML Graph Query semantics Result: the set of all possible Minimal Total Target Object What’s MTTON? Networks(MTTON’s) Node network j: an uncycled subgraph of G, such that each edge in j Total node network j of keyword {k1,…,km}: a node network where Minimal Total Node Network(MTTN): a total node network j where no node can be removed and j still be a total node network. Score : is an edge in G every keyword is contained at least one node n of j number of edges Target object of node n: a segment of XML graph, large enough to be meaningful and semantically identify the node n, and as small as possible. MTTON(cont.) Given a MTNN j with nodes v1, . . . , vn there is a corresponding MTTON t, which is a tree whose nodes is a minimal set of target objects {t1, . . . , tm} such that for every node nk ∈ j there is a tl ∈ t such that target(nk) = tl. There is an edge from a target object ti to a target object tj if there is an edge ( or a path) from a node that belongs to ti to a node that belongs to tj . The score of a MTTON j is the score of its corresponding MTNN. MTNN: name MTNN:namepersonnation MTTN & MTTON Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR] Target object Defined from an administrator using the Target Schema Segment (TSS) graph TSS graph: a partial mapping of nodes in G A node tS is created in GTSS for each set S = {s1, . . . , sw} of nodes of G that are mapped to tS. An edge (tS, tS’) is created in GTSS if the schema graph has nodes s ∈ S and s ‘∈ S’, that are connected directly through an edge (s,s’) or indirectly through a path of dummy schema nodes. Target decomposition: given the TSS graph, decompose XML graph into target objects, connected to each other Example MTTN & MTTON Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR] Presentation Graph Naïve method: multiple threads, evaluating various plans for producing MTTON’s, and outputs as they come. Pro: fast response time Con: many trivial results Interactive interface: allows navigation and hides the trivial results Presentation Graph Outline Introduction Proximity Keyword Query Semantics Architecture XML Decompositions Execution Experiment Conclusion Architecture Load Stage Keyword: <TO_id,node_id, schema_node> The number of nodes of each type and etc. A decomposition of the TSS graph into fragments, which correspond to connection relations that allow efficient retrieval of MTTON’s. Given an object id instantly return the whole target object Example of decomposition Query processing Keyword: TV, VCR Keyword: <TO_id,node_id, schema_node> Execution Plan Candidate Network Schema graph and TSS graph Candidate TSS Network Connection relations schema Execution Plan Schema graph TSS graph Connection relations Outline Introduction Proximity Keyword Query Semantics Architecture XML Decompositions Execution Experiment Conclusion XML Decomposition Decompose TSS graph into fragments Determines how the connections are stored in the database Dramatically change the performance Example: a a Decomposition Tradeoff # fragments v.s. performance Minimal decomposition A fragment is built for each edge of TSS graph Candidate TSS network C of size S, requires S-1 joins Maximal decomposition A fragment F is built for every possible candidate TSS network C C requires zero joins. Not feasible in practice Tradeoff (cont.) Clustering and indexing are critical Classify TSS graph, based on the storage redundancy in the corresponding connection relations. Maximal decomp.: multi-attribute indices Non-maximal decomp.: a connection relation R is clustered on the direction that R is used Example 4NF, inlined( non-MVD,no-4NF) Decomposition Algorithm See paper Outline Introduction Proximity Keyword Query Semantics Architecture XML Decompositions Execution Experiment Conclusion Execution Goal: fast response time Web search engine-like presentation Use inlined decomposition Use thread pool Use nest-loop joins Example: Outmost loop: over TSS partVCR,name Optimization: store partial results Execution Presentation graphs(on-demand) Initially, Xkeyword decomposition is used to retrieve the top result of each CN. Then use a combination of decompositions to find the minimal connection of the expanded nodes. Outline Introduction Architecture Proximity Keyword Query Semantics XML Decompositions Execution Experiment Conclusion Experiments Measure various decompositions , for top-K and full results Evaluate the performance of algorithm for search engine-like presentation method and ondemand expansion method Data: DBLP XML database, 2 keywords Maximum size of CTSSN: M = 6 Max size of fragments: L = 2 Decompositions Execution algorithm Speedup = optimized algorithm / naïve, non-caching algorithm Execution algorithm Keyword queries: the names of two authors, k1 and k2 Candidate Network: Authork1 Paper Authork2 Time measured: average time to expand a Paper node Outline Introduction Architecture Proximity Keyword Query Semantics XML Decompositions Execution Experiment Conclusion Conclusion Xkeyword is built on a relational database and, hence, can accommodate very large graphs. Present keyword proximity search semantics, extended to capture the novel result presentation method. Present an architecture allowing for choosing which connections will be precomputed Address on-demand performance requirement Demo: http://www.db.ucsd.edu/Xkeyword