Data - VideoLectures.NET

advertisement

Smarter Cities SI

Holistic and Compact Selectivity Estimation for Hybrid

Queries over RDF Graphs

Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer

Presenter: Freddy Lecue

IBM Research Ireland

Smarter Cities SI

Outline

 Introduction

– Text-Rich Data-Graphs and Hybrid Queries

– Problem Definition

– Contributions

 TopGuess

– Data Synopsis

– Probabilistic Component

 Evaluation

 Conclusion

 References

Smarter Cities SI

Text-Rich Data-Graphs and Hybrid Queries

 Increasing amount of semi-structured, text-rich data:

Structure

Structured data with unstructured texts

(e.g., [1]).

Unstructed data annotated with structured information

(e.g., [2]).

Text

[1] DBpedia – A Crystallization Point for the

Web of Data.

[2] http://webdatacommons.org.

Smarter Cities SI

Text-Rich Data-Graphs and Hybrid Queries (2)

 Focus of our work: conjuctive, hybrid queries relation attribute

?x ?y „keyword“ structured query predicates unstructured query predicates

„string“ (query) predicates

Structure Text

Smarter Cities SI

Problem Definition (1)

 Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q.

– Decompose problem: sel(Q) = R(Q) * P(Q), [5].

[5] Selectivity estimation using probabilistic models.

– R(Q): upper-bound cardinality for result set.

– P(Q): probability for Q having an non-empty result.

– Correlation between query predicates (data elements) make approximation of P(Q) hard.

?x relation relation relation

?y

Correlations attribute attribute attribute

„keyword“

„keyword“

Correlations

Correlations

Correlations make estimations relying on

„indepence assumptions“ error-prone

!

Smarter Cities SI

Problem Definition (2)

 Previous works focuses either on structured or on unstructured query constraints.

- Graph synopses [3] Correlations Correlations

- Join samples [4]

- PRMs [5,6]

- …

?x relation relation relation

?y attribute

- Fuzzy string matching [7,8]

- Extraction operators [9,10]

-

Correlations

 In our previous work[18], we introduced a uniform model (BN+) for hybrid queries:

– Effectiveness Issues:

 Difficulty of capturing all correlations between text and structure

 Pruning text (i.e. vocabulary) using string synopses result in an "information loss"

– Efficiency Issues:

 Data synopsis: Large query-independent BN constructed offline. Grows exponentially w.r.t. vocabulary size

 Estimation: BN inferencing over large synopsis which is NP-hard.

[18] Wagner et.al, EDBT 2013, Selectivity estimation for hybrid queries over text-rich data graphs

Smarter Cities SI

Problem Definition (3)

Motivating Example

There can many entities of type Person (i.e., bindings for ?p a Person), while only few entities have a name “Audrey". So, in order to estimate the # bindings for ?p, a synopsis has to capture statistics for any word associated (via name) with Person entities.

Data Graph Hybrid Query

Smarter Cities SI

Contributions

 We propose a novel approach (TopGuess), which utilizes relational topic models as data synopsis

– summarizing textual data with linear space complexity w.r.t. vocabulary size

– allowing to capture statistics for the complete vocabulary of words by means of topics (no "information loss" due to coarse-grained string synopses)

– Correlations between the structure and the text via topics

 TopGuess constructs a small query-specific BN at the query time for estimation

– With time complexity independent of the synopsis size

– … so not directly use a large synopsis in memory at runtime, instead, employ a small and compact synopsis for the current query.

 Experiments on real-world data: improve effectiveness by up to 88% - without sacricing runtime performance.

Smarter Cities SI

TOPGUESS

Smarter Cities SI

Data Synopsis

 Uniform synopsis using relational topic models

– Different topic models can be used [19] [20] [21] [22]

 Synopsis Parameters

– Topics: Textual data in a low-dimensional representation via a set of k topics

– Class-Topic Parameter: correlations between a class (e.g. Movie,

Person) and topics (represented as a vector for each class)

– Relation-Topic Parameter: correlations between a relation (e.g. starring) and topics (represented as a matrix for each relation)

 Given topics, TopGuess data synopsis has linear space complexity w.r.t. vocabulary (see Thm. 1 in the paper)

Synopsis of example data graph using TRM [19]

Smarter Cities SI

Probabilistic Component (1)

 TopGuess constructs a small query-specific BN for each query at query-time

– Every predicate in the query is represented as an observed random variable in BN

 Class, relation and string predicates

– Also each query variable v (e.g. m, p, l) is represented as a topical random variable X v

in BN (e.g. X m

, X p

, X l

)

 Those topical random variables are modelled as multinomial distribution over the topics

 So every query variable is perceived as topic mixtures

 However, initially the distribution of X v is unknown (hidden) so learned using gradient ascent

– Query-specific BN is acyclic (see Thrm.2 in the paper)

Hybrid Query Query-specific BN

Smarter Cities SI

Probabilistic Component (2)

Topical Independence Assumption (TIA)

Given topical random variables (X v

), all the query predicate random variables in the query-specific BN is independent

 TIA considers that query predicate probabilities depend on (and are governed by) the topics of their associated topical random variables

– For instance, random variable X holiday

is only dependent on X m

. In other words, given X m

, X holiday

is conditionally independent of all other variables, e.g., X audrey

.

 TIA allows us to easily estimate P(Q) via:

Smarter Cities SI

EVALUATION

Smarter Cities SI

Evaluation (1) – Setting

 Data: IMDB [14] and DBLP [15].

– IMDB featured more correlations than DBLP.

– Both datasets have large vocabularies: ~25 million (DBLP) and ~7 million (IMDB) words

 Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries.

[13] Spark2: Top-k keyword query in relational data-bases.

 Systems:

– We used n -gram-based string synopses [10]: random samples of 1-grams,

[14] A framework for evaluating database key-word search strategies. top-k 1-grams, stratified bloom filters on 1-grams.

– String predicates were integrated via (1) independence ( ind ) or (2) conditional independence ( bn ) assumption.

– TopGuess

Smarter Cities SI

Evaluation (2) – Setting (2)

 Synopsis size:

– We employ baselines with varying synopsis size by varying # words captured by the string synopsis

– Overall synopsis size depends mainly on string synopsis size.

– Synopses sizes ∈ {2, 4, 20, 40} MByte in memory.

– In contrast, TopGuess keeps a large topic model (281MB-IMDB and 229MB-DBLP) at disk and constructs a small, query-specific BN in memory at runtime (~ 100 KBytes)

 Metrics:

– Efficiency: selectivity estimation time.

– Effectiveness: multiplicative error [17].

[17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data.

Smarter Cities SI

Evaluation (3) – Results

Smarter Cities SI

Conclusion

 We proposed a holistic approach (TopGuess) for selectivity estimation of hybrid queries.

– TopGuess uses RTMs with linear space complexity w.r.t. vocabulary

– Compact query-specific BN as probabilistic component enables estimation independent from synopsis size

– Empirical studies on real-world data achieved strong effectiveness improvements, while not requiring additional runtime.

 Future work:

– Extending TopGuess to a more generic selectivity estimation approach for RDF data and BGP queries

– Replacing the topic models in our data synopsis with different application-specific synopses (e.g. streaming RDF data)

Smarter Cities SI

References

[1] Christian Bizer et al: DBpedia

– A Crystallization Point for the Web of Data. Journal of Web Semantics

: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154

–165, 2009.

[2] http://webdatacommons.org/

[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275

–286, 1999.

[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation . In SIGMOD, pages 205 –216, 2006.

[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461

–472, 2001.

[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions . PVLDB, 4(11):852 –863, 2011.

[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227 –238, 2004.

[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets . In VLDB, pages 397 –408, 2005.

Smarter Cities SI

References (2)

[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates.

In VLDB, pages 1033

–1044, 2007.

[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In

ICDE, pages 685 –696, 2011.

[11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees . IEEE Transactions on

Information Theory, 14(3):462 –467,1968.

[12] M. Meila and M. Jordan. Learning with mixtures of trees . The Journal of Machine Learning Research, 1:1

–48, 2001.

[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases . IEEE

Transactions on Knowledge and Data Engineering, 23(12):1763 –1780, 2011.

[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies . In CIKM, pages 729

–738,

2010.

[15] http://knoesis.org/swetodblp/

[16] D. Koller and N. Friedman. Probabilistic graphical models . MIT press, 2009.

[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi . Independence is good: Dependency-based histogram synopses for highdimensional data . In SIGMOD, pages 199-210, 2001.

[18] A. Wagner, V. Bicer, T. Tran: Selectivity estimation for hybrid queries over text-rich data graphs. EDBT 2013: 383-394

Smarter Cities SI

References (3)

[19] V. Bicer, T. Tran, Y. Ma, and R. Studer. TRM - Learning Dependencies between Text and Structure with Topical Relational

Models. In ISWC, 2013.

[20] J. Chang and D. Blei. Relational Topic Models for Document Networks. In AIStats, 2009..

[21] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link LDA: Joint Models of Topic and Author Community. In ICML, 2009.

[22]L. Zhang et al. Multirelational Topic Models. In ICDM, 2009.

Download