Holistic and Compact Selectivity Estimation for Hybrid

Queries over RDF Graphs

Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer

Presenter: Freddy Lecue

IBM Research Ireland

 Introduction

– Text-Rich Data-Graphs and Hybrid Queries

– Problem Definition

– Contributions

 TopGuess

– Data Synopsis

– Probabilistic Component

 Evaluation

 Conclusion

 References

Text-Rich Data-Graphs and Hybrid Queries

 Increasing amount of semi-structured, text-rich data:


Structured data with unstructured texts

(e.g., [1]).

Unstructed data annotated with structured information

(e.g., [2]).


[1] DBpedia – A Crystallization Point for the

Web of Data.


Text-Rich Data-Graphs and Hybrid Queries (2)

 Focus of our work: conjuctive, hybrid queries relation attribute

?x ?y „keyword“ structured query predicates unstructured query predicates

„string“ (query) predicates

Structure Text

Problem Definition (1)

 Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q.

– Decompose problem: sel(Q) = R(Q) * P(Q), [5].

[5] Selectivity estimation using probabilistic models.

– R(Q): upper-bound cardinality for result set.

– P(Q): probability for Q having an non-empty result.

– Correlation between query predicates (data elements) make approximation of P(Q) hard.

?x relation relation relation


Correlations attribute attribute attribute





Correlations make estimations relying on

„indepence assumptions“ error-prone


Problem Definition (2)

 Previous works focuses either on structured or on unstructured query constraints.

- Graph synopses [3] Correlations Correlations

- Join samples [4]

- PRMs [5,6]

- …

?x relation relation relation

?y attribute

- Fuzzy string matching [7,8]

- Extraction operators [9,10]



 In our previous work[18], we introduced a uniform model (BN+) for hybrid queries:

– Effectiveness Issues:

 Difficulty of capturing all correlations between text and structure

 Pruning text (i.e. vocabulary) using string synopses result in an "information loss"

– Efficiency Issues:

 Data synopsis: Large query-independent BN constructed offline. Grows exponentially w.r.t. vocabulary size

 Estimation: BN inferencing over large synopsis which is NP-hard.

[18] Wagner, EDBT 2013, Selectivity estimation for hybrid queries over text-rich data graphs

Problem Definition (3)

Motivating Example

There can many entities of type Person (i.e., bindings for ?p a Person), while only few entities have a name “Audrey". So, in order to estimate the # bindings for ?p, a synopsis has to capture statistics for any word associated (via name) with Person entities.

Data Graph Hybrid Query

 We propose a novel approach (TopGuess), which utilizes relational topic models as data synopsis

– summarizing textual data with linear space complexity w.r.t. vocabulary size

– allowing to capture statistics for the complete vocabulary of words by means of topics (no "information loss" due to coarse-grained string synopses)

– Correlations between the structure and the text via topics

 TopGuess constructs a small query-specific BN at the query time for estimation

– With time complexity independent of the synopsis size

– … so not directly use a large synopsis in memory at runtime, instead, employ a small and compact synopsis for the current query.

 Experiments on real-world data: improve effectiveness by up to 88% - without sacricing runtime performance.

Data Synopsis

 Uniform synopsis using relational topic models

– Different topic models can be used [19] [20] [21] [22]

 Synopsis Parameters

– Topics: Textual data in a low-dimensional representation via a set of k topics

– Class-Topic Parameter: correlations between a class (e.g. Movie,

Person) and topics (represented as a vector for each class)

– Relation-Topic Parameter: correlations between a relation (e.g. starring) and topics (represented as a matrix for each relation)

 Given topics, TopGuess data synopsis has linear space complexity w.r.t. vocabulary (see Thm. 1 in the paper)

Synopsis of example data graph using TRM [19]

Probabilistic Component (1)

 TopGuess constructs a small query-specific BN for each query at query-time

– Every predicate in the query is represented as an observed random variable in BN

 Class, relation and string predicates

– Also each query variable v (e.g. m, p, l) is represented as a topical random variable X v

in BN (e.g. X m

, X p

, X l


 Those topical random variables are modelled as multinomial distribution over the topics

 So every query variable is perceived as topic mixtures

 However, initially the distribution of X v is unknown (hidden) so learned using gradient ascent

– Query-specific BN is acyclic (see Thrm.2 in the paper)

Hybrid Query Query-specific BN

Probabilistic Component (2)

Topical Independence Assumption (TIA)

Given topical random variables (X v

), all the query predicate random variables in the query-specific BN is independent

 TIA considers that query predicate probabilities depend on (and are governed by) the topics of their associated topical random variables

– For instance, random variable X holiday

is only dependent on X m

. In other words, given X m

, X holiday

is conditionally independent of all other variables, e.g., X audrey


 TIA allows us to easily estimate P(Q) via:

Evaluation (1) – Setting

 Data: IMDB [14] and DBLP [15].

– IMDB featured more correlations than DBLP.

– Both datasets have large vocabularies: ~25 million (DBLP) and ~7 million (IMDB) words

 Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries.

[13] Spark2: Top-k keyword query in relational data-bases.

 Systems:

– We used n -gram-based string synopses [10]: random samples of 1-grams,

[14] A framework for evaluating database key-word search strategies. top-k 1-grams, stratified bloom filters on 1-grams.

– String predicates were integrated via (1) independence ( ind ) or (2) conditional independence ( bn ) assumption.

– TopGuess

Evaluation (2) – Setting (2)

 Synopsis size:

– We employ baselines with varying synopsis size by varying # words captured by the string synopsis

– Overall synopsis size depends mainly on string synopsis size.

– Synopses sizes ∈ {2, 4, 20, 40} MByte in memory.

– In contrast, TopGuess keeps a large topic model (281MB-IMDB and 229MB-DBLP) at disk and constructs a small, query-specific BN in memory at runtime (~ 100 KBytes)

 Metrics:

– Efficiency: selectivity estimation time.

– Effectiveness: multiplicative error [17].

[17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data.

Evaluation (3) – Results

 We proposed a holistic approach (TopGuess) for selectivity estimation of hybrid queries.

– TopGuess uses RTMs with linear space complexity w.r.t. vocabulary

– Compact query-specific BN as probabilistic component enables estimation independent from synopsis size

– Empirical studies on real-world data achieved strong effectiveness improvements, while not requiring additional runtime.

 Future work:

– Extending TopGuess to a more generic selectivity estimation approach for RDF data and BGP queries

– Replacing the topic models in our data synopsis with different application-specific synopses (e.g. streaming RDF data)

[1] Christian Bizer et al: DBpedia

– A Crystallization Point for the Web of Data. Journal of Web Semantics

: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154

–165, 2009.


[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275

–286, 1999.

[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation . In SIGMOD, pages 205 –216, 2006.

[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461

–472, 2001.

[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions . PVLDB, 4(11):852 –863, 2011.

[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227 –238, 2004.

[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets . In VLDB, pages 397 –408, 2005.

References (2)

[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates.

In VLDB, pages 1033

–1044, 2007.

[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In

ICDE, pages 685 –696, 2011.

[11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees . IEEE Transactions on

Information Theory, 14(3):462 –467,1968.

[12] M. Meila and M. Jordan. Learning with mixtures of trees . The Journal of Machine Learning Research, 1:1

–48, 2001.

[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases . IEEE

Transactions on Knowledge and Data Engineering, 23(12):1763 –1780, 2011.

[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies . In CIKM, pages 729




[16] D. Koller and N. Friedman. Probabilistic graphical models . MIT press, 2009.

[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi . Independence is good: Dependency-based histogram synopses for highdimensional data . In SIGMOD, pages 199-210, 2001.

[18] A. Wagner, V. Bicer, T. Tran: Selectivity estimation for hybrid queries over text-rich data graphs. EDBT 2013: 383-394

References (3)

[19] V. Bicer, T. Tran, Y. Ma, and R. Studer. TRM - Learning Dependencies between Text and Structure with Topical Relational

Models. In ISWC, 2013.

[20] J. Chang and D. Blei. Relational Topic Models for Document Networks. In AIStats, 2009..

[21] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link LDA: Joint Models of Topic and Author Community. In ICML, 2009.

[22]L. Zhang et al. Multirelational Topic Models. In ICDM, 2009.
