10 Years of Probabilistic Querying

advertisement
10 Years of Probabilistic Querying
– What Next?
Martin Theobald
University of Antwerp
Joint work with
 Maximilian Dylla, Sairam Gurajada, Angelika Kimmig, Andre Melo,
Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek
“ The important thing is to not stop questioning ...
One cannot help but be in awe when contemplating
the mysteries of eternity, of life, of the marvelous
structure of reality. It is enough if one tries
merely to comprehend a little of this mystery
every day.”
“The Marvelous Structure of Reality”
Joseph M. Hellerstein
Keynote at WebDB 2003, San Diego
- Albert Einstein, 1936
Look, There is Structure!
The important thing is to not stop questioning
Look, There is Structure!
C1 Text is not just “unstructured data”

Plethora of natural-languageprocessing techniques & tools




Part-Of-Speech (POS) Tagging
Named-Entity Recognition &
Disambiguation (NERD)
Dependency Parsing
Semantic Role Labeling
Look, There is Structure!
C1 Text is not just “unstructured data”

Plethora of natural-languageprocessing techniques & tools




Part-Of-Speech (POS) Tagging
Named-Entity Recognition &
Disambiguation (NERD)
Dependency Parsing
Semantic Role Labeling

But:



Even the best NLP tools frequently
yield errors
Facts found on the Web are logically
inconsistent
Web-extracted knowledge bases are
inherently incomplete
Information Extraction
YAGO/DBpedia et al.
bornOn(Jeff, 09/22/42)
gradFrom(Jeff, Columbia)
hasAdvisor(Jeff, Arthur)
hasAdvisor(Surajit, Jeff)
knownFor(Jeff, Theory)
>120 M facts for YAGO2
(mostly from Wikipedia infoboxes)
New fact candidates
type(Jeff, Author)[0.9]
author(Jeff, Drag_Book)[0.8]
author(Jeff,Cind_Book)[0.6]
worksAt(Jeff, Bell_Labs)[0.7]
type(Jeff, CEO)[0.4]
100’s M additional facts
from Wikipedia free-text
YAGO Knowledge Base
3 M entities, 120 M facts
100 relations, 200k classes
Entity
subclass
subclass
Organization
accuracy
 95%
subclass
Person
Location
subclass
subclass
Scientist
subclass
subclass
subclass
Politician
subclass
subclass
Biologist
State
Physicist
instanceOf
instanceOf
instanceOf
Oct 23, 1944
instanceOf
Max_Planck
Society
instanceOf
City
instanceOf
Germany
instanceOf
Erwin_Planck
diedOn
Nobel Prize
Country
Kiel
hasWon
fatherOf
locatedIn
locatedIn
bornIn
SchleswigHolstein
citizenOf
Oct 4, 1947
Apr 23, 1858
diedOn
Max_Planck
bornOn
means
means
Angela Merkel
“Max
Planck”
http://www.mpi-inf.mpg.de/yago-naga/
means
“Max Karl Ernst
Ludwig Planck”
means
“Angela
Merkel”
means
“Angela
Dorothea
Merkel”
7
Linked Open Data
As of Sept. 2011:
>200 linked-data sources
>30 billion RDF triples
>400 million owl:sameAs links
http://linkeddata.org/ 8
Maybe Even More Importantly:
Linked Vocabularies!

LinkedData.org


Schema.org


Instance & class links between
DBpedia, WordNet, OpenCyc,
GeoNames, and many more…
Common vocabulary released by
Google, Yahoo!, BING to annotate
Web pages, incl. links to DBpedia.
Micro-Formats: RDFa (W3C)
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:dc="http://purl.org/dc/elements/1.1/"
version="XHTML+RDFa 1.0" xml:lang="en">
Source: http://en.wikipedia.org/wiki/Linked_data
<head><title>Martin's Home Page</title>
<base href="http://adrem.ua.ac.be/~tmartin/" />
<meta property="dc:creator" content= "Martin" />
</head>
9
As of Sept. 2011:
> 5 million owl:sameAs links
between DBpedia/YAGO/Freebase
10
Application 1: Enrichment of Search Results
“Recent Advances in Structured Data and the Web.”
Alon Y. Halevy, Keynote at ICDE 2013, Brisbane
11
Application II: Machine Reading
It’s about the disappearance forty years ago of Harriet Vanger, a young
scion of one of the wealthiest families in Sweden, and about her uncle,
determined to know the truth about what he believes was her murder.
Blomkvist visits Henrik Vanger at his
estate on thesame
tiny island of Hedeby.
same
The old man draws Blomkvist in by promising solid evidence against Wennerström.
same
Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real
owns niece Harriet some 40 years earlier. Hedeby is
assignment: the disappearance of Vanger's
home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist
becomes
acquainted with the members
uncleOf
hiresof the extended Vanger family, most of whom resent
his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.
enemyOf
same
affairWith
After discovering that Salander has hacked into his computer, he persuades same
her to assist
him with research. They eventually
become lovers, but Blomkvist has trouble getting close
affairWith
to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two
discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.
A 24-year-old computer hacker sporting an assortment of tattoos and body piercings
headOf
supports herself by doing deep background
investigations for Dragan Armansky, who, in
same
turn,
worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Etzioni, Banko, Cafarella: Machine Reading.AAAI’06
Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’1012
Application III:
Natural-Language Question Answering
evi.com (formerly trueknowledge.com)
13
Application III:
Natural-Language Question Answering
wolframalpha.com
>10 trillion(!) facts
>50,000 search algorithms
>5,000 visualizations
14
IBM Watson: Deep Question Answering
William Wilkinson's "An Account of the
Principalities of Wallachia and Moldavia"
inspired this author's most famous novel
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
As of 2010, this is the only
former Yugoslav republic in the EU
99 cents got me a 4-pack of Ytterlig coasters
from this Swedish chain
Question
classification &
decomposition
Knowledge
back-ends
D. Ferrucci et al.: Building Watson: An Overview of the
DeepQA Project. AI Magazine, Fall 2010.
www.ibm.com/innovation/us/watson/index.htm
15
Natural-Language QA over Linked Data
Multilingual Question
Answering over Linked Data
(QALD-3), CLEF 2011-13
http://greententacle.techfak.unibielefeld.de/~cunger/qald/
<question id="4" answertype="resource" aggregation="false" onlydbo="true">
<string lang="en">Which river does the Brooklyn Bridge cross?</string>
<string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string>
<string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string>
<string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string>
<string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string>
<string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string>
<keywords lang="en">river, cross, Brooklyn Bridge</keywords>
<keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords>
<keywords lang="es">río, cruza, Brooklyn Bridge</keywords>
<keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords>
<keywords lang="fr">cours d'eau, pont de Brooklyn</keywords>
<keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords>
<query>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX res: <http://dbpedia.org/resource/>
SELECT DISTINCT ?uri
WHERE {
res:Brooklyn_Bridge dbo:crosses ?uri .
}
</query>
</question>
16
Natural-Language QA over Linked Data
INEX Linked Data Track,
CLEF 2012-13
https://inex.mmci.unisaarland.de/tracks/lod/
<topic id="2012374" category="Politics">
<jeopardy_clue>Which German politician is a successor of another politician
who stepped down before his or her actual term was over,
and what is the name of their political ancestor?</jeopardy_clue>
<keyword_title>German politicians successor other stepped down before actual
term name ancestor</keyword_title>
<sparql_ft>
SELECT ?s ?s1 WHERE {
?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> .
?s1 <http://dbpedia.org/property/successor> ?s .
FILTER FTContains (?s, "stepped down early") .
}
</sparql_ft>
</topic>
17
Outline

Probabilistic Databases



Probabilistic & Temporal Databases



Sequenced vs. Non-Sequenced Semantics
Interval Alignment & Probabilistic Inference
Probabilistic Programming



Stanford’s Trio System: Data, Uncertainty & Lineage
Handling Uncertain RDF Data:
URDF (Max-Planck-Institute/U-Antwerp)
Statistical Relational Learning
Learning “Interesting” Deduction Rules
Summary & Challenges
18
Probabilistic Databases:
A Panacea to All of the Afore Tasks?

Probabilistic databases combine
first-order logic and probability theory
in an elegant way:

C2
Declarative: Queries formulated in SQL/Relational
Algebra/Datalog, support for updates, transactions, etc.

Deductive: Well-studied resolution algorithms for
SQL/Relational Algebra/Datalog (top-down/bottomup), indexes, automatic query optimization

Scalable (?): Polynomial data complexity (SQL),
but #P-complete for the probabilistic inference
19
Probabilistic Database
A probabilistic database Dp (compactly) encodes a probability
distribution over a finite set of deterministic database instances Di.

0.42
0.18
0.28
WorksAt(Sub, Obj)
WorksAt(Sub, Obj)
WorksAt(Sub, Obj)
Jeff
Stanford
Jeff
Princeton
Jeff
Stanford
WorksAt(Sub, Obj)
Princeton
Special Cases:
(1) Tuple-independent PDB

Jeff
0.12
(II) Block-independent PDB
WorksAt(Sub, Obj)
p
WorksAt(Sub, Obj)
p
Jeff
Stanford
0.6
Jeff
Stanford
0.6
Jeff
Princeton
0.7
Princeton
0.4
Note: (I) and (II)
are not equivalent!
Query Semantics: (“Marginal Probabilities”)
 Run query Q against each instance Di; for each answer tuple t,
sum up the probabilities of all instances Di where t exists.
20
[Widom: CIDR 2005]
Stanford Trio System
Uncertainty-Lineage Databases (ULDBs)
1.
2.
3.
4.
Alternatives
‘?’ (Maybe) Annotations
Confidence values
Lineage
21
Trio’s Data Model
1. Alternatives: uncertainty about value
Saw (witness, color, car)
Amy
red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possible
instances
22
Trio’s Data Model
1. Alternatives
2.‘?’ (Maybe): uncertainty about presence
Saw (witness, color, car)
Amy
red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty
blue, Acura
?
Six possible
instances
23
Trio’s Data Model



1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences: weighted uncertainty
Saw (witness, color, car)
Amy
red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Betty
blue, Acura 0.6
?
Still six possible instances,
each with a probability
24
So Far: Model is Not Closed
Saw (witness, car)
Cathy
Honda ∥ Mazda
Drives (person, car)
Jimmy, Toyota ∥ Jimmy, Mazda
Billy, Honda ∥ Frank, Honda
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
Suspects
Jimmy
Billy ∥ Frank
Hank
?
?
?
CANNOT
Does
not correctly
capture possible
instances in the
result
25
Example with Lineage
ID
11
Saw (witness, car)
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
?
?
?
λ(31) = (11,2) Λ (21,2)
λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)
λ(33) = (11,1) Λ 23
26
Example with Lineage
ID
11
Saw (witness, car)
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
Suspects = πperson(Saw ⋈ Drives)
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
? λ(31) = (11,2) Λ (21,2)
? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)
? λ(33) = (11,1) Λ 23
Correctly captures
possible instances in
the result (4)
27
Operational Semantics
But: data complexity is #P-complete!
Dp
direct
implementation
possible
instances
D1, D2,…, Dn
Dp′
rep. of
instances
Q on each
instance
Closure:
up-arrow
always exists
D1’, D2’, …, Dm’
Completeness: any (finite) set of
possible instances can be represented
28
Summary on Trio’s Data Model
Uncertainty-Lineage Databases (ULDBs)
1.
2.
3.
4.
Alternatives
‘?’ (Maybe) Annotations
Confidence values
Lineage
Theorem: ULDBs are closed and complete.
Formally studied properties like minimization, equivalence,
approximation and membership based on lineage.
[Benjelloun, Das Sarma, Halevy, Widom, Theobald:VLDB-J. 2008]
29
Basic Complexity Issue
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]
Theorem [Valiant:1979]
For a Boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form “is there a witness ?”  SAT
#P = class of problems of the form “how many witnesses ?”  #SAT
The decision problem for 2CNF is in PTIME.
The counting problem for 2CNF is already #P-complete.
(will be coming back to this later again…)
30
…back to
Information Extraction
bornIn(Barack, Honolulu)
bornIn(Barack, Kenya)
Uncertain RDF (URDF): Facts & Rules

Extensional Knowledge (the “facts”)



High-confidence facts: existing knowledge base (“ground truth”)
New fact candidates: extracted fact candidates with confidences
Linked-Data & integration of various knowledge sources:
Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.)
 Large “Probabilistic Database” of RDF facts

Intensional Knowledge (the “rules”)



Soft rules: deductive grounding & lineage (Datalog/SLD resolution)
Hard rules: consistency constraints (more general FOL rules)
Propositional & probabilistic inference
 At query-time!
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints

People may live in more than one place
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) [0.8]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y) [0.5]

People are not born in different places/on different dates
bornIn(x,y)  bornIn(x,z)  y=z
bornOn(x,y)  bornOn(x,z)  y=z

People are not married to more than one person
(at the same time, in most countries?)
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z
 disjoint(t1,t2)
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs.
(Hard) Consistency Constraints
Deductive Database:
Datalog, core of SQL &
Relational Algebra,
 People may live in more than one place RDF/S, OWL2-RL, etc.
livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) [0.8]
livesIn(x,y)  hasChild(x,z)  livesIn(z,y) [0.5]

People are not born in different places/on different dates
More General FOL
Constraints:
Datalog plus constraints,
 People are not married to more than one X-tuples
person in PDB’s,
owl:FunctionalProperty,
(at the same time, in most countries?)
owl:disjointWith, etc.
bornIn(x,y)  bornIn(x,z)  y=z
bornOn(x,y)  bornOn(x,z)  y=z
marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z
 disjoint(t1,t2)
URDF Running Example
KB:
Rules
RDF Base Facts
type[1.0]
Computer
Scientist
type[1.0]
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z) [0.4]
type[1.0]
hasAdvisor[0.8]
Jeff
hasAdvisor[0.7]
Surajit
graduatedFrom
graduatedFrom
[0.6]
[?]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 y=z
David
graduatedFrom
graduatedFrom[0.9]
[?]
graduatedFrom[?]
graduatedFrom[?]
[0.7]
worksAt[0.9]
Stanford
Princeton
type[1.0]
type[1.0]
University
Derived Facts
gradFr(Surajit,Stanford)
gradFr(David,Stanford)
Basic Types of Inference


MAP Inference

Find the most likely assignment to query variables y
under a given evidence x.

Compute: argmax y P( y | x)
(NP-complete for MaxSAT)
Marginal/Success Probabilities

Probability that query y is true in a random world
under a given evidence x.

Compute: ∑y P( y | x)
(#P-complete already for
conjunctive queries)
General Route: Grounding & MaxSAT Solving
Query
graduatedFrom(x, y)
1) Grounding
CNF
(graduatedFrom(Surajit, Stanford) 
graduatedFrom(Surajit, Princeton)) 1000
 (graduatedFrom(David, Stanford) 
graduatedFrom(David, Princeton))
 (hasAdvisor(Surajit, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(Surajit, Stanford))
1000
0.4
2) Propositional formula in CNF,
consisting of
– Grounded soft & hard rules
– Weighted base facts
3) Propositional Reasoning
 (hasAdvisor(David, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(David, Stanford))
0.4






0.9
0.8
0.7
0.6
0.7
0.9
worksAt(Jeff, Stanford)
hasAdvisor(Surajit, Jeff)
hasAdvisor(David, Jeff)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(David, Princeton)
– Consider only facts (and rules)
which are relevant for answering
the query
– Find truth assignment to facts such
that the total weight of the
satisfied clauses is maximized
 MAP inference: compute “most
likely” possible world
URDF: MaxSAT Solving with Soft & Hard Rules
Special case: Horn-clauses as soft rules & mutex-constraints as hard rules
C: Weighted Horn clauses (CNF)
S: Mutex-const.
[Theobald,Sozio,Suchanek,Nakashole:VLDS’12]
{ graduatedFrom(Surajit, Stanford),
graduatedFrom(Surajit, Princeton) }
{ graduatedFrom(David, Stanford),
graduatedFrom(David, Princeton) }
(hasAdvisor(Surajit, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(Surajit, Stanford)) 0.4
 (hasAdvisor(David, Jeff) 
worksAt(Jeff, Stanford) 
graduatedFrom(David, Stanford))
0.4






0.9
0.8
0.7
0.6
0.7
0.9
worksAt(Jeff, Stanford)
hasAdvisor(Surajit, Jeff)
hasAdvisor(David, Jeff)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)
graduatedFrom(David, Princeton)


Find: arg max y P( y | x)
Resolves to a variant of MaxSAT
for propositional formulas
MaxSAT Alg.
Compute
W0 = ∑clauses C w(C) P(C is satisfied);
For each hard constraint S {
For each fact f in St {
Compute
Wf+t = ∑clauses C w(C) P(C is sat. | f = true);
}
Compute
WS-t = ∑clauses C w(C) P(C is sat. | St = false);
Choose truth assignment to f in St that
maximizes Wf+t , WS-t ;
Remove satisfied clauses C;
t++;
}
• Runtime: O(|S||C|)
• Approximation
guarantee of 1/2
Experiment (I): MAP Inference
•
•
•
YAGO Knowledge Base: 2 Mio entities, 20 Mio facts
Query Answering: Deductive grounding & MaxSAT solving for 10 queries
over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)
Asymptotic runtime checks via synthetic (random) soft rule expansions
URDF: Grounding & MaxSAT
solving
|C| - # literals in grounded soft rules
|S| - # literals in grounded hard rules
URDF MaxSAT vs. Markov Logic
(MAP inference & MC-SAT)
Basic Types of Inference


MAP Inference
✔

Find the most likely assignment to query variables y
under a given evidence x.

Compute: argmax y P( y | x)
(NP-complete for MaxSAT)
Marginal/Success Probabilities

Probability that query y is true in a random world
under a given evidence x.

Compute: ∑y P( y | x)
(#P-complete already for
conjunctive queries)
Deductive Grounding with Lineage
(SLD Resolution in Datalog/Prolog)
Query
graduatedFrom(Surajit, y)
graduatedFrom
(Surajit,
Q1 Princeton)
graduatedFrom
(Surajit,
Q2 Stanford)
 A(B (CD))
A(B (CD))

A graduatedFrom
C
hasAdvisor
(Surajit,Jeff)[0.8]
Rules
hasAdvisor(x,y) 
worksAt(y,z)
 graduatedFrom(x,z) [0.4]
graduatedFrom(x,y) 
graduatedFrom(x,z)
 y=z
Base Facts
\/
(Surajit,
Princeton)[0.7]
[Yahya,Theobald: RuleML’11
Dylla,Miliaraki,Theobald: ICDE’13]
B graduatedFrom
/\
D
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]
graduatedFrom(Surajit, Princeton) [0.7]
graduatedFrom(Surajit, Stanford) [0.6]
graduatedFrom(David, Princeton) [0.9]
hasAdvisor(Surajit, Jeff) [0.8]
hasAdvisor(David, Jeff) [0.7]
worksAt(Jeff, Stanford) [0.9]
type(Princeton, University) [1.0]
type(Stanford, University) [1.0]
type(Jeff, Computer_Scientist) [1.0]
type(Surajit, Computer_Scientist) [1.0]
type(David, Computer_Scientist) [1.0]
Lineage & Possible Worlds
[Das Sarma,Theobald,Widom: ICDE’08
Dylla,Miliaraki,Theobald: ICDE’13]
Query
graduatedFrom(Surajit, y)
1) Deductive Grounding
0.7x(1-0.888)=0.078
graduatedFrom
(Surajit,
Q1 Princeton)
(1-0.7)x0.888=0.266
graduatedFrom
(Surajit,
Q2 Stanford)
A(B (CD))
 A(B (CD))
1-(1-0.72)x(1-0.6)
=0.888

\/
0.8x0.9
=0.72
A graduatedFrom
(Surajit,
Princeton)[0.7]
C
hasAdvisor
(Surajit,Jeff)[0.8]
B graduatedFrom
/\
D
(Surajit,
Stanford)[0.6]
worksAt
(Jeff,Stanford)[0.9]


Dependency graph of the query
Trace lineage of individual query
answers
2) Lineage DAG (not in CNF),
consisting of


Grounded soft & hard rules
Probabilistic base facts
3) Probabilistic Inference
 Compute marginals:
P(Q): sum up the probabilities of
all possible worlds that entail the
query answers’ lineage
P(Q|H): drop “impossible worlds”
Possible Worlds Semantics
P(Q1)=0.0784
P(Q2)=0.2664
P(Q1|H)=0.0784 / 0.412
= 0.1903
P(Q2|H)=0.2664 / 0.412
= 0.6466
A:0.7 B:0.6
C:0.8
D:0.9
Q2:
P(W)
1
1
1
1
0
0.7x0.6x0.8x0.9 = 0.3024
1
1
1
0
0
0.7x0.6x0.8x0.1 = 0.0336
1
1
0
1
0
… = 0.0756
1
1
0
0
0
… = 0.0084
1
0
1
1
0
… = 0.2016
1
0
1
0
0
… = 0.0224
1
0
0
1
0
… = 0.0504
1
0
0
0
0
… = 0.0056
0
1
1
1
1
0.3x0.6x0.8x0.9 = 0.1296
0
1
1
0
1
0.3x0.6x0.8x0.1 = 0.0144
0
1
0
1
1
0.3x0.6x0.2x0.9 = 0.0324
0
1
0
0
1
0.3x0.6x0.2x0.1 = 0.0036
0
0
1
1
1
0.3x0.4x0.8x0.9 = 0.0864
0
0
1
0
0
… = 0.0096
0
0
0
1
0
… = 0.0216
0
0
0
0
0
… = 0.0024
 A(B(CD))
Hard rule H:  A   (B  (CD))
0.0784
0.2664
1.0
0.412
Inference in Probabilistic Databases

Safe query plans [Dalvi,Suciu: VLDB-J’07]


Read-once functions [Sen,Deshpande,Getoor:


PVLDB’10]
Can factorize Boolean formula (in polynomial time) into read-once
form, where every variable occurs at most once.
Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11]


Can propagate confidences along with relational operators.
Can decompose Boolean formula into ordered binary decision
diagram (OBDD), such that inference resolves to independent-and and
independent-or operations over the decomposed formula.
Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89]


Can return top-k answers based on lower and upper bounds,
even without knowing their exact marginal probabilities.
Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC)
simulations in parallel.
Monte Carlo Simulation (I)
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"
Karp,Luby,Madras: J-Alg.’89]
Boolean formula:
E
=
X1X2 v X1X3 v X2X3
Naïve sampling:
cnt = 0
repeat N times
randomly choose X1, X2, X3  {0,1}
if E(X1, X2, X3) = 1
then cnt = cnt+1
P = cnt/N
N may be very big
return P /* estimate for true Pr(F)for
*/ small Pr(E)
Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2)
then: Pr[ | P/Pr(E) - 1 | > e ] < d
X1X2
X1X3
X2X3
Zero/One-Estimator
Theorem
Works for any E
(not in PTIME)
Monte Carlo Simulation (II)
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"
Karp,Luby,Madras: J-Alg.’89]
Boolean formula in DNF:
E
=
C1 v C2 v . . . v Cm
Importance sampling:
cnt = 0;
S = Pr(C1) + … + Pr(Cm)
repeat N times
randomly choose i  {1,2,…, m}, with prob. Pr(Ci)/S
randomly choose X1, …, Xn  {0,1} s.t. Ci = 1
if C1= 0 and C2= 0 and … and Ci-1= 0
then cnt = cnt+1
P = cnt/N
return P /* estimate for true Pr(E) */
Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then:
Pr[ |P/Pr(E) - 1| > e ] < d
This is
better!
Only for E in
DNF in PTIME
Top-k Ranking by Marginal Probabilities
Query
graduatedFrom(Surajit, y)
[Dylla,Miliaraki,Theobald: ICDE’13]

Datalog/SLD resolution

graduatedFrom
(Surajit,
Q1 Princeton)
graduatedFrom
(Surajit,
Q2 Stanford)

\/
graduatedFrom
A (Surajit,
Princeton)[0.7]
C
graduatedFrom
(Surajit,
y=Stanford)
hasAdvisor
(Surajit,Jeff)[0.8]
Top-down grounding allows us to
compute lower and upper bounds
on the marginal probabilities of
answer candidates before rules are
fully grounded.
Subgoals may represent sets of
answer candidates.
graduatedFrom
B (Surajit,
Stanford)[0.6]
First-order lineage formulas:
 Φ(Q1) = A
/\
 Φ(Q2) = B  y gradFrom(Surajit,y)
D
 Prune entire set of answer
worksAt
(Jeff,Stanford)[0.9]
candidates represented by Φ.

Bounds for First-Order Formulas
[Dylla,Miliaraki,Theobald: ICDE’13]
Theorem 1:
Given a (partially grounded) first-order lineage formula Φ:
Φ(Q2) = B  y gradFrom(S,y)


Lower bound Plow (for all query answers that can be obtained from grounding Φ)
Substitute y gradFrom(S,y) with false (or true if negated).
Plow(Q2) = P(B  false) = P(B) = 0.6
Upper bound Pup (for all query answers that can be obtained from grounding Φ)
Substitute y gradFrom(S,y) with true (or false if negated).
Pup(Q2) = P(B  true) = P(true) = 1.0
Proof: (sketch)
Substitution of a subformula with false reduces the number of models
(possible worlds) that satisfy Φ; substitution with true increases them.
Convergence of Bounds
[Dylla,Miliaraki,Theobald: ICDE’13]
Theorem II:
Let Φ1,…, Φn be a series of first-order lineage formulas obtained from
grounding Φ via SLD resolution, and let φ be the propositional lineage
formula of an answer obtained from this grounding procedure.
Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up
creates a monotonic series of lower and upper bounds that
converges to P(φ).
0 = P(false)  P(B  false) = 0.6  P(B  (C  D)) = 0.888
 P(B  true) = P(true) = 1
Proof: (sketch, via induction)
Substitution of true with a formula reduces the number of models that
satisfy Φ; substitution of false with a formula increases this number.
Top-k Pruning
“Fagin’s Algorithm”


[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]
Maintain two disjoint queues:
Top-k queue sorted by Plow and Candidates sorted by Pup
Return the top-k queue at the t’th grounding step when:
Pi,low(Qk) | Qk  Top-k > Pi,up(Qj) | Qj  Candidates
Marginal 1
probability
P1,up(Qj)
Drop Qj from the
Candidates queue.
P2,up(Qj)
k-th lower bound
Pn,up(Qj)
Pn,low(Qj)
0
P1,low(Qj)
P2,low(Qj)
#SLD steps t
Top-k Stopping Condition
“Fagin’s Algorithm”


[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]
Maintain two disjoint queues:
Top-k queue sorted by Plow and Candidates sorted by Pup
Return the top-k queue at the t’th grounding step when:
Pi,low(Qk) | Qk  Top-k > Pi,up(Qj) | Qj  Candidates
Marginal 1
probability
Pt,up(Q1)
Pt,up(Q2)
Stop and return
the top-2 query answers.
2-nd lower bound
k=2
Pt,low(Q1)
Pt,up(Qm)
Pt,low(Q2)
Pt,low(Qm)
0
@SLD step t
Experiment (II): Computing Marginals
Top-10
Top-20
Top-50
MultiSim Top-10
MultiSim Top-20
MultiSim Top-50
Postgres
MayBMS
Trio
100,000
10,000
ms
1,000
100
10
1
Non-Rep. Hierarchical Q1


Rep. Hierarchical Q2
Head-Hierarchical Q3
General Unsafe Q4
IMDB data with 26 Mio facts about movies, directors, actors, etc.
4 query patterns, each instantiated to 1,000 queries (showing runtime averages)

Q1 – safe, non-repeating hierarchical

Q2 – unsafe, repeating hierarchical

Q3 – unsafe, head-hierarchical

Q4 – general unsafe
Experiment (II): Computing Marginals
IMDB data set, 26 Mio facts

Runtime vs. number of
top-k results;
single join query

Percentage of tuples scanned
from input relations
Basic Types of Inference


MAP Inference
✔

Find the most likely assignment to query variables y
under a given evidence x.

Compute: argmax y P( y | x)
Marginal/Success Probabilities
(NP-complete for MaxSAT)
✔

Probability that query y is true in a random world
under a given evidence x.

Compute: ∑y P( y | x)
(#P-complete already for
conjunctive queries)
Probabilistic & Temporal Database
A temporal-probabilistic database DTp (compactly) encodes a
probability distribution over a finite set of deterministic database
instances Di and a finite time domain T.
BornIn(Sub,Obj)
T
p
DeNiro
Greenwhich
[1943,
1944)
0.9
DeNiro
DeNiro
Tribeca
[1998,
1999)
0.6
DeNiro

T
p
Abbott
[1936,
1940)
0.3
Abbott
[1976,
1977)
0.7
Sequenced Semantics & Snapshot Reducibility:



Wedding(Sub,Obj)

DeNiro
Abbott
T
p
[1988,
1989)
0.8
[Dignös, Gamper, Böhlen: SIGMOD’12]
Built-in semantics: reduce temporal-relational operators to their nontemporal counterparts at each snapshot of the database.
Coalesce/split tuples with consecutive time intervals based on their lineages.
Non-Sequenced Semantics

Divorce(Sub,Obj)
[Dylla,Miliaraki,Theobald: PVLDB’13]
Queries can freely manipulate timestamps just like regular attributes.
Single temporal operator ≤T supports all of Allen’s 13 temporal relations.
Deduplicate tuples with overlapping time intervals based on their lineages.
Temporal Alignment & Deduplication
(f1  f3)  (f1  ¬f3) 
(f2  f3)  (f2  ¬f3)
Dedupl.
Facts
(f1  f3)  (f1  ¬f3)
(f1  f3)  (f2  ¬f3)
f1  f3
Deduced
Facts
f1  ¬f3
f2  f3
f2  ¬f3
Base
Facts
tmin
f3
Wedding(DeNiro,Abbott)
f1
1936
f2
Divorce(DeNiro,Abbott)
Wedding(DeNiro,Abbott)
1976
1988
T
tmax
Non-Sequenced Semantics:
MarriedTo(X,Y)[Tb1,tmax)  Wedding(X,Y)[Tb1,Te1)  ¬Divorce(X,Y)[Tb2,Te2)
MarriedTo(X,Y)[Tb1,Te2)  Wedding(X,Y)[Tb1,Te1)  Divorce(X,Y)[Tb2,Te2)  Te1 ≤T Tb2
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
Derived  t3
Facts
teamMates(Beckham,
Ronaldo,
Ronaldo, Tt3)
0.08
‘03
0.4
Base
Facts
‘04

0.16
playsFor(Beckham, Real, T1)
 playsFor(Ronaldo, Real, T2)
 overlaps(T1,T2, T3)
0.12
‘05
‘07
0.6
‘05
‘07
‘03
playsFor(Beckham, Real, T1)
0.1
0.2
0.4
0.2
‘00 ‘02
‘07
‘04 ‘05
playsFor(Ronaldo, Real, T2)
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
teamMates(Beckham,
Ronaldo, T4)
Derived
Facts
teamMates(Beckham,
Zidane, T5)
teamMates(Ronaldo,
Zidane, T6)
0.16
0.12
0.08
‘03
‘04
‘05
‘07
Non-independent
Independent
0.4
Base
Facts
0.6
0.4
0.2
0.2
playsFor(Zidane, Real, T3)
‘05
‘07
‘03
‘00 ‘02
‘07
‘04 ‘05
playsFor(Beckham, Real, T1)
playsFor(Ronaldo, Real, T2)
0.1
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
Derived
Facts
teamMates(Beckham,
Ronaldo, T4)
teamMates(Beckham,
Zidane, T5)
teamMates(Ronaldo,
Zidane, T6)
Non-independent
Independent
Closed and complete representation model (incl. lineage)
 Temporal alignment is linear in the number of input intervals
playsFor(Zidane, Real, T3)
 Confidence computation per interval remains
#P-hard
Base
playsFor(Beckham,
T1approximations
)
playsFor(Ronaldo,
Real,
Facts
 In general
requires MonteReal,
Carlo
(Luby-Karp
forT2)
DNF, MCMC-style sampling), decompositions, or top-k pruning

Experiment (III):
Temporal Alignment & Probabilistic Inference

1,827 base facts with temporal annotations


Extracted from free-text biographies from Wikipedia, IMDB.com, biography.com
11 handcrafted temporal deduction rules, e.g.:
MarriedTo(X,Y)[Tb1,Te2)  Wedding(X,Y)[Tb1,Te1)  Divorce(X,Y)[Tb2,Te2)  Te1 ≤T Tb2

21 handcrafted temporal consistency constraints, e.g.:
BornIn(X,Y)[Tb1,Te1)  MarriedTo(X,Y)[Tb2,Te2)  Te1 ≤T Tb2
Statistical Relational Learning
& Probabilistic Programming



SRL combines first-order logic and probabilistic inference
Employs relational data as input, but with a focus also on
learning the relations (facts, rules & weights)
Knowledge compilation for probabilistic inference


Markov Logic Networks (U-Washington)


Including recent techniques for “lifted inference”
Grounding of weighted first-order rules over a function-free Herbrand
base into an undirected graphical model ( Markov Random Field)
Probabilistic Programming (ProbLog, KU-Leuven)

Deductive grounding over a set of base facts into a directed graphical
model (SLD proofs  Bayesian Net)
Learning Soft Deduction Rules
Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y)
R
Ground truth for IivesIn (only partially known)
Knowledge base for livesIn (known positive examples)
Facts inferred for livesIn from the body of the rule
bornIn (only partially correct)
KB
G
| Head  Body |
confidence( S )  P( Head | Body ) 
| Body |




Inductive learning algorithm based on dynamic programming
A-priori-style pre-filtering & pruning of low-support join patterns
Adaptation of confidence and support measures from data mining
Learning “interesting” rules with constants and type constraints
Learning “Interesting” Deduction Rules (I)
income(x, y), educationLevel(x, z)
Overall population
QOB-1st-quarter
QOB-2nd-quarter
QOB-3rd-quarter
QOB-4th-quarter
income


re/. freq.
re/. freq.
income(x, y), quarterOfBirth(x, z)
income
Plots for the distribution of income versus quarterOfBirth and educationLevel
over actual US census data from Oct. 2009 (>1 billion RDF facts).
Divergence from “Overall population” shows strong correlation of income with
educationLevel but not with quarterOfBirth.
Learning “Interesting” Deduction Rules (II)
re/. freq.
– Overall population
– Nursery school to Grade 4
– Professional school degree
income(x, y) :- educationLevel(x, z)
income(x, “low”) :educationLevel(x, “Nursery school to Grade 4”)
income(x, “medium”) :educationLevel(x, “Professional school degree”)
income(x, “high”) :educationLevel(x, “Professional school degree”)
low

medium
high
income
Divergence measured using Kullback-Leibler or χ2 between “Overall population”
with “Nursery school to Grade 4” and “Professional school degree” over
discretized income domain.
Summary & Challenges (I)
Web-Scale Information Extraction
Names & Patterns
OpenDomain
& Unsupervised
DomainOriented
Training
Data/Facts
human effort
ontological rigor
Entities & Relations
< „N. Portman“,
„honored with“,
„Academy Award“>,
< „Jeff Bridges“,
„expected to win“,
„Oscar“ >
< „Bridges“,
„nominated for“,
„Academy Award“>
wonAward: Person  Prize
type (Meryl_Streep, Actor)
wonAward (Meryl_Streep,
Academy_Award)
wonAward (Natalie_Portman,
Academy_Award)
wonAward (Ethan_Coen,
Palme_d‘Or)
Summary & Challenges (I)
Web-Scale Information Extraction
Names & Patterns
OpenDomain
& Unsupervised
DomainOriented
Training
Data/Facts
TextRunner
Probase
ontological rigor
Entities & Relations
?
WebTables /
FusionTables
StatSnowball /
EntityCube
ReadTheWeb / NELL
Sofie /
Prospera
-----
Freebase
DBpedia 3.8
human effort
YAGO2
Summary & Challenges (II)
RDF is Not Enough!

HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer
output structures than just triplets.

Extraction of





facts
beliefs, modifiers, modalities, etc..
intensional knowledge (“rules”)
More expressive but
canonical representation of
natural language: trees, graphs,
objects, frames (F-logic, KL-one,
CycL, OWL, etc.)
All combined with structured
probabilistic inference
Summary & Challenges (III)
Scalable Probabilistic Inference
“Domain-liftable” FO formula
Corresponding
FO d-DNNF circuit
X,YPeople
smokes(X)  friends(X,Y)
 smokes(Y)

Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC)

Probability of a query depends only on the size(s) of the domain(s), a weight function for
the first-order predicates, and the weighted model count over the FO d-DNNF.
[Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s
[Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s

Approximate inference via Belief Propagation, MCMC-style sampling, etc.

Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT)
Final Summary
C1 Text is not just unstructured data.
C2
C3
Probabilistic databases combine first-order
logic and probability theory in an elegant way.
Natural-Language-Processing people, Database
guys, and Machine-Learning folks:
it’s about time to join your forces!
Demo!
urdf.mpi-inf.mpg.de
References

Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model for Information
Extraction. PVLDB 6(14), 2013 (to appear)

Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic Databases with NonMaterialized Views. ICDE 2013, 2013

Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF
Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20

Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style RDF Engine.
RuleML America 2011: 81-96

Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF Knowledge Bases.
CIKM 2011: 2557-2560

Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting with High Precision and
High Recall. WSDM 2011: 227-236

Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent RDF Knowledge
Bases. BTW 2011: 474-493

Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain Knowledge Bases.
MUD 2010: 51-65

Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust Knowledge Gathering from
the Web. WebDB 2010

Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned DBMS.
SSDBM 2010: 416-433

Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence Computation in Uncertain
and Probabilistic Databases. ICDE 2008: 1023-1032

Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty
and lineage. VLDB J. 17(2): 243-264 (2008)
Download