Cambridge - University of Washington

advertisement
Probabilities in
Databases and Logics I
Nilesh Dalvi and Dan Suciu
University of Washington
Two Lectures
•Today:
probabilistic database to model imprecisions
probabilistic logics
•Tomorrow:
probabilistic database to model incompletness
random graphs
2
Motivation
Record reconciliation
Information extraction
Constraint violations
Schema matching
3
Problem Setting
Queries:
Tables:
A(x,y) :- Review(x,y),
Movie(x,z), z > 1991
Review
Movie
name
rating
p
Monkey Love
good
.5
title
year
p
fair
.2
Twelve Monkeys
1995
.8
fair
.6
Monkey Love
1997
.4
poor
.9
Monkey Love
1935
.9
title
Monkey Love Pl
2005
.7
Twelve Monkeys fair
.53
Monkey Love
good
.42
Monkey Love Pl
fair
.15
Problem: complexity of
query evaluation
Answers:
Top k
rating p
5
Two Problems
Fixed schema S, conjunctive query Q(x,y)
Query evaluation problem
Fix answer tuple (a,b)
Given database I, compute Pr(Q(a,b))
Top-k answering problem
Fix k > 0
Given database I, find k answer tuples with
highest probabilities
6
Related Work: DB
Cavallo&Pitarelli:1987
Barbara,Garcia-Molina, Porter:1992
Lakshmanan,Leone,Ross&Subrahmanian:199
7
Fuhr&Roellke:1997
Dalvi&S:2004
Widom:2005
7
Related Work: Logic
Query reliability [Gradel,Gurevitch,Hirsch’98]
Degrees of belief
[Bacchus,Grove,Halpern,Koller’96]
Probabilistic Logic [Nielson]
Probabilistic model checking [Kwiatkowska’02]
Probabilistic Relational Model
[Taskar,Abbeel,Koller’02]
8
Outline
Definitions
Query Evaluation
Top-k answering (joint with Chris Re)
Conclusions
9
Probabilistic Database
•Schema S, Domain D, Set of instances Inst
•Definition
Probabilistic database is a probability distribution
Pr : Inst → [0,1], ∑I Pr[I] = 1
If Pr[I] > 0 then I is called “possible world”
23
Probabilistic Database
•Representation:
Independent tuples:
I-database DB over some schema Si
Independent and disjoint tuples:
ID-database DB over some schema Sid
•Semantics:
DB “means” probability distribution Pr over schema S
24
Representation
I-Databases
Reviewsi(M,S,p)
Reviews(M,S)
Mov Scor
Movie Score
P
m42
good
p
m99
good
p
m76
poor
p
1
2
3
Mov Scor
Mov Scor
Mov Scor
Mov Scor
Mov Scor
Mov Scor
Mov Scor
m42 good
m99 good
m42 good
m76 poor
m42 good
m42 1995
m42 good
m76 poor
m76 poor
m99 good
m99 good
Pr[I1]=
(1-p )*(1-p )*(1-p )
1
2
3
m76 poor
Pr[I4]=
p *p *(1-p )
1 2
3
Pr[I1] + Pr[I2] + . . . + Pr[I8] = 1
Possible worlds semantics
Pr[I8]=
p *p *p
1 2 3
26
ID-Databases
id
Activities
Activities
Time Act
Timed
Activity
P
t
walk
t
run
t+1
walk
p
1
p
2
p
3
Time Act
Time Act
Time Act
Time Act
Time Act
t
t
t+1
t
walk
t
run
t+1
walk
t+1
walk
walk
run
walk
Pr[I1]=
(1-p -p )*(1-p )
1 2
3
Pr[I3]=
p *(1-p )
2
3
Pr[I5]=
p *p
1 3
Pr[I1] + Pr[I2] + . . . + Pr[I6] = 1
28
ID subsumes I
id
Reviews
i
Reviews
Movied Scored
P
m42
good
p
m99
good
p
m76
poor
p
=
1
2
3
Note:
Movie Score
P
Reviewsid
m42
good
p
m99
good
p
m76
poor
p
1
2
3
Movie Score
P
m42
good
p
m99
good
p
m76
poor
p
means all
tuples are
disjoint
1
2
3
29
Queries
•Syntax:
conjunctive queries over schema S
Q(y) :- Movie(x,y), Review(x,z), z >= 3
i
Review
i
Movie
id
m42
m99
m76
m05
year
1995
2002
2002
2005
P
0.95
0.65
0.1
0.7
mid
rating
p
m42
4
0.7
m42
m99
m99
5
5
4
0.45
0.82
0.68
m05
5
0.79
30
Two Query Semantics
•Possible
answer sets
Given set A:
Pr[{t | I ⊨ Q(t)} = A]
Used for views
•Possible
This
talk
tuples
Given tuple t:
Pr[I ⊨ Q(t)]
Used for query evaluation and top-k
31
Query Semantics
id
year
m42
id
year
1935
m76
1903
id
year
2004
m99
m99
1901
m05
m76
1902
p1
p2
id
year
1995
m87
1934
m99
1935
m44
1904
m05
2004
p3
p4
Q(y) :- Movie(x,y), Review(x,z)
Tuple
probabilities
year
1935
2004
1995
. . .
p
p2 + p3 = 0.6
p1 + p3 = 0.5
p3 =
0.2
. . .
top k
35
Summary on Data Model
Data Model:
Semantics = possible worlds
Syntax = I-databases or ID-databases
Queries:
Syntax = unchanged (conjunctive queries)
Semantics = tuple probabilities
38
Outline
Definitions
Query evaluation
Top-k answering
Conclusions
39
Problem Definition
•Fix
schema S, query Q, answer tuple t
•Problem:
given I/ID-database DB, compute Pr[I ⊨ Q(t)]
notation:
Pr[Q(t)]
•Conventions:
For upper bounds (P or #P): probabilities are rationals
For lower bounds (#P): probabilities are 1/2
40
Query Evaluation
on I-Databases
•Outline
Intuition
Extensional plans: PTIME case
Hard queries: #P-complete case
Dichotomy Theorem
41
Q(y) :- Movie(x,y),
Review(x,z)
Intuition
Reviewi
Moviei
id
year
p
m42
1995
p1
m99
2002
p2
m76
2002
p3
m05
2005
p4
mid
m42
m42
m42
m99
m99
m76
Answer
Year
rate
4
2
3
1
3
5
p
q1
q2
q3
q4
q5
q6
p
1995
p1 × (1 - (1 - q1) ×(1 - q2)×(1 - q3))
2002
1 - (1 - p2 × (1 - (1 - q4)×(1
) ×- q5))
(1 - p3 × )q6
42
I-Extensional Plans
[Barbara92,Lakshmanan97]
Add p
Join ⋈
p = p1 * p2
Projection ∏ p = 1-(1-p1)(1-p2)...(1-pn)
Selection σ p = p
Note: data complexity is PTIME
43
Q(y) :- Movie(x,y),
Review(x,z)
1995
1-(1-pq1)(1-pq2)(1pq3)
∏
∏
1995 m1 pq1
p(1-(1-q1)(1-q2)(11995 m1
q3))
⋈
1 - (1-q1)(1-q2)(1m1
q3)
1995 m1 pq2
⋈
1995 m1 pq3
∏
Movie Review
1995
p
Movie
m1 q1
m1 q2
m1 q3
INCORRECT!
1995
p
Review
m1 q1
m1 q2
m1 q3
CORREC
46
T
#P-Complete Queries
i
R
A
i
T
S
p
A
B
B
p
p1
q1
p2
q2
p3
q3
p4
q4
Qbad :- Ri(x), S(x,y), Ti(y)
Theorem: Data complexity is #P-complete
48
Proof:
Theorem [Provan&Ball83] Counting the number of
satisfying assignments for bipartite DNF is #Pcomplete
Reduction:
x2y3 V x1y2 V x4y3 V x3y1
i
T
S
Ri
A
p
A
B
B
p
x1
1/2
x2
y3
y1
1/2
x2
1/2
x1
y2
y2
1/2
x3
1/2
x4
y3
y3
1/2
x4
1/2
x3
y1
Qbad :- Ri(x), S(x,y), Ti(y)
49
I-Dichotomy
Q = boolean conjunctive query
Definition 1. For each variable x:
goals(x) = set of goals that contain x
Definition 2. Q is hierarchical if forall x, y:
(a) goals(x) ∩ goals(y) = ∅, or
(b) goals(x) ⊆ goals(y), or
(c) goals(y) ⊆ goals(x)
Q :- R(x),S(x,y),T(x,y,z),K(x,v)
x
y
z
R
S
v
T
K
“hierarchical”
Q :- R(x), S(x,y), T(y)
x
“non-hierarchical”
R
y
S
T
52
[Dalvi&S.’04]
I-Dichotomy
Schema Si = {R1i, R2i, . . ., Rmi}
Theorem Let Q = conjunctive query w/o self-joins.
Then one of the following holds:
or:
Q is in PTIME
Q has a correct extensional plan
Q is hierarchical
Q is #P-complete
Q has subgoals R(x,...),S(x,y,...),T(y,...)
53
Proof
Lemma 1.
If Q is non-hierarchical, then #P-complete
Proof:
y
x
v
R
S
z
T
K
Q :- Ri(v,x), Si(x,y), Ti(y,z), Ki(z)
rest is like for Qbad
54
Proof
Lemma 2. If Q is hierarchical, then PTIME
Proof:
Case 1: has no root
Pr(Q) = Pr(Q1) Pr(Q2) Pr(Q3)
This is extensional join ⋈
55
Proof
Case 2: has root x
x
Dom={a1, a2, . . ., an}
Pr(Q) =
1 - (1-Pr(Q(a1/x))(1-Pr(Q(a2/x))...(1-Pr(Q(an/x)))
This is an extensional projection: ∏
QED
56
Query Evaluation
on ID-Databases
ID-extensional plans
#P-complete queries
Dichotomoy Theorem
57
Extensional Plans for ID-DBs
Only difference: two kinds of projections:
independent 1-(1-p1)...(1-pn)
disjoint
p1 + ... + pn
58
#P-Complete Queries
Q1 :- Ri(x), Si(x,y), Ti(y)
Q2 :- Rd(xd,y), Sd(yd,z)
Q3 :- Rd(xd,y), Sd(zd,y)
59
[Dalvi&S.’04]
I-DB Dichotomy
Schema Sid s.t. each table is either Ri or Rid
Theorem Let Q = conjunctive query w/o self-joins.
Then one of the following holds:
Q is in PTIME
Q has a correct extensional plan
or:
Q is #P-complete
Q has one of Q1, Q2, Q3 as subqueries
60
Extensions
•Extensions
of the dichotomoy theorem exists for:
Mixed schemas (some relations are deterministic)
Functional dependencies
61
Summary on Query
Evaluation
•Extensional
plans: popular, efficient, BUT
“Equivalent” plans lead to different results
Some queries admit “correct” plans
•Some
simple queries: #P-complete complexity
•Dichotomy
•Future
theorem
work: remove ‘no-self-join’ restriction
62
Outline
Definitions
Query evaluation
Top-k answering (joint with Chris Re)
Conclusions
65
Top-k Ranking Problem
•Fix
schema S, query Q, number k > 0
•Problem:
given I- or ID-database DB,
find k answers t1,...,tk with highest probabilities
Pr[Q(t1)] > Pr[Q(t2)] > .... > Pr[Q(tk)] > ...
•Note:
Checking Pr[Q(ti)] > Pr[Q(tj)] is PP-complete
Goal: efficient polynomial time approximation
69
Probabilities of Boolean
Expressions
A
p
e1
e2
e3
e1
p1
0
0
0
e2
p2
0
0
1
Pr
(1-p1)(1-p2)(1p3)
(1-p1)(1-p2)p3
e3
p3
0
1
0
(1-p1)p2(1-p3)
0
1
1
(1-p1)p2p3
1
0
0
p1(1-p2)(1-p3)
1
0
1
p1(1-p2)p3
1
1
0
p1p2(1-p3)
1
1
1
p1p2p3
What is the probability of
e1⋀e2 ⋁ e1⋀e2 ⋁ e1⋀e3?
Theorem #P-hard [Valiant]
(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3
70
Monte Carlo Simulation
Algorithm:
radomly pick each e1, e2, e3 = false or true
compute e1∧e2 ∨ e1∧e3 ∨ e2∧e3: true or false ?
repeat
Approximate probability p with frequency p’
[Karp&Luby’83]
Pr( |p’-p| < ε ) > 1-δ
p’- ε
p’
p’+ ε
p
Better: PTAS
71
Monte Carlo Simulation
0
p
1
N=0
N=1
N=2
N=3
72
The Multisimulation
Problem
0
Year
P
1995
2002
??
??
1933
??
1984
??
1
Schedule simulation steps to find top-k
73
Multisimulation
How to find the top k out of n ?
0
p1
1
5
p4
2
1
p2
p54
p3
3
Example: looking for top k=2;
Which one simulate next ?
74
Multisimulation
Critical region: (k’th left, k+1’th right)
0
1
k=2
75
Multisimulation Algorithm
Case 1: pick a “double crosser” and simulate it
0
this
k=2
1
76
Multisimulation Algorithm
Case 2: pick both a “left” AND a “right” crosser
0
1
this
and
this
k=2
77
Multisimulation Algorithm
Case 3: pick a “max crosser” and simulate it
0
this
k=2
1
78
Multisimulation Algorithm
End: when critical region is “empty”
0
1
k=2
To sort the top k, find the top k-1, etc
79
Multisimulation Algorithm
Theorem
(1) It runs in < 2 Optimal # steps
(2) no other deterministic algorithm does better
80
Experiments
82
Summary on Top-k
Answering
Simple algorithm, optimal (x2) w.r.t. a very
powerful standard
Marriage of probabilistic and top-k answers
make probabilistic databases practical
83
Outline
Definitions
Query evaluation
Top-k answering
Conclusions
85
Conclusions
Strong motivation from practical applications
Opportunity to merge query and search technologies
Probabilistic DB’s are hard !
Great opportunity for impactful theory work
Tomorrow: applications of random graphs to model
incompleteness in databases
87
Thank you !
Questions ?
Download