On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

advertisement
On the Semantics and Evaluation of Top-k
Queries in Probabilistic Databases
Presented by
Xi Zhang
Feburary 8th, 2008
Outline




Background
Motivation Examples
Top-k Queries in Probabilistic Databases
Conclusion
Outline

Background





Probabilistic database model
Top-k queries & scoring functions
Motivation Examples
Top-k Queries in Probabilistic Databases
Conclusion
Probabilistic Databases

Motivation


History




Uncertainty/vagueness/imprecision in data
Imcomplete information in relational DB [Imielinski & Lipski
1984]
Probabilistic DB model [Cavallo & Pittarelli 1987]
Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.]
Comeback

Flourish of uncertain data in real world application
 Examples: WWW, Biological data, Sensor network etc.
Probabilistic Database Model
[Fubr & Rölleke 1997]

Probabilisitc Database Model


A generalizaiton of relational DB
Probabilistic Relational Algebra (PRA)

A generalization of standard relational algebra
A Table in Probabilistic Database
DocTerm:
DocNo
Term
Prob
1
2
3
3
4
IR
DB
IR
DB
AI
0.9
0.7
0.8
0.5
0.8
Event expression
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
Independent
events
Probabilistic Relational Algebra

Just like in Relational Algebra…





Selection
Projection
Join
Union
Difference
-
Probabilistic Relational Algebra

Just like in Relational Algebra…





Selection
Projection
Join
Union
Difference
-
Selection
DocTerm:
DocNo
Term
Prob
1
2
3
3
4
IR
DB
IR
DB
AI
0.9
0.7
0.8
0.5
0.8
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
In derived
table
DocNo
Term
Prob
1
3
IR
IR
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
Propositional expression
of basic events
Projection
DocTerm:
DocNo
Term
Prob
1
2
3
3
4
IR
DB
IR
DB
AI
0.9
0.7
0.8
0.5
0.8
Term
Prob
Complex Event
IR
DB
AI
0.98
0.85
0.80
eDT(1, IR) eDT(3, IR)
eDT(2, DB) eDT(2, DB)
eDT(4, AI)
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
Join
DocAu:
DocTerm:
DocNo
AName
1
2
Bauer
Meier
Prob Basic Event
0.9
0.8
eDU(1, Bauer)
eDU(2, Meier)
DocAu.
DocNo
AName
DocTerm.
DocNo
Term
Prob
1
1
2
2
Bauer
Bauer
Meier
Meier
1
2
1
2
IR
DB
IR
DB
0.9*0.9
0.9*0.7
0.8*0.9
0.8*0.7
DocNo
Term
1
2
IR
DB
Prob Basic Event
0.9
0.7
Complex Event
eDU(1, Bauer)
eDU(1, Bauer)
eDT(1, IR)
eDT(2,
DB)
eDU(2, Meier)
eDU(2, Meier)
DB)
eDT(1, IR)
eDT(2,
eDT(1, IR)
eDT(2, DB)
DocAu:
Join + Projection
DocTerm:
DocNo
AName
Prob
Basic Event
DocNo
Term
1
2
2
2
3
4
4
Bauer
Bauer
Meier
Schmidt
Schmidt
Koch
Bauer
0.9
0.3
0.9
0.8
0.7
0.9
0.6
eDU(1, Bauer)
eDU(2, Bauer)
eDU(2, Meier)
eDU(2, Schmidt)
eDU(3, Schmidt)
eDU(3, Koch)
eDU(3, Bauer)
1
2
3
3
4
IR
DB
IR
DB
AI
IR:
Prob
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
Prob Basic Event
0.9
0.7
0.8
0.5
0.8
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
DB:
DocNo
DocNo
Prob
1
3
2
3
0.7
0.5
Complex Event
eDT(2, DB)
eDT(3, DB)
Prob
Complex Event
AName
AName
Prob
Complex Event
0.81
0.56
eDU(1, Bauer) eDT(1, IR)
eDU(3, S) eDT(3, IR)
Bauer
Schimdt
Bauer
Meier
Schmidt
0.21
0.63
0.91
eDU(2, Bauer) eDT(2, DB)
eDU(2, Meier) eDT(2, DB)
(eDU(2, S) eDT(2, DB))
(eDU(3, S) eDT(3, DB) )
Prob
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
0.4368
AName
Complex Event
Bauer (eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
Schmidt (eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )
DocAu:
Join + Projection
DocTerm:
DocNo
AName
Prob
Basic Event
DocNo
Term
1
2
2
2
3
4
4
Bauer
Bauer
Meier
Schmidt
Schmidt
Koch
Bauer
0.9
0.3
0.9
0.8
0.7
0.9
0.6
eDU(1, Bauer)
eDU(2, Bauer)
eDU(2, Meier)
eDU(2, Schmidt)
eDU(3, Schmidt)
eDU(3, Koch)
eDU(3, Bauer)
1
2
3
3
4
IR
DB
IR
DB
AI
IR:
Prob
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
Prob Basic Event
0.9
0.7
0.8
0.5
0.8
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
DB:
DocNo
DocNo
Prob
1
3
2
3
0.7
0.5
Complex Event
eDT(2, DB)
eDT(3, DB)
Prob
Complex Event
AName
AName
Prob
Complex Event
0.81
0.56
eDU(1, Bauer) eDT(1, IR)
eDU(3, S) eDT(3, IR)
Bauer
Schimdt
Bauer
Meier
Schmidt
0.21
0.63
0.91
eDU(2, Bauer) eDT(2, DB)
eDU(2, Meier) eDT(2, DB)
(eDU(2, S) eDT(2, DB))
(eDU(3, S) eDT(3, DB) )
Intensional Semantics
v.s.
Extensional Semantics
Prob
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
0.4368
AName
Complex Event
Bauer (eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
Schmidt (eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )
Intensional v.s Extensional

Intensional Semantics



Assume data independence of base tables
Keeps track of data dependence during the
evaluation
Extensional Semantics


Assume data independence during the evaluation
Could be WRONG with probability computation!
When Intensional = Extensional?

No identical underlying basic events in the
event expression
Prob
AName
Complex Event
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
Bauer
Schmidt
(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
(eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )
0.4368
Identical basic event
Fubr & Rölleke 1997

Summary



Probabilisitc DB Model
 Concept of event
 Basic v.s. complex event
 Event expression
Probabilistic Relational Algebra
 Just like in Relational Algebra…
Computation of event probabilities
 Intensional v.s. extensional semantics
 Yield the same result when NO data dependence in event
expressions
Outline

Background




Motivation Examples
Top-k Queries in Probabilistic Databases



Probabilistic database model
Top-k queries & scoring functions
Semantics
Query Evaluation
Conclusion
Top-k Queries

Traditonally, given
Objects:
o1, o2, …, on
An non-negative integer:
k
A scoring function s:
Question:
What are the k objects with the highest score?

Have been studied in Web, XML, Relational
Databases, and more recently in Probabilistic
Databases.
Scoring Function

A scoring function s over a deterministic
relation R is

For any ti and tj from R,
Outline


Background
Motivation Examples




Smart Enviroment Example
Sensor Network Example
Top-k Queries in Probabilistic Databases
Conclusion
Motivating Example I

Smart Environment


Sample Question
 “Who were the two visitors in the lab last Saturday night?”
Data
 Biometric data from sensors


We would be able to see how those data match the profile of every
candidate -- a scoring function
Historical statistics

e. g. Probability of a certain candidate being in lab on Saturday
nights
Motivating Example I (cont.)
Biometrics
Face
Voice
score(
Detection, Detection,
Personnel
…
)
Probability of being in
lab on Saturday nights
Aiden
score( 0.70 ,
0.60,
… ) = 0.65
0.3
Bob
score( 0.50 ,
0.60,
… ) = 0.55
0.9
Chris
score( 0.50 ,
0.40,
… ) = 0.45
0.4
Question: Find two people in the lab last Saturday night
a Top-2 query over the above probabilistic database
under the above scoring function
Motivating Example II

Sensor Network in a Habitat

Sample Question


“What is the temperature of the warmest spot?”
Data



Sensor readings from different sensors
At a sampling time, only one “real” reading from a
sensor
Each sensor reading comes with a confidence value
Motivating Example II (cont.)
Temp (F)
Prob
C1 (from Sensor 1)
22
10
0.6
0.4
C2 (from Sensor 2)
25
15
0.1
0.6
Question: What is the temperature of the warmest spot?
a Top-1 query over the above probabilistic database
under the scoring function proportional to temperature
Outline



Background
Motivation Examples
Top-k Queries in Probabilistic Databases



Semantics
Query Evaluation
Conclusion
Models

A probabilistic relation Rp=<R, p, C >

R:
p:

C :


the support deterministic relation
probability function
a partition of R, such that
Simple v.s. General probabilistic relation


Simple
 Assume tuple independence, i.e. |C |=|R|
 E.g. smart environment example
General
 Tuples can be independent or exclusive, i.e. |C |<|R|
 E.g. sensor network example
Challenges
Given


A probabilistic relation Rp=<R, p, C >
An injective scoring function s over R




No ties
A non-negative integer k
What is the top-k answer set over Rp ?
(Semantics)
How to compute the top-k answer of Rp ?
(Query Evaluation)
What is a “Good” Semantics?

Desired Properties



Exact-k
Faithfulness
Stability
Properties

Exact-k


Faithfulness


If R has at least k tuples, then exactly k tuples are returned
as the top-k answer
A “better” tuple, i.e. higher in score and probability, is more
likely to be in the top-k answer, compared to a “worse” one
Stability


Raising the score/prob. of a winning tuple will not cause it
to lose
Lowering the score/prob. of a losing tuple will not cause it
to win
Global-Topk Semantics
Given




A probabilistic relation Rp=<R, p, C >
An injective scoring function s over R
 No ties
A non-negative integer k
What is the top-k answer set over Rp ? (Semantics)


Global-Topk
 Return the k highest-ranked tuples according to their
probability of being in top-k answers in possible worlds
Global-Topk satisfies aforementioned three properties
Smart Environment Example
Query: Find two people in lab on last Saturday night
Biometrics
Face
Voice
Score( Detection, Detection,
Personnel
…
)
Prob.
Aiden
Score( 0.70 ,
0.60,
… ) = 0.65
0.3
Bob
Score( 0.50 ,
0.60,
… ) = 0.55
0.9
Chris
Score( 0.50 ,
0.40,
… ) = 0.45
0.4
possible worlds
0.042
Aiden
Bob
Chris
0.018
0.378
0.028
Aiden
Aiden
Bob
Aiden
Bob
Chris
Chris
Bob
0.162
0.012
0.252
Chris
Top-2
0.108
Global-Topk Semantics:
Pr(Bob in top-2) = 0.9
Pr(Aiden in top-2) = 0.3
Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292
Top-2 Answer
Other Semantics


Soliman, Ilyas & Chang 2007
Two Alternative Semantics


U-Topk
U-kRanks
U-Topk Semantics
Given




A probabilistic relation Rp=<R, p, C >
An injective scoring function s over R
 No ties
A non-negative integer k
What is the top-k answer set over Rp ? (Semantics)


U-Topk
 Return the most probable top-k answer set that belongs to
possible worlds
U-Topk does not satisfies all three properties
Smart Environment Example
Query: Find two people in lab on last Saturday night
Biometrics
Face
Voice
Score( Detection, Detection,
Personnel
…
)
Prob.
Aiden
Score( 0.70 ,
0.60,
… ) = 0.65
0.3
Bob
Score( 0.50 ,
0.60,
… ) = 0.55
0.9
Chris
Score( 0.50 ,
0.40,
… ) = 0.45
0.4
possible worlds
0.042
Aiden
Bob
Chris
0.018
0.378
0.028
Aiden
Aiden
Bob
Aiden
Bob
Chris
Chris
Bob
0.162
0.012
0.252
Chris
Top-2
0.108
U-Topk Semantics:
Pr({Bob}) = 0.378
…
Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27
Top-2 Answer
U-kRanks Semantics
Given




A probabilistic relation Rp=<R, p, C >
An injective scoring function s over R
 No ties
A non-negative integer k
What is the top-k answer set over Rp ? (Semantics)


U-kRanks
 For i=1,2,…,k, return the most probable ith-ranked tuples
across all possible worlds
U-kRanks does not satisfies all three properties
Smart Environment Example
Query: Find two people in lab on last Saturday night
Biometrics
Face
Voice
Score( Detection, Detection,
Personnel
…
Prob.
)
Aiden
Score( 0.70 ,
0.60,
… ) = 0.65
0.3
Bob
Score( 0.50 ,
0.60,
… ) = 0.55
0.9
Chris
Score( 0.50 ,
0.40,
… ) = 0.45
0.4
possible worlds
0.042
Aiden
Bob
Chris
0.018
0.378
0.028
U-kRanks
Semantics:
Aiden
Aiden
Bob
Aiden
Bob
Chris
Chris
Bob
0.162
0.012
0.252
Chris
Top-2
0.108
Rank-1
Rank-2
Aiden
Bob
Chris
0.3
0
0.63
0.27
0.028
0.264
Highest at rank-1
Highest at rank-2
e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292
Top-2 Answer
{Bob}
Properties
A better
sementics
Semantics
Exact-k
Faithfulness
Stability
Global-Topk
U-Topk
U-kRanks
Yes
No
No
Yes
Yes/No*
No
Yes
Yes
No
* Yes when the relation is simple, No otherwise
Challenges
Given


A probabilistic relation Rp=<R, p, C >
An injective scoring function s over R




No ties
A non-negative integer k
What is the top-k answer set over Rp ?
(Semantics)
How to compute the top-k answer of Rp ?
(Query Evaluation)
GlobalTopk
Global-Topk in Simple Relation

Given Rp=<R, p, C >, a scoring function s, a
non-negative integer k

Assumptions


Tuples are independent, i.e. |C |=|R|
R={t1,t2,…tn}, ordered in the decreasing order of their
scores, i.e.
Global-Topk in Simple Relation

Query Evaluation

Recursion


Pk,s(ti): Global-Topk probability of tuple ti
Dynamic Programming
Optimization

Threshold Algorithm (TA)


[Fagin & Lotem 2001]
Given a system of objects, such that
 For each object attribute, there is a sorted list ranking
objects in the decreasing order of its score on that attribute
 An aggregation function f combines individual attribute
scores xi, i=1,2,…m, to obtain the overall object score
f(x1,x2,…,xm)
 f is monotonic



f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i
TA is cost-optimal in finding the top-k objects
TA and its variants are widely used in ranking queries, e.g.
top-k, skyline, etc.
Applying TA Optimization

Global-Topk


Two attributes: probability & score
Aggregation function: Global-Topk probability
Global-Topk in General
Relation

Given Rp=<R, p, C >, a scoring function s, a
non-negative integer k

Assumptions


Tuples are independent or exclusive, i.e. |C |<|R|
R={t1,t2,…tn}, ordered in the decreasing order of their
scores, i.e.
Global-Topk in General
Relation

Induced Event Relation

For each tuple in R, there is a probabilistic relation
Ep=<E, pE, C E> generated by the following two
rules

Ep is simple
Sensor Network Example
Prob. Relation (general)
Temp (F)
Prob
C1 (from Sensor 1)
22
10
0.6
0.4
C2 (from Sensor 2)
25
15
0.1
For example:
t=
15
0.6
0.6
Induced Event Relation (simple)
Event
Rule 2
Rule 1
Prob
teC1
0.6 =
tet
0.6 =
where i=1
p(t)
Global-Topk in General
Relation
Evaluating Global-Topk in
General Relation



For each tuple t, generate corresponding
induced event relation
Compute the Global-Topk probability of t by
Theorem 4.3
Pick the k tuples with the highest Global-Topk
probability
Summary on Query Evaluation

Simple (Independent Tuples)

Dynamic Programming



Tuples are ordered on their scores
Recursion on the tuple index and k
General (Independent/Exclusive Tuples)

Polynomial reduction to simple cases
Complexity
GlobalTopk
U-Topk
U-kRanks
Simple
O(kn)
O(kn)
O(kn)
General
O(kn2)
Θ(mknk-1 lg n)* Ω(mnk-1)*
* m is a rule engine related factor
m represents how complicated the relationship between tuples could be
Outline




Background
Motivation Examples
Top-k Queries in Probabilistic Databases
Conclusion
Conclusion



Three intuitive semantic properties for top-k
queries in probability databases
Global-Topk semantics which satisfies all the
properties above
Query evaluation algorithm for Global-Topk in
simple and general probabilistic databases
Future Problems

Weak order scoring function




Allow ties
Not clear how to extend properties
Not clear how to define the semantics (other than “arbitrary
tie breaker”)
Preference Strength


Sensitivity to Score
 Given a prob. relation Rp, if the DB is sufficiently large, by
manipulating the scores of tuples, we would be able to get
different answers
 NOT satisfied by our semantics
 NOT satisfied by any semantics in literature
Need to consider preference strength in the semantics
Thank you !
Related Works

Introduction to Probabilistic Databases


Probabilistic DB Model & Probabilistic Relational
Algebra [Fubr & Rölleke 1997]
Top-K Query in Probabilistic Databases


On the Semantics and Evaluation of Top-k
Queries in Probabilistic Databases [Zhang &
Chomicki 2008]
Alternative Top-k Semantics and Query
Evaluation in Probabilistic Databases [Soliman,
Ilyas & Chang 2007]
Download