On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8th, 2008 Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion Outline Background Probabilistic database model Top-k queries & scoring functions Motivation Examples Top-k Queries in Probabilistic Databases Conclusion Probabilistic Databases Motivation History Uncertainty/vagueness/imprecision in data Imcomplete information in relational DB [Imielinski & Lipski 1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] Comeback Flourish of uncertain data in real world application Examples: WWW, Biological data, Sensor network etc. Probabilistic Database Model [Fubr & Rölleke 1997] Probabilisitc Database Model A generalizaiton of relational DB Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra A Table in Probabilistic Database DocTerm: DocNo Term Prob 1 2 3 3 4 IR DB IR DB AI 0.9 0.7 0.8 0.5 0.8 Event expression Basic Event eDT(1, IR) eDT(2, DB) eDT(3, IR) eDT(3, DB) eDT(4, AI) Independent events Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference - Probabilistic Relational Algebra Just like in Relational Algebra… Selection Projection Join Union Difference - Selection DocTerm: DocNo Term Prob 1 2 3 3 4 IR DB IR DB AI 0.9 0.7 0.8 0.5 0.8 Basic Event eDT(1, IR) eDT(2, DB) eDT(3, IR) eDT(3, DB) eDT(4, AI) In derived table DocNo Term Prob 1 3 IR IR 0.9 0.8 Complex Event eDT(1, IR) eDT(3, IR) Propositional expression of basic events Projection DocTerm: DocNo Term Prob 1 2 3 3 4 IR DB IR DB AI 0.9 0.7 0.8 0.5 0.8 Term Prob Complex Event IR DB AI 0.98 0.85 0.80 eDT(1, IR) eDT(3, IR) eDT(2, DB) eDT(2, DB) eDT(4, AI) Basic Event eDT(1, IR) eDT(2, DB) eDT(3, IR) eDT(3, DB) eDT(4, AI) Join DocAu: DocTerm: DocNo AName 1 2 Bauer Meier Prob Basic Event 0.9 0.8 eDU(1, Bauer) eDU(2, Meier) DocAu. DocNo AName DocTerm. DocNo Term Prob 1 1 2 2 Bauer Bauer Meier Meier 1 2 1 2 IR DB IR DB 0.9*0.9 0.9*0.7 0.8*0.9 0.8*0.7 DocNo Term 1 2 IR DB Prob Basic Event 0.9 0.7 Complex Event eDU(1, Bauer) eDU(1, Bauer) eDT(1, IR) eDT(2, DB) eDU(2, Meier) eDU(2, Meier) DB) eDT(1, IR) eDT(2, eDT(1, IR) eDT(2, DB) DocAu: Join + Projection DocTerm: DocNo AName Prob Basic Event DocNo Term 1 2 2 2 3 4 4 Bauer Bauer Meier Schmidt Schmidt Koch Bauer 0.9 0.3 0.9 0.8 0.7 0.9 0.6 eDU(1, Bauer) eDU(2, Bauer) eDU(2, Meier) eDU(2, Schmidt) eDU(3, Schmidt) eDU(3, Koch) eDU(3, Bauer) 1 2 3 3 4 IR DB IR DB AI IR: Prob 0.9 0.8 Complex Event eDT(1, IR) eDT(3, IR) Prob Basic Event 0.9 0.7 0.8 0.5 0.8 eDT(1, IR) eDT(2, DB) eDT(3, IR) eDT(3, DB) eDT(4, AI) DB: DocNo DocNo Prob 1 3 2 3 0.7 0.5 Complex Event eDT(2, DB) eDT(3, DB) Prob Complex Event AName AName Prob Complex Event 0.81 0.56 eDU(1, Bauer) eDT(1, IR) eDU(3, S) eDT(3, IR) Bauer Schimdt Bauer Meier Schmidt 0.21 0.63 0.91 eDU(2, Bauer) eDT(2, DB) eDU(2, Meier) eDT(2, DB) (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368 AName Complex Event Bauer (eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB)) Schmidt (eDU(3, S) eDT(3, IR) ) ( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) ) DocAu: Join + Projection DocTerm: DocNo AName Prob Basic Event DocNo Term 1 2 2 2 3 4 4 Bauer Bauer Meier Schmidt Schmidt Koch Bauer 0.9 0.3 0.9 0.8 0.7 0.9 0.6 eDU(1, Bauer) eDU(2, Bauer) eDU(2, Meier) eDU(2, Schmidt) eDU(3, Schmidt) eDU(3, Koch) eDU(3, Bauer) 1 2 3 3 4 IR DB IR DB AI IR: Prob 0.9 0.8 Complex Event eDT(1, IR) eDT(3, IR) Prob Basic Event 0.9 0.7 0.8 0.5 0.8 eDT(1, IR) eDT(2, DB) eDT(3, IR) eDT(3, DB) eDT(4, AI) DB: DocNo DocNo Prob 1 3 2 3 0.7 0.5 Complex Event eDT(2, DB) eDT(3, DB) Prob Complex Event AName AName Prob Complex Event 0.81 0.56 eDU(1, Bauer) eDT(1, IR) eDU(3, S) eDT(3, IR) Bauer Schimdt Bauer Meier Schmidt 0.21 0.63 0.91 eDU(2, Bauer) eDT(2, DB) eDU(2, Meier) eDT(2, DB) (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) Intensional Semantics v.s. Extensional Semantics Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368 AName Complex Event Bauer (eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB)) Schmidt (eDU(3, S) eDT(3, IR) ) ( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) ) Intensional v.s Extensional Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the evaluation Extensional Semantics Assume data independence during the evaluation Could be WRONG with probability computation! When Intensional = Extensional? No identical underlying basic events in the event expression Prob AName Complex Event 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 Bauer Schmidt (eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB)) (eDU(3, S) eDT(3, IR) ) ( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) ) 0.4368 Identical basic event Fubr & Rölleke 1997 Summary Probabilisitc DB Model Concept of event Basic v.s. complex event Event expression Probabilistic Relational Algebra Just like in Relational Algebra… Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event expressions Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Probabilistic database model Top-k queries & scoring functions Semantics Query Evaluation Conclusion Top-k Queries Traditonally, given Objects: o1, o2, …, on An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases. Scoring Function A scoring function s over a deterministic relation R is For any ti and tj from R, Outline Background Motivation Examples Smart Enviroment Example Sensor Network Example Top-k Queries in Probabilistic Databases Conclusion Motivating Example I Smart Environment Sample Question “Who were the two visitors in the lab last Saturday night?” Data Biometric data from sensors We would be able to see how those data match the profile of every candidate -- a scoring function Historical statistics e. g. Probability of a certain candidate being in lab on Saturday nights Motivating Example I (cont.) Biometrics Face Voice score( Detection, Detection, Personnel … ) Probability of being in lab on Saturday nights Aiden score( 0.70 , 0.60, … ) = 0.65 0.3 Bob score( 0.50 , 0.60, … ) = 0.55 0.9 Chris score( 0.50 , 0.40, … ) = 0.45 0.4 Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function Motivating Example II Sensor Network in a Habitat Sample Question “What is the temperature of the warmest spot?” Data Sensor readings from different sensors At a sampling time, only one “real” reading from a sensor Each sensor reading comes with a confidence value Motivating Example II (cont.) Temp (F) Prob C1 (from Sensor 1) 22 10 0.6 0.4 C2 (from Sensor 2) 25 15 0.1 0.6 Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Semantics Query Evaluation Conclusion Models A probabilistic relation Rp=<R, p, C > R: p: C : the support deterministic relation probability function a partition of R, such that Simple v.s. General probabilistic relation Simple Assume tuple independence, i.e. |C |=|R| E.g. smart environment example General Tuples can be independent or exclusive, i.e. |C |<|R| E.g. sensor network example Challenges Given A probabilistic relation Rp=<R, p, C > An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over Rp ? (Semantics) How to compute the top-k answer of Rp ? (Query Evaluation) What is a “Good” Semantics? Desired Properties Exact-k Faithfulness Stability Properties Exact-k Faithfulness If R has at least k tuples, then exactly k tuples are returned as the top-k answer A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one Stability Raising the score/prob. of a winning tuple will not cause it to lose Lowering the score/prob. of a losing tuple will not cause it to win Global-Topk Semantics Given A probabilistic relation Rp=<R, p, C > An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over Rp ? (Semantics) Global-Topk Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds Global-Topk satisfies aforementioned three properties Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Score( Detection, Detection, Personnel … ) Prob. Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds 0.042 Aiden Bob Chris 0.018 0.378 0.028 Aiden Aiden Bob Aiden Bob Chris Chris Bob 0.162 0.012 0.252 Chris Top-2 0.108 Global-Topk Semantics: Pr(Bob in top-2) = 0.9 Pr(Aiden in top-2) = 0.3 Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292 Top-2 Answer Other Semantics Soliman, Ilyas & Chang 2007 Two Alternative Semantics U-Topk U-kRanks U-Topk Semantics Given A probabilistic relation Rp=<R, p, C > An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over Rp ? (Semantics) U-Topk Return the most probable top-k answer set that belongs to possible worlds U-Topk does not satisfies all three properties Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Score( Detection, Detection, Personnel … ) Prob. Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds 0.042 Aiden Bob Chris 0.018 0.378 0.028 Aiden Aiden Bob Aiden Bob Chris Chris Bob 0.162 0.012 0.252 Chris Top-2 0.108 U-Topk Semantics: Pr({Bob}) = 0.378 … Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27 Top-2 Answer U-kRanks Semantics Given A probabilistic relation Rp=<R, p, C > An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over Rp ? (Semantics) U-kRanks For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds U-kRanks does not satisfies all three properties Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Score( Detection, Detection, Personnel … Prob. ) Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds 0.042 Aiden Bob Chris 0.018 0.378 0.028 U-kRanks Semantics: Aiden Aiden Bob Aiden Bob Chris Chris Bob 0.162 0.012 0.252 Chris Top-2 0.108 Rank-1 Rank-2 Aiden Bob Chris 0.3 0 0.63 0.27 0.028 0.264 Highest at rank-1 Highest at rank-2 e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292 Top-2 Answer {Bob} Properties A better sementics Semantics Exact-k Faithfulness Stability Global-Topk U-Topk U-kRanks Yes No No Yes Yes/No* No Yes Yes No * Yes when the relation is simple, No otherwise Challenges Given A probabilistic relation Rp=<R, p, C > An injective scoring function s over R No ties A non-negative integer k What is the top-k answer set over Rp ? (Semantics) How to compute the top-k answer of Rp ? (Query Evaluation) GlobalTopk Global-Topk in Simple Relation Given Rp=<R, p, C >, a scoring function s, a non-negative integer k Assumptions Tuples are independent, i.e. |C |=|R| R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e. Global-Topk in Simple Relation Query Evaluation Recursion Pk,s(ti): Global-Topk probability of tuple ti Dynamic Programming Optimization Threshold Algorithm (TA) [Fagin & Lotem 2001] Given a system of objects, such that For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm) f is monotonic f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc. Applying TA Optimization Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability Global-Topk in General Relation Given Rp=<R, p, C >, a scoring function s, a non-negative integer k Assumptions Tuples are independent or exclusive, i.e. |C |<|R| R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e. Global-Topk in General Relation Induced Event Relation For each tuple in R, there is a probabilistic relation Ep=<E, pE, C E> generated by the following two rules Ep is simple Sensor Network Example Prob. Relation (general) Temp (F) Prob C1 (from Sensor 1) 22 10 0.6 0.4 C2 (from Sensor 2) 25 15 0.1 For example: t= 15 0.6 0.6 Induced Event Relation (simple) Event Rule 2 Rule 1 Prob teC1 0.6 = tet 0.6 = where i=1 p(t) Global-Topk in General Relation Evaluating Global-Topk in General Relation For each tuple t, generate corresponding induced event relation Compute the Global-Topk probability of t by Theorem 4.3 Pick the k tuples with the highest Global-Topk probability Summary on Query Evaluation Simple (Independent Tuples) Dynamic Programming Tuples are ordered on their scores Recursion on the tuple index and k General (Independent/Exclusive Tuples) Polynomial reduction to simple cases Complexity GlobalTopk U-Topk U-kRanks Simple O(kn) O(kn) O(kn) General O(kn2) Θ(mknk-1 lg n)* Ω(mnk-1)* * m is a rule engine related factor m represents how complicated the relationship between tuples could be Outline Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion Conclusion Three intuitive semantic properties for top-k queries in probability databases Global-Topk semantics which satisfies all the properties above Query evaluation algorithm for Global-Topk in simple and general probabilistic databases Future Problems Weak order scoring function Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary tie breaker”) Preference Strength Sensitivity to Score Given a prob. relation Rp, if the DB is sufficiently large, by manipulating the scores of tuples, we would be able to get different answers NOT satisfied by our semantics NOT satisfied by any semantics in literature Need to consider preference strength in the semantics Thank you ! Related Works Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008] Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]