A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon Hong Kong, China {xlian, leichen}@cse.ust.hk VLDB 2011 @ Seattle Motivation Example Sensory data: <temperature, light> Forest monitoring application forest n2 n1 n5 n4 n6 n3 sensor node VLDB 2011 @ Seattle n7 n8 2 Motivation Example (cont'd) Samples si collected from sensor node ni light s4 s6 s3 s1 s2 s5 s7 s8 temperature O VLDB 2011 @ Seattle 3 Motivation Example (cont'd) Sensory data are uncertain and imprecise light uncertainty regions o4 o6 o3 o1 o2 o5 o7 o8 temperature O VLDB 2011 @ Seattle 4 Motivation Example (cont'd) 3 monitoring areas forest Area 1 n2 Area 2 n1 n5 n4 n6 Area 3 n3 sensor node monitoring area VLDB 2011 @ Seattle n7 n8 5 Motivation Example (cont'd) 3 monitoring areas forest Area 1 n2 Area 2 n1 n4 n5 n6 Area 3 sensors far away n 3 spatially close sensors sensor node monitoring area VLDB 2011 @ Seattle n7 n8 6 Locally Correlated Sensory Data light uncertainty regions local correlations among sensory data Area 2 o4 o3 o1 Area 1 o6 o5 o7 o8 temperature o2 Area 3 O VLDB 2011 @ Seattle 7 Nearest Neighbor Queries on Locally Correlated Uncertain Data light query point q o4 o6 o3 o1 o5 o7 o8 temperature o2 O VLDB 2011 @ Seattle 8 Outline Introduction Model for Locally Correlated Uncertain Data Problem Definition Query Answering on Uncertain Data With Local Correlations Experimental Evaluation Conclusions VLDB 2011 @ Seattle 9 Introduction Uncertain data are pervasive in real applications Sensor networks RFID networks Location-based services Data integration While existing works often assume the independence among uncertain objects, local correlations! Uncertain objects exhibit correlations VLDB 2011 @ Seattle 10 Data Model for Local Correlations Data Model Uncertain objects contain several locally correlated partitions (LCPs) Uncertain objects within each LCP are correlated with each other Uncertain objects from distinct LCPs are independent of each other light uncertainty regions local correlations among sensory data o4 o6 o3 o1 o2 o5 o7 o8 temperature O VLDB 2011 @ Seattle 11 Data Model for Local Correlations (cont'd) Bayesian network Each vertex corresponds to a random variable Each vertex is associated with a conditional probability table (CPT) Pr { o 6} light a locally correlated partition LCS ( o5) { o5 } o6 o4 o4 o3 o6 o3 Pr { o 4 | o 6} Pr { o 3| o 6} o5 o5 temperature O VLDB 2011 @ Seattle Pr { o 5 | o 3, o}4 12 Data Model for Local Correlations (cont'd) The joint probability of variables Join tuples in CPTs and multiply conditional probabilities Variable elimination oi. t oi . l Pr { o 6} temperature light o6 . t o6 . l Pr { o 6} 23 60 0.8 ... o6 Pr { o 4 | o 6} Pr { o 3| o 6} o4 o3 o6 . l o3 . t 23 60 22 o6 . t o6 . l ... Pr { o 5 | o 3, o}4 ... o4 . t o 3 . l Pr { o 3| o}6 65 23 61 o3 . t o3 . l o4 . t o4 . l o5 . t 22 65 23 ... ... VLDB 2011 @ Seattle ... ... 61 ... ... o 4 . l Pr { o 4| o}6 60 ... T2 0.2 ... 23 ... o5 ... ... ... o6 . t T1 ... ... 23 ... T3 0.4 o 5 . l Pr { o 5| o, 3 }o 4 65 ... T4 0.5 ... 13 Definition of LC-PNN Query Probabilistic Nearest Neighbor Query on Uncertain and Locally Correlated Data, LC-PNN light query point q o4 o6 o3 o1 o2 o5 o7 o8 temperature O VLDB 2011 @ Seattle 14 Challenges & Solutions Challenges Straightforward method of linear scan is costly Computation cost of integration is expensive Dealing with data correlations Filtering Methods Index pruning Candidate filtering with pre-computations VLDB 2011 @ Seattle 15 Index Pruning Basic idea Let best_so_far be the smallest maximum distance from query point q to any uncertain objects seen so far Then, any objects/nodes e having mindist(q, e) > best_so_far can be safely pruned light query point q o4 o3 o1 o2 best_so_far o6 o5 o7 o8 temperature O VLDB 2011 @ Seattle 16 Candidate Filtering with Pre-Computations Basic idea Obtain an upper bound, UB_PrLC-PNN(q, oi), of the LC-PNN probability Object oi can be safely pruned, if UB_PrLC-PNN(q, oi) < a How to obtain the probability upper bound? Derived from formula of the LC-PNN probability upper bound via pivots! VLDB 2011 @ Seattle 17 Derivation of Probability Upper Bound pivot pivs5 l VLDB 2011 @ Seattle 18 Range [min_l, max_l] of l l= Let min_l = max_l = If online l is smaller than min_l, then JPo(s5) = 1 If online l is greater than max_l , then JPo(s5) = 0 and Thus, we do not need to store pre-computations with l outside the range [min_l, max_l] VLDB 2011 @ Seattle 19 Candidate Positions of Pivots light candidate positions for pivots sample s5 nw ne sw se q pivot pivs5 O temperature query point q 20 Selection of Pivot Positions We provide a cost model to formalize the filtering and refinement costs, and obtain a good value of parameter to achieve low query cost sample s5 nw ne sw se pivot piv s 5 query point q VLDB 2011 @ Seattle 21 LC-PNN Query Procedure Index uncertain objects containing LCPs in an R-tree based index For an LC-PNN query When traversing the index, apply index pruning method and candidate filtering to remove false alarms Refine candidates and return true query answers VLDB 2011 @ Seattle 22 Experimental Evaluation Data Sets Real data: California road network Synthetic data: lUeU, lUeG, lSeU, and lSeG Competitor Basic method [Cheng et al., SIGMOD 2003] Generate center locations of LCPs with Uniform or Skew distribution Produce extent lengths of LCPs with Uniform or Gaussian distribution Within LCPs, randomly generate locally correlated uncertain objects with Bayesian networks Assuming uncertain objects are independent Measures Wall clock time Speed-up ratio VLDB 2011 @ Seattle 23 LC-PNN Performance vs. a lUeU speed-up ratio lUeU time wall clock time (sec) lUeG time speed-up ratio 10 100 1 10 0.1 1 0.01 0.1 0.2 lUeG speed-up ratio 0.5 0.8 0.9 0.1 0.1 0.2 0.5 0.8 0.9 a a Extent length of LCP = [1, 3], data size N = 150K, average No. of uncertain objects in an LCP = 5 VLDB 2011 @ Seattle 24 Conclusions We proposed the problem of queries over locally correlated uncertain data, in particular, the LC-PNN query, which is important in real applications We designed the index pruning method, and based on a proposed cost model, we presented the candidate filtering method via offline pre-computations w.r.t. pivots We provided efficient query processing techniques to answer LC-PNN queries on locally correlated uncertain data, and discussed applying the same framework to answer other types of queries. VLDB 2011 @ Seattle 25 Thank you! Q/A VLDB 2011 @ Seattle 26