rSession5

advertisement
A Generic Framework for Handling
Uncertain Data with Local Correlations
Xiang Lian and Lei Chen
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
Clear Water Bay, Kowloon
Hong Kong, China
{xlian, leichen}@cse.ust.hk
VLDB 2011 @ Seattle
Motivation Example

Sensory data:
<temperature, light>
Forest monitoring application
forest
n2
n1
n5
n4
n6
n3
sensor node
VLDB 2011 @ Seattle
n7
n8
2
Motivation Example (cont'd)

Samples si collected from sensor node ni
light
s4
s6
s3
s1
s2
s5
s7
s8
temperature
O
VLDB 2011 @ Seattle
3
Motivation Example (cont'd)

Sensory data are uncertain and imprecise
light
uncertainty regions
o4
o6
o3
o1
o2
o5
o7
o8
temperature
O
VLDB 2011 @ Seattle
4
Motivation Example (cont'd)

3 monitoring areas
forest
Area 1
n2
Area 2
n1
n5
n4
n6
Area 3
n3
sensor node
monitoring area
VLDB 2011 @ Seattle
n7
n8
5
Motivation Example (cont'd)

3 monitoring areas
forest
Area 1
n2
Area 2
n1
n4
n5
n6
Area 3
sensors far away n 3
spatially close sensors
sensor node
monitoring area
VLDB 2011 @ Seattle
n7
n8
6
Locally Correlated Sensory Data
light
uncertainty regions
local correlations
among sensory data
Area 2
o4
o3
o1
Area 1
o6
o5
o7
o8
temperature
o2
Area 3
O
VLDB 2011 @ Seattle
7
Nearest Neighbor Queries on Locally
Correlated Uncertain Data
light
query point q
o4
o6
o3
o1
o5
o7
o8
temperature
o2
O
VLDB 2011 @ Seattle
8
Outline






Introduction
Model for Locally Correlated Uncertain Data
Problem Definition
Query Answering on Uncertain Data With Local
Correlations
Experimental Evaluation
Conclusions
VLDB 2011 @ Seattle
9
Introduction

Uncertain data are pervasive in real applications





Sensor networks
RFID networks
Location-based services
Data integration
While existing works often assume the independence
among uncertain objects,

local correlations!
Uncertain objects exhibit correlations
VLDB 2011 @ Seattle
10
Data Model for Local Correlations

Data Model

Uncertain objects contain several locally correlated partitions
(LCPs)


Uncertain objects within each LCP are correlated with each other
Uncertain objects from distinct LCPs are independent of each other
light
uncertainty regions
local correlations
among sensory data
o4
o6
o3
o1
o2
o5
o7
o8
temperature
O
VLDB 2011 @ Seattle
11
Data Model for Local Correlations
(cont'd)

Bayesian network


Each vertex corresponds to a random variable
Each vertex is associated with a conditional probability table
(CPT)
Pr { o 6}
light
a locally correlated partition
LCS ( o5)  { o5 }
o6
o4
o4
o3
o6
o3
Pr { o 4 | o 6}
Pr { o 3| o 6}
o5
o5
temperature
O
VLDB 2011 @ Seattle
Pr { o 5 | o 3, o}4
12
Data Model for Local Correlations
(cont'd)

The joint probability of variables
Join tuples in CPTs and multiply conditional probabilities
Variable elimination


oi. t
oi . l
Pr { o 6}
temperature
light
o6 . t
o6 . l
Pr { o 6}
23
60
0.8
...
o6
Pr { o 4 | o 6}
Pr { o 3| o 6}
o4
o3
o6 . l
o3 . t
23
60
22
o6 . t
o6 . l
...
Pr { o 5 | o 3, o}4
...
o4 . t
o 3 . l Pr { o 3| o}6
65
23
61
o3 . t
o3 . l
o4 . t
o4 . l
o5 . t
22
65
23
...
...
VLDB 2011 @ Seattle
...
...
61
...
...
o 4 . l Pr { o 4| o}6
60
...
T2
0.2
...
23
...
o5
...
...
...
o6 . t
T1
...
...
23
...
T3
0.4
o 5 . l Pr { o 5| o, 3 }o 4
65
...
T4
0.5
...
13
Definition of LC-PNN Query

Probabilistic Nearest Neighbor Query on Uncertain and
Locally Correlated Data, LC-PNN
light
query point q
o4
o6
o3
o1
o2
o5
o7
o8
temperature
O
VLDB 2011 @ Seattle
14
Challenges & Solutions

Challenges




Straightforward method of linear scan is costly
Computation cost of integration is expensive
Dealing with data correlations
Filtering Methods


Index pruning
Candidate filtering with pre-computations
VLDB 2011 @ Seattle
15
Index Pruning

Basic idea


Let best_so_far be the smallest
maximum distance from query point
q to any uncertain objects seen so
far
Then, any objects/nodes e having
mindist(q, e) > best_so_far can be
safely pruned
light
query point q
o4
o3
o1
o2
best_so_far
o6
o5
o7
o8
temperature
O
VLDB 2011 @ Seattle
16
Candidate Filtering with Pre-Computations

Basic idea


Obtain an upper bound, UB_PrLC-PNN(q, oi), of the LC-PNN
probability
Object oi can be safely pruned, if UB_PrLC-PNN(q, oi) < a
How to obtain the probability upper
bound?
Derived from formula of the LC-PNN
probability upper bound via pivots!
VLDB 2011 @ Seattle
17
Derivation of Probability Upper
Bound
pivot pivs5
l
VLDB 2011 @ Seattle
18
Range [min_l, max_l] of l

l=

Let min_l =

max_l =
If online l is smaller than min_l, then JPo(s5) = 1
If online l is greater than max_l , then JPo(s5) = 0


and
Thus, we do not need to store pre-computations with l
outside the range [min_l, max_l]
VLDB 2011 @ Seattle
19
Candidate Positions of Pivots
light
candidate positions for pivots
sample s5

nw
ne
sw
se
q
pivot pivs5
O
temperature

query point q
20
Selection of Pivot Positions

We provide a cost model to formalize the filtering and
refinement costs, and obtain a good value of parameter 
to achieve low query cost
sample

s5
nw
ne
sw
se
pivot piv s 5

query point q
VLDB 2011 @ Seattle
21
LC-PNN Query Procedure


Index uncertain objects containing LCPs in an R-tree
based index
For an LC-PNN query


When traversing the index, apply index pruning method and
candidate filtering to remove false alarms
Refine candidates and return true query answers
VLDB 2011 @ Seattle
22
Experimental Evaluation

Data Sets


Real data: California road network
Synthetic data: lUeU, lUeG, lSeU, and lSeG




Competitor

Basic method [Cheng et al., SIGMOD 2003]


Generate center locations of LCPs with Uniform or Skew distribution
Produce extent lengths of LCPs with Uniform or Gaussian distribution
Within LCPs, randomly generate locally correlated uncertain objects with
Bayesian networks
Assuming uncertain objects are independent
Measures


Wall clock time
Speed-up ratio
VLDB 2011 @ Seattle
23
LC-PNN Performance vs. a
lUeU speed-up ratio
lUeU time
wall clock time (sec)
lUeG time
speed-up ratio
10
100
1
10
0.1
1
0.01
0.1
0.2
lUeG speed-up ratio
0.5
0.8
0.9
0.1
0.1
0.2
0.5
0.8
0.9
a
a
Extent length of LCP = [1, 3], data size N = 150K, average No. of uncertain objects in an
LCP = 5
VLDB 2011 @ Seattle
24
Conclusions



We proposed the problem of queries over locally
correlated uncertain data, in particular, the LC-PNN
query, which is important in real applications
We designed the index pruning method, and based on a
proposed cost model, we presented the candidate filtering
method via offline pre-computations w.r.t. pivots
We provided efficient query processing techniques to
answer LC-PNN queries on locally correlated uncertain
data, and discussed applying the same framework to
answer other types of queries.
VLDB 2011 @ Seattle
25
Thank you!
Q/A
VLDB 2011 @ Seattle
26
Download