Adaptive Cleaning for RFID Data Streams

advertisement
Adaptive Cleaning for
RFID Data Streams
Shawn Jeffery
UC Berkeley
Minos Garofalakis
Michael Franklin
Intel Research Berkeley
UC Berkeley
Presented by: Hamid Haidarian Shahri
Where Are We? Look at the Signs!
Looking at Signs – Before Jumping In
• S. Chaudhuri, U. Dayal, "An Overview of
Data Warehousing and OLAP Technology,"
SIGMOD Record, 1997.
 800+ citations
• DW and information integration
• “Data cleaning” term publicized
 Identified its importance in integration
• Extensive research followed
VLDB 2001
•
Session R12: DATA QUALITY & CLEANING
•
Declarative data cleaning: language, model, and
algorithms
Helena Galhardas (INRIA Rocquencourt), Daniela Florescu
(Propel), Dennis Shasha (NYU), Eric Simon, and CristianAugustin Saita (INRIA Rocquencourt)
Potter's wheel: an interactive data cleaning system
Vijayshankar Raman and Joseph M. Hellerstein
(University of California at Berkeley)
Update propagation strategies for improving the quality of
data on the Web
Alexandros Labrinidis and Nick Roussopoulos (University
of Maryland)
•
•
Data Cleaning Previous Work - 2006
•
Hamid Haidarian Shahri, S.H. Shahri, “Eliminating Duplicates in
Information Integration: An Adaptive, Extensible Framework,"
IEEE Intelligent Systems, Vol. 21, No. 5, 2006.
Putting Things into Context
• Data cleaning required after integration
 No unified standard across sources
 NOW: sensor/hardware errors inevitable;
research opportunity
• Data modeling (Amol Deshpande)
 An important use case is cleaning
VLDB 2006 – Three weeks ago
•
Research Session 5: Sensor Data (dedicated to cleaning!)
•
Title: Adaptive Cleaning for RFID Data Streams

•
Title: A Deferred Cleansing Method for RFID Data Analytics

•
Authors: Shawn R. Jeffery, Minos Garofalakis, Michael J.
Franklin
Authors: Jun Rao, Sangeeta Doraiswamy, Hetal Thakkar, Latha
S. Colby
Title: Online Outlier Detection in Sensor Data Using NonParametric Models

Authors: Sharmila Subramaniam, Themis Palpana, Dimitris
Papadopoulos, Vana Kalogeraki, Dimitrios Gunopulos
RFID: Radio Frequency IDentification
RFID data is dirty
Shelf 1
Shelf 0
RFID
Readers
3ft
Static
Tags
A simple experiment:
•2 RFID-enabled
shelves
•10 static tags
9ft 3ft
3ft
Mobile Tags
•5 mobile tags
1.5ft
3ft
15ft
RFID Data Cleaning
• RFID data has many dropped readings
• Typically, use a smoothing filter to interpolate
SELECT distinct tag_id
But,RFID_stream
how to set
the‘5size
FROM
[RANGE
sec’]
GROUP BY tag_id
of the window?
Smoothed
output
Smoothing Filter
Raw
readings
Time
Window Size for RFID Smoothing
Fido moving
Fido resting
Reality
Raw readings
Small window
Large window
 Need to balance completeness vs.
capturing tag movement
Truly Declarative Smoothing
• Problem: window size non-declarative
 Application wants a clean stream of data
 Window size is how to get it
• Solution: adapt the window size in
response to data
Itinerary
• Introduction: RFID data cleaning
• A statistical sampling perspective
• SMURF
 Per-tag cleaning
 Multi-tag cleaning
• Ongoing work
• Conclusions
A Statistical Sampling Perspective
• Key Insight:
RFID data 
random sample of present tags
• Map RFID smoothing to a sampling
experiment
RFID’s Gory Details
Antenna & reader
Tags
Read Cycle
(Epoch)
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
Tag List
Tag 1
Epoch
TagID
ReadRate
Tag 2
0
1
.9
0
2
.6
0
3
.3
Tag 3
Tag 4
(For Alien readers)
RFID Smoothing to Sampling
RFID
Read cycle (epoch)
Sampling
Sample trial
Reading
Single sample
Smoothing window
Repeated trials
Read rate
Probability of inclusion (pi)
 Now use sampling theory to drive adaptation!
SMURF
• Statistical Smoothing for Unreliable RFID Data
• Adapts window based on statistical properties
• Mechanisms for:
•
Per-tag and multi-tag cleaning
Application(s)
Application(s)
cleaned
per-tag readings
cleaned
count readings
SMURF
Per-tag
Multi-tag
Cleaning
Cleaning
raw RFID streams
Per-Tag Smoothing: Model and
Background
• Use a binomial sampling model
1
Si
pi
piavg
(Read rate
of tag i)
0
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
Smoothing Window
wi Bernoulli trials
Time
(epochs)
Per-Tag Smoothing: Completeness
•
If the tag is there, read it with high probability
 Want a large window
1
pi
0
E0
Reading with a
low pi
E1
E2
E3
E4
E5
E6
E7
E8
E9
Time
(epochs)
Expand the window
Per-Tag Smoothing: Completeness
 1  1
wi   avg  * ln  
 pi    
Desired window
size for tag
i
With probability
Expected
epochs
needed to read 1- 
Per-Tag Smoothing: Transitions
• Detect transitions as statistically
significant changes in the data
The tag has likely left by
this point
1
pi
0
E0
E1
E2
E3
Statistically significant
difference
E4
E5
E6
E7
E8
E9
Time
(epochs)
Flag a transition and
shrink the window
Per-Tag Smoothing: Transitions
•Statistically significant
|| Si | wi * p
avg
i
# observed # expected
readings
readings
| 2 wi * p
avg
i
* (1  p
avg
i
Is the difference
“statistically significant”?
)
SMURF in Action
Fido moving
Fido resting
SMURF
 Experiments with real and simulated data
show similar results
Multi-tag Cleaning
• Some applications only need aggregates
 E.g., count of items on each shelf
 Don’t need to track each tag!
• Use statistical mechanisms for both:
 Aggregate computation
 Window adaptation
Aggregate Computation
•
•
•
–estimators (Horvitz-Thompson)
Count:

Nw  
1
iS w

P[tag i seen in a window of size w]:
 i  1  (1  piavg ) w
Use small windows to capture movement
Use the estimator to compensate for lost readings
Window Adaptation
• Upper bound window similar to per-tag
• “Transition” based
within
 1 on
 variance
1
 avg  * ln  
w

subwindows
 p   
Count
Nw








 2 Var  N w   Var  N w'  








E0
E1
E2
E3
E4
E5
E6
E7
E8
Nw’
E9
Time
(epochs)
Multi-tag Scenario
Ongoing Work: Spatial Smoothing
• With multiple readers, more complicated
Two rooms, two readers per room
C
A
B
Reinforcement
D
Arbitration
A?addressed
B? A U B? by
A statistical
B?  A? C?
 All
are
framework!
U
Beyond RFID
Other sensor data
• -estimator for other aggregates
 Use SMURF for sensor networks
Other streaming data
• Use SMURF in general streaming systems
(e.g., TelegraphCQ)
 Remove RANGE clause from CQL
Related Work
• Commercial RFID middleware
 Smoothing filters: need to set smoothing window
• RFID-related work
 Rao et al., StreamClean: complementary
 Intel Seattle, HiFi, ESP: static window size
• BBQ, MauveDB
 Heavyweight, model-based
 SMURF is non-parametric, sampling-based
• Statistical filters (digital signal processing & DB)
 Non-linear digital filters inspired SMURF design
Conclusions
• Current smoothing filters not adequate
• Not declarative!
• SMURF: Declarative smoothing filter
• Uses statistical sampling to adapt window size
Thanks!
Questions?
Download