Adaptive Cleaning for RFID Data Streams Shawn Jeffery UC Berkeley Minos Garofalakis Michael Franklin Intel Research Berkeley UC Berkeley Presented by: Hamid Haidarian Shahri Where Are We? Look at the Signs! Looking at Signs – Before Jumping In • S. Chaudhuri, U. Dayal, "An Overview of Data Warehousing and OLAP Technology," SIGMOD Record, 1997. 800+ citations • DW and information integration • “Data cleaning” term publicized Identified its importance in integration • Extensive research followed VLDB 2001 • Session R12: DATA QUALITY & CLEANING • Declarative data cleaning: language, model, and algorithms Helena Galhardas (INRIA Rocquencourt), Daniela Florescu (Propel), Dennis Shasha (NYU), Eric Simon, and CristianAugustin Saita (INRIA Rocquencourt) Potter's wheel: an interactive data cleaning system Vijayshankar Raman and Joseph M. Hellerstein (University of California at Berkeley) Update propagation strategies for improving the quality of data on the Web Alexandros Labrinidis and Nick Roussopoulos (University of Maryland) • • Data Cleaning Previous Work - 2006 • Hamid Haidarian Shahri, S.H. Shahri, “Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework," IEEE Intelligent Systems, Vol. 21, No. 5, 2006. Putting Things into Context • Data cleaning required after integration No unified standard across sources NOW: sensor/hardware errors inevitable; research opportunity • Data modeling (Amol Deshpande) An important use case is cleaning VLDB 2006 – Three weeks ago • Research Session 5: Sensor Data (dedicated to cleaning!) • Title: Adaptive Cleaning for RFID Data Streams • Title: A Deferred Cleansing Method for RFID Data Analytics • Authors: Shawn R. Jeffery, Minos Garofalakis, Michael J. Franklin Authors: Jun Rao, Sangeeta Doraiswamy, Hetal Thakkar, Latha S. Colby Title: Online Outlier Detection in Sensor Data Using NonParametric Models Authors: Sharmila Subramaniam, Themis Palpana, Dimitris Papadopoulos, Vana Kalogeraki, Dimitrios Gunopulos RFID: Radio Frequency IDentification RFID data is dirty Shelf 1 Shelf 0 RFID Readers 3ft Static Tags A simple experiment: •2 RFID-enabled shelves •10 static tags 9ft 3ft 3ft Mobile Tags •5 mobile tags 1.5ft 3ft 15ft RFID Data Cleaning • RFID data has many dropped readings • Typically, use a smoothing filter to interpolate SELECT distinct tag_id But,RFID_stream how to set the‘5size FROM [RANGE sec’] GROUP BY tag_id of the window? Smoothed output Smoothing Filter Raw readings Time Window Size for RFID Smoothing Fido moving Fido resting Reality Raw readings Small window Large window Need to balance completeness vs. capturing tag movement Truly Declarative Smoothing • Problem: window size non-declarative Application wants a clean stream of data Window size is how to get it • Solution: adapt the window size in response to data Itinerary • Introduction: RFID data cleaning • A statistical sampling perspective • SMURF Per-tag cleaning Multi-tag cleaning • Ongoing work • Conclusions A Statistical Sampling Perspective • Key Insight: RFID data random sample of present tags • Map RFID smoothing to a sampling experiment RFID’s Gory Details Antenna & reader Tags Read Cycle (Epoch) E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 Tag List Tag 1 Epoch TagID ReadRate Tag 2 0 1 .9 0 2 .6 0 3 .3 Tag 3 Tag 4 (For Alien readers) RFID Smoothing to Sampling RFID Read cycle (epoch) Sampling Sample trial Reading Single sample Smoothing window Repeated trials Read rate Probability of inclusion (pi) Now use sampling theory to drive adaptation! SMURF • Statistical Smoothing for Unreliable RFID Data • Adapts window based on statistical properties • Mechanisms for: • Per-tag and multi-tag cleaning Application(s) Application(s) cleaned per-tag readings cleaned count readings SMURF Per-tag Multi-tag Cleaning Cleaning raw RFID streams Per-Tag Smoothing: Model and Background • Use a binomial sampling model 1 Si pi piavg (Read rate of tag i) 0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 Smoothing Window wi Bernoulli trials Time (epochs) Per-Tag Smoothing: Completeness • If the tag is there, read it with high probability Want a large window 1 pi 0 E0 Reading with a low pi E1 E2 E3 E4 E5 E6 E7 E8 E9 Time (epochs) Expand the window Per-Tag Smoothing: Completeness 1 1 wi avg * ln pi Desired window size for tag i With probability Expected epochs needed to read 1- Per-Tag Smoothing: Transitions • Detect transitions as statistically significant changes in the data The tag has likely left by this point 1 pi 0 E0 E1 E2 E3 Statistically significant difference E4 E5 E6 E7 E8 E9 Time (epochs) Flag a transition and shrink the window Per-Tag Smoothing: Transitions •Statistically significant || Si | wi * p avg i # observed # expected readings readings | 2 wi * p avg i * (1 p avg i Is the difference “statistically significant”? ) SMURF in Action Fido moving Fido resting SMURF Experiments with real and simulated data show similar results Multi-tag Cleaning • Some applications only need aggregates E.g., count of items on each shelf Don’t need to track each tag! • Use statistical mechanisms for both: Aggregate computation Window adaptation Aggregate Computation • • • –estimators (Horvitz-Thompson) Count: Nw 1 iS w P[tag i seen in a window of size w]: i 1 (1 piavg ) w Use small windows to capture movement Use the estimator to compensate for lost readings Window Adaptation • Upper bound window similar to per-tag • “Transition” based within 1 on variance 1 avg * ln w subwindows p Count Nw 2 Var N w Var N w' E0 E1 E2 E3 E4 E5 E6 E7 E8 Nw’ E9 Time (epochs) Multi-tag Scenario Ongoing Work: Spatial Smoothing • With multiple readers, more complicated Two rooms, two readers per room C A B Reinforcement D Arbitration A?addressed B? A U B? by A statistical B? A? C? All are framework! U Beyond RFID Other sensor data • -estimator for other aggregates Use SMURF for sensor networks Other streaming data • Use SMURF in general streaming systems (e.g., TelegraphCQ) Remove RANGE clause from CQL Related Work • Commercial RFID middleware Smoothing filters: need to set smoothing window • RFID-related work Rao et al., StreamClean: complementary Intel Seattle, HiFi, ESP: static window size • BBQ, MauveDB Heavyweight, model-based SMURF is non-parametric, sampling-based • Statistical filters (digital signal processing & DB) Non-linear digital filters inspired SMURF design Conclusions • Current smoothing filters not adequate • Not declarative! • SMURF: Declarative smoothing filter • Uses statistical sampling to adapt window size Thanks! Questions?