Data Quality Management in IoT Zibin Zheng Dept. of Computer Science & Engineering The Chinese University of Hong Kong ‘The Internet of Things’ is a concept originally coined and introduced by MIT, Auto-ID Center and intimately linked to RFID and electronic product code (EPC) “… all about physical items talking to each other..” Like RFID it is a concept that has attracted much rhetoric, misconception and confusion as to what it means and its implications in a social context Internet of Things • 1999, Foundation of Auto-ID center of MIT (RFID technology) • 2003 SUN article: Toward a Global “Internet of Things” • Nov. 17, 2005, “ITU Internet Report 2005: The Internet Of Things”. • Jan. 23, 2009, IBM Smart Planet,奥巴马: “物联网技术美国 在21世纪保持和夺回竞争优势的方式”. • Aug. 07, 2009, 温家宝: “在传感网发展中,要早一点谋划 未来,早一点攻破核心技术,把传感系统和3G中的TD技术 结合起来”。 • 2009年9月:Internet of Things – An action plan for Europe 3 Data Quality in IoT • Ensuring to provide high-quality trusted data to up-level analysis, optimization and decision is a critical task. – 4 Tolle et al.[1] deployed a sensor network with the goal of examining the microclimate over the volume of a redwood tree. • Many data anomalies that needed to be discarded post deployment. • Only 49% of the collected data could be used for meaningful interpretation. – In a WSN deployment at Great Duck Island, Szewczyk et al.[2] classified 3% to 60% of data from each sensor as faulty. – Buonadonna et al.[3] observe the difficulty of obtaining accurate sensor data. Following a test deployment, they note that failures can occur in unexpected ways and that calibration is a difficult task. 错误分类 [4] Kevin Ni. Et al, Sensor Network Data Fault Types, ACM Trans. on Sensor Networks, Vol. 5, No. 3, May 2009 Current Problems • Data quality is a very important problem – 84% of large businesses believe that poor data governance can reduce the accuracy of business decisions. -- “2006 IBM Survey – 50 Large Companies” – 90% of upper level management feel they don’t have the necessary information for critical business decisions; 50% of them are afraid they are making poor decision because of it -Economist survey 2008 – There is a difference between data ownership and data stewardship. Most organizations do not distinguish between the two or use the terms interchangeably, but the distinction, while seemingly subtle, is significant. The data owner plays the role of a king or queen (the business) and the data steward is the farmer (stewardship role) – “Organizing for Data Quality” Gartner G00148815 – June 2007 • Negative impact on businesses from poor data quality – Lost productivity is the No.1 impact on companies that do not have adequate access to information -- Economist survey 2007 – 40% of orders blocked due to master data problems -- Ericsson – 57% of marketing content work was to mitigate errors -- IDC Case Study 2006 – 43% of users say they’re not sure if internal information is accurate, 77% said bad decisions had been made because of lack of information -- Business Week study, 2005 5 Current Problems (cont.) • • 6 Features of IoT environment – Sensors with strong resource constraints • Sensor nodes usually have limited resources including battery energy, data processing capability, transmission rate and memory because of low cost. – Great number of remote sensors deployed in wide area environment • Many sensor systems consist of hundreds or thousands of sensor nodes densely distributed in a wide area terrain. The massive sensors generate massive diversity data. And the data are characterized by high redundancy. – Unreliable communications and mixed traffics between remote sensors and the service data centers • Some sensors are mobile. And new sensors may be added; the state of a sensor is possibly changed to or from sleeping mode by the employed power management mechanism; some nodes may even die due to exhausted battery energy. All of these factors may potentially cause the unreliable network and network topologies to change dynamically. • There are three basic sensor data delivery models: event-driven, query-driven, and continuous delivery models. – Heterogeneous sensors under diverse standards across industries – Shared infrastructure Data quality in IoT environment – Tolle et al. [1] deployed a sensor network with the goal of examining the microclimate over the volume of a redwood tree. • The authors discovered that there were many data anomalies that needed to be discarded post deployment. Only 49% of the collected data could be used for meaningful interpretation. – In a deployment at Great Duck Island, Szewczyk et al. [2] classified 3% to 60% of data from each sensor as faulty. • Anomaly Data Identification and Cleansing Technologies Anomaly Data Identification and Cleansing – • Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data. Anomaly Detection Techniques – – – – – 7 Classification based anomaly detection techniques (classifier training testing) • Neural Networks (Multi layered Perceptrons, Neural Trees, Autoassociative Networks, Adaptive Resonance Theory Based, Radial Basis Function, Hopfield Networks, Oscillatory Networks) • Bayesian Networks • Support Vector Machines (SVM, RSVM) • Rule-Based Nearest neighbor-based anomaly detection techniques • Using Distance to kth Nearest Neighbor • Using Relative Density Clustering based anomaly detection techniques Statistical anomaly detection techniques • Parametric Techniques (Gaussian Model, Regression Model, Mixture of Parametric Distributions ) • Nonparametric Techniques (Histogram, Kernel Function ) Spectral anomaly detection techniques Data Fusion Technologies • Data Fusion – Data fusion, is generally defined as the use of techniques that combine data from multiple sources and gather that information in order to achieve inferences, which will be more efficient and potentially more accurate than if they were achieved by means of a single source. Data Fusion Technologies – Dempster-Shafer (D-S) Rule of Combination – Voting and Summing Approaches – Expert Systems – Rule-based – Artificial neural networks (ANNs) – …… • 8 Data Quality Management of IoT GARBAGE IN,GARBAGE OUT Causes Appearance Resource constraints Environmental effects Communication Noise & error Outlier detection Missing values Missing value prediction Malicious attacks Inconsistent data Application Domain Intelligent traffic Intelligent healthcare Smart Grids Industrial process Intelligent home Environmental monitoring …….. 9 Decision Marking Data fusion Data confidence Duplicate data IoT Data Quality Management Why Outlier Detection of IoT is Difficult? • Normal region – • Notion of outlier – • The exact notion of an outlier is different for different application domains. Errors and events – • Traditional outlier detection algorithms often do not distinguish between errors and events. How to identify outlier sources and make distinction between errors, events and malicious attacks? Heterogeneous data source – • Heterogeneous data source with different confidence level. Label data – 10 Defining a normal region that encompasses every possible normal behavior is very difficult. Labeled data may not be available. Outlier Detection of IoT • Outlier Detection Input Data Data-driven (Steaming) Event-driven Query-driven Noise & error Actual event Anomalous data by Malicious attack Target of outlier detection in IoT – Remove noise to enhance data quality • Controls the quality of measured data, • Improves robustness of the data analysis under the presence of noise • Prevent the aggregated results to be affected – Event reporting • search for values that do not follow the normal pattern of sensor data in the network. • The detected values are treated as events indicating change of phenomenon that are of interest. – Secure functioning • Outlier detection identifies malicious sensors that always generate outlier values, • Detects potential network attacks by adversaries. 11 Type of Anomaly • Point anomalies An individual data instance can be considered as anomalous with respect to the rest of data. • Contextual anomalies A data instance is anomalous in a specific context, but not otherwise. • Collective anomalies A collection of related data instances is anomalous with respect to the entire data set. Characteristics of Anomaly Detection in IoT(not textual) • Digital anomaly detection • Spa-temporal correlation – Temporal correlation: Different data instances from the same sensor at different time. – Spatial correlation (topology correlation): Different data instances from different sensors at the same time. ------------------------------------------------------------------------------------------------------------------------------------ • How to handle these correlations? – – Prior arts : Design domain-specified algorithms to handle different scenarios case by case, such as: references in speaker notes [1],[2],[3] •Too many scenarios in IoT •The algorithms are too specific to be reused. Our approach: General framework for different IoT application domains • De-coupling correlations via defining new features • Transfer the problem to be point anomaly detection. 13 Guidance of Algorithm Selection Provide guideline to select suitable technique for different application domain Anomaly detection algorithms •Nearest neighbor-based •Classification •Clustering •Statistical anomaly detection •Spectral anomaly detection 14 Selection Criteria's •Anomaly type •Availability of training data •Input type •Output type •Data features •Domain specified features Optimal Algorithm Anomaly type: Point anomalies Contextual anomalies Collective anomalies Type of input data Streaming data (Frequency) Event data Query-based data Data feature: Volume of data Univariate or multivariate Dimensionality Availability of training data: Supervised Semi-supervised Unsupervised Type of expected output Scalar (0 or 1). Anomaly score System Requirements and constrains Real-time or not Memory/data buffer size for stream Procedures of Using the Framework Feature Extraction Guidelines Anomaly detection algorithm selection Guidelines Model training & Parameter configuration Integration Anomaly detection Evaluation 15 Evaluation • Evaluation Metrics – Detection rate • The percent of outliers that are detected. For example, if claim all the data to be outlier, then detection rate = 100%. – False alarm rate • The rate of claim a normal data to be a outlier. The example above has large false alarm rate. – Receiver operating characteristics (ROC) • Represent the trade-off between these two rates. • Evaluation Approaches – Employ labeled data to evaluate the detection performance. • Require mass of data and labor cost – Fault injection and try whether the detection algorithm can find the injected faults. • Require good knowledge on various faults. 16 • • 17 Point Anomaly Detection Point anomaly detection: do not consider the correlations between different data instances. Procedures of anomaly detection by One-class SVM Temporal Correlation: Neighborhood • The last two lines are normal data. However, compared with the values one hour ago, these values are abnormal. • Feature: Neighborhood a - mean; a: current value. mean: the mean of previous k values. k is a configurable parameter. 18 Spatial Correlation: Topology • The value of “虹山北路” is a normal data by itself. However, it is abnormal when compared with the value from the neighbor sensor “羊仙坡”. • a – b*ratio; a: the current value. E.g., 虹山北路 流量445 b: the value from neighbor sensor. E.g.,羊仙坡 流量 315 ratio: the historical value ratio. E.g., 0.557, 虹山北路的历史平均车流量是羊仙坡的0.557. 445 – 315×0.557 = 269.31656 羊仙坡 虹山北路 19