Data Quality Management in IoT - Department of Computer Science

advertisement
Data Quality Management in IoT
Zibin Zheng
Dept. of Computer Science & Engineering
The Chinese University of Hong Kong
‘The Internet of Things’ is a concept originally
coined and introduced by MIT, Auto-ID Center and
intimately linked to RFID and electronic product
code (EPC)
“… all about physical items talking to each other..”
Like RFID it is a concept that has attracted much rhetoric,
misconception and confusion as to what it means and its
implications in a social context
Internet of Things
• 1999, Foundation of Auto-ID center of MIT (RFID technology)
• 2003 SUN article: Toward a Global “Internet of Things”
• Nov. 17, 2005, “ITU Internet Report 2005: The Internet Of
Things”.
• Jan. 23, 2009, IBM Smart Planet,奥巴马: “物联网技术美国
在21世纪保持和夺回竞争优势的方式”.
• Aug. 07, 2009, 温家宝: “在传感网发展中,要早一点谋划
未来,早一点攻破核心技术,把传感系统和3G中的TD技术
结合起来”。
• 2009年9月:Internet of Things – An action plan for Europe
3
Data Quality in IoT
•
Ensuring to provide high-quality trusted
data to up-level analysis, optimization and
decision is a critical task.
–
4
Tolle et al.[1] deployed a sensor network with the
goal of examining the microclimate over the
volume of a redwood tree.
• Many data anomalies that needed to be
discarded post deployment.
• Only 49% of the collected data could be used
for meaningful interpretation.
–
In a WSN deployment at Great Duck Island,
Szewczyk et al.[2] classified 3% to 60% of data from
each sensor as faulty.
–
Buonadonna et al.[3] observe the difficulty of
obtaining accurate sensor data. Following a test
deployment, they note that failures can occur in
unexpected ways and that calibration is a difficult
task.
错误分类
[4] Kevin Ni. Et al, Sensor Network Data Fault Types, ACM Trans. on
Sensor Networks, Vol. 5, No. 3, May 2009
Current Problems
•
Data quality is a very important problem
– 84% of large businesses believe that poor data governance can reduce the accuracy of
business decisions. -- “2006 IBM Survey – 50 Large Companies”
– 90% of upper level management feel they don’t have the necessary information for critical
business decisions; 50% of them are afraid they are making poor decision because of it -Economist survey 2008
– There is a difference between data ownership and data stewardship. Most organizations do
not distinguish between the two or use the terms interchangeably, but the distinction, while
seemingly subtle, is significant. The data owner plays the role of a king or queen (the business)
and the data steward is the farmer (stewardship role) – “Organizing for Data Quality” Gartner G00148815 – June 2007
•
Negative impact on businesses from poor data quality
– Lost productivity is the No.1 impact on companies that do not have adequate access to
information -- Economist survey 2007
– 40% of orders blocked due to master data problems -- Ericsson
– 57% of marketing content work was to mitigate errors -- IDC Case Study 2006
– 43% of users say they’re not sure if internal information is accurate, 77% said bad decisions
had been made because of lack of information -- Business Week study, 2005
5
Current Problems (cont.)
•
•
6
Features of IoT environment
– Sensors with strong resource constraints
• Sensor nodes usually have limited resources including battery energy, data processing
capability, transmission rate and memory because of low cost.
– Great number of remote sensors deployed in wide area environment
• Many sensor systems consist of hundreds or thousands of sensor nodes densely distributed in a
wide area terrain. The massive sensors generate massive diversity data. And the data are
characterized by high redundancy.
– Unreliable communications and mixed traffics between remote sensors and the service data centers
• Some sensors are mobile. And new sensors may be added; the state of a sensor is possibly
changed to or from sleeping mode by the employed power management mechanism; some
nodes may even die due to exhausted battery energy. All of these factors may potentially cause
the unreliable network and network topologies to change dynamically.
• There are three basic sensor data delivery models: event-driven, query-driven, and continuous
delivery models.
– Heterogeneous sensors under diverse standards across industries
– Shared infrastructure
Data quality in IoT environment
– Tolle et al. [1] deployed a sensor network with the goal of examining the microclimate over the
volume of a redwood tree.
• The authors discovered that there were many data anomalies that needed to be discarded post
deployment. Only 49% of the collected data could be used for meaningful interpretation.
– In a deployment at Great Duck Island, Szewczyk et al. [2] classified 3% to 60% of data from each
sensor as faulty.
•
Anomaly Data Identification and Cleansing Technologies
Anomaly Data Identification and Cleansing
–
•
Data cleansing or data scrubbing is the act of detecting and correcting (or removing)
corrupt or inaccurate records from a record set, table, or database. Used mainly in
databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc.
parts of the data and then replacing, modifying or deleting this dirty data.
Anomaly Detection Techniques
–
–
–
–
–
7
Classification based anomaly detection techniques (classifier  training  testing)
• Neural Networks (Multi layered Perceptrons, Neural Trees, Autoassociative
Networks, Adaptive Resonance Theory Based, Radial Basis Function, Hopfield
Networks, Oscillatory Networks)
• Bayesian Networks
• Support Vector Machines (SVM, RSVM)
• Rule-Based
Nearest neighbor-based anomaly detection techniques
• Using Distance to kth Nearest Neighbor
• Using Relative Density
Clustering based anomaly detection techniques
Statistical anomaly detection techniques
• Parametric Techniques (Gaussian Model, Regression Model, Mixture of
Parametric Distributions )
• Nonparametric Techniques (Histogram, Kernel Function )
Spectral anomaly detection techniques
Data Fusion Technologies
•
Data Fusion
– Data fusion, is generally defined as the use of
techniques that combine data from multiple
sources and gather that information in order to
achieve inferences, which will be more efficient
and potentially more accurate than if they were
achieved by means of a single source.
Data Fusion Technologies
– Dempster-Shafer (D-S) Rule of Combination
– Voting and Summing Approaches
– Expert Systems
– Rule-based
– Artificial neural networks (ANNs)
– ……
•
8
Data Quality Management
of
IoT
GARBAGE IN,GARBAGE OUT
Causes
Appearance
Resource constraints
Environmental
effects
Communication
Noise & error
Outlier detection
Missing values
Missing value
prediction
Malicious attacks
Inconsistent data
Application Domain
Intelligent traffic
Intelligent healthcare
Smart Grids
Industrial process
Intelligent home
Environmental monitoring
……..
9
Decision Marking
Data fusion
Data confidence
Duplicate data
IoT Data Quality Management
Why Outlier Detection of IoT is Difficult?
•
Normal region
–
•
Notion of outlier
–
•
The exact notion of an outlier is different for different application
domains.
Errors and events
–
•
Traditional outlier detection algorithms often do not distinguish
between errors and events. How to identify outlier sources and
make distinction between errors, events and malicious attacks?
Heterogeneous data source
–
•
Heterogeneous data source with different confidence level.
Label data
–
10
Defining a normal region that encompasses every possible normal
behavior is very difficult.
Labeled data may not be available.
Outlier Detection of IoT
•
Outlier Detection
Input Data
Data-driven (Steaming)
Event-driven
Query-driven
Noise & error
Actual event
Anomalous data by Malicious attack
Target of outlier detection in IoT
– Remove noise to enhance data quality
• Controls the quality of measured data,
• Improves robustness of the data analysis under the presence of noise
• Prevent the aggregated results to be affected
– Event reporting
• search for values that do not follow the normal pattern of sensor data in the network.
• The detected values are treated as events indicating change of phenomenon that are of interest.
– Secure functioning
• Outlier detection identifies malicious sensors that always generate outlier values,
• Detects potential network attacks by adversaries.
11
Type of Anomaly
•
Point anomalies
An individual data instance can be considered
as anomalous with respect to the rest of data.
•
Contextual anomalies
A data instance is anomalous in a specific
context, but not otherwise.
•
Collective anomalies
A collection of related data instances is
anomalous with respect to the entire data set.
Characteristics of Anomaly Detection in
IoT(not textual)
• Digital anomaly detection
• Spa-temporal correlation
–
Temporal correlation: Different data instances from the same sensor at different time.
–
Spatial correlation (topology correlation): Different data instances from different sensors at the same time.
------------------------------------------------------------------------------------------------------------------------------------
• How to handle these correlations?
–
–
Prior arts : Design domain-specified algorithms to handle different scenarios case by case, such as: references in
speaker notes [1],[2],[3]
•Too many scenarios in IoT
•The algorithms are too specific to be reused.
Our approach: General framework for different IoT application domains
• De-coupling correlations via defining new features
• Transfer the problem to be point anomaly detection.
13
Guidance
of
Algorithm
Selection
Provide guideline to select suitable technique for different application domain
Anomaly detection algorithms
•Nearest neighbor-based
•Classification
•Clustering
•Statistical anomaly detection
•Spectral anomaly detection
14
Selection Criteria's
•Anomaly type
•Availability of training data
•Input type
•Output type
•Data features
•Domain specified features
Optimal Algorithm
Anomaly type:

Point anomalies

Contextual anomalies

Collective anomalies
Type of input data

Streaming data (Frequency)

Event data

Query-based data
Data feature:

Volume of data

Univariate or multivariate

Dimensionality
Availability of training data:

Supervised

Semi-supervised

Unsupervised
Type of expected output

Scalar (0 or 1).

Anomaly score
System Requirements and
constrains

Real-time or not

Memory/data buffer size for
stream
Procedures of Using the Framework
Feature Extraction
Guidelines
Anomaly detection
algorithm selection
Guidelines
Model training &
Parameter configuration
Integration
Anomaly detection
Evaluation
15
Evaluation
• Evaluation Metrics
– Detection rate
• The percent of outliers that are detected.
For example, if claim all the data to be outlier, then detection rate = 100%.
– False alarm rate
• The rate of claim a normal data to be a outlier.
The example above has large false alarm rate.
– Receiver operating characteristics (ROC)
• Represent the trade-off between these two rates.
• Evaluation Approaches
– Employ labeled data to evaluate the detection performance.
• Require mass of data and labor cost
– Fault injection and try whether the detection algorithm can find the
injected faults.
• Require good knowledge on various faults.
16
•
•
17
Point Anomaly Detection
Point anomaly detection: do not consider the correlations between different data
instances.
Procedures of anomaly detection by One-class SVM
Temporal Correlation: Neighborhood
•
The last two lines are normal data. However, compared with the values one hour ago, these values are
abnormal.
•
Feature: Neighborhood
a - mean;
a: current value.
mean: the mean of previous k values. k is a configurable parameter.
18
Spatial Correlation: Topology
•
The value of “虹山北路” is a normal data by itself. However, it is abnormal when compared with the
value from the neighbor sensor “羊仙坡”.
•
a – b*ratio;
a: the current value. E.g., 虹山北路 流量445
b: the value from neighbor sensor. E.g.,羊仙坡 流量 315
ratio: the historical value ratio. E.g., 0.557, 虹山北路的历史平均车流量是羊仙坡的0.557.
445 – 315×0.557 = 269.31656
羊仙坡
虹山北路
19
Download