Erroneous Distribution Data Identification Using Outlier Detection

advertisement
Erroneous Distribution Data Identification Using
Outlier Detection Techniques
W. Zhuang, Y. Zhang, J.F. Grassle
Rutgers, the State University of New Jersey,
USA
Overview




Review of OBIS DQ-issues
Review of existing DQ methods
Case study: detecting outliers in
multidimensional data
Discussion and future directions
Data Quality (DQ)
DQ problems can be generated in every
steps of the data life cycle:
DQ problems (I)
Data gathering:
instrument failures; false identifications
geo-referencing
 Data storage
key metadata missing
erroneous data entry; database default values
masquerading as real values

DQ problems (II)




Data delivery: data corruption due to encoding
conversion
Data integration: duplicated records
Data retrieval: missing values
Data analysis/cleaning: inappropriate models
used, etc.
DQ solving-a process-based approach




DQ solving is an essential component of data
analysis and thus part of the data life cycle
A. It builds foundation for analysis and modeling
B. It provides feedback to improve the whole
data life cycle
C. It could lead to more DQ problems if not
carefully executed
DQ solving methods
Harvest metadata close to data
 Built-in integrity check and double data entry
 Model-based approach:
a) statistical
b) heuristic

OBIS DQ Study
Metadata-related problems
 DQ on scientific names
 Integrity checking
 Redundant records detection
 Outliers detection- a case study
Outliers sometimes represent erroneous data
We are examining data mining tools for detecting
erroneous data points

DBSCAN-a clustering tool






DBSCAN is density-based in feature space
It deals with high dimensional data
There is no need to specify cluster numbers
It identifies outliers during the clustering process
It is a fast algorithm and freely available
M.Ester, H.P.Kriegel, J.Sander and Xu. A
density-based algorithm for discovering clusters
in large spatial databases
A diagram of DBSCAN
Outlier
Border
Core
 = 1unit
MinPts = 5
Total points distribution
90
60
30
0
-180
-120
-60
0
-30
-60
-90
whole dataset
60
120
180
Result from DBSCAN
90
60
30
0
-180
-120
-60
0
60
-30
-60
-90
cluster points
outliers
120
180
Limitation of the method

Geographical outliers may be used to identify
erroneous points in survey data, but may not
good for museum collections or literature-based
data records.

Other methods to identify erroneous distribution
data ? How about using environmental data as
proxies?
Can we get some more information?
90
60
30
0
-180
-120
-60
0
60
-30
-60
-90
dcsn
dcso
dosn
doso
120
180
Limitations of using environmental
variables



Risk of imposing a rigid model at the time of preprocessing
Risk of losing valuable outliers
Risk of circular logic in later analyses
Discussions


Why don’t you use more environmental
variables?
Can you use DBSCAN on environmental
variables directly?
Possible improvements




Define multiple methods as DQ components
Assign bootstrap weights
Present outlier candidates to experts
Update weights based on user feedback
Summary




Many data quality problems can arise during the
whole data life cycle.
Preliminary checking can eliminate a lot of
simple errors
Expert knowledge should be integrated and be
the decisive factor when it comes to DQ solving
Data mining techniques may act as metal
detectors so that experts can focus on a
narrowed down group of candidates
Download