An Unbiased Distance-based Outlier Detection Approach for

advertisement
An Unbiased Distance-based
Outlier Detection Approach for
High Dimensional Data
DASFAA 2011
By
Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent
Presented By
Salman Ahmed Shaikh (D1)
Contents
•
•
•
•
Introduction
Subspace Outlier Detection Challenges
Objectives of Research
The Approach
– Subspace Outlier Score Function: FSout
– HighDOD Algorithm
• Empirical Results and Analysis
• Conclusion
Introduction
Y
• An outlier, is one that appears to deviate
markedly from other members of the sample
in which it occurs. [1]
• Popular techniques of outlier detection
– Distance based
– Density base
• Since these techniques take fulldimensional space into account, their
performance is impacted by noisy or
irrelevant features.
• Recently, researchers have switched to
subspace anomaly detection.
N1
o1
o3
o2
N2
X
o1, o2 and o3 are anomalous instances w.r.t. the data
Anomalous Subsequence
Subspace Outlier Detection Challenges
• Unavoidable exploration of all subspaces to mine full result
set:
– As the monotonicity property does not hold in the case of outliers,
one cannot apply apriori-like heuristic for mining outliers.
• Difficulty in devising an outlier notion:
– Full-dimensional outlier detection techniques suffer the issue of
dimensionality bias in subspaces.
– They assign higher outlier score in high dimensional subspaces than in
lower dimensions
• Exposure to high false alarm rate:
– Binary decision on each data point (normal or outlier) in each
subspace flag too many points as outliers.
– Solution is ranking-based algorithm.
Objectives
• Build an efficient technique for mining outliers
in subspaces, which should
– Avoid expensive scan of all subspaces while still
yielding high detection accuracy
– Eases the task of parameter setting
– Facilitates the design of pruning heuristics to
speed up the detection process
– Provide a ranking of outliers across subspaces.
The Approach
• The authors have made an assertion and given some definitions
to explain their research approach.
{
Non-monotonicity Property: Consider a data point p in the
dataset DS. Even if p is not anomalous in subspace S of DS, it
may be an outlier in some projection(s) of S. Even if p is a normal
data point in all projections of S, it may be an outlier in S.
4
4
3
3
2
2
1
1
0
1
2
B
0
A
0
3
A is an outlier in full space but not in subspace
4
0
1
2
3
B is an outlier in subspace but not in fullspace
4
(Subspace) Outlier Score Function
• Outlier Score Function: Fout as given by Angiulli et al. for full space [2]
The dissimilarity of a point p with respect to its k nearest neighbors is
known by its cumulative neighborhood distance. This is defined as the total
distance from p to its k nearest neighbors in DS.
– In order to ensure that non-monotonicity property is not violated, the outlier
score function is redefined by the authors as below.
• Subspace Outlier Score Function: FSout
The dissimilarity of a point p with respect to its k nearest neighbors in a
subspace S of dimensionality dim(S), is known by its cumulative
neighborhood distance. This is defined as the total distance from p to its k
nearest neighbors in DS (projected onto S), normalized by dim(S).
– Where ps is the projection of a data point p∊ DS onto S.
FSout is Dimensionality Unbiased
• FSout assigns multiple outlier scores to each data point and is dimensionality
unbiased.
• Example: let k=1 and l=2
• In Fig.(a), A's outlier score in the 2-dimensional space is 1/(2)1/2 which is the
largest across all subspaces.
• In Fig.(b), the outlier score of B when projected on the subspace of the x-axis
is 1, which is also the largest in all subspaces.
• Hence, FSout flags A and B as outliers.
Subspace Outlier Detection Problem
• Using FSout for outliers in subspaces, mining
problem now can be re-defined as
Given two positive integers k and n, mine the
top n distinct anomalies whose outlier scores
(in any subspace) are largest.
HighDOD-Subspace Outlier Detection Algorithm
• HighDOD (High dimensional Distance based Outlier Detection)
is
– A Distance based approach towards detecting outliers in very highdimensional datasets.
– Unbiased w.r.t. the dimensionality of different subspaces.
– Capable of producing ranking of outliers
• HighDOD is composed of following 3 algorithms
– OutlierDetection
– CandidateExtraction
– SubspaceMining
• Algorithm OutlierDetection examine subspaces of
dimensionality up to some threshold m = O(logN) as
suggested by Aggarwal and Ailon in [3, 4]
Algorithm 1: Outlier Detection
• Carry out a bottom-up exploration of all subspaces of up to a
dimensionality of m = O(logN)
Algorithm 2: CandidateExtraction
• Estimate the data points’ local densities by using a kernel density
estimator and choose βn data points with the lowest estimates as
potential candidates .
Algorithm 3: SubspaceMining
• This procedure is used to update the set of outliers TopOut with 2n
candidate outliers extracted from a subspace S.
Empirical Results and Analysis
• Authors have compared HighDOD with DenSamp, HighOut,
PODM and LOF.
• Experiments have been performed to compare detection
accuracy and scalability.
• Precision-Recall trade-off curve is used to evaluate the quality
of an unordered set of retrieved items.
• Datasets
– 4 Real data sets from UCI Repository have been used.
Comparison of Detection Accuracy
Detection accuracy of HighDOD, DenSamp, HighOut, PODM and LOF
Comparison of Scalability
• Since PODM and LOF yields unsatisfactory accuracy, they are not included
in this experiment.
• Scalability test is done with CorelHistogram (CH) dataset consisting of
68040 records in 32-dimensional space.
Scalability of HighDOD, DenSamp and HighOut
Conclusion
• Work proposed a new outlier detection
technique which is dimensionality unbiased.
• Extends distance-based anomaly detection to
subspace analysis.
• Facilitates the design of ranking-based
algorithm.
• Introduced HighDOD, a ranking-based
technique for subspace outlier mining.
References
[1] Wikipedia http://en.wikipedia.org/wiki/Outlier
[2] Angiulli, F., Pizzuti, C.: Outlier mining in large highdimensional data sets. IEEE Trans. Knowl. Data Eng.,
2005.
[3] Aggarwal, C.C., Yu, P.S.: An effective and efficient
algorithm for high-dimensional outlier detection.
VLDB Journal, 2005.
[4] Ailon, N., Chazelle, B.: Faster dimension reduction.
Commun. CACM, 2010.
Download