An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent Presented By Salman Ahmed Shaikh (D1) Contents • • • • Introduction Subspace Outlier Detection Challenges Objectives of Research The Approach – Subspace Outlier Score Function: FSout – HighDOD Algorithm • Empirical Results and Analysis • Conclusion Introduction Y • An outlier, is one that appears to deviate markedly from other members of the sample in which it occurs. [1] • Popular techniques of outlier detection – Distance based – Density base • Since these techniques take fulldimensional space into account, their performance is impacted by noisy or irrelevant features. • Recently, researchers have switched to subspace anomaly detection. N1 o1 o3 o2 N2 X o1, o2 and o3 are anomalous instances w.r.t. the data Anomalous Subsequence Subspace Outlier Detection Challenges • Unavoidable exploration of all subspaces to mine full result set: – As the monotonicity property does not hold in the case of outliers, one cannot apply apriori-like heuristic for mining outliers. • Difficulty in devising an outlier notion: – Full-dimensional outlier detection techniques suffer the issue of dimensionality bias in subspaces. – They assign higher outlier score in high dimensional subspaces than in lower dimensions • Exposure to high false alarm rate: – Binary decision on each data point (normal or outlier) in each subspace flag too many points as outliers. – Solution is ranking-based algorithm. Objectives • Build an efficient technique for mining outliers in subspaces, which should – Avoid expensive scan of all subspaces while still yielding high detection accuracy – Eases the task of parameter setting – Facilitates the design of pruning heuristics to speed up the detection process – Provide a ranking of outliers across subspaces. The Approach • The authors have made an assertion and given some definitions to explain their research approach. { Non-monotonicity Property: Consider a data point p in the dataset DS. Even if p is not anomalous in subspace S of DS, it may be an outlier in some projection(s) of S. Even if p is a normal data point in all projections of S, it may be an outlier in S. 4 4 3 3 2 2 1 1 0 1 2 B 0 A 0 3 A is an outlier in full space but not in subspace 4 0 1 2 3 B is an outlier in subspace but not in fullspace 4 (Subspace) Outlier Score Function • Outlier Score Function: Fout as given by Angiulli et al. for full space [2] The dissimilarity of a point p with respect to its k nearest neighbors is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS. – In order to ensure that non-monotonicity property is not violated, the outlier score function is redefined by the authors as below. • Subspace Outlier Score Function: FSout The dissimilarity of a point p with respect to its k nearest neighbors in a subspace S of dimensionality dim(S), is known by its cumulative neighborhood distance. This is defined as the total distance from p to its k nearest neighbors in DS (projected onto S), normalized by dim(S). – Where ps is the projection of a data point p∊ DS onto S. FSout is Dimensionality Unbiased • FSout assigns multiple outlier scores to each data point and is dimensionality unbiased. • Example: let k=1 and l=2 • In Fig.(a), A's outlier score in the 2-dimensional space is 1/(2)1/2 which is the largest across all subspaces. • In Fig.(b), the outlier score of B when projected on the subspace of the x-axis is 1, which is also the largest in all subspaces. • Hence, FSout flags A and B as outliers. Subspace Outlier Detection Problem • Using FSout for outliers in subspaces, mining problem now can be re-defined as Given two positive integers k and n, mine the top n distinct anomalies whose outlier scores (in any subspace) are largest. HighDOD-Subspace Outlier Detection Algorithm • HighDOD (High dimensional Distance based Outlier Detection) is – A Distance based approach towards detecting outliers in very highdimensional datasets. – Unbiased w.r.t. the dimensionality of different subspaces. – Capable of producing ranking of outliers • HighDOD is composed of following 3 algorithms – OutlierDetection – CandidateExtraction – SubspaceMining • Algorithm OutlierDetection examine subspaces of dimensionality up to some threshold m = O(logN) as suggested by Aggarwal and Ailon in [3, 4] Algorithm 1: Outlier Detection • Carry out a bottom-up exploration of all subspaces of up to a dimensionality of m = O(logN) Algorithm 2: CandidateExtraction • Estimate the data points’ local densities by using a kernel density estimator and choose βn data points with the lowest estimates as potential candidates . Algorithm 3: SubspaceMining • This procedure is used to update the set of outliers TopOut with 2n candidate outliers extracted from a subspace S. Empirical Results and Analysis • Authors have compared HighDOD with DenSamp, HighOut, PODM and LOF. • Experiments have been performed to compare detection accuracy and scalability. • Precision-Recall trade-off curve is used to evaluate the quality of an unordered set of retrieved items. • Datasets – 4 Real data sets from UCI Repository have been used. Comparison of Detection Accuracy Detection accuracy of HighDOD, DenSamp, HighOut, PODM and LOF Comparison of Scalability • Since PODM and LOF yields unsatisfactory accuracy, they are not included in this experiment. • Scalability test is done with CorelHistogram (CH) dataset consisting of 68040 records in 32-dimensional space. Scalability of HighDOD, DenSamp and HighOut Conclusion • Work proposed a new outlier detection technique which is dimensionality unbiased. • Extends distance-based anomaly detection to subspace analysis. • Facilitates the design of ranking-based algorithm. • Introduced HighDOD, a ranking-based technique for subspace outlier mining. References [1] Wikipedia http://en.wikipedia.org/wiki/Outlier [2] Angiulli, F., Pizzuti, C.: Outlier mining in large highdimensional data sets. IEEE Trans. Knowl. Data Eng., 2005. [3] Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB Journal, 2005. [4] Ailon, N., Chazelle, B.: Faster dimension reduction. Commun. CACM, 2010.