Dongmei Ren, Baoying Wang, William Perrizo
Outlier detection can lead to discovering unexpected and interesting knowledge, which is critical important to some areas such as monitoring of criminal activities in electronic commerce, credit card fr a ud, etc. In this paper, we developed an efficient density-based outlier detection method for large datasets. Our contributions are: a) We introduce a relative density factor (RDF); b) Based on
RDF, we propose an RDF-based outlier detection method which can efficiently prune the data points which are deep in clusters, and detect outliers only within the remaining small subset of the data; c) The performance of our method is further improved by means of a vertical data representation, P-trees. We tested our method with
NHL and NBA data. Our method shows an order of magnitude speed improvement compared to the contemporary approaches.
The problem of mining rare event, deviant objects, and exceptions is critically important in many domains, such as electronic commerce, network, surveillance, and health monitoring. Outlier mining is drawing more and more attentions. The current outlier mining approaches
clustering-
based [7], deviation-based [8][9]. Density-based outlier
detection approaches are attracting most attentions for
KDD in large database.
Breunig et al. proposed a density-based approach to mining outliers over datasets with different densities
and arbitrary shapes [5]. Their notion of outliers is local in
the sense that the outlier degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. The method does not suffer from local density problem, so it can mine outliers over non-uniform distributed datasets. However, the method needs three scans and the computation of neighborhood search costs highly, which makes the method inefficient. Another density-based approach was
introduced by Papadimitriou & Kiragawa [6] using local
correlation integral (LOCI). This method selects a point as an outlier if its multi-granularity deviation factor (MDEF) deviates three times from the standard deviation of MDEF in a neighborhood. However, the cost of computing of the standard deviation is high.
In this paper, we propose an efficient density-based outlier detection method using a vertical data model P-
Trees 1 . We introduce a novel local density measurement, relative density factor (RDF). RDF indicates the degree at which the density of the point P contrasts to those of its neighbors. We take RDF as an outlierness measurement.
Based on RDF, our method prunes the data points that are deep in clusters, and detect outliers only within the remaining small subset of the data, which makes our method efficient. Also, the performance of our algorithm is enhanced significantly by means of P-Trees. Our method was tested over NHL and NBA datasets.
Experiments show that our method has an order of magnitude of speed improvement with comparable accuracy over the current state-of-the-art density-based outlier detection approaches.
In previous work, we proposed a novel vertical data structure, the P-Trees. In the P-Trees approach, we decompose attributes of relational tables into separate files by bit position and compress the vertical bit files using a data-mining-ready structure called the P-trees.
Instead of processing horizontal data vertically, we process these vertical P-trees horizontally through fast logical operations. Since P-trees remarkably compress the data and the P-trees logical operations scale extremely well, this vertical data structure has the potential to address the non-scalability with respect to size. In this section, we briefly review some useful features, which will
1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.
be used in this paper, of P-Tree, including its optimized logical operations.
Given a data set with d attributes, X = (A
1
, A
2
…
A d
), and the binary representation of the j b j.m
, b j.m-1
,..., b j.i
, …, b j.1,
b j.0
th attribute A j
as
, we decompose each attribute
into bit files, one file for each bit position [10]. Each bit
file is converted into a P-tree. Logical AND, OR and
NOT are the most frequently used operations of the Ptrees, which facilitate efficient neighborhood search, pruning and computation of RDF.
Calculation of Inequality P-trees P x≥v
and P x<v
:
Let x be a data point within a data set X, x be an m-bit data, and P m
, P m-1
X and P’ m
, P’ m-1
, …, P
0
be P-trees for vertical bit files of
,… P’
0
be the complement set for the vertical bit files of X. Let v = b m
…b i
…b
0
, where b i
is i th binary bit value of v, then
P x≥v
= P m
op m
… P i
op i
P i-1
… op
1
P
0
, i = 0, 1 … m (a)
P x<v
= P’
In (a), op i is
m op
is
i m
… P’
=0, op i op i
P’ i-1
… op i
=1, op i
is
i
is
k+1
P’ k
, k≤i≤ m (b)
In (b) op i
; In both (a) and (b),
stands for OR and
for AND; the operators are right binding, which means operators are associated from right to left, e.g., P
2
op
2
P
1
op
1
P
0
is equivalent to (P
2
op
2
(P
1 op
1
P
0
)).
High Order Bit Metric (HOBit): The HOBit metric
[12] is a bitwise distance function. It measures distance
based on the most significant consecutive matching bit positions starting from the left. Assume A i
is an attribute in tabular data sets, R (A
1
, A
2
, ..., A n
) and its values are represented as binary numbers, x, i.e., x = x(m)x(m-1)--x(1)x(0).x(-1)---x(-n). Let X and Y are A i of two tuples/samples, the HOBit similarity between X and Y is defined by m (X,Y) = max {i | x i
⊕ y i
}, where x i
and y i
are the i th bits of X and Y respectively, and
⊕ denotes the XOR (exclusive OR) operation.
Correspondingly, the HOBit dissimilarity is defined by
(note that N bit is the number of bits for the attribute) dm (X,Y) = N bit
- m.
In this section, we first introduce some definitions related to outlier detection. Then propose a RDF-based outlier detection method. The performance of the algorithm is enhanced significantly by means of the bitwise vertical data structure, P-trees, and its optimized logical operations.
From the density view, a point P is an outlier if it has much lower density than those of its neighbors. Based on this intuition, we propose some definitions related to outliers.
Definition 1 (neighborhood)
The neighborhood of a data point P with the radius r is defined as a set Nbr (P, r) = {x
X | |P-x|
r}, where |P-x| is the distance between P and x. It is also called rneighborhood. The points in this neighborhood are called neighbors of P, or direct r-neighbors of P. The number of neighbors of P is defined as N (Nbr (P, r)); Indirect neighbors of P are those points that are within the rneighborhood of the direct neighbors of P but not include direct neighbors of P.
Definition 2 (density factor)
Given a data point P and the neighborhood radius r,
Density factor (DF) of P is a measurement for local density around P, denoted as DF (P,r). It is defined as
(note that d is the number of dimension),
DF ( P , r )
N ( Nbr ( P , r )) / r d
.
(1)
Neighborhood density factor of the point P, denoted as
DF nbr
(P, r), is the average density factor of the neighbors of P.
DF nbr
( P , r )
N ( Nbr ( i
1
P , r ))
DF ( q i
, r ) / N ( Nbr ( P , r )), where q i is the neighbors of P, i = 1, 2, …, N(Nbr(P,r)).
Relative
DF(P,r)
| Nbr(P,r) | /r d
.
Density Factor (RDF) of the point P, denoted as
RDF (P, r), is the ratio of neighborhood density factor of P over its density factor
(DF).
RDF( P,r)
DF nbr
( P,r) / DF( P , r)
RDF indicates the degree at which the density of the point
P contrasts to those of its neighbors. We take RDF as an outlierness measurement, which indicates the degree to which a point can be an outlier in the view of the whole dataset.
Definition 3 (outliers)
Based on RDF, we define outliers as a subset of the dataset X with RDF > t, where t is a RDF threshold defined by case. The outlier set is denoted as Ols(X, t) =
{x
X | RDF(x) > t}.
Given a dataset X and a RDF threshold t, the RDFbased outlier detection is processed in two phases: “zoomout” procedure and “zoom-in” procedure. The detection process starts with the “zoom-out” procedure, which calls
“zoom-in” procedure when necessary. On the other hand the “zoom-in” procedure also calls “zoom-out” procedure by case.
“Zoom-out” process: The procedure starts with an arbitrary point P and a small neighborhood radius r, and calculates RDF of the point. There are three possible local data distributions with regard to the value of RDF, which are shown in figure 1, where α is a small value number, while β is a large value number. In our experiments, we choose α < 0.3 and β > 12, which leads to a good balance between accuracy and pruning speed.
(a) RDF = 1 ± α (b) RDF ≤ 1/β (c) RDF ≥ β
Figure 1. Three different local data distributions
In case (a), it is observed that neither the point P is an outlier, nor the direct and indirect neighbors of P are.
The local neighbors are distributed uniformly. The “zoomin” procedure will be called to quickly reach points located on the boundary or outlier points.
In case (b), the point P is highly likely to be a center point of a cluster. We prune all neighbors of P, while calculating RDF for each of the indirect r-neighbors. In case RDF of one point larger than the threshold t, the point can be inserted into the outlier set together with its
RDF value.
In case (c), RDF is large, P is inserted into the outlier set. We prune all the indirect neighbors of P.
“Zoom-in” process: The “zoom-in” procedure is a pruning process based on neighborhood expanding. We calculate DF and observe change of DF values. First we increase radius from r to 2r, compute DF (P, 2r) and compare DF (P, 2r) with DF (P, r). If DF (P, r) is close to
DF (P, 2r), it indicates that the whole 2r-neighbohood has uniform density. Therefore, increase (e.g. double or 4 times) the radius until significant change is observed. As for significant decrease of DF is observed, cluster boundary and potential outliers are reached. Therefore,
“zoom-out” procedure is called to detect outliers at a fine scale. Figure 2 shows this case. All the 4*r-neighbors are pruned off and “zoom-out” procedure detect outliers over points in 4*r-6*r ring. As for significant increase of DF is observed, we pick up a point with high DF value, likely to be in a denser cluster, and call “zoom-in” procedure further to prune off all points in the dense cluster.
As we can see, our method detects outliers using
“zoom-out” process for small candidate outlier sets, the boundary points and the outliers. This subset of data as a whole is much smaller than the original dataset. This is where the performance of our algorithm lies in.
Both the “zoom-in” and “zoom-out” procedures can be further improved by using the P-trees data structure and its optimal logical operations. The speed improvement lies in: a) P-trees make the “zoom-in” process on fly using HOBit metric; b) P-trees are very efficient for neighborhood search by its logical operations; c) P-tree can be used as a self-index for unprocessed dataset, clustered dataset and outlier set. Because of it, pruning is efficiently executed by logical operations of Ptrees.
Zoom-out
Zoom-in
Figure 2.
“Zoom-in” Process followed by “Zoom-out”
“Zoom-in” using HOBit metric: Given a point P, we define the neighbors of P hierarchically based on the
HOBit dissimilarity between P and its neighbors, denoted as
ξ
-neighbors.
ξ
-neighbors represents the neighbors with
ξ
bits of dissimilarity, where
ξ
= 1, 2 ... 8 if P is an
8-bit value. The basic calculations in the procedure are computing DF (P,
ξ
) for each
ξneighborhood and pruning neighborhood. HOBit dissimilarity is calculated by means of P-tree AND. For any data point, P, let P = b
11 b
12
… b nm
, where b i,j
is the j th bit value in the i th attribute column of P. The attribute P-trees for P with ξ-
HOBit neighbors for i th attribute are then defined by
Pv i, ξ
= Pp i,1
Pp i,2
…
Pp i,mξ
The
ξ
-neighborhood P-tree for P in multi-dimensions are then calculated by
PNp,
ξ
= Pv
1
,
m-ξ
Pv
2
,
m-ξ
Pv
3
,
m-ξ
…
Pv n
,
m-ξ
Density factor, DF (P,r) of the ξ-neighborhood is simply the root counts of PNp,r divided by r.
The neighborhood pruning is accomplished by:
PU = PU
PN’p,
ξ where PU is a P-tree represents the unprocessed points of the dataset. PN’p,
ξ represents the complement set of PNp,
ξ.
“Zoom-out” using Inequality P-trees: In the
“Zoom-out” procedure, we use inequality P-trees to search for neighborhood, upon which the RDF is calculated.
The direct neighborhood P-tree of a given point P within r, denoted as PDN p,r direct neighbors. PDN p,r
is P-tree representation of its is calculated by PDN p,r
= P x>p-r
P x
p+r
.The root count of PDN p,r
is equal to N(Nbr(p,r)).
Accordingly, DF (P,r) and RDF (P,r) are calculated based on equation (1) and (2) respectively.
Using P-Trees AND operations, the pruning is calculated as: In case of RDF (p,r)=1±α, we prune nonoutlier points by PU = PU
PDN’
P,r
PIN’
RDF<1/ β, dataset is pruned by PU = PU
P,r ;;;
In case
PDN’
P,r
; If
RDF > β the dataset is pruned by PU= PU
PDN’
P,r
PIN’
P,r
.
4.
In this section, we experimentally compare our method
(RDF) with current approaches: LOF (local outlier factor) and aLOCI (approximate local correlation integral). LOF is the first approach to density-based outlier detection. aLOCI is the fastest approach in the density-based area so far. We compare these three methods in terms of run time and scalability to data size. We will show our approach is efficient and has high scalability.
We ran the methods on a 1400-MHZ AMD machine with 1GB main memory and Debian Linux version 4.0.
The datasets we used are the National Hockey League
(NHL, 96) dataset and NBA dataset. Due to space limitation, we only show our result on NHL dataset in this paper. The result on NBA dataset also leads to our conclusion in terms of speed and scalability.
Figure 3 shows that our method has an order of magnitude improvements in speed compared to aLOCI method. Figure 4 shows our method is the most scalable among the three. When data size is large, e.g. 16384, our method starts to outperform these two methods.
Run Time Comparisons of LOF, aLOCI, RDF
2000
1500
Run Time (s) 1000
500
0
LOF aLOCI
RDF
256
0.23
0.17
0.58
1024
1.92
1.87
2.1
4096
38.79
35.81
8.34
Data Size
16384 65536
103.19 1813.43
87.34
37.82
985.39
108.91
Figure 3. Run Time Comparison
Scalability Comparison of LOF,aLOCI,RDF
2000
1800
1600
1400
1200
1000
800
600
400
200
0
-200 256
LOF aLOCI
RDF
1024 4096
Data Size
16384 65536
Figure 4. Scalability Comparison
In this paper, we propose a density based outlier detection method based on a novel local density measurement RDF. The method can efficiently mining outliers over large datasets and scales well with increase of data size. A vertical data representation, P-Trees, is used to speed up the process further. Our method was tested over NHL and NBA datasets. Experiments show that our method has an order of magnitude of speed improvements with comparable accuracy over the current state-of-art density-based outlier detection approaches.
[1] V.Barnett, T.Lewis, Outliers in Statistic Data, John
Wiley’s Publisher, NY,1994
[2]
Knorr, Edwin M. and Raymond T. Ng. “A Unified
Notion of Outliers: Properties and Computation”,
3 rd International Conference on Knowledge
Discovery and Data Mining Proceedings, 1997, pp.
219-222.
[3] Knorr, Edwin M. and Raymond T. Ng.,
“Algorithms for Mining Distance-Based Outliers in
Large Datasets”, Very Large Data Bases
Conference Proceedings, 1998, pp. 24-27.
[4] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok
Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on
Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000
ACM SIGMOD international conference on
Management of data,2000, ISSN:0163-5808
[5] Markus M. Breunig, Hans-Peter Kriegel, Raymond
T. Ng, Jörg Sander, “LOF: Identifying Densitybased Local Outliers”, Proc. ACM SIGMOD 2000
Int. Conf. On Management of Data, TX, 2000
[6] Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip
B. Gibbons, Christos Faloutsos, “LOCI: Fast
Outlier Detection Using the Local Correlation
Integral”, 19th International Conference on Data
Engineering, 2003, Bangalore, India
[7] A.K.Jain, M.N.Murty, and P.J.Flynn. “Data clustering: A review”, ACM Comp. Surveys,
31(3):264-323, 1999
[8] Arning, Andreas, Rakesh Agrawal, and Prabhakar
Raghavan. “A Linear Method for Deviation
Detection in Large Databases”, 2 nd International
Conference on Knowledge Discovery and Data
Mining Proceedings, 1996, pp. 164-169.
[9] S. Sarawagi, R. Agrawal, and N. Megiddo.
“Discovery-Driven Exploration of OLAP Data
Cubes”, EDBT'98.
[10]
Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The Ptree algebra”. Proceedings of the ACM SAC,
Symposium on Applied Computing, 2002.
[11] W. Perrizo, “Peano Count Tree Technology,”
Technical Report NDSU-CSOR-TR-01-1, 2001.
[12]
Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and
Perrizo, W., “Efficient Density Clustering for
Spatial Data”, PKDD 2003