04_icdm_Outline of RDF algorithm

advertisement

RDF: A Density-based Outlier Detection Method using Vertical Data

Representation

Dongmei Ren, Baoying Wang, William Perrizo

Computer Science Department

North Dakota State University

Fargo, ND 58105, USA dongmei.ren@ndsu.nodak.edu

Abstract

Outlier detection can lead to discovering unexpected and interesting knowledge, which is critical important to some areas such as monitoring of criminal activities in electronic commerce, credit card fr a ud, etc. In this paper, we developed an efficient density-based outlier detection method for large datasets. Our contributions are: a) We introduce a relative density factor (RDF); b) Based on

RDF, we propose an RDF-based outlier detection method which can efficiently prune the data points which are deep in clusters, and detect outliers only within the remaining small subset of the data; c) The performance of our method is further improved by means of a vertical data representation, P-trees. We tested our method with

NHL and NBA data. Our method shows an order of magnitude speed improvement compared to the contemporary approaches.

1.

Introduction

The problem of mining rare event, deviant objects, and exceptions is critically important in many domains, such as electronic commerce, network, surveillance, and health monitoring. Outlier mining is drawing more and more attentions. The current outlier mining approaches

can be classified as five categories: statistic-based [1], distance-based[2][3][4],density-based[5][6],

clustering-

based [7], deviation-based [8][9]. Density-based outlier

detection approaches are attracting most attentions for

KDD in large database.

Breunig et al. proposed a density-based approach to mining outliers over datasets with different densities

and arbitrary shapes [5]. Their notion of outliers is local in

the sense that the outlier degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. The method does not suffer from local density problem, so it can mine outliers over non-uniform distributed datasets. However, the method needs three scans and the computation of neighborhood search costs highly, which makes the method inefficient. Another density-based approach was

introduced by Papadimitriou & Kiragawa [6] using local

correlation integral (LOCI). This method selects a point as an outlier if its multi-granularity deviation factor (MDEF) deviates three times from the standard deviation of MDEF in a neighborhood. However, the cost of computing of the standard deviation is high.

In this paper, we propose an efficient density-based outlier detection method using a vertical data model P-

Trees 1 . We introduce a novel local density measurement, relative density factor (RDF). RDF indicates the degree at which the density of the point P contrasts to those of its neighbors. We take RDF as an outlierness measurement.

Based on RDF, our method prunes the data points that are deep in clusters, and detect outliers only within the remaining small subset of the data, which makes our method efficient. Also, the performance of our algorithm is enhanced significantly by means of P-Trees. Our method was tested over NHL and NBA datasets.

Experiments show that our method has an order of magnitude of speed improvement with comparable accuracy over the current state-of-the-art density-based outlier detection approaches.

2.

Review of P-trees

In previous work, we proposed a novel vertical data structure, the P-Trees. In the P-Trees approach, we decompose attributes of relational tables into separate files by bit position and compress the vertical bit files using a data-mining-ready structure called the P-trees.

Instead of processing horizontal data vertically, we process these vertical P-trees horizontally through fast logical operations. Since P-trees remarkably compress the data and the P-trees logical operations scale extremely well, this vertical data structure has the potential to address the non-scalability with respect to size. In this section, we briefly review some useful features, which will

1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308.

be used in this paper, of P-Tree, including its optimized logical operations.

Given a data set with d attributes, X = (A

1

, A

2

A d

), and the binary representation of the j b j.m

, b j.m-1

,..., b j.i

, …, b j.1,

b j.0

th attribute A j

as

, we decompose each attribute

into bit files, one file for each bit position [10]. Each bit

file is converted into a P-tree. Logical AND, OR and

NOT are the most frequently used operations of the Ptrees, which facilitate efficient neighborhood search, pruning and computation of RDF.

Calculation of Inequality P-trees P x≥v

and P x<v

:

Let x be a data point within a data set X, x be an m-bit data, and P m

, P m-1

X and P’ m

, P’ m-1

, …, P

0

be P-trees for vertical bit files of

,… P’

0

be the complement set for the vertical bit files of X. Let v = b m

…b i

…b

0

, where b i

is i th binary bit value of v, then

P x≥v

= P m

op m

… P i

op i

P i-1

… op

1

P

0

, i = 0, 1 … m (a)

P x<v

= P’

In (a), op i is

 m op

is

 i m

… P’

=0, op i op i

P’ i-1

… op i

=1, op i

is

 i

is

 k+1

P’ k

, k≤i≤ m (b)

In (b) op i

; In both (a) and (b),

 stands for OR and

for AND; the operators are right binding, which means operators are associated from right to left, e.g., P

2

op

2

P

1

op

1

P

0

is equivalent to (P

2

op

2

(P

1 op

1

P

0

)).

High Order Bit Metric (HOBit): The HOBit metric

[12] is a bitwise distance function. It measures distance

based on the most significant consecutive matching bit positions starting from the left. Assume A i

is an attribute in tabular data sets, R (A

1

, A

2

, ..., A n

) and its values are represented as binary numbers, x, i.e., x = x(m)x(m-1)--x(1)x(0).x(-1)---x(-n). Let X and Y are A i of two tuples/samples, the HOBit similarity between X and Y is defined by m (X,Y) = max {i | x i

⊕ y i

}, where x i

and y i

are the i th bits of X and Y respectively, and

⊕ denotes the XOR (exclusive OR) operation.

Correspondingly, the HOBit dissimilarity is defined by

(note that N bit is the number of bits for the attribute) dm (X,Y) = N bit

- m.

3.

RDF-based Outlier Detection Using Ptrees

In this section, we first introduce some definitions related to outlier detection. Then propose a RDF-based outlier detection method. The performance of the algorithm is enhanced significantly by means of the bitwise vertical data structure, P-trees, and its optimized logical operations.

3.1.

Outlier Definitions

From the density view, a point P is an outlier if it has much lower density than those of its neighbors. Based on this intuition, we propose some definitions related to outliers.

Definition 1 (neighborhood)

The neighborhood of a data point P with the radius r is defined as a set Nbr (P, r) = {x

X | |P-x|

r}, where |P-x| is the distance between P and x. It is also called rneighborhood. The points in this neighborhood are called neighbors of P, or direct r-neighbors of P. The number of neighbors of P is defined as N (Nbr (P, r)); Indirect neighbors of P are those points that are within the rneighborhood of the direct neighbors of P but not include direct neighbors of P.

Definition 2 (density factor)

Given a data point P and the neighborhood radius r,

Density factor (DF) of P is a measurement for local density around P, denoted as DF (P,r). It is defined as

(note that d is the number of dimension),

DF ( P , r )

N ( Nbr ( P , r )) / r d

.

(1)

Neighborhood density factor of the point P, denoted as

DF nbr

(P, r), is the average density factor of the neighbors of P.

DF nbr

( P , r )

N ( Nbr ( i

1

P , r ))

DF ( q i

, r ) / N ( Nbr ( P , r )), where q i is the neighbors of P, i = 1, 2, …, N(Nbr(P,r)).

Relative

DF(P,r)

| Nbr(P,r) | /r d

.

Density Factor (RDF) of the point P, denoted as

RDF (P, r), is the ratio of neighborhood density factor of P over its density factor

(DF).

RDF( P,r)

DF nbr

( P,r) / DF( P , r)

(2)

RDF indicates the degree at which the density of the point

P contrasts to those of its neighbors. We take RDF as an outlierness measurement, which indicates the degree to which a point can be an outlier in the view of the whole dataset.

Definition 3 (outliers)

Based on RDF, we define outliers as a subset of the dataset X with RDF > t, where t is a RDF threshold defined by case. The outlier set is denoted as Ols(X, t) =

{x

X | RDF(x) > t}.

3.2.

RDF-based Outlier Detection with

Pruning

Given a dataset X and a RDF threshold t, the RDFbased outlier detection is processed in two phases: “zoomout” procedure and “zoom-in” procedure. The detection process starts with the “zoom-out” procedure, which calls

“zoom-in” procedure when necessary. On the other hand the “zoom-in” procedure also calls “zoom-out” procedure by case.

“Zoom-out” process: The procedure starts with an arbitrary point P and a small neighborhood radius r, and calculates RDF of the point. There are three possible local data distributions with regard to the value of RDF, which are shown in figure 1, where α is a small value number, while β is a large value number. In our experiments, we choose α < 0.3 and β > 12, which leads to a good balance between accuracy and pruning speed.

(a) RDF = 1 ± α (b) RDF ≤ 1/β (c) RDF ≥ β

Figure 1. Three different local data distributions

In case (a), it is observed that neither the point P is an outlier, nor the direct and indirect neighbors of P are.

The local neighbors are distributed uniformly. The “zoomin” procedure will be called to quickly reach points located on the boundary or outlier points.

In case (b), the point P is highly likely to be a center point of a cluster. We prune all neighbors of P, while calculating RDF for each of the indirect r-neighbors. In case RDF of one point larger than the threshold t, the point can be inserted into the outlier set together with its

RDF value.

In case (c), RDF is large, P is inserted into the outlier set. We prune all the indirect neighbors of P.

“Zoom-in” process: The “zoom-in” procedure is a pruning process based on neighborhood expanding. We calculate DF and observe change of DF values. First we increase radius from r to 2r, compute DF (P, 2r) and compare DF (P, 2r) with DF (P, r). If DF (P, r) is close to

DF (P, 2r), it indicates that the whole 2r-neighbohood has uniform density. Therefore, increase (e.g. double or 4 times) the radius until significant change is observed. As for significant decrease of DF is observed, cluster boundary and potential outliers are reached. Therefore,

“zoom-out” procedure is called to detect outliers at a fine scale. Figure 2 shows this case. All the 4*r-neighbors are pruned off and “zoom-out” procedure detect outliers over points in 4*r-6*r ring. As for significant increase of DF is observed, we pick up a point with high DF value, likely to be in a denser cluster, and call “zoom-in” procedure further to prune off all points in the dense cluster.

As we can see, our method detects outliers using

“zoom-out” process for small candidate outlier sets, the boundary points and the outliers. This subset of data as a whole is much smaller than the original dataset. This is where the performance of our algorithm lies in.

Both the “zoom-in” and “zoom-out” procedures can be further improved by using the P-trees data structure and its optimal logical operations. The speed improvement lies in: a) P-trees make the “zoom-in” process on fly using HOBit metric; b) P-trees are very efficient for neighborhood search by its logical operations; c) P-tree can be used as a self-index for unprocessed dataset, clustered dataset and outlier set. Because of it, pruning is efficiently executed by logical operations of Ptrees.

Zoom-out

Zoom-in

Figure 2.

“Zoom-in” Process followed by “Zoom-out”

“Zoom-in” using HOBit metric: Given a point P, we define the neighbors of P hierarchically based on the

HOBit dissimilarity between P and its neighbors, denoted as

ξ

-neighbors.

ξ

-neighbors represents the neighbors with

ξ

bits of dissimilarity, where

ξ

= 1, 2 ... 8 if P is an

8-bit value. The basic calculations in the procedure are computing DF (P,

ξ

) for each

ξneighborhood and pruning neighborhood. HOBit dissimilarity is calculated by means of P-tree AND. For any data point, P, let P = b

11 b

12

… b nm

, where b i,j

is the j th bit value in the i th attribute column of P. The attribute P-trees for P with ξ-

HOBit neighbors for i th attribute are then defined by

Pv i, ξ

= Pp i,1

Pp i,2

Pp i,mξ

The

ξ

-neighborhood P-tree for P in multi-dimensions are then calculated by

PNp,

ξ

= Pv

1

,

m-ξ

Pv

2

,

m-ξ

Pv

3

,

m-ξ

Pv n

,

m-ξ

Density factor, DF (P,r) of the ξ-neighborhood is simply the root counts of PNp,r divided by r.

The neighborhood pruning is accomplished by:

PU = PU

PN’p,

ξ where PU is a P-tree represents the unprocessed points of the dataset. PN’p,

ξ represents the complement set of PNp,

ξ.

“Zoom-out” using Inequality P-trees: In the

“Zoom-out” procedure, we use inequality P-trees to search for neighborhood, upon which the RDF is calculated.

The direct neighborhood P-tree of a given point P within r, denoted as PDN p,r direct neighbors. PDN p,r

is P-tree representation of its is calculated by PDN p,r

= P x>p-r

P x

 p+r

.The root count of PDN p,r

is equal to N(Nbr(p,r)).

Accordingly, DF (P,r) and RDF (P,r) are calculated based on equation (1) and (2) respectively.

Using P-Trees AND operations, the pruning is calculated as: In case of RDF (p,r)=1±α, we prune nonoutlier points by PU = PU

PDN’

P,r

 PIN’

RDF<1/ β, dataset is pruned by PU = PU

P,r ;;;

In case

PDN’

P,r

; If

RDF > β the dataset is pruned by PU= PU

PDN’

P,r

PIN’

P,r

.

4.

Experimental Study

In this section, we experimentally compare our method

(RDF) with current approaches: LOF (local outlier factor) and aLOCI (approximate local correlation integral). LOF is the first approach to density-based outlier detection. aLOCI is the fastest approach in the density-based area so far. We compare these three methods in terms of run time and scalability to data size. We will show our approach is efficient and has high scalability.

We ran the methods on a 1400-MHZ AMD machine with 1GB main memory and Debian Linux version 4.0.

The datasets we used are the National Hockey League

(NHL, 96) dataset and NBA dataset. Due to space limitation, we only show our result on NHL dataset in this paper. The result on NBA dataset also leads to our conclusion in terms of speed and scalability.

Figure 3 shows that our method has an order of magnitude improvements in speed compared to aLOCI method. Figure 4 shows our method is the most scalable among the three. When data size is large, e.g. 16384, our method starts to outperform these two methods.

Run Time Comparisons of LOF, aLOCI, RDF

2000

1500

Run Time (s) 1000

500

0

LOF aLOCI

RDF

256

0.23

0.17

0.58

1024

1.92

1.87

2.1

4096

38.79

35.81

8.34

Data Size

16384 65536

103.19 1813.43

87.34

37.82

985.39

108.91

Figure 3. Run Time Comparison

Scalability Comparison of LOF,aLOCI,RDF

2000

1800

1600

1400

1200

1000

800

600

400

200

0

-200 256

LOF aLOCI

RDF

1024 4096

Data Size

16384 65536

Figure 4. Scalability Comparison

5.

Conclusion

In this paper, we propose a density based outlier detection method based on a novel local density measurement RDF. The method can efficiently mining outliers over large datasets and scales well with increase of data size. A vertical data representation, P-Trees, is used to speed up the process further. Our method was tested over NHL and NBA datasets. Experiments show that our method has an order of magnitude of speed improvements with comparable accuracy over the current state-of-art density-based outlier detection approaches.

6.

Reference

[1] V.Barnett, T.Lewis, Outliers in Statistic Data, John

Wiley’s Publisher, NY,1994

[2]

Knorr, Edwin M. and Raymond T. Ng. “A Unified

Notion of Outliers: Properties and Computation”,

3 rd International Conference on Knowledge

Discovery and Data Mining Proceedings, 1997, pp.

219-222.

[3] Knorr, Edwin M. and Raymond T. Ng.,

“Algorithms for Mining Distance-Based Outliers in

Large Datasets”, Very Large Data Bases

Conference Proceedings, 1998, pp. 24-27.

[4] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok

Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on

Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000

ACM SIGMOD international conference on

Management of data,2000, ISSN:0163-5808

[5] Markus M. Breunig, Hans-Peter Kriegel, Raymond

T. Ng, Jörg Sander, “LOF: Identifying Densitybased Local Outliers”, Proc. ACM SIGMOD 2000

Int. Conf. On Management of Data, TX, 2000

[6] Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip

B. Gibbons, Christos Faloutsos, “LOCI: Fast

Outlier Detection Using the Local Correlation

Integral”, 19th International Conference on Data

Engineering, 2003, Bangalore, India

[7] A.K.Jain, M.N.Murty, and P.J.Flynn. “Data clustering: A review”, ACM Comp. Surveys,

31(3):264-323, 1999

[8] Arning, Andreas, Rakesh Agrawal, and Prabhakar

Raghavan. “A Linear Method for Deviation

Detection in Large Databases”, 2 nd International

Conference on Knowledge Discovery and Data

Mining Proceedings, 1996, pp. 164-169.

[9] S. Sarawagi, R. Agrawal, and N. Megiddo.

“Discovery-Driven Exploration of OLAP Data

Cubes”, EDBT'98.

[10]

Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The Ptree algebra”. Proceedings of the ACM SAC,

Symposium on Applied Computing, 2002.

[11] W. Perrizo, “Peano Count Tree Technology,”

Technical Report NDSU-CSOR-TR-01-1, 2001.

[12]

Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and

Perrizo, W., “Efficient Density Clustering for

Spatial Data”, PKDD 2003

Download