SMART-TV: A Fast and Scalable Nearest Neighbor Based Classifier for Data Mining Taufik Abidin William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 USA +1 701 231 6257 {taufik.abidin, william.perrizo}@ndsu.edu ABSTRACT K-nearest neighbors (KNN) is the simplest method for classification. Given a set of objects in a multi-dimensional feature space, the method assigns a category to an unclassified object based on the plurality of category of the k-nearest neighbors. The closeness between objects is determined using a distance measure, e.g. Euclidian distance. Despite its simplicity, KNN also has some drawbacks: 1) it suffers from expensive computational cost in training when the training set contains millions of objects; 2) its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. In this paper, we propose a new algorithm, called SMART-TV (SMall Absolute diffeRence of ToTal Variation), that approximates a set of potential candidates of nearest neighbors by examining the absolute difference of total variation between each data object in the training set and the unclassified object. The k-nearest neighbors are then searched from that candidate set. We empirically evaluate the performance of our algorithm on both real and synthetic datasets and find that SMART-TV is fast and scalable. The classification accuracy of SMART-TV is high and comparable to the accuracy of the traditional KNN algorithm. General Terms Algorithms Keywords K-Nearest Neighbors Classification, Vertical Data Structure, Vertical Set Squared Distance. 1. INTRODUCTION Classification on large datasets has become one of the most important research priorities in data mining. Given a collection of labeled (pre-classified) data objects, the classification task is to label a newly encountered and unlabeled object with a pre-defined class label. Classification algorithm such as k-nearest neighbors has shown good performance on various datasets. However, when the training set is very large, i.e. contains millions of objects, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’06, April, 23-27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004…$5.00. sequential search for the k-nearest neighbors from the training set is not the best approach as it will increase the classification time significantly. Variety of new approaches have been developed to accelerate the classification process by improving the KNN search, such as [5][8][12], just to mention a few. In this paper, we propose a new algorithm that aims at expediting the k-nearest neighbors search. The algorithm derives from the basic structure of KNN algorithm. However, unlike KNN in which the k-nearest neighbors are searched from the entire training set, in SMART-TV, they are searched from the candidates of nearest neighbors. The candidates are approximated by examining the absolute difference (gap) of total variation between data objects in the training set and the unclassified object. A small gap implies that the objects are possibly near the unclassified object. A parameter hs (the number of data objects with a small gap will be considered) is required by the algorithm, and it can be parameterized by the user. The empirical results show that with a relatively small hs, a high accuracy of classification can still be obtained. The total variation of a set X about a , denoted as TV(X, a ), is computed using a fast, efficient, and scalable vertical set squared distance function [1]. The total variations of each class about each data point in that class are measured once in the preprocessing phase and stored for later use in classification. We store the total variations of data points because, in classification problems, the sets of pre-classified data points are known and unchanged, and thus, the total variations of all data points in the sets can be pre-computed in advance. Specifically, our contributions are: 1. We introduce a quick and scalable way to approximate the candidate of nearest neighbors by transforming all data points in a multi-dimensional feature space into a single-dimensional total variation space and examining their absolute difference from the total variation of the unclassified point. 2. We propose a new algorithm of nearest neighbor-based classification that classifies much more quickly and scalably over large datasets than the traditional KNN algorithm. 3. We demonstrate that vertical total variation gap is a potent means for expediting nearest neighbor-based classification algorithm and worthy of investigation when the speed and scalability are the issues. The performance evaluations on both real and synthetic datasets show that the speed of our algorithm outperforms the speed of the traditional KNN and Ptree-based KNN (P-KNN) algorithms. In addition, the classification accuracy of our algorithm is high and comparable to the accuracy of KNN algorithm. The remainder of the paper is organized as follows. In section 2, we review some related works. In section 3, we present the vertical data structure and show the derivation of the vertical set squared distance function to vertically compute a total variation. In section 4, we discuss our approach. In section 5, we report the empirical results, and in section 6, we summarize our conclusions. 2. RELATED WORKS Classification based on k-nearest neighbors was first investigated by Hart [3]. It can be summarized as follows: search for the knearest neighbors and assign the class label to the unclassified sample based on the plurality of category of the neighbors. KNN is simple and robust. However, it has some drawbacks. First, KNN suffers from expensive computational cost in training. Second, its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. Thus, more time is needed for classification. Third, the classification accuracy is very sensitive to the choice of parameter k. In most cases, the user has no intuition regarding the choice of k. An extension of KNN to solve the sensitivity of k has been proposed [9]. The idea is to adjust k based on the local density of the training set. However, this approach only improves the classification accuracy a little bit, and the classification time remains intensive when the training set is very large. P-KNN [8] employs vertical data structure to accelerate the classification time and uses Higher Order Bit (HOB) as the distance metric. HOB removes one bit at a time, starting from the lowest significant bit position, to search for the nearest neighbors. Experiments showed that P-KNN is fast and accurate in spatial data. However, the squared ring of HOB cannot evenly expand on both sides when a bit is removed, which consequently can reduce the classification accuracy. Another improvement of KNN to speed up the search uses kd-tree method to replace the linear search [5]. This approach reduces the searching complexity from O(n) to O(log n). However, the O(log n) behavior is realized only when the data points are dense. BOND [12] improves KNN search by projecting each dimension into a separate table. Each dimension is scanned to estimate the lower and upper bounds of the k-nearest neighbors. After that, the data points that are outside the boundary are discarded. This strategy reduces the number of candidates of nearest neighbors significantly. Making a smaller space to search for the k-nearest neighbors is also the main idea of SMART-TV algorithm. However, SMART-TV estimates a number of candidates of nearest neighbors by examining the absolute difference of total variation between data points in the training set and the unclassified point. trees by decomposing each attribute in the table into separate bit vectors, one for each bit position of numeric values in that attribute, and storing them separately in a file. Then, each bit vectors can be constructed into 0-dimensional P-tree (the uncompressed vertical bit vectors) or into 1-dimensional, 2dimensional, or multi-dimensional P-tree in the form of tree. The creation of 1-dimensional P-tree, for example, is done by recursively dividing the bit vectors into halves and recording the truth of “purely 1-bits” predicate. A predicate “1” indicates that the bits in the subsequent level are all “1” while “0” indicates otherwise. In this work, we used 0-dimensional P-tree or the uncompressed vertical bit vectors. Basically, the logical operations AND (), OR (), and NOT (') are the main operations in this data structure. Advantage can be gained while performing aggregate operation such as root count. This operation counts the 1-bits from a basic P-tree or from the resulting P-trees of any logical operations. Refer to [4] for more information about P-tree vertical data structure and its operations. 3.2 Vertical Set Squared Distance Let x be a numerical domain of attribute Ai of relation R(A1, A2, …, Ad). Then, x in b-bits binary representation is written as follows. xib1 xi 0 The vertical data structure used in this work is also known as Ptree1 [10]. It is created by converting a relational table R(A1, A2,…, Ad) of horizontal records, normally the training set, to a set of P1 P-tree is patented by NDSU, United States Patent 6,941,303. j b 1 j xij The first subscript of x refers to the attribute to which x belongs, and the second subscript refers to the bit order. The summation in the right-hand side is the actual value of x in base 10. Let x be a vector in a d-dimensional space, then x in b-bits binary representation can be written as: x ( x1(b1) x10, x2(b1) x20,, xd (b1) xd 0 ) Let X be a set of vectors in a relation R, x X, and a can either be a training point or an unclassified point, then the total variation of X about a , TV(X, a ), measures the cumulative squared separation of the vectors in X about a. TV(X, a ) transforms a multi-dimensional feature space of a into a single-dimensional total variation space. The total variation can be measured efficiently and scalably using the vertical set squared distance, derived as follows: 2 X a X a x a x a xi ai d xX xX i 1 d d xi2 2 xX i 1 d xi ai xX i 1 a 2 i xX i 1 T1 2T2 T3 d 3. PRELIMINARIES 3.1 Vertical Data Structure 0 2 T1 xX i 1 2 0 2 j xij i 1 j b 1 d xi2 2 d 2 xX b 1 b 1 xi (b 1) 2 b 2 xi (b 2) 2 0 xi 0 x X i 1 xi (b 1) 2 b 2 xi (b 2) 2 0 xi 0 d 2 j k 2 xij 2 xij xil j b 1 xX k 2 j j 1 xX l j 10 AND j 0 0 i 1 Let PX be the P-tree mask of a set X that can quickly mask all data points in X. Then, the above equation can be simplified by replacing rc ( PX Pij ) and xij xil x ij with xX xX with rc ( PX Pij Pil ). Thus, we get: T1 d 2j k 2 rc( PX Pij Pil ) 2 rc( PX Pij ) j b 1 k 2 j j 1 l j 10 AND j 0 0 i 1 T2 d x a a 2 d i i x X i 1 i 1 a j j b 1 rc( PX Pij ) 2 i rc ( PX ) a 2 i i 1 x X i i i 1 0 d j b 1 i 1 The root count and total variation values generated in step 1 and 2 above are stored in files, and they are loaded during classification. 2 j rc( PX Pij ) rc( PX ) ai2 Note that the root count operations are independent from a . These include the root count operations of P-tree class mask PX, d P-tree two operands 0 rc( PX P ), ij i 1 j b1 d operands 0 The complexity of this computation is O(n), where n is the cardinality of the training set. 2 2 j rc ( PX Pij ) d 0 ( X a) ( X a) 2 k rc ( PX Pij Pil ) i 1 j b 1 k 2 j j 1 l j 10 AND j 0 d SMART-TV algorithm consists of two phases: preprocessing and classifying. In preprocessing phase, all steps are conducted only once, whereas in classifying phase the steps are repeated during classification. The preprocessing steps are: 2) The computation of TV (C j , xi ) xi C j , 1 j # of classes. Hence, the vertical set squared distance is defined as: 2 ai 4.2 SMART-TV Algorithm 1) The computation of root counts of P-tree operands of each class Cj, where 1 j # of classes. The complexity of this computation is O(kdb2) where k is the number of classes, d is the total of dimensions, and b is the bit-width. d d T3 0 i examination of the gap cannot guarantee 100% that all points are exactly close to the unclassified point, this examination can quickly approximate the superset of the nearest neighbors in each class. In fact, we can increase the chance to include more nearest neighbor points in the candidate sets by considering more data points with a small total variation gap. Thus, for this reason, a parameter hs is introduced, which specifies the number of points in each class that will be considered in the candidate set. Our empirical results show that with relatively small hs, i.e. 25 hs50, a high accuracy of classification can still be obtained. rc( PX P ij and P-tree three Pil ) . i 1 j b 1 l j 10 AND j 0 In classification problems where the classes are predefined, the independence of root count operations from the input vector a is an advantage. It allows us to run the root count operations once in advance and maintain the resulting root count values. Later, when the total variation of class X about a given point is measured, for example the total variation of class X about an unclassified point, the same root count values that have already counted can be reused. This reusability expedites the computation of total variation significantly [1]. 4. PROPOSED ALGORITHM 4.1 Total Variation Gap The total variation gap g of two vectors x1 and x2 of a set X is defined as the absolute difference of TV(X, x1 ) and TV(X, x2 ), i.e. g = |TV(X, x1 ) - TV(X, x2 )|. In SMART-TV algorithm, we examine the gap between the total variation of each data point in the training set and the total variation of the unclassified point. The main goal is to approximate a number of points in each class that are possibly “near” the unclassified point. Although the The classifying phase consists of four steps. The first step is called the filtering step. The second, third, and fourth steps are similar to KNN algorithm. The difference is that the k-nearest neighbors are not searched from the original training set, but they are searched from the candidate points that were filtered in step 1. The classifying steps are summarized as follows: 1) For each class Cj, where 1 j # of classes do the following: a) Compute TV(Cj, a ), where a is the unclassified point. b) Find hs number of points in Cj such that the gaps between the total variation of the points in Cj and the total variation of a are the smallest, i.e. Let A be an array and t : 1 t hs : A.t z , where p , q : 1 p n j , 1 q n j , x p , xq C j : z TV (C j , x p ) TV (C j , a ) TV (C j , xq ) TV (C j , a ) x p xq : 1 r t : x A.r p r c) Store the IDs of points in A into array TVGapList. 2) For each point IDt, 1 t Len(TVGapList) where Len(TVGapList) is equal to hs times the total number of classes, fetch the feature attributes of points xk and measure the Euclidian distance, d 2 ( xk , a ) = d (x i 1 i ai ) 2 3) Find the k nearest points from the list and vote a class for a . 5. EMPIRICAL RESULTS We compared SMART-TV with KNN and P-KNN algorithms. PKNN has been developed in DataMIMETM system [11]. We measured the classification accuracy using F-score and used datasets with different cardinality to test the scalability. 5.1 Datasets We conducted the experiments on both real and synthetic datasets. The first dataset is the network intrusion dataset that was used in KDDCUP 1999 [6]. This dataset contains approximately 4.8 million records and normalized both training and testing sets. We selected six different types of classes, normal, ipsweep, neptune, portsweep, satan, and smurf, each of which contains at least 10,000 records. We discarded categorical attributes because our method only deals with numerical attributes and found 32 numerical attributes. We generated four sub-sampling datasets with different cardinality from this dataset to analyze the scalability of the algorithms with respect to dataset size by selecting the records randomly but maintaining the classes’ distribution proportionally. The cardinality of these sampling datasets are 10,000 (SS-I), 100,000 (SS-II), 1,000,000 (SS-III), and 2,000,000 (SS-IV). In addition, we selected randomly 120 records, 20 records for each class, for the testing set. the cardinality (N) of these datasets is varied from 10,000 to 4,891,469. We evaluated the algorithms running time using k = 5. Figure 1(a) compares the algorithms running time with varying datasets cardinality. We found that SMART-TV is much faster than KNN and P-KNN. For example, for the second largest dataset, N = 2,000,000, with hs = 25, SMART-TV takes only 3.88 seconds on an average to classify, P-KNN requires about 12.44 seconds, and KNN takes about 49.28 seconds. It is clear that by increasing cardinality, KNN scales super linearly. Figure 1(b) shows the SMART-TV running time against varying hs on the sub-sampling dataset of size 1,000,000. We also used k = 5 in this observation. We found that SMART-TV running time is linear to the increasing of hs. However, it increases insignificantly. The higher the hs, the more the points are in the candidate set. Thus, more candidate points will be examined by SMART-TV to find the k-nearest neighbors. Running Time Against Varying Cardinality SMART-TV P-KNN 5.2 Classification Accuracy Measurement Different classification accuracy measurements have been introduced in the literature. In this work, we use F-score. It is 2 P R defined as F , where P is precision: the ratio of correct PR assignment of a class and the total number of points assigned to that class, and R is recall: the ratio of correct assignment of a class and the actual number of points in that class. The higher the score, the better is the accuracy. 5.3 Experimental Setup and Comparison The experiments were performed on an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM, running Red Hat Linux version 2.4.20-8smp. All algorithms are implemented using C++ programming language. 5.3.1 Speed and Scalability Comparison We compared the performance of the algorithms in terms of speed and scalability using the KDDCUP datasets. As mentioned before, 3.00 Running Time in Seconds 50.00 40.00 30.00 20.00 2.50 2.00 1.50 1.00 0.50 10.00 0.00 The third dataset is the IRIS dataset [7]. This dataset contains three classes: iris-setosa, iris-versicolor, and iris-virginica. IRIS dataset is the smallest dataset containing only 150 records in 4 attributes. Thirty points were selected randomly for the testing set, and the rest, 120 points, were used for the training set. We measured the classification accuracy of the algorithms when classifying this dataset to see how well our algorithm classifies the dataset containing classes of points that are not linearly separable. In IRIS dataset, only iris-setosa class is linearly separable to the others. We neither determined the scalability nor measured the speed of the algorithm using this dataset because the dataset is very small. Running Time of SMART-TV Against Varying hs on a Dataset of Size 1,000,000 Records KNN 60.00 Running Time in Seconds The second dataset is the OPTICS dataset [2]. This dataset was originally used for clustering tasks. So, to make it suitable for our classification task, we carefully added a class label to each data point based on the original clusters found in the dataset. The class labels are: CL-1, CL-2, CL-3, CL-4, CL-5, CL-6, CL-7, and CL-8. OPTICS dataset contains 8,000 points in 2-dimensional space. In this experiment, we took 7,920 points for the training set and selected randomly 80 points, 10 from each class, for the testing set. 0.00 10 100 1000 2000 Cardinality (x 1000) 4891 25 50 75 100 125 hs Figure 1. (a) The algorithms running time against varying cardinality; (b) SMART-TV running time against varying hs. We also discovered that SMART-TV and P-KNN algorithms scale very well. They were able to classify the unclassified sample using the dataset of size 4,891,469. For this large dataset, SMART-TV takes about 9.27 seconds and P-KNN takes 30.79 seconds on an average to classify. Note that both algorithms use vertical data structure as their underlying structure, which makes the two algorithms scale well to very large dataset. In contrast, KNN uses horizontal record-based data structure, and while using the same resources, it failed to load the 4,891,469 training points to memory, and thus, failed to classify. This demonstrates that KNN requires more resources to make it scales to a very large dataset. We also found that the overall classification accuracy of SMARTTV is moving toward the classification accuracy of KNN algorithm when hs increased (see table 1). For example, when using hs = 50 on the 2,000,000 dataset (SS-IV), the difference between the overall accuracy of KNN and SMART-TV is only 1%. This difference indicates that within the 300 candidates of closest points filtered by SMART-TV, most of them are the right nearest points. In fact, the classification accuracy between KNN and SMART-TV became exactly the same when we used hs = 100. Similarly, for the OPTICS dataset, when hs = 25, SMARTTV successfully filtered the right candidates of points. Furthermore, when we used IRIS dataset, which contains only 120 points, and specified hs = 100 or hs = 125, we actually did not filter any points. In fact, we took the entire data points in the dataset because (hs x number of classes) is greater than the number of points in the dataset. In this situation, SMART-TV is no difference with KNN algorithm because the k-nearest neighbors are searched from the entire training set. Comparison of the Algorithms Overall Classification Accuracy 1.00 Average F-Score 0.75 From this observation, we conclude that with a proper hs, the examination of absolute difference of total variation between data points and the unclassified point can be used to approximate the candidate of nearest neighbors. Thus, when the speed is the issue, our approach is efficient to expedite the nearest neighbor search. SMART-TV 0.50 PKNN KNN 0.25 0.00 IRIS OPTICS SS-I Table 1. Classification accuracy of SMART-TV against varying hs compared to KNN. Both algorithms used k=5. Dataset 25 Network Intrusion (NI) 0.93 SS-I 0.96 SS-II 0.92 SS-III 0.94 SS-IV 0.92 OPTICS 0.96 IRIS 0.97 KNN 125 0.96 0.96 0.97 0.96 0.97 0.97 0.97 NA 0.89 0.97 0.96 0.97 0.97 0.97 We examined the accuracy of the algorithms using different k. However, due to the space limitation, we only show the result for k = 5 using the dataset of size 1,000,000 in table 2. We specified hs = 25 for SMART-TV algorithm. Table 2. Classification accuracy comparison for SS-III dataset. P-KNN KNN Class normal ipsweep neptune portsweep satan smurf normal ipsweep neptune portsweep satan smurf normal ipsweep neptune portsweep satan smurf TP 18 20 20 18 17 20 20 20 15 20 14 20 20 20 20 18 17 20 FP 0 1 0 0 2 4 4 1 0 0 1 5 3 1 0 0 1 0 P 1.00 0.95 1.00 1.00 0.90 0.83 0.83 0.95 1.00 1.00 0.93 0.80 0.87 0.95 1.00 1.00 0.94 1.00 SS-III SS-IV NI Figure 2. Classification accuracy comparison. SMART-TV hs 50 75 100 0.93 0.94 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.97 0.96 0.96 0.96 0.97 0.97 0.97 5.3.2 Classification Accuracy Comparison Algorithm SMART-TV SS-II Dataset R 0.90 1.00 1.00 0.90 0.85 1.00 1.00 1.00 0.75 1.00 0.70 1.00 1.00 1.00 1.00 0.90 0.85 1.00 F 0.95 0.98 1.00 0.95 0.87 0.91 0.91 0.98 0.86 1.00 0.80 0.89 0.93 0.98 1.00 0.95 0.89 1.00 In this observation, we found that the classification accuracy of SMART-TV is high and very comparable to the accuracy of KNN. Most classes have F-score above 90%. The same phenomenon is also found in the other datasets. Figure 2 compares the algorithms overall accuracy (the average Fscores) of all datasets. We were not able to show the comparison for KNN using the largest dataset (NI) because it terminated when loading the dataset to the memory. 6. CONCLUSIONS In this paper, we have presented a new classification algorithm that starts its classification steps by filtering out a number of candidates of nearest neighbors by examining the absolute difference of total variation between data points in the training set and the unclassified point. The k-nearest neighbors are subsequently searched from those candidates to determine the appropriate class label for the unclassified point. We have conducted extensive performance evaluations in terms of speed, scalability, and classification accuracy. We found that the speed of our algorithm outperforms the speed of KNN and PKNN algorithms. Our algorithm takes less than 10 seconds to classify a new sample using a training set of size more than 4.8 million records. In addition, our method scales very well and can classify with high classification accuracy. In our future work, we will devise some strategy for automatically providing hs value, e.g. from the inherit features of the training set. 7. REFERENCES [1] Abidin, T., et al. Vertical Set Squared Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets. Proceedings of the International Conference on Computers and Their Applications (CATA), 2005, 60-65. [2] Ankerst, M., et al. OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the ACM SIGMOD, 1999, 49-60. [3] Cover, T.M. and Hart, P.E. Nearest Neighbor Pattern Classification. IEEE Trans. on Info. Theory, 13, 1967, 2127. [4] Ding, Q., Khan, M., Roy, A., and Perrizo, W. The P-tree Algebra, Proceedings of the ACM SAC, 2002, 426-431. [5] Grother, P.J., Candela, G.T., and Blue, J.L. Fast Implementations of Nearest Neighbor Classifiers. Pattern Recognition, 30, 1997, 459-465. [6] Hettich, S. and Bay, S. D. The UCI KDD Archive http://kdd.ics.uci.edu. Irvine, University of California, CA., Department of Information and Computer Science, 1999. [7] Iris Dataset, http://www.ailab.si/orange/doc/datasets/iris.htm. [8] Khan, M., Ding, Q., and Perrizo, W. K-Nearest Neighbor Classification of Spatial Data Streams using P-trees, Proceedings of the PAKDD, 2002, 517-528. [9] Mitchell, H. B. and Schaefer, P.A. A Soft K-Nearest Neighbor Voting Scheme. International Journal of Intelligent Systems, 16, 2001, 459-469. [10] Perrizo, W. Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001. [11] Serazi, M., et al. DataMIME™. Proceedings of the ACM SIGMOD, 2004, 923-924. [12] Vries, A.P., et al. Efficient k-NN Search on Vertically Decomposed Data. Proceedings of the ACM SIGMOD, 2002, 322-333.