Weighted EIN-ring Based Nearest Neighbor Classifier Fei Pan, Baoying Wang, William Perrizo {fei.pan, baoying.wang, william.perrizo} @ndsu.nodak.edu Computer Science Department North Dakota State University Fargo, ND 58105 Tel: (701) 231-6403/6257 Fax: (701) 231-8255 Abstract Nearest neighbor classification as an instance-based learning algorithm has been shown to be very effective for a variety of problem domains. In this paper, we propose a novel Equal Interval Neighborhood ring (EIN-ring) to facilitate efficient neighborhood search. We develop a weighted EIN-ring based nearest neighbor classifier (ENN). The calculation of EIN-ring is based on a data structure, called Peano Count Trees* (P-tree). In ENN, each ring is weighted according to its diameter and each attribute is weighted based on its information gain. ENN works well for data sets with an arbitrary number of dimensions and an arbitrary number of data points, and provides accurate classification results. We compare ENN with other nearest neighbor classifier, i.e., KNN, VSM, and IB3. Experiments show that our method is superior to all of them with respect to dimensional scalability, cardinality scalability and accuracy. Keywords: Nearest neighbor classification, Equal Interval Neighborhood Ring, Peano Tree. 1. Introduction Nearest neighbor classification is an instance-based learning algorithm. Instance-based methods are sometimes referred to as “lazy” learning methods because they delay processing until a new instance must be classified. A key advantage of this kind of learning is that instead of estimating the target function once for the entire instance space, these methods can estimate it locally and differently for each new instance to be classified [15]. It is ideal for fast adaptation, natural handling of the multi-class case. K-nearest neighbor classification has shown to be very effective for a variety of problem domains [3]. Scott Cost and Steven Salzberg propose a nearest neighbor algorithm, PEBLS, which can produce highly accurate predictive models on domains in which features values are symbolic [4]. VSM is another nearest neighbor classification algorithm that uses a variable interpolation kernel in combination with conjugate gradient optimization of the similarity metric and kernel size [5]. The power of instance-based methods has been demonstrated in a number of important real world domains, such as prediction of cancer recurrence, diagnosis of heart disease, and classification of congressional voting records [6][7] [8] [9]. * Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308. There are several problems as pointed by Breiman, such as its expensive computation and intolerance of irrelevant attributes. Nearest neighbor algorithm requires large memory and does not work well when the number of distinguishing features is large. It is highly sensitive to the number of irrelevant attributes used to describe instances. Its storage requirements increase exponentially and its learning rate decreases exponentially with increasing dimensionality. Two major approaches have been proposed for efficient example retrieval in nearest neighbor classifier: speeding up the retrievals by using index structures, and speeding up the retrievals by reducing storage. Several researches demonstrated that edited nearest neighbor algorithms can reduce storage requirements with, at most, small losses in classification accuracy, which save and use only selected instances to generate classification predictions [11][12][13]. David W. Aha propose a new nearest neighbor algorithm, IBL, focusing on reducing storage requirements and tolerating noisy instances, which achieve robustness and slightly faster learning rates than C4.5 algorithm [14]. In this paper, we propose a novel neighborhood searching approach, Equal Interval Neighborhood ring (EIN-ring), to facilitate efficient neighborhood search. We develop a weighted EIN-ring based nearest neighbor classifier (ENN). The calculation of EIN-ring is based on a data structure, called Peano Tree (P-tree) [1][2]. P-tree is a lossless, bitwise quadrant-based tree. It recursively partition a bSQ file into quadrants and each quadrant into sub-quadrants until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits). In ENN, each ring is weighted according to its diameter and each attribute is weighted based on its information gain. ENN works well for data sets with an arbitrary number of dimensions and an arbitrary number of data points, and provides accurate classification results. We compare ENN with other nearest neighbor classifier, e.g., KNN, VSM, and IB3. Experiments show that our method is superior to all of them with respect to dimensional scalability, cardinality scalability and accuracy. This paper is organized as follows. In section 2, P-tree techniques are briefly reviewed. In section 3, we propose Equal Interval Neighborhood Rings (EIN-ring). In section 4, we introduce weighted nearest neighbor classification using EIN-ring. Finally, we compare our method with KNN, VSM, and IB3 experimentally in section 5 and conclude the paper in section 6. 2. Review of Peano Trees A new tree structure, called the Peano Tree (P-tree), was developed to facilitate efficient data mining for spatial data. Suppose we have a spatial data set with d feature attributes, X = (x1, x2 … xd). Let binary representation of jth feature attribute of spatial image pixel be bj,mbjm-1...bj,i… bj,1bj,0, where bj,i=1or 0. We strip each feature attribute into several files, one file for each bit position. Such files are called bit Sequential files or bSQ files. P-tree is a lossless, bitwise quadrant-based tree. It recursively partition a bSQ file into quadrants and each quadrant into sub-quadrants until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits). Recursive raster ordering is called the Peano or Z-ordering in the literature – therefore, the name Peano tree. A P-tree can be 1-dimensional, 2-dimensional, 3-dimensional, etc. One quadrant of a 2-dimensional P-tree with four sub-quadrants, quadrant 0, 1, 2 and 3, is shown in Figure 1. 45 0 1 5 m PM-2 1 0 _____/ 7 / \ \__ 2 3 / 2 3 / \ \ (a) 2-Dimension (b) 3-Dimension / / \ Figure 1. Peano Ordering or Z Ordering \ _m__ For a two dimensional P-tree, its root contains the 1-bit count of the entire bSQ file. The next level of 0 1 the tree contains the 1-bit count of the four quadrants. At the third level, each quadrant is partitioned ___m_ _ into four sub-quadrants. This level contains 1-bit counts of sub-quadrants. This construction is / / \ continued recursively down each tree path until the sub-quadrant is \ pure (entirely 1-bits or entirely / | \ 0-bits), which may or may not be at the leaf level. \ 1 1 m 1 m 0 Figure 2. The spatial data We illustrate the detailed construction of P-trees using an example shown in 1m is the red reflective value of an 8x8 2-D image, which is shown on the top of Figure 2. We represent the //|\ reflectance as binary values, e.g., (7)10 = (111)2. Then strip them into 3//|\separate bSQ files, one file for //|\ each bit. as shown in the middle of Figure 2. The corresponding Basic P-trees, P1, P2 and P3, are 0001 2. constructed by recursive partition, which are shown at the bottom of Figure 1100 0010 111 111 101 101 110 110 010 010 111 111 101 111 110 110 010 110 111 111 111 111 110 110 110 110 111 111 111 111 110 110 110 110 101 001 100 100 011 000 011 011 101 001 100 101 011 000 011 011 001 001 001 101 000 000 000 011 001 001 001 001 000 000 000 000 Red reflectance value of 8x8 spatial image 11 11 11 11 11 11 00 01 11 11 11 11 11 11 11 11 11 00 11 11 00 00 00 00 00 00 00 10 00 00 00 00 11 11 00 01 11 11 11 11 bSQ-1 36 _______/ / \ / __ / / / 16 __7__ / / | \ 2 0 4 1 / / |\ //|\ 1100 0010 00 00 00 00 11 00 11 11 00 00 00 00 00 00 00 10 bSQ-2 P-1 \______ \__ \ \ \ ___13___ 0 / | \ \ 4 4 1 4 //|\ 0001 Figure 2. 11 11 11 11 11 11 11 11 _____/ / / / / _13__ 0 / / | \ 44 1 4 //|\ 0001 36 /\ P-2 \_____ \ \ \ \ 16 ___7__ / | \ \ 2 0 4 1 //|\ //|\ 1100 0010 11 11 11 11 00 00 00 00 11 11 11 11 00 00 00 00 11 11 00 01 11 00 11 11 11 11 11 11 00 00 00 10 bSQ-3 36 P-3 _______/ / \ \____ / __ / \ \ / / \ \ 16 _13__ 0 ___7__ / / | \ / | \ \ 4 4 1 4 2 0 4 1 //|\ //|\ //|\ 0001 1100 0010 Construction of 2-D Basic P-trees for 8x8 Image Data In P-1 tree of Figure 2, the root is 55, which is the 1-bit count of the entire bSQ-1 file. The second level of P-1 contains the 1-bit counts of the four quadrants, 16, 7, 13, 0. Since Quadrant 0 and Quadrant 3 are pure, there is no need to partition these quadrants. Quadrant 1 and 2 are further partitioned recursively. AND, OR and NOT logic operations are the most frequently used P-tree operations. For efficient implementation, we use a variation of P-trees, called Peano Mask trees (PM-trees). We define sub-tree of a PM-tree is pure-1 (or pure-0) if all the values in the sub-tree are 1’s (or 0’s), otherwise mixed sub-tree. In PM-trees, three-value logic, i.e., 0, 1, and m, is used to represent pure-0, pure-1 and mixed sub-tree, respectively. Figure 3 shows the PM-trees corresponding to the three basic P-trees, P-1, P-2 and P-3, in Figure 2. m PM-1 _______/ / \ \______ / __ / \__ \ / / \ \ 1 __m__ ___ m___ 0 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1100 0010 0001 m PM-3 _______/ / \ \___ / ___/ \ \ / / \ \ 1 __m__ 0 ___m___ / / | \ / | \ \ 11 m 1 m 1 m1 //|\ //|\ //|\ 0001 1100 0010 Figure 3. PM-trees for 8x8 Image Data The AND, OR and NOT operations are performed level-by-level starting from the root level. They are commutative and distributive, similar to logical Boolean operations. For example, ANDing a pure-0 tree with any P-tree results in a pure-0 tree, ORing a pure-1 tree with any P-tree results in a pure-1 tree. The rules are summarized in Table 1. Table 1. P-tree AND rules op1 0 0 op2 0 1 0 1 1 m m 1 m m op1 AND op2 0 0 0 1 m 0, or m op1 OR op2 0 1 NOT op1 1 1 m 1 1 1, or m 1 0 0 m In Table 1, op1 and op2 are two P-trees or sub-trees. a) and b) of Figure 4 show the results of AND and OR of PM-1 and PM-2 in Figure 3. c) of Figure 4 is the result of NOT PM-3 in Figure 3. m /\ _______ / / / m /|\ \ 11m1 //|\ 0001 / / 0 m ________ / / \ \__ / ____ / \ \ / / \ 1 m 1 / | \ \ / m 0 1 m m //|\ //|\ /|\ 1110 0010 1110 \______ \ \ \ m / | \ \ 1 1 m 1 //|\ 0001 \ 0 a) PM-1 AND PM-2 \ m |\ \ 01 m //|\ 0010 b) PM-1OR PM-2 Figure 4. m _____/ / \ \__ / __ / \ \ / / \ \ 0 m 1 m / / | \ / | \ \ 0 0 m 0 m 0 m 0 //|\ //|\ //|\ 1110 0011 1101 c) NOT PM-3 Results of AND, OR and NOT Operations 3. Equal Interval Neighborhood Rings In this section, we present the novel Equal Interval Neighborhood ring (EIN-ring) which is used to facilitate efficient neighborhood search. We first define neighborhood ring and EIN-ring, and then describe propositions on the calculation of EIN-rings using P-trees. Symbol X m r Pi,j Pi,j’ bi,j Pxi,j Pvi,r Px,r Definition Data set, X = {A1, A2, …, An}, n is the number of attributes Maximal bit length of attributes Radius of EIN-ring Basic P-tree for bit j of attribute i Complement of Pi,j The jth bit of the ith attribute of x. Operator P-tree of jth bit of the ith attribute of x Value P-tree within ring r Tuple P-tree within ring r AND operator of P-trees OR operator of P-trees Definition 1. Neighborhood Ring of c with radii r1 and r2 is defined as a set R(c, r1, r2) = {x X | r1<|c-x| r2}, where |c-x| is distance between x and c. Figure 5 shows a diagram of 2-D neighborhood ring R(c, r1, r2) within data set X. Definition 2. Equal Interval Neighborhood Ring of c with radii r1 and r2 (r2 > r1) is defined as a set R(c, r1, r2) = {x X | r1 < |c-x| r2}, and r2-r1= k, where k=1,2,…, |c-x| is distance between x and c, and is interval. The interval is user defined parameter based on accuracy requirement. The higher accuracy is required, the smaller interval can be chosen. The calculation of EIN-ring is implemented by means of range mask of P-trees, which is defined below. Definition 3. Range Mask Pxy is a P-tree that represents data set X that satisfies xy, where y is a boundary value, and is relational operator. In range mask Pxy, its root is the count of data points within data set X that satisfies xy. r2 x r1 c Figure 5. Diagram of Neighborhood Rings Lemma 1. Complement Rule of P-tree Let P1, P2 be basic P-trees, P1’ be the complement P-tree of P1, then P1(P1’P2)=P1P2 holds. Proof: P1(P1’P2) (according to distribution property of P-tree operations) = (P1P1’)(P1P2) =True(P1P2) = P1P2 Proposition 1. Let A be jth attribute of data set X, m be the number of bit of binary representation of A, and Pm, Pm-1, … P0 be the basic P-trees of A. Let boundary value c=b m…bi…b0, where bi is ith binary bit value of c, and PA c be the Range Mask that satisfies inequality A>c, then PA >c = Pm opm … Pi opi Pi-1 … opk+1 Pk, kim where 1) opi is if bi=1, opi is otherwise, 2) k is right most bit position with value of “0” i.e., bk=0, bj=1, j<k, and 3) right binding. Proof (by induction): Base Case: without loss of generality, assume b0=1, then need show PA>c = P1 op1 P0 holds. If b1=1, obviously the range mask that satisfy A(11)2 is PA >c =P1P0. If b1=0, the range mask that satisfy A>(01)2 is PA >c =P1(P1’P0). According to Lemma 1, we get PA>c =P1P0 holds. Inductive Step: assume PAc = Pn opn … Pk, we need to show PA>c = Pn+1opn+1Pn opn …Pk holds. Let Pright= Pn opn … Pk, if bn+1=1, then obviously the range mask PA>c = Pn+1 Pright. If bn+1= 0, then PA>c = Pn+1(P’n+1 Pright). According to Lemma 1, we get PA>c = Pn+1 Pright holds. Proposition 2. Let A be jth attribute of data set X, m be the number of bit of binary representation of A, and Pm, Pm-1, … P0 be the basic P-trees of x. Let boundary value c=bm…bi…b0, where bi is ith binary bit value of c, and PA r be the Range Mask that satisfies inequality Ac, then PAc = Pmopm … P’i opi P’i-1 … opk+1Pk, kim where 1). opi is if bi=0, opi is otherwise, 2) k is right most bit position with value of “0”, i.e., b k=0, bj=1, j<k, and 3) right binding, Proof (by induction): Base Case: without loss of generality, assume b0=0, then need show PAc = P’1 op1 P’0 holds. If b1=0, obviously the range mask that satisfy A(00)2 is PAc =P’1P’0. If b1=1, the range mask that satisfy A(10)2 is PAc =P’1(P1P’0). According to Lemma 1, we get PAc =P’1P’0 holds. Inductive Step: assume PAc = Pn opn … Pk, we need to show PAc = Pn+1opn+1Pn opn …Pk holds. Let Pright= Pn opn … Pk, if bn+1=0, then obviously the range mask PAc = P’n+1 Pright. If bn+1= 1, then PAc = P’n+1(Pn+1 Pright). According to Lemma 1, we get PAc = P’n+1 Pright holds. Theorem Range Mask Complement Rule Let A be jth attribute of data set X, PA c and PA>c be the Range Mask that satisfy Ac and A>c, where c is boundary value, then P Ac = P’A>c holds. Proof: obviously true according to P-tree operation properties. 4. Weighted K-NN Classification Using EIN-ring Given a set of training data X, a k-nearest neighbor classifier predicts the class value for an unknown data point x by searching the training set for the k nearest neighbors to x and then assigning to x the most common class among its k nearest neighbors. One of the major problems is that classical K-NN uses all the features equally for voting. In this paper, we propose a new approach to overcome this problem. In this approach, we weigh each EIN-ring inverse proportionally to its radius, namely, to weigh less the vote of farther neighbors than those of close neighbors. EIN-ring interval of each is proportional to its information gain. In this section, we first describe EIN-ring based nearest neighbor search using P-trees, and then develop a weighted EIN-ring based neighborhood classifer. 4.1. EIN-ring based nearest neighbor search Given a data set X = (A1,A2,…Aj), where A,j is X’s jth attribute, a data point c = (a1, a2, …, aj) and radii of EIN-ring r = (r1,r2,…rj), we calculate the range mask PAc+r as PXc+r = PA1a1+r1 PA2a2+r2 …. PAjaj+rj (1) and range mask PA>c-r as PX>c-r = PA1>a1-r1 PA2>a2-r2 …. PAj>aj-rj The range mask for X within the EIN-ring, R(x, 0, r), is calculated by (2) Pc-r<Xc+r = PX>c-r PXc+r (3) Here Pc-r<Xc+r is a P-tree that represents nearest neighbors within the EIN-ring, R(x, 0, r). The algorithm for searching the nearest neighbors is shown in Figure 6. Algorithm: Calculation of the nearest neighbors Input: P-tree Set Pji for bit i of attribute j of data set X Output: Range Mask Px r // n - # of attribute, m - # of bits in each attribute, bji – bit i of element j of vector r FOR j = 1 to n DO //Set k where bk=1, and bg=0 for all g<k k0 FOR i =0 to m DO IF bji = 0 k k+1 ELSE break END FOR FOR i = k TO m DO IF bji = 1 Px r Px r Pj,i ELSE Px r Px rPj,i END FOR END FOR Figure 6. Algorithm for calculation of the nearest neighbors 4.2. Weighted EIN-ring Based Nearest Neighbor Classification Weighted EIN-ring based nearest neighbor classification has two steps: 1) search the training set for nearest neighbors to a given data point x within the EIN-ring, R(x, 0, r) by calculating range masks; 2) assign to x the class label with the maximum P-tree root count among its neighbors. The P-tree root counts are calculated by ANDing rang masks and the class label P-tree. The class P-tree, PCi, is built for class i within the data set X. A “1” value in PCi indicates that the corresponding data point has class label i. A “0” value in PCi indicates that the corresponding data point does not have class label i. Let r=(r1, r2, … rj, … rd) be the radii of EIN-ring, R(x, 0, r). According to definition of EIN-ring, rj = ki, where k = 1, 2, … and j is interval along jth dimension. The interval j is proportional to the information gain of jth attribute. We calculate the information gain of j th attribute through Gain(Aj) = - n i 1 pi*logpi + n i 1 d ( pij*logpij ) * q j 1 where pi=si/s, pij=sij/sj, and q=(s1j+…+smj)/s. Here s is the total number of data points in set X, si is the number of data points in class i, sj is the number of data points in jth partition of X by fixed interval , and sij is the number of data points of class i in j th partition of X. Hence we get j as j = Gain ( A j ) d Gain( A ) i 1 i By substituting the adjusted interval in equation (1), (2) and (3), we get weighted range mask Pc-r<Xc+r. Next we need to find the neighbors within EIN-ring, R(x, 0, r) for each class i by P-tree Anding as PNi = Pc-r<Xc+r PCi where PNi is a P-tree that represents data points with class label i within EIN-ring, R(x, 0, r). In nearest neighbor voting of a data point x, the farther the neighbor, the “less likely” is that it has influence on x. It is reasonable assign different weights to different neighbors, namely, to weigh less the vote of farther neighbors than those of close neighbors. Our approach is assign different weights for the neighbors within each EIN-ring based on kernel functions, such as Gaussian function, RBF function, step function, etc. The weighted root count, wrc(x,r1,r2), of x within EIN-ring R(c, r1, r2) is calculated as k wrci(x) = w j ( RC ( PN ( j 1), i ) RC ( PN j , i )) , j 1 where RC is the root count of P-tree. An i which yields the maximum weighted root count wrci(x) is the class label of x. The algorithm is given in the figure 6. Algorithm: ENN classification Input: NC(x,r1,r2), neighbors count of x within R(c, r1, r2) Output: class label histogram // k is the number of different classes // wrc is the sum of the weighted root count class 0 FOR i =1 to k -1 DO wrc 0 FOR j = 0 TO m DO wrc [j] + =wj * NC(x,j,j+1) END FOR END FOR class max (wrc) Figure 7. 5. Experiment Evaluation 6. Conclusion Algorithm of Weighted Voting In this paper, we propose a novel neighborhood searching approach, Equal Interval Neighborhood ring (EIN-ring), to facilitate efficient neighborhood search. We develop a weighted EIN-ring based nearest neighbor classifier (ENN). The calculation of EIN-ring is based on a data structure, called Peano Tree (P-tree). In ENN, each ring is weighted according to its diameter and each attribute is weighted based on its information gain. ENN works well for data sets with an arbitrary number of dimensions and an arbitrary number of data points, and provides accurate classification results. We compare ENN with other nearest neighbor classifier, i.e., KNN, VSM, and IB3. Experiments show that our method is superior to all of them with respect to dimensional scalability, cardinality scalability and accuracy. Our method is also particularly useful for data streams. In data streams, such as large sets of transactions, remotely sensed images, multimedia video, etc. where new data keeps on arrival continuously. Therefore both speed and accuracy are critical issues. Achieving high speed using EIN-ring, and high accuracy using the weighted interval EIN-ring along dimensions provides an efficient lazy classifier that is well suited to the classification of stream data. Besides, our method also has potential applications in other areas, such as DNA micro array and medical image analysis. Reference: 1. Perrizo, W. (2001). Peano Count Tree Technology. Technical Report NDSU-CSOR-TR-01-1. 2. Khan, M., Ding, Q., & Perrizo, W. (2002). k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees. PAKDD 2002, Spriger-Verlag, LNAI 2336, 517-528. 3. R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973. 4. S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10(1):57–78, 1993. 5. D.G. Lowe. Similarity metric learning for a variable-kernel classifier. Neural Computation, pages 72–85, January 1995. 6. Aha, D. and Kibler, D. Noise-tolerant instance-based learning algorithms. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, p 794-799. Detroit, MI, Morgan Kaufmann,1989. 7. Salzberg, S. Nested Hyper-rectangles for examplar-based learning. In K.P. Jantke, Analogical and Inductive Inference: International Workshop AII’89, 184-201. Berlin Springer-Verlag. 8. Jabbour, K., Riveros, J.F.V., Landsbergen, D., & Meyer, W. (1987). ALFA: Automated load forecasting assistant. Proceedings of the 1987 IEEE Power Engineering Society Summer Meeting. San Francisco, CA. 9. Clark, P.E., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261-284. 10. HAN, J. & KAMBER, M. (2001). Data Mining. Morgan Kaufmann Publishers. San Francisco, CA. 11. Hart, P.E. (1968). The condensed nearest neighbor rule. Institute of Electrical and Electronics Engineers and Transactions on Information Theory, 14, 515-516. 12. Gates, G.W. (1972). The reduced nearest neighbor rule. IEEE Transactions on Information Theory, 431-433. 13. Dasarathy, B.V. (1980). Nosing around the neighborhood: A new system structure and classification rule for recognition in partially exposed environments. Pattern Analysis and Machine Intelligence, 2, 67-71. 14. Aha, D., Kibler, D., and Albert, M. Instance-Based Learning Algorithms. Machine Learning, vol 6:1, p 37-66, 1991 15. Mitchell, T.: Machine Learning. Morgan Kaufmann, 1997. 16. Dasarathy, B.,V.: NN concepts and techniques. Nearest Neighbour (NN) Norms: NN Pattern Classification Techniques. B. V. Dasarathy (Ed.), IEEE Computer Society Press. 17. TIFF image data sets. Available at http://midas-10cs.ndsu.nodak.edu/data/images/.