An Optimized Approach for KNN Text Categorization using P-trees Imad Rahal and William Perrizo Computer Science Department North Dakota State University IACC 258 Fargo, ND, USA 001-701-231-7248 {imad.rahal, william.perrizo}@ndsu.nodak.edu ABSTRACT The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text categorization is the process of assigning categories or labels to documents based entirely on their contents. Formally, it can be viewed as a mapping from the document space into a set of predefined class labels (aka subjects or categories); F: D{C1, C2…Cn} where F is the mapping function, D is the document space and {C1, C2…Cn} is the set of class labels. Given an unlabeled document d, we need to find its class label, Ci, using the mapping function F where F(d) = Ci. In this paper, an optimized k-Nearest Neighbors (KNN) classifier that uses intervalization and the P-tree1 technology to achieve high degree of accuracy, space utilization and time efficiency is proposed: As new samples arrive, the classifier finds the k nearest neighbors to the new sample from the training space without a single database scan. Categories and Subject Descriptors I.5.4 [Pattern Recognition]: Applications – Text Processing. I.5.2 [Pattern Recognition]: Design Methodologies – Classifier design and evaluation. E.1 [Data Structures]: Trees. General Terms Algorithms, Management, Performance, Design. Keywords Text categorization, Text classification, P-trees, Intervalization, kNearest Neighbor. 1. INTRODUCTION Nowadays, a great deal of the literature in most domains is available in text format. Document collections (aka text or 1 Patents are pending for the P-tree technology. This work was partially supported by the GSA grant ACT#: K96130308 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC ‘04, March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04…$5.00. document databases in the literature) are usually characterized by being very dynamic in size. They contain documents from various sources such as news articles, research publications, digital libraries, emails, and web pages. Perhaps the worldwide advent of the Internet is one of the main reasons for the rapid growth in the sizes of those collections. In the term space model [6][7], a document is presented as a vector in the term space where terms are used as features or dimensions. The data structure resulting from representing all the documents in a given collection as term vectors is referred to as a document-by-term matrix. Given that the term space has thousands of dimensions, most current text-mining algorithms fail to scale-up. This very high dimensionality of the term space is an idiosyncrasy of text mining and must be addressed carefully in any text-mining application. Within the term space model, many different representations exist. On one extreme, there is the binary representation in which a document is represented as a binary vector where a 1 bit in slot i implies the existence of the corresponding term ti in the document, and a 0 bit implies its absence. This model is fast and efficient to implement but clearly lacks the degree of accuracy needed because most of the semantics are lost. On the other extreme, there is the frequency representation where a document is represented as a frequency vector [6][7]. Many types of frequency measures exist: term frequency (TF), term frequency by inverse document frequency (TFxIDF), normalized TF, and the like. This representation is obviously more accurate than the binary one but is not as easy and efficient to implement. Text preprocessing such as stemming, case folding, and stop lists can be exploited prior to the weighting process for efficiency purposes. In this paper we present a new model for representing text data based on the idea of intervalizing (aka discretizing) the data into a set of predefined intervals. We propose an optimized KNN algorithm that exploits this model. Our algorithm is characterized by accuracy and space and time efficiency because it is based on the P-tree technology. The rest of this paper is organized as follows: In Section 2, an introduction to the P-tree technology is given. Section 3 discusses data management aspects required for applying the text categorization algorithm which, in turn, is discussed in Section 4. Section 5 gives a performance analysis study. Finally, in Section 6, we conclude this paper by highlighting the achievements of our work and pointing out future direction in this area. 2. THE P-TREE TECHNOLOGY The basic data structure exploited in the P-tree technology [1] is the Predicate Count Tree2 (PC-Tree) or simply the P-tree. Formally, P-trees are tree-like data structures that store numericrelational data (i.e. numeric data in relational format) in columnwise, bit-compressed format by splitting each attribute into bits (i.e. representing each attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a P-tree. P-trees provide a lot of information and are structured to facilitate data mining processes. After representing each numeric attribute value by its bit representation, we store all bits for each position separately. In other words, we group together all the bit values at bit position x of each attribute A for all tuples t in the table. Figure 1 shows a relational table made up of three attributes and four tuples transformed from numeric to binary, and highlights all the bits in the first three bit groups for the first Attribute 1; each of those bit groups will form a P-tree. Since each attribute value in our table is made up of 8 bits, 24 bit groups are generated in total with each attribute generating 8 bit groups. Figure 2 shows a group of 16 bits transformed into a P-tree after being divided into quadrants (i.e. subgroups of 4). Each such tree is called a basic P-tree. In the lower part of Figure 2, 7 is the total number of bits in the whole bit group shown in the upper part. 4, 2, 1 and 0 are the number of 1’s in the 1st, 2nd, 3rd and 4th quadrants respectively in the bit group. Since the first quadrant is made up of entirely “1” bits (we call it a pure-1 quadrant) no sub-trees for it (this is the node denoted by 4 on the second level in the tree) are needed. Similarly, quadrants made up entirely of “0” bits (the node denoted by 0 on the second level in the tree) are called pure-0 quadrants and have no sub-trees. As a matter of fact, this is how compression is achieved3 [1]. Non-pure quadrants such as nodes 2 and 1 on the second level in the tree are recursively partitioned further into four quadrants with a node for each quadrant. We stop the recursive partitioning of a node when it becomes pure-1 or pure-0 (eventually we will reach a point where the node is composed of a single bit only and is pure because it is made up entirely of either only “1” bits or “0” bits). P-tree algebra includes operations such as AND, OR, NOT (or complement) and RootCount (a count of the number of “1”s in the tree). Details for those operations can be found in [1]. The latest benchmark on P-trees ANDing has shown a speed of 6 milliseconds for ANDing two P-trees representing bit groups each containing 16 million bits. Speed and compression aspects of Ptrees have been discussed in greater details in [1]. [2], [4] and [5] give some applications exploiting the P-tree technology. Once we have represented our data using P-trees, no scans of the database are needed to perform text categorization as we shall demonstrate later. In fact, this is one of the very important aspects of the P-tree technology. 2 Formerly known as the Peano Count Tree 3 Its worth noting that good P-tree compression can be achieved when the data is very sparse (which increases the chances of having long sequences of “0” bits) or very dense (which increases the chances of having long sequences of “1” bits) Figure 1. Relational numeric data converted to binary format with the first three bit groups in Attribute 1 highlighted. Figure 2. A 16-bit group converted to a P-tree. 3. DATA MANAGEMENT 3.1 Intervalization Viewing the different text representations discussed briefly in the introduction as a concept hierarchy with the binary representation on one extreme and the exact frequencies representation on the other, we suggest working somewhere along this hierarchy by using intervals. This would enable us to deal with a finite number of possible values thus approaching the speed of the binary representation, and to be able to differentiate among term frequencies existing in different documents on a much more sophisticated level than the binary representation thus approaching the accuracy of the exact frequencies representation. Given a document-by-term matrix represented using the aforementioned TFxIDF measurement, we aim to intervalize this data. To do this, we need to normalize the original set of weighted term frequency measurement values into values between 0 and 1 (any other range will do). This would eliminate the problems resulting from differences in document sizes. After normalization, all values for terms in document vectors lie in the range [0, 1]; now the intervalization phase starts. First, we must decide on the number of intervals and specify a range for every interval. After that, we replace the term values of document vectors by their corresponding intervals so that values are now drawn from a finite set of values. For example, we can use a four-interval value logic: I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1] where “(“ and “)” are exclusive and “[“ and “]” are inclusive. The optimal number of intervals and their ranges depend on the type of the documents and their domain. Further discussion of those variables is environment dependent and outside the scope of this paper. Domain experts and experimentation can assist in this regard. After normalization, each interval would be defined over a range of values. The ranges are chosen in an ordered consecutive manner so that the corresponding intervals would have the same intrinsic order amongst each other. Consider the example interval set used previously in this section; we have I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1]. We know that [0,0] < (0,0.1] < (0.1,0.2] < (0.2,1]; as a result, the corresponding ordering implied among the intervals is: I0 << I1 << I2 << I3. 3.2 Data Representation Each interval value will be represented by a bit string preserving its order in the interval set. For example, for I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1], we can set I0=00, I1=01, I2=10 and I3=11. This will enable us to represent our data using the efficient lossless tool, the P-tree. Note that a bit string of length x is needed to represent 2x intervals. So far, we’ve created a binary matrix similar to that depicted in Figure 1. We follow the same steps presented in Section 2 to create the P-tree version of the matrix as in Figure 2. For every bit position in every term ti we will create a basic P-tree. We have two bits per term (since each term value is now one of the four intervals each represented by two bits) and thus two P-trees are needed (one for each bit position), P i,1 and Pi,2 where Pi,j is the Ptree representation of the bits lying at jth position in ith term for all documents. Each Pi,j will give the number of documents having a 1 bit in position j for term i. This representation conveys a lot of information and is structured to facilitate fast data mining processing. To get the P-tree representing the documents having a certain interval value for some term i, we can follow the steps given in the following example: if the desired binary value for term i is 10, we calculate Pi,10 as Pi,10 = Pi,1 AND P’i,0, where ’ indicates the bit-complement or the NOT operation (which is simply the count complement in each quadrant [1]). 4. P-TREE BASED CATEGORIZATION 4.1 Document Similarity Every document is a vector of terms represented by interval values. Similarity between two documents d1 and d2 could be measured by the number of common terms. A term t is considered to be common between d1 and d2 if the interval value given to t in both documents is the same. The more common terms d1 and d2 have, the higher is their degree of similarity. However, not all terms participate equally in this similarity. This is where the order of the intervals comes into the picture. If we use our previous example: four intervals I0, I1, I2 and I3, where I0 << I1 << I2 << I3, then common terms having higher interval values such as I3 are more likely to contribute to the similarity than do terms having lower interval values such as I0 . The rationale for this should be obvious. Term values in document vectors reflect the degree of association between the terms and corresponding documents; the higher the value for a term in a document vector, the more this term contributes to the context of the document. In addition to using common terms to measure the similarity between documents, we need to check how close non-common terms are. If documents d1 and d2 have different interval values for certain terms, then the higher and closer those interval values are, the higher the degree of similarity between d1 and d2. For example, if term t has a value 11 in d1 and a value 01 in d2, then the similarity between d1 and d2 would be higher than if term t had a value 10 in d1 and a value 00 in d2 because 11 contributes more to the context of d1 than does 10 and the same holds for d2. However, the similarity between d1 and d2 would be higher than the former case if term t had a value 11 in d1 and a value 10 in d2 because the gap between 11 and 10 is smaller than that between 11 and 01. In short, the similarity between documents is implicitly specified in our P-tree representation and is based on the number of common terms between the two documents, the interval closeness for non-common terms and the interval values themselves (higher values mean higher similarity). 4.2 Categorization algorithm Before applying our classification algorithm, we will assume that a TFxIDF document-by-term matrix has been created and intervalized and that the P-tree version of the matrix has also been created. We will only operate on the P-tree version. To categorize a new document, dnew, the algorithm first attempts to find the k-most-similar neighbors. In Figure 5 we present the selection phase of our algorithm. 1. 2. 3. 4. Initialize an identity P-tree, Pnn (represents a bit group having only ones or a pure-1 quadrant). Order the set of all term P-trees, S, in descending order from term P-trees representing higher to lower interval values in dnew. For every term P-tree, Pt, in S do the following: a. AND Pnn with Pt. b. If root count of result is less than k, expand Pt by removing the rightmost bit from the interval value (i.e., intervals 01 and 00 become 0, and intervals 10 and 11 become 1). This could be done by recalculating the Pt while disregarding the rightmost bit P-tree. Repeat this step until the root count of Pnn AND Pt is greater than k. c. Else, put the result in Pnn. d. Loop. End of selection phase. Figure 5. Selection algorithm. After creating and sorting the term P-trees according to the values in dnew as described in step 2 of the algorithm (the P-trees for terms having higher interval values in dnew will be processed before other term P-trees with lower values), the algorithm sequentially ANDs each term P-tree, Pt, with Pnn always making sure that root count of the result is greater than or equal to k. Should the root count drop below k, the ANDing operation that resulted in this drop is undone and the Pt involved in that operation will be reconstructed by removing right most bit. To see how this happens, consider an example where we are ANDing Pnn with a Pt representing a term with a 10 binary value in dnew. Suppose that the root count of the result of this ANDing operation is less that k. Initially, Pt was calculated as Pt,1 AND Pt,0’ (because the desired value is 10). To reconstruct Pt by removing the right most bit we assume that the value of t in dnew is 1 instead of 10; so, now we can calculate Pt as Pt,1. This process of reconstructing Pt is repeated until either the result of ANDing of Pnn with the newly constructed Pt has a root count greater than k or the newly assumed value for t has no bits because all the bits have been previously removed (in this case we say that t has been ignored). After looping through all the term P-trees, Pnn will hold the documents that are nearest to dnew. After finding the k nearest neighbors (or more since the root count of Pnn might be greater than k due to use of the “closed neighborhood” which has been proven to improve accuracy [2]) through the selection phase, the algorithm proceeds with the voting phase to find the target class label of document dnew. For voting purposes, every neighboring document (i.e. has been selected in the selection phase) will be given a voting weight based on its similarity to dnew. For every class label ci, loop through all terms t in dnew and calculate the number of nearest neighbors having the same value for t as dnew. To see how this works, suppose we have the following document: dnew(v1, v2 , v3, …, vn) where vj is the interval value for term j in dnew. We need to calculate Pj for each term j with value vj and then AND it with Pnn (to make sure we are only considering the selected neighbors) and with the P-tree representing documents having class label ci, Pi. Multiply the root count of the resulting P-tree by its predefined weight which is derived from the value of term j in dnew, Ij, and is equal to the numeric value of Ij + 1 (add 1 to handle the case where Ij’s numeric value is 0). The resulting value is added to the counter maintained for class label ci. Let Pnn denote the P-tree representing the neighbors from the selection phase and Pi denote the P-tree for the documents having class label ci. A formal description of the voting phase is given in Figure 6. 1. 2. 3. For every class ci, loop through dnew vector and do the following for every term t in dnew vector: a. Get the P-tree representing the neighboring documents–Pnn from the selection phase–and having the same value for t (Pt) and class ci (Pi). This could be done by calculating Presult = Pnn AND Pt AND Pi. b. If the term under consideration has a value Ij in dnew, multiply the root count of Presult by (Ij + 1). c. Add the result to the counter of ci, w(ci). d. Loop. Select the class ck having the largest counter, w(ck), as the respective class of dnew. End of voting phase. calculated as F1=2pr/(p+r). Large-scale testing over the whole Reuter’s collection is still underway. We used k=3 and a 4-interval value set: I0=00=[0,0] , I1=01=(0,0.25], I2=10=(0.25,0.75] and I3=11=(0.75,1]. Each of the 10 selected test datasets (1 for each run) was composed of 380 randomly-selected samples for neighbor selection (distributed as follows: 152 samples from class earn, 114 samples from class acq, 76 samples from class crude and 38 samples from class corn) and 90 randomly-selected samples for testing (distributed as follows: 40 samples from class earn, 25 samples from class acq, 15 samples from class crude and 5 samples from class corn)–470 samples in total. This is the same sampling process reported in [3]. Table 1 lists a comparative prediction, recall and F1 measurements table showing how well we compare to the cosinesimilarity-based KNN and the string kernels4 approaches. The values for our approach and the KNN approach are averaged over 10 runs to enable comparison with the values for the kernels approach as reported in [3]. For each of the three approaches, we show the precision, recall and F1 measurements over each of the four considered classes. The F1 measurements are then plotted in Figure 7. Table 2 lists the effect of using different matrix sizes on the total time needed by our approach and by the KNN approach to select the nearest neighbors of a given document. Figure 8 plots the time measurements graphically. Table 1. Measurements comparison table. Figure 6. Voting algorithm. 5. PERFORMANCE ANALYSIS STUDY For the purpose of performance analysis, our algorithm has been tested against randomly selected subsets from the Reuters-21578 collection available at the University of California–Irvine data collections store (http://kdd.ics.uci.edu/databases/reuters21578/ reuters21578.html). We compare our speed and accuracy to the original KNN approach which uses the cosine similarity measure between document vectors to decide upon the nearest neighbors. Since our aim is to propose a globally “good” text classifier and not only to outperform the original KNN approach, we also compare our results in terms of accuracy only to the string kernels approach [3] (based on support vector machines) which reports good results on small subsets of this collection (better than those reported on the whole collection). To get deeper insight on our performance, we empirically experimented on random subsets of documents having one of the following four random classes: acq, earn, corn, and crude. We chose those 4 classes because [3] uses them to report its performance while varying other free parameters (referred to as the length of the subsequence, k, and weight decay parameter λ). We used the precision, recall and F1 measurements on each class averaged over 10 runs as an indication of our accuracy. The F1 measure combines both precision (p) and recall (r) values into a single value giving the same importance weight for both. It is Figure 7. F1 measure comparison graph. Table 2. Time comparison table. 4 Values for precision, recall and F1 measure were selected from [3] for k=5 and λ=0.3 which were among the highest reported. categorization algorithm characterized by high efficiency and accuracy and based on intervalization and the P-tree technology. This algorithm has been devised with space, time and accuracy considerations in mind. Figure 8. Time comparison graph. Compared to the KNN approach, our approach shows much better results in terms of speed and accuracy. The reason for the improvement in speed is mainly due to the complexity of the two algorithms. Usually, the high cost of KNN-based algorithms is mainly associated with their selection phases. The selection phase in our algorithm has a complexity of O(n) where n is the number of dimensions (number of terms) while the KNN approach has a complexity of O(mn) for its selection phase where m is the size of the dataset (number of documents or tuples) and n is the number of dimensions. Drastic improvement is shown when the size of the matrix is very large (the case of 5000x20000 matrix size in Table 2). As for accuracy, the KNN approach bases its judgment on the similarity between two document vectors upon the angle between those vectors regardless of the actual distance between them. Our approach does a more sophisticated comparison by using ANDing to compare the closeness of the value of each term in the corresponding vectors, thus being able to judge upon the distance between the two vectors and not only the angle. Also, terms that seem to skew the result are ignored in our algorithm unlike in the KNN approach which has to include all the terms in the space during the neighbor-selection process. It is clear that, in all cases, our approach’s accuracy summarized by the F1 measure is very comparable to that reported in the kernels approach (for k=5 and λ=0.3). They only show better results for class crude while we show better results for the rest of the classes. However, it would not be appropriate to compare speeds because the two approaches are fundamentally different. Our approach is example-based while the kernels approach is eager. In general, after training and testing, eager classifiers tend to be faster than lazy classifiers; however, they lack the ability to operate on dynamic streams in which case they need to be retrained and retested frequently to cope with the change in the data. Table 1 shows that the ranges of values for the precision, recall and F1 measurements in the kernels and the KNN approaches are wider than ours. For instance, our approach’s precision values spread over the range [92.6, 98.3] while the KNN approach’s precision values spread over the range [79.1, 90] and kernels approach’s precision values spread over the range [91.1, 99.2]. This observation reveals a greater potential for our P-tree-based approach to produce more stable results across different categories which is a very desirable characteristic. 6. CONCLUSION In this paper, we have presented an optimized KNN-based text- The high accuracy reported in this work is mainly due to the use of sequential ANDing in the selection phase which ensures that only the nearest neighbors are selected for voting. In addition, in the voting algorithm, each neighboring document gets a vote weight depending on its similarity to the document being classified. Also, using the closed k-neighborhood [2] in case the number of neighbors returned by our selection algorithm is greater than k has a good effect on accuracy because more near documents get to vote. As for space, we operate on a reduced, compressed space. The first step after generating the normalized TFxIDF document-byterm matrix is to intervalize the data. If we assume, for simplicity only, that every term value is stored in 1 byte or 8 bits and that we have four intervals, then we are able to reduce the size of this matrix by a factor of 4 since each term value stored in a byte is replaced by a 2 bit interval value. After reducing the size of the matrix, we apply compression by creating and exploiting the Ptree version of the matrix. As for speed, our data in structured in the P-tree format which enables us to perform the task without a single database scan. Add to this the fact that our algorithms are driven completely by the AND operation which is among the fastest computer instructions. In the future, we aim to examine more closely the effects of varying the number of intervals and their ranges over large datasets including the whole Reuters collection. 7. REFERENCES [1] Ding, Khan, Roy, and Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing (Madrid, Spain), 2002. [2] Khan, Ding, and Perrizo, K-nearest neighbor classification on spatial data streams Using P-trees. Proceedings of the PAKDD, Pacific-Asia Conference on Knowledge Discovery and Data Mining (Taipei, Taiwan), 2002. [3] Lodhi, Saunders, Shawe-Taylor, Cristianini, and Watkins, “Text classification using string kernels.” Journal of Machine Learning Research, (2): 419-444, February 2002. [4] Perrizo, Ding, and Roy, Deriving high confidence rules from spatial data using peano count trees. Proceeding of the WAIM, International Conference on Web-Age Information Management, (Xi'an, China), 91-102, July 2001.) [5] Rahal and Perrizo, Query acceleration in Multi-level secure database systems using the P-tree technology, Proceedings of the ISCA CATA, International Conference on Computers and Their Applications (Honolulu, Hawaii), March 2003. [6] Salton and Buckley, “Term-weighting approaches in automatic text retrieval.” Information Processing & Management, 24(5), 513-523, May 1988. [7] Salton, Wong, and Yang, "A vector space model for automatic indexing.” Communications of the ACM 18(11), 613-620, November 1975.