An Optimized Approach for KNN Text Categorization using P

advertisement
An Optimized Approach for KNN Text Categorization using
P-trees
Imad Rahal and William Perrizo
Computer Science Department
North Dakota State University
IACC 258
Fargo, ND, USA
001-701-231-7248
{imad.rahal, william.perrizo}@ndsu.nodak.edu
ABSTRACT
The importance of text mining stems from the availability of huge
volumes of text databases holding a wealth of valuable
information that needs to be mined. Text categorization is the
process of assigning categories or labels to documents based
entirely on their contents. Formally, it can be viewed as a
mapping from the document space into a set of predefined class
labels (aka subjects or categories); F: D{C1, C2…Cn} where F
is the mapping function, D is the document space and {C1,
C2…Cn} is the set of class labels. Given an unlabeled document
d, we need to find its class label, Ci, using the mapping function F
where F(d) = Ci. In this paper, an optimized k-Nearest Neighbors
(KNN) classifier that uses intervalization and the P-tree1
technology to achieve high degree of accuracy, space utilization
and time efficiency is proposed: As new samples arrive, the
classifier finds the k nearest neighbors to the new sample from the
training space without a single database scan.
Categories and Subject Descriptors
I.5.4 [Pattern Recognition]: Applications – Text Processing.
I.5.2 [Pattern Recognition]: Design Methodologies – Classifier
design and evaluation.
E.1 [Data Structures]: Trees.
General Terms
Algorithms, Management, Performance, Design.
Keywords
Text categorization, Text classification, P-trees, Intervalization, kNearest Neighbor.
1. INTRODUCTION
Nowadays, a great deal of the literature in most domains is
available in text format. Document collections (aka text or
1
Patents are pending for the P-tree technology. This work was partially
supported by the GSA grant ACT#: K96130308
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC ‘04, March 14-17, 2004, Nicosia, Cyprus
Copyright 2004 ACM 1-58113-812-1/03/04…$5.00.
document databases in the literature) are usually characterized by
being very dynamic in size. They contain documents from various
sources such as news articles, research publications, digital
libraries, emails, and web pages. Perhaps the worldwide advent of
the Internet is one of the main reasons for the rapid growth in the
sizes of those collections.
In the term space model [6][7], a document is presented as a
vector in the term space where terms are used as features or
dimensions. The data structure resulting from representing all the
documents in a given collection as term vectors is referred to as a
document-by-term matrix. Given that the term space has
thousands of dimensions, most current text-mining algorithms fail
to scale-up. This very high dimensionality of the term space is an
idiosyncrasy of text mining and must be addressed carefully in
any text-mining application.
Within the term space model, many different representations exist.
On one extreme, there is the binary representation in which a
document is represented as a binary vector where a 1 bit in slot i
implies the existence of the corresponding term ti in the
document, and a 0 bit implies its absence. This model is fast and
efficient to implement but clearly lacks the degree of accuracy
needed because most of the semantics are lost. On the other
extreme, there is the frequency representation where a document
is represented as a frequency vector [6][7]. Many types of
frequency measures exist: term frequency (TF), term frequency by
inverse document frequency (TFxIDF), normalized TF, and the
like. This representation is obviously more accurate than the
binary one but is not as easy and efficient to implement. Text
preprocessing such as stemming, case folding, and stop lists can
be exploited prior to the weighting process for efficiency
purposes.
In this paper we present a new model for representing text data
based on the idea of intervalizing (aka discretizing) the data into a
set of predefined intervals. We propose an optimized KNN
algorithm that exploits this model. Our algorithm is characterized
by accuracy and space and time efficiency because it is based on
the P-tree technology.
The rest of this paper is organized as follows: In Section 2, an
introduction to the P-tree technology is given. Section 3 discusses
data management aspects required for applying the text
categorization algorithm which, in turn, is discussed in Section 4.
Section 5 gives a performance analysis study. Finally, in Section
6, we conclude this paper by highlighting the achievements of our
work and pointing out future direction in this area.
2. THE P-TREE TECHNOLOGY
The basic data structure exploited in the P-tree technology [1] is
the Predicate Count Tree2 (PC-Tree) or simply the P-tree.
Formally, P-trees are tree-like data structures that store numericrelational data (i.e. numeric data in relational format) in columnwise, bit-compressed format by splitting each attribute into bits
(i.e. representing each attribute value by its binary equivalent),
grouping together all bits in each bit position for all tuples, and
representing each bit group by a P-tree. P-trees provide a lot of
information and are structured to facilitate data mining processes.
After representing each numeric attribute value by its bit
representation, we store all bits for each position separately. In
other words, we group together all the bit values at bit position x
of each attribute A for all tuples t in the table. Figure 1 shows a
relational table made up of three attributes and four tuples
transformed from numeric to binary, and highlights all the bits in
the first three bit groups for the first Attribute 1; each of those bit
groups will form a P-tree. Since each attribute value in our table is
made up of 8 bits, 24 bit groups are generated in total with each
attribute generating 8 bit groups. Figure 2 shows a group of 16
bits transformed into a P-tree after being divided into quadrants
(i.e. subgroups of 4). Each such tree is called a basic P-tree. In the
lower part of Figure 2, 7 is the total number of bits in the whole
bit group shown in the upper part. 4, 2, 1 and 0 are the number of
1’s in the 1st, 2nd, 3rd and 4th quadrants respectively in the bit
group. Since the first quadrant is made up of entirely “1” bits (we
call it a pure-1 quadrant) no sub-trees for it (this is the node
denoted by 4 on the second level in the tree) are needed.
Similarly, quadrants made up entirely of “0” bits (the node
denoted by 0 on the second level in the tree) are called pure-0
quadrants and have no sub-trees. As a matter of fact, this is how
compression is achieved3 [1]. Non-pure quadrants such as nodes 2
and 1 on the second level in the tree are recursively partitioned
further into four quadrants with a node for each quadrant. We stop
the recursive partitioning of a node when it becomes pure-1 or
pure-0 (eventually we will reach a point where the node is
composed of a single bit only and is pure because it is made up
entirely of either only “1” bits or “0” bits).
P-tree algebra includes operations such as AND, OR, NOT (or
complement) and RootCount (a count of the number of “1”s in the
tree). Details for those operations can be found in [1]. The latest
benchmark on P-trees ANDing has shown a speed of 6
milliseconds for ANDing two P-trees representing bit groups each
containing 16 million bits. Speed and compression aspects of Ptrees have been discussed in greater details in [1]. [2], [4] and [5]
give some applications exploiting the P-tree technology. Once we
have represented our data using P-trees, no scans of the database
are needed to perform text categorization as we shall demonstrate
later. In fact, this is one of the very important aspects of the P-tree
technology.
2
Formerly known as the Peano Count Tree
3
Its worth noting that good P-tree compression can be achieved
when the data is very sparse (which increases the chances of
having long sequences of “0” bits) or very dense (which
increases the chances of having long sequences of “1” bits)
Figure 1. Relational numeric data converted to binary format
with the first three bit groups in Attribute 1 highlighted.
Figure 2. A 16-bit group converted to a P-tree.
3. DATA MANAGEMENT
3.1 Intervalization
Viewing the different text representations discussed briefly in the
introduction as a concept hierarchy with the binary representation
on one extreme and the exact frequencies representation on the
other, we suggest working somewhere along this hierarchy by
using intervals. This would enable us to deal with a finite number
of possible values thus approaching the speed of the binary
representation, and to be able to differentiate among term
frequencies existing in different documents on a much more
sophisticated level than the binary representation thus
approaching the accuracy of the exact frequencies representation.
Given a document-by-term matrix represented using the
aforementioned TFxIDF measurement, we aim to intervalize this
data. To do this, we need to normalize the original set of weighted
term frequency measurement values into values between 0 and 1
(any other range will do). This would eliminate the problems
resulting from differences in document sizes. After normalization,
all values for terms in document vectors lie in the range [0, 1];
now the intervalization phase starts. First, we must decide on the
number of intervals and specify a range for every interval. After
that, we replace the term values of document vectors by their
corresponding intervals so that values are now drawn from a finite
set of values. For example, we can use a four-interval value logic:
I0=[0,0], I1=(0,0.1], I2=(0.1,0.2] and I3=(0.2,1] where “(“ and “)”
are exclusive and “[“ and “]” are inclusive. The optimal number of
intervals and their ranges depend on the type of the documents
and their domain. Further discussion of those variables is
environment dependent and outside the scope of this paper.
Domain experts and experimentation can assist in this regard.
After normalization, each interval would be defined over a range
of values. The ranges are chosen in an ordered consecutive
manner so that the corresponding intervals would have the same
intrinsic order amongst each other. Consider the example interval
set used previously in this section; we have I0=[0,0], I1=(0,0.1],
I2=(0.1,0.2] and I3=(0.2,1]. We know that [0,0] < (0,0.1] <
(0.1,0.2] < (0.2,1]; as a result, the corresponding ordering implied
among the intervals is: I0 << I1 << I2 << I3.
3.2 Data Representation
Each interval value will be represented by a bit string preserving
its order in the interval set. For example, for I0=[0,0], I1=(0,0.1],
I2=(0.1,0.2] and I3=(0.2,1], we can set I0=00, I1=01, I2=10 and
I3=11. This will enable us to represent our data using the efficient
lossless tool, the P-tree. Note that a bit string of length x is needed
to represent 2x intervals.
So far, we’ve created a binary matrix similar to that depicted in
Figure 1. We follow the same steps presented in Section 2 to
create the P-tree version of the matrix as in Figure 2. For every bit
position in every term ti we will create a basic P-tree. We have
two bits per term (since each term value is now one of the four
intervals each represented by two bits) and thus two P-trees are
needed (one for each bit position), P i,1 and Pi,2 where Pi,j is the Ptree representation of the bits lying at jth position in ith term for all
documents. Each Pi,j will give the number of documents having a
1 bit in position j for term i. This representation conveys a lot of
information and is structured to facilitate fast data mining
processing. To get the P-tree representing the documents having a
certain interval value for some term i, we can follow the steps
given in the following example: if the desired binary value for
term i is 10, we calculate Pi,10 as Pi,10 = Pi,1 AND P’i,0, where ’
indicates the bit-complement or the NOT operation (which is
simply the count complement in each quadrant [1]).
4. P-TREE BASED CATEGORIZATION
4.1 Document Similarity
Every document is a vector of terms represented by interval
values. Similarity between two documents d1 and d2 could be
measured by the number of common terms. A term t is considered
to be common between d1 and d2 if the interval value given to t in
both documents is the same. The more common terms d1 and d2
have, the higher is their degree of similarity. However, not all
terms participate equally in this similarity. This is where the order
of the intervals comes into the picture. If we use our previous
example: four intervals I0, I1, I2 and I3, where I0 << I1 << I2 <<
I3, then common terms having higher interval values such as I3
are more likely to contribute to the similarity than do terms having
lower interval values such as I0 . The rationale for this should be
obvious. Term values in document vectors reflect the degree of
association between the terms and corresponding documents; the
higher the value for a term in a document vector, the more this
term contributes to the context of the document.
In addition to using common terms to measure the similarity
between documents, we need to check how close non-common
terms are. If documents d1 and d2 have different interval values
for certain terms, then the higher and closer those interval values
are, the higher the degree of similarity between d1 and d2. For
example, if term t has a value 11 in d1 and a value 01 in d2, then
the similarity between d1 and d2 would be higher than if term t
had a value 10 in d1 and a value 00 in d2 because 11 contributes
more to the context of d1 than does 10 and the same holds for d2.
However, the similarity between d1 and d2 would be higher than
the former case if term t had a value 11 in d1 and a value 10 in d2
because the gap between 11 and 10 is smaller than that between
11 and 01.
In short, the similarity between documents is implicitly specified
in our P-tree representation and is based on the number of
common terms between the two documents, the interval closeness
for non-common terms and the interval values themselves (higher
values mean higher similarity).
4.2 Categorization algorithm
Before applying our classification algorithm, we will assume that
a TFxIDF document-by-term matrix has been created and
intervalized and that the P-tree version of the matrix has also been
created. We will only operate on the P-tree version.
To categorize a new document, dnew, the algorithm first attempts
to find the k-most-similar neighbors. In Figure 5 we present the
selection phase of our algorithm.
1.
2.
3.
4.
Initialize an identity P-tree, Pnn (represents a bit group having
only ones or a pure-1 quadrant).
Order the set of all term P-trees, S, in descending order from
term P-trees representing higher to lower interval values in dnew.
For every term P-tree, Pt, in S do the following:
a.
AND Pnn with Pt.
b. If root count of result is less than k, expand Pt by
removing the rightmost bit from the interval value
(i.e., intervals 01 and 00 become 0, and intervals 10
and 11 become 1). This could be done by
recalculating the Pt while disregarding the rightmost
bit P-tree. Repeat this step until the root count of Pnn
AND Pt is greater than k.
c.
Else, put the result in Pnn.
d. Loop.
End of selection phase.
Figure 5. Selection algorithm.
After creating and sorting the term P-trees according to the values
in dnew as described in step 2 of the algorithm (the P-trees for
terms having higher interval values in dnew will be processed
before other term P-trees with lower values), the algorithm
sequentially ANDs each term P-tree, Pt, with Pnn always making
sure that root count of the result is greater than or equal to k.
Should the root count drop below k, the ANDing operation that
resulted in this drop is undone and the Pt involved in that
operation will be reconstructed by removing right most bit. To see
how this happens, consider an example where we are ANDing
Pnn with a Pt representing a term with a 10 binary value in dnew.
Suppose that the root count of the result of this ANDing operation
is less that k. Initially, Pt was calculated as Pt,1 AND Pt,0’
(because the desired value is 10). To reconstruct Pt by removing
the right most bit we assume that the value of t in dnew is 1
instead of 10; so, now we can calculate Pt as Pt,1. This process of
reconstructing Pt is repeated until either the result of ANDing of
Pnn with the newly constructed Pt has a root count greater than k
or the newly assumed value for t has no bits because all the bits
have been previously removed (in this case we say that t has been
ignored). After looping through all the term P-trees, Pnn will hold
the documents that are nearest to dnew.
After finding the k nearest neighbors (or more since the root count
of Pnn might be greater than k due to use of the “closed
neighborhood” which has been proven to improve accuracy [2])
through the selection phase, the algorithm proceeds with the
voting phase to find the target class label of document dnew. For
voting purposes, every neighboring document (i.e. has been
selected in the selection phase) will be given a voting weight
based on its similarity to dnew. For every class label ci, loop
through all terms t in dnew and calculate the number of nearest
neighbors having the same value for t as dnew. To see how this
works, suppose we have the following document: dnew(v1, v2 ,
v3, …, vn) where vj is the interval value for term j in dnew. We
need to calculate Pj for each term j with value vj and then AND it
with Pnn (to make sure we are only considering the selected
neighbors) and with the P-tree representing documents having
class label ci, Pi. Multiply the root count of the resulting P-tree by
its predefined weight which is derived from the value of term j in
dnew, Ij, and is equal to the numeric value of Ij + 1 (add 1 to
handle the case where Ij’s numeric value is 0). The resulting value
is added to the counter maintained for class label ci. Let Pnn
denote the P-tree representing the neighbors from the selection
phase and Pi denote the P-tree for the documents having class
label ci. A formal description of the voting phase is given in
Figure 6.
1.
2.
3.
For every class ci, loop through dnew vector and do the
following for every term t in dnew vector:
a. Get the P-tree representing the neighboring
documents–Pnn from the selection phase–and
having the same value for t (Pt) and class ci (Pi).
This could be done by calculating Presult = Pnn
AND Pt AND Pi.
b. If the term under consideration has a value Ij in
dnew, multiply the root count of Presult by (Ij +
1).
c. Add the result to the counter of ci, w(ci).
d. Loop.
Select the class ck having the largest counter, w(ck), as the
respective class of dnew.
End of voting phase.
calculated as F1=2pr/(p+r). Large-scale testing over the whole
Reuter’s collection is still underway.
We used k=3 and a 4-interval value set: I0=00=[0,0] ,
I1=01=(0,0.25], I2=10=(0.25,0.75] and I3=11=(0.75,1]. Each of
the 10 selected test datasets (1 for each run) was composed of 380
randomly-selected samples for neighbor selection (distributed as
follows: 152 samples from class earn, 114 samples from class
acq, 76 samples from class crude and 38 samples from class corn)
and 90 randomly-selected samples for testing (distributed as
follows: 40 samples from class earn, 25 samples from class acq,
15 samples from class crude and 5 samples from class corn)–470
samples in total. This is the same sampling process reported in
[3]. Table 1 lists a comparative prediction, recall and F1
measurements table showing how well we compare to the cosinesimilarity-based KNN and the string kernels4 approaches. The
values for our approach and the KNN approach are averaged over
10 runs to enable comparison with the values for the kernels
approach as reported in [3]. For each of the three approaches, we
show the precision, recall and F1 measurements over each of the
four considered classes. The F1 measurements are then plotted in
Figure 7. Table 2 lists the effect of using different matrix sizes on
the total time needed by our approach and by the KNN approach
to select the nearest neighbors of a given document. Figure 8
plots the time measurements graphically.
Table 1. Measurements comparison table.
Figure 6. Voting algorithm.
5. PERFORMANCE ANALYSIS STUDY
For the purpose of performance analysis, our algorithm has been
tested against randomly selected subsets from the Reuters-21578
collection available at the University of California–Irvine data
collections store (http://kdd.ics.uci.edu/databases/reuters21578/
reuters21578.html).
We compare our speed and accuracy to the original KNN
approach which uses the cosine similarity measure between
document vectors to decide upon the nearest neighbors. Since our
aim is to propose a globally “good” text classifier and not only to
outperform the original KNN approach, we also compare our
results in terms of accuracy only to the string kernels approach [3]
(based on support vector machines) which reports good results on
small subsets of this collection (better than those reported on the
whole collection).
To get deeper insight on our performance, we empirically
experimented on random subsets of documents having one of the
following four random classes: acq, earn, corn, and crude. We
chose those 4 classes because [3] uses them to report its
performance while varying other free parameters (referred to as
the length of the subsequence, k, and weight decay parameter λ).
We used the precision, recall and F1 measurements on each class
averaged over 10 runs as an indication of our accuracy. The F1
measure combines both precision (p) and recall (r) values into a
single value giving the same importance weight for both. It is
Figure 7. F1 measure comparison graph.
Table 2. Time comparison table.
4
Values for precision, recall and F1 measure were selected from
[3] for k=5 and λ=0.3 which were among the highest reported.
categorization algorithm characterized by high efficiency and
accuracy and based on intervalization and the P-tree technology.
This algorithm has been devised with space, time and accuracy
considerations in mind.
Figure 8. Time comparison graph.
Compared to the KNN approach, our approach shows much better
results in terms of speed and accuracy. The reason for the
improvement in speed is mainly due to the complexity of the two
algorithms. Usually, the high cost of KNN-based algorithms is
mainly associated with their selection phases. The selection phase
in our algorithm has a complexity of O(n) where n is the number
of dimensions (number of terms) while the KNN approach has a
complexity of O(mn) for its selection phase where m is the size of
the dataset (number of documents or tuples) and n is the number
of dimensions. Drastic improvement is shown when the size of the
matrix is very large (the case of 5000x20000 matrix size in Table
2). As for accuracy, the KNN approach bases its judgment on the
similarity between two document vectors upon the angle between
those vectors regardless of the actual distance between them. Our
approach does a more sophisticated comparison by using ANDing
to compare the closeness of the value of each term in the
corresponding vectors, thus being able to judge upon the distance
between the two vectors and not only the angle. Also, terms that
seem to skew the result are ignored in our algorithm unlike in the
KNN approach which has to include all the terms in the space
during the neighbor-selection process.
It is clear that, in all cases, our approach’s accuracy summarized
by the F1 measure is very comparable to that reported in the
kernels approach (for k=5 and λ=0.3). They only show better
results for class crude while we show better results for the rest of
the classes. However, it would not be appropriate to compare
speeds because the two approaches are fundamentally different.
Our approach is example-based while the kernels approach is
eager. In general, after training and testing, eager classifiers tend
to be faster than lazy classifiers; however, they lack the ability to
operate on dynamic streams in which case they need to be
retrained and retested frequently to cope with the change in the
data.
Table 1 shows that the ranges of values for the precision, recall
and F1 measurements in the kernels and the KNN approaches are
wider than ours. For instance, our approach’s precision values
spread over the range [92.6, 98.3] while the KNN approach’s
precision values spread over the range [79.1, 90] and kernels
approach’s precision values spread over the range [91.1, 99.2].
This observation reveals a greater potential for our P-tree-based
approach to produce more stable results across different categories
which is a very desirable characteristic.
6. CONCLUSION
In this paper, we have presented an optimized KNN-based text-
The high accuracy reported in this work is mainly due to the use
of sequential ANDing in the selection phase which ensures that
only the nearest neighbors are selected for voting. In addition, in
the voting algorithm, each neighboring document gets a vote
weight depending on its similarity to the document being
classified. Also, using the closed k-neighborhood [2] in case the
number of neighbors returned by our selection algorithm is greater
than k has a good effect on accuracy because more near
documents get to vote.
As for space, we operate on a reduced, compressed space. The
first step after generating the normalized TFxIDF document-byterm matrix is to intervalize the data. If we assume, for simplicity
only, that every term value is stored in 1 byte or 8 bits and that we
have four intervals, then we are able to reduce the size of this
matrix by a factor of 4 since each term value stored in a byte is
replaced by a 2 bit interval value. After reducing the size of the
matrix, we apply compression by creating and exploiting the Ptree version of the matrix.
As for speed, our data in structured in the P-tree format which
enables us to perform the task without a single database scan. Add
to this the fact that our algorithms are driven completely by the
AND operation which is among the fastest computer instructions.
In the future, we aim to examine more closely the effects of
varying the number of intervals and their ranges over large
datasets including the whole Reuters collection.
7. REFERENCES
[1] Ding, Khan, Roy, and Perrizo, The P-tree algebra.
Proceedings of the ACM SAC, Symposium on Applied
Computing (Madrid, Spain), 2002.
[2] Khan, Ding, and Perrizo, K-nearest neighbor classification
on spatial data streams Using P-trees. Proceedings of the
PAKDD, Pacific-Asia Conference on Knowledge Discovery
and Data Mining (Taipei, Taiwan), 2002.
[3] Lodhi, Saunders, Shawe-Taylor, Cristianini, and Watkins,
“Text classification using string kernels.” Journal of Machine
Learning Research, (2): 419-444, February 2002.
[4] Perrizo, Ding, and Roy, Deriving high confidence rules from
spatial data using peano count trees. Proceeding of the
WAIM, International Conference on Web-Age Information
Management, (Xi'an, China), 91-102, July 2001.)
[5] Rahal and Perrizo, Query acceleration in Multi-level secure
database systems using the P-tree technology, Proceedings of
the ISCA CATA, International Conference on Computers
and Their Applications (Honolulu, Hawaii), March 2003.
[6] Salton and Buckley, “Term-weighting approaches in
automatic text retrieval.” Information Processing &
Management, 24(5), 513-523, May 1988.
[7] Salton, Wong, and Yang, "A vector space model for
automatic indexing.” Communications of the ACM 18(11),
613-620, November 1975.
Download