06_SAC-CameraReady_smart_tv

advertisement
SMART-TV: A Fast and Scalable Nearest Neighbor Based
Classifier for Data Mining
Taufik Abidin
William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
+1 701 231 6257
{taufik.abidin, william.perrizo}@ndsu.edu
ABSTRACT
K-nearest neighbors (KNN) is the simplest method for
classification. Given a set of objects in a multi-dimensional
feature space, the method assigns a category to an unclassified
object based on the plurality of category of the k-nearest
neighbors. The closeness between objects is determined using a
distance measure, e.g. Euclidian distance. Despite its simplicity,
KNN also has some drawbacks: 1) it suffers from expensive
computational cost in training when the training set contains
millions of objects; 2) its classification time is linear to the size of
the training set. The larger the training set, the longer it takes to
search for the k-nearest neighbors. In this paper, we propose a
new algorithm, called SMART-TV (SMall Absolute diffeRence of
ToTal Variation), that approximates a set of potential candidates
of nearest neighbors by examining the absolute difference of total
variation between each data object in the training set and the
unclassified object. The k-nearest neighbors are then searched
from that candidate set. We empirically evaluate the performance
of our algorithm on both real and synthetic datasets and find that
SMART-TV is fast and scalable. The classification accuracy of
SMART-TV is high and comparable to the accuracy of the
traditional KNN algorithm.
General Terms
Algorithms
Keywords
K-Nearest Neighbors Classification, Vertical Data Structure,
Vertical Set Squared Distance.
1. INTRODUCTION
Classification on large datasets has become one of the most
important research priorities in data mining. Given a collection of
labeled (pre-classified) data objects, the classification task is to
label a newly encountered and unlabeled object with a pre-defined
class label. Classification algorithm such as k-nearest neighbors
has shown good performance on various datasets. However, when
the training set is very large, i.e. contains millions of objects,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC’06, April, 23-27, 2006, Dijon, France.
Copyright 2006 ACM 1-59593-108-2/06/0004…$5.00.
sequential search for the k-nearest neighbors from the training set
is not the best approach as it will increase the classification time
significantly. Variety of new approaches have been developed to
accelerate the classification process by improving the KNN
search, such as [5][8][12], just to mention a few.
In this paper, we propose a new algorithm that aims at expediting
the k-nearest neighbors search. The algorithm derives from the
basic structure of KNN algorithm. However, unlike KNN in
which the k-nearest neighbors are searched from the entire
training set, in SMART-TV, they are searched from the candidates
of nearest neighbors. The candidates are approximated by
examining the absolute difference (gap) of total variation between
data objects in the training set and the unclassified object. A small
gap implies that the objects are possibly near the unclassified
object. A parameter hs (the number of data objects with a small
gap will be considered) is required by the algorithm, and it can be
parameterized by the user. The empirical results show that with a
relatively small hs, a high accuracy of classification can still be

obtained. The total variation of a set X about a , denoted as

TV(X, a ), is computed using a fast, efficient, and scalable vertical
set squared distance function [1]. The total variations of each
class about each data point in that class are measured once in the
preprocessing phase and stored for later use in classification. We
store the total variations of data points because, in classification
problems, the sets of pre-classified data points are known and
unchanged, and thus, the total variations of all data points in the
sets can be pre-computed in advance.
Specifically, our contributions are:
1. We introduce a quick and scalable way to approximate the
candidate of nearest neighbors by transforming all data points
in a multi-dimensional feature space into a single-dimensional
total variation space and examining their absolute difference
from the total variation of the unclassified point.
2. We propose a new algorithm of nearest neighbor-based
classification that classifies much more quickly and scalably
over large datasets than the traditional KNN algorithm.
3. We demonstrate that vertical total variation gap is a potent
means for expediting nearest neighbor-based classification
algorithm and worthy of investigation when the speed and
scalability are the issues.
The performance evaluations on both real and synthetic datasets
show that the speed of our algorithm outperforms the speed of the
traditional KNN and Ptree-based KNN (P-KNN) algorithms. In
addition, the classification accuracy of our algorithm is high and
comparable to the accuracy of KNN algorithm.
The remainder of the paper is organized as follows. In section 2,
we review some related works. In section 3, we present the
vertical data structure and show the derivation of the vertical set
squared distance function to vertically compute a total variation.
In section 4, we discuss our approach. In section 5, we report the
empirical results, and in section 6, we summarize our conclusions.
2. RELATED WORKS
Classification based on k-nearest neighbors was first investigated
by Hart [3]. It can be summarized as follows: search for the knearest neighbors and assign the class label to the unclassified
sample based on the plurality of category of the neighbors. KNN
is simple and robust. However, it has some drawbacks. First,
KNN suffers from expensive computational cost in training.
Second, its classification time is linear to the size of the training
set. The larger the training set, the longer it takes to search for the
k-nearest neighbors. Thus, more time is needed for classification.
Third, the classification accuracy is very sensitive to the choice of
parameter k. In most cases, the user has no intuition regarding the
choice of k. An extension of KNN to solve the sensitivity of k has
been proposed [9]. The idea is to adjust k based on the local
density of the training set. However, this approach only improves
the classification accuracy a little bit, and the classification time
remains intensive when the training set is very large.
P-KNN [8] employs vertical data structure to accelerate the
classification time and uses Higher Order Bit (HOB) as the
distance metric. HOB removes one bit at a time, starting from the
lowest significant bit position, to search for the nearest neighbors.
Experiments showed that P-KNN is fast and accurate in spatial
data. However, the squared ring of HOB cannot evenly expand on
both sides when a bit is removed, which consequently can reduce
the classification accuracy.
Another improvement of KNN to speed up the search uses kd-tree
method to replace the linear search [5]. This approach reduces the
searching complexity from O(n) to O(log n). However, the O(log
n) behavior is realized only when the data points are dense.
BOND [12] improves KNN search by projecting each dimension
into a separate table. Each dimension is scanned to estimate the
lower and upper bounds of the k-nearest neighbors. After that, the
data points that are outside the boundary are discarded. This
strategy reduces the number of candidates of nearest neighbors
significantly. Making a smaller space to search for the k-nearest
neighbors is also the main idea of SMART-TV algorithm.
However, SMART-TV estimates a number of candidates of
nearest neighbors by examining the absolute difference of total
variation between data points in the training set and the
unclassified point.
trees by decomposing each attribute in the table into separate bit
vectors, one for each bit position of numeric values in that
attribute, and storing them separately in a file. Then, each bit
vectors can be constructed into 0-dimensional P-tree (the
uncompressed vertical bit vectors) or into 1-dimensional, 2dimensional, or multi-dimensional P-tree in the form of tree. The
creation of 1-dimensional P-tree, for example, is done by
recursively dividing the bit vectors into halves and recording the
truth of “purely 1-bits” predicate. A predicate “1” indicates that
the bits in the subsequent level are all “1” while “0” indicates
otherwise. In this work, we used 0-dimensional P-tree or the
uncompressed vertical bit vectors.
Basically, the logical operations AND (), OR (), and NOT (')
are the main operations in this data structure. Advantage can be
gained while performing aggregate operation such as root count.
This operation counts the 1-bits from a basic P-tree or from the
resulting P-trees of any logical operations. Refer to [4] for more
information about P-tree vertical data structure and its operations.
3.2 Vertical Set Squared Distance
Let x be a numerical domain of attribute Ai of relation R(A1, A2,
…, Ad). Then, x in b-bits binary representation is written as
follows.
xib1  xi 0 
The vertical data structure used in this work is also known as Ptree1 [10]. It is created by converting a relational table R(A1, A2,…,
Ad) of horizontal records, normally the training set, to a set of P1
P-tree is patented by NDSU, United States Patent 6,941,303.
j b 1
j
 xij
The first subscript of x refers to the attribute to which x belongs,
and the second subscript refers to the bit order. The summation in
the right-hand side is the actual value of x in base 10.


Let x be a vector in a d-dimensional space, then x in b-bits
binary representation can be written as:

x  ( x1(b1) x10, x2(b1) x20,, xd (b1) xd 0 )


Let X be a set of vectors in a relation R, x  X, and a can either
be a training point or an unclassified point, then the total variation


of X about a , TV(X, a ), measures the cumulative squared


separation of the vectors in X about a. TV(X, a ) transforms a

multi-dimensional feature space of a into a single-dimensional
total variation space. The total variation can be measured
efficiently and scalably using the vertical set squared distance,
derived as follows:
2
 X  a    X  a    x  a   x  a    xi  ai 
d

xX
xX i 1
d


d
xi2  2 
xX i 1

d
xi ai 
xX i 1
 a
2
i
xX i 1
 T1  2T2  T3
d
3. PRELIMINARIES
3.1 Vertical Data Structure
0
2
T1 

xX i 1

2
 0


2 j  xij 


i 1  j b 1

d
xi2 
 2
d
2
 
xX
b 1
b 1

 xi (b 1)  2 b  2  xi (b  2)    2 0  xi 0 
x  X i 1
 xi (b 1)  2 b  2  xi (b  2)    2 0  xi 0

d





2
j
k
2 
xij 
2
xij  xil 


j b 1 
xX
k  2 j j 1
xX

l  j 10 AND j  0


0

i 1



Let PX be the P-tree mask of a set X that can quickly mask all data
points in X. Then, the above equation can be simplified by
replacing
rc ( PX  Pij ) and
xij  xil
x ij with


xX
xX
with rc ( PX  Pij  Pil ). Thus, we get:
T1 
d


 2j

k
2  rc( PX  Pij  Pil ) 
 2  rc( PX  Pij ) 

j b 1 
k  2 j j 1
l  j 10 AND j  0


0

i 1
T2 

d
 x a   a   2
d
i i
x X i 1
i 1
 a
j
j b 1
 rc( PX  Pij )
2
i
 rc ( PX ) 
a
2
i
i 1
x X i  i

i 1
0
d
j b 1
i 1
The root count and total variation values generated in step 1 and 2
above are stored in files, and they are loaded during classification.
 2 j  rc( PX  Pij )  rc( PX )   ai2

Note that the root count operations are independent from a .
These include the root count operations of P-tree class mask PX,
d
P-tree two operands
0
  rc( PX  P ),
ij
i 1 j b1
d
operands
0


The complexity of this computation is O(n), where n is the
cardinality of the training set.
 2 2 j  rc ( PX  Pij ) 

d
0 



( X  a)  ( X  a)    
2 k  rc ( PX  Pij  Pil )  

i 1 j b 1  k  2 j j 1

 l  j 10 AND j  0

d
SMART-TV algorithm consists of two phases: preprocessing and
classifying. In preprocessing phase, all steps are conducted only
once, whereas in classifying phase the steps are repeated during
classification. The preprocessing steps are:
2) The computation of TV (C j , xi ) xi  C j , 1 j  # of classes.
Hence, the vertical set squared distance is defined as:
2   ai 
4.2 SMART-TV Algorithm
1) The computation of root counts of P-tree operands of each
class Cj, where 1  j  # of classes. The complexity of this
computation is O(kdb2) where k is the number of classes, d is
the total of dimensions, and b is the bit-width.
d
d
T3 
0
i
examination of the gap cannot guarantee 100% that all points are
exactly close to the unclassified point, this examination can
quickly approximate the superset of the nearest neighbors in each
class. In fact, we can increase the chance to include more nearest
neighbor points in the candidate sets by considering more data
points with a small total variation gap. Thus, for this reason, a
parameter hs is introduced, which specifies the number of points
in each class that will be considered in the candidate set. Our
empirical results show that with relatively small hs, i.e. 25
hs50, a high accuracy of classification can still be obtained.
 rc( PX  P
ij
and P-tree three
 Pil ) .
i 1 j b 1 l  j 10 AND j 0
In classification problems where the classes are predefined, the

independence of root count operations from the input vector a is
an advantage. It allows us to run the root count operations once in
advance and maintain the resulting root count values. Later, when
the total variation of class X about a given point is measured, for
example the total variation of class X about an unclassified point,
the same root count values that have already counted can be
reused. This reusability expedites the computation of total
variation significantly [1].
4. PROPOSED ALGORITHM
4.1 Total Variation Gap


The total variation gap g of two vectors x1 and x2 of a set X is


defined as the absolute difference of TV(X, x1 ) and TV(X, x2 ), i.e.


g = |TV(X, x1 ) - TV(X, x2 )|. In SMART-TV algorithm, we
examine the gap between the total variation of each data point in
the training set and the total variation of the unclassified point.
The main goal is to approximate a number of points in each class
that are possibly “near” the unclassified point. Although the
The classifying phase consists of four steps. The first step is called
the filtering step. The second, third, and fourth steps are similar to
KNN algorithm. The difference is that the k-nearest neighbors are
not searched from the original training set, but they are searched
from the candidate points that were filtered in step 1. The
classifying steps are summarized as follows:
1) For each class Cj, where 1  j  # of classes do the following:


a) Compute TV(Cj, a ), where a is the unclassified point.
b) Find hs number of points in Cj such that the gaps between
the total variation of the points in Cj and the total variation

of a are the smallest, i.e.
Let A be an array and  t : 1  t  hs : A.t  z  , where
 
  p , q : 1  p  n j , 1  q  n j , x p , xq  C j :









z   TV (C j , x p )  TV (C j , a )  TV (C j , xq )  TV (C j , a )  x p  xq   


  : 1  r  t : x  A.r 

p
 r

c) Store the IDs of points in A into array TVGapList.


2) For each point IDt, 1 t  Len(TVGapList) where
Len(TVGapList) is equal to hs times the total number of

classes, fetch the feature attributes of points xk and measure
 
the Euclidian distance, d 2 ( xk , a ) =
d
(x
i 1
i
 ai ) 2

3) Find the k nearest points from the list and vote a class for a .
5. EMPIRICAL RESULTS
We compared SMART-TV with KNN and P-KNN algorithms. PKNN has been developed in DataMIMETM system [11]. We
measured the classification accuracy using F-score and used
datasets with different cardinality to test the scalability.
5.1 Datasets
We conducted the experiments on both real and synthetic datasets.
The first dataset is the network intrusion dataset that was used in
KDDCUP 1999 [6]. This dataset contains approximately 4.8
million records and normalized both training and testing sets. We
selected six different types of classes, normal, ipsweep, neptune,
portsweep, satan, and smurf, each of which contains at least
10,000 records. We discarded categorical attributes because our
method only deals with numerical attributes and found 32
numerical attributes. We generated four sub-sampling datasets
with different cardinality from this dataset to analyze the
scalability of the algorithms with respect to dataset size by
selecting the records randomly but maintaining the classes’
distribution proportionally. The cardinality of these sampling
datasets are 10,000 (SS-I), 100,000 (SS-II), 1,000,000 (SS-III),
and 2,000,000 (SS-IV). In addition, we selected randomly 120
records, 20 records for each class, for the testing set.
the cardinality (N) of these datasets is varied from 10,000 to
4,891,469. We evaluated the algorithms running time using k = 5.
Figure 1(a) compares the algorithms running time with varying
datasets cardinality. We found that SMART-TV is much faster
than KNN and P-KNN. For example, for the second largest
dataset, N = 2,000,000, with hs = 25, SMART-TV takes only 3.88
seconds on an average to classify, P-KNN requires about 12.44
seconds, and KNN takes about 49.28 seconds. It is clear that by
increasing cardinality, KNN scales super linearly. Figure 1(b)
shows the SMART-TV running time against varying hs on the
sub-sampling dataset of size 1,000,000. We also used k = 5 in this
observation. We found that SMART-TV running time is linear to
the increasing of hs. However, it increases insignificantly. The
higher the hs, the more the points are in the candidate set. Thus,
more candidate points will be examined by SMART-TV to find
the k-nearest neighbors.
Running Time Against Varying Cardinality
SMART-TV
P-KNN
5.2 Classification Accuracy Measurement
Different classification accuracy measurements have been
introduced in the literature. In this work, we use F-score. It is
2 P R
defined as F 
, where P is precision: the ratio of correct
PR
assignment of a class and the total number of points assigned to
that class, and R is recall: the ratio of correct assignment of a class
and the actual number of points in that class. The higher the score,
the better is the accuracy.
5.3 Experimental Setup and Comparison
The experiments were performed on an Intel Pentium 4 CPU 2.6
GHz machine with 3.8GB RAM, running Red Hat Linux version
2.4.20-8smp. All algorithms are implemented using C++
programming language.
5.3.1 Speed and Scalability Comparison
We compared the performance of the algorithms in terms of speed
and scalability using the KDDCUP datasets. As mentioned before,
3.00
Running Time in Seconds
50.00
40.00
30.00
20.00
2.50
2.00
1.50
1.00
0.50
10.00
0.00
The third dataset is the IRIS dataset [7]. This dataset contains
three classes: iris-setosa, iris-versicolor, and iris-virginica. IRIS
dataset is the smallest dataset containing only 150 records in 4
attributes. Thirty points were selected randomly for the testing set,
and the rest, 120 points, were used for the training set. We
measured the classification accuracy of the algorithms when
classifying this dataset to see how well our algorithm classifies the
dataset containing classes of points that are not linearly separable.
In IRIS dataset, only iris-setosa class is linearly separable to the
others. We neither determined the scalability nor measured the
speed of the algorithm using this dataset because the dataset is
very small.
Running Time of SMART-TV Against Varying hs
on a Dataset of Size 1,000,000 Records
KNN
60.00
Running Time in Seconds
The second dataset is the OPTICS dataset [2]. This dataset was
originally used for clustering tasks. So, to make it suitable for our
classification task, we carefully added a class label to each data
point based on the original clusters found in the dataset. The class
labels are: CL-1, CL-2, CL-3, CL-4, CL-5, CL-6, CL-7, and CL-8.
OPTICS dataset contains 8,000 points in 2-dimensional space. In
this experiment, we took 7,920 points for the training set and
selected randomly 80 points, 10 from each class, for the testing
set.
0.00
10
100
1000
2000
Cardinality (x 1000)
4891
25
50
75
100
125
hs
Figure 1. (a) The algorithms running time against varying
cardinality; (b) SMART-TV running time against varying hs.
We also discovered that SMART-TV and P-KNN algorithms
scale very well. They were able to classify the unclassified sample
using the dataset of size 4,891,469. For this large dataset,
SMART-TV takes about 9.27 seconds and P-KNN takes 30.79
seconds on an average to classify. Note that both algorithms use
vertical data structure as their underlying structure, which makes
the two algorithms scale well to very large dataset. In contrast,
KNN uses horizontal record-based data structure, and while using
the same resources, it failed to load the 4,891,469 training points
to memory, and thus, failed to classify. This demonstrates that
KNN requires more resources to make it scales to a very large
dataset.
We also found that the overall classification accuracy of SMARTTV is moving toward the classification accuracy of KNN
algorithm when hs increased (see table 1). For example, when
using hs = 50 on the 2,000,000 dataset (SS-IV), the difference
between the overall accuracy of KNN and SMART-TV is only
1%. This difference indicates that within the 300 candidates of
closest points filtered by SMART-TV, most of them are the right
nearest points. In fact, the classification accuracy between KNN
and SMART-TV became exactly the same when we used hs =
100. Similarly, for the OPTICS dataset, when hs = 25, SMARTTV successfully filtered the right candidates of points.
Furthermore, when we used IRIS dataset, which contains only 120
points, and specified hs = 100 or hs = 125, we actually did not
filter any points. In fact, we took the entire data points in the
dataset because (hs x number of classes) is greater than the
number of points in the dataset. In this situation, SMART-TV is
no difference with KNN algorithm because the k-nearest
neighbors are searched from the entire training set.
Comparison of the Algorithms Overall Classification Accuracy
1.00
Average F-Score
0.75
From this observation, we conclude that with a proper hs, the
examination of absolute difference of total variation between data
points and the unclassified point can be used to approximate the
candidate of nearest neighbors. Thus, when the speed is the issue,
our approach is efficient to expedite the nearest neighbor search.
SMART-TV
0.50
PKNN
KNN
0.25
0.00
IRIS
OPTICS
SS-I
Table 1. Classification accuracy of SMART-TV against
varying hs compared to KNN. Both algorithms used k=5.
Dataset
25
Network Intrusion (NI) 0.93
SS-I
0.96
SS-II
0.92
SS-III
0.94
SS-IV
0.92
OPTICS
0.96
IRIS
0.97
KNN
125
0.96
0.96
0.97
0.96
0.97
0.97
0.97
NA
0.89
0.97
0.96
0.97
0.97
0.97
We examined the accuracy of the algorithms using different k.
However, due to the space limitation, we only show the result for
k = 5 using the dataset of size 1,000,000 in table 2. We specified
hs = 25 for SMART-TV algorithm.
Table 2. Classification accuracy comparison for SS-III dataset.
P-KNN
KNN
Class
normal
ipsweep
neptune
portsweep
satan
smurf
normal
ipsweep
neptune
portsweep
satan
smurf
normal
ipsweep
neptune
portsweep
satan
smurf
TP
18
20
20
18
17
20
20
20
15
20
14
20
20
20
20
18
17
20
FP
0
1
0
0
2
4
4
1
0
0
1
5
3
1
0
0
1
0
P
1.00
0.95
1.00
1.00
0.90
0.83
0.83
0.95
1.00
1.00
0.93
0.80
0.87
0.95
1.00
1.00
0.94
1.00
SS-III
SS-IV
NI
Figure 2. Classification accuracy comparison.
SMART-TV
hs
50
75
100
0.93
0.94
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.97
0.96
0.96
0.96
0.97
0.97
0.97
5.3.2 Classification Accuracy Comparison
Algorithm
SMART-TV
SS-II
Dataset
R
0.90
1.00
1.00
0.90
0.85
1.00
1.00
1.00
0.75
1.00
0.70
1.00
1.00
1.00
1.00
0.90
0.85
1.00
F
0.95
0.98
1.00
0.95
0.87
0.91
0.91
0.98
0.86
1.00
0.80
0.89
0.93
0.98
1.00
0.95
0.89
1.00
In this observation, we found that the classification accuracy of
SMART-TV is high and very comparable to the accuracy of
KNN. Most classes have F-score above 90%. The same
phenomenon is also found in the other datasets.
Figure 2 compares the algorithms overall accuracy (the average Fscores) of all datasets. We were not able to show the comparison
for KNN using the largest dataset (NI) because it terminated when
loading the dataset to the memory.
6. CONCLUSIONS
In this paper, we have presented a new classification algorithm
that starts its classification steps by filtering out a number of
candidates of nearest neighbors by examining the absolute
difference of total variation between data points in the training set
and the unclassified point. The k-nearest neighbors are
subsequently searched from those candidates to determine the
appropriate class label for the unclassified point.
We have conducted extensive performance evaluations in terms of
speed, scalability, and classification accuracy. We found that the
speed of our algorithm outperforms the speed of KNN and PKNN algorithms. Our algorithm takes less than 10 seconds to
classify a new sample using a training set of size more than 4.8
million records. In addition, our method scales very well and can
classify with high classification accuracy. In our future work, we
will devise some strategy for automatically providing hs value,
e.g. from the inherit features of the training set.
7. REFERENCES
[1] Abidin, T., et al. Vertical Set Squared Distance: A Fast and
Scalable Technique to Compute Total Variation in Large
Datasets. Proceedings of the International Conference on
Computers and Their Applications (CATA), 2005, 60-65.
[2] Ankerst, M., et al. OPTICS: Ordering Points to Identify the
Clustering Structure. Proceedings of the ACM SIGMOD,
1999, 49-60.
[3] Cover, T.M. and Hart, P.E. Nearest Neighbor Pattern
Classification. IEEE Trans. on Info. Theory, 13, 1967, 2127.
[4] Ding, Q., Khan, M., Roy, A., and Perrizo, W. The P-tree
Algebra, Proceedings of the ACM SAC, 2002, 426-431.
[5] Grother, P.J., Candela, G.T., and Blue, J.L. Fast
Implementations of Nearest Neighbor Classifiers. Pattern
Recognition, 30, 1997, 459-465.
[6] Hettich, S. and Bay, S. D. The UCI KDD Archive
http://kdd.ics.uci.edu. Irvine, University of California, CA.,
Department of Information and Computer Science, 1999.
[7] Iris Dataset, http://www.ailab.si/orange/doc/datasets/iris.htm.
[8] Khan, M., Ding, Q., and Perrizo, W. K-Nearest Neighbor
Classification of Spatial Data Streams using P-trees,
Proceedings of the PAKDD, 2002, 517-528.
[9] Mitchell, H. B. and Schaefer, P.A. A Soft K-Nearest
Neighbor Voting Scheme. International Journal of
Intelligent Systems, 16, 2001, 459-469.
[10] Perrizo, W. Peano Count Tree Technology, Technical Report
NDSU-CSOR-TR-01-1, 2001.
[11] Serazi, M., et al. DataMIME™. Proceedings of the ACM
SIGMOD, 2004, 923-924.
[12] Vries, A.P., et al. Efficient k-NN Search on Vertically
Decomposed Data. Proceedings of the ACM SIGMOD, 2002,
322-333.
Download