Using Representative-Based Clustering for Nearest Neighbor

advertisement
Using Representative-Based Clustering for Nearest Neighbor Dataset Editing
Christoph F. Eick , Nidal Zeidat, and Ricardo Vilalta
Dept. of Computer Science, University of Houston
{ceick, nzeidat, vilalta}@cs.uh.edu
Abstract
The goal of dataset editing in instance-based learning
is to remove objects from a training set in order to
increase the accuracy of a classifier. For example, Wilson
editing removes training examples that are misclassified
by a nearest neighbor classifier so as to smooth the shape
of the resulting decision boundaries. This paper revolves
around the use of representative-based clustering
algorithms for nearest neighbor dataset editing. We term
this approach supervised clustering editing. The main
idea is to replace a dataset by a set of cluster prototypes.
A novel clustering approach called supervised clustering
is introduced for this purpose. Our empirical evaluation
using eight UCI datasets shows that both Wilson and
supervised clustering editing improve accuracy on more
than 50% of the datasets tested. However, supervised
clustering editing achieves four times higher compression
rates than Wilson editing.
1. Introduction
Nearest
neighbor
classifiers
have
received
considerable attention by the research community (for a
survey see Toussaint [3]). Most of the research aims at
producing time-efficient versions of the algorithm. For
example, several condensing techniques have been
proposed that replace the set of training examples O by a
smaller set OC ⊂ O such that all examples in O are still
classified correctly by a NN-classifier that uses OC.
Data set editing techniques, on the other hand, aim at
replacing a dataset O with a, usually smaller, dataset OE
with the goal of improving the accuracy of a NNclassifier. A popular technique in this category is Wilson
editing [5]; it removes all examples that have been
misclassified by the 1-NN rule from a dataset. Wilson
editing cleans interclass overlap regions, thereby leading
to smoother boundaries between classes. It has been
shown by Penrod and Wagner [2] that the accuracy of a
Wilson edited nearest neighbor classifier converges to
Bayes’ error as the number of examples approaches
infinity. Figure 1a shows a hypothetical dataset where
examples that are misclassified using the 1-NN-rule are
marked with circles around them. Figure 1.b shows the
reduced dataset after applying Wilson editing.
Attribute1
Attribute1
Attribute2
a. Hypothetical Dataset
Attribute2
b. Dataset Edited using
Wilson’s Technique
Figure 1: Wilson editing for a 1-NN classifier.
In addition to analyzing the benefits of Wilson editing,
this paper proposes a new approach to nearest neighbor
editing that replaces a dataset by a set of cluster
prototypes using a supervised clustering algorithm [6].
We will refer to this editing technique as supervised
clustering editing (SCE) and to the corresponding nearest
neighbor classifier as nearest representative (NR)
classifier. Section 2 introduces supervised clustering
editing. Section 3 discusses experimental results that
compare Wilson editing, supervised clustering editing,
and traditional, ”unedited” nearest-neighbor classifiers.
Section 4 summarizes the results of this paper.
2. Using Supervised Clustering for Dataset
Editing
Due to its novelty, the goals and objectives of
supervised clustering will be discussed in the first
subsection. The subsequent subsections will explain how
supervised clustering can be used for dataset editing.
2.1. Supervised Clustering
Supervised clustering deviates from traditional
clustering in that it is applied on classified examples with
the objective of identifying clusters having not only
strong cohesion but also class purity. Moreover, in
supervised clustering, we try to keep the number of
clusters small, and objects are assigned to clusters using a
notion of closeness with respect to a given distance
function. Consequently, supervised clustering evaluates a
clustering based on the following two criteria:
• Class impurity, Impurity(X). Measured by the
percentage of minority examples in the different
clusters of a clustering X. A minority example is an
example that belongs to a class different from the
most frequent class in its cluster.
• Number of clusters, k. In general, we favor a low
number of clusters.
In particular, we use the following fitness function in
our experimental work (lower values for q(X) indicate
‘better’ quality of clustering X).
q(X) = Impurity(X) + β∗Penalty(k)
(1)
Where
Impurity
(X ) =
# of Minority Examples
n
 k−c

n
Penalty (k ) = 

 0
representative. The algorithm tries to improve the quality
of the clustering by adding a single non-representative
example to the set of representatives as well as by
removing a single representative from the set of
representatives. The algorithm terminates if the solution
quality (measured by q(X)) of the current solution does
not improve. Moreover, SRIDHCR is run r times,
reporting the best solution found as its result.
2.3 Using Cluster Prototypes for Dataset Editing
Figure 2 gives an example illustrating how supervised
clustering is used for dataset editing. Figure 2.a shows a
dataset that was partitioned into 6 clusters using a
supervised clustering algorithm. Cluster representatives
are marked with small circles around them. Figure 2.b
shows the result of supervised clustering editing.
Attribute1
A
,
E
D
k ≥ c
k < c
With n being the total number of examples and c being
the number of classes in a dataset. Parameter β (0< β
≤3.0) determines the penalty that is associated with the
number of clusters, k; i.e., higher values for β imply larger
penalties as the number of clusters increases.
2.2 Representative-Based Supervised Clustering
Algorithms
Representative-based clustering aims at finding a set of
k representatives that best characterizes a dataset. Clusters
are created by assigning each object to the closest
representative. Representative-based supervised clustering
algorithms seek to accomplish the following goal: Find a
subset OR of O such that the clustering X, obtained by
using the objects in OR as representatives, minimizes q(X).
As part of our research, we have designed and
evaluated several supervised clustering algorithms [1,6,7].
Among the algorithms investigated, one named Single
Representative Insertion/Deletion Steepest Decent Hill
Climbing with Randomized Restart (SRIDHCR for short)
performed quite well. This greedy algorithm starts by
randomly selecting a number of examples from the
dataset as the initial set of representatives. Clusters are
then created by assigning examples to their closest
Attribute1
B
C
G
F
Attribute 2
a. Dataset clustered using
supervised clustering.
Attribute2
b. Dataset edited using
cluster representatives.
Figure 2: Editing a dataset using supervised
clustering.
3. Experimental Results
To evaluate the benefits of Wilson editing and
supervised clustering editing, we applied these techniques
to a benchmark consisting of eight UCI datasets [4] (see.
Table 1).
Table1: Datasets used in the experiments.
Dataset name
# of
# of
# of
objects attributes classes
Glass
214
9
6
Heart-Statlog
270
13
2
Heart-H
294
13
2
Iris Plants
150
4
3
Pima Indians Diabetes 768
8
2
Image Segmentation
2100
19
7
Vehicle Silhouettes
846
18
4
Waveform
5000
21
3
All datasets were normalized using a linear
interpolation function that assigns 1 to the maximum
value and 0 to the minimum value. Manhattan distance
was used to compute the distance between two objects.
Table 2: Prediction accuracy for the four
classifiers.
β
NR
Wilson
1-NN C4.5
Glass (214)
0.1
0.636
0.607
0.692 0.677
0.4
0.589
0.607
0.692 0.677
1.0
0.575
0.607
0.692 0.677
Heart-Stat Log (270)
0.1
0.796
0.804
0.767 0.782
0.4
0.833
0.804
0.767 0.782
1.0
0.838
0.804
0.767 0.782
Diabetes (768)
0.1
0.736
0.734
0.690 0.745
0.4
0.736
0.734
0.690 0.745
1.0
0.745
0.734
0.690 0.745
Vehicle (846)
0.1
0.667
0.4
0.667
1.0
0.665
Heart-H (294)
0.1
0.755
0.4
0.793
1.0
0.809
Waveform (5000)
0.1
0.834
0.4
0.841
1.0
0.837
Iris-Plants (150)
0.1
0.947
0.4
0.973
1.0
0.953
Segmentation (2100)
0.1
0.938
0.4
0.919
1.0
0.890
0.716
0.716
0.716
0.700
0.700
0.700
0.723
0.723
0.723
0.809
0.809
0.809
0.783
0.783
0.783
0.802
0.802
0.802
0.796
0.796
0.796
0.768
0.768
0.768
0.781
0.781
0.781
0.936
0.936
0.936
0.947
0.947
0.947
0.947
0.947
0.947
0.966
0.966
0.966
0.956
0.956
0.956
0.968
0.968
0.968
The parameter β of q(X) has a strong influence on the
number k of representatives chosen by the supervised
clustering algorithm. In general, an editing technique
reduces the size n of a dataset to a smaller size k. We
define the dataset compression rate of an editing
technique as:
k
Compression Rate = 1 −
(2)
n
In order to explore different compression rates for
supervised clustering editing, three different values for
parameter β were used in the experiments: 1.0, 0.4, and
0.1. Prediction accuracies were measured using 10-fold
cross-validation
throughout
the
experiments.
Representatives for the nearest representative (NR)
classifier were computed using the SRIDHCR supervised
clustering algorithm that was restarted 50 times.
Accuracies and compression rates were obtained for a 1NN-classifier that operates on subsets of the 8 datasets
obtained using Wilson editing. We also computed
prediction accuracy for a traditional 1-NN classifier that
uses all training examples when classifying a new
example. Finally, we report prediction accuracy for the
decision-tree learning algorithm C4.5 that was run using
its default parameter settings. Table 2 reports the
accuracies obtained by the four classifiers evaluated in
our experiments. Table 3 reports the average dataset
compression rates for supervised clustering editing and
Wilson editing. It also reports the average, minimum, and
maximum number of representatives found on the 10 runs
for SCE.
If we inspect the results displayed in Table 2, we can
see that Wilson editing is a quite useful technique for
improving traditional 1-NN-classfiers. Using Wilson
editing leads to higher accuracies for 6 of the 8 datasets
tested and only shows a significant loss in accuracy for
the Glass dataset. The SCE approach, on the other hand,
accomplished significant improvement in accuracy for the
Heart-Stat Log, Waveform, and Iris-Plants datasets,
outperforming Wilson editing by at least 2% in accuracy
for those datasets. It should also be mentioned that the
achieved accuracies are significantly higher than those
obtained by C4.5 for those datasets. However, our results
also indicate a significant loss in accuracy for the Glass
and Segmentation datasets.
More importantly, looking at Table 3, we notice that
SCE accomplishes compression rates of more than 95%
without a significant loss in prediction accuracy for 6 of
the 8 datasets. For example, for the Waveform dataset, a
1-NN classifier that only uses at an average 28
representatives outperforms the traditional 1-NN classifier
that uses all 4500 training examples1 by 7.3% (76.8% to
84.1%).
As mentioned earlier, Wilson editing reduces the size
of a dataset by removing examples that have been
misclassified by a k-NN classifier, which explains the low
compression rates for the Iris and Segmentation datasets.
Condensing approaches, on the other hand, reduce the
size of a dataset by removing examples that have been
classified correctly by a nearest neighbor classifier.
Finally, supervised clustering editing reduces the size of a
1 Due to the fact that we use 10-fold cross-validation, training sets
contain 0.9*5000=4500 examples.
dataset by removing examples that have been classified
correctly as well as examples that have not been classified
correctly. A representative-based supervised clustering
algorithm is used that aims at finding clusters that are
dominated by instances of a single class, and tends to pick
as cluster representatives2 objects that are in the center of
the region associated with the cluster.
Table 3: Dataset compression rates for SCE and
Wilson editing.
Β
Avg. k
SCE
Wilson
[Min-Max]
Compression
Compression
for SCE
Rate (%)
Rate (%)
Glass (214)
0.1
34 [28-39]
84.3
27
0.4
25 [19-29]
88.4
27
1.0
6 [6 – 6]
97.2
27
Heart-Stat Log (270)
0.1
15 [12-18]
94.4
22.4
0.4
2 [2 – 2]
99.3
22.4
1.0
2 [2 – 2]
99.3
22.4
Diabetes (768)
0.1
27 [22-33]
96.5
30.0
0.4
9 [2-18]
98.8
30.0
1.0
2 [2 – 2]
99.7
30.0
Vehicle (846)
0.1
57 [51-65]
97.3
30.5
0.4 38 [ 26-61]
95.5
30.5
1.0
14 [ 9-22]
98.3
30.5
Heart-H (294)
0.1
14 [11-18]
95.2
21.9
0.4
2
99.3
21.9
1.0
2
99.3
21.9
Waveform (5000)
0.1 104 [79-117]
97.9
23.4
0.4
28 [20-39]
99.4
23.4
1.0
4 [3-6]
99.9
23.4
Iris-Plants (150)
0.1
4 [3-8]
97.3
6.0
0.4
3 [3 – 3]
98.0
6.0
1.0
3 [3 – 3]
98.0
6.0
Segmentation (2100)
0.1
57 [48-65]
97.3
2.8
0.4
30 [24-37]
98.6
2.8
1.0
14
99.3
2.8
4. Conclusion
The goal of dataset editing in instance-based learning
is to remove objects from a training set in order to
increase the accuracy of the learnt classifier. This paper
evaluates the benefits of Wilson editing using a
2
Representatives are rarely picked at the boundaries of a region
dominated by a single class, because boundary points have the
tendency to attract points of neighboring regions that are dominated
by other classes, therefore increasing cluster impurity.
benchmark consisting of eight UCI datasets. Our results
show that Wilson editing enhanced the accuracy of a
traditional nearest neighbor classifier on six of the eight
datasets tested, achieving an average compression rate of
about 20%. It is also important to note that Wilson
editing, although initially proposed for nearest neighbor
classification, can easily be used for other classification
tasks. For example, a dataset can easily be “Wilson
edited” by removing all training examples that have been
misclassified by a decision tree classification algorithm.
In this paper, we introduced a new technique for
dataset editing called supervised clustering editing (SCE).
The idea of this approach is to replace a dataset by a
subset of cluster prototypes. Experiments were conducted
that compare the accuracy and compression rates of SCE,
with Wilson editing, and with a traditional, unedited, 1NN classifier. Results show SCE accomplished
significant improvements in prediction accuracy for 3 out
of the 8 datasets used in the experiments, outperforming
the Wilson editing based 1-NN classifier by more than
2%. Moreover, experimental results show that for 6 out
the 8 datasets tested, SCE achieves compression rates of
more than 95% without significant loss in accuracy. In
summary, surprisingly, high accuracy gains were
achieved using only a very small number of
representatives for several datasets. In general, our
empirical results stress the importance of centering more
research on dataset editing techniques.
5. References
[1] Eick, C., Zeidat, N., and Zhao, Z., “Supervised Clustering –
Algorithms and Applications. submitted for publication.
[2] Penrod, C. and Wagner, T., “Another look at the edited
nearest neighbor rule”, IEEE Trans. Syst., Man, Cyber., SMC7:92–94, 1977.
[3] Toussaint, G., “Proximity Graphs for Nearest Neighbor
Decision Rules: Recent Progress”, Proceedings of the 34th
Symposium on the INTERFACE, Montreal, Canada, April 1720, 2002.
[4] University of California at Irving, Machine Learning
Repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
[5] Wilson, D.L., “Asymptotic Properties of Nearest Neighbor
Rules Using Edited Data”, IEEE Transactions on Systems, Man,
and Cybernetics, 2:408-420, 1972.
[6] Zeidat, N., Eick, C., “Using k-medoid Style Algorithms for
Supervised Summary Generation”, Proceedings of MLMTA,
Las Vegas, June 2004.
[7] Zhao, Z., “Evolutionary Computing and Splitting Algorithms
for Supervised Clustering”, Master’s Thesis, Dept. of Computer
Science, University of Houston, May 2004.
Download