Uploaded by Jenny Xie

entropy-23-00759

advertisement
entropy
Article
Silhouette Analysis for Performance Evaluation in Machine
Learning with Applications to Clustering
Meshal Shutaywi 1
and Nezamoddin N. Kachouie 2, *
1
2
*
Citation: Shutaywi, M.; Kachouie,
Department of Mathematics, King Abdulaziz University, Rabigh 21911, Saudi Arabia; mshutaywi@kau.edu.sa
Department of Mathematical Sciences, Florida Institute of Technology, Melbourne, FL 32901, USA
Correspondence: nezamoddin@fit.edu
Abstract: Grouping the objects based on their similarities is an important common task in machine
learning applications. Many clustering methods have been developed, among them k-means based
clustering methods have been broadly used and several extensions have been developed to improve
the original k-means clustering method such as k-means ++ and kernel k-means. K-means is a
linear clustering method; that is, it divides the objects into linearly separable groups, while kernel
k-means is a non-linear technique. Kernel k-means projects the elements to a higher dimensional
feature space using a kernel function, and then groups them. Different kernel functions may not
perform similarly in clustering of a data set and, in turn, choosing the right kernel for an application
could be challenging. In our previous work, we introduced a weighted majority voting method for
clustering based on normalized mutual information (NMI). NMI is a supervised method where the
true labels for a training set are required to calculate NMI. In this study, we extend our previous
work of aggregating the clustering results to develop an unsupervised weighting function where
a training set is not available. The proposed weighting function here is based on Silhouette index,
as an unsupervised criterion. As a result, a training set is not required to calculate Silhouette index.
This makes our new method more sensible in terms of clustering concept.
N.N. Silhouette Analysis for
Performance Evaluation in Machine
Learning with Applications to
Keywords: k-means; kernel k-means; machine learning; nonlinear clustering; silhouette index;
weighted clustering
Clustering. Entropy 2021, 23, 759.
https://doi.org/10.3390/e23060759
Academic Editor: Antonio
M. Scarfone
Received: 18 April 2021
Accepted: 4 June 2021
Published: 16 June 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
There is a high demand for developing new methods to discover hidden structures,
identify patterns, and recognize different groups in machine learning applications [1].
Cluster analysis has been widely applied for dividing objects into different groups based
on their similarities [2]. Cluster analysis is an important task in different areas of study such
as medical diagnosis, information retrieval, marketing, and social sciences [3]. For instance,
in medical diagnosis, clustering can help to divide the patients into different groups based
on the stage of the disease and in marketing, clustering can assist in determining customers
with similar interests to customize the advertisement process and make it more effective [4].
Cluster analysis is an unsupervised learning method [5] to optimize an objective
function based on features similarities [6]. The goal is finding groups, such that elements
in the same group have similar features. Clustering algorithms often use a search method
to optimize the objective function. To quantify the distinctiveness of each pair of elements,
a distance measure such as Euclidean distance or Manhattan distance can be used. The
number of clusters is often unknown and must be chosen for cluster analysis. Hence, either
the number of clusters is specified by the user or it must be estimated using an overall
distance measure such as the sum of the squared distances of elements from their cluster
centers. An objective function is optimized by minimizing the distance of elements to their
cluster centers (within-cluster distance) and/or maximizing the distance between cluster
centers (between-cluster distance). A search strategy is required to find the groups that
optimize the objective function [1].
Entropy 2021, 23, 759. https://doi.org/10.3390/e23060759
https://www.mdpi.com/journal/entropy
Entropy 2021, 23, 759
2 of 17
Several clustering methods have been developed [7], such as k-means, model-based
clustering, spectral clustering, and Hierarchical clustering. The focus of this study is on the
k-means based clustering methods. K-means is a well-known clustering method with the
aim of minimizing the Euclidean distance between each point and the center of the cluster
to which it belongs. Advantages of k-means are its simplicity and speed [8]. However,
k-means can only discover clusters that are linearly separable. In contrast, kernel k-means
is a non-linear extension of the k-means clustering method that can identify clusters that are
not linearly separable. Although objective functions of both methods are similar, elements
are projected into a higher dimensional space in kernel k-means.
In our previous work [9], we introduced a weighted majority voting method for
clustering based on normalized mutual information (NMI). We showed that the clustering
results highly depend on the selected kernel function when using kernel k-means method.
For example, by choosing a different kernel such as Gaussian, polynomial, or hyperbolic
tangent kernels, we obtain different clustering results for the same dataset. Therefore, to
eliminate the partiality based on the chosen kernel for an arbitrary application, we aggregated the clustering results obtained by different kernels. To achieve this, we implemented
a weighting function to assign the weights to each kernel based on their performances.
The performance of each kernel for clustering application was assessed by normalized
mutual information (NMI) [10]. NMI is a supervised method, i.e., we need a training set for
calculating NMI to evaluate the performance of a clustering method. However, a training
set might not be available for the clustering purposes. Therefore, in this study, we extend
our previous work of ensemble clustering by developing a weighting function that does
not need a training set. Silhouette index is an unsupervised method for evaluating the
performance of a clustering method [11]. Since the Silhouette index does not need a training
set to evaluate the clustering performance, it is more relevant to the clustering concept.
Here in this work, we have developed a different weighting function based on the
Silhouette index. The paper is organized as follows. We review k-means, kernel k-means,
and the Silhouette index in Section 2. Section 3 explains the proposed weighting function
using the Silhouette index. Simulation studies and results are presented in Section 4, and
the conclusions are provided in Section 5.
2. Background
A brief review of k-means, kernel k-means, and the Silhouette index follow.
2.1. K-Means
K-means choose K centers such that the total squared distances of each point and its
cluster center is minimized. K-means technique can be summarized by first selecting K
arbitrary centers, which are usually, as Lloyd’s algorithm suggests, uniformly selected at
random from the data. Second, we must calculate the Euclidean distance between each
element and all cluster centers separately, and assign each element to its closest cluster
center. Third, new cluster centers are obtained by averaging the Euclidean distances of all
elements grouped in the same cluster in Step 2. Finally, the second and the third steps are
repeated until the algorithm converges, i.e., cluster centers obtained in current iteration are
the same or very close to the cluster centers obtained in the previous iteration. K-means
objective functions can be written as ∑kK = 1 ∑ xi ∈ πk k xi − µk k2 , where πk is cluster k, µk is
the center of cluster k, and k · k is the Euclidean distance.
2.2. Kernel K-Means
Kernel k-means is an extension of k-means for grouping objects that are not linearly
separable. The idea of kernel k-means clustering relies on projecting the elements into
a higher-dimensional feature space using a non-linear function to make them linearly
separable in the projected space. The kernel k-means algorithm is summarized below [10].
Let { x1 , x2 , . . . , xn } be a set of n data points, K be the number of clusters, πk be the
cluster k, {πk }kK = 1 be a partitioning of points into K groups, and φ be a non-linear function
Entropy 2021, 23, 759
3 of 17
projecting each data point xi to a higher dimensional space. Each element in the kernel
matrix M is:
Mij = φ( xi ) · φ x j , for i, j = 1, 2, . . . , n.
(1)
where φ( xi ) and φ x j are transformations of xi and x j respectively. Some popular kernel
functions are the Radial Basis Function known as the Gaussian kernel [12], polynomial
kernel, and sigmoid kernel. The procedure is similar to k-means, but in the transformed
space as follows. Step 1: after transforming the data points into the new space, each
cluster center µk must be randomly initialized. Step 2: the Euclidean distance between each
element and all cluster centers µk ’s is computed in the transformed space by:
φ ( xi )
= φ ( xi ) · φ ( xi )
|πk |
xi ∈ π k
φ( xi )·φ( x j )
2 ∑ x ,xc ∈π φ( x j )·φ( xc )
j
k
φ ( xi ) − µ k = φ ( xi ) − ∑
−
2 ∑ x ∈π
j
k
+
|πk |
| π k |2
(2)
.
where |πk | is the number of elements in cluster πk . Step 3: each data point in the transformed space is assigned to the closest cluster with minimum distance in the transformed
space. Step 4: a new cluster center µk is obtained for cluster k by averaging Euclidean
distance of all elements that were assigned to cluster πk in the transformed space in the
previous iteration:
φ ( xi )
, for k = 1, 2, . . . , K.
(3)
µk = ∑
xi ∈ π k | π k |
Steps 2 to 4 will be repeated to minimize the objective function:
D {πk }kK = 1 =
K
∑ ∑
k = 1 xi ∈ π k
k φ ( x i ) − µ k k2 .
(4)
The algorithm will converge when the obtained cluster centers in the current iteration
are the same as or very close to the cluster centers obtained in the previous iteration.
2.3. Silhouette Index
There are several methods to evaluate clustering results, such as the Rand index [13],
adjusted Rand index [14], distortion score [11], and Silhouette index. While most of
the performance evaluation methods need a training set, the Silhouette index does not
need a training set to evaluate the clustering results. This makes it more appropriate for
a clustering task. In this work, we use the Silhouette index to evaluate the clustering
performance. The Silhouette width s( xi ) for the point xi is defined as [11]:
s ( xi ) =
b ( xi ) − a ( xi )
.
max{b( xi ), a( xi )}
(5)
where xi is an element in cluster πk , a( xi ) is the average distance of xi to all other elements
in the cluster πk (within dissimilarity), and
b( xi ) = min {dl ( xi )}, among all clusters l 6= k.
where dl ( xi ) is the average distance from xi to all points in cluster πl for l 6= k (between
dissimilarity). From Equation (5) the value of the Silhouette width can vary between −1
and 1. A negative value is undesirable because it is related to a case in which a( xi ) is
greater than b( xi ), and the means within dissimilarity is greater than between dissimilarity.
A positive value is obtained where a( xi ) < b( xi ), and the Silhouette width reaches its
maximum s( xi ) = 1 for a( xi ) = 0 [11]. The greater the (positive) s( xi ) value of an
element, the higher the likelihood to be clustered in the correct group. Elements with
negative s( xi ) are more likely to be clustered in wrong groups [15]. A typical example is
shown in Figure 1, where xi is shown by a black disk, a( xi ) by red lines, and b( xi ) by blue
Entropy 2021, 23, 759
tween dissimilarity). From Equation (5) the value of the Silhouette width can vary between −1 and 1. A negative value is undesirable because it is related to a case in which
π‘Ž(π‘₯𝑖 ) is greater than 𝑏(π‘₯𝑖 ), and the means within dissimilarity is greater than between
dissimilarity. A positive value is obtained where π‘Ž(π‘₯𝑖 ) < 𝑏(π‘₯𝑖 ), and the Silhouette width
reaches its maximum 𝑠(π‘₯𝑖 ) = 1 for π‘Ž(π‘₯𝑖 ) = 0 [11]. The greater the (positive)4 of
𝑠(π‘₯
17𝑖 )
value of an element, the higher the likelihood to be clustered in the correct group. Elements with negative 𝑠(π‘₯𝑖 ) are more likely to be clustered in wrong groups [15]. A typical
example is shown in Figure 1, where π‘₯𝑖 is shown by a black disk, π‘Ž(π‘₯𝑖 ) by red lines, and
lines.
average
Silhouette
width
for a cluster
theaaverage
xi ) for
all points
theall
𝑏(π‘₯𝑖 )The
by blue
lines.
The average
Silhouette
widthisfor
cluster iss(the
average
𝑠(π‘₯𝑖 )infor
cluster,
and
the
average
Silhouette
width
for
the
entire
clustering
result
is
the
average
s
xi )is
(
points in the cluster, and the average Silhouette width for the entire clustering result
ofthe
allaverage
points in𝑠(π‘₯
every
cluster.
We
discuss
the
proposed
weighting
function
using
Silhouette
𝑖 ) of all points in every cluster. We discuss the proposed weighting funcintion
detail
in the
next section.
using
Silhouette
in detail in the next section.
Figure1.1. Procedure
Procedure to
to calculate
point
X. X.
Red
lines
show
the the
Figure
calculate Silhouette
Silhouetteindex
indexfor
fora atypical
typicaldata
data
point
Red
lines
show
distances
between
X
and
every
data
point
clustered
in
the
same
group.
Blue
lines
show
the
disdistances between X and every data point clustered in the same group. Blue lines show the distances
tances between X and every data point clustered in the nearest group.
between X and every data point clustered in the nearest group.
WeightedClustering
ClusteringMethod
MethodUsing
UsingSilhouette
SilhouetteIndex
Index
3.3.Weighted
Thegoal
goalofofthe
theproposed
proposedmethod
methodisistotoassess
assessthe
theperformance
performanceofofdifferent
differentclustering
clustering
The
methods
methodsand
andcombine
combinetheir
theirresults
resultsbased
basedon
ontheir
theirperformances
performancestotoprovide
provideaasingle
singleoutoutcome.
focus
here
is
on
kernel
k-means,
because
selecting
the
right
kernel
function
come.The
The
focus
here
is
on
kernel
k-means,
because
selecting
the
right
kernel
function
F𝐹(π‘₯
xi ,𝑖x, π‘₯j 𝑗 ) = πœ™(π‘₯
φ( x𝑖i)) βˆ™· πœ™(π‘₯
φ x𝑗j) for an
an arbitrary
arbitrary application
applicationisisnot
notobvious.
obvious.Therefore,
Therefore,applying
applyinga a
set
of
different
kernels
for
clustering
and
aggregating
the
clustering
results
based
set of different kernels for clustering and aggregating the clustering results basedon
onthe
the
performance
of
different
kernels
can
provide
consistent
results
and
eliminate
the
partiality
performance of different kernels can provide consistent results and eliminate the partialofity
the
based
on the
selected
kernel.
WeWe
compute
thethe
performance
ofof
a akernel
of results
the results
based
on the
selected
kernel.
compute
performance
kernelby
by
computing
Silhouette
width
(γ)(𝛾)
forfor
thethe
clustering
results
obtained
by the
kernel.
computingthe
theaverage
average
Silhouette
width
clustering
results
obtained
by the
kerThree
kernels
including
Gaussian,
polynomial,
and hyperbolic
tangent tangent
are usedare
to project
nel. Three
kernels
including
Gaussian,
polynomial,
and hyperbolic
used to
the
elements
as
follows.
project the elements as follows.
Gaussian
kernel
standard
deviation
σ given
is given
Gaussian
kernel
withwith
standard
deviation
𝜎 is
by:by:
2
2
−k−β€–π‘₯
xi − x𝑖j−π‘₯
k 𝑗‖
2σ22𝜎 2 .
φ( xπœ™(π‘₯
·
φ
x
=
e
)
)
βˆ™
πœ™(π‘₯
)
=
e
.
i
j 𝑗
𝑖
(6)
(6)
Polynomial
kernel
is defined
by: by:
Polynomial
kernel
is defined
𝑙
𝑇
(7)
πœ™(π‘₯𝑖 ) βˆ™ πœ™(π‘₯
l .
𝑗 ) = (𝑝π‘₯𝑖 π‘₯𝑗 +𝑐)
φ( xi ) · φ x j = pxi T x j + c .
(7)
where 𝑝 is the slope, 𝑐 is the intercept, and 𝑙 is the polynomial order.
Hyperbolic
given by:
where
p is thetangent
slope, ckernel
is the is
intercept,
and l is the polynomial order.
Hyperbolic tangent kernel is given
by:
πœ™(π‘₯𝑖 ) βˆ™ πœ™(π‘₯𝑗 ) = tanh (𝑝π‘₯𝑖𝑇 π‘₯𝑗 + 𝑐).
(8)
(8)
· φ x j = tanh pxiT x j + c .
where 𝑝 is slope and 𝑐 isφthe
( xi )intercept.
We compute the average Silhouette width for the clustering results obtained by each
where
is slope andWe
c isthen
the intercept.
kernelp separately.
combine the results using computed weights for each kernel.
We
compute
the
average
Silhouette
width
the clustering
results obtained
byjth
each
Let 𝛾𝑗 be the average Silhouette
width for
the for
clustering
result obtained
using the
kerkernel
separately.
We
then
combine
the
results
using
computed
weights
for
each
kernel.
nel. The assigned weight 𝛿𝑗 for the clustering result of the jth kernel is computed by:
Let γ j be the average Silhouette width for the clustering result obtained using the jth kernel.
The assigned weight δj for the clustering result of the jth kernel is computed by:
δj =
γj + 1
,
d
∑ m = 1 ( γm + 1 )
for j = 1, 2, · · · , d.
(9)
where d is the number of kernel functions (here d = 3) and
d
∑
j=1
δj = 1.
(10)
Entropy 2021, 23, 759
5 of 17
The Silhouette value is used to evaluate and assign weight to each kernel. Weights
must be non-negative real values between 0 and 1 and must sum up to one. Hence, we
shifted the Silhouette scores by adding one, obtaining shifted Silhouette scores ranging
from 0 to 2. Normalized non-negative weights are consequently computed by dividing the
shifted Silhouette scores to the sum of shifted Silhouette scores. Next, for each data point,
we sum up the weights (δj ) assigned to the kernels that clustered the data point into the
same group. To ensure the consistency of the group labels by different kernels, the clusters
found by the first kernel are considered the base group labels. The Euclidean distance
between the cluster centers found by each kernel and the cluster centers found by the first
kernel (reference groups) are computed to assign the consistent cluster labels. The cluster
labels are then preserved based on the minimum sum of Euclidean distances. Finally, for a
given data point, we compare the total weights computed for each cluster label, and the
data point will be considered in the group with the highest weight. The proposed method
is summarized below.
1.
2.
3.
4.
Let Ωd be a set of d kernel functions h1 to hd . Perform kernel k-means method using
kernels in the kernel set and generate d clustering results Γ j , for j = {1, 2, . . . , d};
Compute average Silhouette width γ j for clustering results Γ j , for j = {1, 2, . . . , d}
obtained by kernel h j for all kernels j = {1, 2, . . . , d};
Shift Silhouette values from {−1 to 1} to {0 to 2} to compute non-negative weights δj
for each kernel;
For each data point, use the computed weights δj in step (3) to combine the clustering
results Γ j , for j = {1, 2, . . . , d} as follows:
•
•
•
Sum up the weights corresponding to the kernels that assign the same cluster
label to the data point;
Compare the total weight of each cluster for the data point;
Group the data point to the cluster with the highest total weight.
4. Simulation
To evaluate the performance of the proposed method, we have applied it to several
benchmark datasets as follow.
•
•
•
•
Bensaid data: it contains 49 two-dimensional elements grouped in three classes [16].
Dunn data: it contains 90 two-dimensional elements grouped in two classes [17].
Iris data: it contains 150 four-dimensional elements grouped in three classes [18].
Seed data: it contains 210 seven-dimensional elements grouped in three classes [19].
To address the random initialization of the cluster centers in kernel k-means, we
applied the proposed method in a Monte Carlo setting and averaged the results obtained
through 100 Monte Carlo trials. In each trial, initial cluster centers were randomly generated. For each dataset, kernel k-means with three different kernel functions (Gaussian,
polynomial, and hyperbolic tangent) were used. To evaluate the clustering performance
of each kernel, identified clusters by these three kernels are matched using their cluster
centroids. Clustering performance of each kernel is then evaluated using the Monte Carlo
average of the Silhouette index. Next, weight of each cluster is assigned by the estimated
performance of the kernel. Finally, clustering results of three kernels are merged using
majority voting and the proposed weighted majority voting method.
5. Results
In this section, the clustering results obtained using the proposed method are demonstrated. Table 1 summarizes the Monte Carlo average for the average Silhouette indices,
and the Monte Carlo average of true rates for the clustering results of each dataset obtained
using Gaussian, polynomial, and tangent kernels along with combined results using the
proposed weighted method. We use package “kernlab” in R, and hyper-parameters are
set by default “automatic” setting that use a heuristic to determine suitable values for the
hyper-parameters of the kernel.
Entropy 2021, 23, 759
6 of 17
Table 1. Monte Carlo average of the average Silhouette index (ASI) and Monte Carlo average of true
rate (TR) for the clustering results obtained using kernel k-means with three different kernel functions
(Gaussian, polynomial, and tangent) along with majority voting, and the proposed weighted majority
voting based on average Silhouette.
DATA
Gaussian
Polynomial
Tangent
Majority
Voting
Weighted
Majority
Voting
Bensaid
ASI
TR
0.159
0.528
0.453
0.641
−0.127
0.413
0.132
0.592
0.167
0.611
DUNN
ASI
TR
0.269
0.453
0.418
0.521
−0.003
0.417
0.293
0.474
0.293
0.474
IRIS
ASI
TR
0.609
0.856
0.609
0.850
−0.068
0.381
0.482
0.833
0.589
0.873
SEED
ASI
TR
0.594
0.869
0.601
0.842
−0.052
0.376
0.537
0.852
0.570
0.862
The first row of Table 1 summarizes the clustering results for the Bensaid dataset.
For this dataset, the polynomial kernel provided the best performance among all kernels
based on the Monte Carlo average true rate (0.641) and Monte Carlo average of average
Silhouette indices (0.453). Figure 2 (top left) demonstrates the three original groups in
the Bensaid dataset along with the clustering results obtained by three different kernels,
majority voting, and the proposed weighted method. It shows that among all kernels, the
polynomial kernel could better distinguish the original clusters in the Bensaid dataset. The
clustering results of these three kernels are combined using the calculated weights for each
kernel. Then, the Monte Carlo average of average Silhouette indices and the Monte Carlo
average of the true rates are obtained for both majority voting and the proposed method.
The proposed method outperforms majority voting with the Monte Carlo average true rate
of 0.611 in comparison with 0.592. Figure 3 shows the Silhouette index for all data points in
the Bensaid dataset, computed for clustering results obtained using each kernel separately.
A colored horizontal line shows the Silhouette index of each data element. We can see that
there are smaller number of data points with negative Silhouette scores in the clustering
results obtained by the proposed method in comparison with that of the majority voting
method. As a result, the average Silhouette index of the proposed method is higher.
The second row of Table 1 summarizes the clustering results for the Dunn dataset. For
this dataset with two original groups, clustering performance of three different kernels
were comparable with the Monte Carlo average true rates of 0.453, 0.521, and 0.417 for
Gaussian, polynomial, and tangent kernels, respectively. Among them, polynomial kernels
provided slightly better results (with the true rate of 0.521). As we can see in Table 1, the
Monte Carlo average of the average Silhouette index for this kernel is also higher than those
of the other two kernels (0.418), resulting a higher weight for this kernel when we combine
the results. Figure 4 shows the original groups in the Dunn dataset and the clustering
results obtained using kernel k-means with three different kernel functions, majority voting,
and the proposed weighted method. Since three different kernels produced comparable
results, the proposed method (weighted averaging) and majority voting (averaging) will
also produce similar results, comparable with the results obtained by the original three
kernels (Table 1). Figure 5 illustrates the Silhouette indices obtained for the clustering
results of the Dunn dataset in a randomly selected trial (from 100 Monte Carlo trials).
Entropy 2021, 23, x FOR PEER REVIEW
7 of 17
Entropy 2021, 23, 759
7 of 17
Kernel k-means, polydot function
Kernel k-means, Gaussian function
3 clusters Cj
j : nj | avei Cj si
n = 49
Kernel k-means, tanhdot function
3 clusters Cj
j : nj | avei Cj si
n = 49
1 : 13 | -0.33
3 clusters Cj
j : nj | avei Cj si
n = 49
1 : 14 | 0.15
1 : 19 | -0.29
2 : 11 | 0.13
2 : 15 | -0.42
2 : 16 | -0.03
3 : 24 | 0.63
3 : 21 | 0.75
3 : 14 | -0.08
-0.5
0.0
0.5
-0.5
1.0
0.0
0.5
1.0
-0.2
0.0
0.2
Silhouette width si
Silhouette width si
Average silhouette width : 0.38
Average silhouette width : 0.11
Majority Voting
0.4
0.6
0.8
1.0
Silhouette width si
Average silhouette width : -0.14
Weighted Majority Voting
3 clusters Cj
j : nj | avei Cj si
n = 49
3 clusters Cj
j : nj | avei Cj si
n = 49
1 : 18 | 0.48
1 : 24 | -0.10
2 : 20 | -0.06
2 : 9 | 0.38
3 : 16 | 0.65
3 : 11 | -0.14
-0.5
0.0
0.5
1.0
-0.5
Silhouette width si
Average silhouette width : 0.12
0.0
0.5
1.0
Silhouette width si
Average silhouette width : 0.23
Figure
3:index
Silhouette
index
obtained
for results
clustering
results
ofusing
Bensaid
dataset
using
k-means
with
three
different
kernelpolynomial,
functions
2. Silhouette
index
obtained
for
clustering
of Bensaid
dataset
using
kernel
k-means
with
three
different
kernel
functions
(Gaussian,
polynomial,
tangent),
FigureFigure
2. Silhouette
obtained
for clustering
results
of Bensaid
dataset
kernel
k-means
withkernel
three
different
kernel
functions
(Gaussian,
andand
tangent),
majority
voting, and
the proposed
weighted method.
voting,majority
and the proposed
weighted
method.
(Gaussian, polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette.
Entropy 2021, 23, x FOR PEER REVIEW
9 of 17
Entropy 2021, 23, 759
Kernel k-means, polydot function
Kernel k-means, Gaussian function
30
50
40
30
30
40
40
50
50
60
60
60
Original Groups
0
0
20
40
60
80
20
40
60
80
80
60
80
100
60
50
60
40
30
40
30
60
40
Weighted Majority Voting
50
60
50
40
40
20
Majority Voting
30
20
0
100
100
Kernel k-means, tanhdot function
0
8 of 17
100
0
20
40
60
80
100
0
20
40
60
80
100
Figure in
2: Original
groups
of Bensaid
data and
the of
clustering
results ofobtained
kernel k-means
with
three different
kernel functions
(Gaussian,
Figure
3. Original
groups
data and
clustering
results
kernelof
k-means,
three
different
kernel
functions
(Gaussian,
polynomial,
and
tangent), majority
Figure
3. Original
groupsthe
inBensaid
the Bensaid
datathe
and
the clustering
results
kernel k-means,using
obtained
using
three
different
kernel
functions
(Gaussian,
polynomial,
and
voting,
and
the
proposed
weighted
method.
Results
are
visualized
using
1st
and
2nd
PCA.
polynomial,
and tangent),
voting,
andare
proposed
weighted
voting
based on average silhouette.
tangent), majority voting, and
the proposed
weightedmajority
method.
Results
visualized
using majority
1st and 2nd
PCA.
Entropy 2021, 23, 759
Entropy 2021, 23, x FOR PEER REVIEW
Kernel k-means, Gaussian function
Kernel k-means, Polydot function
2 clusters Cj
j : nj | avei Cj si
n = 90
Kernel k-means, tanhdot function
2 clusters Cj
j : nj | avei Cj si
n = 90
1 : 42 | 0.55
2 : 48 | -0.12
-0.5
0.0
9 of 17
10 of 17
0.5
1.0
-0.2
0.0
0.2
Silhouette width si
0.4
0.6
1 : 47 | 0.61
1 : 47 | -0.02
2 : 43 | 0.43
2 : 43 | -0.02
0.8
1.0
0.0
0.4
Majority Voting
0.6
0.8
1.0
Silhouette width si
Average silhouette width : 0.53
Average silhouette width : -0.02
Weighted Majority Voting
2 clusters Cj
j : nj | avei Cj si
n = 90
0.0
0.5
2 clusters Cj
j : nj | avei Cj si
n = 90
1 : 54 | 0.25
1 : 54 | 0.25
2 : 36 | 0.51
2 : 36 | 0.51
1.0
-0.5
Silhouette width si
Average silhouette width : 0.35
0.2
Silhouette width si
Average silhouette width : 0.2
-0.5
2 clusters Cj
j : nj | avei Cj si
n = 90
0.0
0.5
1.0
Silhouette width si
Average silhouette width : 0.35
index
resultsresults
of Dunn
data
clustering
resultskernel
of kernel
k-means
with different
three different
kernel functions
(Gaussian,
Figure 5:
Figure 4. Silhouette
indexSilhouette
obtained for
clustering
of the
Dunn
dataset using
k-means
with three
kernel functions
(Gaussian,
polynomial, and tangent), majority
tangent), majority voting, and the proposed weighted method.
voting, and the proposed weighted
method.
polynomial,
and tangent), majority voting, and proposed weighted majority voting based on average silhouette.
Figure 4. Silhouette index obtained for clustering results of the Dunn dataset using kernel k-means with three different kernel functions (Gaussian, polynomial, and
Entropy 2021, 23, 759
10 of 17
Entropy 2021, 23, x FOR PEER REVIEW
11 of 17
Kernel k-means, polydot function
Kernel k-means, Gaussian function
4
2
0
-2
-4
-4
-4
-2
-2
0
0
2
2
4
4
Original Groups
4
6
8
10
12
14
16
4
6
8
Kernel k-means, tanhdot function
10
12
14
10
12
14
16
10
12
14
16
14
16
4
2
4
0
0
-2
-2
-4
-4
8
8
Weighted Majority Voting
2
4
2
0
-2
6
6
Majority Voting
-4
4
4
16
4
6
8
10
12
14
16
4
6
8
10
12
Figure
4: Original
groups
Dunn
and
theresults
clustering
results
of obtained
kernel
k-means
with
three
different
kernel
functions
(Gaussian,
5. Original
groups
the Dunn
dataof
and
the data
clustering
of kernel
k-means
obtained
using
three
different
kernel
functions
(Gaussian,
polynomial,
and
tangent),
FigureFigure
5. Original
groups
in thein
Dunn
data and
the
clustering
results
of kernel
k-means
using
three
different
kernel
functions
(Gaussian,
polynomial,
and
tangent),
majority
majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA.
polynomial,
tangent),
majority voting,
and
proposed
weighted majority voting based on average silhouette.
voting, and the proposed weighted
method. and
Results
are visualized
using 1st
and
2nd PCA.
Entropy 2021, 23, 759
11 of 17
The third row of Table 1 summarizes the clustering results for the Iris dataset. For
this dataset with three original groups, Gaussian kernels performed better than the other
two kernels with a Monte Carlo average true rate of 0.856. Polynomial kernels produced
comparable results to Gaussian kernels with a Monte Carlo average true rate of 0.850, while
the Monte Carlo average true rate of the tangent kernel was 0.381. The computed Monte
Carlo average Silhouette indices were 0.609, 0.609, and −0.068 for Gaussian, polynomial,
and tangent kernels, respectively. As a result, Gaussian and polynomial kernels with the
same high weight and tangent received a low weight in the proposed weighted method.
The proposed method provided the Monte Carlo average true rate of 0.873, which is higher
than Monte Carlo average true rates obtained by each kernel individually. As we can see in
Table 1, the clustering results obtained by the proposed method also have higher Monte
Carlo average true rates than the majority voting, since the latter assigns the same weights
to each kernel. Figure 6 shows the original groups in the Iris dataset, and the clustering
results obtained using kernel k-means with three different kernel functions, majority voting,
and the proposed weighted method. Figure 7 shows the Silhouette indices calculated for
the clustering results of Iris data obtained using kernel k-means with three different kernel
functions, majority voting, and the proposed weighted method in a randomly selected trial
(from 100 Monte Carlo trails).
The fourth row of Table 1 summarizes the clustering results for the Seed dataset. For
this dataset with three original groups, such as with the Iris data, the Gaussian kernel
performed better than the other two kernels, with a Monte Carlo average true rate of 0.869.
Polynomial kernels produced comparable results to Gaussian kernels with a Monte Carlo
average true rate of 0.842, while the Monte Carlo average true rate of the tangent kernel
was 0.376. As a result, Gaussian and polynomial kernels had high weights and the tangent
received a low weight in the proposed weighted method. The proposed method obtained
the Monte Carlo average true rate of 0.862, which is higher than that of the majority
voting (0.852), and almost as high as the best performance obtained by the Gaussian kernel
(0.869). Figure 8 shows the original groups in the Seed dataset, and the clustering results
obtained using kernel k-means with three different kernel functions, majority voting, and
the proposed weighted method. Figure 9 shows the Silhouette indices calculated for the
clustering results of Seed data obtained using kernel k-means with three different kernel
functions, majority voting, and the proposed weighted method in a randomly selected trial
(from 100 Monte Carlo trails).
Entropy 2021, 23, x FOR PEER REVIEW
12 of 17
Entropy 2021, 23, 759
Kernel k-means, Gaussian function
Kernel k-means, Polynomial function
3 clusters Cj
j : nj | avei Cj si
n = 150
12 of 17
Kernel k-means, Tangent function
3 clusters Cj
j : nj | avei Cj si
n = 150
1 : 42 | 0.69
3 clusters Cj
j : nj | avei Cj si
n = 150
1 : 50 | 0.95
1 : 58 | -0.11
2 : 50 | 0.94
2 : 47 | 0.51
2 : 50 | -0.02
3 : 58 | 0.47
3 : 53 | 0.60
3 : 42 | -0.04
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
-0.5
0.0
Silhouette width si
0.5
1.0
0.0
0.2
Silhouette width si
Average silhouette width : 0.69
Average silhouette width : 0.69
Majority Voting
0.4
0.6
0.8
1.0
Silhouette width si
Average silhouette width : -0.06
Weighted Majority Voting
3 clusters Cj
j : nj | avei Cj si
n = 150
3 clusters Cj
j : nj | avei Cj si
n = 150
1 : 50 | 0.95
1 : 60 | 0.44
2 : 53 | 0.42
2 : 53 | 0.48
3 : 47 | 0.58
3 : 37 | 0.65
-0.5
0.0
0.5
1.0
-0.5
0.0
Silhouette width si
Average silhouette width : 0.51
0.5
1.0
Silhouette width si
Average silhouette width : 0.65
Figure7:
6. Silhouette indexindex
obtained
for clustering
resultsclustering
of the Iris dataset
using
kernel k-means
k-means with
three
different
kernel
functions
(Gaussian,
polynomial, and tangent),
Figure
results
of Iris
data
results
of k-means
kernel
three
different
kernel
functions
(Gaussian,
Figure 6. Silhouette
index Silhouette
obtained for clustering
results
of the
Iris dataset using
kernel
with threewith
different
kernel
functions
(Gaussian,
polynomial,
and tangent), majority voting,
majority voting, and the proposed weighted method.
and the proposed weighted
method.
polynomial,
and tangent), majority voting, and proposed weighted majority voting based on average silhouette.
Entropy 2021, 23, 759
Entropy 2021, 23, x FOR PEER REVIEW
13 of 17
13 of 17
1
2
3
4
5
6
7
1.0
0.5
0.5
0.5
1.0
1.5
1.5
2.0
2.0
2.5
2.5
2.5
2.0
1.5
1.0
Petal.Width
Kernel k-means, polydot function
Kernel k-means, Gaussian function
Original Groups
1
2
3
4
5
6
7
1
2
3
4
5
6
7
Petal.Length
Kernel k-means, tanhdot function
2.5
Weighted Majority Voting
2.0
1.5
iris$Petal.Width
1.5
1.0
1.5
0.5
0.5
0.5
1.0
1.0
2.0
2.0
2.5
2.5
Majority Voting
1
1
2
3
4
5
6
7
1
2
3
4
5
6
7
2
3
4
5
6
7
iris$Petal.Length
Figure 7. Original
groups
in the Irisgroups
data and
thedata
clustering
of kernel
k-means
obtained
using
Figure
6: Original
of Iris
and theresults
clustering
results
of kernel
k-means
withthree
threedifferent
differentkernel
kernelfunctions
functions (Gaussian,
(Gaussian, polynomial, and tangent),
Figure 7. Original groups in the Iris data and the clustering results of kernel k-means obtained using three different kernel functions (Gaussian, polynomial, and tangent), majority voting,
majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA.
polynomial,
tangent), majority
voting,
andPCA.
proposed weighted majority voting based on average silhouette.
and the proposed weighted method.
Results and
are visualized
using 1st
and 2nd
Entropy 2021, 23, 759
14 of 17
Entropy 2021, 23, x FOR PEER REVIEW
14 of 17
Kernel k-means, Gaussian function
Kernel k-means, Polynomial function
3 clusters Cj
j : nj | avei Cj si
n = 210
-0.5
0.0
Kernel k-means, Tangent function
3 clusters Cj
j : nj | avei Cj si
n = 210
3 clusters Cj
j : nj | avei Cj si
n = 210
1 : 61 | 0.65
1 : 65 | 0.54
1 : 64 | -0.24
2 : 71 | 0.73
2 : 62 | 0.58
2 : 69 | 0.005
3 : 78 | 0.38
3 : 83 | 0.68
3 : 77 | 0.005
0.5
1.0
-0.5
Silhouette width si
0.0
0.5
1.0
-0.2
0.0
Silhouette width si
Average silhouette width : 0.58
0.4
0.6
0.8
1.0
Silhouette width si
Average silhouette width : 0.61
Majority Voting
0.2
Average silhouette width : -0.07
Weighted Majority Voting
3 clusters Cj
j : nj | avei Cj si
n = 210
3 clusters Cj
j : nj | avei Cj si
n = 210
1 : 64 | 0.43
1 : 73 | 0.25
2 : 69 | 0.56
2 : 65 | 0.55
3 : 77 | 0.46
3 : 72 | 0.62
-0.5
0.0
0.5
1.0
-0.5
Silhouette width si
Average silhouette width : 0.47
0.0
0.5
1.0
Silhouette width si
Average silhouette width : 0.48
Figure9:8.index
Silhouette
index
obtained
for clustering
results
of thedataset
Seed dataset
kernel
k-means
with
three
different
kernelkernel
functions
(Gaussian,
polynomial,
andand
tangent),
Figure
Silhouette
index
results
of
Seedof
data
clustering
results
of kernel
k-means
with
three
different
functions
(Gaussian,
Figure 8. Silhouette
obtained
for
clustering
results
the Seed
usingusing
kernel
k-means
with
three
different
kernel
functions
(Gaussian,
polynomial,
tangent), majority
majority voting, and the proposed weighted method.
voting, and the proposed
weighted
method.
polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette.
Entropy 2021, 23, 759
Entropy 2021, 23, x FOR PEER REVIEW
15 of 17
15 of 17
Original Groups
Kernel k-means, polydot function
17
16
15
14
13
13
13
14
14
15
15
16
16
17
17
Kernel k-means, Gaussian function
12
14
16
18
20
12
14
Tangent Kernel
16
18
18
20
18
20
17
15
15
14
14
13
13
16
16
16
16
16
15
14
14
14
Weighted Majority Voting
17
17
Majority Voting
13
12
12
20
12
14
16
18
20
12
14
16
18
20
Figure 9. Original
groups
inOriginal
the Seedgroups
data andSeed
the clustering
results
of kernel
k-means
obtained
using three
different
kernel
functions
(Gaussian,
polynomial, and tangent),
Figure
dataresults
and the
results
of kernel
k-means
three
different
kernel
functions
(Gaussian,
Figure 9. Original groups
in8:the Seed
data and theofclustering
of clustering
kernel k-means
obtained
using
three with
different
kernel
functions
(Gaussian,
polynomial, and tangent), majority voting,
majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA.
and the proposed weighted method.
Results are
using 1stvoting,
and 2nd
polynomial,
andvisualized
tangent), majority
andPCA.
proposed weighted majority voting based on average silhouette.
Entropy 2021, 23, 759
16 of 17
6. Conclusions and Future Research
In our previous work [9], we introduced a weighted mutual information method for
aggregated kernel clustering. We showed that the clustering results highly depend on
the selected kernel in the kernel k-means method. Therefore, we proposed a weighting
function based on the clustering performance to aggregate the clustering results obtained
by different kernels. The notion of aggregating the clustering results obtained by several
kernel functions including Gaussian, polynomial, and hyperbolic tangent was based on
assigning weight functions using their associated NMI values. Calculation of NMI requires
a training set which might not be available, therefore in this paper, we extended our
previous work of ensemble clustering by developing an unsupervised weighting function
based on the Silhouette index. For calculating the Silhouette index, a training set is not
required, which in turn is more sensible in the context of clustering.
We applied the proposed method to different benchmark datasets and combined
the results obtained by three different kernels. We should point out that the proposed
aggregated method either improved the clustering performance or provided comparable
results, depending on the performance of different kernels for the application at hand.
However, the main goal of the proposed method is to obtain impartial results independent
of the kernel function, rather than merely improving the clustering performance. The
former is essential, because in the real applications the true results are not available to
measure the clustering performance, and as a result the choice of the right clustering
method is not obvious. In turn, by aggregating the results based on some sensible weights,
the aggregated clustering results are less biased regarding the selected method. Here, we
showed that not only the aggregated results are impartial where the ensemble performance
was comparable with the best performance obtained by an individual kernel, but also,
for some datasets, the aggregated results outperformed the best performance obtained
by an individual kernel. The focus of future work is on further improvement of the
performance of the ensemble clustering. As the average Silhouette score of the entire model
demonstrates encouraging results, future research will be conducted to study a pointwise
Silhouette score for potential further improvement.
Author Contributions: Conceptualization, N.N.K. and M.S.; methodology, N.N.K.; software, M.S.;
validation, N.N.K. and M.S.; formal analysis, N.N.K. and M.S.; investigation, N.N.K. and M.S.;
data curation, N.N.K. and M.S.; writing—original draft preparation, N.N.K. and M.S.; writing—
review and editing, N.N.K. and M.S.; visualization, N.N.K. and M.S.; supervision, N.N.K.; project
administration, N.N.K. All authors have read and agreed to the published version of the manuscript.
Funding: The publication of this paper is funded by the Deanship of Scientific Research (DSR), King
Abdulaziz University, Jeddah, under grant No. J: 87-662-1441.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data that support the findings of this study are available from the
corresponding author upon reasonable request.
Acknowledgments: The authors would like to acknowledge the support of Deanship of Scientific
Research (DSR), King Abdulaziz University, Jeddah, for the publication of this paper.
Conflicts of Interest: The authors declare no conflict of interests.
References
1.
2.
3.
4.
Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (Eds.) Bioinformatics and Computational Biology Solutions Using R and
Bioconductor; Springer Science & Business Media: New York, NY, USA, 2006.
Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003, 52, 91–118. [CrossRef]
Wu, J. Advances in K-Means Clustering: A Data Mining Thinking; Springer Science & Business Media: New York, NY, USA, 2012.
Kassambara, A. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning; CreateSpace Independent Publishing
Platform: Scotts Valley, CA, USA, 2017; Volume 1.
Entropy 2021, 23, 759
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
17 of 17
Brocard, D.; Gillet, F.; Legendre, P. Numerical Ecology with R (Use R!); Springer: New York, NY, USA, 2011; pp. 978–981. [CrossRef]
Nguyen, N.; Caruana, R. Consensus clusterings. In Proceedings of the Seventh IEEE International Conference on Data Mining
(ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 607–612.
Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [CrossRef]
Arthur, D.; Vassilvitskii, S. Society for Industrial and Applied Mathematics. k-means++: The advantages of careful seeding. In
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, LA, USA, 7–9 January 2007;
pp. 1027–1035.
Kachouie, N.N.; Shutaywi, M. Weighted Mutual Information for Aggregated Kernel Clustering. Entropy 2020, 22, 351. [CrossRef]
[PubMed]
Dhillon, I.S.; Guan, Y.; Kulis, B. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; ACM:
New York, NY, USA, 2004; pp. 551–556.
Camps-Valls, G. (Ed.) Kernel Methods in Bioengineering, Signal and Image Processing; Igi Global: Hershey, PA, USA, 2006.
Campbell, C. An introduction to kernel methods. Stud. Fuzziness Soft Comput. 2001, 66, 155–192.
Yeung, K.Y.; Haynor, D.R.; Ruzzo, W.L. Validating clustering for gene expression data. Bioinformatics 2001, 17, 309–318. [CrossRef]
[PubMed]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987,
20, 53–65. [CrossRef]
Bensaid, A.M.; Hall, L.O.; Bezdek, J.C.; Clarke, L.P.; Silbiger, M.L.; Arrington, J.A.; Murtagh, R.F. Validity-guided (re) clustering
with applications to image segmentation. IEEE Trans. Fuzzy Syst. 1996, 4, 112–123. [CrossRef]
Bezdek, J.C.; Keller, J.; Krisnapuram, R.; Pal, N. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing; Springer
Science & Business Media: New York, NY, USA, 1999; Volume 4.
Kozak, M.; Łotocka, B. What should we know about the famous Iris data? Curr. Sci. 2013, 104, 579–580.
Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.A.; Łukasik, S.; ZΜ‡ak, S. Complete Gradient Clustering Algorithm
for Features Analysis of X-Ray Images. In Information Technologies in Biomedicine; Springer: Berlin/Heidelberg, Germany, 2010;
pp. 15–24.
Download