entropy Article Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering Meshal Shutaywi 1 and Nezamoddin N. Kachouie 2, * 1 2 * Citation: Shutaywi, M.; Kachouie, Department of Mathematics, King Abdulaziz University, Rabigh 21911, Saudi Arabia; mshutaywi@kau.edu.sa Department of Mathematical Sciences, Florida Institute of Technology, Melbourne, FL 32901, USA Correspondence: nezamoddin@fit.edu Abstract: Grouping the objects based on their similarities is an important common task in machine learning applications. Many clustering methods have been developed, among them k-means based clustering methods have been broadly used and several extensions have been developed to improve the original k-means clustering method such as k-means ++ and kernel k-means. K-means is a linear clustering method; that is, it divides the objects into linearly separable groups, while kernel k-means is a non-linear technique. Kernel k-means projects the elements to a higher dimensional feature space using a kernel function, and then groups them. Different kernel functions may not perform similarly in clustering of a data set and, in turn, choosing the right kernel for an application could be challenging. In our previous work, we introduced a weighted majority voting method for clustering based on normalized mutual information (NMI). NMI is a supervised method where the true labels for a training set are required to calculate NMI. In this study, we extend our previous work of aggregating the clustering results to develop an unsupervised weighting function where a training set is not available. The proposed weighting function here is based on Silhouette index, as an unsupervised criterion. As a result, a training set is not required to calculate Silhouette index. This makes our new method more sensible in terms of clustering concept. N.N. Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Keywords: k-means; kernel k-means; machine learning; nonlinear clustering; silhouette index; weighted clustering Clustering. Entropy 2021, 23, 759. https://doi.org/10.3390/e23060759 Academic Editor: Antonio M. Scarfone Received: 18 April 2021 Accepted: 4 June 2021 Published: 16 June 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1. Introduction There is a high demand for developing new methods to discover hidden structures, identify patterns, and recognize different groups in machine learning applications [1]. Cluster analysis has been widely applied for dividing objects into different groups based on their similarities [2]. Cluster analysis is an important task in different areas of study such as medical diagnosis, information retrieval, marketing, and social sciences [3]. For instance, in medical diagnosis, clustering can help to divide the patients into different groups based on the stage of the disease and in marketing, clustering can assist in determining customers with similar interests to customize the advertisement process and make it more effective [4]. Cluster analysis is an unsupervised learning method [5] to optimize an objective function based on features similarities [6]. The goal is finding groups, such that elements in the same group have similar features. Clustering algorithms often use a search method to optimize the objective function. To quantify the distinctiveness of each pair of elements, a distance measure such as Euclidean distance or Manhattan distance can be used. The number of clusters is often unknown and must be chosen for cluster analysis. Hence, either the number of clusters is specified by the user or it must be estimated using an overall distance measure such as the sum of the squared distances of elements from their cluster centers. An objective function is optimized by minimizing the distance of elements to their cluster centers (within-cluster distance) and/or maximizing the distance between cluster centers (between-cluster distance). A search strategy is required to find the groups that optimize the objective function [1]. Entropy 2021, 23, 759. https://doi.org/10.3390/e23060759 https://www.mdpi.com/journal/entropy Entropy 2021, 23, 759 2 of 17 Several clustering methods have been developed [7], such as k-means, model-based clustering, spectral clustering, and Hierarchical clustering. The focus of this study is on the k-means based clustering methods. K-means is a well-known clustering method with the aim of minimizing the Euclidean distance between each point and the center of the cluster to which it belongs. Advantages of k-means are its simplicity and speed [8]. However, k-means can only discover clusters that are linearly separable. In contrast, kernel k-means is a non-linear extension of the k-means clustering method that can identify clusters that are not linearly separable. Although objective functions of both methods are similar, elements are projected into a higher dimensional space in kernel k-means. In our previous work [9], we introduced a weighted majority voting method for clustering based on normalized mutual information (NMI). We showed that the clustering results highly depend on the selected kernel function when using kernel k-means method. For example, by choosing a different kernel such as Gaussian, polynomial, or hyperbolic tangent kernels, we obtain different clustering results for the same dataset. Therefore, to eliminate the partiality based on the chosen kernel for an arbitrary application, we aggregated the clustering results obtained by different kernels. To achieve this, we implemented a weighting function to assign the weights to each kernel based on their performances. The performance of each kernel for clustering application was assessed by normalized mutual information (NMI) [10]. NMI is a supervised method, i.e., we need a training set for calculating NMI to evaluate the performance of a clustering method. However, a training set might not be available for the clustering purposes. Therefore, in this study, we extend our previous work of ensemble clustering by developing a weighting function that does not need a training set. Silhouette index is an unsupervised method for evaluating the performance of a clustering method [11]. Since the Silhouette index does not need a training set to evaluate the clustering performance, it is more relevant to the clustering concept. Here in this work, we have developed a different weighting function based on the Silhouette index. The paper is organized as follows. We review k-means, kernel k-means, and the Silhouette index in Section 2. Section 3 explains the proposed weighting function using the Silhouette index. Simulation studies and results are presented in Section 4, and the conclusions are provided in Section 5. 2. Background A brief review of k-means, kernel k-means, and the Silhouette index follow. 2.1. K-Means K-means choose K centers such that the total squared distances of each point and its cluster center is minimized. K-means technique can be summarized by first selecting K arbitrary centers, which are usually, as Lloyd’s algorithm suggests, uniformly selected at random from the data. Second, we must calculate the Euclidean distance between each element and all cluster centers separately, and assign each element to its closest cluster center. Third, new cluster centers are obtained by averaging the Euclidean distances of all elements grouped in the same cluster in Step 2. Finally, the second and the third steps are repeated until the algorithm converges, i.e., cluster centers obtained in current iteration are the same or very close to the cluster centers obtained in the previous iteration. K-means objective functions can be written as ∑kK = 1 ∑ xi ∈ πk k xi − µk k2 , where πk is cluster k, µk is the center of cluster k, and k · k is the Euclidean distance. 2.2. Kernel K-Means Kernel k-means is an extension of k-means for grouping objects that are not linearly separable. The idea of kernel k-means clustering relies on projecting the elements into a higher-dimensional feature space using a non-linear function to make them linearly separable in the projected space. The kernel k-means algorithm is summarized below [10]. Let { x1 , x2 , . . . , xn } be a set of n data points, K be the number of clusters, πk be the cluster k, {πk }kK = 1 be a partitioning of points into K groups, and φ be a non-linear function Entropy 2021, 23, 759 3 of 17 projecting each data point xi to a higher dimensional space. Each element in the kernel matrix M is: Mij = φ( xi ) · φ x j , for i, j = 1, 2, . . . , n. (1) where φ( xi ) and φ x j are transformations of xi and x j respectively. Some popular kernel functions are the Radial Basis Function known as the Gaussian kernel [12], polynomial kernel, and sigmoid kernel. The procedure is similar to k-means, but in the transformed space as follows. Step 1: after transforming the data points into the new space, each cluster center µk must be randomly initialized. Step 2: the Euclidean distance between each element and all cluster centers µk ’s is computed in the transformed space by: φ ( xi ) = φ ( xi ) · φ ( xi ) |πk | xi ∈ π k φ( xi )·φ( x j ) 2 ∑ x ,xc ∈π φ( x j )·φ( xc ) j k φ ( xi ) − µ k = φ ( xi ) − ∑ − 2 ∑ x ∈π j k + |πk | | π k |2 (2) . where |πk | is the number of elements in cluster πk . Step 3: each data point in the transformed space is assigned to the closest cluster with minimum distance in the transformed space. Step 4: a new cluster center µk is obtained for cluster k by averaging Euclidean distance of all elements that were assigned to cluster πk in the transformed space in the previous iteration: φ ( xi ) , for k = 1, 2, . . . , K. (3) µk = ∑ xi ∈ π k | π k | Steps 2 to 4 will be repeated to minimize the objective function: D {πk }kK = 1 = K ∑ ∑ k = 1 xi ∈ π k k φ ( x i ) − µ k k2 . (4) The algorithm will converge when the obtained cluster centers in the current iteration are the same as or very close to the cluster centers obtained in the previous iteration. 2.3. Silhouette Index There are several methods to evaluate clustering results, such as the Rand index [13], adjusted Rand index [14], distortion score [11], and Silhouette index. While most of the performance evaluation methods need a training set, the Silhouette index does not need a training set to evaluate the clustering results. This makes it more appropriate for a clustering task. In this work, we use the Silhouette index to evaluate the clustering performance. The Silhouette width s( xi ) for the point xi is defined as [11]: s ( xi ) = b ( xi ) − a ( xi ) . max{b( xi ), a( xi )} (5) where xi is an element in cluster πk , a( xi ) is the average distance of xi to all other elements in the cluster πk (within dissimilarity), and b( xi ) = min {dl ( xi )}, among all clusters l 6= k. where dl ( xi ) is the average distance from xi to all points in cluster πl for l 6= k (between dissimilarity). From Equation (5) the value of the Silhouette width can vary between −1 and 1. A negative value is undesirable because it is related to a case in which a( xi ) is greater than b( xi ), and the means within dissimilarity is greater than between dissimilarity. A positive value is obtained where a( xi ) < b( xi ), and the Silhouette width reaches its maximum s( xi ) = 1 for a( xi ) = 0 [11]. The greater the (positive) s( xi ) value of an element, the higher the likelihood to be clustered in the correct group. Elements with negative s( xi ) are more likely to be clustered in wrong groups [15]. A typical example is shown in Figure 1, where xi is shown by a black disk, a( xi ) by red lines, and b( xi ) by blue Entropy 2021, 23, 759 tween dissimilarity). From Equation (5) the value of the Silhouette width can vary between −1 and 1. A negative value is undesirable because it is related to a case in which π(π₯π ) is greater than π(π₯π ), and the means within dissimilarity is greater than between dissimilarity. A positive value is obtained where π(π₯π ) < π(π₯π ), and the Silhouette width reaches its maximum π (π₯π ) = 1 for π(π₯π ) = 0 [11]. The greater the (positive)4 of π (π₯ 17π ) value of an element, the higher the likelihood to be clustered in the correct group. Elements with negative π (π₯π ) are more likely to be clustered in wrong groups [15]. A typical example is shown in Figure 1, where π₯π is shown by a black disk, π(π₯π ) by red lines, and lines. average Silhouette width for a cluster theaaverage xi ) for all points theall π(π₯π )The by blue lines. The average Silhouette widthisfor cluster iss(the average π (π₯π )infor cluster, and the average Silhouette width for the entire clustering result is the average s xi )is ( points in the cluster, and the average Silhouette width for the entire clustering result ofthe allaverage points inπ (π₯ every cluster. We discuss the proposed weighting function using Silhouette π ) of all points in every cluster. We discuss the proposed weighting funcintion detail in the next section. using Silhouette in detail in the next section. Figure1.1. Procedure Procedure to to calculate point X. X. Red lines show the the Figure calculate Silhouette Silhouetteindex indexfor fora atypical typicaldata data point Red lines show distances between X and every data point clustered in the same group. Blue lines show the disdistances between X and every data point clustered in the same group. Blue lines show the distances tances between X and every data point clustered in the nearest group. between X and every data point clustered in the nearest group. WeightedClustering ClusteringMethod MethodUsing UsingSilhouette SilhouetteIndex Index 3.3.Weighted Thegoal goalofofthe theproposed proposedmethod methodisistotoassess assessthe theperformance performanceofofdifferent differentclustering clustering The methods methodsand andcombine combinetheir theirresults resultsbased basedon ontheir theirperformances performancestotoprovide provideaasingle singleoutoutcome. focus here is on kernel k-means, because selecting the right kernel function come.The The focus here is on kernel k-means, because selecting the right kernel function FπΉ(π₯ xi ,πx, π₯j π ) = π(π₯ φ( xπi)) β· π(π₯ φ xπj) for an an arbitrary arbitrary application applicationisisnot notobvious. obvious.Therefore, Therefore,applying applyinga a set of different kernels for clustering and aggregating the clustering results based set of different kernels for clustering and aggregating the clustering results basedon onthe the performance of different kernels can provide consistent results and eliminate the partiality performance of different kernels can provide consistent results and eliminate the partialofity the based on the selected kernel. WeWe compute thethe performance ofof a akernel of results the results based on the selected kernel. compute performance kernelby by computing Silhouette width (γ)(πΎ) forfor thethe clustering results obtained by the kernel. computingthe theaverage average Silhouette width clustering results obtained by the kerThree kernels including Gaussian, polynomial, and hyperbolic tangent tangent are usedare to project nel. Three kernels including Gaussian, polynomial, and hyperbolic used to the elements as follows. project the elements as follows. Gaussian kernel standard deviation σ given is given Gaussian kernel withwith standard deviation π is by:by: 2 2 −k−βπ₯ xi − xπj−π₯ k πβ 2σ22π 2 . φ( xπ(π₯ · φ x = e ) ) β π(π₯ ) = e . i j π π (6) (6) Polynomial kernel is defined by: by: Polynomial kernel is defined π π (7) π(π₯π ) β π(π₯ l . π ) = (ππ₯π π₯π +π) φ( xi ) · φ x j = pxi T x j + c . (7) where π is the slope, π is the intercept, and π is the polynomial order. Hyperbolic given by: where p is thetangent slope, ckernel is the is intercept, and l is the polynomial order. Hyperbolic tangent kernel is given by: π(π₯π ) β π(π₯π ) = tanh (ππ₯ππ π₯π + π). (8) (8) · φ x j = tanh pxiT x j + c . where π is slope and π isφthe ( xi )intercept. We compute the average Silhouette width for the clustering results obtained by each where is slope andWe c isthen the intercept. kernelp separately. combine the results using computed weights for each kernel. We compute the average Silhouette width the clustering results obtained byjth each Let πΎπ be the average Silhouette width for the for clustering result obtained using the kerkernel separately. We then combine the results using computed weights for each kernel. nel. The assigned weight πΏπ for the clustering result of the jth kernel is computed by: Let γ j be the average Silhouette width for the clustering result obtained using the jth kernel. The assigned weight δj for the clustering result of the jth kernel is computed by: δj = γj + 1 , d ∑ m = 1 ( γm + 1 ) for j = 1, 2, · · · , d. (9) where d is the number of kernel functions (here d = 3) and d ∑ j=1 δj = 1. (10) Entropy 2021, 23, 759 5 of 17 The Silhouette value is used to evaluate and assign weight to each kernel. Weights must be non-negative real values between 0 and 1 and must sum up to one. Hence, we shifted the Silhouette scores by adding one, obtaining shifted Silhouette scores ranging from 0 to 2. Normalized non-negative weights are consequently computed by dividing the shifted Silhouette scores to the sum of shifted Silhouette scores. Next, for each data point, we sum up the weights (δj ) assigned to the kernels that clustered the data point into the same group. To ensure the consistency of the group labels by different kernels, the clusters found by the first kernel are considered the base group labels. The Euclidean distance between the cluster centers found by each kernel and the cluster centers found by the first kernel (reference groups) are computed to assign the consistent cluster labels. The cluster labels are then preserved based on the minimum sum of Euclidean distances. Finally, for a given data point, we compare the total weights computed for each cluster label, and the data point will be considered in the group with the highest weight. The proposed method is summarized below. 1. 2. 3. 4. Let β¦d be a set of d kernel functions h1 to hd . Perform kernel k-means method using kernels in the kernel set and generate d clustering results Γ j , for j = {1, 2, . . . , d}; Compute average Silhouette width γ j for clustering results Γ j , for j = {1, 2, . . . , d} obtained by kernel h j for all kernels j = {1, 2, . . . , d}; Shift Silhouette values from {−1 to 1} to {0 to 2} to compute non-negative weights δj for each kernel; For each data point, use the computed weights δj in step (3) to combine the clustering results Γ j , for j = {1, 2, . . . , d} as follows: • • • Sum up the weights corresponding to the kernels that assign the same cluster label to the data point; Compare the total weight of each cluster for the data point; Group the data point to the cluster with the highest total weight. 4. Simulation To evaluate the performance of the proposed method, we have applied it to several benchmark datasets as follow. • • • • Bensaid data: it contains 49 two-dimensional elements grouped in three classes [16]. Dunn data: it contains 90 two-dimensional elements grouped in two classes [17]. Iris data: it contains 150 four-dimensional elements grouped in three classes [18]. Seed data: it contains 210 seven-dimensional elements grouped in three classes [19]. To address the random initialization of the cluster centers in kernel k-means, we applied the proposed method in a Monte Carlo setting and averaged the results obtained through 100 Monte Carlo trials. In each trial, initial cluster centers were randomly generated. For each dataset, kernel k-means with three different kernel functions (Gaussian, polynomial, and hyperbolic tangent) were used. To evaluate the clustering performance of each kernel, identified clusters by these three kernels are matched using their cluster centroids. Clustering performance of each kernel is then evaluated using the Monte Carlo average of the Silhouette index. Next, weight of each cluster is assigned by the estimated performance of the kernel. Finally, clustering results of three kernels are merged using majority voting and the proposed weighted majority voting method. 5. Results In this section, the clustering results obtained using the proposed method are demonstrated. Table 1 summarizes the Monte Carlo average for the average Silhouette indices, and the Monte Carlo average of true rates for the clustering results of each dataset obtained using Gaussian, polynomial, and tangent kernels along with combined results using the proposed weighted method. We use package “kernlab” in R, and hyper-parameters are set by default “automatic” setting that use a heuristic to determine suitable values for the hyper-parameters of the kernel. Entropy 2021, 23, 759 6 of 17 Table 1. Monte Carlo average of the average Silhouette index (ASI) and Monte Carlo average of true rate (TR) for the clustering results obtained using kernel k-means with three different kernel functions (Gaussian, polynomial, and tangent) along with majority voting, and the proposed weighted majority voting based on average Silhouette. DATA Gaussian Polynomial Tangent Majority Voting Weighted Majority Voting Bensaid ASI TR 0.159 0.528 0.453 0.641 −0.127 0.413 0.132 0.592 0.167 0.611 DUNN ASI TR 0.269 0.453 0.418 0.521 −0.003 0.417 0.293 0.474 0.293 0.474 IRIS ASI TR 0.609 0.856 0.609 0.850 −0.068 0.381 0.482 0.833 0.589 0.873 SEED ASI TR 0.594 0.869 0.601 0.842 −0.052 0.376 0.537 0.852 0.570 0.862 The first row of Table 1 summarizes the clustering results for the Bensaid dataset. For this dataset, the polynomial kernel provided the best performance among all kernels based on the Monte Carlo average true rate (0.641) and Monte Carlo average of average Silhouette indices (0.453). Figure 2 (top left) demonstrates the three original groups in the Bensaid dataset along with the clustering results obtained by three different kernels, majority voting, and the proposed weighted method. It shows that among all kernels, the polynomial kernel could better distinguish the original clusters in the Bensaid dataset. The clustering results of these three kernels are combined using the calculated weights for each kernel. Then, the Monte Carlo average of average Silhouette indices and the Monte Carlo average of the true rates are obtained for both majority voting and the proposed method. The proposed method outperforms majority voting with the Monte Carlo average true rate of 0.611 in comparison with 0.592. Figure 3 shows the Silhouette index for all data points in the Bensaid dataset, computed for clustering results obtained using each kernel separately. A colored horizontal line shows the Silhouette index of each data element. We can see that there are smaller number of data points with negative Silhouette scores in the clustering results obtained by the proposed method in comparison with that of the majority voting method. As a result, the average Silhouette index of the proposed method is higher. The second row of Table 1 summarizes the clustering results for the Dunn dataset. For this dataset with two original groups, clustering performance of three different kernels were comparable with the Monte Carlo average true rates of 0.453, 0.521, and 0.417 for Gaussian, polynomial, and tangent kernels, respectively. Among them, polynomial kernels provided slightly better results (with the true rate of 0.521). As we can see in Table 1, the Monte Carlo average of the average Silhouette index for this kernel is also higher than those of the other two kernels (0.418), resulting a higher weight for this kernel when we combine the results. Figure 4 shows the original groups in the Dunn dataset and the clustering results obtained using kernel k-means with three different kernel functions, majority voting, and the proposed weighted method. Since three different kernels produced comparable results, the proposed method (weighted averaging) and majority voting (averaging) will also produce similar results, comparable with the results obtained by the original three kernels (Table 1). Figure 5 illustrates the Silhouette indices obtained for the clustering results of the Dunn dataset in a randomly selected trial (from 100 Monte Carlo trials). Entropy 2021, 23, x FOR PEER REVIEW 7 of 17 Entropy 2021, 23, 759 7 of 17 Kernel k-means, polydot function Kernel k-means, Gaussian function 3 clusters Cj j : nj | avei Cj si n = 49 Kernel k-means, tanhdot function 3 clusters Cj j : nj | avei Cj si n = 49 1 : 13 | -0.33 3 clusters Cj j : nj | avei Cj si n = 49 1 : 14 | 0.15 1 : 19 | -0.29 2 : 11 | 0.13 2 : 15 | -0.42 2 : 16 | -0.03 3 : 24 | 0.63 3 : 21 | 0.75 3 : 14 | -0.08 -0.5 0.0 0.5 -0.5 1.0 0.0 0.5 1.0 -0.2 0.0 0.2 Silhouette width si Silhouette width si Average silhouette width : 0.38 Average silhouette width : 0.11 Majority Voting 0.4 0.6 0.8 1.0 Silhouette width si Average silhouette width : -0.14 Weighted Majority Voting 3 clusters Cj j : nj | avei Cj si n = 49 3 clusters Cj j : nj | avei Cj si n = 49 1 : 18 | 0.48 1 : 24 | -0.10 2 : 20 | -0.06 2 : 9 | 0.38 3 : 16 | 0.65 3 : 11 | -0.14 -0.5 0.0 0.5 1.0 -0.5 Silhouette width si Average silhouette width : 0.12 0.0 0.5 1.0 Silhouette width si Average silhouette width : 0.23 Figure 3:index Silhouette index obtained for results clustering results ofusing Bensaid dataset using k-means with three different kernelpolynomial, functions 2. Silhouette index obtained for clustering of Bensaid dataset using kernel k-means with three different kernel functions (Gaussian, polynomial, tangent), FigureFigure 2. Silhouette obtained for clustering results of Bensaid dataset kernel k-means withkernel three different kernel functions (Gaussian, andand tangent), majority voting, and the proposed weighted method. voting,majority and the proposed weighted method. (Gaussian, polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette. Entropy 2021, 23, x FOR PEER REVIEW 9 of 17 Entropy 2021, 23, 759 Kernel k-means, polydot function Kernel k-means, Gaussian function 30 50 40 30 30 40 40 50 50 60 60 60 Original Groups 0 0 20 40 60 80 20 40 60 80 80 60 80 100 60 50 60 40 30 40 30 60 40 Weighted Majority Voting 50 60 50 40 40 20 Majority Voting 30 20 0 100 100 Kernel k-means, tanhdot function 0 8 of 17 100 0 20 40 60 80 100 0 20 40 60 80 100 Figure in 2: Original groups of Bensaid data and the of clustering results ofobtained kernel k-means with three different kernel functions (Gaussian, Figure 3. Original groups data and clustering results kernelof k-means, three different kernel functions (Gaussian, polynomial, and tangent), majority Figure 3. Original groupsthe inBensaid the Bensaid datathe and the clustering results kernel k-means,using obtained using three different kernel functions (Gaussian, polynomial, and voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA. polynomial, and tangent), voting, andare proposed weighted voting based on average silhouette. tangent), majority voting, and the proposed weightedmajority method. Results visualized using majority 1st and 2nd PCA. Entropy 2021, 23, 759 Entropy 2021, 23, x FOR PEER REVIEW Kernel k-means, Gaussian function Kernel k-means, Polydot function 2 clusters Cj j : nj | avei Cj si n = 90 Kernel k-means, tanhdot function 2 clusters Cj j : nj | avei Cj si n = 90 1 : 42 | 0.55 2 : 48 | -0.12 -0.5 0.0 9 of 17 10 of 17 0.5 1.0 -0.2 0.0 0.2 Silhouette width si 0.4 0.6 1 : 47 | 0.61 1 : 47 | -0.02 2 : 43 | 0.43 2 : 43 | -0.02 0.8 1.0 0.0 0.4 Majority Voting 0.6 0.8 1.0 Silhouette width si Average silhouette width : 0.53 Average silhouette width : -0.02 Weighted Majority Voting 2 clusters Cj j : nj | avei Cj si n = 90 0.0 0.5 2 clusters Cj j : nj | avei Cj si n = 90 1 : 54 | 0.25 1 : 54 | 0.25 2 : 36 | 0.51 2 : 36 | 0.51 1.0 -0.5 Silhouette width si Average silhouette width : 0.35 0.2 Silhouette width si Average silhouette width : 0.2 -0.5 2 clusters Cj j : nj | avei Cj si n = 90 0.0 0.5 1.0 Silhouette width si Average silhouette width : 0.35 index resultsresults of Dunn data clustering resultskernel of kernel k-means with different three different kernel functions (Gaussian, Figure 5: Figure 4. Silhouette indexSilhouette obtained for clustering of the Dunn dataset using k-means with three kernel functions (Gaussian, polynomial, and tangent), majority tangent), majority voting, and the proposed weighted method. voting, and the proposed weighted method. polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette. Figure 4. Silhouette index obtained for clustering results of the Dunn dataset using kernel k-means with three different kernel functions (Gaussian, polynomial, and Entropy 2021, 23, 759 10 of 17 Entropy 2021, 23, x FOR PEER REVIEW 11 of 17 Kernel k-means, polydot function Kernel k-means, Gaussian function 4 2 0 -2 -4 -4 -4 -2 -2 0 0 2 2 4 4 Original Groups 4 6 8 10 12 14 16 4 6 8 Kernel k-means, tanhdot function 10 12 14 10 12 14 16 10 12 14 16 14 16 4 2 4 0 0 -2 -2 -4 -4 8 8 Weighted Majority Voting 2 4 2 0 -2 6 6 Majority Voting -4 4 4 16 4 6 8 10 12 14 16 4 6 8 10 12 Figure 4: Original groups Dunn and theresults clustering results of obtained kernel k-means with three different kernel functions (Gaussian, 5. Original groups the Dunn dataof and the data clustering of kernel k-means obtained using three different kernel functions (Gaussian, polynomial, and tangent), FigureFigure 5. Original groups in thein Dunn data and the clustering results of kernel k-means using three different kernel functions (Gaussian, polynomial, and tangent), majority majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA. polynomial, tangent), majority voting, and proposed weighted majority voting based on average silhouette. voting, and the proposed weighted method. and Results are visualized using 1st and 2nd PCA. Entropy 2021, 23, 759 11 of 17 The third row of Table 1 summarizes the clustering results for the Iris dataset. For this dataset with three original groups, Gaussian kernels performed better than the other two kernels with a Monte Carlo average true rate of 0.856. Polynomial kernels produced comparable results to Gaussian kernels with a Monte Carlo average true rate of 0.850, while the Monte Carlo average true rate of the tangent kernel was 0.381. The computed Monte Carlo average Silhouette indices were 0.609, 0.609, and −0.068 for Gaussian, polynomial, and tangent kernels, respectively. As a result, Gaussian and polynomial kernels with the same high weight and tangent received a low weight in the proposed weighted method. The proposed method provided the Monte Carlo average true rate of 0.873, which is higher than Monte Carlo average true rates obtained by each kernel individually. As we can see in Table 1, the clustering results obtained by the proposed method also have higher Monte Carlo average true rates than the majority voting, since the latter assigns the same weights to each kernel. Figure 6 shows the original groups in the Iris dataset, and the clustering results obtained using kernel k-means with three different kernel functions, majority voting, and the proposed weighted method. Figure 7 shows the Silhouette indices calculated for the clustering results of Iris data obtained using kernel k-means with three different kernel functions, majority voting, and the proposed weighted method in a randomly selected trial (from 100 Monte Carlo trails). The fourth row of Table 1 summarizes the clustering results for the Seed dataset. For this dataset with three original groups, such as with the Iris data, the Gaussian kernel performed better than the other two kernels, with a Monte Carlo average true rate of 0.869. Polynomial kernels produced comparable results to Gaussian kernels with a Monte Carlo average true rate of 0.842, while the Monte Carlo average true rate of the tangent kernel was 0.376. As a result, Gaussian and polynomial kernels had high weights and the tangent received a low weight in the proposed weighted method. The proposed method obtained the Monte Carlo average true rate of 0.862, which is higher than that of the majority voting (0.852), and almost as high as the best performance obtained by the Gaussian kernel (0.869). Figure 8 shows the original groups in the Seed dataset, and the clustering results obtained using kernel k-means with three different kernel functions, majority voting, and the proposed weighted method. Figure 9 shows the Silhouette indices calculated for the clustering results of Seed data obtained using kernel k-means with three different kernel functions, majority voting, and the proposed weighted method in a randomly selected trial (from 100 Monte Carlo trails). Entropy 2021, 23, x FOR PEER REVIEW 12 of 17 Entropy 2021, 23, 759 Kernel k-means, Gaussian function Kernel k-means, Polynomial function 3 clusters Cj j : nj | avei Cj si n = 150 12 of 17 Kernel k-means, Tangent function 3 clusters Cj j : nj | avei Cj si n = 150 1 : 42 | 0.69 3 clusters Cj j : nj | avei Cj si n = 150 1 : 50 | 0.95 1 : 58 | -0.11 2 : 50 | 0.94 2 : 47 | 0.51 2 : 50 | -0.02 3 : 58 | 0.47 3 : 53 | 0.60 3 : 42 | -0.04 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 -0.5 0.0 Silhouette width si 0.5 1.0 0.0 0.2 Silhouette width si Average silhouette width : 0.69 Average silhouette width : 0.69 Majority Voting 0.4 0.6 0.8 1.0 Silhouette width si Average silhouette width : -0.06 Weighted Majority Voting 3 clusters Cj j : nj | avei Cj si n = 150 3 clusters Cj j : nj | avei Cj si n = 150 1 : 50 | 0.95 1 : 60 | 0.44 2 : 53 | 0.42 2 : 53 | 0.48 3 : 47 | 0.58 3 : 37 | 0.65 -0.5 0.0 0.5 1.0 -0.5 0.0 Silhouette width si Average silhouette width : 0.51 0.5 1.0 Silhouette width si Average silhouette width : 0.65 Figure7: 6. Silhouette indexindex obtained for clustering resultsclustering of the Iris dataset using kernel k-means k-means with three different kernel functions (Gaussian, polynomial, and tangent), Figure results of Iris data results of k-means kernel three different kernel functions (Gaussian, Figure 6. Silhouette index Silhouette obtained for clustering results of the Iris dataset using kernel with threewith different kernel functions (Gaussian, polynomial, and tangent), majority voting, majority voting, and the proposed weighted method. and the proposed weighted method. polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette. Entropy 2021, 23, 759 Entropy 2021, 23, x FOR PEER REVIEW 13 of 17 13 of 17 1 2 3 4 5 6 7 1.0 0.5 0.5 0.5 1.0 1.5 1.5 2.0 2.0 2.5 2.5 2.5 2.0 1.5 1.0 Petal.Width Kernel k-means, polydot function Kernel k-means, Gaussian function Original Groups 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Petal.Length Kernel k-means, tanhdot function 2.5 Weighted Majority Voting 2.0 1.5 iris$Petal.Width 1.5 1.0 1.5 0.5 0.5 0.5 1.0 1.0 2.0 2.0 2.5 2.5 Majority Voting 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 3 4 5 6 7 iris$Petal.Length Figure 7. Original groups in the Irisgroups data and thedata clustering of kernel k-means obtained using Figure 6: Original of Iris and theresults clustering results of kernel k-means withthree threedifferent differentkernel kernelfunctions functions (Gaussian, (Gaussian, polynomial, and tangent), Figure 7. Original groups in the Iris data and the clustering results of kernel k-means obtained using three different kernel functions (Gaussian, polynomial, and tangent), majority voting, majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA. polynomial, tangent), majority voting, andPCA. proposed weighted majority voting based on average silhouette. and the proposed weighted method. Results and are visualized using 1st and 2nd Entropy 2021, 23, 759 14 of 17 Entropy 2021, 23, x FOR PEER REVIEW 14 of 17 Kernel k-means, Gaussian function Kernel k-means, Polynomial function 3 clusters Cj j : nj | avei Cj si n = 210 -0.5 0.0 Kernel k-means, Tangent function 3 clusters Cj j : nj | avei Cj si n = 210 3 clusters Cj j : nj | avei Cj si n = 210 1 : 61 | 0.65 1 : 65 | 0.54 1 : 64 | -0.24 2 : 71 | 0.73 2 : 62 | 0.58 2 : 69 | 0.005 3 : 78 | 0.38 3 : 83 | 0.68 3 : 77 | 0.005 0.5 1.0 -0.5 Silhouette width si 0.0 0.5 1.0 -0.2 0.0 Silhouette width si Average silhouette width : 0.58 0.4 0.6 0.8 1.0 Silhouette width si Average silhouette width : 0.61 Majority Voting 0.2 Average silhouette width : -0.07 Weighted Majority Voting 3 clusters Cj j : nj | avei Cj si n = 210 3 clusters Cj j : nj | avei Cj si n = 210 1 : 64 | 0.43 1 : 73 | 0.25 2 : 69 | 0.56 2 : 65 | 0.55 3 : 77 | 0.46 3 : 72 | 0.62 -0.5 0.0 0.5 1.0 -0.5 Silhouette width si Average silhouette width : 0.47 0.0 0.5 1.0 Silhouette width si Average silhouette width : 0.48 Figure9:8.index Silhouette index obtained for clustering results of thedataset Seed dataset kernel k-means with three different kernelkernel functions (Gaussian, polynomial, andand tangent), Figure Silhouette index results of Seedof data clustering results of kernel k-means with three different functions (Gaussian, Figure 8. Silhouette obtained for clustering results the Seed usingusing kernel k-means with three different kernel functions (Gaussian, polynomial, tangent), majority majority voting, and the proposed weighted method. voting, and the proposed weighted method. polynomial, and tangent), majority voting, and proposed weighted majority voting based on average silhouette. Entropy 2021, 23, 759 Entropy 2021, 23, x FOR PEER REVIEW 15 of 17 15 of 17 Original Groups Kernel k-means, polydot function 17 16 15 14 13 13 13 14 14 15 15 16 16 17 17 Kernel k-means, Gaussian function 12 14 16 18 20 12 14 Tangent Kernel 16 18 18 20 18 20 17 15 15 14 14 13 13 16 16 16 16 16 15 14 14 14 Weighted Majority Voting 17 17 Majority Voting 13 12 12 20 12 14 16 18 20 12 14 16 18 20 Figure 9. Original groups inOriginal the Seedgroups data andSeed the clustering results of kernel k-means obtained using three different kernel functions (Gaussian, polynomial, and tangent), Figure dataresults and the results of kernel k-means three different kernel functions (Gaussian, Figure 9. Original groups in8:the Seed data and theofclustering of clustering kernel k-means obtained using three with different kernel functions (Gaussian, polynomial, and tangent), majority voting, majority voting, and the proposed weighted method. Results are visualized using 1st and 2nd PCA. and the proposed weighted method. Results are using 1stvoting, and 2nd polynomial, andvisualized tangent), majority andPCA. proposed weighted majority voting based on average silhouette. Entropy 2021, 23, 759 16 of 17 6. Conclusions and Future Research In our previous work [9], we introduced a weighted mutual information method for aggregated kernel clustering. We showed that the clustering results highly depend on the selected kernel in the kernel k-means method. Therefore, we proposed a weighting function based on the clustering performance to aggregate the clustering results obtained by different kernels. The notion of aggregating the clustering results obtained by several kernel functions including Gaussian, polynomial, and hyperbolic tangent was based on assigning weight functions using their associated NMI values. Calculation of NMI requires a training set which might not be available, therefore in this paper, we extended our previous work of ensemble clustering by developing an unsupervised weighting function based on the Silhouette index. For calculating the Silhouette index, a training set is not required, which in turn is more sensible in the context of clustering. We applied the proposed method to different benchmark datasets and combined the results obtained by three different kernels. We should point out that the proposed aggregated method either improved the clustering performance or provided comparable results, depending on the performance of different kernels for the application at hand. However, the main goal of the proposed method is to obtain impartial results independent of the kernel function, rather than merely improving the clustering performance. The former is essential, because in the real applications the true results are not available to measure the clustering performance, and as a result the choice of the right clustering method is not obvious. In turn, by aggregating the results based on some sensible weights, the aggregated clustering results are less biased regarding the selected method. Here, we showed that not only the aggregated results are impartial where the ensemble performance was comparable with the best performance obtained by an individual kernel, but also, for some datasets, the aggregated results outperformed the best performance obtained by an individual kernel. The focus of future work is on further improvement of the performance of the ensemble clustering. As the average Silhouette score of the entire model demonstrates encouraging results, future research will be conducted to study a pointwise Silhouette score for potential further improvement. Author Contributions: Conceptualization, N.N.K. and M.S.; methodology, N.N.K.; software, M.S.; validation, N.N.K. and M.S.; formal analysis, N.N.K. and M.S.; investigation, N.N.K. and M.S.; data curation, N.N.K. and M.S.; writing—original draft preparation, N.N.K. and M.S.; writing— review and editing, N.N.K. and M.S.; visualization, N.N.K. and M.S.; supervision, N.N.K.; project administration, N.N.K. All authors have read and agreed to the published version of the manuscript. Funding: The publication of this paper is funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant No. J: 87-662-1441. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The data that support the findings of this study are available from the corresponding author upon reasonable request. Acknowledgments: The authors would like to acknowledge the support of Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, for the publication of this paper. Conflicts of Interest: The authors declare no conflict of interests. References 1. 2. 3. 4. Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (Eds.) Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer Science & Business Media: New York, NY, USA, 2006. Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003, 52, 91–118. [CrossRef] Wu, J. Advances in K-Means Clustering: A Data Mining Thinking; Springer Science & Business Media: New York, NY, USA, 2012. Kassambara, A. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning; CreateSpace Independent Publishing Platform: Scotts Valley, CA, USA, 2017; Volume 1. Entropy 2021, 23, 759 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 17 of 17 Brocard, D.; Gillet, F.; Legendre, P. Numerical Ecology with R (Use R!); Springer: New York, NY, USA, 2011; pp. 978–981. [CrossRef] Nguyen, N.; Caruana, R. Consensus clusterings. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 607–612. Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [CrossRef] Arthur, D.; Vassilvitskii, S. Society for Industrial and Applied Mathematics. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. Kachouie, N.N.; Shutaywi, M. Weighted Mutual Information for Aggregated Kernel Clustering. Entropy 2020, 22, 351. [CrossRef] [PubMed] Dhillon, I.S.; Guan, Y.; Kulis, B. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; ACM: New York, NY, USA, 2004; pp. 551–556. Camps-Valls, G. (Ed.) Kernel Methods in Bioengineering, Signal and Image Processing; Igi Global: Hershey, PA, USA, 2006. Campbell, C. An introduction to kernel methods. Stud. Fuzziness Soft Comput. 2001, 66, 155–192. Yeung, K.Y.; Haynor, D.R.; Ruzzo, W.L. Validating clustering for gene expression data. Bioinformatics 2001, 17, 309–318. [CrossRef] [PubMed] Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [CrossRef] Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [CrossRef] Bensaid, A.M.; Hall, L.O.; Bezdek, J.C.; Clarke, L.P.; Silbiger, M.L.; Arrington, J.A.; Murtagh, R.F. Validity-guided (re) clustering with applications to image segmentation. IEEE Trans. Fuzzy Syst. 1996, 4, 112–123. [CrossRef] Bezdek, J.C.; Keller, J.; Krisnapuram, R.; Pal, N. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing; Springer Science & Business Media: New York, NY, USA, 1999; Volume 4. Kozak, M.; Εotocka, B. What should we know about the famous Iris data? Curr. Sci. 2013, 104, 579–580. Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.A.; Εukasik, S.; ZΜak, S. Complete Gradient Clustering Algorithm for Features Analysis of X-Ray Images. In Information Technologies in Biomedicine; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–24.