STATS 780 Assignment 3 Stduent number: 400415239 Kyuson Lim 15 March, 2022 Contents Explatory Data Analysis . . . . . Supplimentary material . . . . . Technical supplemental material Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . 4 . 7 . 19 Explatory Data Analysis A data contains Chinese 88 ceramic samples named depending on collected locations (Longquan kiln at Dayao County & Jingdezhen kiln) for celadon body and glaze (equivalently both parts to be 44) with 17 chemical compositions extracted to be sourced from UCI machine learning repository (He et al., 2016). In data, only 8 compounds (N a2 O, M gO, Al2 O3 , SiO2 , K2 O, CaO, T iO2 , F e2 O3 ) are recorded in percentage for continuous values to be suitable for classification. According to the chemical analysis, two compounds can be evaluated for classification methods, F e2 O3 (mean = 1.56, s2 = 0.365) and Al2 O3 (mean = 17.46, s2 = 22.12), which are key compounds that vary for two locations that celedon body parts of 44 samples are collected with (He et al., 2016). Hence, we fit clustering models except for the PCA with 8 compounds, as to apply for dimension reduction, based on a standardized (scaled) F e2 O3 and Al2 O3 values to know if classification is appropriate with two variables. There are no missing data entries and no outliers in two variables to detect with. A k−mean cluster−scatter plot of ceramic samples −b 1.5 −b Q −4 FL −b Q FL −b DY −Q C− 2−b −7 Q FL −b −5 −b 1.0 b FLQ DY−Y −1 −3− Fe2O3 FL Q −12 −b 2.0 −1 3− b −1−b FLQ FLQ FLQ−10 DY−M−2−b 7−b DY−NS− −b Y−8 DY− 1−b C− −6 −b −b Y− 6 −4 BS DY − −Q DY BS b 5− S− − DY −B DY DY − FLQ−11−b A hierarchial clustering plot of ceramic samples −b 6 FLQ− DY−BS−2−b FLQ−2−b DY−Y−2−b FLQ−8−b 1−b DY−NS− DY−NS− S−1 b 4− − DY DY − −b b 9− DY−Y−4−b DY−B S −1.5 −2 −b −3−b b b −1.0 −b b DY− M− 1− −7 − −6 5− b −b DY − −9 −b −b −5 −4−b −7− Y DY− DY−NS−2−b S −N DY QC −3 −b Q S DY−N DY − Y− FL −3 −Y DY − −Y DY NS DY − −3 S −N DY BS − QC DY− B 0.0 −0.5 8−b b −3− M DY− cluster cluster 1 cluster 2 0.5 −1 0 1 Al2O3 2 Figure 1: (a) A hierachial clustering plot shows for the 44 samples of two classified samples based on two variables. (b) A K-mean clustering model shows in 2 dimensional plot for classified clusters. As the purpose is to compare for all pairs of points in 2 dimension, an Euclidean measure with an average linkage is applied to the hierarchical clustering model to identify two clusters of samples in Figure 1 (James et al., 2013). The chosen number of cluster is k = 2 for k-mean clustering from the silhouette coefficient measures with highest average silhouette width to be 0.52 in Table 2 (Chakravorty, 2020). Considering that true of celadon parts are 13/31 distributed, the estimated clusters, 15/29, is very close for likely fit of the k-mean clustering analysis (Table 1). The hierarchical classification model has an accuracy of 0.909 of same number of clusters, 15/29, has higher accuracy 2 than the k-mean cluster model with an accuracy of 0.818. Both methods of accuracy to discriminate two location is high enough (above 0.7) to come up with successful classification models. 2 A model based cluster− Gaussian mixture plot 2.0 O K2 1 1.5 NA2 O −1 1.0 Al2O3 SiO2 Fe2 O 3 O Ca Fe2O3 0 cluster cluster 1 cluster 2 0.5 0.0 Ti O −0.5 2 −2 MgO PC2 (21.0% explained var.) A PCA & k−mean cluster−scatter plot of ceramic samples −1.0 −3 −3 −2 −1 0 1 2 PC1 (45.0% explained var.) 3 −1.5 −2 −1 0 Al2O3 1 2 Figure 2: (a) A PCA is performed to use loadings of first 2 PCs to apply for K-mean clustering with K=2. (b) A model based clustering is performed on 2 variables to derive for 2 clusters. The chosen number of principal components for PCA is 2 since the most drastic decrease in the scree plot is at PC2 (Figure 3) by the elbow method to be flattened after 2nd PC, where the second PC of variance explained is 0.21, and first PC for variance explained is 0.45 (James et al., 2013). A two PC model fits better for capturing the essence of the data with few PCs as possible (Ngo, 2018). Then, the number of cluster selected is k = 2 based on silhouette plot with average silhouette width to be highest with 0.45 (Table 4). Considering that true of celadon parts are 13/31 distributed, estimated clusters for the k-mean clustering analysis is 16/28, which is close to true clusters (Table 6). Based on the accuracy with 0.886, the k-mean clustering with PCA for 2 PCs classifies the ceramic better than the ordinary k-mean clustering, as two PCs explain the total variability of the data more than two chosen chemical compounds, F e2 O3 and Al2 O3 . Lastly, a model based clustering with BIC identifies 2 clusters model with EII (1st) for value of -236.27 to be maximum among 14 different mixture models for 2 clusters in Table 5 (Fraley et al., 2012). Although 4 needs to chosen based on BIC plot (EEI covariance for BIC = -223.05), a number of chosen cluster is 2 to be compared with true clusters for accuracy as well as the purpose of comparing model performances (Figure 4). An accuracy for classifying two different locations with 19/25 classes by the Gaussian mixture model yields 0.818, which is same as the k-mean clustering without PCA (Figure 2). Thus, a hierarchical clustering has the highest accuracy of classification with 2 variables, and the K-mean clustering with PCA model performs better than the ordinary K-mean clustering model. 3 Supplimentary material Model selection and criterion A K−mean cluster Sillhoutte plot Scree plot: PCA on scaled data Average Sillhoutte width = 0.52 0.4 Variance explained Silhouette width Si 1.00 0.75 cluster Cluster 1 Cluster 2 0.50 0.25 0.3 0.2 0.1 0.0 0.00 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC Figure 3: (a) A K-mean clustering sillhoutte plot is shown for the average width among 2 clusters. (b) A scree plot with 8 principal components is showns for the variance explained. A PCA & k−mean cluster Sillhoutte plot An model based clustering (Mclust) BIC score Average Sillhoutte width = 0.45 Models EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV VEV EVV VVV 1.00 −240 0.50 cluster Cluster 1 Cluster 2 BIC Silhouette width Si −230 0.75 −250 −260 0.25 −270 0.00 1 2 3 5 clusters 4 5 Figure 4: (a) A K-mean clustering sillhoutte plot based on 2 principal components is shown for the average width. (b) A BIC values of model based clustering analysis among 5 different clusters are shown as a line graph. 4 Table 1: Only K-mean clustering number of elements when k=1 to 5 Var1 1 2 Freq 25 19 Var1 1 2 3 Freq 18 7 19 Var1 1 2 3 4 Freq 13 10 14 7 Var1 1 2 3 4 5 Freq 7 13 12 2 10 Var1 1 2 3 4 5 6 Freq 7 8 6 6 8 9 Table 2: Only K-mean clustering Silhoutte plot for different k k.mean 2.00 3.00 4.00 5.00 6.00 7.00 8.0 Average.width 0.52 0.49 0.48 0.53 0.48 0.37 0.4 Table 3: Scree plot for each PC PC PC1 PC2 PC3 PC4 PC5 PC6 variance.explained 0.450 0.210 0.146 0.101 0.048 0.028 Table 4: Silhoutte plot for K-mean clustering after PCA k.mean 2.00 3.00 4.00 5.00 6.00 7.0 8.00 Average.width 0.45 0.44 0.43 0.44 0.43 0.5 0.48 Table 5: Top 3: Model Based clustering with BIC when G=2 model EII,2 VII,2 EEI,2 BIC -236.27 -239.19 -239.87 5 Table 6: A K-mean clustering after PCA elements distribution for different k = 1 to 5 Var1 1 2 Freq 28 16 Var1 1 2 3 Freq 15 11 18 Var1 1 2 3 4 Freq 12 11 9 12 6 Var1 1 2 3 4 5 Freq 9 7 12 11 5 Var1 1 2 3 4 5 6 Freq 7 4 6 7 12 8 Technical supplemental material library(caret) library(pander) library(class) library(pscl) library(ggbiplot) library(mclust) library(gridExtra) library(knitr) library(cowplot) library(tidyverse) library(reshape2) library(kableExtra) library(RCurl) library(readxl) library(dplyr) library(magrittr) library(readr) library(ggplot2) library(ISLR2) library(cluster) library(factoextra) library(pgmm) library(fossil) library(dendextend) library(circlize) library(ggdendro) library(ape) library(ggforce) library(e1071) set.seed(1) gg = "https://archive.ics.uci.edu/ml/machine-learning-databases/00583/Chemical%20Composion%20of%20Ceramic.csv" # parse the downloaded data as CSV d = read.csv(url(gg), header=T) df = cbind(as.numeric(d[,3]), as.numeric(d[,4]), as.numeric(d[,5]), as.numeric(d[,6]), as.numeric(d[,7]), as.numeric(d[,8]), as.numeric(d[,9]), as.numeric(d[,10])) %>% na.omit() d_cov = d[1:44,1] df= scale(df[1:44,]) colnames(df) = c(colnames(d)[3:10]) rownames(df) = c(d_cov) #dgr = if() #d = read.csv('echocardiogram.data', sep = ',') ## outlier exception om = which(df[,8] %in% c(boxplot.stats(df[,8])$out)) om2 = which(df[,3] %in% c(boxplot.stats(df[,3])$out)) # ----------------- hierarchial clustering -------------------- # # Compute hierarchical clustering res.hc <- dist(df[,c(3,8)]) %>% dist(method = "euclidean") %>% hclust(method = "average") # Visualize using factoextra p1 = fviz_dend(res.hc, k = 2, cex = 0.315, type = "circular", k_colors = c("#87CEEB", "#FFD700"), 7 labels_track_height = 0, color_labels_by_k = TRUE, rect =F ) + theme(axis.text.y = element_blank(), plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "cm"))+ labs(title = "A hierarchial clustering plot \n of ceramic samples", size=11) # ------- hierarchial clustering evaluation -------------------- # era = c(rep(1,13), rep(2, 31), rep(1,13), rep(2,31)) e1 = cbind(hc = cutree(res.hc, 2), era = era) a <- c(era[1:44]) #values from classification b <- as.numeric(cutree(res.hc, 2)) #reference values k =confusionMatrix(table(a,b)) #confusion matrix # ----------------- k-eman clustering ------------------------ # # k-eman clustering km.out <- kmeans(df[,c(3,8)], 2, nstart = 10, algorithm = "MacQueen") # Kmean cluster dataset_scaled = data.frame(Al2O3 = df[,3], Fe2O3 = df[,8], cluster = factor(km.out$cluster)) p2 = ggplot(dataset_scaled, aes(x = Al2O3, y = Fe2O3)) + #geom_mark_ellipse(aes(fill = cluster), alpha=0.1) + scale_fill_manual(values=c("#1ED14B","#11a6fc"))+ geom_point(aes(color = cluster), size = 1.7175, pch=8, alpha=0.55) + geom_point(aes(color = cluster), shape = 21, stroke = 0.7, size = 1.05, fill='white') + scale_alpha_continuous(guide='none')+ theme(legend.position="right", #plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"), axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.text=element_text(size=11), legend.title = element_text(size=11), axis.text.x = element_text(size=11, angle=0), legend.key = element_rect(fill = "white"), axis.text.y = element_text(size=11), legend.box.background = element_blank(), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_y_continuous(breaks=seq(-5, 3, 0.5))+ scale_x_continuous(breaks=seq(-2, 6, 1))+ scale_color_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"), label = c('cluster 1', 'cluster 2') )+ labs(title = "A k-mean cluster-scatter plot \n of ceramic samples", size=11) # ------------- evaluation ------------------------ # b <- as.numeric(km.out$cluster) #reference values k2 =confusionMatrix(table(a,b)) #confusion matrix # ------------------------------------------------------- # # plots combined 8 gridExtra::grid.arrange(p1, p2, nrow=1, widths=c(1.25,1)) b 3− −1 −4 Q FL C− Q− b 1.0 1−b Fe2O3 FLQ 6−b FLQ− DY−BS−2−b FLQ−2−b DY−Y−2−b FLQ−8−b 1−b DY−NS− M−3 DY− 5− b −3− 2−b DY−Y − 7− Q− FL 6− b DY −Q b FL −b 1.5 −b FL Q b FLQ −12 − b 2.0 −b −6 DY −Y − A k−mean cluster−scatter plot of ceramic samples −b 1−b FLQ − FLQ−10 DY−M−2−b 7−b DY−NS− −8−b −1− −4 b BS Y DY− C −Q DY BS 5− − DY S− −B DY DY − FLQ−11−b A hierarchial clustering plot of ceramic samples DY−NS− − −4 QC DY− B S−1 b − DY 0.0 −0.5 8−b −b cluster cluster 1 cluster 2 0.5 DY −N b 9− −b −b QC −3 −1− DY−Y−4−b DY−B DY− M −b −5 S−3−b b −7 BS b NS 3− DY − −1.5 −2 Y− − DY S−4−b 7−b Y− DY− DY−NS−2−b DY−N DY − DY FL − DY − −Y −1.0 S− 6− DY b −Y −5 −b Q −9 −b −b −3 NS − DY −b −1 0 1 Al2O3 2 Figure 5: (a) A hierachial clustering plot shows for the 44 samples of two classified samples based on two variables. (b) A K-mean clustering model shows in 2 dimensional plot for classified clusters. # ----------------- model based clustering ----------------- # BIC = mclustBIC(df[,c(3,8)], G=2) ## selection #summary(BIC, df[,c(3,8)]) mod1 = Mclust(df[,c(3,8)], x = BIC, modelNames = "EII") # cluster dataset_scaled2 = data.frame(Al2O3 = df[,3], Fe2O3 = df[,8], cluster = factor(mod1$classification)) # dimesion reduction #mod1dr = MclustDR(mod1) p4 = ggplot(dataset_scaled2, aes(x = Al2O3, y = Fe2O3)) + geom_point(aes(color = cluster), size = 1.7175, pch=8, alpha=0.55) + geom_point(aes(color = cluster), shape = 21, stroke = 0.7, size = 1.05, fill='white') + #with_outer_glow(geom_segment(mapping=aes(x=-Inf, xend=Inf, # y=-0.25, yend=-0.27), color='grey50', lineend="round", # size=0.5, alpha=0.55), sigma = 3)+ geom_mark_ellipse(aes(fill = cluster), color='transparent', alpha=0.1) + scale_alpha_continuous(guide='none')+ theme(legend.position="right", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.text=element_text(size=11), legend.key = element_rect(fill = "white"), legend.title = element_text(size=11), axis.text.x = element_text(size=11, angle=0), axis.text.y = element_text(size=11), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), 9 panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_y_continuous(breaks=seq(-5, 3, 0.5))+ scale_x_continuous(breaks=seq(-2, 6, 1))+ scale_color_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"), label = c('cluster 1', 'cluster 2') )+ scale_fill_manual(values=c("#87CEEB", "#FFD700"), guide='none')+ labs(title = "A model based cluster- \n Gaussian mixture plot", size=11) # ------------------------------------------------------- # # evaluation b <- as.numeric(mod1$classification) #reference values k4 =confusionMatrix(table(a,b)) #confusion matrix # ----------------- classification + PCA ----------------- # ### overlapping labe colnames(df)[1] = c("NA2O") colnames(df)[2] = c("MgO") colnames(df)[3] = c("Al2O3") colnames(df)[6] = c("CaO") colnames(df)[7] = c("TiO2") colnames(df)[8] = c("Fe2O3 \n") # PCA pca_iris = prcomp(df, center = TRUE, scale = F) d3 = as.data.frame(pca_iris$x[,1:2]) km.out2 <- kmeans(d3, 2, nstart = 20) # Kmean cluster dataset_scaled = data.frame(d3, cluster = factor(km.out2$cluster)) p3 = ggbiplot(pcobj = pca_iris, obs.scale = 1, var.scale = 1, alpha=0, varname.size = 2)+ geom_point(data = dataset_scaled, mapping = aes(color = cluster, x = PC1, y = PC2), size = 1.7175, pch=8, alpha=0.55)+ geom_point(data = dataset_scaled, mapping = aes(color = cluster, x = PC1, y = PC2), shape = 21, stroke = 0.7, size = 1.05, fill='white')+ scale_alpha_continuous(guide='none')+ theme(legend.position="none", #plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"), axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.key = element_rect(fill = "white"), legend.text=element_text(size=11), legend.title = element_text(size=11), axis.text.x = element_text(size=11, angle=0), axis.text.y = element_text(size=11), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_y_continuous(breaks=seq(-5, 5, 1))+ scale_x_continuous(breaks=seq(-5, 6, 1))+ scale_color_manual(values=c("#87CEEB", "#FFD700"), label = c('Cluster 1', 'Cluster 2'))+ 10 scale_fill_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"), guide='none')+ labs(title="A PCA & k-mean cluster-scatter \n plot of ceramic samples",size=11) # ------------------------------------------------------- # # evaluation b <- as.numeric(km.out2$cluster) #reference values k3 =confusionMatrix(table(a,b)) #confusion matrix # ------------------------------------------------------- # # plots combined gridExtra::grid.arrange(p3, p4, nrow=1, widths=c(1,1.2)) 2 A model based cluster− Gaussian mixture plot 2.0 O K2 1 1.5 NA2 O −1 1.0 Al2O3 SiO2 Fe2 O 3 O Ca Fe2O3 0 cluster cluster 1 cluster 2 0.5 0.0 O Ti −0.5 2 −2 MgO PC2 (21.0% explained var.) A PCA & k−mean cluster−scatter plot of ceramic samples −1.0 −3 −3 −2 −1 0 1 2 PC1 (45.0% explained var.) 3 −1.5 −2 −1 0 Al2O3 1 2 Figure 6: (a) A PCA is performed to use loadings of first 2 PCs to apply for K-mean clustering with K=2. (b) A model based clustering is performed on 2 variables to derive for 2 clusters. 11 ## color mycol = c('violet', 'cornflowerblue', 'darkkhaki', 'darkturquoise', 'lightgreen', 'lightgray', 'moccasin', 'sandybrown', 'lightcoral', "#87CEEB", "#FFD700", "#ffae00", 'grey20', 'red') # ---------------------------------------------------------------------- # # K-mean clsutering sil<-silhouette(km.out$cluster, daisy(df[,c(3,8)])) q1=fviz_silhouette(sil, label=FALSE, print.summary=FALSE)+ scale_fill_manual(values =c("#87CEEB", "#FFD700"), labels=c("Cluster 1","Cluster 2"))+ scale_color_manual(values =c("#87CEEB", "#FFD700"))+ guides(col='none')+ theme(legend.position="right", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.text=element_text(size=11), legend.title = element_text(size=11), axis.title.x = element_text(size=11), axis.title.y = element_text(size=11), axis.text.x = element_blank(), axis.text.y = element_text(size=11), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), panel.grid.major.x = element_blank(), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ labs(title = "A K-mean cluster Sillhoutte plot", size=11, subtitle = 'Average Sillhoutte width = 0.52') # ---------------------------------------------------------------------- # # screeplot var_explained_df <- data.frame(PC= paste0("PC",1:8), var_explained=(pca_iris$sdev)ˆ2/sum((pca_iris$sdev)ˆ2)) q2=var_explained_df %>% ggplot(aes(x=PC,y=var_explained, group=1))+ geom_line(aes(x = PC, y = var_explained, group = 1, color = mycol[1]))+ geom_point(aes(x = PC, y = var_explained, group = 1, color = mycol[1]), shape = 21, stroke = 1, size = 2, fill = "white") + geom_point(aes(x = PC, y = var_explained, group = 1, color=mycol[1]), alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") + labs(title="Scree plot: PCA on scaled data", size=11)+ ylab ('Variance explained')+ theme(legend.position="none", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.text=element_text(size=11), legend.title = element_text(size=11), axis.title.x = element_text(size=11), axis.title.y = element_text(size=11), axis.text.x = element_text(size=11), axis.text.y = element_text(size=11), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), 12 panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank()) # ---------------------------------------------------------------------- # # plots combined gridExtra::grid.arrange(q1, q2, nrow=1) A K−mean cluster Sillhoutte plot Scree plot: PCA on scaled data Average Sillhoutte width = 0.52 0.4 Variance explained Silhouette width Si 1.00 0.75 cluster Cluster 1 Cluster 2 0.50 0.25 0.3 0.2 0.1 0.0 0.00 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC Figure 7: (a) A K-mean clustering sillhoutte plot is shown for the average width among 2 clusters. (b) A scree plot with 8 principal components is showns for the variance explained. # ---------------------------------------------------------------------- # # PCA + k-mean clustering sil2<-silhouette(km.out2$cluster, daisy(d3)) q3=fviz_silhouette(sil2, print.summary=FALSE)+ scale_fill_manual(values =c("#87CEEB", "#FFD700"), labels=c("Cluster 1","Cluster 2"))+ scale_color_manual(values =c("#87CEEB", "#FFD700"))+ guides(col='none')+ theme(legend.position="right", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=11), axis.title=element_text(size=11), legend.text=element_text(size=11), legend.title = element_text(size=11), axis.title.x = element_text(size=11), axis.title.y = element_text(size=11), axis.text.x = element_blank(), axis.text.y = element_text(size=11), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), panel.grid.major.x = element_blank(), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ labs(title = "A PCA & k-mean cluster Sillhoutte plot", size=11, subtitle = 'Average Sillhoutte width = 0.45') # ---------------------------------------------------------------------- # # Mclust 13 mod1 <- Mclust(df[,c(3,8)]) d4 = data.frame(cbind(cluster = c(1:5), mod1$BIC[1:5,])) d5 = melt(d4, id = 'cluster') # logistic misclassification rate t1 = ggplot(data = d5, aes(x = cluster, y = value, group = variable, color = variable))+ geom_line() + geom_point(aes(x = cluster, y = value, group = variable, color = variable), shape = 21, stroke = 1, size = 2, fill = "white") + geom_point(aes(x = cluster, y = value, group = variable, color=variable), alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") + scale_color_manual(values = mycol) + labs(title = "An model based clustering \n (Mclust) BIC score", y = "BIC", x = "5 clusters", size=11, color = "Models") + theme(legend.position="right", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), #axis.ticks.y= element_line(color = "grey90"), axis.title.x = element_text(size=11), axis.title.y = element_text(size=11), axis.text.x = element_text(size=11), axis.text.y = element_text(size=11), legend.text=element_text(size=11), axis.title=element_text(size=11), title =element_text(size=11), panel.background = element_blank(), legend.key.size = unit(0.25, 'cm'), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_line(color = "grey90"))+ scale_y_continuous(breaks=seq(-10, -2300, -10))+ scale_x_continuous(breaks=seq(0, 10, 1)) # ---------------------------------------------------------------------- # # plots combined gridExtra::grid.arrange(q3, t1, nrow=1) # -------------------- number of elements ------------------------ # l1 = list() for (i in 2:6){ x_k <- kmeans(df[,c(3,8)], i, nstart = 20) l1[[i-1]] = table(x_k$cluster) } names(l1)=c('Table 2. cluster 2 (Only K-mean clustering)', 'Table 3. cluster 3 (Only K-mean clustering)', 'Table 4. cluster 4 (Only K-mean clustering)', 'Table 5. cluster 5 (Only K-mean clustering)', 'Table 6. cluster 6 (Only K-mean clustering)') kable(l1, format="pipe", position = 'H', caption = 'Only K-mean clustering number of elements when k=1 to 5') 14 A PCA & k−mean cluster Sillhoutte plot An model based clustering (Mclust) BIC score Average Sillhoutte width = 0.45 Models EII VII EEI VEI EVI VVI EEE VEE EVE VVE EEV VEV EVV VVV 1.00 0.75 −240 cluster Cluster 1 Cluster 2 0.50 BIC Silhouette width Si −230 −250 −260 0.25 −270 0.00 1 2 3 5 clusters 4 5 Figure 8: (a) A K-mean clustering sillhoutte plot based on 2 principal components is shown for the average width. (b) A BIC values of model based clustering analysis among 5 different clusters are shown as a line graph. Table 7: Only K-mean clustering number of elements when k=1 to 5 Var1 1 2 Freq 25 19 Var1 1 2 3 Freq 18 7 19 Var1 1 2 3 4 Freq 13 10 14 7 15 Var1 1 2 3 4 5 Freq 7 13 12 2 10 Var1 1 2 3 4 5 6 Freq 7 8 6 6 8 9 # K-mean clustering overall sillhoutte plot kmm = numeric() plotSilK <- function(k){ x_k <- kmeans(df[,c(3,8)], k, nstart = 20) si <- silhouette(x_k$cluster, dist(df[,c(3,8)])) km = numeric() for(i in 1:k){ km = c(km, mean(si[which(si[,1] == i), 3])) } kmm = c(kmm, mean(km)) return (kmm) } # compares sill = c(plotSilK(2), plotSilK(3), plotSilK(4), plotSilK(5), plotSilK(6), plotSilK(7), plotSilK(8)) kable(t(round(data.frame(`k-mean` = c(2:8), `Average width` = sill),2)), position = 'H', format="latex", caption = 'Only K-mean clustering Silhoutte plot for different k') Table 8: Only K-mean clustering Silhoutte plot for different k k.mean 2.00 3.00 4.00 5.00 6.00 7.00 8.0 Average.width 0.52 0.49 0.48 0.53 0.48 0.37 0.4 # ------------------------- pca -------------------------------------- # kable(t(data.frame(PC = var_explained_df$PC[1:6], `variance explained` = as.numeric(round(var_explained_df[1:6,2],3)))), format = "latex", position = 'H', caption = 'Scree plot for each PC') Table 9: Scree plot for each PC PC PC1 PC2 PC3 PC4 PC5 PC6 variance.explained 0.450 0.210 0.146 0.101 0.048 0.028 # -------------------------- k mean after pca ------------------------ # # kmean +PCA kmm = numeric() plotSilK <- function(k){ x_k <- kmeans(d3, k, nstart = 20) si <- silhouette(x_k$cluster, dist(d3)) km = numeric() for(i in 1:k){ km = c(km, mean(si[which(si[,1] == i), 3])) } kmm = c(kmm, mean(km)) return (kmm) } 16 # compares sill = c(plotSilK(2), plotSilK(3), plotSilK(4), plotSilK(5), plotSilK(6), plotSilK(7), plotSilK(8)) kable(t(round(data.frame(`k-mean` = c(2:8), `Average width` = sill),2)), format="latex",position = 'H', caption = 'Silhoutte plot for K-mean clustering after PCA') Table 10: Silhoutte plot for K-mean clustering after PCA k.mean 2.00 3.00 4.00 5.00 6.00 7.0 8.00 Average.width 0.45 0.44 0.43 0.44 0.43 0.5 0.48 # -------------------- model based clustering ------------------------ # dmb = data.frame(cbind(model = c('EII,2', 'VII,2', 'EEI,2'), BIC = round(BIC[2:4],2))) kable(t(dmb), format="latex", position = 'H', caption = 'Top 3: Model Based clustering with BIC when G=2') Table 11: Top 3: Model Based clustering with BIC when G=2 model EII,2 VII,2 EEI,2 BIC -236.27 -239.19 -239.87 # -------------------- number of elements ------------------------ # l2 = list() for (i in 2:6){ x_k <- kmeans(d3, i, nstart = 20) l2[[i-1]] = table(x_k$cluster) } names(l2)=c('Table 8. cluster 2 (Only K-mean clustering)', 'Table 9. cluster 3 (Only K-mean clustering)', 'Table 10. cluster 4 (Only K-mean clustering)', 'Table 11. cluster 5 (Only K-mean clustering)', 'Table 12. cluster 6 (Only K-mean clustering)') kable(l2, format="pipe", position = 'H', caption = 'A K-mean clustering after PCA elements distribution for different k = 1 to 5') 17 Table 12: A K-mean clustering after PCA elements distribution for different k = 1 to 5 Var1 1 2 Freq 28 16 Var1 1 2 3 Freq 15 11 18 Var1 1 2 3 4 Freq 12 11 9 12 18 Var1 1 2 3 4 5 Freq 9 7 12 11 5 Var1 1 2 3 4 5 6 Freq 7 4 6 7 12 8 Reference • Fraley, C., Raftery, A. E., Murphy, T. B., & Scrucca, L. (2012). mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation (Vol. 597, p. 1). Technical report. https://cran.r-project.org/web/packages/mclust/vignettes/mclust.h tml • Reusova, A. (2019, March 26). Hierarchical clustering on categorical data in R. Medium. Retrieved March 10, 2022, from https://towardsdatascience.com/hierarchical-clustering-oncategorical-data-in-r-a27e578f2995 • Ordorica, A. D. (2021). Aggregation methods in R. Data Science Portfolio. Retrieved March 10, 2022, from https://www.alldatascience.com/clustering/aggregation-methods-in-r/ • Warin, T. (2020, April 7). [R course] clustering with R. Thierry Warin, PhD. Retrieved March 10, 2022, from https://warin.ca/posts/rcourse-clustering-with-r/ • He, Z., Zhang, M., & Zhang, H. (2016). Data-driven research on chemical features of Jingdezhen and Longquan celadon by energy dispersive X-ray fluorescence. Ceramics International, 42(4), 5123-5129. • Chakravorty, J. (2020). Unsupervised learning. Observable. Retrieved March 14, 2022, from https://observablehq.com/@jhelum-ch/unsupervised-learning • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer. • Chemical Composition of Ceramic Samples Data Set. UCI Machine Learning Repository: Chemical Composition of ceramic samples Data Set. (2012). Retrieved March 14, 2022, from https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples# • Ngo, L. (2018, December 5). Principal component analysis explained simply. BioTuring’s Blog. Retrieved March 14, 2022, from https://blog.bioturing.com/2018/06/14/principalcomponent-analysis-explained-simply/ 19