Uploaded by Kevin Kyuson Lim

ass3-2

advertisement
STATS 780
Assignment 3
Stduent number: 400415239
Kyuson Lim
15 March, 2022
Contents
Explatory Data Analysis . . . . .
Supplimentary material . . . . .
Technical supplemental material
Reference . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 2
. 4
. 7
. 19
Explatory Data Analysis
A data contains Chinese 88 ceramic samples named depending on collected locations (Longquan kiln
at Dayao County & Jingdezhen kiln) for celadon body and glaze (equivalently both parts to be 44)
with 17 chemical compositions extracted to be sourced from UCI machine learning repository (He
et al., 2016). In data, only 8 compounds (N a2 O, M gO, Al2 O3 , SiO2 , K2 O, CaO, T iO2 , F e2 O3 )
are recorded in percentage for continuous values to be suitable for classification. According to the
chemical analysis, two compounds can be evaluated for classification methods, F e2 O3 (mean = 1.56,
s2 = 0.365) and Al2 O3 (mean = 17.46, s2 = 22.12), which are key compounds that vary for two
locations that celedon body parts of 44 samples are collected with (He et al., 2016). Hence, we fit
clustering models except for the PCA with 8 compounds, as to apply for dimension reduction, based
on a standardized (scaled) F e2 O3 and Al2 O3 values to know if classification is appropriate with two
variables. There are no missing data entries and no outliers in two variables to detect with.
A k−mean cluster−scatter plot
of ceramic samples
−b
1.5
−b
Q
−4
FL
−b
Q
FL
−b
DY
−Q
C−
2−b
−7
Q
FL
−b
−5
−b
1.0
b
FLQ
DY−Y
−1
−3−
Fe2O3
FL
Q
−12
−b
2.0
−1
3−
b
−1−b
FLQ
FLQ
FLQ−10
DY−M−2−b
7−b
DY−NS−
−b
Y−8
DY−
1−b
C−
−6
−b
−b
Y−
6
−4
BS
DY
−
−Q
DY
BS
b
5−
S−
−
DY
−B
DY
DY
−
FLQ−11−b
A hierarchial clustering plot
of ceramic samples
−b
6
FLQ−
DY−BS−2−b
FLQ−2−b
DY−Y−2−b
FLQ−8−b
1−b
DY−NS−
DY−NS−
S−1
b
4−
−
DY
DY
−
−b
b
9−
DY−Y−4−b
DY−B
S
−1.5
−2
−b
−3−b
b
b
−1.0
−b
b
DY−
M−
1−
−7
−
−6
5−
b
−b
DY
−
−9
−b
−b
−5
−4−b
−7−
Y
DY−
DY−NS−2−b
S
−N
DY
QC
−3
−b
Q
S
DY−N
DY
−
Y−
FL
−3
−Y
DY
−
−Y
DY
NS
DY
−
−3
S
−N
DY
BS
−
QC
DY−
B
0.0
−0.5
8−b
b
−3−
M
DY−
cluster
cluster 1
cluster 2
0.5
−1
0
1
Al2O3
2
Figure 1: (a) A hierachial clustering plot shows for the 44 samples of two classified samples based
on two variables. (b) A K-mean clustering model shows in 2 dimensional plot for classified clusters.
As the purpose is to compare for all pairs of points in 2 dimension, an Euclidean measure with an
average linkage is applied to the hierarchical clustering model to identify two clusters of samples
in Figure 1 (James et al., 2013). The chosen number of cluster is k = 2 for k-mean clustering
from the silhouette coefficient measures with highest average silhouette width to be 0.52 in Table 2
(Chakravorty, 2020). Considering that true of celadon parts are 13/31 distributed, the estimated
clusters, 15/29, is very close for likely fit of the k-mean clustering analysis (Table 1). The hierarchical
classification model has an accuracy of 0.909 of same number of clusters, 15/29, has higher accuracy
2
than the k-mean cluster model with an accuracy of 0.818. Both methods of accuracy to discriminate
two location is high enough (above 0.7) to come up with successful classification models.
2
A model based cluster−
Gaussian mixture plot
2.0
O
K2
1
1.5
NA2
O
−1
1.0
Al2O3
SiO2
Fe2
O
3
O
Ca
Fe2O3
0
cluster
cluster 1
cluster 2
0.5
0.0
Ti
O
−0.5
2
−2
MgO
PC2 (21.0% explained var.)
A PCA & k−mean cluster−scatter
plot of ceramic samples
−1.0
−3
−3 −2 −1 0
1
2
PC1 (45.0% explained var.)
3
−1.5
−2
−1
0
Al2O3
1
2
Figure 2: (a) A PCA is performed to use loadings of first 2 PCs to apply for K-mean clustering
with K=2. (b) A model based clustering is performed on 2 variables to derive for 2 clusters.
The chosen number of principal components for PCA is 2 since the most drastic decrease in the
scree plot is at PC2 (Figure 3) by the elbow method to be flattened after 2nd PC, where the second
PC of variance explained is 0.21, and first PC for variance explained is 0.45 (James et al., 2013). A
two PC model fits better for capturing the essence of the data with few PCs as possible (Ngo, 2018).
Then, the number of cluster selected is k = 2 based on silhouette plot with average silhouette width
to be highest with 0.45 (Table 4). Considering that true of celadon parts are 13/31 distributed,
estimated clusters for the k-mean clustering analysis is 16/28, which is close to true clusters (Table
6). Based on the accuracy with 0.886, the k-mean clustering with PCA for 2 PCs classifies the
ceramic better than the ordinary k-mean clustering, as two PCs explain the total variability of
the data more than two chosen chemical compounds, F e2 O3 and Al2 O3 . Lastly, a model based
clustering with BIC identifies 2 clusters model with EII (1st) for value of -236.27 to be maximum
among 14 different mixture models for 2 clusters in Table 5 (Fraley et al., 2012). Although 4 needs
to chosen based on BIC plot (EEI covariance for BIC = -223.05), a number of chosen cluster is 2 to
be compared with true clusters for accuracy as well as the purpose of comparing model performances
(Figure 4). An accuracy for classifying two different locations with 19/25 classes by the Gaussian
mixture model yields 0.818, which is same as the k-mean clustering without PCA (Figure 2). Thus,
a hierarchical clustering has the highest accuracy of classification with 2 variables, and the K-mean
clustering with PCA model performs better than the ordinary K-mean clustering model.
3
Supplimentary material
Model selection and criterion
A K−mean cluster Sillhoutte plot
Scree plot: PCA on scaled data
Average Sillhoutte width = 0.52
0.4
Variance explained
Silhouette width Si
1.00
0.75
cluster
Cluster 1
Cluster 2
0.50
0.25
0.3
0.2
0.1
0.0
0.00
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
PC
Figure 3: (a) A K-mean clustering sillhoutte plot is shown for the average width among 2 clusters.
(b) A scree plot with 8 principal components is showns for the variance explained.
A PCA & k−mean cluster Sillhoutte plot
An model based clustering
(Mclust) BIC score
Average Sillhoutte width = 0.45
Models
EII
VII
EEI
VEI
EVI
VVI
EEE
VEE
EVE
VVE
EEV
VEV
EVV
VVV
1.00
−240
0.50
cluster
Cluster 1
Cluster 2
BIC
Silhouette width Si
−230
0.75
−250
−260
0.25
−270
0.00
1
2
3
5 clusters
4
5
Figure 4: (a) A K-mean clustering sillhoutte plot based on 2 principal components is shown for the
average width. (b) A BIC values of model based clustering analysis among 5 different clusters are
shown as a line graph.
4
Table 1: Only K-mean clustering number of elements when k=1 to 5
Var1
1
2
Freq
25
19
Var1
1
2
3
Freq
18
7
19
Var1
1
2
3
4
Freq
13
10
14
7
Var1
1
2
3
4
5
Freq
7
13
12
2
10
Var1
1
2
3
4
5
6
Freq
7
8
6
6
8
9
Table 2: Only K-mean clustering Silhoutte plot for different k
k.mean
2.00
3.00
4.00
5.00
6.00
7.00
8.0
Average.width
0.52
0.49
0.48
0.53
0.48
0.37
0.4
Table 3: Scree plot for each PC
PC
PC1
PC2
PC3
PC4
PC5
PC6
variance.explained
0.450
0.210
0.146
0.101
0.048
0.028
Table 4: Silhoutte plot for K-mean clustering after PCA
k.mean
2.00
3.00
4.00
5.00
6.00
7.0
8.00
Average.width
0.45
0.44
0.43
0.44
0.43
0.5
0.48
Table 5: Top 3: Model Based clustering with BIC when G=2
model
EII,2
VII,2
EEI,2
BIC
-236.27
-239.19
-239.87
5
Table 6: A K-mean clustering after PCA elements distribution for different k = 1 to 5
Var1
1
2
Freq
28
16
Var1
1
2
3
Freq
15
11
18
Var1
1
2
3
4
Freq
12
11
9
12
6
Var1
1
2
3
4
5
Freq
9
7
12
11
5
Var1
1
2
3
4
5
6
Freq
7
4
6
7
12
8
Technical supplemental material
library(caret)
library(pander)
library(class)
library(pscl)
library(ggbiplot)
library(mclust)
library(gridExtra)
library(knitr)
library(cowplot)
library(tidyverse)
library(reshape2)
library(kableExtra)
library(RCurl)
library(readxl)
library(dplyr)
library(magrittr)
library(readr)
library(ggplot2)
library(ISLR2)
library(cluster)
library(factoextra)
library(pgmm)
library(fossil)
library(dendextend)
library(circlize)
library(ggdendro)
library(ape)
library(ggforce)
library(e1071)
set.seed(1)
gg = "https://archive.ics.uci.edu/ml/machine-learning-databases/00583/Chemical%20Composion%20of%20Ceramic.csv"
# parse the downloaded data as CSV
d = read.csv(url(gg), header=T)
df = cbind(as.numeric(d[,3]), as.numeric(d[,4]), as.numeric(d[,5]),
as.numeric(d[,6]), as.numeric(d[,7]), as.numeric(d[,8]),
as.numeric(d[,9]), as.numeric(d[,10])) %>% na.omit()
d_cov = d[1:44,1]
df= scale(df[1:44,])
colnames(df) = c(colnames(d)[3:10])
rownames(df) = c(d_cov)
#dgr = if()
#d = read.csv('echocardiogram.data', sep = ',')
## outlier exception
om = which(df[,8] %in% c(boxplot.stats(df[,8])$out))
om2 = which(df[,3] %in% c(boxplot.stats(df[,3])$out))
# ----------------- hierarchial clustering -------------------- #
# Compute hierarchical clustering
res.hc <- dist(df[,c(3,8)]) %>%
dist(method = "euclidean") %>%
hclust(method = "average")
# Visualize using factoextra
p1 = fviz_dend(res.hc, k = 2, cex = 0.315, type = "circular",
k_colors = c("#87CEEB", "#FFD700"),
7
labels_track_height = 0,
color_labels_by_k = TRUE, rect =F
) + theme(axis.text.y = element_blank(),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "cm"))+
labs(title = "A hierarchial clustering plot \n of ceramic samples", size=11)
# ------- hierarchial clustering evaluation -------------------- #
era = c(rep(1,13), rep(2, 31), rep(1,13), rep(2,31))
e1 = cbind(hc = cutree(res.hc, 2), era = era)
a <- c(era[1:44]) #values from classification
b <- as.numeric(cutree(res.hc, 2))
#reference values
k =confusionMatrix(table(a,b)) #confusion matrix
# ----------------- k-eman clustering ------------------------ #
# k-eman clustering
km.out <- kmeans(df[,c(3,8)], 2, nstart = 10, algorithm = "MacQueen")
# Kmean cluster
dataset_scaled = data.frame(Al2O3 = df[,3], Fe2O3 = df[,8],
cluster = factor(km.out$cluster))
p2 = ggplot(dataset_scaled, aes(x = Al2O3, y = Fe2O3)) +
#geom_mark_ellipse(aes(fill = cluster), alpha=0.1) +
scale_fill_manual(values=c("#1ED14B","#11a6fc"))+
geom_point(aes(color = cluster), size = 1.7175, pch=8, alpha=0.55) +
geom_point(aes(color = cluster), shape = 21,
stroke = 0.7, size = 1.05, fill='white') +
scale_alpha_continuous(guide='none')+ theme(legend.position="right",
#plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"),
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.text=element_text(size=11),
legend.title = element_text(size=11),
axis.text.x = element_text(size=11, angle=0),
legend.key = element_rect(fill = "white"),
axis.text.y = element_text(size=11),
legend.box.background = element_blank(),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_y_continuous(breaks=seq(-5, 3, 0.5))+
scale_x_continuous(breaks=seq(-2, 6, 1))+
scale_color_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"),
label = c('cluster 1', 'cluster 2')
)+
labs(title = "A k-mean cluster-scatter plot \n of ceramic samples", size=11)
# ------------- evaluation ------------------------ #
b <- as.numeric(km.out$cluster)
#reference values
k2 =confusionMatrix(table(a,b)) #confusion matrix
# ------------------------------------------------------- #
# plots combined
8
gridExtra::grid.arrange(p1, p2, nrow=1, widths=c(1.25,1))
b
3−
−1
−4
Q
FL
C−
Q−
b
1.0
1−b
Fe2O3
FLQ
6−b
FLQ−
DY−BS−2−b
FLQ−2−b
DY−Y−2−b
FLQ−8−b
1−b
DY−NS−
M−3
DY−
5−
b
−3−
2−b
DY−Y
−
7−
Q−
FL
6−
b
DY
−Q
b
FL
−b
1.5
−b
FL
Q
b
FLQ
−12
−
b
2.0
−b
−6
DY
−Y
−
A k−mean cluster−scatter plot
of ceramic samples
−b
1−b
FLQ
−
FLQ−10
DY−M−2−b
7−b
DY−NS−
−8−b
−1−
−4
b
BS
Y
DY−
C
−Q
DY
BS
5−
−
DY
S−
−B
DY
DY
−
FLQ−11−b
A hierarchial clustering plot
of ceramic samples
DY−NS−
−
−4
QC
DY−
B
S−1
b
−
DY
0.0
−0.5
8−b
−b
cluster
cluster 1
cluster 2
0.5
DY
−N
b
9−
−b
−b
QC
−3
−1−
DY−Y−4−b
DY−B
DY−
M
−b
−5
S−3−b
b
−7
BS
b
NS
3−
DY
−
−1.5
−2
Y−
−
DY
S−4−b
7−b
Y−
DY−
DY−NS−2−b
DY−N
DY
−
DY
FL
−
DY
−
−Y
−1.0
S−
6−
DY
b
−Y
−5
−b
Q
−9
−b
−b
−3
NS
−
DY
−b
−1
0
1
Al2O3
2
Figure 5: (a) A hierachial clustering plot shows for the 44 samples of two classified samples based
on two variables. (b) A K-mean clustering model shows in 2 dimensional plot for classified clusters.
# ----------------- model based clustering ----------------- #
BIC = mclustBIC(df[,c(3,8)], G=2)
## selection
#summary(BIC, df[,c(3,8)])
mod1 = Mclust(df[,c(3,8)], x = BIC, modelNames = "EII")
# cluster
dataset_scaled2 = data.frame(Al2O3 = df[,3], Fe2O3 = df[,8],
cluster = factor(mod1$classification))
# dimesion reduction
#mod1dr = MclustDR(mod1)
p4 = ggplot(dataset_scaled2, aes(x = Al2O3, y = Fe2O3)) +
geom_point(aes(color = cluster), size = 1.7175, pch=8, alpha=0.55) +
geom_point(aes(color = cluster), shape = 21,
stroke = 0.7, size = 1.05, fill='white') +
#with_outer_glow(geom_segment(mapping=aes(x=-Inf, xend=Inf,
#
y=-0.25, yend=-0.27), color='grey50', lineend="round",
#
size=0.5, alpha=0.55), sigma = 3)+
geom_mark_ellipse(aes(fill = cluster), color='transparent', alpha=0.1) +
scale_alpha_continuous(guide='none')+ theme(legend.position="right",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.text=element_text(size=11),
legend.key = element_rect(fill = "white"),
legend.title = element_text(size=11),
axis.text.x = element_text(size=11, angle=0),
axis.text.y = element_text(size=11),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
9
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_y_continuous(breaks=seq(-5, 3, 0.5))+
scale_x_continuous(breaks=seq(-2, 6, 1))+
scale_color_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"),
label = c('cluster 1', 'cluster 2')
)+
scale_fill_manual(values=c("#87CEEB", "#FFD700"), guide='none')+
labs(title = "A model based cluster- \n Gaussian mixture plot", size=11)
# ------------------------------------------------------- #
# evaluation
b <- as.numeric(mod1$classification)
#reference values
k4 =confusionMatrix(table(a,b)) #confusion matrix
# ----------------- classification + PCA ----------------- #
### overlapping labe
colnames(df)[1] = c("NA2O")
colnames(df)[2] = c("MgO")
colnames(df)[3] = c("Al2O3")
colnames(df)[6] = c("CaO")
colnames(df)[7] = c("TiO2")
colnames(df)[8] = c("Fe2O3 \n")
# PCA
pca_iris = prcomp(df, center = TRUE, scale = F)
d3 = as.data.frame(pca_iris$x[,1:2])
km.out2 <- kmeans(d3, 2, nstart = 20)
# Kmean cluster
dataset_scaled = data.frame(d3, cluster = factor(km.out2$cluster))
p3 = ggbiplot(pcobj = pca_iris, obs.scale = 1,
var.scale = 1, alpha=0, varname.size = 2)+
geom_point(data = dataset_scaled, mapping = aes(color = cluster,
x = PC1, y = PC2), size = 1.7175, pch=8, alpha=0.55)+
geom_point(data = dataset_scaled, mapping = aes(color = cluster,
x = PC1, y = PC2), shape = 21, stroke = 0.7, size = 1.05, fill='white')+
scale_alpha_continuous(guide='none')+ theme(legend.position="none",
#plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"),
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.key = element_rect(fill = "white"),
legend.text=element_text(size=11),
legend.title = element_text(size=11),
axis.text.x = element_text(size=11, angle=0),
axis.text.y = element_text(size=11),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_y_continuous(breaks=seq(-5, 5, 1))+
scale_x_continuous(breaks=seq(-5, 6, 1))+
scale_color_manual(values=c("#87CEEB", "#FFD700"),
label = c('Cluster 1', 'Cluster 2'))+
10
scale_fill_manual(values=c("#87CEEB", "#FFD700"),#,"#ffae00"),
guide='none')+
labs(title="A PCA & k-mean cluster-scatter \n plot of ceramic samples",size=11)
# ------------------------------------------------------- #
# evaluation
b <- as.numeric(km.out2$cluster)
#reference values
k3 =confusionMatrix(table(a,b)) #confusion matrix
# ------------------------------------------------------- #
# plots combined
gridExtra::grid.arrange(p3, p4, nrow=1, widths=c(1,1.2))
2
A model based cluster−
Gaussian mixture plot
2.0
O
K2
1
1.5
NA2
O
−1
1.0
Al2O3
SiO2
Fe2
O
3
O
Ca
Fe2O3
0
cluster
cluster 1
cluster 2
0.5
0.0
O
Ti
−0.5
2
−2
MgO
PC2 (21.0% explained var.)
A PCA & k−mean cluster−scatter
plot of ceramic samples
−1.0
−3
−3 −2 −1 0
1
2
PC1 (45.0% explained var.)
3
−1.5
−2
−1
0
Al2O3
1
2
Figure 6: (a) A PCA is performed to use loadings of first 2 PCs to apply for K-mean clustering
with K=2. (b) A model based clustering is performed on 2 variables to derive for 2 clusters.
11
## color
mycol = c('violet', 'cornflowerblue', 'darkkhaki', 'darkturquoise',
'lightgreen', 'lightgray', 'moccasin', 'sandybrown', 'lightcoral',
"#87CEEB", "#FFD700", "#ffae00", 'grey20', 'red')
# ---------------------------------------------------------------------- #
# K-mean clsutering
sil<-silhouette(km.out$cluster, daisy(df[,c(3,8)]))
q1=fviz_silhouette(sil, label=FALSE, print.summary=FALSE)+
scale_fill_manual(values =c("#87CEEB", "#FFD700"),
labels=c("Cluster 1","Cluster 2"))+
scale_color_manual(values =c("#87CEEB", "#FFD700"))+
guides(col='none')+
theme(legend.position="right",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.text=element_text(size=11),
legend.title = element_text(size=11),
axis.title.x = element_text(size=11),
axis.title.y = element_text(size=11),
axis.text.x = element_blank(),
axis.text.y = element_text(size=11),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
labs(title = "A K-mean cluster Sillhoutte plot", size=11,
subtitle = 'Average Sillhoutte width = 0.52')
# ---------------------------------------------------------------------- #
# screeplot
var_explained_df <- data.frame(PC= paste0("PC",1:8),
var_explained=(pca_iris$sdev)ˆ2/sum((pca_iris$sdev)ˆ2))
q2=var_explained_df %>%
ggplot(aes(x=PC,y=var_explained, group=1))+
geom_line(aes(x = PC, y = var_explained, group = 1, color = mycol[1]))+
geom_point(aes(x = PC, y = var_explained, group = 1, color = mycol[1]),
shape = 21, stroke = 1, size = 2, fill = "white") +
geom_point(aes(x = PC, y = var_explained, group = 1, color=mycol[1]),
alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") +
labs(title="Scree plot: PCA on scaled data", size=11)+
ylab ('Variance explained')+
theme(legend.position="none",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.text=element_text(size=11),
legend.title = element_text(size=11),
axis.title.x = element_text(size=11),
axis.title.y = element_text(size=11),
axis.text.x = element_text(size=11),
axis.text.y = element_text(size=11),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
12
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())
# ---------------------------------------------------------------------- #
# plots combined
gridExtra::grid.arrange(q1, q2, nrow=1)
A K−mean cluster Sillhoutte plot
Scree plot: PCA on scaled data
Average Sillhoutte width = 0.52
0.4
Variance explained
Silhouette width Si
1.00
0.75
cluster
Cluster 1
Cluster 2
0.50
0.25
0.3
0.2
0.1
0.0
0.00
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
PC
Figure 7: (a) A K-mean clustering sillhoutte plot is shown for the average width among 2 clusters.
(b) A scree plot with 8 principal components is showns for the variance explained.
# ---------------------------------------------------------------------- #
# PCA + k-mean clustering
sil2<-silhouette(km.out2$cluster, daisy(d3))
q3=fviz_silhouette(sil2, print.summary=FALSE)+
scale_fill_manual(values =c("#87CEEB", "#FFD700"),
labels=c("Cluster 1","Cluster 2"))+
scale_color_manual(values =c("#87CEEB", "#FFD700"))+
guides(col='none')+
theme(legend.position="right",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=11),
axis.title=element_text(size=11),
legend.text=element_text(size=11),
legend.title = element_text(size=11),
axis.title.x = element_text(size=11),
axis.title.y = element_text(size=11),
axis.text.x = element_blank(),
axis.text.y = element_text(size=11),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
labs(title = "A PCA & k-mean cluster Sillhoutte plot", size=11,
subtitle = 'Average Sillhoutte width = 0.45')
# ---------------------------------------------------------------------- #
# Mclust
13
mod1 <- Mclust(df[,c(3,8)])
d4 = data.frame(cbind(cluster = c(1:5), mod1$BIC[1:5,]))
d5 = melt(d4, id = 'cluster')
# logistic misclassification rate
t1 = ggplot(data = d5, aes(x = cluster, y = value, group = variable,
color = variable))+
geom_line() +
geom_point(aes(x = cluster, y = value, group = variable, color = variable),
shape = 21, stroke = 1, size = 2, fill = "white") +
geom_point(aes(x = cluster, y = value, group = variable, color=variable),
alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") +
scale_color_manual(values = mycol) +
labs(title = "An model based clustering \n (Mclust) BIC score",
y = "BIC", x = "5 clusters", size=11,
color = "Models") +
theme(legend.position="right",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
#axis.ticks.y= element_line(color = "grey90"),
axis.title.x = element_text(size=11),
axis.title.y = element_text(size=11),
axis.text.x = element_text(size=11),
axis.text.y = element_text(size=11),
legend.text=element_text(size=11),
axis.title=element_text(size=11),
title =element_text(size=11),
panel.background = element_blank(),
legend.key.size = unit(0.25, 'cm'),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_line(color = "grey90"))+
scale_y_continuous(breaks=seq(-10, -2300, -10))+
scale_x_continuous(breaks=seq(0, 10, 1))
# ---------------------------------------------------------------------- #
# plots combined
gridExtra::grid.arrange(q3, t1, nrow=1)
# -------------------- number of elements ------------------------ #
l1 = list()
for (i in 2:6){
x_k <- kmeans(df[,c(3,8)], i, nstart = 20)
l1[[i-1]] = table(x_k$cluster)
}
names(l1)=c('Table 2. cluster 2 (Only K-mean clustering)',
'Table 3. cluster 3 (Only K-mean clustering)',
'Table 4. cluster 4 (Only K-mean clustering)',
'Table 5. cluster 5 (Only K-mean clustering)',
'Table 6. cluster 6 (Only K-mean clustering)')
kable(l1, format="pipe", position = 'H', caption =
'Only K-mean clustering number of elements when k=1 to 5')
14
A PCA & k−mean cluster Sillhoutte plot
An model based clustering
(Mclust) BIC score
Average Sillhoutte width = 0.45
Models
EII
VII
EEI
VEI
EVI
VVI
EEE
VEE
EVE
VVE
EEV
VEV
EVV
VVV
1.00
0.75
−240
cluster
Cluster 1
Cluster 2
0.50
BIC
Silhouette width Si
−230
−250
−260
0.25
−270
0.00
1
2
3
5 clusters
4
5
Figure 8: (a) A K-mean clustering sillhoutte plot based on 2 principal components is shown for the
average width. (b) A BIC values of model based clustering analysis among 5 different clusters are
shown as a line graph.
Table 7: Only K-mean clustering number of elements when k=1 to 5
Var1
1
2
Freq
25
19
Var1
1
2
3
Freq
18
7
19
Var1
1
2
3
4
Freq
13
10
14
7
15
Var1
1
2
3
4
5
Freq
7
13
12
2
10
Var1
1
2
3
4
5
6
Freq
7
8
6
6
8
9
# K-mean clustering overall sillhoutte plot
kmm = numeric()
plotSilK <- function(k){
x_k <- kmeans(df[,c(3,8)], k, nstart = 20)
si <- silhouette(x_k$cluster, dist(df[,c(3,8)]))
km = numeric()
for(i in 1:k){
km = c(km, mean(si[which(si[,1] == i), 3]))
}
kmm = c(kmm, mean(km))
return (kmm)
}
# compares
sill = c(plotSilK(2), plotSilK(3), plotSilK(4), plotSilK(5),
plotSilK(6), plotSilK(7), plotSilK(8))
kable(t(round(data.frame(`k-mean` = c(2:8),
`Average width` = sill),2)), position = 'H', format="latex",
caption = 'Only K-mean clustering Silhoutte plot for different k')
Table 8: Only K-mean clustering Silhoutte plot for different k
k.mean
2.00
3.00
4.00
5.00
6.00
7.00
8.0
Average.width
0.52
0.49
0.48
0.53
0.48
0.37
0.4
# ------------------------- pca -------------------------------------- #
kable(t(data.frame(PC = var_explained_df$PC[1:6],
`variance explained` = as.numeric(round(var_explained_df[1:6,2],3)))),
format = "latex", position = 'H',
caption = 'Scree plot for each PC')
Table 9: Scree plot for each PC
PC
PC1
PC2
PC3
PC4
PC5
PC6
variance.explained
0.450
0.210
0.146
0.101
0.048
0.028
# -------------------------- k mean after pca ------------------------ #
# kmean +PCA
kmm = numeric()
plotSilK <- function(k){
x_k <- kmeans(d3, k, nstart = 20)
si <- silhouette(x_k$cluster, dist(d3))
km = numeric()
for(i in 1:k){
km = c(km, mean(si[which(si[,1] == i), 3]))
}
kmm = c(kmm, mean(km))
return (kmm)
}
16
# compares
sill = c(plotSilK(2), plotSilK(3), plotSilK(4), plotSilK(5),
plotSilK(6), plotSilK(7), plotSilK(8))
kable(t(round(data.frame(`k-mean` = c(2:8),
`Average width` = sill),2)),
format="latex",position = 'H',
caption = 'Silhoutte plot for K-mean clustering after PCA')
Table 10: Silhoutte plot for K-mean clustering after PCA
k.mean
2.00
3.00
4.00
5.00
6.00
7.0
8.00
Average.width
0.45
0.44
0.43
0.44
0.43
0.5
0.48
# -------------------- model based clustering ------------------------ #
dmb = data.frame(cbind(model = c('EII,2', 'VII,2', 'EEI,2'),
BIC = round(BIC[2:4],2)))
kable(t(dmb), format="latex", position = 'H',
caption = 'Top 3: Model Based clustering with BIC when G=2')
Table 11: Top 3: Model Based clustering with BIC when G=2
model
EII,2
VII,2
EEI,2
BIC
-236.27
-239.19
-239.87
# -------------------- number of elements ------------------------ #
l2 = list()
for (i in 2:6){
x_k <- kmeans(d3, i, nstart = 20)
l2[[i-1]] = table(x_k$cluster)
}
names(l2)=c('Table 8. cluster 2 (Only K-mean clustering)',
'Table 9. cluster 3 (Only K-mean clustering)',
'Table 10. cluster 4 (Only K-mean clustering)',
'Table 11. cluster 5 (Only K-mean clustering)',
'Table 12. cluster 6 (Only K-mean clustering)')
kable(l2, format="pipe", position = 'H', caption =
'A K-mean clustering after PCA elements distribution for different k = 1 to 5')
17
Table 12: A K-mean clustering after PCA elements distribution for different k = 1 to 5
Var1
1
2
Freq
28
16
Var1
1
2
3
Freq
15
11
18
Var1
1
2
3
4
Freq
12
11
9
12
18
Var1
1
2
3
4
5
Freq
9
7
12
11
5
Var1
1
2
3
4
5
6
Freq
7
4
6
7
12
8
Reference
• Fraley, C., Raftery, A. E., Murphy, T. B., & Scrucca, L. (2012). mclust version 4 for R: normal
mixture modeling for model-based clustering, classification, and density estimation (Vol. 597,
p. 1). Technical report. https://cran.r-project.org/web/packages/mclust/vignettes/mclust.h
tml
• Reusova, A. (2019, March 26). Hierarchical clustering on categorical data in R. Medium.
Retrieved March 10, 2022, from https://towardsdatascience.com/hierarchical-clustering-oncategorical-data-in-r-a27e578f2995
• Ordorica, A. D. (2021). Aggregation methods in R. Data Science Portfolio. Retrieved March
10, 2022, from https://www.alldatascience.com/clustering/aggregation-methods-in-r/
• Warin, T. (2020, April 7). [R course] clustering with R. Thierry Warin, PhD. Retrieved March
10, 2022, from https://warin.ca/posts/rcourse-clustering-with-r/
• He, Z., Zhang, M., & Zhang, H. (2016). Data-driven research on chemical features of Jingdezhen
and Longquan celadon by energy dispersive X-ray fluorescence. Ceramics International, 42(4),
5123-5129.
• Chakravorty, J. (2020). Unsupervised learning. Observable. Retrieved March 14, 2022, from
https://observablehq.com/@jhelum-ch/unsupervised-learning
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning (Vol. 112, p. 18). New York: springer.
• Chemical Composition of Ceramic Samples Data Set. UCI Machine Learning Repository:
Chemical Composition of ceramic samples Data Set. (2012). Retrieved March 14, 2022, from
https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples#
• Ngo, L. (2018, December 5). Principal component analysis explained simply. BioTuring’s
Blog. Retrieved March 14, 2022, from https://blog.bioturing.com/2018/06/14/principalcomponent-analysis-explained-simply/
19
Download