Uploaded by Kevin Kyuson Lim

assignment22-copy

advertisement
STATS 780
Assignment 1
Stduent number: 400415239
Kyuson Lim
15 February, 2022
Contents
Supplimentary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
The data contains 900 raisin with 7 morphological features extracted sourced from the UCI machine
learning repository (Çinar et al., 2020). The variables include the area which is the number of pixels,
the perimeter, a major axis length which is the longest line, minor axis length which is the shortest
line, eccentricity, convex area which is the smallest convex shell, extent and the classes which are
consists of Kecimen and Besni raisin. This would fit for the logistic regression model as the class
consists of two kinds of raisin with different numerical measurements to vary by classes.
A box−plot of varaibles in baisin data
value
MinorAxisLength
A density−scatter plot of raisin
2
1
0
−1
−2
−3
−4
−5
−2
0
2
4
a
are
Extent
class
Besni
6
5
4
3
2
1
0
−1
−2
−3
−4
−5
−6
Kecimen
per
te
ime
r
gth
jorA
Ma
en
xisL
eng
class
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
xisL
y
ricit
th
ent
Ecc
orA
Min
Besni
ea
Ar
vex
Con
ent
Ext
Kecimen
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
Figure 1: (a) The density-scatter plot shows for the relationship between variable extent and minor
axis length. (b) The box-plot shows for the scaled (normalized) distribution of each variable.
The variables in the data include area, perimeter, MajorAxisLength, MinorAxisLength, Eccentricity,
ConvexArea, Extent where all are numerics, and Class which is categorical. A total of 900 raisin
grains were used, including 450 pieces from both, Kecimen and Besni raisin with no missing data.
From the box-plot, the Kecimen species have overall low area mean 63413.47 pixels with 4 outliers,
mean perimeter with 352.86 pixels with 4 outliers, mean major axis length with 229.35 pixels with
5 outliers, mean minor axis length with 0.74 pixels with 9 outliers, mean minor axis length with
65696.36 pixels with 4 outliers, mean convex area with 0.71 pixels with 10 outliers, and mean extent
with 983.69 pixels with 3 outliers. For the same variables, the Besni has mean 112194.79 for 3
outliers, 509 for 3 outliers, 279.62 pixels for 2 outliers, 0.82 pixels for 12 outliers, 116675.82 pixels
for 3 outliers, 0.69 pixels for 2 outliers, 1348.13 pixels for 2 outliers. The minor axis length variable
overlap slightly for two species and has moderate correlation of 0.45 with variable extent (Figure 1).
Other variable has correlation of 0.83, 0.98, 0.96, -0.17 and 0.98 to be strong or too weak. However,
the separation becomes obvious from the box-plot (Figure 1) where Kecimen species have lower
mean for all scaled (standardized) variables except for convex area compared to other Besni raisin.
From the scaled (normalized) variables, 675 observations are randomly chosen for 10 iterations
randomly to fit for the logistic regression and knn classifier, and tested on 225 observations.
2
An error rate of logistic regression
An error rate of KNN classifier
0.20
0.18
0.4
Error rates
Error rates
0.5
0.3
0.2
0.16
0.14
0.12
0.1
1
2
3
4
5
6
7
8
9
0.10
10
1
10 iterations
2
3
4
5
6
7
8
9
10
10 iterations
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
k=25
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
k=75
k=125
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
Figure 2: (a) The line graph shows for 10 error rates of each 7 variables fitted by the logistic
regression. (b) The line graph shows for 10 error rates of KNN classifiers with k=25, 75, and 125.
Each variable are chosen to test for 10 different test sets using while loop and sample, randomly.
An output for error rates is shown in a Figure 2. A lowest error rates is achieved by the extent and
perimeter variable on logistic regression model. A mean error rate is 0.151 for k = 25, 0.143 for
k = 75, 0.144 for k = 125, with lowest error rate for k = 75. Both use the probability that responses
belongs to a specific category such that both has over 0.85 accuracy. A logistic regression which is
a parametric, can’t be applied on non-linear classification problem, collinearity among predictors
to result in 0.148 for extent variable. Hence, the convex area predictor for its classification has
higher error rate. Unlike logistic regression, KNN classifier is a non-parametric, doesn’t tell us
which predictors are important, and possible for non-linear classification. We identify optimal k
value such as k = 25 as a tuning parameter from empirical approach without considering for the
predictors. From Table 1, ANOVA table indicates adding area variable alone reduces the deviance
drastically but the variable is not statistically significant (p-value = 0.17), although a coefficient
has a largest magnitude for the most important variable. Hence, the extent variable is an important
variable by its magnitude, statistical significance, reduction in the deviance (14.16) and lowest error
rate (0.148) to have achieved with. For every one unit change in extent, the log odds of Kecimen
(versus Besni) raisin decreases by 8.11, when other 6 variables are fixed. Thus, logistic regression is
essential with extent variable.
3
Table 1: ANOVA and an assessment on fitted model
NULL
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
Resid. Dev
935.71
518.95
439.93
439.93
439.90
430.65
430.47
416.31
Pr(>Chi)
NA
0.00
0.00
0.96
0.86
0.00
0.67
0.00
(Intercept)
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
Supplimentary material
# parse the downloaded data as CSV
dataset = read_excel('/Users/kyusoinlims/Desktop/Raisin_Dataset.xlsx')
colnames(dataset) = c('area', 'perimeter', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity',
'ConvexArea', 'Extent', 'Class')
# data transformation
dataset_scaled = dataset[,1:7] %>% scale() %>% as.data.frame()
dataset_scaled$class = as.factor(dataset$Class)
# 2 plots
## Bivariate perspective
p1 = ggplot(dataset_scaled, aes(x=Extent, y=MinorAxisLength)) +
stat_density2d(geom = "polygon", aes(alpha = (..level..)/100,
fill = class), bins = 9)+
scale_fill_manual(values=c("#1ED14B","#11a6fc"))+
geom_point(aes(color = class), shape = 21,
stroke = 0.2, size = 0.5, fill='transparent') +
scale_alpha_continuous(guide='none')+ theme(legend.position="bottom",
#plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"),
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
title =element_text(size=7),
axis.title=element_text(size=7),
legend.text=element_text(size=7),
legend.title = element_text(size=7),
axis.text.x = element_text(size=6),
legend.key.size = unit(0.25, 'cm'),
panel.background = element_blank(),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_y_continuous(breaks=seq(-5, 2, 1))+
scale_x_continuous(breaks=seq(-2, 6, 2))+
scale_color_manual(values=c("#7CAE00","#3582c4"),#,"#ffae00"),
guide='none')+
labs(title = "A density-scatter plot of raisin",
caption = "Data source: UCI Machine learning repository,
Raisin Dataset Data Set", size=5)
## univariate perspective
data_long <- melt(dataset_scaled, id = "class")
p2= ggplot(data_long, aes(x = variable, y = value, color=class)) +
4
Estimate
-0.71
-9.20
3.68
3.37
0.02
6.87
-0.03
-8.11
Pr(>|z|)
0.02
0.17
0.11
0.05
0.97
0.33
0.88
0.00
geom_boxplot(varwidth = T) +
theme(legend.position="bottom",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
axis.title=element_text(size=7),
legend.background = element_blank(),
title =element_text(size=7),
legend.title = element_text(size=7),
legend.text=element_text(size=7),
legend.key.size = unit(0.25, 'cm'),
axis.text.x = element_text(angle = 22.5, vjust = 0.5, hjust=1, size=6),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+xlab(NULL)+
scale_y_continuous(breaks=seq(-7, 25, 1))+
scale_color_manual(values=c("#7CAE00","#11a6fc"))+
labs(title = "A box-plot of varaibles in baisin data",
caption = "Data source: UCI Machine learning repository,
Raisin Dataset Data Set", cex.labs=0.5) #+
#
#
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
#
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
#
hjust = -1)
# -------------------------------------------------------------------------- #
# plots combined
grid.arrange(p1, p2, nrow=1, widths=c(1,1))
A density−scatter plot of raisin
A box−plot of varaibles in baisin data
2
0
value
MinorAxisLength
1
−1
−2
−3
−4
−5
−2
0
2
4
a
are
Extent
class
Besni
6
5
4
3
2
1
0
−1
−2
−3
−4
−5
−6
Kecimen
per
te
ime
r
gth
jorA
Ma
en
xisL
eng
class
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
xisL
y
ricit
th
ent
Ecc
orA
Min
Besni
ea
Ar
vex
Con
ent
Ext
Kecimen
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
Figure 3: (a) The density-scatter plot shows for the relationship between variable extent and minor
axis length. (b) The box-plot shows for the scaled (normalized) distribution of each variable.
#
test-train split this data
set.seed(780)
i=1
# logistics regression error rate
logError1=data.frame(matrix(nrow=7, ncol=10))
# KNN error rate
misClassError1=as.numeric(); misClassError2=as.numeric()
misClassError3=as.numeric()
5
while (i <= 10){
dataset_obs = nrow(dataset_scaled)
## split
dataset_idx = sample(dataset_obs, size = trunc(0.75 * dataset_obs))
dataset_trn = dataset_scaled[dataset_idx,] # training set
dataset_test = dataset_scaled[-dataset_idx,] # test set
# Feature Scaling
train_scale <- scale(dataset_trn[, 1:7])
test_scale <- scale(dataset_test[, 1:7])
# -------------------------------------------------------------------------- #
# logistic modeling
# each variables
model_1 = glm(as.factor(class) ~ area, data = dataset_trn,
family = "binomial")
model_2 = glm(as.factor(class) ~ perimeter, data = dataset_trn,
family = "binomial")
model_3 = glm(as.factor(class) ~ MajorAxisLength, data = dataset_trn,
family = "binomial")
model_4 = glm(as.factor(class) ~ MinorAxisLength, data = dataset_trn,
family = "binomial")
model_5 = glm(as.factor(class) ~ Eccentricity, data = dataset_trn,
family = "binomial")
model_6 = glm(as.factor(class) ~ ConvexArea, data = dataset_trn,
family = "binomial")
model_7 = glm(as.factor(class) ~ Extent, data = dataset_trn,
family = "binomial")
# model_8 = glm(as.factor(class) ~ ., data = dataset_trn,
#
family = "binomial")
# model_9 = glm(as.factor(class) ~ . -Extent, data = dataset_trn,
#
family = "binomial")
get_logistic_error = function(mod, data, res = "y", pos = 2,
neg = 1, cut = 0.5) {
probs = predict(mod, newdata = data, type = "response")
preds = ifelse(probs > cut, pos, neg)
mean(data[, res] != preds)
}
# -------------------------------------------------------------------------- #
# Fitting KNN Model
classifier_knn1 <- knn(train = train_scale, test = test_scale,
cl = dataset_trn$class,
k = 10)
classifier_knn2 <- knn(train = train_scale, test = test_scale,
cl = dataset_trn$class,
k = 75)
classifier_knn3 <- knn(train = train_scale, test = test_scale,
cl = dataset_trn$class,
k = 125)
6
# -------------------------------------------------------------------------- #
# knn table
misClassError1 <- c(misClassError1, mean(classifier_knn1 !=
dataset_test$class))
misClassError2 <- c(misClassError2, mean(classifier_knn2 !=
dataset_test$class))
misClassError3 <- c(misClassError3, mean(classifier_knn3 !=
dataset_test$class))
# logistics
model_list = list(model_1, model_2, model_3, model_4, model_5, model_6,
model_7)
## error
train_errors = sapply(model_list, get_logistic_error, data = dataset_trn,
res = "class", pos = "Kecimen", neg = "Besni", cut = 0.5)
test_errors
= sapply(model_list, get_logistic_error, data = dataset_test,
res = "class", pos = "Kecimen", neg = "Besni", cut = 0.5)
logError1[,i] = test_errors
## iteration
i=i+1
}
# error rate graph
logError = as.data.frame(t(logError1))
logError[,8] = c(1:10)
colnames(logError) = c('area', 'perimeter', 'MajorAxisLength',
'MinorAxisLength', 'Eccentricity',
'ConvexArea', 'Extent', #'all', 'without extent',
'trial')
eror_log <- melt(logError, id = "trial")
## color
mycol = c('violet', 'cornflowerblue', 'darkkhaki', 'darkturquoise',
'lightgreen', 'lightgray',
'moccasin')#, 'sandybrown', 'lightcoral')
# logistic misclassification rate
t1 = ggplot(data = eror_log, aes(x = trial, y = value, group = variable,
color = variable))+
geom_line() +
geom_point(aes(x = trial, y = value, group = variable, color = variable),
shape = 21, stroke = 1, size = 2, fill = "white") +
geom_point(aes(x = trial, y = value, group = variable, color=variable),
alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") +
scale_color_manual(values = mycol) +
labs(title = "An error rate of logistic regression",
# subtitle = "10 splits",
y = "Error rates",
x = "10 iterations",
color = "Variables",
caption = "Data source: UCI Machine learning repository,
Raisin Dataset Data Set") +
theme(legend.position="bottom",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
axis.title.x = element_text(size=6),
7
panel.grid = element_blank(),
legend.text=element_text(size=7),
axis.title=element_text(size=7),
title =element_text(size=7),
panel.background = element_blank(),
legend.key.size = unit(0.25, 'cm'),
legend.title=element_blank(),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_x_continuous(breaks=seq(0, 10, 1))
# -------------------------------------------------------------------------- #
# knn misclassification rate
knnError1=data.frame(matrix(nrow=10, ncol=4))
knnError1[,1] = misClassError1; knnError1[,2] = misClassError2
knnError1[,3] = misClassError3; knnError1[,4] = c(1:10)
knnerr = knnError1; rownames(knnerr) = c(1:10)
colnames(knnerr) = c('k=25', 'k=75', 'k=125', 'trial')
knner <- melt(knnerr, id = "trial")
t2 = ggplot(data = knner, aes(x = trial, y = value, group = variable,
color = variable))+
geom_line() +
geom_point(aes(x = trial, y = value, group = variable, color = variable),
shape = 21, stroke = 1, size = 2, fill = "white") +
geom_point(aes(x = trial, y = value, group = variable, color=variable),
alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") +
scale_color_manual(values = mycol[1:3]) +
labs(title = "An error rate of KNN classifier",
#subtitle = "10 splits",
y = "Error rates",
x = "10 iterations",
color = "Variables",
caption = "Data source: UCI Machine learning repository,
Raisin Dataset Data Set", size=7) +
theme(legend.position="bottom",
axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"),
axis.ticks.y=element_blank(),
panel.grid = element_blank(),
axis.title.x = element_text(size=6),
title =element_text(size=7),
axis.title=element_text(size=7),
legend.text=element_text(size=7),
legend.key.size = unit(0.25, 'cm'),
legend.title=element_blank(),
panel.background = element_blank(),
panel.grid.major.x = element_line(color = "grey90"),
panel.grid.major.y = element_line(color = "grey90"),
panel.grid.minor = element_blank())+
scale_x_continuous(breaks=seq(0,10,1))+
scale_y_continuous(breaks=seq(0.04,0.24,0.02))
# -------------------------------------------------------------------------- #
# plots combined
grid.arrange(t1, t2, nrow=1, widths=c(1,1))
8
An error rate of logistic regression
An error rate of KNN classifier
0.20
0.5
Error rates
Error rates
0.18
0.4
0.3
0.2
0.16
0.14
0.12
0.1
1
2
3
4
5
6
7
8
9
0.10
10
1
10 iterations
2
3
4
5
6
7
8
9
10
10 iterations
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
k=25
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
k=75
k=125
Data source: UCI Machine learning repository,
Raisin Dataset Data Set
Figure 4: (a) The line graph shows for 10 error rates of each 7 variables fitted by the logistic
regression. (b) The line graph shows for 10 error rates of KNN classifiers with k=25, 75, and 125.
Table 2: ANOVA and an assessment on fitted model
NULL
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
Resid. Dev
935.71
518.95
439.93
439.93
439.90
430.65
430.47
416.31
Pr(>Chi)
NA
0.00
0.00
0.96
0.86
0.00
0.67
0.00
(Intercept)
area
perimeter
MajorAxisLength
MinorAxisLength
Eccentricity
ConvexArea
Extent
#data.frame(Accuracy = c(1-misClassError1, 1-misClassError2, 1-misClassError3),K = c())
## error rate for knn
clsf = round(cbind(mean(knnerr[,1]), mean(knnerr[,2]), mean(knnerr[,3])), 3)
colnames(clsf) = c('k=25', 'k=75', 'k=125')
# model
model_8 = glm(as.factor(class) ~ ., data = dataset_trn, family = "binomial")
# important variable
var_im = round(summary(model_8)$coefficients[,-c(2,3)], 2)
# variable importance
kable(list(round(anova(model_8, test="Chisq"),2)[,-c(1,2,3)], var_im),
#round(varImp(model_8, scale = FALSE),2)),
caption = "ANOVA and an assessment on fitted model",
format='latex')%>%
kable_styling(font_size = 7.5)
9
Estimate
-0.71
-9.20
3.68
3.37
0.02
6.87
-0.03
-8.11
Pr(>|z|)
0.02
0.17
0.11
0.05
0.97
0.33
0.88
0.00
Reference
• ÇINAR, İ., KOKLU, M., & TAŞDEMİR, Ş. (2020). Classification of raisin grains using machine
vision and artificial intelligence methods. Gazi Mühendislik Bilimleri Dergisi (GMBD), 6(3),
200-209.
• UCI machine learning repository: Raisin Dataset Data Set . UCI Machine Learning Repository:
Raisin Dataset Data set. (n.d.). Retrieved February 14, 2022, from https://archive.ics.uci.ed
u/ml/datasets/Raisin+Dataset
• Dalpiaz, D. (2020, October 28). R for statistical learning. Chapter 10 Logistic Regression.
Retrieved February 14, 2022, from https://daviddalpiaz.github.io/r4sl/logistic-regression.html
• Soares, F. C. (2020, December 11). Exploring predictors’ importance in binomial logistic
regressions. Exploring predictors’ importance in binomial logistic regressions. Retrieved
February 14, 2022, from https://cran.r-project.org/web/packages/dominanceanalysis/vignett
es/da-logistic-regression.html
• K-nn classifier in R programming. GeeksforGeeks. (2020, June 22). Retrieved February 14,
2022, from https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/
• Varghese, D. (2019, May 10). Comparative study on classic machine learning algorithms.
Medium. Retrieved February 14, 2022, from https://towardsdatascience.com/comparativestudy-on-classic-machine-learning-algorithms-24f9ff6ab222
• UCLA: LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES. OARC Stats. (n.d.).
Retrieved February 15, 2022, from https://stats.oarc.ucla.edu/r/dae/logit-regression/
10
Download