STATS 780 Assignment 1 Stduent number: 400415239 Kyuson Lim 15 February, 2022 Contents Supplimentary material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1 The data contains 900 raisin with 7 morphological features extracted sourced from the UCI machine learning repository (Çinar et al., 2020). The variables include the area which is the number of pixels, the perimeter, a major axis length which is the longest line, minor axis length which is the shortest line, eccentricity, convex area which is the smallest convex shell, extent and the classes which are consists of Kecimen and Besni raisin. This would fit for the logistic regression model as the class consists of two kinds of raisin with different numerical measurements to vary by classes. A box−plot of varaibles in baisin data value MinorAxisLength A density−scatter plot of raisin 2 1 0 −1 −2 −3 −4 −5 −2 0 2 4 a are Extent class Besni 6 5 4 3 2 1 0 −1 −2 −3 −4 −5 −6 Kecimen per te ime r gth jorA Ma en xisL eng class Data source: UCI Machine learning repository, Raisin Dataset Data Set xisL y ricit th ent Ecc orA Min Besni ea Ar vex Con ent Ext Kecimen Data source: UCI Machine learning repository, Raisin Dataset Data Set Figure 1: (a) The density-scatter plot shows for the relationship between variable extent and minor axis length. (b) The box-plot shows for the scaled (normalized) distribution of each variable. The variables in the data include area, perimeter, MajorAxisLength, MinorAxisLength, Eccentricity, ConvexArea, Extent where all are numerics, and Class which is categorical. A total of 900 raisin grains were used, including 450 pieces from both, Kecimen and Besni raisin with no missing data. From the box-plot, the Kecimen species have overall low area mean 63413.47 pixels with 4 outliers, mean perimeter with 352.86 pixels with 4 outliers, mean major axis length with 229.35 pixels with 5 outliers, mean minor axis length with 0.74 pixels with 9 outliers, mean minor axis length with 65696.36 pixels with 4 outliers, mean convex area with 0.71 pixels with 10 outliers, and mean extent with 983.69 pixels with 3 outliers. For the same variables, the Besni has mean 112194.79 for 3 outliers, 509 for 3 outliers, 279.62 pixels for 2 outliers, 0.82 pixels for 12 outliers, 116675.82 pixels for 3 outliers, 0.69 pixels for 2 outliers, 1348.13 pixels for 2 outliers. The minor axis length variable overlap slightly for two species and has moderate correlation of 0.45 with variable extent (Figure 1). Other variable has correlation of 0.83, 0.98, 0.96, -0.17 and 0.98 to be strong or too weak. However, the separation becomes obvious from the box-plot (Figure 1) where Kecimen species have lower mean for all scaled (standardized) variables except for convex area compared to other Besni raisin. From the scaled (normalized) variables, 675 observations are randomly chosen for 10 iterations randomly to fit for the logistic regression and knn classifier, and tested on 225 observations. 2 An error rate of logistic regression An error rate of KNN classifier 0.20 0.18 0.4 Error rates Error rates 0.5 0.3 0.2 0.16 0.14 0.12 0.1 1 2 3 4 5 6 7 8 9 0.10 10 1 10 iterations 2 3 4 5 6 7 8 9 10 10 iterations area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent k=25 Data source: UCI Machine learning repository, Raisin Dataset Data Set k=75 k=125 Data source: UCI Machine learning repository, Raisin Dataset Data Set Figure 2: (a) The line graph shows for 10 error rates of each 7 variables fitted by the logistic regression. (b) The line graph shows for 10 error rates of KNN classifiers with k=25, 75, and 125. Each variable are chosen to test for 10 different test sets using while loop and sample, randomly. An output for error rates is shown in a Figure 2. A lowest error rates is achieved by the extent and perimeter variable on logistic regression model. A mean error rate is 0.151 for k = 25, 0.143 for k = 75, 0.144 for k = 125, with lowest error rate for k = 75. Both use the probability that responses belongs to a specific category such that both has over 0.85 accuracy. A logistic regression which is a parametric, can’t be applied on non-linear classification problem, collinearity among predictors to result in 0.148 for extent variable. Hence, the convex area predictor for its classification has higher error rate. Unlike logistic regression, KNN classifier is a non-parametric, doesn’t tell us which predictors are important, and possible for non-linear classification. We identify optimal k value such as k = 25 as a tuning parameter from empirical approach without considering for the predictors. From Table 1, ANOVA table indicates adding area variable alone reduces the deviance drastically but the variable is not statistically significant (p-value = 0.17), although a coefficient has a largest magnitude for the most important variable. Hence, the extent variable is an important variable by its magnitude, statistical significance, reduction in the deviance (14.16) and lowest error rate (0.148) to have achieved with. For every one unit change in extent, the log odds of Kecimen (versus Besni) raisin decreases by 8.11, when other 6 variables are fixed. Thus, logistic regression is essential with extent variable. 3 Table 1: ANOVA and an assessment on fitted model NULL area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent Resid. Dev 935.71 518.95 439.93 439.93 439.90 430.65 430.47 416.31 Pr(>Chi) NA 0.00 0.00 0.96 0.86 0.00 0.67 0.00 (Intercept) area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent Supplimentary material # parse the downloaded data as CSV dataset = read_excel('/Users/kyusoinlims/Desktop/Raisin_Dataset.xlsx') colnames(dataset) = c('area', 'perimeter', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity', 'ConvexArea', 'Extent', 'Class') # data transformation dataset_scaled = dataset[,1:7] %>% scale() %>% as.data.frame() dataset_scaled$class = as.factor(dataset$Class) # 2 plots ## Bivariate perspective p1 = ggplot(dataset_scaled, aes(x=Extent, y=MinorAxisLength)) + stat_density2d(geom = "polygon", aes(alpha = (..level..)/100, fill = class), bins = 9)+ scale_fill_manual(values=c("#1ED14B","#11a6fc"))+ geom_point(aes(color = class), shape = 21, stroke = 0.2, size = 0.5, fill='transparent') + scale_alpha_continuous(guide='none')+ theme(legend.position="bottom", #plot.margin = margin(t = 3, r = 2, b = 0, l = 0, unit = "cm"), axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), title =element_text(size=7), axis.title=element_text(size=7), legend.text=element_text(size=7), legend.title = element_text(size=7), axis.text.x = element_text(size=6), legend.key.size = unit(0.25, 'cm'), panel.background = element_blank(), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_y_continuous(breaks=seq(-5, 2, 1))+ scale_x_continuous(breaks=seq(-2, 6, 2))+ scale_color_manual(values=c("#7CAE00","#3582c4"),#,"#ffae00"), guide='none')+ labs(title = "A density-scatter plot of raisin", caption = "Data source: UCI Machine learning repository, Raisin Dataset Data Set", size=5) ## univariate perspective data_long <- melt(dataset_scaled, id = "class") p2= ggplot(data_long, aes(x = variable, y = value, color=class)) + 4 Estimate -0.71 -9.20 3.68 3.37 0.02 6.87 -0.03 -8.11 Pr(>|z|) 0.02 0.17 0.11 0.05 0.97 0.33 0.88 0.00 geom_boxplot(varwidth = T) + theme(legend.position="bottom", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), panel.background = element_blank(), axis.title=element_text(size=7), legend.background = element_blank(), title =element_text(size=7), legend.title = element_text(size=7), legend.text=element_text(size=7), legend.key.size = unit(0.25, 'cm'), axis.text.x = element_text(angle = 22.5, vjust = 0.5, hjust=1, size=6), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+xlab(NULL)+ scale_y_continuous(breaks=seq(-7, 25, 1))+ scale_color_manual(values=c("#7CAE00","#11a6fc"))+ labs(title = "A box-plot of varaibles in baisin data", caption = "Data source: UCI Machine learning repository, Raisin Dataset Data Set", cex.labs=0.5) #+ # # stat_summary( aes(label = round(stat(y), 1)), geom = "text", # fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o }, # hjust = -1) # -------------------------------------------------------------------------- # # plots combined grid.arrange(p1, p2, nrow=1, widths=c(1,1)) A density−scatter plot of raisin A box−plot of varaibles in baisin data 2 0 value MinorAxisLength 1 −1 −2 −3 −4 −5 −2 0 2 4 a are Extent class Besni 6 5 4 3 2 1 0 −1 −2 −3 −4 −5 −6 Kecimen per te ime r gth jorA Ma en xisL eng class Data source: UCI Machine learning repository, Raisin Dataset Data Set xisL y ricit th ent Ecc orA Min Besni ea Ar vex Con ent Ext Kecimen Data source: UCI Machine learning repository, Raisin Dataset Data Set Figure 3: (a) The density-scatter plot shows for the relationship between variable extent and minor axis length. (b) The box-plot shows for the scaled (normalized) distribution of each variable. # test-train split this data set.seed(780) i=1 # logistics regression error rate logError1=data.frame(matrix(nrow=7, ncol=10)) # KNN error rate misClassError1=as.numeric(); misClassError2=as.numeric() misClassError3=as.numeric() 5 while (i <= 10){ dataset_obs = nrow(dataset_scaled) ## split dataset_idx = sample(dataset_obs, size = trunc(0.75 * dataset_obs)) dataset_trn = dataset_scaled[dataset_idx,] # training set dataset_test = dataset_scaled[-dataset_idx,] # test set # Feature Scaling train_scale <- scale(dataset_trn[, 1:7]) test_scale <- scale(dataset_test[, 1:7]) # -------------------------------------------------------------------------- # # logistic modeling # each variables model_1 = glm(as.factor(class) ~ area, data = dataset_trn, family = "binomial") model_2 = glm(as.factor(class) ~ perimeter, data = dataset_trn, family = "binomial") model_3 = glm(as.factor(class) ~ MajorAxisLength, data = dataset_trn, family = "binomial") model_4 = glm(as.factor(class) ~ MinorAxisLength, data = dataset_trn, family = "binomial") model_5 = glm(as.factor(class) ~ Eccentricity, data = dataset_trn, family = "binomial") model_6 = glm(as.factor(class) ~ ConvexArea, data = dataset_trn, family = "binomial") model_7 = glm(as.factor(class) ~ Extent, data = dataset_trn, family = "binomial") # model_8 = glm(as.factor(class) ~ ., data = dataset_trn, # family = "binomial") # model_9 = glm(as.factor(class) ~ . -Extent, data = dataset_trn, # family = "binomial") get_logistic_error = function(mod, data, res = "y", pos = 2, neg = 1, cut = 0.5) { probs = predict(mod, newdata = data, type = "response") preds = ifelse(probs > cut, pos, neg) mean(data[, res] != preds) } # -------------------------------------------------------------------------- # # Fitting KNN Model classifier_knn1 <- knn(train = train_scale, test = test_scale, cl = dataset_trn$class, k = 10) classifier_knn2 <- knn(train = train_scale, test = test_scale, cl = dataset_trn$class, k = 75) classifier_knn3 <- knn(train = train_scale, test = test_scale, cl = dataset_trn$class, k = 125) 6 # -------------------------------------------------------------------------- # # knn table misClassError1 <- c(misClassError1, mean(classifier_knn1 != dataset_test$class)) misClassError2 <- c(misClassError2, mean(classifier_knn2 != dataset_test$class)) misClassError3 <- c(misClassError3, mean(classifier_knn3 != dataset_test$class)) # logistics model_list = list(model_1, model_2, model_3, model_4, model_5, model_6, model_7) ## error train_errors = sapply(model_list, get_logistic_error, data = dataset_trn, res = "class", pos = "Kecimen", neg = "Besni", cut = 0.5) test_errors = sapply(model_list, get_logistic_error, data = dataset_test, res = "class", pos = "Kecimen", neg = "Besni", cut = 0.5) logError1[,i] = test_errors ## iteration i=i+1 } # error rate graph logError = as.data.frame(t(logError1)) logError[,8] = c(1:10) colnames(logError) = c('area', 'perimeter', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity', 'ConvexArea', 'Extent', #'all', 'without extent', 'trial') eror_log <- melt(logError, id = "trial") ## color mycol = c('violet', 'cornflowerblue', 'darkkhaki', 'darkturquoise', 'lightgreen', 'lightgray', 'moccasin')#, 'sandybrown', 'lightcoral') # logistic misclassification rate t1 = ggplot(data = eror_log, aes(x = trial, y = value, group = variable, color = variable))+ geom_line() + geom_point(aes(x = trial, y = value, group = variable, color = variable), shape = 21, stroke = 1, size = 2, fill = "white") + geom_point(aes(x = trial, y = value, group = variable, color=variable), alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") + scale_color_manual(values = mycol) + labs(title = "An error rate of logistic regression", # subtitle = "10 splits", y = "Error rates", x = "10 iterations", color = "Variables", caption = "Data source: UCI Machine learning repository, Raisin Dataset Data Set") + theme(legend.position="bottom", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), axis.title.x = element_text(size=6), 7 panel.grid = element_blank(), legend.text=element_text(size=7), axis.title=element_text(size=7), title =element_text(size=7), panel.background = element_blank(), legend.key.size = unit(0.25, 'cm'), legend.title=element_blank(), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_x_continuous(breaks=seq(0, 10, 1)) # -------------------------------------------------------------------------- # # knn misclassification rate knnError1=data.frame(matrix(nrow=10, ncol=4)) knnError1[,1] = misClassError1; knnError1[,2] = misClassError2 knnError1[,3] = misClassError3; knnError1[,4] = c(1:10) knnerr = knnError1; rownames(knnerr) = c(1:10) colnames(knnerr) = c('k=25', 'k=75', 'k=125', 'trial') knner <- melt(knnerr, id = "trial") t2 = ggplot(data = knner, aes(x = trial, y = value, group = variable, color = variable))+ geom_line() + geom_point(aes(x = trial, y = value, group = variable, color = variable), shape = 21, stroke = 1, size = 2, fill = "white") + geom_point(aes(x = trial, y = value, group = variable, color=variable), alpha=0.2, shape = 21, stroke = 2, size =2.75, fill = "transparent") + scale_color_manual(values = mycol[1:3]) + labs(title = "An error rate of KNN classifier", #subtitle = "10 splits", y = "Error rates", x = "10 iterations", color = "Variables", caption = "Data source: UCI Machine learning repository, Raisin Dataset Data Set", size=7) + theme(legend.position="bottom", axis.ticks.length.x = unit(0, "cm"), axis.ticks.length.y = unit(0, "cm"), axis.ticks.y=element_blank(), panel.grid = element_blank(), axis.title.x = element_text(size=6), title =element_text(size=7), axis.title=element_text(size=7), legend.text=element_text(size=7), legend.key.size = unit(0.25, 'cm'), legend.title=element_blank(), panel.background = element_blank(), panel.grid.major.x = element_line(color = "grey90"), panel.grid.major.y = element_line(color = "grey90"), panel.grid.minor = element_blank())+ scale_x_continuous(breaks=seq(0,10,1))+ scale_y_continuous(breaks=seq(0.04,0.24,0.02)) # -------------------------------------------------------------------------- # # plots combined grid.arrange(t1, t2, nrow=1, widths=c(1,1)) 8 An error rate of logistic regression An error rate of KNN classifier 0.20 0.5 Error rates Error rates 0.18 0.4 0.3 0.2 0.16 0.14 0.12 0.1 1 2 3 4 5 6 7 8 9 0.10 10 1 10 iterations 2 3 4 5 6 7 8 9 10 10 iterations area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent k=25 Data source: UCI Machine learning repository, Raisin Dataset Data Set k=75 k=125 Data source: UCI Machine learning repository, Raisin Dataset Data Set Figure 4: (a) The line graph shows for 10 error rates of each 7 variables fitted by the logistic regression. (b) The line graph shows for 10 error rates of KNN classifiers with k=25, 75, and 125. Table 2: ANOVA and an assessment on fitted model NULL area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent Resid. Dev 935.71 518.95 439.93 439.93 439.90 430.65 430.47 416.31 Pr(>Chi) NA 0.00 0.00 0.96 0.86 0.00 0.67 0.00 (Intercept) area perimeter MajorAxisLength MinorAxisLength Eccentricity ConvexArea Extent #data.frame(Accuracy = c(1-misClassError1, 1-misClassError2, 1-misClassError3),K = c()) ## error rate for knn clsf = round(cbind(mean(knnerr[,1]), mean(knnerr[,2]), mean(knnerr[,3])), 3) colnames(clsf) = c('k=25', 'k=75', 'k=125') # model model_8 = glm(as.factor(class) ~ ., data = dataset_trn, family = "binomial") # important variable var_im = round(summary(model_8)$coefficients[,-c(2,3)], 2) # variable importance kable(list(round(anova(model_8, test="Chisq"),2)[,-c(1,2,3)], var_im), #round(varImp(model_8, scale = FALSE),2)), caption = "ANOVA and an assessment on fitted model", format='latex')%>% kable_styling(font_size = 7.5) 9 Estimate -0.71 -9.20 3.68 3.37 0.02 6.87 -0.03 -8.11 Pr(>|z|) 0.02 0.17 0.11 0.05 0.97 0.33 0.88 0.00 Reference • ÇINAR, İ., KOKLU, M., & TAŞDEMİR, Ş. (2020). Classification of raisin grains using machine vision and artificial intelligence methods. Gazi Mühendislik Bilimleri Dergisi (GMBD), 6(3), 200-209. • UCI machine learning repository: Raisin Dataset Data Set . UCI Machine Learning Repository: Raisin Dataset Data set. (n.d.). Retrieved February 14, 2022, from https://archive.ics.uci.ed u/ml/datasets/Raisin+Dataset • Dalpiaz, D. (2020, October 28). R for statistical learning. Chapter 10 Logistic Regression. Retrieved February 14, 2022, from https://daviddalpiaz.github.io/r4sl/logistic-regression.html • Soares, F. C. (2020, December 11). Exploring predictors’ importance in binomial logistic regressions. Exploring predictors’ importance in binomial logistic regressions. Retrieved February 14, 2022, from https://cran.r-project.org/web/packages/dominanceanalysis/vignett es/da-logistic-regression.html • K-nn classifier in R programming. GeeksforGeeks. (2020, June 22). Retrieved February 14, 2022, from https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/ • Varghese, D. (2019, May 10). Comparative study on classic machine learning algorithms. Medium. Retrieved February 14, 2022, from https://towardsdatascience.com/comparativestudy-on-classic-machine-learning-algorithms-24f9ff6ab222 • UCLA: LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES. OARC Stats. (n.d.). Retrieved February 15, 2022, from https://stats.oarc.ucla.edu/r/dae/logit-regression/ 10