9 – NEAREST NEIGHBOR AND NAÏVE BAYES CLASSIFIERS 9.1 – Nearest Neighbor Classification 276 9.2 - Nearest Neighbor Classification in R The function knn in the class library performs nearest neighbor classification for test cases based on a training dataset. A portion of the help file for the function is shown below. You may also need a function to determine the percent of observations misclassified. Here fit is the predicted class from the model and y is actual response categories. > misclass = function(fit,y) { temp <- table(fit,y) cat("Table of Misclassification\n") cat("(row = predicted, col = actual)\n") print(temp) cat("\n\n") numcor <- sum(diag(temp)) numinc <- length(y) - numcor mcr <- numinc/length(y) cat(paste("Misclassification Rate = ",format(mcr,digits=3))) cat("\n") } 277 Example 9.1 - Handwritten Digit Recognition on UPS Zip Codes > library(ElemStatLearn) > digits.knn = knn(zip.train[,-1],zip.test[,-1],cl=zip.train[,1],k=3) Note the first column in zip.train and zip.test is the correct digit. The other 256 columns are 16 X 16 pixel readings of the darkness. Samples are shown below. > misclass(digits.knn,zip.test[,1]) Table of Misclassification (row = predicted, col = actual) y fit 0 1 2 3 4 5 6 7 8 9 0 355 0 7 2 0 3 3 0 4 2 1 0 258 0 0 2 0 0 1 0 0 2 2 0 182 2 0 2 1 1 1 0 3 0 0 1 154 0 4 0 1 4 0 4 0 3 1 0 182 0 2 4 0 3 5 1 0 0 6 2 145 0 0 2 0 6 0 2 1 0 2 1 164 0 0 0 7 0 1 2 1 2 0 0 138 1 4 8 0 0 4 0 1 1 0 1 151 0 9 1 0 0 1 9 4 0 1 3 168 Misclassification Rate = 0.0548 The package klaR also contains the function sknn that performs knn and weighted knn classification. It allows for the use of standard formula nomenclature. It would not work with the digit recognition data however, not sure why. 278 Example 9.2: Classifying Species of Water Bear > wb.sknn = sknn(Species~.,data=WaterBears) > misclass(predict(wb.sknn)$class,WaterBears$Species) Table of Misclassification (row = predicted, col = actual) y fit bohleberi roanensis smokiensis unakaensis bohleberi 10 0 0 0 roanensis 0 10 1 1 smokiensis 0 0 9 1 unakaensis 0 0 0 9 Misclassification Rate = 0.0732 The use of a weighted nearest neighbor approach can yield superior results. A weighted nearest neighbor approach gives more weight to observations that are close to the target point with the weight being determined from the distance using: 𝑤𝑖 = exp(−𝛾‖𝑥𝑖 − 𝑥̅ ‖). > wb.sknn = sknn(Species~.,data=WaterBears,gamma=4) > misclass(predict(wb.sknn)$class,WaterBears$Species) Table of Misclassification (row = predicted, col = actual) y fit bohleberi roanensis smokiensis unakaensis bohleberi 10 0 0 0 roanensis 0 10 0 0 smokiensis 0 0 10 0 unakaensis 0 0 0 11 The weighted-knn approach gives no misclassifications on the training data. Crossvalidation of methods for classification is still critical, especially when considering methods that have the potential to over fit the training data. 279 Cross-validation function for knn() knn.cv = function(train,y,B=25,p=.333,k=3) { y = as.factor(y) data = data.frame(y,train) n = length(y) cv <- rep(0,B) leaveout = floor(n*p) for (i in 1:B) { sam <- sample(1:n,leaveout,replace=F) pred <- knn(train[-sam,],train[sam,],y[-sam],k=k) tab <- table(y[sam],pred) mc <- leaveout - sum(diag(tab)) cv[i] <- mc/leaveout } cv } Cross-validation function for sknn() sknn.cv = function(train,y,B=25,p=.333,k=3,gamma=0) { y = as.factor(y) data = data.frame(y,train) n = length(y) cv <- rep(0,B) leaveout = floor(n*p) for (i in 1:B) { sam <- sample(1:n,leaveout,replace=F) temp <- data[-sam,] fit <- sknn(y~.,data=temp,k=k,gamma=gamma) pred = predict(fit,newdata=train[sam,])$class tab <- table(y[sam],pred) mc <- leaveout - sum(diag(tab)) cv[i] <- mc/leaveout } cv } > names(WaterBears) [1] "Species" "SSI" "BTWa" "BTWs" "BTWp" "BTWL" "BTPA" > X = WaterBears[,-1] > y = WaterBears[,1] > results = knn.cv(X,y,B=1000,p=.25) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. 0.0 0.1000 0.2000 0.1639 0.2000 Max. 0.6000 > results = sknn.cv(X,y,B=1000,p=.25) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. 0.0000 0.1000 0.2000 0.1673 0.2000 Max. 0.6000 > results = sknn.cv(X,y,B=1000,p=.25,gamma=4) <- Weighted-knn ( > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.1000 0.1000 0.1392 0.2000 0.6000 280 > results = sknn.cv(X,y,B=1000,p=.25,gamma=2) > summary(results) Min. 1st Qu. Median 0.0000 0.1000 0.1000 Mean 3rd Qu. 0.1395 0.2000 Max. 0.5000 > results = sknn.cv(X,y,B=1000,p=.25,gamma=8) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.1000 0.1000 0.1446 0.2000 0.5000 281 9.3 – Naïve Bayes Classification 282 283 9.4 - Naïve Bayes Classifiers in R There are two packages that contain Naïve Baye’s classifier functions: klaR (NaiveBayes) and e1071(naiveBayes). Both functions seem to work similarly, but I did run into some issues obtaining predicted class when using the one in klaR. However, the klaR package has several functions for plotting the results from classification methods when the Xi’s are numeric. NaiveBayes from klaR naiveBayes from e1071 284 Example 9.2: Water Bears (cont’d) > wb.nb = NaiveBayes(Species~.,data=WaterBears) > misclass(predict(wb.nb)$class,WaterBears$Species) note the $class Table of Misclassification (row = predicted, col = actual) y fit bohleberi roanensis smokiensis unakaensis bohleberi 10 0 0 0 roanensis 0 8 0 1 smokiensis 0 1 10 0 unakaensis 0 1 0 10 Misclassification Rate = 0.0732 > wb.nb2 = naiveBayes(Species~.,data=WaterBears) > misclass(predict(wb.nb2,newdata=WaterBears[,-1]),WaterBears$Species) Table of Misclassification (row = predicted, col = actual) y fit bohleberi roanensis smokiensis unakaensis bohleberi 10 0 0 0 roanensis 0 8 0 1 smokiensis 0 1 10 0 unakaensis 0 1 0 10 Misclassification Rate = 0.0732 > partimat(Species~.,data=WaterBears,method=”naiveBayes”) 285 Cross-validation Functions for Baye’s Classifiers This one tends to behave better than the NB one because naiveBayes is more stable. nB.cv = function(X,y,B=25,p=.333,laplace=0) { y = as.factor(y) data = data.frame(y,X) n = length(y) cv <- rep(0,B) leaveout = floor(n*p) for (i in 1:B) { sam <- sample(1:n,leaveout,replace=F) temp <- data[-sam,] fit <- naiveBayes(y~.,data=temp,laplace=laplace) pred = predict(fit,newdata=X[sam,]) tab <- table(y[sam],pred) mc <- leaveout - sum(diag(tab)) cv[i] <- mc/leaveout } cv } > results = nB.cv(X,y,B=500) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00000 0.07692 0.15380 0.14380 0.23080 0.61540 The NB one tends to be buggy due to NaiveBayes fits being unstable. NB.cv = function(X,y,B=25,p=.333,fL=0,usekernel=F) { y = as.factor(y) data = data.frame(y,X) n = length(y) cv <- rep(0,B) leaveout = floor(n*p) for (i in 1:B) { sam <- sample(1:n,leaveout,replace=F) temp <- data[-sam,] fit <- NaiveBayes(y~.,data=temp,fL=fL,usekernel=usekernel) pred = predict(fit,newdata=X[sam,])$class tab <- table(y[sam],pred) mc <- leaveout - sum(diag(tab)) cv[i] <- mc/leaveout } cv } > results = NB.cv(X,y,B=500,fL=4) There were 50 or more warnings (use warnings() to see the first 50) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00000 0.07692 0.15380 0.14400 0.15380 0.53850 l 286