Classification by nearest neighbor strategy. Lazy Learning via kNN algorithm. Nearest Neighbor Algorithm. Given a dataset of cancer biopsies, each with several features, we want to train a classifier so that it can figure out whether a specific sample(s) represent a benign or cancerous case. Classification via nearest neighbor approach has following characteristics. Pros: (a) simple and effective, (b) makes no assumption of any distribution, and (c) fast training phase Cons: (a) Does not produce any model (no abstraction, hence Lazy Learning), (b) classification process is slow, (c) needs a lot of memory, (d) nominal features and missing data need additional scrutiny. Training dataset: Examples of classification into several categories. For each record in the test dataset, kNN identifies 𝑘 known samples nearest in similarities. Unlabeled test is assigned the category label of the majority of the k nearest neighbors. 𝑘 = 5 kNN environment The distance function between the target object and an accepted vector sample (the ‘geographical’ distance) could be measured in many ways. By default, though, it is Euclidean distance metric. Distance between (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ): Euclidean: √(𝑥1 − 𝑥2 )2 + (𝑦1 − 𝑦2 )2 Manhattan: |𝑥1 − 𝑥2 | + |𝑦1 − 𝑦2 | Minkowski: (|𝑥1 − 𝑥2 |𝑞 + |𝑦1 − 𝑦2 |𝑞 )1/𝑞 We are going to use Wisconsin breast-cancer data available in our website. R uses Euclidean distance metric. Major issue in kNN classification approach is the value of 𝑘. If 𝑘 is large, one is basically focusing on majority decision, regardless of which clusters are nearest to it. If 𝑘 is too small, noisy data or outliers may control the situation. This is not acceptable either. We need 𝑘 that avoids two extremes. Cluster separation boundary Low k. For any k, we could encounter misclassifications: False Positive, and False Negatives. 𝑘 needs to be chosen so that these misclassifications tend to be as low as possible. First we read the data as a csv file from our website, and figure out its structure. > md=read.csv("http://web.cs.sunyit.edu/~sengupta/num_maths/wisc_bc_data.csv"). Now let’s look at its structure. We use str() operation to reveal the data structure md. > str(md) It seems there are 32 variables (10 × 3 + 2) over 569 samples. ID we do not need. The Diagnosis is our focus, and it is a factor with 2 levels: B (Benign), M (Malignant). Note that all features are numeric. We now filter the data, and need to prepare it further. > table(md$diagnosis) B M 357 212 Majority is Benign. Let us use proportion on the diagnosis factor to find what proportion leads to benign. > > round(prop.table(table(md$diagnosis))*100, digits=1) B M 62.7 37.3 We are going to focus on three features of our data. You have to try all of them or some other combination of features. >summary(md[c("radius_mean","area_mean","smoothness_mean")]) radius_mean Min. : 6.981 1st Qu.:11.700 Median :13.370 Mean :14.127 3rd Qu.:15.780 Max. :28.110 > area_mean Min. : 143.5 1st Qu.: 420.3 Median : 551.1 Mean : 654.9 3rd Qu.: 782.7 Max. :2501.0 smoothness_mean Min. :0.05263 1st Qu.:0.08637 Median :0.09587 Mean :0.09636 3rd Qu.:0.10530 Max. :0.16340 Everything is okay, except our data need to be polished further. kNN is very sensitive to measurement scale of the input features. In order to make sure our classifier is not comparing apples with oranges, we are going to rescale each of our features (except the diagnosis, that is a factor) with a function > normalize <- function(x) { + return((x-min(x))/(max(x)-min(x))) +} > > > > # lets try on a vector q=c(12,56,3,2.7, 9.8,12.6,-5.3) q=normalize(q) q [1] 0.2822186 1.0000000 0.1353997 0.1305057 0.2463295 0.2920065 0.0000000 The minimum and the maximum are at 0.0 and at 1.0. We are going to normalize these numerical data, and make a data frame with the normalized attributes. To apply to our data, we are going to call lapply on column 2:31 using normalize function. > #redefine our data frame md by normalizing first all the numerical > # columns in our dataframe. We might need them later. > #apply normalize function to all our data frame columns. We need to apply lapply. > md=as.data.frame(lapply(md[,2:31], normalize)) Anytime you want to create a new data frame that might contain mixed data you need to make the call as.data.frame. Let’s pick the three attributes of our interest and put them in a new data frame qq. qq=data.frame(md$area_mean,md$radius_mean,md$smoothness_mean) > summary(qq) Check every attribute is bounded between 0 and 1. We are now ready to define our training set and our testing set. We are going to split the data into two portions: mdn_train, and mdn_test. The first 469 rows (samples) would comprise the training set, and the rest 100 rows the mdn_test > > > > train=qq[1:469,] # First 469 rows for training test=qq[470:569,] # Last 100 rows for our testbed train_labels=q1[1:469,1] #Our training classification factors test_labels=q1[470:569] # Our target sample factors The classifier would change test_labels column if necessary. The kNN package is in “class” that we need to install. You may need to make install.packages(“class”) first from a mirror site. And then call library(class) > library(class) > p=knn(train, test, cl=train_labels, k=3) > Thank Heaven. It worked. Now, we need to do some tidying it up. For this we need a package called “gmodels” , which will provide us with cross tabulation results. > library("gmodels") > CrossTable(x=test_labels, y=p, prop.chisq=FALSE) The final output emerges now. We don’t want Chi square results. There were 100 test data. The top-left data (True Negative) score is 56/61 (we suggested 61 of them “benign”, the classifier picked 56 of them benign). The top-right data is False Positive, meaning 5 of the 61 were misclassified under our scheme. The bottom left box implies total number of False Negatives, 10 out of 39 were wrongly classified as Malignant. The bottom right box shows True Positive stats: 29 out of 39 were correctly labelled. Perhaps we would get different result using different k values. Find out how our misclassification results change by selecting different k values. We try with k=7 > p=knn(train, test, cl=train_labels,k=7) > CrossTable(x=test_labels, y=p, prop.chisq=FALSE) Now we are making it slightly worse. Look at the cross tabulation figures. The top row didn’t change, but now we made 1 more misclassification in the False Negative area. Now you know how you have to approach the problem.