CZ4041/CE4041: Machine Learning (2017-18, Semester II) Notes on K-Nearest Neighbor Classifier Slide 3: Recall that in inductive learning, there are two phrases: training phrase and prediction phrase. In the training phrase, a predictive model is explicitly trained using a set of labeled training data. The trained predictive model is then used to make predictions on unseen test data in the prediction phrase. Slide 4: Different from inductive learning, lazy learning does not have a training phrase to explicitly learn a predictive from the labeled training data. In the setting of lazy learning, when a test instance comes, a lazy learner will first receive a subset of labeled training instances from the database, which are most similar to the test instance. After that the lazy learner makes a prediction on the test instance based on the received training instances. A representative lazy learning algorithm is the K-Nearest Neighbor (K-NN) classifier. Slide 5: The key idea of the K-NN classifier is to receive K closest instances (nearest neighbors) from the training dataset to perform classification on each specific test instance. Slide 6: ments: To design the K-NN classifier to make predictions, there are three require- 1. A set of labeled training instances. 2. A distance metric to compute distance between instances. 3. The value of K, i.e., the number of nearest neighbors to be retrieved. Given the above three elements, given a test instance, the K-NN classifier first computes the distance between the test instance and each training instance, and then identify the K-nearest neighbors based on the distances. Finally, the K-NN classifier uses the class labels of the K-nearest neighbors to determine the class label of the test instance, e.g., using the majority class of K-nearest neighbors. Slide 7: Here are examples on the K-nearest neighbors retrieved for the same test instance from the same training dataset under varying values of K (K = 1, 2, 3). 1 Slide 8: Regarding distance metric, a most commonly used distance metric is the Euclidean distance. Note that to perform the Euclidean distance on categorial features, one can first use one-hot encoding as introduced in the 1st lecture to map each categorial feature to a vector of binary values. Slide 9: Regarding the value of K in the K-NN classifier, if K is too small, e.g., in an extreme case K = 1, then the performance of K-NN will be sensitive to noise instances. However, if K is too large, then the neighborhood may include instances from irrelevant class(es). As shown in the figure, the test instance (the cross point) is supposed to be of the positive class. If we set K to be an appropriate value, then the K-NN classifier is able to make a correct prediction. However, if K is set to be too large, then the majority class of the neighborhood becomes negative, resulting in an incorrect prediction. In practice, one can use cross-validation to determine a proper value for K. Slide 10: To determine the class label for a test instance, a simple solution as mentioned on the previous slides is based on the majority class of the K-nearest neighbors of the test instance. Mathematically, the majority voting scheme can be written as ∑ y ∗ = arg max I(y = yi ), y (xi ,yi )∈Nx∗ ∗ where x is the test instance, Nx∗ denotes the K-nearest neighborhood of the test instance x∗ in the training dataset, and I(·) is the indicator function that returns 1 if its argument is true, otherwise 0. The majority voting scheme implies that every neighbor has the same impact on the classification voting, which indeed makes the K-NN classifier sensitive to the choice of K. Slide 11: A solution to overcome the limitation of the majority voting scheme is to introduce distance-based weights to each neighbor. Specifically, the key idea is to weight the influence of each nearest neighbor xi according to its distance to the test data, wi = d(x∗1,xi )2 , the smaller is the distance to the test data, the more influent is the nearest neighbor to the classification voting. In mathematics, the distance-weight voting scheme can be written as follows, ∑ y ∗ = arg max wi × I(y = yi ), y (xi ,yi )∈Nx∗ where wi = 1 d(x∗ ,xi )2 . Slide 12: Here is an example. Suppose we use the 5-NN classifier for a binary classification problem. For a test instance represented by the cross point in green, we have retrieved 5 nearest neighbors whose distances to the test data and the corresponding class labels are shown in the table. If we use the majority voting scheme, as there are three positive instances out of the 5 nearest neighbors, we predict the label of the test data as positive. However, if we use the distance-weight voting scheme, what is the predicted label for the test data? This is a tutorial question. 2 Slide 13: There are some other issues on using the K-NN classifier to make predictions. An important issue is the scaling issue. Different features may have quite different ranges on their feature values. In this case, some feature(s) may dominate the distance between instances. For example, suppose an instance (or a person) is represented by the following three features: height, weight and income. Note that height of a person may vary from 1.5m to 1.8m, weight of a person may vary from 90lb to 300lb, and income of a person may vary from 10K to 1M. Compared to height or weight, values of income are of relatively large scale, which will dominate the distance measure between two instances (or persons). To address this issue, one can perform normalization on each feature to transform them into a same predefined range. Slide 14: A typical normalization technique is known as min-max normalization. Given a new range [minnew maxnew ], min-max normalization transforms the original value vold into the new range via vnew = vold − minold (maxnew − minnew ) + minnew , maxold − minold where vnew is the new value of vold in the new range [minnew maxnew ], and maxold and minold are the maximum and the minimum of the original feature values, respectively, obtained from the training dataset. Slide 15: Another typical normalization technique is known as standardization, which aims to transform the values such that after transformation their mean equals 0 and standard deviation equals 1. This can be done via vnew = vold − µold , σold where µold and σold are the mean and the standard deviation of the original values obtained from the training dataset. Slide 16: In summary, the K-NN classifier is a lazy learner, which does not train a predictive model explicitly. Therefore, the “training” procedure is efficient. However, in return, classifying test instances is time consuming. 3