Uploaded by ari pribadi

cz4041 7 KNN note

advertisement
CZ4041/CE4041: Machine Learning (2017-18,
Semester II)
Notes on K-Nearest Neighbor Classifier
Slide 3: Recall that in inductive learning, there are two phrases: training phrase and
prediction phrase. In the training phrase, a predictive model is explicitly trained using a set of labeled training data. The trained predictive model is then used to make
predictions on unseen test data in the prediction phrase.
Slide 4: Different from inductive learning, lazy learning does not have a training
phrase to explicitly learn a predictive from the labeled training data. In the setting of
lazy learning, when a test instance comes, a lazy learner will first receive a subset of
labeled training instances from the database, which are most similar to the test instance.
After that the lazy learner makes a prediction on the test instance based on the received
training instances. A representative lazy learning algorithm is the K-Nearest Neighbor
(K-NN) classifier.
Slide 5: The key idea of the K-NN classifier is to receive K closest instances (nearest neighbors) from the training dataset to perform classification on each specific test
instance.
Slide 6:
ments:
To design the K-NN classifier to make predictions, there are three require-
1. A set of labeled training instances.
2. A distance metric to compute distance between instances.
3. The value of K, i.e., the number of nearest neighbors to be retrieved.
Given the above three elements, given a test instance, the K-NN classifier first computes the distance between the test instance and each training instance, and then identify the K-nearest neighbors based on the distances. Finally, the K-NN classifier uses
the class labels of the K-nearest neighbors to determine the class label of the test instance, e.g., using the majority class of K-nearest neighbors.
Slide 7: Here are examples on the K-nearest neighbors retrieved for the same test
instance from the same training dataset under varying values of K (K = 1, 2, 3).
1
Slide 8: Regarding distance metric, a most commonly used distance metric is the Euclidean distance. Note that to perform the Euclidean distance on categorial features,
one can first use one-hot encoding as introduced in the 1st lecture to map each categorial feature to a vector of binary values.
Slide 9: Regarding the value of K in the K-NN classifier, if K is too small, e.g.,
in an extreme case K = 1, then the performance of K-NN will be sensitive to noise
instances. However, if K is too large, then the neighborhood may include instances
from irrelevant class(es). As shown in the figure, the test instance (the cross point) is
supposed to be of the positive class. If we set K to be an appropriate value, then the
K-NN classifier is able to make a correct prediction. However, if K is set to be too
large, then the majority class of the neighborhood becomes negative, resulting in an
incorrect prediction. In practice, one can use cross-validation to determine a proper
value for K.
Slide 10: To determine the class label for a test instance, a simple solution as mentioned on the previous slides is based on the majority class of the K-nearest neighbors
of the test instance. Mathematically, the majority voting scheme can be written as
∑
y ∗ = arg max
I(y = yi ),
y
(xi ,yi )∈Nx∗
∗
where x is the test instance, Nx∗ denotes the K-nearest neighborhood of the test
instance x∗ in the training dataset, and I(·) is the indicator function that returns 1
if its argument is true, otherwise 0. The majority voting scheme implies that every
neighbor has the same impact on the classification voting, which indeed makes the
K-NN classifier sensitive to the choice of K.
Slide 11: A solution to overcome the limitation of the majority voting scheme is
to introduce distance-based weights to each neighbor. Specifically, the key idea is to
weight the influence of each nearest neighbor xi according to its distance to the test
data, wi = d(x∗1,xi )2 , the smaller is the distance to the test data, the more influent is
the nearest neighbor to the classification voting. In mathematics, the distance-weight
voting scheme can be written as follows,
∑
y ∗ = arg max
wi × I(y = yi ),
y
(xi ,yi )∈Nx∗
where wi =
1
d(x∗ ,xi )2 .
Slide 12: Here is an example. Suppose we use the 5-NN classifier for a binary classification problem. For a test instance represented by the cross point in green, we have
retrieved 5 nearest neighbors whose distances to the test data and the corresponding
class labels are shown in the table. If we use the majority voting scheme, as there are
three positive instances out of the 5 nearest neighbors, we predict the label of the test
data as positive. However, if we use the distance-weight voting scheme, what is the
predicted label for the test data? This is a tutorial question.
2
Slide 13: There are some other issues on using the K-NN classifier to make predictions. An important issue is the scaling issue. Different features may have quite
different ranges on their feature values. In this case, some feature(s) may dominate the
distance between instances. For example, suppose an instance (or a person) is represented by the following three features: height, weight and income. Note that height
of a person may vary from 1.5m to 1.8m, weight of a person may vary from 90lb
to 300lb, and income of a person may vary from 10K to 1M. Compared to height or
weight, values of income are of relatively large scale, which will dominate the distance
measure between two instances (or persons). To address this issue, one can perform
normalization on each feature to transform them into a same predefined range.
Slide 14: A typical normalization technique is known as min-max normalization.
Given a new range [minnew maxnew ], min-max normalization transforms the original
value vold into the new range via
vnew =
vold − minold
(maxnew − minnew ) + minnew ,
maxold − minold
where vnew is the new value of vold in the new range [minnew maxnew ], and maxold
and minold are the maximum and the minimum of the original feature values, respectively, obtained from the training dataset.
Slide 15: Another typical normalization technique is known as standardization, which
aims to transform the values such that after transformation their mean equals 0 and
standard deviation equals 1. This can be done via
vnew =
vold − µold
,
σold
where µold and σold are the mean and the standard deviation of the original values
obtained from the training dataset.
Slide 16: In summary, the K-NN classifier is a lazy learner, which does not train a
predictive model explicitly. Therefore, the “training” procedure is efficient. However,
in return, classifying test instances is time consuming.
3
Download