Uploaded by Satya Anand

K-Nearest Nei

advertisement
About KNN
• In statistics, the k-nearest neighbors algorithm
(k-NN) is a non-parametric supervised learning
method first developed by Evelyn Fix and
Joseph Hodges in 1951, and later expanded by
Thomas Cover.
• It is used for classification and regression.
Definition
 In a single sentence, nearest neighbor classifiers are defined
by their characteristic of classifying unlabeled examples by
assigning them the class of the most similar labeled
examples.
 They have been used successfully for:
• Computer vision applications, including optical character
recognition and facial recognition in both still images and
video.
• Predicting whether a person enjoys a movie which he/she
has been recommended (as in the Netflix challenge).
• Identifying patterns in genetic data, for use in detecting
specific protein or disease.
The KNN Algorithm
• The kNN algorithm begins with a training dataset made up of examples that are
classified into several categories, as labeled by a nominal variable.
• Assume that we have a test dataset containing unlabeled examples that
otherwise have the same features as the training data.
• For each record in the test dataset, kNN identifies k records in the training data
that are the "nearest" in similarity, where k is an integer specified in advance.
• The unlabeled test instance is assigned the class of the majority of the k nearest
neighbors
For k=5
machinelearningknowledge.ai
Example:
Calculating Distance
• There are many different ways to calculate distance.
• The most commonly known methods are — Euclidian,
Manhattan (for continuous) and Hamming distance (for
categorical).
• Traditionally, the kNN algorithm uses Euclidean distance, which is
the distance one would measure if you could use a ruler to
connect two points.
Euclidean distance between A & B
For two dimension
𝐴𝐵 =
(𝑋2 −𝑋1 )2 + (𝑌2 − 𝑌1 )2
For multi (n) dimension
𝐴 = (𝑋1 , 𝑋2 , … . , 𝑋𝑛 ), B = (𝑌1 , 𝑌2 , … . , 𝑌𝑛 )
𝑛
(𝑋𝑖 − 𝑌𝑖 )2
𝐴𝐵 =
𝑖=1
Other distance methods
Manhattan Distance: This is the distance between real vectors
using the sum of their absolute difference.
Hamming Distance: It is used for categorical variables. If the
value (x) and the value (y) are the same, the distance D will be
equal to 0 . Otherwise D=1.
Manhattan Distance
𝑛
𝐴𝐵 =
𝑋𝑖 − 𝑌𝑖
𝑖=1
Hamming Distance
Choosing the value of K
• There is no structured method to find the best value for “K”. We need to find
out with various values by trial and error and assuming that training data is
unknown.
• Choosing smaller values for K can be noisy and will have a higher influence on
the result.
• Larger values of K will have smoother decision boundaries which mean lower
variance but increased bias. Also, computationally expensive.
Choosing the value of k
• Another way to choose K is though cross-validation. This way we are
going to predict the label for every instance in the validation set using
with K equals to 1, K equals to 2, K equals to 3.. and then we look at
what value of K gives us the best performance on the validation set and
then we can take that value and use that as the final setting of our
algorithm so we are minimizing the validation error .
• In general, practice, choosing the value of k is k = sqrt(N) where N
stands for the number of samples in your training dataset.
• Try to keep the value of k not divisible by the number of classes that
data set have, in order to avoid confusion when every class have the
same number of Nearest Neighbors for a given data point.
K=3
=?
Number of classes = 3
K=3
SOLVED EXAMPLE BASED ON EUCLIDEAN DISTANCE :
Height (CM)
Weight (KG)
Class
167
51
Underweight
182
62
Normal
176
69
Normal
173
64
Normal
172
65
Normal
174
56
Underweight
169
58
Normal
173
57
Normal
170
55
Normal
170
57
?
STEPS TO USE IN ORDER TO SOLVE:
Step1: Find Euclidean distance between the new tuple and the existing
tuples in the data set.
Step2: Set the value of k using cross validation method and find the
nearest neighbours.
Step3: Check the classification of the nearest neighbours.
Step4: For classification of tuple determine the majority class among the
k neighbours.
Step5: Assign the majority class value to the new data point.
Step6: If there are more data points repeat the steps.
Height (CM)
Weight (KG)
Class
Distance
Rank
169
58
Normal
1.4
1
170
55
Normal
2
2
173
57
Normal
3
3
174
56
Underweight
4.1
4
167
51
Underweight
6.7
5
173
64
Normal
7.6
6
172
65
Normal
8.2
7
182
62
Normal
13
8
176
69
Normal
13.4
9
170
57
?
FROM THE ABOVE TABLE WE SEE THAT FOR DIFFERENT VALUES OF K WE GET THE LAST TUPLES
CLASS TO BE NORMAL
SOLVED EXAMPLE USING HAMMING DISTANCE:
Pepper Ginger
Chilly
Liked
A
True
True
True
False
B
True
False
False
True
C
False
True
True
False
D
False
True
False
True
E
True
False
False
True
New Example – Q: pepper: false, ginger: true, chilly; true
STEPS TO USE TO SOLVE :
• But How to calculate the distance for attributes with nominal or
categorical values.
• Here we can use Hamming distance to find the distance between
the categorical values.
• Let x1 and x2 are the attribute values of two instances.
• Then, in hamming distance, if the categorical values are same or
matching that is x1 is same as x2 then distance is 0, otherwise 1.
• For example,
• If value of x1 is blue and x2 is also blue then the distance between
x1 and x2 is 0.
• If value of x1 is blue and x2 is red then the distance between x1
and x2 is 1.
Pepper Ginger Chilly Liked Distance
A
True
True
True
False
1+0+0=1
B
True
False
False True
1+1+1=3
C
False
True
True
False
0+0+0=0
D
False
True
False True
0+0+1=1
E
True
False
False True
1+1+1=3
New Example – Q: pepper: false, ginger: true, chilly; true
Pepper Ginger Chilly Liked Distance 3NN
A
B
C
D
E
True
True
False
False
True
True
False
True
True
False
True
False
True
False
False
False
True
False
True
True
1+0+0=1 2
1+1+1=3
0+0+0=0 1
0+0+1=1 2
1+1+1=3
SO WE SEE THAT THE ABOVE EXAMPLE WILL NOT BE LIKED
Applications of the KNN Algorithm

It's used in many different areas, such as:
• handwriting detection
• image recognition,
• video recognition etc.
Advantages of the KNN Algorithm:
1. The algorithm is simple and easy to implement.
2. There’s no need to build a model, tune several
parameters, or make additional assumptions.
3. The algorithm is versatile. It can be used for
classification, regression, and searching.
Disadvantages of the KNN Algorithm:
•
Can be cost-intensive when working with a large data
set.
•
A lot of memory is required for processing large data
sets.
•
Choosing the right value of K can be tricky.
•
It is a lazy learning method.
Download