About KNN • In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. • It is used for classification and regression. Definition In a single sentence, nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of the most similar labeled examples. They have been used successfully for: • Computer vision applications, including optical character recognition and facial recognition in both still images and video. • Predicting whether a person enjoys a movie which he/she has been recommended (as in the Netflix challenge). • Identifying patterns in genetic data, for use in detecting specific protein or disease. The KNN Algorithm • The kNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labeled by a nominal variable. • Assume that we have a test dataset containing unlabeled examples that otherwise have the same features as the training data. • For each record in the test dataset, kNN identifies k records in the training data that are the "nearest" in similarity, where k is an integer specified in advance. • The unlabeled test instance is assigned the class of the majority of the k nearest neighbors For k=5 machinelearningknowledge.ai Example: Calculating Distance • There are many different ways to calculate distance. • The most commonly known methods are — Euclidian, Manhattan (for continuous) and Hamming distance (for categorical). • Traditionally, the kNN algorithm uses Euclidean distance, which is the distance one would measure if you could use a ruler to connect two points. Euclidean distance between A & B For two dimension 𝐴𝐵 = (𝑋2 −𝑋1 )2 + (𝑌2 − 𝑌1 )2 For multi (n) dimension 𝐴 = (𝑋1 , 𝑋2 , … . , 𝑋𝑛 ), B = (𝑌1 , 𝑌2 , … . , 𝑌𝑛 ) 𝑛 (𝑋𝑖 − 𝑌𝑖 )2 𝐴𝐵 = 𝑖=1 Other distance methods Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference. Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be equal to 0 . Otherwise D=1. Manhattan Distance 𝑛 𝐴𝐵 = 𝑋𝑖 − 𝑌𝑖 𝑖=1 Hamming Distance Choosing the value of K • There is no structured method to find the best value for “K”. We need to find out with various values by trial and error and assuming that training data is unknown. • Choosing smaller values for K can be noisy and will have a higher influence on the result. • Larger values of K will have smoother decision boundaries which mean lower variance but increased bias. Also, computationally expensive. Choosing the value of k • Another way to choose K is though cross-validation. This way we are going to predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K equals to 3.. and then we look at what value of K gives us the best performance on the validation set and then we can take that value and use that as the final setting of our algorithm so we are minimizing the validation error . • In general, practice, choosing the value of k is k = sqrt(N) where N stands for the number of samples in your training dataset. • Try to keep the value of k not divisible by the number of classes that data set have, in order to avoid confusion when every class have the same number of Nearest Neighbors for a given data point. K=3 =? Number of classes = 3 K=3 SOLVED EXAMPLE BASED ON EUCLIDEAN DISTANCE : Height (CM) Weight (KG) Class 167 51 Underweight 182 62 Normal 176 69 Normal 173 64 Normal 172 65 Normal 174 56 Underweight 169 58 Normal 173 57 Normal 170 55 Normal 170 57 ? STEPS TO USE IN ORDER TO SOLVE: Step1: Find Euclidean distance between the new tuple and the existing tuples in the data set. Step2: Set the value of k using cross validation method and find the nearest neighbours. Step3: Check the classification of the nearest neighbours. Step4: For classification of tuple determine the majority class among the k neighbours. Step5: Assign the majority class value to the new data point. Step6: If there are more data points repeat the steps. Height (CM) Weight (KG) Class Distance Rank 169 58 Normal 1.4 1 170 55 Normal 2 2 173 57 Normal 3 3 174 56 Underweight 4.1 4 167 51 Underweight 6.7 5 173 64 Normal 7.6 6 172 65 Normal 8.2 7 182 62 Normal 13 8 176 69 Normal 13.4 9 170 57 ? FROM THE ABOVE TABLE WE SEE THAT FOR DIFFERENT VALUES OF K WE GET THE LAST TUPLES CLASS TO BE NORMAL SOLVED EXAMPLE USING HAMMING DISTANCE: Pepper Ginger Chilly Liked A True True True False B True False False True C False True True False D False True False True E True False False True New Example – Q: pepper: false, ginger: true, chilly; true STEPS TO USE TO SOLVE : • But How to calculate the distance for attributes with nominal or categorical values. • Here we can use Hamming distance to find the distance between the categorical values. • Let x1 and x2 are the attribute values of two instances. • Then, in hamming distance, if the categorical values are same or matching that is x1 is same as x2 then distance is 0, otherwise 1. • For example, • If value of x1 is blue and x2 is also blue then the distance between x1 and x2 is 0. • If value of x1 is blue and x2 is red then the distance between x1 and x2 is 1. Pepper Ginger Chilly Liked Distance A True True True False 1+0+0=1 B True False False True 1+1+1=3 C False True True False 0+0+0=0 D False True False True 0+0+1=1 E True False False True 1+1+1=3 New Example – Q: pepper: false, ginger: true, chilly; true Pepper Ginger Chilly Liked Distance 3NN A B C D E True True False False True True False True True False True False True False False False True False True True 1+0+0=1 2 1+1+1=3 0+0+0=0 1 0+0+1=1 2 1+1+1=3 SO WE SEE THAT THE ABOVE EXAMPLE WILL NOT BE LIKED Applications of the KNN Algorithm It's used in many different areas, such as: • handwriting detection • image recognition, • video recognition etc. Advantages of the KNN Algorithm: 1. The algorithm is simple and easy to implement. 2. There’s no need to build a model, tune several parameters, or make additional assumptions. 3. The algorithm is versatile. It can be used for classification, regression, and searching. Disadvantages of the KNN Algorithm: • Can be cost-intensive when working with a large data set. • A lot of memory is required for processing large data sets. • Choosing the right value of K can be tricky. • It is a lazy learning method.