K-nearest neighbourhood algorithm Questions : 1. 2. 3. 4. 5. How many neighbours should we consider ? How do we measure distance ? How do we combine the information from more than one observations ? Should all points be weighted equally ? Should some points have more influence ? Distance Function D-euclidean (x,y) =√∑π(ππ − ππ )2 Age Sodium Potassium ratio A 20 12 B 30 8 Distance (A,B) = √(20 − 30)2 + (12 − 8)2 = 10.77 Need for Normalization of Variables Min – Max Normalization π∗ = π−πππ(π) πππππ(π) Z score Normalization π∗ = π−ππππ(π) ππ·(π) We use ‘Similar’ or ‘Different’ concept for Categorical (binary) variables 0 ππππ = ππ Different (ππ , ππ ) = { 1 ππ‘βπππ€ππ π Patient Age Age(mmn) A B C 50 20 50 0.8 0.2 0.8 Distance without Normalization d (A, B) = √(50 − 20)2 + 02 =30 d (A, C) = √(50 − 50)2 + 12 = 1 How can we interpret these ? Age(Z) 0.33 - 1.67 0.33 Gender M M F Distance with Min Max Normalization d (A, B) = 0.6 d (A, C) = 1.0 Distance with Z-Score d (A, B) = 2.0 d (A, C) = 1.0 Min max normalization will lie between 0 and 1 Z-Score usually takes Value between -3 to +3 In case of medical situation here, MMN score may be preferable to Z score. Combination Functions 1. Simple UnweightVoting Here for a one record, one vote. If K = 5 For a new patient, identify 5 Close neighbours. If 2 gets drug X and 3 get drug Y. Prediction will be Y. Are closer neighbours should be weighted more heavily? Apply weighted voting where closer neighbours have a larger voice. Age and Sodium T able for 4 patients Patient New A B C Drug ? X Y Y Age 17 16.8 17.2 19.5 Na/K 12.5 12.4 10.5 13.5 Ag(MMN) 0.05 0.0467 0.0533 0.0917 Let 3 Nearest neighbours of New are A, B and C One record votes for X 2 records votes for Y Thus New will possibly take drug Y But Consider closeness of neighbours. d(new, A) = √(0.05 − 0.0467)2 + (0.25 − 0.2471)2 = 0.004393 Similarly d(new, B) = 0.58893 d(new, C) = 0.51022 Na/K(MMN) 0.25 0.2471 0.1912 0.2794 Now use weight on closeness: 1 1 Thus Vote weightage of X =π(πππ€,π΄)2 =0.0043932 = 51888 Vote weightage for Y will be 1 π(πππ€,π΅)2 + 1 π(πππ€,πΆ)2 = 1 0.0588932 + 1 0.0510222 =672 Thus weighted voting procedure would choose drug X for the new patient. Estimation and Prediction of Continuous Variable using K-nearest Neighbourhood How can we estimate or predict a continuous variable? Method – Locally weighted averaging. Let us predict BP, which is continuous. Patient New A B C Age Na/K BP 17 16.8 17.2 19.5 12.5 12.4 10.5 13.5 ? 120 122 130 Ag(MMN) 0.05 0.0467 0.0533 0.0917 Na/K(MMN) Dist 0.25 0.2471 0.1912 0.2794 0.009305 0.17643 0.26737 Locally weighted average of BP for K=3, nearest neighbours. Using inverse square of distances for the weights, ∑ ππ β πnew = ∑π ππ π , ππ are input vectors π π Where ππ = 1⁄π(πππ€, π ) 2 π Then β π new = for i =1,…….K (120/0.0093052 + 122/0.176432 + 130/0.097562 ) (1/0.0093052 + 1/0.176432 + 1/0.097562 ) = 120.0954 β120 As expected, the estimated BP value is quite close to the patient A. Can Avoid Regression !! Model can Take care of Non-linearity !!