Uploaded by Shubhro saha

K nearest ... algorithm

advertisement
K-nearest neighbourhood algorithm
Questions :
1.
2.
3.
4.
5.
How many neighbours should we consider ?
How do we measure distance ?
How do we combine the information from more than one observations ?
Should all points be weighted equally ?
Should some points have more influence ?
Distance Function
D-euclidean (x,y) =√∑𝑖(𝑋𝑖 − π‘Œπ‘– )2
Age
Sodium Potassium ratio
A
20
12
B
30
8
Distance (A,B) = √(20 − 30)2 + (12 − 8)2
= 10.77
Need for Normalization of Variables
Min – Max Normalization
𝑋∗ =
𝑋−𝑀𝑖𝑛(𝑋)
π‘Ÿπ‘Žπ‘›π‘”π‘’(𝑋)
Z score Normalization
𝑋∗ =
𝑋−π‘€π‘’π‘Žπ‘›(𝑋)
𝑆𝐷(𝑋)
We use ‘Similar’ or ‘Different’ concept for Categorical (binary) variables
0
𝑖𝑓𝑋𝑖 = π‘Œπ‘–
Different (𝑋𝑖 , π‘Œπ‘– ) = {
1 π‘‚π‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Patient
Age
Age(mmn)
A
B
C
50
20
50
0.8
0.2
0.8
Distance without Normalization
d (A, B) = √(50 − 20)2 + 02 =30
d (A, C) = √(50 − 50)2 + 12 = 1
How can we interpret these ?
Age(Z)
0.33
- 1.67
0.33
Gender
M
M
F
Distance with Min Max Normalization
d (A, B) = 0.6
d (A, C) = 1.0
Distance with Z-Score
d (A, B) = 2.0
d (A, C) = 1.0
Min max normalization will lie between 0 and 1
Z-Score usually takes Value between -3 to +3
In case of medical situation here, MMN score may be preferable to Z score.
Combination Functions
1. Simple UnweightVoting
Here for a one record, one vote.
If K = 5
For a new patient, identify 5 Close neighbours.
If 2 gets drug X and 3 get drug Y. Prediction will be Y.
Are closer neighbours should be weighted more heavily?
Apply weighted voting where closer neighbours have a larger voice.
Age and Sodium T able for 4 patients
Patient
New
A
B
C
Drug
?
X
Y
Y
Age
17
16.8
17.2
19.5
Na/K
12.5
12.4
10.5
13.5
Ag(MMN)
0.05
0.0467
0.0533
0.0917
Let 3 Nearest neighbours of New are A, B and C
One record votes for X
2 records votes for Y
Thus New will possibly take drug Y
But Consider closeness of neighbours.
d(new, A) = √(0.05 − 0.0467)2 + (0.25 − 0.2471)2
= 0.004393
Similarly d(new, B) = 0.58893
d(new, C) = 0.51022
Na/K(MMN)
0.25
0.2471
0.1912
0.2794
Now use weight on closeness:
1
1
Thus Vote weightage of X =𝑑(𝑛𝑒𝑀,𝐴)2 =0.0043932 = 51888
Vote weightage for Y will be
1
𝑑(𝑛𝑒𝑀,𝐡)2
+
1
𝑑(𝑛𝑒𝑀,𝐢)2
=
1
0.0588932
+
1
0.0510222
=672
Thus weighted voting procedure would choose drug X for the new patient.
Estimation and Prediction of Continuous Variable using K-nearest
Neighbourhood
How can we estimate or predict a continuous variable?
Method – Locally weighted averaging.
Let us predict BP, which is continuous.
Patient
New
A
B
C
Age
Na/K
BP
17
16.8
17.2
19.5
12.5
12.4
10.5
13.5
?
120
122
130
Ag(MMN)
0.05
0.0467
0.0533
0.0917
Na/K(MMN)
Dist
0.25
0.2471
0.1912
0.2794
0.009305
0.17643
0.26737
Locally weighted average of BP for K=3, nearest neighbours.
Using inverse square of distances for the weights,
∑ π‘Šπ‘‹
⏞
π‘Œnew = ∑𝑖 π‘Šπ‘– 𝑖 , 𝑋𝑖 are input vectors
𝑖
𝑖
Where π‘Šπ‘– = 1⁄𝑑(𝑛𝑒𝑀, 𝑋 ) 2
𝑖
Then ⏞
π‘Œ new =
for i =1,…….K
(120/0.0093052 + 122/0.176432 + 130/0.097562 )
(1/0.0093052 + 1/0.176432 + 1/0.097562 )
= 120.0954 ≃120
As expected, the estimated BP value is quite close to the patient A.
Can Avoid Regression !!
Model can Take care of Non-linearity !!
Download