Uploaded by iammalikmalik3

K-nn and K-mean-1

advertisement
UNIVERSITY OF GUJRAT
HAFIZ HAYAT
DEPARTMENT OF
INFORMATION TECHNOLOGY
Project No#1
Name:
Mubashir
Ismail Ahmad Khan
Ehtisham Murtaza
Roll No:
19011556-020
19011556-022
19011556-030
Teacher:
Dr. Samina Naz
Section:
IT-19- A
K-nn Classification:
k-nn is the supervised learning method used for the classification and regression.
We take the 2 datasets from the UCI machine learning repository and import them on the mat lab and
there we applied the K-nn classification algorithm on them.
1. Haberman Dataset
The dataset was generated in the result of a research conducted to study the survival of the
patient who has done their breast cancer surgery. And the total of the 308 number of
instance are in the dataset that has the attributes like
1) Age of patient
2) Year of operation
3) Number of the positive auxiliary nodes
4) Alive status(class attribute)
 Survived 5 year or more
 Survived less than 5 years
In Haberman dataset we have 2 classes and total 4 attributes. So we will apply the fine K-nn
algorithm on them.
We take the different values of (K) and check the accuracy of the dataset.
We take the value of
K=3
K=7
K=10
K=35
and get
and get
and get
and get
accuracy=67.4%
accuracy=72.3%
accuracy=72.6%
accuracy=74.6%
2. Caesarian Section Classification Dataset
This dataset was generated by the M.Zain Amin.and the datset was generated
after the research conducted to study the caesarian or not.The attributes are
the following
1) Age of the patient
2) The delivery number
3) The delivery time
4) Pressure of the blood
5) Problem of the heart
The dataset has the total of the 80 observations and the number of the
attributes are the 5.
We
K=3 and get accuracy=52.5%
K=6 and get accuracy=67.55
K-mean Clustering:
K-Means clustering is an unsupervised learning algorithm. There is no labeled data
for this clustering, unlike in supervised learning. K-Means performs the division of
objects into clusters that share similarities and are dissimilar to the objects
belonging to another cluster.
Working Mechanisam
1.
2.
3.
4.
Choose the value of K.
Randomly select the K data points to represent cluster centroid.
Assign all other data points to its nearest cluster centroids.
Reposition the cluster centroid until it is the average of the points in the
cluster.
5. Repeat step 3 and step 4 until there are no changes in cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need
to create. For example, K = 2 refers to two clusters. There is a way of finding out
what is the best or optimum value of K for a given data.
1. K-mean Clustering on the iris species
In the Iris Dataset we have given the following attributes.The
sepal length
sepal width
petal length
petal width
class:
-- Sentosa
-- Versicolor
-- Virginica
And the dataset contains the 50 instances of the 3 dataset each.
The next thing is to find the number of clusters in the dataset. So after
finding the value of the K we can make the clusters according to it.
We are using the Elbow curve method to find the number of clusters
in the dataset.
After the analysis we get the value of k=2 and we make the 2 clusters of
the dataset.
If we change the value of k=3 ,then we get following results.
So, this shows that we have carefully make the clusters and set the
value of the K.
5. Comparative analysis of the K-nn and
the K-mean
K means is an unsupervised learning clustering algorithm, while KNN is
a supervised learning classification algorithm. K means creates classes
out of unlabeled data while KNN classifies data to available classes from
labeled data.
K-nn has shown incredible utility in solving the classification problems.
But, selecting K may be complicated and it needs large number of
samples for precision. It requires no training phase, all the work is done
throughout the testing phase.
Traditional KNN is simple effective and non parametric widely used for
classification but it may not be effective for large scale database or data
having many categories. Moreover, it uses all training samples for
classification and prediction that may become a problem for large scale
databases.
Download