Jelena Vuković, "K-means algorithm"

advertisement
K-MEANS ALGORITHM
Jelena Vukovic 53/07
jeca.zr@gmail.com
Introduction
• Basic idea of k-means algorithm
• Detailed explenation
• Most common problems of the algorithm
• Applications
• Possible improvements
Elektrotehnički fakultet u Beogradu
2/16
Bassic principles of algorithm
• Given the set of points (x1, x2, … , xn)
• Partition n points into k sets (n>k) (S1, S2, … , Sk)
• The goal is to minimize within-cluster sum of squares
• µi is the mean of points in Si
Elektrotehnički fakultet u Beogradu
3/16
The algorithm
• Initialize the number
of means (k)
• Iterate:
1. Assign each point to the
nearest mean
2. Move mean to
center of its cluster
Elektrotehnički fakultet u Beogradu
4/16
The algorithm
Assign points to nearest mean
Elektrotehnički fakultet u Beogradu
Move means
5/16
The algorithm
• The complexity is
O(n * k * I * d)
• n – number of points
• k – number of clusters
• I – number of iterations
• d – number of attributes
Re-assign points
Elektrotehnički fakultet u Beogradu
6/16
The algorithm
Elektrotehnički fakultet u Beogradu
7/16
K nearest neighbors
• Very similar algorithm
• The decision is made based on the
simple majority of the closest k neighbors
• In k-means the Euclidian distant measure is used
Elektrotehnički fakultet u Beogradu
8/16
Some limitations of algorithm
• The number of clusters needs to be
known in advance
• Initialization of means position
• Problems appear when clusters have different
• Shapes
• Sizes
• Density
Elektrotehnički fakultet u Beogradu
9/16
Initial centroids problem
• Random distribution (the most common)
• Multiple runs
• Testing on a data sample
• Analyze the data
Elektrotehnički fakultet u Beogradu
10/16
Different density
Original points
Elektrotehnički fakultet u Beogradu
3 Clusters
11/16
Non-globular shapes
Original points
Elektrotehnički fakultet u Beogradu
2 Clusters
12/16
Pros and cons
Pros
Cons
• Simple to implement
• K needs to be known
• Fast
• Ellipsoid shape is
• Not highly demanding
assumed
• Requires some
knowledge about data in
advance
• Possibility of many loop
turns, without significant
changes in clusters
Elektrotehnički fakultet u Beogradu
13/16
Applications of the algorithm
• Many different uses
• Computer vision
• Market segmentation
• Geostatic
• Astronomy
• etc
Elektrotehnički fakultet u Beogradu
14/16
Improvements
• Pre-processing of the data in order to better estimate k
• Run multiple iteration in parallel with
different centroid initialization
• Ignore possible errors to avoid
non-standard cluster shapes
Elektrotehnički fakultet u Beogradu
15/16
Thank you!
Elektrotehnički fakultet u Beogradu
16/16
Download