Uploaded by prec13

Clustering Methods

advertisement
Clustering Methods
1
Introduction to Clustering
• Clustering is the use of unsupervised techniques for
grouping similar objects. In machine learning, unsupervised
refers to the problem of finding hidden structure within
unlabeled data.
• Clustering techniques are unsupervised in the sense that
the data scientist does not determine, in advance, the labels
to apply to the clusters. The structure of the data describes
the objects of interest and determines how best to group
the objects.
• Suppose variables such as age, years of education and
annual purchase expenditures were considered along with
the personal income variable. What are the natural
occurring groupings of customers? Clustering tries to
answer this question.
2
Applications of Clustering
• Clustering is primarily an exploratory technique to discover
hidden structures of the data, possibly as a prelude to more
focused analysis or decision processes. Some specific
applications of clustering are image processing, medical, and
customer segmentation.
• Successive frames of security video images can be examined
to identify any changes to the clusters. These newly
identified clusters may indicate unauthorized access to a
facility.
• Patient attributes such as age, height, weight, systolic and
diastolic blood pressures, cholesterol level, and other
attributes can identify naturally occurring clusters. These
clusters could be used to target individuals for specific
preventive measures or clinical trial participation. Clustering,
in general, is useful in biology for the classification of plants
and animals as well as in the field of human genetics.
3
K-Means Clustering
• Given a collection of objects, k-means identifies k clusters of
objects based on the object’s proximity to the center of the
k groups. The center is determined as the arithmetic
average (mean) of each cluster’s n-dimensional vector of
attributes.
4
Method step by step
• In this chapter, to find k clusters from a collection of M
objects with n attributes, the two-dimensional case (n=2) is
examined. It is much easier to visualize the k-means method
in two dimensions.
• Because each object has two attributes, it is useful to
consider each object corresponding to the point (x,y),
where x and y denote the two attributes. For a given cluster
of m points, the point that corresponds to the cluster’s
mean is called a centroid. In mathematics, a centroid refers
to a point that corresponds to the center of mass for an
object.
• The k-means algorithm to find k clusters can be described in
the three steps.
5
Selection of Initial Centroids
• Choose the value of k and the k initial guesses for the
centroids. In this example, k = 3 and the initial centroids are
indicated by the points shaded in red, green, and blue.
6
Assignment of Samples
• Compute the distance from each data point (x, y) to each
centroid. Assign each point to the closest centroid. This
association defines the first k clusters. In two dimensions, the
distance d, between any two points, (x1, y1) and (x2, y2), in
the Cartesian plane is typically expressed by using the
Euclidean distance measure.
7
Re-computation of Centroids
• Compute the centroid, the center of mass, of each newly
defined cluster. In two dimensions, the centroid (xc , yc ) of
the points in a k-means cluster is calculated using
• The two steps above are repeated until the algorithm
converges.
• Convergence is reached when the computed centroids do not
change or the centroids and the assigned points oscillate
back and forth from one iteration to the next. The latter case
can occur when there are one or more points that are equal
distances from the computed centroid.
8
k-means Clustering Algorithm
9
k-means Clustering Algorithm
• Trajectories for the centroids of the k-means clustering
procedure applied to two-dimensional data.
10
Fuzzy k-means (or, c-means) Clustering Algorithm
11
Fuzzy k-means Clustering Algorithm
12
Units of Measure
• From a computational perspective, the k-means algorithm
depends on the units of measure for a given attribute (for
example, meters or centimeters for a patient’s height). The
algorithm will identify different clusters depending on the
choice of the units of measure.
• When the height is expressed in meters, the age dominates
to the distance calculation between two samples.
• A widely used rescaling approach is to divide each attribute
by the attribute’s standard deviation. The resulting attributes
will each have a standard deviation equal to 1 and will be
without units.
13
Example
14
Example Using Iris Dataset
• Consider the iris dataset that is made up of three classes:
15
Example
16
Example (2 clusters)
17
Example (more clusters)
k=4
k=3
k=5
18
Choosing the Best k value
• Within Cluster Sum of Squares (WSS or inertia) represents
the total sum of distance squares between each sample and
the corresponding centroid
• The smaller the WSS value, the more coherent are the
different clusters. When as many clusters are added as
there are samples in the data set, then the WSS value
would be zero.
• So how to find the optimal number of clusters using the
inertia value? For this, the so called Elbow-Method can
be used.
19
Choosing the Best k value
• The value of k can be chosen based on a reasonable guess or
some predefined requirement. However, even then, it would
be good to know how much better or worse having k clusters
versus (k − 1) or (k + 1) clusters would be in explaining the
structure of the data.
• A heuristic using the Within Cluster Sum of Squares (WSS)
metric can be examined to determine a reasonably optimal
value of k. Remember that WSS is the sum of the squares of
the distances between each data point and its closest
centroid.
20
Choosing the Best k value
21
Choosing the Best k value
• As can be seen, WSS is greatly reduced when k increases
from one to two. Another substantial reduction in WSS
occurs at k = 3. However, the improvement in WSS is fairly
linear for k > 3. Therefore, the k-means analysis will be
conducted for k = 3.
22
Download