Uploaded by 22pwdsc0079

Data Mining Clustering: K-Means & K-Median

advertisement
DATA MINING
LECTURE 6
Clustering
K-Means
K-Median
CLUSTERING
Clustering
• The process of making a group of abstract objects
into classes of similar objects is known as clustering.
• unsupervised Machine Learning-based
Algorithm
• One group is treated as a cluster of data objects
• In the process of cluster analysis, the first step is to
partition the set of data into groups with the help of
data similarity, and then groups are assigned to their
respective labels.
• The biggest advantage of clustering overclassification is it can adapt to the changes made and
helps single out useful features that differentiate
different groups.
What is a Clustering?
• In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Applications of Cluster Analysis
• It is widely used in many applications such as image
processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer
base and they can characterize their customer groups by using
purchasing patterns.
• It can be used in the field of biology, by deriving animal and
plant taxonomies and identifying genes with the same
capabilities.
• It also helps in information discovery by classifying
documents on the web.
•
Early applications of cluster analysis
• John Snow, London 1854
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
• A division data objects into subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical
tree
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical
Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical
Clustering
p3 p4
Non-traditional Dendrogram
Other types of clustering
• Exclusive (or non-overlapping) versus non-
exclusive (or overlapping)
• In non-exclusive clusterings, points may belong to
multiple clusters.
• Points that belong to multiple classes, or ‘border’ points
• Fuzzy (or soft) versus non-fuzzy (or hard)
• In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
•
Weights usually must sum to 1 (often interpreted as probabilities)
• Partial versus complete
• In some cases, we only want to cluster some of the
data
Types of Clusters: Well-Separated
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
•
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
• The center of a cluster is often a centroid, the minimizer of
distances from all the points in the cluster, or a medoid, the
most “representative” point of a cluster
A center-based cluster is a type of cluster in which the grouping of data points is
determined based on their proximity to a central point, often referred to as the
cluster center or centroid. The center represents the mean or median of the data
points within the cluster
4 center-based clusters
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
The various types of clustering are:
• DBSCAN
Hierarchical clustering
Partitioning clustering
Hierarchical clustering is further subdivided into:
Agglomerative clustering
Divisive clustering
Partitioning clustering is further subdivided into:
K-Means clustering
Fuzzy C-Means clustering
K-MEANS
K-means Clustering
• Unsupervised Learning algorithm
• Partitional clustering approach
• groups the unlabeled dataset into different clusters.
• K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters.
• divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that
has similar properties.
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
K-means Clustering
• Problem: Given a set X of n points in a d-
dimensional space and an integer K group the
points into K clusters C= {C1, C2,…,Ck} such that
π‘˜
πΆπ‘œπ‘ π‘‘ 𝐢 = ෍ ෍ 𝑑𝑖𝑠𝑑(π‘₯, 𝑐)
𝑖=1 π‘₯∈𝐢𝑖
is minimized, where ci is the centroid of the points
in cluster Ci
K-means Clustering
• Most common definition is with euclidean distance,
minimizing the Sum of Squares Error (SSE) function
• Sometimes K-means is defined like that
• Problem: Given a set X of n points in a d-
dimensional space and an integer K group the points
into K clusters C= {C1, C2,…,Ck} such that
π‘˜
πΆπ‘œπ‘ π‘‘ 𝐢 = ෍ ෍ π‘₯ − 𝑐𝑖 2
𝑖=1 π‘₯∈𝐢𝑖
is minimized, where ci is the mean of the points in
cluster Ci
Sum of Squares Error (SSE)
K-means Algorithm
• K-means is sometimes
synonymous with this algorithm
• The kmeans clustering algorithm
mainly performs two tasks:
• Determines the best value for K
center points or centroids by an
iterative process.
• Assigns each data point to its
closest k-center. Those data
points which are near to the
particular k-center, create a
cluster.
How does the K-Means Algorithm
Work?
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid
of each cluster.
• Step-5: Repeat the third steps, which means reassign
each datapoint to the new closest centroid of each
cluster.
• Step-6: If any reassignment occurs, then go to step-4
else go to FINISH.
• Step-7: The model is ready.
•
K-means Algorithm – Initialization
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
Two different K-means Clusterings
3
2.5
2
Original Points
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
x
Optimal Clustering
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Sub-optimal Clustering
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Importance of Choosing Initial Centroids
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
-2
2
-1.5
-1
-0.5
x
0
0.5
1
1.5
-2
2
Iteration 4
Iteration 5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
x
0.5
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
Importance of Choosing Initial Centroids
Iteration 5
1
2
3
4
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Importance of Choosing Initial Centroids …
Iteration 1
Iteration 2
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
-2
2
-1.5
-1
-0.5
x
0
0.5
Iteration 3
2.5
2
2
2
1.5
1.5
1.5
y
2.5
y
2.5
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-1
-0.5
0
x
0.5
2
Iteration 5
3
-1.5
1.5
Iteration 4
3
-2
1
x
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Dealing with Initialization
• Do multiple runs and select the clustering with the
smallest error
• Select original set of points by methods other
than random . E.g., pick the most distant (from
each other) points as cluster centers (K-means++
algorithm)
K-means Algorithm – Centroids
• The centroid depends on the distance function
• The minimizer for the distance function
• ‘Closeness’ is measured by Euclidean distance
(SSE), cosine similarity, correlation, etc.
• Centroid:
• The mean of the points in the cluster for SSE
• The median for Manhattan distance.
Limitations of K-means
• K-means has problems when clusters are of
different
• Sizes
• Densities
• Non-globular shapes
• K-means has problems when the data contains
outliers.
Limitations of K-means: Differing Sizes
Original Points
K-means (3 Clusters)
Limitations of K-means: Differing Density
Original Points
K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
Original Points
K-means (2 Clusters)
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
Overcoming K-means Limitations
Original Points
K-means Clusters
Example
• Visualization
• This clustering can be visualized as two distinct
groups on a 2D graph, where the x-axis
represents Study Hours and the y-axis
represents Test Scores.
Assignment
• K Median
• Self Study
Download