Uploaded by Jayaram Gautam

Clustering 240130 174330

advertisement
Clustering
Prepared by : Bharat Gautam
Content
• Clustering Algorithm and its algorithms
1/25/2024
Clustering Analysis
2
Clustering
• Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are more
similar to other data points.
• Suppose, you are the head of a rental store and wish to understand
preferences of your customers to scale up your business. Is it possible for
you to look at details of each customer and devise a unique business
strategy for each one of them? Definitely not.
• But, what you can do is to cluster all of your customer into say 10 groups
based on their purchasing habits and use a separate strategy for customer
in each of these 10 groups. And this is what we call clustering.
1/25/2024
Clustering Analysis
3
• Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their
similarity.
• Clustering can also be used for outlier detection, where outliers may be
more interesting than common cases.
• Applications of outlier detection include the detection of credit card fraud
and the monitoring of criminal activities in electronic commerce.
• For example, exceptional cases in credit card transactions, such as very
expensive and frequent purchases, may be of interest as possible
fraudulent activity
1/25/2024
Clustering Analysis
4
• In machine learning, clustering is an example of unsupervised learning.
• Unlike classification, clustering and unsupervised learning do not rely on
predefined classes and class-labeled training examples.
• For this reason, clustering is a form of learning by observation, rather than
learning by examples.
1/25/2024
Clustering Analysis
5
What Is Good Clustering?
• A good clustering method will produce high quality clusters with
•
•
high intra-class similarity
low inter-class similarity
1/25/2024
Clustering Analysis
6
Outliers
• Outliers are objects that do not belong to any cluster or form
clusters of very small cardinality
1/25/2024
Clustering Analysis
7
Types of clustering algorithms
• Partitioning algorithms: Construct random partitions and then iteratively
refine them by some criterion. E.g. K-mean Clustering
• Hierarchical algorithms: Create a hierarchical decomposition of the set of
data (or objects) using some criterion.
• Density-based: based on connectivity and density functions
1/25/2024
Clustering Analysis
8
K-mean Clustering
• Given k, the k-means algorithm is implemented in 4 steps:
•
•
•
•
Step 1: Partition objects into k nonempty subsets
Step 2: Compute seed points as the centroids of the clusters of the current partition.
The centroid is the center (mean point) of the cluster.
Step 3: Assign each object to the cluster with the nearest seed point.
Step 4: Go back to Step 2, stop when no more new assignment.
1/25/2024
Clustering Analysis
9
1/25/2024
Clustering Analysis
10
• Use K means algorithm to cluster the given datapoints where K=3:
A1(2, 10), A2(2,5), A3(8, 4), B1(5, 8), B2(7,5), B3(6, 4), C1(1, 2), C2(4, 9)
1/25/2024
Clustering Analysis
11
1/25/2024
Clustering Analysis
12
1/25/2024
Clustering Analysis
13
1/25/2024
Clustering Analysis
14
Pros and cons of k-mean clustering
1/25/2024
Clustering Analysis
15
Hierarchical methods:
• Hierarchical clustering techniques are a second important category of
clustering methods.
• A hierarchical clustering method works by grouping data objects into a
tree of clusters.
• Can be further classified into two categories: Agglomerative and Divisive,
depending on whether the hierarchical decomposition is formed in a
Bottom-up (merging) or Top-down (splitting) fashion.
• A hierarchical clustering is often displayed graphically using a tree-like
diagram called a dendrogram. It displays both the cluster-subcluster
relationships and the order in which the clusters were merged or split.
1/25/2024
Clustering Analysis
16
• It suffers from its inability to perform adjustment; if a particular
merge or split decision later turns out to have been a poor choice,
this method cannot backtrack and correct it.
• Agglomerative:
•
•
•
Start with the points as individual clusters and, at each step, merge the closest pair
of clusters.
This requires a notion of cluster proximity.
Bottom –up approach
• Divisive:
•
•
•
Start with one, all-inclusive cluster and, at each step, split a cluster until only
singleton clusters of individual points remain.
In this case, we need to decide which cluster to split at each step and how to do the
splitting
Top Down Approach
1/25/2024
Clustering Analysis
17
1/25/2024
Clustering Analysis
18
Agglomerative Algorithms
• Unsupervised machine learning algorithm
• This is the type of hierarchical clustering that follows bottom-up approach.
• This algorithm considers each dataset as a single cluster at the beginning,
and then start combining the closest pair of clusters together until all the
clusters are merged into a single cluster that contains all the datasets.
• This hierarchy of clusters is represented in the form of the dendrogram.
1/25/2024
Clustering Analysis
19
1/25/2024
Clustering Analysis
1/25/2024
Clustering Analysis
21
Algorithm:
• Given a dataset (d1, d2, d3, ...dn) of size N
•
•
Computer the distance matrix.
Repeat until only one cluster remains
•
•
1/25/2024
Merge the closest two clusters
Update the distance matrix.
Clustering Analysis
22
Example
Consider the following set of six one-dimensional data points:
18, 22, 25, 42, 27, 43
• Apply the agglomerative hierarchical clustering algorithm to build the
hierarchical clustering dendrogram.
• Merge the clusters using Min distance and update the proximity matrix
accordingly.
• Clearly show the proximity matrix corresponding to each iteration of the
algorithm.
1/25/2024
Clustering Analysis
23
1/25/2024
Clustering Analysis
24
Pros and cons of hierarchical
• Advantages
•
•
•
No need for information about how many numbers of clusters are required
Easy to use and implement
Dendrogram provides clear visualization.
• Disadvantages
•
•
•
We can not take a step back in this algorithm.
Time complexity is higher
Not suitable for larger dataset due to high time and space complexity.
1/25/2024
Clustering Analysis
25
Density based Cluster
• The clusters are created based upon the density of the data
points which are represented in the data space.
• The regions that become dense due to the huge number of data points
residing in that region are considered as clusters.
• The data points in the sparse region (the region where the data points are
very less) are considered as noise or outliers.
• The clusters created in these methods can be of arbitrary shape.
• Density-based clustering algorithms are very efficient at finding high
density regions and outliers.
• Algorithm: DBSCAN
1/25/2024
Clustering Analysis
26
DBSCAN algorithm
• DBSCAN stands for density-based spatial clustering of
applications with noise.
• It is able to find arbitrary shaped clusters and clusters with noise (i.e.
outliers).
• K-Means and Hierarchical Clustering both fail in creating clusters of
arbitrary shapes. DBSCAN, help us identify arbitrary shaped clusters.
• The main idea behind DBSCAN is that a point belongs to a cluster if it is
close to many points from that cluster.
1/25/2024
Clustering Analysis
27
• There are two key parameters of DBSCAN:
•
epsilon: The distance that specifies the neighborhoods.
•
•
Two points are considered to be neighbors if the distance between them are less than or equal
to eps.
minPoints: Minimum number of data points to define a cluster.
• Based on these two parameters, points are classified as core point, border
point, or outlier.
• Core point: A point is a core point if there are at least minPoints number of
points (including the point itself) in its surrounding area with radius
epsilon.
• Border point: A point is a border point if it is reachable from a core point
and there are less than minPoints number of points within its surrounding
area.
1/25/2024
Clustering Analysis
28
• Outlier: A point is an outlier if it is not a core point and not reachable from
any core points.
1/25/2024
Clustering Analysis
29
In this case, minPoints is 4. Red points are core points. The yellow points are
border points because they are reachable from a core point and have less
than 4 points within their neighborhood. N is an outlier because it is not a core
point and cannot be reached from a core point.
1/25/2024
Clustering Analysis
30
Algorithms
1/25/2024
Clustering Analysis
31
1/25/2024
Clustering Analysis
1/25/2024
Clustering Analysis
33
1/25/2024
Clustering Analysis
34
1/25/2024
Clustering Analysis
35
1/25/2024
Clustering Analysis
Pros and cons of DBSCAN
• Advantages:
•
•
•
•
Handles irregularly shaped and sized clusters.
Robust to outliers.
Does not require the number of clusters to be specified.
Relatively fast.
• Disadvantages:
•
•
•
Difficult to incorporate categorical features.
Struggles with clusters of similar density
Struggles with high dimensional data.
1/25/2024
Clustering Analysis
37
1/25/2024
Clustering Analysis
Application of clustering
• Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image processing.
Some of the major applications are:
•
•
•
•
•
Business: discover distinct groups in their customer base. And they can characterize
their customer groups based on the purchasing patterns.
Biological field: It can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.
Geo-information: Clustering also helps in identification of areas of similar land use
in an earth observation database. It also helps in the identification of groups of
houses in a city according to house type, value, and geographic location.
Information retrieval: Clustering also helps in classifying documents on the web for
information discovery and information retrieval.
Outlier detection: Clustering is also used in outlier detection applications such as
detection of credit card fraud.
1/25/2024
Clustering Analysis
39
Thank you
Download