Clustering Prepared by : Bharat Gautam Content • Clustering Algorithm and its algorithms 1/25/2024 Clustering Analysis 2 Clustering • Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points. • Suppose, you are the head of a rental store and wish to understand preferences of your customers to scale up your business. Is it possible for you to look at details of each customer and devise a unique business strategy for each one of them? Definitely not. • But, what you can do is to cluster all of your customer into say 10 groups based on their purchasing habits and use a separate strategy for customer in each of these 10 groups. And this is what we call clustering. 1/25/2024 Clustering Analysis 3 • Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. • Clustering can also be used for outlier detection, where outliers may be more interesting than common cases. • Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. • For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity 1/25/2024 Clustering Analysis 4 • In machine learning, clustering is an example of unsupervised learning. • Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. • For this reason, clustering is a form of learning by observation, rather than learning by examples. 1/25/2024 Clustering Analysis 5 What Is Good Clustering? • A good clustering method will produce high quality clusters with • • high intra-class similarity low inter-class similarity 1/25/2024 Clustering Analysis 6 Outliers • Outliers are objects that do not belong to any cluster or form clusters of very small cardinality 1/25/2024 Clustering Analysis 7 Types of clustering algorithms • Partitioning algorithms: Construct random partitions and then iteratively refine them by some criterion. E.g. K-mean Clustering • Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. • Density-based: based on connectivity and density functions 1/25/2024 Clustering Analysis 8 K-mean Clustering • Given k, the k-means algorithm is implemented in 4 steps: • • • • Step 1: Partition objects into k nonempty subsets Step 2: Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Step 3: Assign each object to the cluster with the nearest seed point. Step 4: Go back to Step 2, stop when no more new assignment. 1/25/2024 Clustering Analysis 9 1/25/2024 Clustering Analysis 10 • Use K means algorithm to cluster the given datapoints where K=3: A1(2, 10), A2(2,5), A3(8, 4), B1(5, 8), B2(7,5), B3(6, 4), C1(1, 2), C2(4, 9) 1/25/2024 Clustering Analysis 11 1/25/2024 Clustering Analysis 12 1/25/2024 Clustering Analysis 13 1/25/2024 Clustering Analysis 14 Pros and cons of k-mean clustering 1/25/2024 Clustering Analysis 15 Hierarchical methods: • Hierarchical clustering techniques are a second important category of clustering methods. • A hierarchical clustering method works by grouping data objects into a tree of clusters. • Can be further classified into two categories: Agglomerative and Divisive, depending on whether the hierarchical decomposition is formed in a Bottom-up (merging) or Top-down (splitting) fashion. • A hierarchical clustering is often displayed graphically using a tree-like diagram called a dendrogram. It displays both the cluster-subcluster relationships and the order in which the clusters were merged or split. 1/25/2024 Clustering Analysis 16 • It suffers from its inability to perform adjustment; if a particular merge or split decision later turns out to have been a poor choice, this method cannot backtrack and correct it. • Agglomerative: • • • Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This requires a notion of cluster proximity. Bottom –up approach • Divisive: • • • Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide which cluster to split at each step and how to do the splitting Top Down Approach 1/25/2024 Clustering Analysis 17 1/25/2024 Clustering Analysis 18 Agglomerative Algorithms • Unsupervised machine learning algorithm • This is the type of hierarchical clustering that follows bottom-up approach. • This algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest pair of clusters together until all the clusters are merged into a single cluster that contains all the datasets. • This hierarchy of clusters is represented in the form of the dendrogram. 1/25/2024 Clustering Analysis 19 1/25/2024 Clustering Analysis 1/25/2024 Clustering Analysis 21 Algorithm: • Given a dataset (d1, d2, d3, ...dn) of size N • • Computer the distance matrix. Repeat until only one cluster remains • • 1/25/2024 Merge the closest two clusters Update the distance matrix. Clustering Analysis 22 Example Consider the following set of six one-dimensional data points: 18, 22, 25, 42, 27, 43 • Apply the agglomerative hierarchical clustering algorithm to build the hierarchical clustering dendrogram. • Merge the clusters using Min distance and update the proximity matrix accordingly. • Clearly show the proximity matrix corresponding to each iteration of the algorithm. 1/25/2024 Clustering Analysis 23 1/25/2024 Clustering Analysis 24 Pros and cons of hierarchical • Advantages • • • No need for information about how many numbers of clusters are required Easy to use and implement Dendrogram provides clear visualization. • Disadvantages • • • We can not take a step back in this algorithm. Time complexity is higher Not suitable for larger dataset due to high time and space complexity. 1/25/2024 Clustering Analysis 25 Density based Cluster • The clusters are created based upon the density of the data points which are represented in the data space. • The regions that become dense due to the huge number of data points residing in that region are considered as clusters. • The data points in the sparse region (the region where the data points are very less) are considered as noise or outliers. • The clusters created in these methods can be of arbitrary shape. • Density-based clustering algorithms are very efficient at finding high density regions and outliers. • Algorithm: DBSCAN 1/25/2024 Clustering Analysis 26 DBSCAN algorithm • DBSCAN stands for density-based spatial clustering of applications with noise. • It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). • K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. DBSCAN, help us identify arbitrary shaped clusters. • The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster. 1/25/2024 Clustering Analysis 27 • There are two key parameters of DBSCAN: • epsilon: The distance that specifies the neighborhoods. • • Two points are considered to be neighbors if the distance between them are less than or equal to eps. minPoints: Minimum number of data points to define a cluster. • Based on these two parameters, points are classified as core point, border point, or outlier. • Core point: A point is a core point if there are at least minPoints number of points (including the point itself) in its surrounding area with radius epsilon. • Border point: A point is a border point if it is reachable from a core point and there are less than minPoints number of points within its surrounding area. 1/25/2024 Clustering Analysis 28 • Outlier: A point is an outlier if it is not a core point and not reachable from any core points. 1/25/2024 Clustering Analysis 29 In this case, minPoints is 4. Red points are core points. The yellow points are border points because they are reachable from a core point and have less than 4 points within their neighborhood. N is an outlier because it is not a core point and cannot be reached from a core point. 1/25/2024 Clustering Analysis 30 Algorithms 1/25/2024 Clustering Analysis 31 1/25/2024 Clustering Analysis 1/25/2024 Clustering Analysis 33 1/25/2024 Clustering Analysis 34 1/25/2024 Clustering Analysis 35 1/25/2024 Clustering Analysis Pros and cons of DBSCAN • Advantages: • • • • Handles irregularly shaped and sized clusters. Robust to outliers. Does not require the number of clusters to be specified. Relatively fast. • Disadvantages: • • • Difficult to incorporate categorical features. Struggles with clusters of similar density Struggles with high dimensional data. 1/25/2024 Clustering Analysis 37 1/25/2024 Clustering Analysis Application of clustering • Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing. Some of the major applications are: • • • • • Business: discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns. Biological field: It can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations. Geo-information: Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location. Information retrieval: Clustering also helps in classifying documents on the web for information discovery and information retrieval. Outlier detection: Clustering is also used in outlier detection applications such as detection of credit card fraud. 1/25/2024 Clustering Analysis 39 Thank you