CCCS323 Machine Learning DR ELHAM ALGHAMDI Clustering Density-Based Clustering Spherical-shape clusters Arbitrary-shape clusters when apply traditional clustering techniques such as K-Means, hierarchical to tasks with arbitrary shaped clusters or clusters within clusters, traditional techniques might not be able to achieve good results that is, elements in the same cluster might not share enough similarity or the performance may be poor. 3 Density-Based Clustering Partitioning based algorithms such as K-Means may be easy to understand and implement in practice, the algorithm has no notion of outliers that is, all points are assigned to a cluster even if they do not belong in any. In the domain of anomaly detection, this causes problems as anomalous points will be assigned to the same cluster as normal data points. The anomalous points pull the cluster centroid towards them making it harder to classify them as anomalous points. In contrast, density-based clustering locates regions of high density that are separated from one another by regions of low density. 4 K-means Vs. Density-Based Clustering K-mean assigns all points to a cluster even if they do not belong in any Density-based clustering locates regions of high density, and separates outliers 5 K-means Vs. Density-Based Clustering K-mean assigns all points to a cluster even if they do not belong in any Density-based clustering locates regions of high density, and separates outliers 6 Density-Based Clustering In density-based clustering we partition points into dense regions separated by not-sodense regions. Each cluster has a considerable higher density of points than outside of the cluster A density-based notion of cluster: a cluster is defined as a maximal set of densityconnected points. DBSCAN is a Density-Based Clustering algorithm, (which stands for “density-based spatial clustering of applications with noise”). Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) 7 DBSCAN Clustering Important Questions: How do we measure density? What is a dense region? Two parameters: 𝜀𝜀(epsilon ): Maximum radius of the neighborhood for some point p 𝜀𝜀 –Neighborhood of p: points within a radius of 𝜀𝜀 from a point p. 𝑁𝑁𝜀𝜀 𝑝𝑝 : 𝑞𝑞 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑝𝑝, 𝑞𝑞 ≤ 𝜀𝜀} MinPts: Minimum number of points in an 𝜀𝜀-neighborhood of a point p 𝑁𝑁𝜀𝜀 𝑝𝑝 Dense region: when 𝜀𝜀 –Neighborhood of p 𝑁𝑁𝜀𝜀 𝑝𝑝 contains at least MinPts points. Density at point p is high if number of points within 𝑁𝑁𝜀𝜀 𝑝𝑝 ≥ MinPts Density at point q is low if number of points within 𝑁𝑁𝜀𝜀 𝑞𝑞 < MinPts DBSCAN Clustering Characterization of points A point is a core point if it has more than a specified number of points (MinPts) within 𝜀𝜀 —These are points that are at the interior of a cluster (dense region). A border point has fewer than MinPts within 𝜀𝜀 . But is in the neighborhood of a core point. A noise point is any point that is not a core point or a border point. DBSCAN Clustering DBSCAN Clustering Direct density reachable: Given 𝜀𝜀 and MinPts, a point q is directly density reachable from point p, if: q is within the ε-Neighborhood of p, means: 𝑞𝑞 ∈ 𝑁𝑁𝜀𝜀 𝑝𝑝 p is a core point, means: |𝑁𝑁𝜀𝜀 𝑝𝑝 | ≥ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 Here, X is directly densityreachable from Y, but vice versa is not valid. DBSCAN Clustering Density-reachable: Given 𝜀𝜀 and MinPts, A point q is densityreachable from p, if there is a chain of objects q1, q2…, qn, with q1=p, qn=q such that qi+1 is directly density-reachable from qi all 1 <= i <= n “Transitive closure of directly density-reachable” Here, X is density-reachable from Y with X being directly density-reachable from P2, P2 from P3, and P3 from Y. But the inverse of this is not valid. DBSCAN Clustering Density-connected: Given 𝜀𝜀 and MinPts, A point q is density-connected from p, if there is a point o such that both p and q are density-reachable from o Here, both X and Y are densityreachable from O, therefore, we can say that X is density-connected from Y DBSCAN Algorithm Classify the points as core, border and noise Eliminate noise points For every core point p that has not been assigned to a cluster Create a new cluster with the point p and all the points that are density-connected to p. Assign border points to the cluster of the closest core point. DBSCAN Algorithm 15 DBSCAN: Heuristics for determining EPS and MinPts The parameter 𝜀𝜀 is somewhat more important, as it determines what it means for points to be “close.” Setting 𝜀𝜀 to be very small will mean that no points are core points and may lead to all points being labeled as noise. Setting 𝜀𝜀 to be very large will result in all points forming a single cluster. The MinPts setting mostly determines whether points in less dense regions will be labeled as outliers or as their own clusters. If you decrease MinPts, anything that would have been a cluster with less than MinPts many samples will now be labeled as noise. MinPts therefore determines the minimum cluster size 16 DBSCAN: Heuristics for determining EPS and MinPts The parameter 𝜀𝜀 is somewhat more important, as it determines what it means for points to be “close.” Smaller 𝜺𝜺 values: Clusters will be more dense, requiring data points to be closer to each other to form a cluster. This may result in smaller, more tightly packed clusters. However, setting 𝜀𝜀 to be very small will mean that no points are core points and may lead to all points being labeled as noise. Larger ε values: Clusters will be less dense, allowing data points to be farther apart while still forming a cluster. This can lead to larger and more loosely defined clusters. However, setting 𝜀𝜀 to be very large will result in all points forming a single cluster. 17 DBSCAN: Heuristics for determining EPS and MinPts When you decrease the value of MinPts, you are essentially lowering the threshold for what qualifies as a core point. With a smaller MinPts value: More points become eligible to be core points because they require fewer neighbors within 𝜀𝜀 to meet the MinPts criterion. As a result, many smaller groups of data points that are closer together can now be labeled as clusters. By decreasing MinPts, you are effectively making it easier for points to be considered core points, and consequently, fewer points are left as noise because more of them become part of clusters. With larger MinPts value : Fewer points qualify as core points because they need more neighbors within 𝜀𝜀 to meet the MinPts requirement. Only denser regions can form clusters, and points in sparser regions are more likely to be labeled as noise points (outliers). If it’s too large value, smaller clusters will be incorporated into larger clusters. 18 DBSCAN: Heuristics for determining EPS and MinPts Suppose you have chosen MinPts = 4. You will find the distance of the 4th neighbor from every data point. you will have a distance array and the ith entry in that array will represent the distance of the 4th neighbor of the ith data point. then sort this distance array and plot it (below graph). On the y axis, the distance and on the x-axis, the index (i) of points. As we have sorted this, as the index will increase the distance of the 4th data point from that point will also increase. You can find this elbow shape in the graph. You can cut that by a horizontal line and the horizontal line will cut the y-axis at some point, this value would be the 𝜀𝜀 Thus, eps=10 19 DBSCAN: Heuristics for determining EPS and MinPts The Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance Noise points have the kth nearest neighbor at farther distance. Thus, eps=10 20 DBSCAN: Heuristics for determining EPS and MinPts Choosing (“MinPts”): There is no automatic way to determine the MinPts value for DBSCAN. Ultimately, the MinPts value should be set using domain knowledge and familiarity with the data set. Here are a few rules of thumb for selecting the MinPts value: • The larger the data set, the larger the value of MinPts should be • If the data set is noisier, choose a larger value of MinPts • Generally, MinPts should be greater than or equal to the dimensionality of the data set • For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996). • If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of your data set. 21 DBSCAN Clustering Advantages: Does not require the user to set the number of clusters a priori. it can capture clusters of complex shapes and arbitrary shape it can identify points that are not part of any cluster. (noises) Disadvantages: Struggle with varying density clusters. Does not work well in case of high dimensional data. Sensitive to Parameters. When DBSCAN Works Well Original Points Resistant to Noise Can handle clusters of different shapes and sizes Clusters When DBSCAN Works Well When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.75). Original Points Struggles with varying densities (MinPts=4, Eps=9.92) DBSCAN: Sensitive to Parameters Recall: DBSCAN Algorithm DBSCAN: exercise DBSCAN: exercise P1:P2,P10 P2: P1,P3, P11 P3: P2, P4 P4: P3, P5 P5: P4, P6, P7, P8 P6: P5, P7 P7: P5, P6 P8: P5 P9: P12 P10: P1, P11 P11: P2, P10, P12 P12: P9, P11 DBSCAN: exercise MinPts: 4 Point Status P1 Noise P2 Core P3 Noise P4 Noise P5 Core P6 Noise P7 Noise P8 Noise P9 Noise P10 Noise P11 Core P12 Noise P1:P2,P10 P2: P1,P3, P11 P3: P2, P4 P4: P3, P5 P5: P4, P6, P7, P8 P6: P5, P7 P7: P5, P6 P8: P5 P9: P12 P10: P1, P11 P11: P2, P10, P12 P12: P9, P11 DBSCAN: exercise Start with a core point (P2) if there is no cluster label for this point, form a cluster and include all points in its 𝜀𝜀 –Neighborhood. change the label of these points to border (P3, P1) except if it’s a core (P11) keep it as core Point Status Cluster label P1 Noise C1 Border P2 Core C1 Core P3 Noise C1 Border P4 Noise P5 Core P6 Noise P7 Noise P8 Noise P9 Noise P10 Noise P11 Core C1 Core P12 Noise P1:P2,P10 P2: P1,P3, P11 P3: P2, P4 P4: P3, P5 P5: P4, P6, P7, P8 P6: P5, P7 P7: P5, P6 P8: P5 P9: P12 P10: P1, P11 P11: P2, P10, P12 P12: P9, P11 DBSCAN: exercise Look for the next core point (P5), if there is no cluster label for this point, form a new cluster and include all points in its 𝜀𝜀 –Neighborhood, change the label of these points to border(P4, P6,P7,P8) Point Status Cluster label P1 Noise C1 Border P2 Core C1 Core P3 Noise C1 Border P4 Noise C2 Border P5 Core C2 Core P6 Noise C2 Border P7 Noise C2 Border P8 Noise C2 Border P9 Noise P10 Noise P11 Core C1 Core P12 Noise P1:P2,P10 P2: P1,P3, P11 P3: P2, P4 P4: P3, P5 P5: P4, P6, P7, P8 P6: P5, P7 P7: P5, P6 P8: P5 P9: P12 P10: P1, P11 P11: P2, P10, P12 P12: P9, P11 DBSCAN: exercise Look for the next core point (P11), since it already has a label (C1) just expand it by including all points in its 𝜀𝜀 –Neighborhood to the current cluster(C1), change the label of these points to border (P10, P12) Point Status Cluster label P1 Noise C1 Border P2 Core C1 Core P3 Noise C1 Border P4 Noise C2 Border P5 Core C2 Core P6 Noise C2 Border P7 Noise C2 Border P8 Noise C2 Border P9 Noise Noise Noise P10 Noise C1 Border P11 Core C1 Core P12 Noise C1 Border P1:P2,P10 P2: P1,P3, P11 P3: P2, P4 P4: P3, P5 P5: P4, P6, P7, P8 P6: P5, P7 P7: P5, P6 P8: P5 P9: P12 P10: P1, P11 P11: P2, P10, P12 P12: P9, P11