Uploaded by XY ZW

Density-Based Clustering: DBSCAN Algorithm Explained

advertisement
CCCS323
Machine Learning
DR ELHAM ALGHAMDI
Clustering
Density-Based Clustering
Spherical-shape clusters
Arbitrary-shape clusters
when apply traditional clustering techniques such as K-Means, hierarchical to tasks
with arbitrary shaped clusters or clusters within clusters, traditional techniques might
not be able to achieve good results that is, elements in the same cluster might not
share enough similarity or the performance may be poor.
3
Density-Based Clustering
 Partitioning based algorithms such as K-Means may be easy to understand and
implement in practice, the algorithm has no notion of outliers that is, all points
are assigned to a cluster even if they do not belong in any.
 In the domain of anomaly detection, this causes problems as anomalous points
will be assigned to the same cluster as normal data points.
 The anomalous points pull the cluster centroid towards them making it harder to
classify them as anomalous points.
 In contrast, density-based clustering locates regions of high density that are
separated from one another by regions of low density.
4
K-means Vs. Density-Based Clustering
K-mean assigns all points to a
cluster even if they do not belong
in any
Density-based clustering locates
regions of high density, and
separates outliers
5
K-means Vs. Density-Based Clustering
K-mean assigns all points to a
cluster even if they do not belong
in any
Density-based clustering locates
regions of high density, and
separates outliers
6
Density-Based Clustering
 In density-based clustering we partition points into dense regions separated by not-sodense regions. Each cluster has a considerable higher density of points than outside of the
cluster
 A density-based notion of cluster: a cluster is defined as a maximal set of densityconnected points.
 DBSCAN is a Density-Based Clustering algorithm, (which stands for “density-based
spatial clustering of applications with noise”).
Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
7
DBSCAN Clustering
Important Questions:
 How do we measure density?
 What is a dense region?
Two parameters:
 𝜀𝜀(epsilon ): Maximum radius of the neighborhood for some point p
 𝜀𝜀 –Neighborhood of p: points within a radius of 𝜀𝜀 from a point p.
𝑁𝑁𝜀𝜀 𝑝𝑝 : 𝑞𝑞 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑝𝑝, 𝑞𝑞 ≤ 𝜀𝜀}
 MinPts: Minimum number of points in an 𝜀𝜀-neighborhood of a point p 𝑁𝑁𝜀𝜀 𝑝𝑝
 Dense region: when 𝜀𝜀 –Neighborhood of p 𝑁𝑁𝜀𝜀 𝑝𝑝 contains at least MinPts
points.
Density at point p is high if number of points within 𝑁𝑁𝜀𝜀 𝑝𝑝 ≥ MinPts
Density at point q is low if number of points within 𝑁𝑁𝜀𝜀 𝑞𝑞 < MinPts
DBSCAN Clustering
Characterization of points
 A point is a core point if it has more than a specified number of points
(MinPts) within 𝜀𝜀 —These are points that are at the interior of a
cluster (dense region).
 A border point has fewer than MinPts within 𝜀𝜀 . But is in the
neighborhood of a core point.
 A noise point is any point that is not a core point or a border point.
DBSCAN Clustering
DBSCAN Clustering
Direct density reachable: Given 𝜀𝜀 and MinPts, a
point q is directly density reachable from point p, if:
q is within the ε-Neighborhood of p, means:
𝑞𝑞 ∈ 𝑁𝑁𝜀𝜀 𝑝𝑝
p is a core point, means:
|𝑁𝑁𝜀𝜀 𝑝𝑝 | ≥ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
Here, X is directly densityreachable from Y,
but vice versa is not valid.
DBSCAN Clustering
Density-reachable: Given 𝜀𝜀 and MinPts, A point q is densityreachable from p, if there is a chain of objects q1, q2…, qn,
with q1=p, qn=q such that qi+1 is directly density-reachable
from qi all 1 <= i <= n
“Transitive closure of
directly density-reachable”
Here, X is density-reachable
from Y with X being directly density-reachable
from P2, P2 from P3, and P3 from Y. But the
inverse of this is not valid.
DBSCAN Clustering
Density-connected: Given 𝜀𝜀 and MinPts, A point q
is density-connected from p, if there is a point o
such that both p and q are density-reachable from o
Here, both X and Y are densityreachable from O, therefore, we can say
that X is density-connected from Y
DBSCAN Algorithm
 Classify the points as core, border and noise
 Eliminate noise points
 For every core point p that has not been assigned to a cluster
 Create a new cluster with the point p and all the points that
are density-connected to p.
 Assign border points to the cluster of the closest core point.
DBSCAN Algorithm
15
DBSCAN: Heuristics for determining EPS and MinPts
The parameter 𝜀𝜀 is somewhat more important, as it determines what it means for
points to be “close.”
 Setting 𝜀𝜀 to be very small will mean that no points are core points and may
lead to all points being labeled as noise.
 Setting 𝜀𝜀 to be very large will result in all points forming a single cluster.
The MinPts setting mostly determines whether points in less dense regions will
be labeled as outliers or as their own clusters. If you decrease MinPts, anything
that would have been a cluster with less than MinPts many samples will now be
labeled as noise. MinPts therefore determines the minimum cluster size
16
DBSCAN: Heuristics for determining EPS and MinPts
The parameter 𝜀𝜀 is somewhat more important, as it determines what it means for
points to be “close.”
 Smaller 𝜺𝜺 values: Clusters will be more dense, requiring data points to be closer
to each other to form a cluster. This may result in smaller, more tightly packed
clusters. However, setting 𝜀𝜀 to be very small will mean that no points are core
points and may lead to all points being labeled as noise.
 Larger ε values: Clusters will be less dense, allowing data points to be farther
apart while still forming a cluster. This can lead to larger and more loosely
defined clusters. However, setting 𝜀𝜀 to be very large will result in all points
forming a single cluster.
17
DBSCAN: Heuristics for determining EPS and MinPts
When you decrease the value of MinPts, you are essentially lowering the
threshold for what qualifies as a core point.
With a smaller MinPts value:
 More points become eligible to be core points because they require fewer neighbors
within 𝜀𝜀 to meet the MinPts criterion. As a result, many smaller groups of data points
that are closer together can now be labeled as clusters.
 By decreasing MinPts, you are effectively making it easier for points to be considered
core points, and consequently, fewer points are left as noise because more of them
become part of clusters.
With larger MinPts value :
 Fewer points qualify as core points because they need more neighbors within 𝜀𝜀 to meet
the MinPts requirement.
 Only denser regions can form clusters, and points in sparser regions are more likely to
be labeled as noise points (outliers).
 If it’s too large value, smaller clusters will be incorporated into larger clusters.
18
DBSCAN: Heuristics for determining EPS and MinPts
 Suppose you have chosen MinPts = 4.
 You will find the distance of the 4th neighbor from every data point.
 you will have a distance array and the ith entry in that array will represent the distance
of the 4th neighbor of the ith data point.
 then sort this distance array and plot it (below graph). On the y axis, the distance and on
the x-axis, the index (i) of points.
 As we have sorted this, as the index will increase the
distance of the 4th data point from that point will also
increase. You can find this elbow shape in the graph.
You can cut that by a horizontal line and
the horizontal line will cut the y-axis at
some point, this value would be the 𝜀𝜀
Thus, eps=10
19
DBSCAN: Heuristics for determining EPS and MinPts
 The Idea is that for points in a cluster, their kth nearest neighbors are at roughly the
same distance
 Noise points have the kth nearest neighbor at farther distance.
Thus, eps=10
20
DBSCAN: Heuristics for determining EPS and MinPts
Choosing (“MinPts”):
There is no automatic way to determine the MinPts value for DBSCAN. Ultimately, the
MinPts value should be set using domain knowledge and familiarity with the data set. Here
are a few rules of thumb for selecting the MinPts value:
• The larger the data set, the larger the value of MinPts should be
• If the data set is noisier, choose a larger value of MinPts
• Generally, MinPts should be greater than or equal to the dimensionality of the data set
• For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
• If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim= the dimensions of
your data set.
21
DBSCAN Clustering
Advantages:
 Does not require the user to set the number of clusters a priori.
 it can capture clusters of complex shapes and arbitrary shape
 it can identify points that are not part of any cluster. (noises)
Disadvantages:
 Struggle with varying density clusters.
 Does not work well in case of high dimensional data.
 Sensitive to Parameters.
When DBSCAN Works Well
Original Points
 Resistant to Noise
 Can handle clusters of different shapes and sizes
Clusters
When
DBSCAN
Works
Well
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
 Struggles with varying densities
(MinPts=4, Eps=9.92)
DBSCAN: Sensitive to Parameters
Recall: DBSCAN Algorithm
DBSCAN: exercise
DBSCAN: exercise
P1:P2,P10
P2: P1,P3, P11
P3: P2, P4
P4: P3, P5
P5: P4, P6, P7, P8
P6: P5, P7
P7: P5, P6
P8: P5
P9: P12
P10: P1, P11
P11: P2, P10, P12
P12: P9, P11
DBSCAN: exercise
MinPts: 4
Point
Status
P1
Noise
P2
Core
P3
Noise
P4
Noise
P5
Core
P6
Noise
P7
Noise
P8
Noise
P9
Noise
P10
Noise
P11
Core
P12
Noise
P1:P2,P10
P2: P1,P3, P11
P3: P2, P4
P4: P3, P5
P5: P4, P6, P7, P8
P6: P5, P7
P7: P5, P6
P8: P5
P9: P12
P10: P1, P11
P11: P2, P10, P12
P12: P9, P11
DBSCAN: exercise
Start with a core point (P2) if there is no cluster label
for this point, form a cluster and include all points in
its 𝜀𝜀 –Neighborhood. change the label of these points
to border (P3, P1) except if it’s a core (P11) keep it as
core
Point
Status
Cluster
label
P1
Noise
C1
Border
P2
Core
C1
Core
P3
Noise
C1
Border
P4
Noise
P5
Core
P6
Noise
P7
Noise
P8
Noise
P9
Noise
P10
Noise
P11
Core
C1
Core
P12
Noise
P1:P2,P10
P2: P1,P3, P11
P3: P2, P4
P4: P3, P5
P5: P4, P6, P7, P8
P6: P5, P7
P7: P5, P6
P8: P5
P9: P12
P10: P1, P11
P11: P2, P10, P12
P12: P9, P11
DBSCAN: exercise
Look for the next core point (P5), if there is no cluster
label for this point, form a new cluster and include all
points in its 𝜀𝜀 –Neighborhood, change the label of
these points to border(P4, P6,P7,P8)
Point
Status
Cluster
label
P1
Noise
C1
Border
P2
Core
C1
Core
P3
Noise
C1
Border
P4
Noise
C2
Border
P5
Core
C2
Core
P6
Noise
C2
Border
P7
Noise
C2
Border
P8
Noise
C2
Border
P9
Noise
P10
Noise
P11
Core
C1
Core
P12
Noise
P1:P2,P10
P2: P1,P3, P11
P3: P2, P4
P4: P3, P5
P5: P4, P6, P7, P8
P6: P5, P7
P7: P5, P6
P8: P5
P9: P12
P10: P1, P11
P11: P2, P10, P12
P12: P9, P11
DBSCAN: exercise
Look for the next core point (P11), since it already has
a label (C1) just expand it by including all points in its 𝜀𝜀
–Neighborhood to the current cluster(C1), change the
label of these points to border (P10, P12)
Point
Status
Cluster
label
P1
Noise
C1
Border
P2
Core
C1
Core
P3
Noise
C1
Border
P4
Noise
C2
Border
P5
Core
C2
Core
P6
Noise
C2
Border
P7
Noise
C2
Border
P8
Noise
C2
Border
P9
Noise
Noise
Noise
P10
Noise
C1
Border
P11
Core
C1
Core
P12
Noise
C1
Border
P1:P2,P10
P2: P1,P3, P11
P3: P2, P4
P4: P3, P5
P5: P4, P6, P7, P8
P6: P5, P7
P7: P5, P6
P8: P5
P9: P12
P10: P1, P11
P11: P2, P10, P12
P12: P9, P11
Download