Data Mining algorithm
Professor
Dr Veljko Milutinović
Student
Milan Micić
2011/3323 milan.z.micic@gmail.com
School of Electrical Engineering, University of Belgrade
Department of Computer Engineering
•
•
•
•
•
•
•
• Introduction
The DBSCAN basic idea
Algorithm
DBSCAN on R
Example
Advantages
Disadvantages
References
2/13
•
•
•
Data clustering algorithms
Using in machine learning, pattern recognition, image analyses, information retrieval, and bioinformatics
Hierarchical, centroid-based, distribution-based, density-based, etc
3/13
•
•
•
Density-Based Spatial Clustering of Applications with Noise
•
• Munich,1996
Derived from a human natural clustering approach
•
•
Input parameters
The size of epsilon neighborhood – ε
Minimum points in cluster – MinPts
Neighborhood of a given radius ε has to contain at least a minimum number of points MinPts
4/13
•
•
•
•
Directly density-reachable, p
1 p
1 from p belongs to the ε neighborhood of p
2
2 p
2
's neighborhood size is greater than a given parameter MinPts
•
Density-reachable, p
0 from p n
Exists a chain of points p
1
,..., p n-1
, where p i+1 is directly density-reachable from p i
• Core, border and noise point
5/13
DBSCAN(D, eps, MinPts)
C = 0 for each unvisited point P in dataset D mark P as visited
N = regionQuery(P, eps) if sizeof(N) < MinPts mark P as NOISE else
C = next cluster expandCluster(P,N,C,eps,MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited
N' = regionQuery(P', eps) if sizeof(N') >= MinPts
N = N joined with N' if P' is not yet member of any cluster eps, MinPts)
• Complexity with indexing structure: O(n*log(n))
6/13
•
•
•
•
•
FPC - Flexible Procedures for Clustering
GNU General Public License
Various methods for clustering and cluster validation
Interface functions for many methods implemented in language R
DBSCAN: O(n 2 )
•
• dbscan(x,0.2,showplot=2) dbscan Pts=600 MinPts=5 eps=0.2
0 1 2 3 4 5 6 7 8 9 10 11 seed 0 50 53 51 52 51 54 54 54 53 51 1 border 28 4 4 8 5 3 3 4 3 4 6 4 total 28 54 57 59 57 54 57 58 57 57 57 5
7/13
• Astronomy task
• Identifying celestial objects by capturing the radiation they emit
• Captured noise (by sensors, diffuse emission from atmosphere and space itself)
•
• Eliminating method – to constrain the relevant intensity by a known threshold
In this case – only pixels whose intensity are less than 50
(and consequently darker) are being considered
8/13
•
•
•
DBSCAN algorithm applied on individual pixels
Linking together a complete emission area
•
Each of the generated cluster will define a celestial entity
ε = 5, MinPts = 5, 64 clusters and 224 outliers found
9/13
•
•
•
•
Appropriate parameters ε and MinPts
Numerous experiments indicates best MinPts = 4
•
Clustering datasets with large difference in densities
“Curse of dimensionality”
In every algorithm based on the Euclidean distance for high-dimensional data sets
10/13
•
•
•
•
•
Does not require number of clusters in the data a priori
Can find arbitrarily shaped clusters
• Even clusters completely surrounded by a different cluster
Mostly insensitive to the ordering of the points in the database
• Only border points might swap cluster membership
Has a notion of noise
Requires just two parameters
11/13
•
•
•
•
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu:
“ A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” ,
Institute for Computer Science, University of Munich,
1996;
Mehmed Kantardzic:
“ Data Mining: Concepts, Models, Methods, and Algorithms ”,
2011;
Wikibooks: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/
Density-Based_Clustering ;
Wiki: http://en.wikipedia.org/wiki/DBSCAN
12/13
Milan Micić milan.z.micic@gmail.com
13/13