Slides

advertisement
Presented by:
GROUP 7
Gayathri Gandhamuneni &
Yumeng Wang
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Validation
 Results
 Conclusions
 Future Work
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Validation
 Results
 Conclusions
 Future Work
PROBLEM STATEMENT
 Input:
Two different Clustering algorithms (DBScan, SatScan)
 Same Input Dataset
 Criteria of Comparison

 Output:

Result of Comparison – Data / Graph
 Constraints:
DBScan – No data about efficiency
 SatScan Software – 1 pre defined shape

 Objective:

Usage Scenarios – Which algorithm can be used where?
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Validation
 Results
 Conclusions
 Future Work
MOTIVATION/NOVELTY
 Different clustering algorithms
 Categorized into different types
 Comparisons
 Algorithms - Same category
 No Systematic way of comparison, Biased Comparisons
 No situation based comparison – Which to use where?
 No comparison betn. DBScan & SatScan
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Validation
 Results
 Conclusions
 Future Work
RELATED WORK
Comparison of Clustering
Algorithms
Same type of
Algorithms
Density Based
– DBScan &
OPTICS
Density Based
– DBScan &
SNN
Different type of
Algorithms
DBScan Vs
K-Means
Our Work –
DBSCan Vs
SatScan
K-means (Centroid Based) Vs
Hierarchical, Expectation Vs
Maximization (Distance Based)
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Validation
 Results
 Conclusions
 Future Work
PROPOSED APPROACH
 Our Approach:
 Two different types of Clustering algorithms
 DBScan
 SatScan
 Unbiased comparison
 Systematic – 3 factors & Same Input datasets
Shape of the cluster
 Statistical Significance
 Scalability

AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
KEY CONCEPTS - 1
 Clustering
 Task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more
similar to each other than to those in other groups.
 Data Mining, Statistical Analysis & many more fields
 Real world Application:
 Earthquake studies: Clustering observed earthquake
epicenters to identify dangerous zones
 Field Robotics: For robotic situational awareness to
track objects and detect outliers in sensor data
KEY CONCEPTS - 2
Types of Clustering Algorithms
………
Connectivity based /
Hierarchical
Core Idea Objects being
more related to
nearby objects
(distance) than
to objects
farther away
Centroid
Based
Core IdeaClusters are
represented by a
central vector,
which may not
necessarily be a
member of the
data set
Ex: K - Means
Distribution
Based
Core Idea Clusters can
be defined as
objects
belonging
most likely to
the same
distribution
Density
Based
Core Idea Clusters are
areas of
higher
density than
the
remainder of
the data set
KEY CONCEPTS - DBSCAN
 Density based Clustering
 Arguments
 Minimum number of Points – MinPts
 Radius - Eps
 Density = Number of Points within specified radius (Eps)
 Three types of Points
 Core Point – No. of points > MinPts within Eps
 Border point – No. of Points < MinPts within Eps but is
in neighborhood of a core point
 Noise point - Neither a core point nor a border point
EXAMPLE - DBSCAN
 Dataset 1 :
DBSCAN RESULTS - 1
DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5
150
Number of
Clusters = 36
100
50
0
-50
-100
-150
-150
-100
-50
0
50
100
150
DBSCAN RESULTS - 2
DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1
150
100
Number of clusters
=0
50
0
-50
-100
-150
-150
-100
-50
0
50
100
150
DBSCAN RESULTS - 3
DB Scan o/p on dataset1: Min-Neighbors=20,Radius =
20
150
Number of clusters
=4
100
50
0
-50
-100
-150
-150
-100
-50
0
50
100
150
KEY CONCEPTS - 3
 SaTScan – Spatial Scan Statistics
 Input:
 Dataset
 null hypothesis model
 Procedure:
 Pre-defined shape scanning window
 Variating size of the window
 Calculate likelyhood ratio => Most Likely clusters
 Test statistical significance (Monte Carlo Sampling, 1000 runs)
Significant/primary
 Output:
 Clusters with p-value
Insignificant/secondary
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
CHALLENGES
 Tuning parameters - DBScan
 Manual tuning to detect clusters
 Hard to set correct parameters
 Design of appropriate Datasets
 To demonstrate Criteria of Comparison
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
VALIDATION
 Experiment
 Assumptions based on theory
 Designing datasets and running experiment
 Able to validate them with results
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
CLUSTER SHAPE - DBSCAN Vs SatScan
CLUSTER SHAPE - DBSCAN Vs SatScan
STATISTICAL SIGNIFICANCE
DBScan k = 5, eps = 5
SATScan
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
0
10
20
30
40
50
60
70
80
90
100
CSR Dataset -1000 points
0
0
10
20
30
40
50
60
70
80
90
100
STATISTICAL SIGNIFICANCE
DBScan k = 2, eps = 10
SATScan
500
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
50
100
150
200
250
300
350
400
450
500
0
0
50
100
CSR Dataset - 2000 points
150
200
250
300
350
400
450
500
RUNTIME – Number of Points - DBScan
DBScan
3.5
3.252
3
2.5
Runtime/s
2.194
2
1.5
1.31
1.011
1
0.611
0.5
0.066 0.069
0
500
1000
0.384
1500
2000
2500
3000
3500
Number of Data Points
4000
4500
5000
RUNTIME – Number of Points - SATScan
SATScan
4500
4085
4000
3500
Runtime/s
3000
2566
2500
2000
1443
1500
965
1000
600
500
0
332
1
0
39
500
39
1000
1500
2000 2500 3000
Number of points
3500
4000
4500
5000
RUNTIME – Number of Clusters – DB Vs SAT
100
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
20
40
60
80
100
0
100
100
80
80
60
60
40
40
20
20
0
0
20
40
60
80
100
0
0
20
40
60
80
100
0
20
40
60
80
100
Datasets: 3000 points
0
0
20
40
60
80
100
RUNTIME – Number of Clusters – DBScan
DBScan
0.7
0.65
0.641
0.6
Runtime/s
0.55
0.533
0.5
0.45
0.434
0.4
0.346
0.35
0.302
1
1.5
2
2.5
3
3.5
Number of Clusters
4
4.5
5
RUNTIME – Number of Clusters – SATScan
SATScan
900
843
800
Runtime/s
700
686
600
549
500
432
400
325
300
1
1.5
2
2.5
3
3.5
Number of Clusters
4
4.5
5
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
CONCLUSIONS
S.N
o
1
Factor of Comparison
Number of clusters not
known beforehand
DBSCAN
SATSCAN
Yes
2
Shape: Data has different
shaped clusters
3
Runtime: How much time
to form clusters?
Less runtime
4
Scalability: How well it
scales when data size is
increased
Still manageable runtime - Runtime α Size, Number of
Curse of dimensionality
clusters
5
Statistical Significance:
How significant are the
clusters detected?
No significance factor
6
Noise: Is noise allowed or
should all points be in
Yes
Yes
No - Only 1 shape of clusters
(Circle, ellipse, rectangle.. )
More runtime
Iterative approach to detect
clusters and Monte Carlo
Sampling too
Yes
Significance is at the core
Yes
AGENDA
 Problem Statement
 Motivation / Novelty
 Related Work & Our Contributions
 Proposed Approach
 Key Concepts
 Challenges
 Validation
 Results
 Conclusions
 Future Work
FUTURE WORK
 Same project – Real World Datasets
 Run more instances of the experiments
 Control over parameters
 Compare with other types of clustering algorithms
QUESTIONS?
BACKUP SLIDE - 1
 DBSCAN requires two parameters: epsilon (eps) and minimum points
(minPts). It starts with an arbitrary starting point that has not been
visited. It then finds all the neighbor points within distance eps of the
starting point.
 If the number of neighbors is greater than or equal to minPts, a cluster is
formed. The starting point and its neighbors are added to this cluster and
the starting point is marked as visited. The algorithm then repeats the
evaluation process for all the neighbors recursively.
 If the number of neighbors is less than minPts, the point is marked as
noise.
 If a cluster is fully expanded (all points within reach are visited) then the
algorithm proceeds to iterate through the remaining unvisited points in
the dataset.
BACKUP SLIDE 2 -CONCLUSIONS
 DBScan Works
 Same density clusters
 Don’t know the number of clusters beforehand
 Different shaped clusters
 All points need not be in clusters – Noise concept is present
 DBScan doesn’t work
 Varying density clusters
 Quality of DBScan depends on – Epsilon – If Euclidean distance

High dimension data – Curse of dimensionality

TO DO
CLUSTER SHAPE - DBSCAN Results
150
5
30
100
20
0
50
10
0
-5
0
-50
-10
-10
-100
-150
-150
-20
-100
-50
0
50
100
150
-15
-15
40
15
30
10
20
-10
-5
0
5
10
15
-30
-30
-20
-10
0
10
20
20
40
60
80
100
100
80
5
60
10
0
0
40
-5
-10
-30
-30
20
-10
-20
-20
-10
0
10
20
30
-15
-10
-5
0
5
10
0
0
SHAPE - SATSCAN RESULTS
150
5
30
100
20
0
50
10
0
-5
0
-50
-10
-10
-100
-150
-150
-20
-100
-50
0
50
100
150
-15
-15
40
15
30
10
20
-10
-5
0
5
10
15
-30
-30
-20
-10
0
10
20
20
40
60
80
100
100
80
5
60
10
0
0
40
-5
-10
-30
-30
20
-10
-20
-20
-10
0
10
20
30
-15
-10
-5
0
p-value: 0.001
5
10
0
0
Download