Slides

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Validation  Results  Conclusions  Future Work AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Validation  Results  Conclusions  Future Work PROBLEM STATEMENT  Input: Two different Clustering algorithms (DBScan, SatScan)  Same Input Dataset  Criteria of Comparison   Output:  Result of Comparison – Data / Graph  Constraints: DBScan – No data about efficiency  SatScan Software – 1 pre defined shape   Objective:  Usage Scenarios – Which algorithm can be used where? AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Validation  Results  Conclusions  Future Work MOTIVATION/NOVELTY  Different clustering algorithms  Categorized into different types  Comparisons  Algorithms - Same category  No Systematic way of comparison, Biased Comparisons  No situation based comparison – Which to use where?  No comparison betn. DBScan & SatScan AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Validation  Results  Conclusions  Future Work RELATED WORK Comparison of Clustering Algorithms Same type of Algorithms Density Based – DBScan & OPTICS Density Based – DBScan & SNN Different type of Algorithms DBScan Vs K-Means Our Work – DBSCan Vs SatScan K-means (Centroid Based) Vs Hierarchical, Expectation Vs Maximization (Distance Based) AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Validation  Results  Conclusions  Future Work PROPOSED APPROACH  Our Approach:  Two different types of Clustering algorithms  DBScan  SatScan  Unbiased comparison  Systematic – 3 factors & Same Input datasets Shape of the cluster  Statistical Significance  Scalability  AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work KEY CONCEPTS - 1  Clustering  Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.  Data Mining, Statistical Analysis & many more fields  Real world Application:  Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones  Field Robotics: For robotic situational awareness to track objects and detect outliers in sensor data KEY CONCEPTS - 2 Types of Clustering Algorithms ……… Connectivity based / Hierarchical Core Idea Objects being more related to nearby objects (distance) than to objects farther away Centroid Based Core IdeaClusters are represented by a central vector, which may not necessarily be a member of the data set Ex: K - Means Distribution Based Core Idea Clusters can be defined as objects belonging most likely to the same distribution Density Based Core Idea Clusters are areas of higher density than the remainder of the data set KEY CONCEPTS - DBSCAN  Density based Clustering  Arguments  Minimum number of Points – MinPts  Radius - Eps  Density = Number of Points within specified radius (Eps)  Three types of Points  Core Point – No. of points > MinPts within Eps  Border point – No. of Points < MinPts within Eps but is in neighborhood of a core point  Noise point - Neither a core point nor a border point EXAMPLE - DBSCAN  Dataset 1 : DBSCAN RESULTS - 1 DB Scan o/p on dataset1: Min-Neighbors=3, Radius = 5 150 Number of Clusters = 36 100 50 0 -50 -100 -150 -150 -100 -50 0 50 100 150 DBSCAN RESULTS - 2 DB Scan o/p on dataset1: Min-Neighbors=7, Radius = 1 150 100 Number of clusters =0 50 0 -50 -100 -150 -150 -100 -50 0 50 100 150 DBSCAN RESULTS - 3 DB Scan o/p on dataset1: Min-Neighbors=20,Radius = 20 150 Number of clusters =4 100 50 0 -50 -100 -150 -150 -100 -50 0 50 100 150 KEY CONCEPTS - 3  SaTScan – Spatial Scan Statistics  Input:  Dataset  null hypothesis model  Procedure:  Pre-defined shape scanning window  Variating size of the window  Calculate likelyhood ratio => Most Likely clusters  Test statistical significance (Monte Carlo Sampling, 1000 runs) Significant/primary  Output:  Clusters with p-value Insignificant/secondary AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work CHALLENGES  Tuning parameters - DBScan  Manual tuning to detect clusters  Hard to set correct parameters  Design of appropriate Datasets  To demonstrate Criteria of Comparison AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work VALIDATION  Experiment  Assumptions based on theory  Designing datasets and running experiment  Able to validate them with results AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work CLUSTER SHAPE - DBSCAN Vs SatScan CLUSTER SHAPE - DBSCAN Vs SatScan STATISTICAL SIGNIFICANCE DBScan k = 5, eps = 5 SATScan 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 10 20 30 40 50 60 70 80 90 100 CSR Dataset -1000 points 0 0 10 20 30 40 50 60 70 80 90 100 STATISTICAL SIGNIFICANCE DBScan k = 2, eps = 10 SATScan 500 500 450 450 400 400 350 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 50 100 150 200 250 300 350 400 450 500 0 0 50 100 CSR Dataset - 2000 points 150 200 250 300 350 400 450 500 RUNTIME – Number of Points - DBScan DBScan 3.5 3.252 3 2.5 Runtime/s 2.194 2 1.5 1.31 1.011 1 0.611 0.5 0.066 0.069 0 500 1000 0.384 1500 2000 2500 3000 3500 Number of Data Points 4000 4500 5000 RUNTIME – Number of Points - SATScan SATScan 4500 4085 4000 3500 Runtime/s 3000 2566 2500 2000 1443 1500 965 1000 600 500 0 332 1 0 39 500 39 1000 1500 2000 2500 3000 Number of points 3500 4000 4500 5000 RUNTIME – Number of Clusters – DB Vs SAT 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 20 40 60 80 100 0 100 100 80 80 60 60 40 40 20 20 0 0 20 40 60 80 100 0 0 20 40 60 80 100 0 20 40 60 80 100 Datasets: 3000 points 0 0 20 40 60 80 100 RUNTIME – Number of Clusters – DBScan DBScan 0.7 0.65 0.641 0.6 Runtime/s 0.55 0.533 0.5 0.45 0.434 0.4 0.346 0.35 0.302 1 1.5 2 2.5 3 3.5 Number of Clusters 4 4.5 5 RUNTIME – Number of Clusters – SATScan SATScan 900 843 800 Runtime/s 700 686 600 549 500 432 400 325 300 1 1.5 2 2.5 3 3.5 Number of Clusters 4 4.5 5 AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work CONCLUSIONS S.N o 1 Factor of Comparison Number of clusters not known beforehand DBSCAN SATSCAN Yes 2 Shape: Data has different shaped clusters 3 Runtime: How much time to form clusters? Less runtime 4 Scalability: How well it scales when data size is increased Still manageable runtime - Runtime α Size, Number of Curse of dimensionality clusters 5 Statistical Significance: How significant are the clusters detected? No significance factor 6 Noise: Is noise allowed or should all points be in Yes Yes No - Only 1 shape of clusters (Circle, ellipse, rectangle.. ) More runtime Iterative approach to detect clusters and Monte Carlo Sampling too Yes Significance is at the core Yes AGENDA  Problem Statement  Motivation / Novelty  Related Work & Our Contributions  Proposed Approach  Key Concepts  Challenges  Validation  Results  Conclusions  Future Work FUTURE WORK  Same project – Real World Datasets  Run more instances of the experiments  Control over parameters  Compare with other types of clustering algorithms QUESTIONS? BACKUP SLIDE - 1  DBSCAN requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbor points within distance eps of the starting point.  If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.  If the number of neighbors is less than minPts, the point is marked as noise.  If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset. BACKUP SLIDE 2 -CONCLUSIONS  DBScan Works  Same density clusters  Don’t know the number of clusters beforehand  Different shaped clusters  All points need not be in clusters – Noise concept is present  DBScan doesn’t work  Varying density clusters  Quality of DBScan depends on – Epsilon – If Euclidean distance  High dimension data – Curse of dimensionality  TO DO CLUSTER SHAPE - DBSCAN Results 150 5 30 100 20 0 50 10 0 -5 0 -50 -10 -10 -100 -150 -150 -20 -100 -50 0 50 100 150 -15 -15 40 15 30 10 20 -10 -5 0 5 10 15 -30 -30 -20 -10 0 10 20 20 40 60 80 100 100 80 5 60 10 0 0 40 -5 -10 -30 -30 20 -10 -20 -20 -10 0 10 20 30 -15 -10 -5 0 5 10 0 0 SHAPE - SATSCAN RESULTS 150 5 30 100 20 0 50 10 0 -5 0 -50 -10 -10 -100 -150 -150 -20 -100 -50 0 50 100 150 -15 -15 40 15 30 10 20 -10 -5 0 5 10 15 -30 -30 -20 -10 0 10 20 20 40 60 80 100 100 80 5 60 10 0 0 40 -5 -10 -30 -30 20 -10 -20 -20 -10 0 10 20 30 -15 -10 -5 0 p-value: 0.001 5 10 0 0

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib