Spatial Data Mining CS 697 Assignment 1 February 16, 2010 Pradnya Khutafale, Peter Lucas, and Chris Maio Advisor: Dr. Wei Ding Computer Science Department UMass Boston 1 Discovery of Climate Indices using Clustering Principal Investigators Vipin Kumar (University of Minnesota) Michael Steinbach (University of Minnesota) Collaborators Steven Klooster (Cal. State Univ, Monterey Bay) Christopher Potter (NASA Ames Research Center) Pang-Ning Tan (Michigan State University) 2 Researchers Department of Computer Science and Engineering Michael Steinbach Pang-Ning Tan Vipin Kumar Leading educators in the field of spatial data mining Investigating the use of data mining techniques to find interesting spatiotemporal patterns from Earth Science Regarded as leaders in the field of climate indices identification and data mining research Discovery of Climate Indices using Clustering 3 Researchers NASA & Ames Research Center team members: Chris Potter Steven Klooster Working on cutting edge computer science methods and technologies to be utilized for finding solutions to complex environmental problems. Discovery of Climate Indices using Clustering 4 Presentation Outline Background: (Chris) Climate Change Earth System Linkages Earth Science Data and Climate Indices (Chris) Existing Eigenvalue Techniques and Limits (Pete) New Clustering Based Methodology (Pete) Results and Comparisons (Pradnya) Conclusions and Future Research (Pradnya and Pete) Discovery of Climate Indices using Clustering 5 Presentation Outline Background: Climate Change Earth System Linkages Earth Science Data and Climate Indices Existing Eigenvalue Techniques and Limitations New Clustering Based Methodology Results and Comparisons Conclusions and Future Research Discovery of Climate Indices using Clustering 6 Background Climate Change Extinctions of plants and animals Sea-level Rise Rise in global temperatures IPCC Predictions Discovery of Climate Indices using Clustering 7 Background Climate Change Impacts Climate Change leads to significant changes of rainfall and soil moisture (drought and flood) Agricultural activities (crop growth cycle) and world food supplies are affected greatly by climatic factors (desertification) Climate change increases the frequency, intensity, and distribution of natural hazards, such as hurricanes and other storms Discovery of Climate Indices using Clustering 8 Background Earth System Linkages Ocean, atmosphere, and land processes are highly coupled Climate phenomena in one location can affect the climate at a far away location this is known as climate teleconnections Understanding climate “teleconnections” key to knowing and predicting ecosystem response to climate change Discovery of Climate Indices using Clustering 9 Presentation Outline Background: Climate Change Earth System Linkages Earth Science Data and Climate Indices Existing Eigenvalue Techniques and Limitations New Clustering Based Methodology Results and Comparisons Conclusions and Future Research Discovery of Climate Indices using Clustering 10 Earth Science Data Time Series Data Sea Surface Temperature (SST) Sea Level Pressure (SLP) 11 Earth Science Data Data Acquisition There are thousands of floats, buoys, and other remote sensing devises throughout the oceans collecting enormous amount of oceanographic data periodically transmitted to shore via satellite (Naval Research Laboratory). Discovery of Climate Indices using Clustering 12 Earth Science Data Preprocessing Required Spatial and temporal nature of data poses a number of challenges Noisy Cycles of varying lengths and regularity Strong seasonal component Displays long term trends Displays temporal and spatial Autocorrelation Discovery of Climate Indices using Clustering 13 Climate Indices Climate Indices = Data time series that summarize physical behavior of different regions of ocean and atmosphere Distill climate variability at regional or global scale into a single and manageable time series Usually based on sea level pressure and sea surface temperature Past methods of indication painstakingly slow and tedious Discovery of Climate Indices using Clustering 14 Climate Indices Climate Index: Nino 1+2 Discovery of Climate Indices using Clustering 15 Discovery of Climate Indices using Clustering 16 Climate Indices El Nino Correlations SST of El Nino correlated indices 17 Climate Indices Detection of Climate Indices Earth Scientists have devoted a significant amount of time discovering climate indices Traditional approaches include direct observation of climate phenomena (El Nino) Use of linear algebra techniques including eigenvalue analysis Discovery of Climate Indices using Clustering 18 Climate Indices Eigenvalue Analysis Driven by massive amount of data obtained from satellites and remote sensing devises Provides a way to quickly and automatically detect patterns in large amounts of data Jason-2 IR satellite image Discovery of Climate Indices using Clustering 19 Climate Indices Eigenvalue Analysis Eigenvalue techniques include: Principle Components Analysis (PCA) Single Value Decomposition (SVD) Limitations of Eigenvalue Analysis Weaker signals may be masked by stronger signals All Discovered signals must be orthogonal to each other making it difficult to attach a physical interpretation to them Discovery of Climate Indices using Clustering 20 Climate Indices Alternative Clustering Methodology Utilization of data mining techniques and enormous amount of remote sensing data to find climate indices Analysis yields clusters that represent ocean regions with relatively homogeneous behavior Centroids of these areas summarize behavior particular region Finding “meaningful” clusters will enable Earth Scientists to better predict changes in climate system Discovery of Climate Indices using Clustering 21 Climate Indices Benefits of Clustering Discovered signals do not need to be orthogonal or statistically independent of one another Signals are more easily interpreted Weaker signals are more readily detected It provides an efficient way to determine the influence of large set of points (all ocean point) on another large set of points (all land points) Discovery of Climate Indices using Clustering 22 Climate Indices Results of Clustering Methodology Candidate Indices highly correlated to known indices representing rediscovery of well known indices and validation of methods Variants to well-known indices which may be better predictors of land behavior for some regions of land Cluster centroids that have medium or low correlation with known indices may represent new Earth science phenomena Discovery of Climate Indices using Clustering 23 Presentation Outline Background: Climate Change Earth System Linkages Earth Science Data and Climate Indices Existing Eigenvalue Techniques and Limitations New Clustering Based Methodology Results and Comparisons Conclusions and Future Research Discovery of Climate Indices using Clustering 24 Eigenvalue Techniques Finding Spatial or Temporal Patterns using SVD Analysis SVD: Singular Value Decomposition Earth Scientists typically used SVD analysis to identify climate indices Goal : To find a new set of attributes that better describe variability in the data, through dimensionality reduction Its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data Karl Pearson, Statistician 1857 – 1936 Discovery of Climate Indices using Clustering 25 Eigenvalue Techniques Overview of SVD Analysis These techniques applied to a data set in the form of a data matrix (m by n) m rows (objects) n columns (attributes) Data Matrix: a variation of record data in that it consists of all numeric attributes Example of a data matrix Discovery of Climate Indices using Clustering 26 Eigenvalue Techniques Overview of SVD Analysis Assume the data objects in a matrix all have the same fixed set of attributes Each data object can be thought of as a point, or Vector in multidimensional space Each spatial dimension represents a distinct attribute describing the object Discovery of Climate Indices using Clustering 27 Simple Example of SVD Analysis Just using web, it’s hard to find intuitive explanation of SVD Again, SVD is a way to expose underlying details of matrix Simple Example using Golf : 3 golfers play 9 holes, par every hole How to predict score for a player on a given hole? Assume two vectors, Player Ability and Hole Difficulty Predicted score = Player Ability * Hole Difficulty Hole difficulty is Left Singular Vector Player Ability is Right Singular Vector Discovery of Climate Indices using Clustering 28 Eigenvalue Techniques Finding Spatial or Temporal Patterns using SVD Analysis Given a data matrix, whose rows consist of time series from various points on the globe, the objective is to discover the strong temporal or spatial patterns in the data SVD decomposes a matrix into two sets of patterns, which, that correspond to a set of spatial patterns (left singular vectors) and a set of temporal patterns (right singular vectors). We can plot the temporal patterns regular line plot and the spatial patterns on a spatial grid and visualize these patterns. Discovery of Climate Indices using Clustering 29 Eigenvalue Techniques Example : Plotting SST (Sea Surface Temp) Strongest spatial pattern of SST Temporal pattern of SST (blue) plotted against the NINO4 index (green) Discovery of Climate Indices using Clustering 30 Eigenvalue Techniques Limitations of SVD Analysis Only useful for finding a few of the strongest signals Smaller patterns in data may be obscured Signals must be orthogonal to each other (statistically independent) May not identify all patterns in data Efficiency can be a concern Discovery of Climate Indices using Clustering 31 Presentation Outline Background: Climate Change Earth System Linkages Earth Science Data and Climate Indices Existing Eigenvalue Techniques and Limitations New Clustering Based Methodology Results and Comparisons Conclusions and Future Research Discovery of Climate Indices using Clustering 32 Clustering Methods Clustering Based Methodology for the Discovery of Climate Indices Two key steps for finding climate indices 1. 2. Find candidate indices using clustering Evaluate these candidate indices for Earth Science significance Clustering Method used for this study: SNN Clustering Algorithm Method “Searching Nearest Neighbors” Discovery of Climate Indices using Clustering 33 Clustering Methods Finding Candidate Indices Using Clustering SNN Clustering Algorithm First finds the nearest neighbors of each data point Next, redefines the similarity between pairs in terms of how many nearest neighbors the two points share Using this definition of similarity the algorithm identifies core points These Core Points are used to build clusters SNN algorithms have time complexity O(n*log(n)) Graph of functions n(log n) and n Discovery of Climate Indices using Clustering 34 Clustering Methods Evaluation of Candidate Indices Indices must be evaluated in terms of Earth Science significance (meaning the strength of the association between the behavior of a candidate index and land climate) Goal is to find a numerical measure of the strength and association between the behavior of an index and land climate To evaluate influence of climate indices on land, the researchers use Area-Weighted Correlation Definition : The weighted average of the correlation of the candidate index with all land points, where weight is based on the area of the land grid point Discovery of Climate Indices using Clustering 35 Clustering Methods Calculating Area-weighted Correlation Step 1 : Compute the correlation of the time series of the candidate index with the same time series associated with each land point Step 2 : Compute the weighted average of the correlations, where the weight associated with each land point is its area The resulting area-weighted correlation can be at most 1, min is 0 General Formula for W.A. Wc = weight of each value M Mc = some value to average General Correlation Index. 1 being strongest Discovery of Climate Indices using Clustering 36 Clustering Methods Comparison of Area-Weighted Correlations Development of Baseline to compare the values of area weighted correlations of candidate indices Histogram of area weighted correlation of 1000 random time series No time series has a WAC >.1 This will be the baseline, and indicates whether a good candidate index Discovery of Climate Indices using Clustering 37 Clustering Methods Validation of Comparison Baseline Below shown are weighted area correlations of 11 known indices Note that 10/11 indices have a weighted area correlation of >.1 If candidate index shows weighted area correlation >.1, investigate Graph of Weighted Area Correlation of Well know Climate Indices Discovery of Climate Indices using Clustering 38 Presentation Outline Background: Climate Change Earth System Linkages Earth Science Data and Climate Indices Existing Eigenvalue Techniques and Limitations New Clustering Based Methodology Results and Comparisons Conclusions and Future Research Discovery of Climate Indices using Clustering 39 Results SST Based Candidate Indices Used SST data over time period from 1958 and 1998 and applied SNN clustering Obtained 107 clusters Cluster centroids were used to categorize clusters into G0,G1,G2 and G3 groups depending on their correlation to known indices Discovery of Climate Indices using Clustering 40 Results 107 Sea Surface Temperature (SST) Clusters Find Correlation with known index like SOI, NINO1+2 etc Find Area Weighted correlation with land Discovery of Climate Indices using Clustering 41 Results SST Cluster Correlation Correlation between known indices with SST cluster centroids and SVD Components Discovery of Climate Indices using Clustering 42 Results G0: Clusters with correlation to known indices >= 0.8 Very highly correlated NINO 3.4 1+2 RediscoveredNINO well-known indices Serve to validate the approach NINO 4 NINO 3 Discovery of Climate Indices using Clustering 43 Results G0: SST Cluster Correlation Correlation between known indices with SST cluster centroids and SVD Components Discovery of Climate Indices using Clustering 44 Results G1: Clusters with correlation to known indices from 0.4 to 0.8 Discovery of Climate Indices using Clustering 45 Results G1: Cluster 29 vs. El Nino Indices Cluster 29 Discovery of Climate Indices using Clustering 46 Results G2: Clusters with correlation to known indices from 0.25 to 0.4 Less correlated May represent new earth science phenomena May be new index Discovery of Climate Indices using Clustering 47 Results Cluster 62 vs. El Nino Indices Cluster 62 Discovery of Climate Indices using Clustering 48 Results G3: Clusters with correlation to known indices <= 0.25 Less correlated May represent new earth science phenomena or weaker version of known phenomena New index Discovery of Climate Indices using Clustering 49 Results SLPbased Candidate Indices SLP data over time period from 1958 to 1998 Correlation measured as difference of all pairs of cluster centriods Negative correlation are interesting candidates 25 Clusters found 25 Sea Level Pressure Based Clusters Discovery of Climate Indices using Clustering 50 Results SLP Clusters Pairwise Correlation Note :Only negative correlation values shown Discovery of Climate Indices using Clustering 51 Comparisons Comparison with SVD based Indices Correlation of Cluster Centroids with land temperature Correlation of first 30 SVD components with land temperature Discovery of Climate Indices using Clustering 52 Comparisons SST Clusters : Performance Comparison Correlation for known indices with SST cluster centroids and SVD components Discovery of Climate Indices using Clustering 53 Comparisons SLP Clusters : Performance Comparison Discovery of Climate Indices using Clustering 54 Comparisons SLP clusters Performance Comparison Area-weighted correlation for known indices with SLP cluster centroids and SVD components Discovery of Climate Indices using Clustering 55 Conclusions Demonstrated that clustering is a viable alternative to eigenvalue based approach for the discovery of climate indices Can replicate many well-known climate indices Have also discovered variants of known indices that may be “better” for some regions Some indices may represent new Earth Science phenomena No need for discovered indices to be orthogonal No need to pre-select the area to analyze Discovery of Climate Indices using Clustering 56 Future Work Investigation of candidate indices by Earth Scientists Investigate whether there are climate indices that cannot be represented by clusters Noise elimination and other preprocessing improvements Aggregation Discovery of Climate Indices using Clustering 57 QUESTIONS ??? 58