Data Mining in Earth Sciences Rahul Ramachandran, Sara Graves and Ken Keiser Mathematical Challenges in Scientific Data Mining IPAM January 14-18, 2002 Information Technology and Systems Center University of Alabama in Huntsville http://datamining.itsc.uah.edu Outline • Introduction • ADaM System • Data Mining Taxonomy for Earth Science • Event/Relationship based • Application Examples • Dimensionality Reduction • References Reasons for Data Mining of Earth Science Data • Greatly increased data volume due to improvements in data collection/access/availability/storage technology (instruments, computational resources, internet…) • Terra are about 1 terabyte per day - more than can be analyzed by conventional means • High variability in data formats and content • Need for high returns on expensive data investments • Need for improved access/availability of data, information and knowledge • Need for higher level products for the non-specialist and interdisciplinary/cross-domain researchers • Questions/queries are getting more complex due, in part, to heterogeneous nature of the data Characteristics of Earth Science Data • High variability • Type: • Geostationary • Polar Orbiting • Structure • Raster • Vector • Resolution • Fine – AVHRR 1km • Coarse – SSM/I 20km • Multi/Hyper Spectral • Processing stage: • Level 0: Raw data – instrument counts • Level 1: Annotated with Geo-reference information • Level 2: Transformed by algorithm into geophysical parameter • Level 3: Spatial/Temporal resampling • Level 4: Includes additional model data Characteristics of Earth Science Data • Need to know physical basis (domain knowledge) before applying statistical techniques • Multiple time scales • Wide variety of data formats • Includes spatial/temporal information • Typically needs domain-specific algorithms ADaM History • Algorithm Development and Mining (ADaM) System • The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. • It contains over 120 different operations to be performed on the input data stream. • Operations vary from specialized atmospheric science dataset specific algorithms to different digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks and genetic algorithms. • Developed a Event/Relationship Search System for the environment ADaM Engine Architecture Results Translated Data Data Preprocessed Data Patterns/ Models Processing Input Preprocessing Analysis Output HDF HDF-EOS GIF PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others... GIF Images HDF-EOS HDF Raster Images HDF SDS Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images Others... Intergraph Raster Others... ADaM Mining Environment Distributed Clients Web-based Workstation based Other Systems Analysis/Vis Tools Data Mining Server Common Client API Knowledge Base Mining Engine (ADaM) Input Modules Analysis Modules Output Modules Data Stores Mining Results Event/ Relationship Search System Data Mining Taxonomy Event-based mining Relationship-based mining Event-based Mining • Known events/Known algorithms • Tropical Cyclones from AMSU-A data • Known events/Learned algorithms • Rainfall estimation from SSM/I data • Lightning Detection from OLS data • Unknown event/Unknown algorithm • Target Independent Mining Known Event/Known Algorithm I know what phenomena to detect and I have the algorithm to do so! Results Add algorithm to Mining Environment • Relationship analysis • Coincidence searches • Input for other algorithms Earth Science Data Sets Tropical Cyclone Detection: Estimating Maximum Wind Speed • Scientist: Dr. Roy Spencer (GHCC/MSFC NASA) • Data used: Advanced Microwave Sounding Unit-A • Radiometer can detect temperatures at different levels of the atmosphere • Surface winds in tropical cyclones are directly related to the warm middle- and upper-atmosphere temperatures which exist around the cyclone center • AMSU-A measures this warmth at several frequencies near 55 gigahertz (GHz) • Calibrated using aircraft reconnaissance measurements in tropical depressions, tropical storms, and hurricanes from the 1998 Atlantic hurricane season • Tropical cyclone detection based on ice scattering, water vapor and wind speed Tropical Cyclone Detection: Estimating Maximum Wind Speed Advanced Microwave Sounding Unit (AMSU-A)Data Calibration/ Limb Correction/ Converted to Tb • Water cover mask to eliminate land • Laplacian filter to compute temperature gradients • Science Algorithm to estimate wind speed • Contiguous regions with wind speeds above a desired threshold identified • Additional test to eliminate false positives • Maximum wind speed and location produced Hurricane Floyd Data Archive Mining Environment Results are placed on the web and made available to National Hurricane Center & Joint Typhoon Warning Center Result Known Event/Learned Algorithm Data Mining System I know what phenomena I want to detect but I do not know the characteristics of the phenomena Results Refine your algorithm using iteration • Relationship analysis • Coincidence searches • Input for other algorithms Earth Science Data Sets Rainfall Estimation and Identification Study using SSM/I data • Scientist: Dr. Steve Goodman (GHCC/MSFC NASA) Subsetted SSM/I data NEXRAD Composite data • To determine whether generic pattern recognition techniques could be applied to SSM/I data to detect rain • Minimum Distance Classifier, Back-propagation Neural Network and Discrete Bayes Classifier were compared against a Science Algorithm ( WetNet PIP Algorithm) • US Composite rainfall product was used as ground truth Rainfall Estimation and Identification Study using SSM/I data SSM/I and US rain data over southeastern United States for the period January and July 1995 were compared in the study SSM/I and Radar data were gridded and registered to establish spatial and temporal coincidence BPNN performance was comparable to that of the WetNet PIP SSM/I rain rate algorithm Performance of Bayes classifier was not as good as that of the WetNet PIP SSM/I rain rate algorithm. This is perhaps due to the small sample size used for estimating density functions of the two classes (rain and non-rain) Lightning Detection in Operational Linescan System (OLS) Images • Scientist: Dr. Steve Goodman (GHCC/MSFC NASA) • To identify lightning streaks in night time portions of OLS images • OLS is carried by DMSP satellites and produces a visible and thermal image • Lightning shows up as bright horizontal streaks as do city lights and moonlight reflected off the clouds • Approach based on morphological filtering and gradient detection was selected • Both visible and thermal band used Lightning Detection in Operational Linescan System (OLS) Images Erosion and dilation was used to find areas in/near clouds, other areas were removed Gradient detection in the direction of satellite propagation was applied to the visible image to extract horizontal streaks Texture measures were used to identify areas of small patchy cloud cover which exhibited small bright streaks Genetic algorithm was used to tune parameters of the classification during training Results ( % Accuracy) Correctly Detected False Positives False Negatives Training Results 80 0.7 19.2 Test Results 78.2 4.3 17.3 Unknown Event/Unknown Algorithm Data Mining System I want to find anomalies in the data sets ! Results Let the miner “discover” it • Relationship analysis • Coincidence searches • Input for other algorithms Example: Target Independent Mining Earth Science Data Sets Target Independent Mining of SSM/I Data • Mine for data in a target independent manner (no specific phenomena under consideration) • Interested in transient phenomena that move through an area • Transient phenomena characterized as deviation from normal • • Objective: Data Reduction with minimum loss of information • Size of remotely sensed data prevents it from being maintained online • Data is archived in much slower tertiary storage • Need to develop techniques to minimize the need for data access from the tertiary storage Procedure: Overlay the earth’s surface with a constant grid consisting of cells • For each cell a maximum and minimum trend line is computed • Maximum trend line is computed by forming a set of maximum values for a day over some period (month) • Median for a series of months is used to form the maximum trend line • Same procedure used to calculate minimum trend line Target Independent Mining of SSM/I Data Trend Lines Represent What Is Normal Target Independent Mining of SSM/I Data • Extracted metadata not oriented toward any particular transient phenomena • Laboratory tests show 98% data compression while preserving 92% of MCSs detectable in raw data • MCS events represented only 6.7% of extracted metadata Relationship-based Mining • Coincident Association • VARGA Algorithm for multispectral data • Localized Spatial Association • Cumulus Cloud Classification in GOES Imagery • Temporal Association Coincident Association Mining • Use Market Basket analysis to mine for association rules in vector data • Rule has form [X Y] • Rule characterized by Support: % of vector instances that have X Y How likely the rule is applicable? Confidence: What % of vector instances that contain X also contain Y? Estimate of conditional probability Coincident Association Applied to Multi-spectral Data Mining • Developed and implemented Vector Association Rule Generation Algorithm (VARGA) as a modification to market-basket association rule mining. • Modified to minimize memory usage for large multi-spectral satellite data such as SSM/I (90 megabytes per day uncompressed) • Example SSM/I Rule: • [19V, 180.0] [37H, 140.0] -> [37V, 200.0] : 0.117037 0.945986 Localized Spatial Association Mining • Extract association rules to characterize texture (Dissertation of Dr. John Rushing) • Each pixel on an nxn neighborhood is characterized by the triple (X,Y,I) • The X and Y offsets from the pixel at the neighborhood center • Its intensity I • Association rules can then be characterized by relationships between the triples Association Rule Example •The rule specified in figure can be applied to this image in 9 of the 16 pixel locations due to the pixel offsets in the rule. •Of these 9 locations, the antecedent matches at 5 locations, and both the antecedent and consequent match at 3 locations. •This yields a support of 3/9 = 33.33% and a confidence of 3/5 = 60%. 0,0,2 1,1,2 1,0,0 Support: 3/9 = 33.33% Confidence 3/5 = 60% Association Rule Example Original Image Segmented Image Normal “J” Elements Mirrored “J” Elements Normal “J” Rules 1,1,1 1,0,1 1,1,0 0,1,1 0,0,0 0,1,0 1,1,1 1,0,1 1,1,1 1,1,1 1,0,1 1,1,1 0,1,1 0,0,0 0,1,1 1,1,1 1,0,0 1,1,0 1,1,1 1,0,1 1,1,1 0,1,0 0,0,0 0,1,1 1,1,0 1,0,1 1,1,1 1,1,0 1,0,0 1,1,1 0,1,1 0,0,0 0,1,1 1,1,1 1,0,1 1,1,1 Mirrored “J” Rules 1,1,1 1,0,1 1,1,1 0,1,1 0,0,0 0,1,0 1,1,1 1,0,1 1,1,0 1,1,1 1,0,1 1,1,1 0,1,1 0,0,0 0,1,1 1,1,0 1,0,0 1,1,1 1,1,0 1,0,1 1,1,1 0,1,0 0,0,0 0,1,1 1,1,1 1,0,1 1,1,1 1,1,1 1,0,0 1,1,0 0,1,1 0,0,0 0,1,1 1,1,1 1,0,1 1,1,1 GOES Cumulus Cloud Classification: Why Texture Features? • Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery GOES Cumulus Cloud Classification: The Need • Cloud systems are important modulators of earth’s radiation budget • Large uncertainties are associated with cloud radiative forcing • Radiative energy budget is impacted by change in distribution of clouds • Cumulus clouds are a cloud field type that could respond strongly to climate change • Knowledge of cloud geometry, size and spatial distribution is needed for the representation of cumulus clouds in radiative transfer models • To derive models of cloud field characteristics, automated cumulus cloud detection schemes are required to analyze large amounts of data GOES Cumulus Cloud Classification: Purpose of this study • Compare different techniques for detecting Cumulus cloud fields in Geostationary Operation Environmental Satellite (GOES) • Comparison based on • Accuracy of detection • Amount of time required to classify • Feature measures used along with the Maximum Likelihood Classifier • Texture features • Gray Level Co-Occurrences Matrix • Gray Level Run Length Features • Association Rules • Edge Detection Features • Sobel Filter • Laplacian Filter • Combination of Sobel and Laplacian Filter GOES Cumulus Cloud Classification: Texture Features (1) • Gray Level Co-Occurrence Matrix: • First texture feature vector to be developed • GLCM is used as a benchmark • It is based on positional operator • Positional operator defines relationship between pixels in terms of x,y offset or as a distance, angle offset • Co-occurrence matrix is an NxN matrix where N is the number of gray levels and functions are computed on the matrix • Gray Level Run Length Features • Gray level statistical features based on homogeneous gray level runs • Run is a series of consecutive pixels of the same intensity • Run length are at orientations in increments of 45 degrees starting at 0 degrees GOES Cumulus Cloud Classification: Texture Features (2) • Association Rules • Often used in business applications to identify relationships in databases • Adapted to discriminate textures in images • Based on frequently occurring local image structures 0,0,2 1,1,2 1,0,0 Triples ( Pos X, Pos Y, Pixel Intensity) Rule: (0,0,2) ^ (1,1,2) => (1,0,0) Support = 3/9 = 33.33% Then calculate Support and Confidence of this Rule Confidence = 3/5 = 60% (a) (b) GOES Cumulus Cloud Classification: Edge Detection Features • These techniques are used for detecting discontinuities in an image • These techniques apply a local derivative operator on the image • Sobel Filters • It calculates the magnitude of rate of change of gray level and the direction of this change vector • Magnitude = | Gx | + |Gy| • Direction = tan^-1(Gx/Gy) • Gx = (z7 + 2z8 + z9) – (z1 + 2z2 + z3) • Gy = (z3 + 2z6 + z9) – (z1 + 2z4 + z7) • Laplacian Filters • It is a second order derivative • F(z) = 4z5 – (z2 + z4 + z6 + z8) z1 z4 z7 z2 z5 z8 z3 z6 z9 GOES Cumulus Cloud Classification: Experiment Process • Training • Samples selected from 1000x1000 GOES scene • Only two classes are used: Cumulus and Others ( includes background) • For validation, samples were labeled by at least two experts and only pixels where experts agreed were used for training • Maximum likelihood classifier was trained using GLCM, GLRL, Association Rules and Edge detection features • Window size was varied: 5x5 – 11x11 • Testing • 12 different GOES images (512x512) where used for testing • Classification results were compared against expert labeled images • Confusion matrix, classification accuracy and experiment run times were calculated GOES Cumulus Cloud Classification: Sample Result Original GLRL Association Rules GLCM Expert Labeled Sobel Sobel + Laplacian Laplacian GOES Cumulus Cloud Classification: Conclusions • Accuracy • Best results using texture features • GLRL (78%) with a filter size of 11x11 • Association Rules (75%) with a filter size of 5x5 • GLCM gave the worst results (51-55%) • Best results using edge detection filters • Sobel Filter (78%) with a filter size of 11x11 • Laplacian (73%) with a filter size of 9x9 • Laplacian and Sobel (75%) with a filter size of 9x9 • Timing Results • Times were calculated on an 933MHz Pentium III processor PC with 512 MB memory • Texture feature techniques in general required an order of magnitude more time than edge detection filters Dimensionality Reduction: Mesoscale Convective System (MCS) Detection Scientists Scientists Populating Knowledge Base (reducing data volume ) •Define the Experiment •Select algorithm (Devlin) •Automatic extraction of MCSs from SSM/I data Mining Engine Input Modules Analysis Modules Output Modules SSM/I Data Mining Results: MCSs Knowledge Base Event/ Relationship Search System Dimensionality Reduction: Research Analysis Scientists Analysis: •Find MCSs over river basins in Middle East? •Data Sets •MCSs •River basin data set •Political boundaries Mining Engine Input Modules Analysis Modules Output Modules SSM/I Data •Reduced amount of data •Allow scientists to pose questions and get “results” •Allow easy visualization •Maximize knowledge discovery/ minimize data handling •Scientists can refine their knowledge repository •Answer the science questions Mining Results: MCSs Event/ Relationship Knowledge Search Base System Event/ Relationship Search System Dimensionality Reduction: Knowledge Reuse Latitudinal Distribution of MCS for 1998-1999 Mining Engine Input Modules Analysis Modules Output Modules SSM/I Data Mining Results: MCSs 80 70 60 50 40 30 20 Latitude Scientists Climatological Study of MCSs: •What is the latitudinal distribution of MCSs? •Which continent has more MCSs? •What is the size distribution of the MCSs for JUN-JUL-AUG? •What is the relationship between the number of MCSs and their intensities? •Do results vary for El-Nino years? 90 10 0 -10 Mar98-Mar99 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 -20 -30 -40 -50 -60 -70 -80 -90 Number of MCS's Knowledge Reuse Event/ Relationship Knowledge Search Base System Event/ Relationship Search System Event/Relationship Search System Allows users to conduct coincidence searches and relationship tests between mined phenomena and a variety of parameters Parameters include geographic regions, political boundaries, or other named phenomena for a specific time period References • Graves, Sara J., Thomas Hinke, Shanlini Kansal, "Metadata: The Golden Nuggets of Data Mining", First IEEE Metadata Conference, Bethesda, Maryland, April 16- 18, 1996 • Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S. Ranganath, "For Scientific Data Discovery: Why Can't the Archive be More Like the Web", Proceedings Ninth International Conference on Scientific Database Management, Evergreen State College, Olympia, Washington, August 11-13, 1997 • Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves, "Techniques and Experience in Mining Remotely Sensed Satellite Data", Artificial Intelligence Review 14 (6): Issues on the Application of Data Mining, pp 503531, December 2000 • Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S. Ranganath, Evans Criswell, "Eureka Phenomena Discovery and Phenomena Mining System", AMS 13th Int’l Conference on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography and Hydrology, 1997 References • Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves, "TargetIndependent Mining for Scientific Data: Capturing Transients and Trends for Phenomena Mining", Proceedings Third International Conference on Data Mining (KDD-97), Newport Beach, California, August 14-17, 1997 • Keiser, Ken, John Rushing, Helen Conover, Sara J. Graves, "Data Mining System Toolkit for Earth Science Data", Earth Observation (EO) & Geo-Spatial (GEO) Web and Internet Workshop, Washington, D.C., February 1999 • Rushing, John, Heggere S. Ranganath, Thomas Hinke, Sara J. Graves, "Using Association Rules as Texture Features", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 23, No. 8, 845-858, 2001 • Nair, Udaysankar J., John Rushing, Rahul Ramachandran, Kwo-Sen Kuo, Sara J. Graves, Ron Welch, "Detection of Cumulus Cloud Fields in Satellite Imagery", The International Symposium on Optical Science, Engineering and Instrumentation, Denver, 1999 • Nair, U., J. Rushing, R. Ramachandran, R. Welch, and S. J. Graves, Detection of boundary layer cumulus cloud fields in GOES satellite imagery”, submitted to Journal of Applied Meteorology, September, 2001