Data Mining in Earth Sciences Information Technology and Systems Center

advertisement
Data Mining in Earth Sciences
Rahul Ramachandran, Sara Graves and Ken Keiser
Mathematical Challenges in Scientific Data Mining
IPAM January 14-18, 2002
Information Technology and Systems Center
University of Alabama in Huntsville
http://datamining.itsc.uah.edu
Outline
• Introduction
• ADaM System
• Data Mining Taxonomy for Earth Science
• Event/Relationship based
• Application Examples
• Dimensionality Reduction
• References
Reasons for Data Mining of Earth
Science Data
• Greatly increased data volume due to improvements in data
collection/access/availability/storage technology (instruments,
computational resources, internet…)
• Terra are about 1 terabyte per day - more than can be analyzed by
conventional means
• High variability in data formats and content
• Need for high returns on expensive data investments
• Need for improved access/availability of data, information and
knowledge
• Need for higher level products for the non-specialist and
interdisciplinary/cross-domain researchers
• Questions/queries are getting more complex due, in part, to
heterogeneous nature of the data
Characteristics of Earth Science
Data
• High variability
• Type:
• Geostationary
• Polar Orbiting
• Structure
• Raster
• Vector
• Resolution
• Fine – AVHRR 1km
• Coarse – SSM/I 20km
• Multi/Hyper Spectral
• Processing stage:
• Level 0: Raw data – instrument counts
• Level 1: Annotated with Geo-reference information
• Level 2: Transformed by algorithm into geophysical parameter
• Level 3: Spatial/Temporal resampling
• Level 4: Includes additional model data
Characteristics of Earth Science
Data
• Need to know physical basis (domain knowledge)
before applying statistical techniques
• Multiple time scales
• Wide variety of data formats
• Includes spatial/temporal information
• Typically needs domain-specific algorithms
ADaM History
• Algorithm Development and Mining (ADaM) System
• The system provides knowledge discovery, feature detection
and content-based searching for data values, as well as for
metadata.
• It contains over 120 different operations to be performed on
the input data stream.
• Operations vary from specialized atmospheric science dataset specific algorithms to different digital image processing
techniques, processing modules for automatic pattern
recognition, machine perception, neural networks and
genetic algorithms.
• Developed a Event/Relationship Search System for the
environment
ADaM Engine Architecture
Results
Translated
Data
Data
Preprocessed
Data
Patterns/
Models
Processing
Input
Preprocessing
Analysis
Output
HDF
HDF-EOS
GIF PIP-2
SSM/I Pathfinder
SSM/I TDR
SSM/I NESDIS Lvl 1B
SSM/I MSFC
Brightness Temp
US Rain
Landsat
ASCII Grass
Vectors (ASCII Text)
Selection and Sampling
Subsetting
Subsampling
Select by Value
Coincidence Search
Grid Manipulation
Grid Creation
Bin Aggregate
Bin Select
Grid Aggregate
Grid Select
Find Holes
Image Processing
Cropping
Inversion
Thresholding
Others...
Clustering
K Means
Isodata
Maximum
Pattern Recognition
Bayes Classifier
Min. Dist. Classifier
Image Analysis
Boundary Detection
Cooccurrence Matrix
Dilation and Erosion
Histogram
Operations
Polygon
Circumscript
Spatial Filtering
Texture Operations
Genetic Algorithms
Neural Networks
Others...
GIF Images
HDF-EOS
HDF Raster Images
HDF SDS
Polygons (ASCII, DXF)
SSM/I MSFC
Brightness Temp
TIFF Images
Others...
Intergraph Raster
Others...
ADaM Mining Environment
Distributed Clients
Web-based
Workstation
based
Other Systems
Analysis/Vis
Tools
Data Mining Server
Common Client API
Knowledge Base
Mining Engine (ADaM)
Input
Modules
Analysis
Modules
Output
Modules
Data Stores
Mining
Results
Event/
Relationship
Search
System
Data Mining Taxonomy
Event-based mining
Relationship-based mining
Event-based Mining
• Known events/Known algorithms
• Tropical Cyclones from AMSU-A data
• Known events/Learned algorithms
• Rainfall estimation from SSM/I data
• Lightning Detection from OLS data
• Unknown event/Unknown algorithm
• Target Independent Mining
Known Event/Known Algorithm
I know what phenomena
to detect and I have the
algorithm to do so!
Results
Add algorithm
to Mining Environment
• Relationship analysis
• Coincidence searches
• Input for other algorithms
Earth Science
Data Sets
Tropical Cyclone Detection:
Estimating Maximum Wind Speed
• Scientist: Dr. Roy Spencer (GHCC/MSFC NASA)
• Data used: Advanced Microwave Sounding Unit-A
• Radiometer can detect temperatures at different levels of the
atmosphere
• Surface winds in tropical cyclones are directly related to the
warm middle- and upper-atmosphere temperatures which exist
around the cyclone center
• AMSU-A measures this warmth at several frequencies near 55
gigahertz (GHz)
• Calibrated using aircraft reconnaissance measurements in
tropical depressions, tropical storms, and hurricanes from the
1998 Atlantic hurricane season
• Tropical cyclone detection based on ice scattering, water vapor
and wind speed
Tropical Cyclone Detection:
Estimating Maximum Wind Speed
Advanced Microwave
Sounding Unit
(AMSU-A)Data
Calibration/
Limb Correction/
Converted to Tb
• Water cover mask to eliminate land
• Laplacian filter to compute temperature gradients
• Science Algorithm to estimate wind speed
• Contiguous regions with wind speeds above a desired
threshold identified
• Additional test to eliminate false positives
• Maximum wind speed and location produced
Hurricane Floyd
Data Archive
Mining Environment
Results are placed on the web
and made available to
National Hurricane Center &
Joint Typhoon Warning Center
Result
Known Event/Learned Algorithm
Data Mining
System
I know what phenomena
I want to detect but I do
not know the characteristics
of the phenomena
Results
Refine your
algorithm
using iteration
• Relationship analysis
• Coincidence searches
• Input for other algorithms
Earth Science
Data Sets
Rainfall Estimation and Identification Study
using SSM/I data
• Scientist: Dr. Steve Goodman
(GHCC/MSFC NASA)
Subsetted SSM/I data
NEXRAD Composite data
• To determine whether generic
pattern recognition techniques
could be applied to SSM/I data
to detect rain
• Minimum Distance Classifier,
Back-propagation Neural
Network and Discrete Bayes
Classifier were compared
against a Science Algorithm (
WetNet PIP Algorithm)
• US Composite rainfall product
was used as ground truth
Rainfall Estimation and Identification Study
using SSM/I data
SSM/I
and US rain data over southeastern United States for the
period January and July 1995 were compared in the study
SSM/I
and Radar data were gridded and registered to establish
spatial and temporal coincidence
BPNN
performance was comparable to that of the WetNet PIP SSM/I
rain rate algorithm
Performance
of Bayes classifier was not as good as that of the
WetNet PIP SSM/I rain rate algorithm. This is perhaps due to the small
sample size used for estimating density functions of the two classes
(rain and non-rain)
Lightning Detection in Operational
Linescan System (OLS) Images
• Scientist: Dr. Steve Goodman
(GHCC/MSFC NASA)
• To identify lightning streaks in
night time portions of OLS
images
• OLS is carried by DMSP
satellites and produces a visible
and thermal image
• Lightning shows up as bright
horizontal streaks as do city
lights and moonlight reflected off
the clouds
• Approach based on
morphological filtering and
gradient detection was selected
• Both visible and thermal band
used
Lightning Detection in Operational Linescan
System (OLS) Images
Erosion
and dilation was used to find areas in/near clouds, other
areas were removed
Gradient detection in the direction of satellite propagation was
applied to the visible image to extract horizontal streaks
Texture measures were used to identify areas of small patchy cloud
cover which exhibited small bright streaks
Genetic algorithm was used to tune parameters of the
classification during training
Results ( % Accuracy)
Correctly
Detected
False
Positives
False
Negatives
Training Results
80
0.7
19.2
Test Results
78.2
4.3
17.3
Unknown Event/Unknown Algorithm
Data Mining
System
I want to find anomalies
in the
data sets !
Results
Let the miner
“discover” it
• Relationship analysis
• Coincidence searches
• Input for other algorithms
Example: Target Independent Mining
Earth Science
Data Sets
Target Independent Mining of
SSM/I Data
•
Mine for data in a target independent manner (no specific phenomena
under consideration)
•
Interested in transient phenomena that move through an area
• Transient phenomena characterized as deviation from normal
•
•
Objective: Data Reduction with minimum loss of information
•
Size of remotely sensed data prevents it from being maintained online
•
Data is archived in much slower tertiary storage
•
Need to develop techniques to minimize the need for data access from
the tertiary storage
Procedure: Overlay the earth’s surface with a constant grid consisting of
cells
•
For each cell a maximum and minimum trend line is computed
• Maximum trend line is computed by forming a set of maximum
values for a day over some period (month)
• Median for a series of months is used to form the maximum trend
line
• Same procedure used to calculate minimum trend line
Target Independent Mining of
SSM/I Data
Trend Lines Represent What Is Normal
Target Independent Mining of
SSM/I Data
• Extracted metadata not oriented toward any
particular transient phenomena
• Laboratory tests show 98% data compression
while preserving 92% of MCSs detectable in raw
data
• MCS events represented only 6.7% of extracted
metadata
Relationship-based Mining
• Coincident Association
• VARGA Algorithm for multispectral data
• Localized Spatial Association
• Cumulus Cloud Classification in GOES Imagery
• Temporal Association
Coincident Association Mining
•
Use Market Basket analysis to mine for
association rules in vector data
• Rule has form [X Y]
• Rule characterized by
 Support:
 % of vector instances that have X  Y
 How likely the rule is applicable?
 Confidence:
 What % of vector instances that contain X also
contain Y?
 Estimate of conditional probability
Coincident Association Applied to
Multi-spectral Data Mining
• Developed and implemented Vector Association Rule
Generation Algorithm (VARGA) as a modification to
market-basket association rule mining.
• Modified to minimize memory usage for large
multi-spectral satellite data such as SSM/I (90
megabytes per day uncompressed)
• Example SSM/I Rule:
• [19V, 180.0] [37H, 140.0] -> [37V, 200.0] : 0.117037
0.945986
Localized Spatial Association
Mining
• Extract association rules to characterize texture
(Dissertation of Dr. John Rushing)
• Each pixel on an nxn neighborhood is characterized
by the triple (X,Y,I)
• The X and Y offsets from the pixel at the
neighborhood center
• Its intensity I
• Association rules can then be characterized by
relationships between the triples
Association Rule Example
•The rule specified in figure can be applied to this image in 9 of the
16 pixel locations due to the pixel offsets in the rule.
•Of these 9 locations, the antecedent matches at 5 locations, and
both the antecedent and consequent match at 3 locations.
•This yields a support of 3/9 = 33.33% and a confidence of 3/5 =
60%.
0,0,2  1,1,2  1,0,0
Support: 3/9 = 33.33%
Confidence 3/5 = 60%
Association Rule Example
Original Image
Segmented Image
Normal “J” Elements
Mirrored “J” Elements
Normal “J” Rules
 1,1,1   1,0,1   1,1,0   0,1,1  0,0,0   0,1,0   1,1,1  1,0,1  1,1,1
 1,1,1   1,0,1   1,1,1  0,1,1  0,0,0   0,1,1  1,1,1  1,0,0   1,1,0 
 1,1,1   1,0,1   1,1,1  0,1,0   0,0,0   0,1,1  1,1,0   1,0,1  1,1,1
 1,1,0    1,0,0    1,1,1  0,1,1  0,0,0   0,1,1  1,1,1  1,0,1  1,1,1
Mirrored “J” Rules
 1,1,1   1,0,1   1,1,1  0,1,1  0,0,0   0,1,0   1,1,1  1,0,1  1,1,0 
 1,1,1   1,0,1   1,1,1  0,1,1  0,0,0   0,1,1  1,1,0   1,0,0   1,1,1
 1,1,0    1,0,1   1,1,1  0,1,0   0,0,0   0,1,1  1,1,1  1,0,1  1,1,1
 1,1,1   1,0,0    1,1,0   0,1,1  0,0,0   0,1,1  1,1,1  1,0,1  1,1,1
GOES Cumulus Cloud Classification:
Why Texture Features?
• Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery
GOES Cumulus Cloud Classification:
The Need
• Cloud systems are important modulators of earth’s radiation
budget
• Large uncertainties are associated with cloud radiative forcing
• Radiative energy budget is impacted by change in distribution of
clouds
• Cumulus clouds are a cloud field type that could respond
strongly to climate change
• Knowledge of cloud geometry, size and spatial distribution is
needed for the representation of cumulus clouds in radiative
transfer models
• To derive models of cloud field characteristics, automated
cumulus cloud detection schemes are required to analyze large
amounts of data
GOES Cumulus Cloud Classification:
Purpose of this study
• Compare different techniques for detecting Cumulus cloud fields
in Geostationary Operation Environmental Satellite (GOES)
• Comparison based on
• Accuracy of detection
• Amount of time required to classify
• Feature measures used along with the Maximum Likelihood
Classifier
• Texture features
• Gray Level Co-Occurrences Matrix
• Gray Level Run Length Features
• Association Rules
• Edge Detection Features
• Sobel Filter
• Laplacian Filter
• Combination of Sobel and Laplacian Filter
GOES Cumulus Cloud Classification:
Texture Features (1)
• Gray Level Co-Occurrence Matrix:
• First texture feature vector to be developed
• GLCM is used as a benchmark
• It is based on positional operator
• Positional operator defines relationship between pixels in terms of
x,y offset or as a distance, angle offset
• Co-occurrence matrix is an NxN matrix where N is the number of
gray levels and functions are computed on the matrix
• Gray Level Run Length Features
• Gray level statistical features based on homogeneous gray level
runs
• Run is a series of consecutive pixels of the same intensity
• Run length are at orientations in increments of 45 degrees starting
at 0 degrees
GOES Cumulus Cloud Classification:
Texture Features (2)
• Association Rules
• Often used in business applications to identify relationships in
databases
• Adapted to discriminate textures in images
• Based on frequently occurring local image structures
0,0,2   1,1,2   1,0,0 
Triples
( Pos X, Pos Y, Pixel Intensity)
Rule:
(0,0,2) ^ (1,1,2) => (1,0,0)
Support = 3/9 = 33.33%
Then calculate Support and Confidence
of this
Rule
Confidence
= 3/5 = 60%
(a)
(b)
GOES Cumulus Cloud Classification:
Edge Detection Features
• These techniques are used for detecting discontinuities in an
image
• These techniques apply a local derivative operator on the image
• Sobel Filters
• It calculates the magnitude of rate of change of gray level and the
direction of this change vector
• Magnitude = | Gx | + |Gy|
• Direction = tan^-1(Gx/Gy)
• Gx = (z7 + 2z8 + z9) – (z1 + 2z2 + z3)
• Gy = (z3 + 2z6 + z9) – (z1 + 2z4 + z7)
• Laplacian Filters
• It is a second order derivative
• F(z) = 4z5 – (z2 + z4 + z6 + z8)
z1
z4
z7
z2
z5
z8
z3
z6
z9
GOES Cumulus Cloud Classification:
Experiment Process
• Training
• Samples selected from 1000x1000 GOES scene
• Only two classes are used: Cumulus and Others ( includes
background)
• For validation, samples were labeled by at least two experts and
only pixels where experts agreed were used for training
• Maximum likelihood classifier was trained using GLCM, GLRL,
Association Rules and Edge detection features
• Window size was varied: 5x5 – 11x11
• Testing
• 12 different GOES images (512x512) where used for testing
• Classification results were compared against expert labeled images
• Confusion matrix, classification accuracy and experiment run times
were calculated
GOES Cumulus Cloud Classification:
Sample Result
Original
GLRL
Association Rules
GLCM
Expert Labeled
Sobel
Sobel + Laplacian
Laplacian
GOES Cumulus Cloud Classification:
Conclusions
• Accuracy
• Best results using texture features
• GLRL (78%) with a filter size of 11x11
• Association Rules (75%) with a filter size of 5x5
• GLCM gave the worst results (51-55%)
• Best results using edge detection filters
• Sobel Filter (78%) with a filter size of 11x11
• Laplacian (73%) with a filter size of 9x9
• Laplacian and Sobel (75%) with a filter size of 9x9
• Timing Results
• Times were calculated on an 933MHz Pentium III processor PC with
512 MB memory
• Texture feature techniques in general required an order of
magnitude more time than edge detection filters
Dimensionality Reduction:
Mesoscale Convective System (MCS) Detection
Scientists
Scientists
Populating Knowledge Base
(reducing data volume )
•Define the Experiment
•Select algorithm (Devlin)
•Automatic extraction of MCSs from SSM/I
data
Mining Engine
Input
Modules
Analysis
Modules
Output
Modules
SSM/I Data
Mining
Results:
MCSs
Knowledge Base
Event/
Relationship
Search
System
Dimensionality Reduction: Research Analysis
Scientists
Analysis:
•Find MCSs over river
basins in Middle East?
•Data Sets
•MCSs
•River basin data
set
•Political
boundaries
Mining Engine
Input
Modules
Analysis
Modules
Output
Modules
SSM/I Data
•Reduced amount of data
•Allow scientists to pose questions
and get “results”
•Allow easy visualization
•Maximize knowledge discovery/
minimize data handling
•Scientists can refine their
knowledge repository
•Answer the science questions
Mining
Results:
MCSs
Event/
Relationship
Knowledge
Search Base
System
Event/
Relationship
Search
System
Dimensionality Reduction: Knowledge Reuse
Latitudinal Distribution of MCS for 1998-1999
Mining Engine
Input
Modules
Analysis
Modules
Output
Modules
SSM/I Data
Mining
Results:
MCSs
80
70
60
50
40
30
20
Latitude
Scientists
Climatological Study of MCSs:
•What is the latitudinal distribution of
MCSs?
•Which continent has more MCSs?
•What is the size distribution of the
MCSs for JUN-JUL-AUG?
•What is the relationship between the
number of MCSs and their intensities?
•Do results vary for El-Nino years?
90
10
0
-10
Mar98-Mar99
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
-20
-30
-40
-50
-60
-70
-80
-90
Number of MCS's
Knowledge Reuse
Event/
Relationship
Knowledge
Search Base
System
Event/
Relationship
Search
System
Event/Relationship Search System
 Allows users to conduct
coincidence searches and
relationship tests between
mined phenomena and a
variety of parameters
 Parameters include
geographic regions, political
boundaries, or other named
phenomena for a specific
time period
References
•
Graves, Sara J., Thomas Hinke, Shanlini Kansal, "Metadata: The Golden
Nuggets of Data Mining", First IEEE Metadata Conference, Bethesda, Maryland,
April 16- 18, 1996
•
Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S.
Ranganath, "For Scientific Data Discovery: Why Can't the Archive be More Like
the Web", Proceedings Ninth International Conference on Scientific Database
Management, Evergreen State College, Olympia, Washington, August 11-13,
1997
•
Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves,
"Techniques and Experience in Mining Remotely Sensed Satellite Data", Artificial
Intelligence Review 14 (6): Issues on the Application of Data Mining, pp 503531, December 2000
•
Hinke, Thomas, John Rushing, Shanlini Kansal, Sara J. Graves, Heggere S.
Ranganath, Evans Criswell, "Eureka Phenomena Discovery and Phenomena
Mining System", AMS 13th Int’l Conference on Interactive Information and
Processing Systems (IIPS) for Meteorology, Oceanography and Hydrology, 1997
References
•
Hinke, Thomas, John Rushing, Heggere S. Ranganath, Sara J. Graves, "TargetIndependent Mining for Scientific Data: Capturing Transients and Trends for
Phenomena Mining", Proceedings Third International Conference on Data
Mining (KDD-97), Newport Beach, California, August 14-17, 1997
•
Keiser, Ken, John Rushing, Helen Conover, Sara J. Graves, "Data Mining
System Toolkit for Earth Science Data", Earth Observation (EO) & Geo-Spatial
(GEO) Web and Internet Workshop, Washington, D.C., February 1999
•
Rushing, John, Heggere S. Ranganath, Thomas Hinke, Sara J. Graves, "Using
Association Rules as Texture Features", IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol 23, No. 8, 845-858, 2001
•
Nair, Udaysankar J., John Rushing, Rahul Ramachandran, Kwo-Sen Kuo, Sara
J. Graves, Ron Welch, "Detection of Cumulus Cloud Fields in Satellite Imagery",
The International Symposium on Optical Science, Engineering and
Instrumentation, Denver, 1999
•
Nair, U., J. Rushing, R. Ramachandran, R. Welch, and S. J. Graves, Detection
of boundary layer cumulus cloud fields in GOES satellite imagery”, submitted to
Journal of Applied Meteorology, September, 2001
Download