Motivation Examples - Department of Computer Science and

advertisement
Environmental Data Warehousing and
Mining
Nabil R. Adam
Vijay Atluri, Dihua Guo, Songmei Yu
Rutgers University CIMIC
NSF Workshop on Next Generation Data Mining
NGDM02
November 1-3, 2002
Outline
• Setting
– A Real-world Lab – The NJ Meadowlands Area
• Motivating Examples
• Environmental Data Warehousing
• Environmental Data Mining
RUTGERS-CIMIC/MERI
2
MERI
(Meadowlands Environmental Research Institute)
• Established in 1998 as a Collaboration between The New
Jersey Meadowlands Commission and Rutgers CIMIC.
• Provides a world class environmental research institute for
urban and coastal wetlands focused on the district
• Administered by Rutgers-CIMIC
Mission
• Conduct and sponsor research in ecology, environmental
science and information technology to monitor, preserve and
improve ecological and human health and welfare in the
Meadowlands District, NJ.
RUTGERS-CIMIC/MERI
3
MERI
• Budget $1.6 Million/year (2002-2007)
• Staff
– Faculty, Students, FT NJMC/Rutgers Scientists/Staff
• Disciplines
– Biology, Ecology, Geology, Environmental Sc., Hydrological Modeling,
Remote Sensing/Geographic Information Systems, and Information
Technology
• Work closely with the NJ Meadowlands Commission to
disseminate research results:
– To the scientific community and the various government agencies
– Information and technology transfer to local municipalities
– Develop scientific content for education and exhibits
– Provide high school and college students with science internships
Processed
Satellite Images
Digital Meadowlands
3D visualization
NASA archives
Reports
Environmental
Parameters
Fly-by/ Drill-down
Users
Interactive Maps
Sensors
documents
Maps
Digital
Meadowlands
Satellite Imagery:
AVHRR
Aerial
Photos
Radar
Monitoring
Stations
Visualization Drill-down
RUTGERS-CIMIC/MERI
6
ASTER (Advanced Spaceborne
Thermal Emission and
Reflection Radiometer)
-Ground Resolution: 15 m (bands
1-2), 30 m (bands 4-7), 90 m
(bands 10-14)
-Spectral Bands: 14
-Swath width: 60 Km
-Application:
Daily monitoring of flood prone
areas. Flood prone areas are
shown in red. Under flood
conditions sensor would detect
water (blue) covering flood prone
areas (red).
RUTGERS-CIMIC/MERI
7
Satellite Images
• Various data sources and data types
– various types of satellite images with
resolutions captured by different sensors
different
• AVHRR: direct downloads from polar orbiting satellites(NOAA 12, NOAA 14 and
NOAA 15), 1km resolution, 5 bands
• LANDSAT and RADAR: obtained from NASA archives, 30m resolution, 7 bands
• Hyper-spectral images: 34 bands images from AISA (Airborne Imaging
Spectrometer for Applications) sensor, 250-1000m resolution
• Aerial ortho-photographs: high resolution (1m) images (IKONOS, QUICKBIRD),
• MODIS: images for global dynamics and processes occurring on the land, in
the oceans, and in the lower atmosphere from NASA’S satellites Terra and Aqua
, 90m resolution, 36 bands
• ASTER: detailed maps of land surface temperature, emissivity, reflectance and
elevation from NASA’s satellite Terra, 14 bands, 15-90m resolution
RUTGERS-CIMIC/MERI
8
Automated, near real-time
monitoring system
Water monitor
wired to data
logger
Users access
database through
the Worldwide Web
Weather station with
data logger
WWW interface
8/24/01
Coastal GeoTools ‘01
Central computer
ingests data and stores
it in a database
18
Real Time Data from Monitoring Stations
RUTGERS-CIMIC/MERI
10
Use of Water Quality Data:
Tracking the effectiveness of pollution control measures
Regulatory
minimum level
One Example of Satellite Imagery: AVHRR
NOAA 12, 14, 15
NASA Rapid
Application Tool
Data band
Extraction
Raw Files
Creation of level1b
Georectification
Georegistration
Metadata
Extraction
Direct Readout
AVHRR
ORACLE DB
elements.dat
Downloaded daily
from US Navy
metadata
thumbnails
RUTGERS-CIMIC/MERI
Region Of Interest
12
RUTGERS-CIMIC/MERI
13
Information Sources -- Traditional
• Include
• A Library of a large variety of Documents
– Scientific Publication
– Guidelines and Regulations
– Measurements and Impact Studies
• Documents contain Text, Tables, Pictures, Drawings
and Maps
• Census information that describes the socio-economic
and health characteristics of the population
RUTGERS-CIMIC/MERI
14
The User Community
• Researchers, faculty, graduate students in a variety of
disciplines including biology, ecology, geology, environmental
science, and IT:
– make scientific observations such as the changes in vegetation
pattern and its effect on temperature over the years
• Policy Makers:
– query various critical parameters such as ambient air and water
quality and visualize the results in a graphical form
– gain help in the evaluation and formulation of environmental policies
• The Public:
– learn information about their county, community, home on such issues as
environment, health, and infrastructure
• K-12 Educators and students
RUTGERS-CIMIC/MERI
15
The Data Volume
• Satellite images
– AVHRR: 50MB each image, 2-4 images per satellite per day, CIMIC is downloading
images from 3 satellites and generate 15GB data per month
– MODIS images and ASTER images are available everyday or every other day.
– IKONOS, QUICKBIRD 1m resolution images (7.5 quad), each image would roughly
be 15000*12000*8, which means 1.44GB.
• Our Mass Storage
– EMC CLARiiON FC4500
– Capacity – up to 18TB
– Good cost per MB, excellent performance, scalability, and flexibility
• It satisfies the needs of the Online Querying Information System
• One GB cache: optimized for r/w at different times by a script
• Backplane provides a data transfer rate of 200 MB/Second from the disks to
the fiber channel port which transfers the data over the fiber channel cables to
the host at 100 MB/Second.
• Additional fiber channel
• Very flexible configuration capabilities
RUTGERS-CIMIC/MERI
16
Environmental Knowledge Discovery
Examples
Data Warehousing
Data Mining
–How and Why
–Hypothesis testing
Motivating Examples (1)
• Identify a natural disturbance affecting wetland
vegetation such as fire, pathogen infestation or wilting by
drought in the New Jersey Meadowlands?
• What should we have?
–
–
–
–
–
A time series of satellite images (a few years)
Calculated soil and vegetation indices for images
Digital elevation models (DEM) of Meadowlands
Precipitation record for time series
Zoning designation for area being observed
• What do we need to do?
– Identify the sudden drop in the vegetation index (NDVI) in areas
where NDVI has been consistently high through time (outlier
detection)
– Determine areas where suddenly the soil index is high due to the
exposure of bare mineral soil. (classification)
– Combine high soil index with low NDVI and precipitation record
to determine the occurrence of vegetation disturbance
(characterization)
RUTGERS-CIMIC/MERI
18
Motivating Examples (2)
• Find bird resting patterns along the eastern seaboard
migration corridor
– Data needs
• Extent of ecosystems that support invertebrates along the migration
corridor
• Availability of invertebrates in water and sediments through the
migrating period.
– What we need to know
• The number of birds and bird types as related to the availability of
food at each rest stop. (trends detection)
• Detect abnormal bird populations (low or high) which are not
explained by availability of food at specific resting stops. (outlier
detection)
RUTGERS-CIMIC/MERI
19
Motivating Examples (3)
• Investigate the associations between change in forest
cover and illegal exploitation of protected tropical
forests
– Data we need
•
•
•
•
•
Satellite image/maps
Calculated deforestation rates using NDVI indices
Data on truck movement
Records on ship movements from local ports
Data on migrant worker camps
– What we can get?
• Relate deforestation rate to new road construction and truck traffic
in areas where the topography and local ecosystems support exotic
tropical trees.. (association detection)
RUTGERS-CIMIC/MERI
20
Other Motivate Examples
• hydroclimatological study
(Praveen Kumar and Amanda BT. White, 2002)
– How can we link the changes in NDVI to changes in the hydrologic
condition?
– Can we distinguish between the changes due to various factors, such
as inter-annual climate variability and human action impact?
– Is it possible to distinguish between variabilities related to interannual and long-term trends?
– Is there correlation between NDVI variations and ecoregion, or
between NDVI with other parameters, such as climate, physiography,
topography, or hydrology?
– Are the trends confined to certain regions? Is the nature of the
variability and trend different in different regions?
– Are there any systematic changes over last 10 years?
– Are there regions where changes are attributable to human impact,
such as logging?
RUTGERS-CIMIC/MERI
21
Environmental Data Warehousing (EDW)
• Poses a number of challenging requirements
with respect to
– The design of the data model due to the nature of
analytical operations to be performed
– The nature of the views to be maintained by the
environmental warehouse.
RUTGERS-CIMIC/MERI
22
EDW – Challenges
Nature of the Environmental Data
•
Each dimension in itself is multi-dimensional in nature, e.g.,
– raster images such as satellite downloads
• used to generate various images of different types including landuse, water, temperature, NDVI
• each of them have multiple dimensions
– the geographic extent and coordinates
– the time and date of its capture
– resolution, ...
– regional maps represented as vector data
• temporal and spatial
– streaming data collected from various sensors
• Temperature, air quality, atmospheric pressure, water quality: dissolved oxygen,
mineral contents, salinity
• geographic location (spatial dimension)
• temporal dimension
RUTGERS-CIMIC/MERI
23
Nature of the Environmental data
Year
Land-use
Date
Timestamp
Category
(LX, LY)
Time
Vector Map
Year
Themes
Land-use
Time
Temperature
Spatial
Water
(UX, UY)
Spatial
Resolution
Date
Attributes
Vegetation
Timestamp
Developed
Types
Barren
Vector Map
Water
…
(LX, LY)
(UX, UY)
Etc.
Image ID
Dot
Time
Line
Year
Resolution
Forested upland
Spatial
Polygon
(LX, LY)
Date
Attributes
(UX, UY)
Timestamp
Resolution
NDWT Index
Chlorophyll
RUTGERS-CIMIC/MERI
Temperature
24
Nature of the Environmental data
•
•
•
Each dimensional table is itself multi-dimensional by nature
Traditional data warehouse models are not suitable for an environmental
data warehouse
Our Proposal: cascaded star schema
f
b
A
c
e
d
A is the fact table, and b, c, d, e and f are dimensions that are also multidimensional themselves
RUTGERS-CIMIC/MERI
25
EDW Challenges
Complex Nature of Queries (1)
• Retrieve changes in the vegetation pattern over a
certain region during last 10 years, and their effect on
the regional maps over that time period
• requires
– layering of the images representing the vegetation
patterns with those of the maps whose time intervals of
validity overlap
– traverse along this temporal dimension with the overlaid
image
• In the traditional data warehouse sense,
– first construct two data cubes along the time dimensions
for each of the vegetation images and maps
– then fuse these two cubes into one
RUTGERS-CIMIC/MERI
26
Demo
• http://cimic.rutgers.edu/~songmei/dw.html
RUTGERS-CIMIC/MERI
27
RUTGERS-CIMIC/MERI
28
RUTGERS-CIMIC/MERI
29
RUTGERS-CIMIC/MERI
30
EDW Challenges
Complex Nature of Queries (2)
• Observe the changes in the surface water and population
due to the changes in the vegetation pattern
• fusion of multiple cubes is required
• Simulate a fly-by over a region starting with a specific point
and elevation, and traverse the region on a specific path
with reducing elevation levels at a certain speed, and
reaching a destination (a 3-dimensional trajectory)
– Requires
• retrieving images that span adjacent regions that overlap the
spatial trajectory, but with increasing resolution levels to simulate
the effect of reduced elevation level
• display them at a speed that matches the desired velocity of the
fly-by.
RUTGERS-CIMIC/MERI
31
EDW Challenges
Efficient Software and Mature Technology
• We need software applications to efficiently manage and
manipulate images either by pre-setting or by ad -hoc
• Example of calculating NDVI
select (char) ( 255.0 * (band2 - band1)/(band2 + band1))
[1000:1500, 1000:1500]
from landsat_band1 as band1, landsat_band2 as band2
– Example in the area of DB -- RasDaMan
• A basic research project sponsored by the European
Community to develop comprehensive MDD database
technology
• Multi-dimensional data models (MDD) to store images
• Interacts with Oracle for meta data and blob management
RUTGERS-CIMIC/MERI
32
RasDaMan (1)
RUTGERS-CIMIC/MERI
33
RasDaMan (2)
• Distinguished Features
– A clear distinction is made between the logical (query) level and the
physical (storage organization and data transmission) level of array
management.
– On the conceptual level, arrays are treated as a general data abstraction,
they can be of any dimensionality, they can have an arbitrary (fixed or
variable) number of elements per dimension, and both primitive and
derived types are admissible as array base types.
– The model has formal set-algebraic semantics based on AFATL Image
Algebra, a rigid mathematical framework able to express any image
transformation.
– On the physical level, a novel combination of tiling and spatial indexing
allows for the efficient execution of queries on MDD while offering the
benefits of conventional database technology, such as query performance
depending on the result set (and not on the overall data set size),
concurrency control, support for crash recovery, and transaction
management.
– A data definition language for multidimensional arrays, together with a SQLbased and optimized query language called RasQL allows for powerful
associative retrieval and data manipulation
RUTGERS-CIMIC/MERI
34
Ongoing Work
• Formulating the necessary primitives for the specification and
execution of queries
• Extending the OLAP operations for the cascaded star
– roll-up: aggregating on a specific dimension, i.e., summarize data
– drill-down: from higher level summary to lower level detailed
– slicing: projecting data along a subset of dimensions with an equality
selection of other dimensions
– dicing: similar to slicing except that instead of equality selection of
other dimensions, a range selection is used
– pivoting: reorient the multidimensional cube
– zoom-in, zoom-out, aggregation of views using the above OLAP
operations
RUTGERS-CIMIC/MERI
35
Environmental Data Mining – Challenges (1)
• How can we mine spatial data and non-spatial data from
multispectral satellite images and thematic maps. (Krzysztof
Koerski, Junas Adhikary, and Jiawei Han, 1996)
–
–
–
–
Currently research uses only single type of map or image
Mine them at the same time
Resolutions are different
The representation of the thematic maps are different
• How to deal with the complex relationships among
objects (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996)
– Relationships
• Spatial relationship: distance
• Topological relationship: disjoint, overlap, far away, etc
• Direction
– Current clustering represent the big object using centroid, e.g., objects
of similar size and regular shape, only one of them is very narrow, long
band shape
RUTGERS-CIMIC/MERI
36
Environmental Data Mining – Challenges (2)
• How to utilize the various data seamlessly
– The diverse data types
• Structured data: vector, raster, relational database
• Unstructured data: text, multimedia, and geo-referenced
stream data.
– Needs supporting data
• Some can be found in the Data Warehouse: summary, average
• Some need to be created on the fly: variation, etc.
• How to utilize the geographic visualization tool
– Can it replace the statistical visualizations tools at some area?
RUTGERS-CIMIC/MERI
37
Data Mining Techniques
• From the motivating examples we notice that several
data mining techniques are involved
– Segmentation
• Clustering
• Classification
– Rule detection
– Trend detection
– Outlier detection
RUTGERS-CIMIC/MERI
38
EDM Techniques – Rule detection
• Examples:
– Can we distinguish between the changes due to various factors,
such as inter-annual climate variability and human action impact?
– Can we link the changes in NDVI to changes in the hydrologic
conditions, or changes in population?
• Various rules
– Characteristic rules: one characteristic of data
– Discriminant rules: the feature discriminating or contrasting a class
of data from other classes
– Association rules: one set of feature is correlated with another set
of data
RUTGERS-CIMIC/MERI
39
EDM Techniques – Association Rule detection
• Algorithms
– Classic algorithms
• Apriori: for Boolean association rules to find frequent itemset
(Jiawei Han and Micheline Kamber, 2000)
• Statistic techniques: regression model
– Spatial data mining algorithm (Krzysztof Koperski and Jiawei Han, 1996)
• a top-down search technique
• Use spatial approximation
• Pre-process is require for object recognition
• Needs comprehensive algorithm for mining a combination
of spatial and non-spatial data at the same time
RUTGERS-CIMIC/MERI
40
EDM Techniques - Segmentation
• Example
– Are the trends confined to certain regions? Is the nature of the
variability and trend different in different regions?
– Is it possible to distinguish between variability related to interannual and long term trends.
• Clustering:
– groups spatial objects such that objects in the same groups are
similar and objects in different groups are unlike each other .
• Classification:
– Selects a relevant set of attributes and attribute values that
determine an effective mapping of spatial objects into pre-defined
target classes. (H. J. Miller and J.Han, 2001)
– Name a set of pre-determined classes (inter-annual changes, long
term changes)
RUTGERS-CIMIC/MERI
41
EDM Techniques – Segmentation (Cont’d)
• Algorithms
– Classification: the classes are pre-defined
• Decision tree induction
• Bayesian classification
– Cluster
• Partitioning algorithms: k-means method, k-medoids method
– The problem here is that the result is strongly depends on the
initial guess of the centroid
• Hierarchy algorithms: AGNES, DIANA, BIRCH, CURE
– The hierarchy algorithms are not optimal for large datasets
• Density –based: DBSCAN, OPTICS, DENCLUE
– Only dot, without meaningful interpretation
• Grid based: STRING, WaveCluster, CLIQUE
– How to partition high-dimensional data
RUTGERS-CIMIC/MERI
42
EDM Techniques – Outlier Detection
• To find inconsistency and abnormal
• Example
– Can we identify the abnormal changes in NDVI or particular species?
– Has is it been usually hot for this October
• Algorithms
(Raymond T.Ng, 2001)
– Distribution-based approach: the one not follow the standard
distribution.
• Hard to know the distribution
• Not suitable for high-dimensional datasets
– Depth-based method: represent the data at k-dimensional space,
assign depth to each object.
• Does not scale up for more than 3-D
– Distance-based outlier detection
• Require the existence of an appropriate distance function
RUTGERS-CIMIC/MERI
43
References
[1] Praveen Kumar and Amanda BT. White, Scalable Knowledge discovery for hydroclimatological studies ,
University of Illinois, 2002
[2] H. J. Miller and J.Han, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001
[3] Nabil Adam, Vijay Atluri, Songmei Yu and Yelena Yesha, “Efficient Storage and Management of
Environmental Information”, presented in 11th Mass Storage Conference hold by IEEE and NASA,
Maryland, April 2002.
[4] Wendolin Bosques, Ricardo Rodriguez, Angelica Rondon and Ramon Vasquez, "A Spatial Data Retrieval
and Image Processing Expert System for the World Wide Web," 21st International Conference on
Computers and Industrial Engineering, 1997, pages 433-436.
[5]. Krzysztof Koperski and Jiawei Han, “Discovery of spatial association rules in Geographic Information
Database”, Proceedings of 4th International Symp. Advances, in Spatial Database, (SSD). Vol 951,
Springer-Verlag, 47-66.
[6] Kirk Barrett, “The Meadowlands Environmental Research Institute”, Science on the Semantic Web
(SWS) Workshop, Oct 2002.
[7] Jiawei Han, Russ B. Altman, Vipin Kumar, Heikki Mannila, and Daryl Pregibon, “Emerging Scientific
Applications in Data Mining”, Communications of ACM, August, 2002, Vol. 45, No. 8, Page 54-58
[8] Krzysztof Koerski, Junas Adhikary, and Jiawei Han, “Spatial data mining: progress and Challenges
Survey paper”, SIGMOD ’96 workshop on Research Issures in Data Mining and Knowledge discover.
[9] Jiawei Han and Micheline Kamber, Data Mining – Concepts and Techniques, Morgan Kaufmann
Publishers, 2000
[10] Raymond T.Ng, “Detecting outliers from large datasets”, “Geographic Data Mining and Knowledge
Discovery”, Taylor & Francis, 2001
RUTGERS-CIMIC/MERI
44
Focus Areas
• Environmental monitoring
• Remote sensing/GIS for land use planning
• Plant and animal inventory and assessment
• Salt-marsh and Landfill Characterization and Restoration
• Assessment and Remediation of Contaminated Sediments
• Land use information management for planning and engineering
(predict land use trends for planner, code enforcement for
engineers)
• Scientific data warehousing for efficient management of
environmental and remote sensing data
• Scientific data mining for discovering trends, patterns and
relationships among land use and environmental data
• Automating land use permit processing workflows through
transparent inter-agency interaction
Introduction to Environmental Data (cont’d.)
– Value-added products:
•
•
•
•
water
vegetation
temperature
true colors (composites)
– models of the topography and spatial attributes of the
landscape
• roads, rivers, parcels, schools, zip code areas, city streets and
administrative boundaries
• Maps, reports, data sets from government agencies
– census information that describes the socio-economic and
health characteristics of the population
– real-time data from ground monitoring stations
RUTGERS-CIMIC/MERI
46
Download