Environmental Data Warehousing and Mining Nabil R. Adam Vijay Atluri, Dihua Guo, Songmei Yu Rutgers University CIMIC NSF Workshop on Next Generation Data Mining NGDM02 November 1-3, 2002 Outline • Setting – A Real-world Lab – The NJ Meadowlands Area • Motivating Examples • Environmental Data Warehousing • Environmental Data Mining RUTGERS-CIMIC/MERI 2 MERI (Meadowlands Environmental Research Institute) • Established in 1998 as a Collaboration between The New Jersey Meadowlands Commission and Rutgers CIMIC. • Provides a world class environmental research institute for urban and coastal wetlands focused on the district • Administered by Rutgers-CIMIC Mission • Conduct and sponsor research in ecology, environmental science and information technology to monitor, preserve and improve ecological and human health and welfare in the Meadowlands District, NJ. RUTGERS-CIMIC/MERI 3 MERI • Budget $1.6 Million/year (2002-2007) • Staff – Faculty, Students, FT NJMC/Rutgers Scientists/Staff • Disciplines – Biology, Ecology, Geology, Environmental Sc., Hydrological Modeling, Remote Sensing/Geographic Information Systems, and Information Technology • Work closely with the NJ Meadowlands Commission to disseminate research results: – To the scientific community and the various government agencies – Information and technology transfer to local municipalities – Develop scientific content for education and exhibits – Provide high school and college students with science internships Processed Satellite Images Digital Meadowlands 3D visualization NASA archives Reports Environmental Parameters Fly-by/ Drill-down Users Interactive Maps Sensors documents Maps Digital Meadowlands Satellite Imagery: AVHRR Aerial Photos Radar Monitoring Stations Visualization Drill-down RUTGERS-CIMIC/MERI 6 ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) -Ground Resolution: 15 m (bands 1-2), 30 m (bands 4-7), 90 m (bands 10-14) -Spectral Bands: 14 -Swath width: 60 Km -Application: Daily monitoring of flood prone areas. Flood prone areas are shown in red. Under flood conditions sensor would detect water (blue) covering flood prone areas (red). RUTGERS-CIMIC/MERI 7 Satellite Images • Various data sources and data types – various types of satellite images with resolutions captured by different sensors different • AVHRR: direct downloads from polar orbiting satellites(NOAA 12, NOAA 14 and NOAA 15), 1km resolution, 5 bands • LANDSAT and RADAR: obtained from NASA archives, 30m resolution, 7 bands • Hyper-spectral images: 34 bands images from AISA (Airborne Imaging Spectrometer for Applications) sensor, 250-1000m resolution • Aerial ortho-photographs: high resolution (1m) images (IKONOS, QUICKBIRD), • MODIS: images for global dynamics and processes occurring on the land, in the oceans, and in the lower atmosphere from NASA’S satellites Terra and Aqua , 90m resolution, 36 bands • ASTER: detailed maps of land surface temperature, emissivity, reflectance and elevation from NASA’s satellite Terra, 14 bands, 15-90m resolution RUTGERS-CIMIC/MERI 8 Automated, near real-time monitoring system Water monitor wired to data logger Users access database through the Worldwide Web Weather station with data logger WWW interface 8/24/01 Coastal GeoTools ‘01 Central computer ingests data and stores it in a database 18 Real Time Data from Monitoring Stations RUTGERS-CIMIC/MERI 10 Use of Water Quality Data: Tracking the effectiveness of pollution control measures Regulatory minimum level One Example of Satellite Imagery: AVHRR NOAA 12, 14, 15 NASA Rapid Application Tool Data band Extraction Raw Files Creation of level1b Georectification Georegistration Metadata Extraction Direct Readout AVHRR ORACLE DB elements.dat Downloaded daily from US Navy metadata thumbnails RUTGERS-CIMIC/MERI Region Of Interest 12 RUTGERS-CIMIC/MERI 13 Information Sources -- Traditional • Include • A Library of a large variety of Documents – Scientific Publication – Guidelines and Regulations – Measurements and Impact Studies • Documents contain Text, Tables, Pictures, Drawings and Maps • Census information that describes the socio-economic and health characteristics of the population RUTGERS-CIMIC/MERI 14 The User Community • Researchers, faculty, graduate students in a variety of disciplines including biology, ecology, geology, environmental science, and IT: – make scientific observations such as the changes in vegetation pattern and its effect on temperature over the years • Policy Makers: – query various critical parameters such as ambient air and water quality and visualize the results in a graphical form – gain help in the evaluation and formulation of environmental policies • The Public: – learn information about their county, community, home on such issues as environment, health, and infrastructure • K-12 Educators and students RUTGERS-CIMIC/MERI 15 The Data Volume • Satellite images – AVHRR: 50MB each image, 2-4 images per satellite per day, CIMIC is downloading images from 3 satellites and generate 15GB data per month – MODIS images and ASTER images are available everyday or every other day. – IKONOS, QUICKBIRD 1m resolution images (7.5 quad), each image would roughly be 15000*12000*8, which means 1.44GB. • Our Mass Storage – EMC CLARiiON FC4500 – Capacity – up to 18TB – Good cost per MB, excellent performance, scalability, and flexibility • It satisfies the needs of the Online Querying Information System • One GB cache: optimized for r/w at different times by a script • Backplane provides a data transfer rate of 200 MB/Second from the disks to the fiber channel port which transfers the data over the fiber channel cables to the host at 100 MB/Second. • Additional fiber channel • Very flexible configuration capabilities RUTGERS-CIMIC/MERI 16 Environmental Knowledge Discovery Examples Data Warehousing Data Mining –How and Why –Hypothesis testing Motivating Examples (1) • Identify a natural disturbance affecting wetland vegetation such as fire, pathogen infestation or wilting by drought in the New Jersey Meadowlands? • What should we have? – – – – – A time series of satellite images (a few years) Calculated soil and vegetation indices for images Digital elevation models (DEM) of Meadowlands Precipitation record for time series Zoning designation for area being observed • What do we need to do? – Identify the sudden drop in the vegetation index (NDVI) in areas where NDVI has been consistently high through time (outlier detection) – Determine areas where suddenly the soil index is high due to the exposure of bare mineral soil. (classification) – Combine high soil index with low NDVI and precipitation record to determine the occurrence of vegetation disturbance (characterization) RUTGERS-CIMIC/MERI 18 Motivating Examples (2) • Find bird resting patterns along the eastern seaboard migration corridor – Data needs • Extent of ecosystems that support invertebrates along the migration corridor • Availability of invertebrates in water and sediments through the migrating period. – What we need to know • The number of birds and bird types as related to the availability of food at each rest stop. (trends detection) • Detect abnormal bird populations (low or high) which are not explained by availability of food at specific resting stops. (outlier detection) RUTGERS-CIMIC/MERI 19 Motivating Examples (3) • Investigate the associations between change in forest cover and illegal exploitation of protected tropical forests – Data we need • • • • • Satellite image/maps Calculated deforestation rates using NDVI indices Data on truck movement Records on ship movements from local ports Data on migrant worker camps – What we can get? • Relate deforestation rate to new road construction and truck traffic in areas where the topography and local ecosystems support exotic tropical trees.. (association detection) RUTGERS-CIMIC/MERI 20 Other Motivate Examples • hydroclimatological study (Praveen Kumar and Amanda BT. White, 2002) – How can we link the changes in NDVI to changes in the hydrologic condition? – Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact? – Is it possible to distinguish between variabilities related to interannual and long-term trends? – Is there correlation between NDVI variations and ecoregion, or between NDVI with other parameters, such as climate, physiography, topography, or hydrology? – Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions? – Are there any systematic changes over last 10 years? – Are there regions where changes are attributable to human impact, such as logging? RUTGERS-CIMIC/MERI 21 Environmental Data Warehousing (EDW) • Poses a number of challenging requirements with respect to – The design of the data model due to the nature of analytical operations to be performed – The nature of the views to be maintained by the environmental warehouse. RUTGERS-CIMIC/MERI 22 EDW – Challenges Nature of the Environmental Data • Each dimension in itself is multi-dimensional in nature, e.g., – raster images such as satellite downloads • used to generate various images of different types including landuse, water, temperature, NDVI • each of them have multiple dimensions – the geographic extent and coordinates – the time and date of its capture – resolution, ... – regional maps represented as vector data • temporal and spatial – streaming data collected from various sensors • Temperature, air quality, atmospheric pressure, water quality: dissolved oxygen, mineral contents, salinity • geographic location (spatial dimension) • temporal dimension RUTGERS-CIMIC/MERI 23 Nature of the Environmental data Year Land-use Date Timestamp Category (LX, LY) Time Vector Map Year Themes Land-use Time Temperature Spatial Water (UX, UY) Spatial Resolution Date Attributes Vegetation Timestamp Developed Types Barren Vector Map Water … (LX, LY) (UX, UY) Etc. Image ID Dot Time Line Year Resolution Forested upland Spatial Polygon (LX, LY) Date Attributes (UX, UY) Timestamp Resolution NDWT Index Chlorophyll RUTGERS-CIMIC/MERI Temperature 24 Nature of the Environmental data • • • Each dimensional table is itself multi-dimensional by nature Traditional data warehouse models are not suitable for an environmental data warehouse Our Proposal: cascaded star schema f b A c e d A is the fact table, and b, c, d, e and f are dimensions that are also multidimensional themselves RUTGERS-CIMIC/MERI 25 EDW Challenges Complex Nature of Queries (1) • Retrieve changes in the vegetation pattern over a certain region during last 10 years, and their effect on the regional maps over that time period • requires – layering of the images representing the vegetation patterns with those of the maps whose time intervals of validity overlap – traverse along this temporal dimension with the overlaid image • In the traditional data warehouse sense, – first construct two data cubes along the time dimensions for each of the vegetation images and maps – then fuse these two cubes into one RUTGERS-CIMIC/MERI 26 Demo • http://cimic.rutgers.edu/~songmei/dw.html RUTGERS-CIMIC/MERI 27 RUTGERS-CIMIC/MERI 28 RUTGERS-CIMIC/MERI 29 RUTGERS-CIMIC/MERI 30 EDW Challenges Complex Nature of Queries (2) • Observe the changes in the surface water and population due to the changes in the vegetation pattern • fusion of multiple cubes is required • Simulate a fly-by over a region starting with a specific point and elevation, and traverse the region on a specific path with reducing elevation levels at a certain speed, and reaching a destination (a 3-dimensional trajectory) – Requires • retrieving images that span adjacent regions that overlap the spatial trajectory, but with increasing resolution levels to simulate the effect of reduced elevation level • display them at a speed that matches the desired velocity of the fly-by. RUTGERS-CIMIC/MERI 31 EDW Challenges Efficient Software and Mature Technology • We need software applications to efficiently manage and manipulate images either by pre-setting or by ad -hoc • Example of calculating NDVI select (char) ( 255.0 * (band2 - band1)/(band2 + band1)) [1000:1500, 1000:1500] from landsat_band1 as band1, landsat_band2 as band2 – Example in the area of DB -- RasDaMan • A basic research project sponsored by the European Community to develop comprehensive MDD database technology • Multi-dimensional data models (MDD) to store images • Interacts with Oracle for meta data and blob management RUTGERS-CIMIC/MERI 32 RasDaMan (1) RUTGERS-CIMIC/MERI 33 RasDaMan (2) • Distinguished Features – A clear distinction is made between the logical (query) level and the physical (storage organization and data transmission) level of array management. – On the conceptual level, arrays are treated as a general data abstraction, they can be of any dimensionality, they can have an arbitrary (fixed or variable) number of elements per dimension, and both primitive and derived types are admissible as array base types. – The model has formal set-algebraic semantics based on AFATL Image Algebra, a rigid mathematical framework able to express any image transformation. – On the physical level, a novel combination of tiling and spatial indexing allows for the efficient execution of queries on MDD while offering the benefits of conventional database technology, such as query performance depending on the result set (and not on the overall data set size), concurrency control, support for crash recovery, and transaction management. – A data definition language for multidimensional arrays, together with a SQLbased and optimized query language called RasQL allows for powerful associative retrieval and data manipulation RUTGERS-CIMIC/MERI 34 Ongoing Work • Formulating the necessary primitives for the specification and execution of queries • Extending the OLAP operations for the cascaded star – roll-up: aggregating on a specific dimension, i.e., summarize data – drill-down: from higher level summary to lower level detailed – slicing: projecting data along a subset of dimensions with an equality selection of other dimensions – dicing: similar to slicing except that instead of equality selection of other dimensions, a range selection is used – pivoting: reorient the multidimensional cube – zoom-in, zoom-out, aggregation of views using the above OLAP operations RUTGERS-CIMIC/MERI 35 Environmental Data Mining – Challenges (1) • How can we mine spatial data and non-spatial data from multispectral satellite images and thematic maps. (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996) – – – – Currently research uses only single type of map or image Mine them at the same time Resolutions are different The representation of the thematic maps are different • How to deal with the complex relationships among objects (Krzysztof Koerski, Junas Adhikary, and Jiawei Han, 1996) – Relationships • Spatial relationship: distance • Topological relationship: disjoint, overlap, far away, etc • Direction – Current clustering represent the big object using centroid, e.g., objects of similar size and regular shape, only one of them is very narrow, long band shape RUTGERS-CIMIC/MERI 36 Environmental Data Mining – Challenges (2) • How to utilize the various data seamlessly – The diverse data types • Structured data: vector, raster, relational database • Unstructured data: text, multimedia, and geo-referenced stream data. – Needs supporting data • Some can be found in the Data Warehouse: summary, average • Some need to be created on the fly: variation, etc. • How to utilize the geographic visualization tool – Can it replace the statistical visualizations tools at some area? RUTGERS-CIMIC/MERI 37 Data Mining Techniques • From the motivating examples we notice that several data mining techniques are involved – Segmentation • Clustering • Classification – Rule detection – Trend detection – Outlier detection RUTGERS-CIMIC/MERI 38 EDM Techniques – Rule detection • Examples: – Can we distinguish between the changes due to various factors, such as inter-annual climate variability and human action impact? – Can we link the changes in NDVI to changes in the hydrologic conditions, or changes in population? • Various rules – Characteristic rules: one characteristic of data – Discriminant rules: the feature discriminating or contrasting a class of data from other classes – Association rules: one set of feature is correlated with another set of data RUTGERS-CIMIC/MERI 39 EDM Techniques – Association Rule detection • Algorithms – Classic algorithms • Apriori: for Boolean association rules to find frequent itemset (Jiawei Han and Micheline Kamber, 2000) • Statistic techniques: regression model – Spatial data mining algorithm (Krzysztof Koperski and Jiawei Han, 1996) • a top-down search technique • Use spatial approximation • Pre-process is require for object recognition • Needs comprehensive algorithm for mining a combination of spatial and non-spatial data at the same time RUTGERS-CIMIC/MERI 40 EDM Techniques - Segmentation • Example – Are the trends confined to certain regions? Is the nature of the variability and trend different in different regions? – Is it possible to distinguish between variability related to interannual and long term trends. • Clustering: – groups spatial objects such that objects in the same groups are similar and objects in different groups are unlike each other . • Classification: – Selects a relevant set of attributes and attribute values that determine an effective mapping of spatial objects into pre-defined target classes. (H. J. Miller and J.Han, 2001) – Name a set of pre-determined classes (inter-annual changes, long term changes) RUTGERS-CIMIC/MERI 41 EDM Techniques – Segmentation (Cont’d) • Algorithms – Classification: the classes are pre-defined • Decision tree induction • Bayesian classification – Cluster • Partitioning algorithms: k-means method, k-medoids method – The problem here is that the result is strongly depends on the initial guess of the centroid • Hierarchy algorithms: AGNES, DIANA, BIRCH, CURE – The hierarchy algorithms are not optimal for large datasets • Density –based: DBSCAN, OPTICS, DENCLUE – Only dot, without meaningful interpretation • Grid based: STRING, WaveCluster, CLIQUE – How to partition high-dimensional data RUTGERS-CIMIC/MERI 42 EDM Techniques – Outlier Detection • To find inconsistency and abnormal • Example – Can we identify the abnormal changes in NDVI or particular species? – Has is it been usually hot for this October • Algorithms (Raymond T.Ng, 2001) – Distribution-based approach: the one not follow the standard distribution. • Hard to know the distribution • Not suitable for high-dimensional datasets – Depth-based method: represent the data at k-dimensional space, assign depth to each object. • Does not scale up for more than 3-D – Distance-based outlier detection • Require the existence of an appropriate distance function RUTGERS-CIMIC/MERI 43 References [1] Praveen Kumar and Amanda BT. White, Scalable Knowledge discovery for hydroclimatological studies , University of Illinois, 2002 [2] H. J. Miller and J.Han, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001 [3] Nabil Adam, Vijay Atluri, Songmei Yu and Yelena Yesha, “Efficient Storage and Management of Environmental Information”, presented in 11th Mass Storage Conference hold by IEEE and NASA, Maryland, April 2002. [4] Wendolin Bosques, Ricardo Rodriguez, Angelica Rondon and Ramon Vasquez, "A Spatial Data Retrieval and Image Processing Expert System for the World Wide Web," 21st International Conference on Computers and Industrial Engineering, 1997, pages 433-436. [5]. Krzysztof Koperski and Jiawei Han, “Discovery of spatial association rules in Geographic Information Database”, Proceedings of 4th International Symp. Advances, in Spatial Database, (SSD). Vol 951, Springer-Verlag, 47-66. [6] Kirk Barrett, “The Meadowlands Environmental Research Institute”, Science on the Semantic Web (SWS) Workshop, Oct 2002. [7] Jiawei Han, Russ B. Altman, Vipin Kumar, Heikki Mannila, and Daryl Pregibon, “Emerging Scientific Applications in Data Mining”, Communications of ACM, August, 2002, Vol. 45, No. 8, Page 54-58 [8] Krzysztof Koerski, Junas Adhikary, and Jiawei Han, “Spatial data mining: progress and Challenges Survey paper”, SIGMOD ’96 workshop on Research Issures in Data Mining and Knowledge discover. [9] Jiawei Han and Micheline Kamber, Data Mining – Concepts and Techniques, Morgan Kaufmann Publishers, 2000 [10] Raymond T.Ng, “Detecting outliers from large datasets”, “Geographic Data Mining and Knowledge Discovery”, Taylor & Francis, 2001 RUTGERS-CIMIC/MERI 44 Focus Areas • Environmental monitoring • Remote sensing/GIS for land use planning • Plant and animal inventory and assessment • Salt-marsh and Landfill Characterization and Restoration • Assessment and Remediation of Contaminated Sediments • Land use information management for planning and engineering (predict land use trends for planner, code enforcement for engineers) • Scientific data warehousing for efficient management of environmental and remote sensing data • Scientific data mining for discovering trends, patterns and relationships among land use and environmental data • Automating land use permit processing workflows through transparent inter-agency interaction Introduction to Environmental Data (cont’d.) – Value-added products: • • • • water vegetation temperature true colors (composites) – models of the topography and spatial attributes of the landscape • roads, rivers, parcels, schools, zip code areas, city streets and administrative boundaries • Maps, reports, data sets from government agencies – census information that describes the socio-economic and health characteristics of the population – real-time data from ground monitoring stations RUTGERS-CIMIC/MERI 46