10. G EO -S PATIAL D ATABASE M ANAGEMENT 10.1 Preamble The Ganga river basin management plan is an ambitious and unique proposal. It has been conceived to understand and rectify the various environmental issues that have cropped up due to the continuing expansion of human habitat in the basin. In order to achieve the goals of the project, scientists from different fields need to work in a synergistic manner. A crucial component of the entire exercise will be an integrated geo-spatial database management system to be used by all thematic groups and policy makers. The system will provide data storage, retrieval, visualization and search capabilities. In addition, it will provide relevant interfaces that can be used by the different thematic groups for simulation, prediction and analysis of data. The enhancement of sensor technologies coupled with the advent of advanced geographic information systems (GIS) provide myriad and virtually limitless opportunities to applications for assessment and evaluation of natural resources in a sustainable manner. However, such systems require the capabilities of robust and large databases in order to be successful in the long run. Thus, the proposed data centre is an ideal and crucial cog, on which the smooth running of the Ganga basin project wheel depends. Several themes have been outlined for managing the different aspects of the plan. A major common effort will be to collect myriad types of data ranging from climate, soil conditions, bio-diversity, land usage and socio-economic practices. While the sources from where these data will be available are different, it will be beneficial from both a scientific as well as a management point of view to store all the various types of data in a central repository. The proposed repository or data centre will provide the additional benefit of linking the data from different themes to get an overall perspective. As the data has been (and will be) collected over a period of time across different spatial sites in the Ganga basin, it will be spatio-temporal in nature. Designing a geo-spatial database management system is, therefore, crucial to the entire project. 10.2 Objectives We envision four important aspects of the system: • A database (henceforth referred to as a data centre) that can house and interconnect the different types of data, • Query, visualization, and retrieval capabilities of the stored data, • Interfaces that will make the data available to simulation tools, and • Data mining, pattern recognition and knowledge modelling. • Creation of a data centre using open source technologies and tools with the aim of migrating to such a system eventually. It is expected that in the initial stages proprietary software and tools may have to be used since most end users are generally familiar with them and will require time and training to migrate. A unified database that has access to all sorts of data is necessary not only to serve as a central repository, but also to establish the connections across the different spatial sites, periods and sources of data. Moreover, these will help the thematic groups to link their own sources of data to other related data for better understanding of the problems they work on. However, since the database is supposed to cover every bit of data collected or produced by every thematic group, the amount of data will be very large. Such voluminous and continuously increasing data calls for sophisticated query processing and indexing techniques. The database must support improved ways of searching and retrieval in order to be practically useful to the domain scientists. Since the data can have various attributes, multi-dimensional indexing techniques along with suitable similarity measures need to be developed. Another important feature of the data centre is visualization and representation of data in different forms that are more suitable and amenable to the needs of the domain specialists. It is important that the provenance of each piece of data can be tracked and can be displayed on a map of the Ganga basin with the exact time period when it was collected. For a particular site, a time-series of each type of data needs to be displayed. This will also help in dissemination of information about the progress of the project and the status of the river to the general viewers. Models will require various abstraction layers on top of the raw data, e.g., a flow abstraction that projects the flows into and out of a chosen object (such as the main stem of the Ganga or one of its tributaries) giving point and extended source flows. The comprehensive data gathering and modelling exercise, both qualitative and quantitative, will also reveal gaps in the existing data and help guide future data collection efforts. Another very important aspect of the system will be the data mining and pattern recognition components. Since the amount of data is extremely large, it is not possible to sift through them manually and find patterns. Specialized machine learning techniques need to be applied for pattern discovery and trend analysis. Statistical methods and models can be incorporated to identify data that is statistically unlikely and, therefore, points to some unusual physical phenomena that warrants further exploration. Due to the large area of the Ganga basin, it may not be possible to collect data from all the spatial sites at all times. Thus, building an appropriate generative model that describes the different data sources will be a boon. The model will also help to simulate different situations such as flood, drought, etc. and predict the future values of various physical parameters. This will be a valuable resource for policy makers and scientists alike. Since the data will be from different sources, linking the metadata is important to understand the relationships among the various types of data. Therefore, the construction of knowledge models and ontologies are vital as well. Research on pattern recognition and statistical analysis can provide value addition as well as support research of other thematic groups. This research will be long term and will evolve over time with interactions between other thematic groups to understand their data and identify their requirements. These include, but are not limited to the following ideas. Sensitivity tolerance and confidence levels can be added to the models developed by other thematic groups using statistical analysis. Pattern recognition on remote sensing data will be an important aspect to develop maps on surface water, glacier extent (and monitoring), soil composition, land-use, forest cover (and monitoring), etc. Relevant processed data at different times can feed models of other thematic groups to make better and/or additional parameter predictions. 10.3 Scope The scope of the project extends to the entire Ganga basin management plan. The data centre will include all the data requirements of all the thematic groups. It will also include a portal cum qualitative knowledge map (Gangapedia) that will subserve the communication needs of the project. 10.4 Types of Data The collection of data is external to the project. It is assumed that the different thematic groups will feed the data collected or generated by them to this group. Some of the typical sources of data that are expected from them are: a) Data from water sources b) River water levels c) Pollution levels d) Rainfall e) Ground water levels f) Ground water pollution levels g) Glacier sizes and melt rates h) Bio-diversity maps i) Chemical substance levels j) Data from land sources k) Land use maps l) Pollution levels m) Bio-diversity maps n) Remote-sensing data o) Topographic data p) Soil composition data 10.5 Methodology The data objects, attributes, sources, views and interfaces will be identified in close consultation with representatives of all thematic groups. A consultative group with representation from each thematic group will be formed to understand the data requirements of each group and the database group will design and implement the necessary requirements. It is expected that these requirements will evolve over the course of the project. The steps below give a more detailed picture of the approach that will be taken: a) Identify the objects in the entire system. b) Identify the attributes for each object - in particular the spatial and temporal aspects. c) Identify the type and structure of data elements and the interfaces needed. d) Identify the meta data tags for the data elements in the system. e) Design data mining techniques to access the raw and processed data in different ways. f) Create a communication portal for within project and external communication needs. g) Create qualitative knowledge models showing dependencies and nature of dependencies. h) Design a security policy for access to data. i) Identify the hardware and software needs (e.g., servers, database, GIS and visualization software, network bandwidth for connectivity, etc.). j) Design and implement a system that meets the requirements from (a) to (h) above. k) Design pattern recognition techniques to identify trends and anomalies. l) Research on other aspects of mapping, modeling, prediction and support to other thematic groups. 10.6 Work Plan 0-3 Months Activity 3-6 Months 6-9 Months 9-12 Months 12-15 Months 15-18 Months Setup of the basic data centre (items (i), (iv) and (viii) with basic/standard data access capabilities). Development of specialized interfaces for simulation and modeling; visualization; other abstraction layers; creation of qualitative knowledge models. Initiate research into data intensive modeling and prediction - data mining, pattern recognition, machine learning, knowledge modeling, and ontology creation. 10.7 Deliverables Data centre with appropriate querying, retrieval, visualization, API interfaces and data abstraction facilities. The data will be acquired by the individual thematic groups and given to the database group. 10.8 The Team S No Name Affiliation Role 1 Alka Bhushan IIT Bombay Member 2 N L Sarada, IIT Bombay Member 3 Smita Sengupta IIT Bombay Member 4 Umesh Bellur IIT Bombay Member 5 A K Gosain IIT Delhi Member 6 A K Mittal IIT Delhi Member 7 Arnab Bhattacharya IIT Kanpur Member 8 Bharat Lohani IIT Kanpur Member 9 Harish Karnick IIT Kanpur Member 10 Krithika Venkataramani IIT Kanpur Leader 11 Onkar Dikshit IIT Kanpur Member 12 Purnendu Bose IIT Kanpur Member 13 Rajiv Sinha IIT Kanpur Member 14 T V Prabhakar IIT Kanpur Member 15 Vinod Tare IIT Kanpur Member