SDSC Data and Knowledge Systems Program & GEON: The Geosciences Network Chaitan Baru Director, DAKS Program PI (SDSC), GEON AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Outline • SDSC and cyberinfrastructure • GEON: Cyberinfrastructure for the Geosciences AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES SDSC Organizational Structure www.sdsc.edu ~ 400 employees/students total Integrative Biological Sciences (IBS) • Molecular biology • Neuroscience • Structural Genomics • Cell Signaling • Proteomics Integrative Computational Sciences (ICS) • Computational chemistry • Applied math • Ecoinformatics • Environmental Science • Computational Economics • User Services Networking and Security (N&S) Office of the Director Data and Knowledge Systems (DAKS) • Data integration • Distributed data management • Scientific databases • Data mining • Scientific data visualization Communications And Outreach Fran Berman, Director Alan Blatecky, Exec Director Richard Moore, NPACI Exec Director Anke Kamrath, COO Grids and Clusters (G&C) • Cluster management • Portals • Grid middleware Business Office High-End Computing (HEC) • Production systems Education and Training • Production networking and security • Research on network monitoring AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Cyberinfrastructure Vision [Cyberinfrastructure] refers to infrastructure based upon distributed computer, information, and communication technology. If infrastructure is required for an industrial economy, the we could say that cyberinfrastructure is required for the knowledge economy. Source: [NSF Blue Ribbon Panel] Cyberinfrastructure for a knowledge economy requires a new and innovative infrastructure for data management, data exploration, analysis, and visualization, and knowledge sharing. AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES High-End Cyberinfrastructure Instrumentation (large and/or many small People and Training Cyberinfrastructur e Computation Courtesy: Dr. Peter Freeman Assistant Director, CISE, NSF Large / Complex Databases and Libraries Software High-speed Network Connectivity NSF - pf - 8/02 AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data: A Cyberinfrastructure “Killer App” • Over the next decade, data will come from everywhere • • • • • Data from sensors Data from instruments And be used by everyone • • • • • Scientific instruments Experiments Sensors and sensornets New devices (personal digital devices, computer-enabled clothing, cars, …) Scientists Consumers Educators General public SW environment will need to support unprecedented diversity, globalization, integration, scale, and use AMD Seminar, SDSC, April 1 2004 Data from simulations Data from analysis CYBERINFRASTRUCTURE FOR THE GEOSCIENCES The SDSC DAKS Program • Organized as a set of R&D Labs 1. 2. 3. 4. 5. 6. 7. 8. 9. Knowledge-based Integration (Bertram Ludaescher) Advanced Query Processing (Amarnath Gupta) Advanced Database Projects (David Archbell) Data Mining (Tony Fountain) Visualization (Michael Bailey) Spatial Information Systems (Ilya Zaslavsky) Geoinformatics (Dogan Seber) Storage Resource Broker, SRB (Arcot Rajasekar) Sustainable Archives and Digital library Technology (Richard Marciano) AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES From Data to Information to Knowledge Applications: Geoinformatics, Biosciences, Ecoinformatics,… Visualization Data Mining, Simulation Modeling, Analysis, Data Fusion Knowledge-Based Integration Advanced Query Processing Grid Storage Filesystems, Database Systems High speed networking Storage hardware How do we represent data, information and knowledge to the user? How do we detect trends and relationships in data? How do we obtain usable information from data? How do we collect, access and organize data? How do we configure computer architectures to optimally support data-oriented computing? Networked Storage (SAN) sensornets How do we combine data, knowledge and information management with simulation and modeling? instruments SDSC Data and Knowledge Systems Program AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES SDSC and Cyberinfrastructure Projects • SDSC is involved in several, NSF and NIH-funded, community-based CI projects • BIRN – Biomedical Informatics Research Network, funded by NIH. Integrating distributed brain image data • GEON – Geosciences Network. Integrating distributed Earth Sciences data • SEEK – Scientific Environment for Ecological Knowledge. Integrating distributed biodiversity data along with tools • TeraGrid – Providing access to high-End, national-scale, physical computing infrastructure • NEES – Network for Earthquake Engineering Simulation. Integrating distributed earthquake simulation and sensor data AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES TeraGrid—High-End Cyberinfrastructure AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON: The Geosciences Network • NSF ITR Project, 2002-2007, $11.5M PI Institutions • Arizona State University • Bryn Mawr College • Penn State University • Rice University • San Diego State University • San Diego Supercomputer Center/UCSD • University of Arizona • University of Idaho • University of Missouri, Columbia • University of Texas at El Paso • University of Utah • Virginia Tech • UNAVCO • Digital Library for Earth System Education (DLESE) AMD Seminar, SDSC, April 1 2004 Partners • Chronos • CUAHSI-HIS • ESRI • Geological Survey of Canada • IBM • Kansas Geological Survey • Lawrence Livermore National Laboratory • U.S. Geological Survey (USGS) • California Institute for Telecommunications and Information Technology (Cal-(IT)2) • Georeference Online Other Affiliates • Southern California Earthquake Consortium (SCEC), EarthScope, IRIS, NASA GSFC CYBERINFRASTRUCTURE FOR THE GEOSCIENCES The GEON Project • Close collaboration between geoscientists and IT to interlink databases and Grid-enable applications • “Deep” data modeling of 4D data • Situating 4D data in context—spatial, temporal, topic, process • Semantic integration of Geosciences data • Logic-based formalisms to represent knowledge and map between ontologies • Grid computing • Deploy a prototype GEON Grid: heterogeneous networks, compute nodes, storage capabilities. Enable sharing of data, tools, expertise. • Interaction environments • Information visualization. Visualization of concept maps • Remote data visualization via high-speed networks • Augmented reality in the field AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Science Challenges and GEON Research • Origin and 4-D Evolution of Continents High Level --Plate Tectonics --Crustal Growth Through Time --Terranes --Terrane Recognition --Integration of Distributed Databases --Knowledge Representation of Domains --Domain Ontology --Databases --Data Providers Data Level Krishna Sinha, VaTech AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A Geoscientist’s Information Integration Scenario What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry using gravity data ? How do the plutons relate to the host rock structures? ? Information Integration Digital geologic map Geochemical Geophysical Database Geochronologic Database database (gravity contours) Of Virginia (Concordia) (plutons in Virginia) (chemical data) AMD Seminar, SDSC, April 1 2004 Structure database (foliation map) CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Drilling into the Concept Space Plate Tectonics PLATE Krishna Sinha, VaTech AMD Seminar, SDSC, April 1 2004 TECTONICS CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Components of the GEONgrid Architecture • GEONgrid Physical Implementation • Core Grid Services • Registry, authentication, access control, monitoring, replication, distributed filesystem, collection management (SRB), job submission, e.g. launch job to TeraGrid • “Higher-Order” Services • Registration: data and metadata, schema, ontology, services • Data Integration: spatial data integration, data systems integration, schema integration • 2D Visualization, including GIS • Workflow • 3D Viz, Augmented Reality • Portal • Portlet-based design. User space, GeonSearch/GeoWorkbench. AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEONgrid Physical Implementation • PoP Nodes only • VaTech, Bryn Mawr, Penn State, Rice, Utah EGI, Utah, DLESE, UNAVCO • PoP nodes + Data Nodes • Idaho, Arizona State, SDSC • PoP nodes + Compute Nodes • Missouri, UTEP, SDSC AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES The GEON Grid Geological Survey of Canada Chronos Livermore KGS USGS ESRI CUAHSI PoP node Partner Projects Compute cluster Data Cluster Partner services 1TF cluster GEON Node Status AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Industry Involvement • ESRI • • • • PoP Node in Redlands Access to ArcWeb services and content Use of Arc software Technical session on GEON ESRI Users’ Conference, San Diego, Aug.9-11, 2004 • IBM • Use of GMR (Grid Movement and Replication) software • Free DB2 for academic use • HP • Donation of an Itanium cluster for GEON development and to power GEON portal AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON Services • “Hosted” vs “non-hosted” services • Hosted: service is implemented within the physical GEONgrid environment (i.e. on one of the systems). • The implementation can benefit from core capabilities provided in GEONgrid, e.g. replication, load-balancing • Need at least a PoP node to host a service • Hosted databases will be stored at Data Nodes, but may be replicated at one or more PoP nodes • Data nodes • Require Internet2 connectivity • Will be backed up to SDSC • Will be replicated among themselves AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON Compute Nodes • Compute nodes • Want to create at least a few nodes as a TeraGrid “sandbox” • GEONgrid is currently based on Redhat Linux, OGSI and Globus Toolkit Version 3 (GT3) • TeraGrid is currently based on SuSE Linux, GT2.4 • Sandbox allows GEON PI’s to develop debug software in GEONgrid prior to sending jobs to TeraGrid • GEON has a TeraGrid allocation (30,000hours) • Need to keep in mind GEONgrid heterogeneity • Windows and other platforms AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Core Grid Services • Registry: • a place to register and find basic Web services. But also, all services (e.g. PGAP, Gravity Database, Seismic Simulation Tool, …) • Authentication: • using GEON Certificate Authority and Grid certificates • Access control: • investigating various systems for policy-based access to services • Data replication: • initial target is IBM GMR software for replicating files as well as databases • Support for various data systems: • e.g., SDSC Storage Resource Broker (SRB) and OpenDAP • Implement servers at Data Nodes • Job submission, e.g. launch job to TeraGrid. AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Higher-Order Grid Services • Registration • Data and metadata, schema, ontology, services • Important in order to support search functionality • Data Integration • Defining “views” across multiple sources • Multiple database schemas, e.g. in GEON PAST (Paleogeography and AMOCO database), Chronos (Paleostrat, Neptune, Paleobiology), Geochemisry (Navdat, PetDB, …) • Multiple maps and map layers • GIS and 2D Viz • Integrating map layers. “Simple” mapping service. • SVG-based data access and visualization tools • Workflow • Iconic representation of databases and tools • Ability to link together tools and data to specify computations • Based on Kepler system AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GeonSearch • Ad hoc search versus querying of preestablished “views” • Ad hoc Search • Search/discover information on data, services, experiments, “other” (e.g., people, organizations) • Display results via map interfaces, semantic graphs • View-based querying • E.g., use ad hoc search to find a set of databases, map layers of interest; define a specific way of combining data across these various sources AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Knowledge Representation in GEON • • • • • • Controlled vocabularies Database schema (relational, XML, …) Conceptual schema (ER, UML, … ) Thesauri (synonyms, broader term/narrower term) Taxonomies Informal/semi-formal representations • “Concept spaces”, “concept maps” • Labeled graphs / semantic networks (RDF) • Formal ontologies, e.g., in [Description] Logic (OWL) • “formalization of a specification” constrains possible interpretation of terms • What is an ontology? An ontology usually … • specifies a theory (a set of models) by … • defining and relating … • concepts representing features of a domain of interest AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GEON Ontology Development Workshops • Workshop format • Led by GEON PI’s • Involves small group of domain experts from community • Participation by a few IT experts in data modeling and knowledge representation • Igneous Petrology, led by Prof. Krishna Sinha, VaTech, 2003 • Seismology, led by Prof. Randy Keller, UT El Paso, Feb 24-25, 2004 • Aqueous Geochemistry, led by Dr. William Glassley, Livermore Labs, March 2-3, 2004 • Structural Geology, led by Prof. John Oldow, Univ. of Idaho, 2004 • Metamorphic Petrology, led by Prof. Maria Crawford, Bryn Mawr, under planning AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A Multi-Hierarchical Rock Classification “Ontology” (GSC) Genesis Fabric Composition Texture AMD Seminar, SDSC, April 1 2004 Kai Lin, SDSC Boyan Brodaric, GSC CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Geologic Map Integration in the Portal • After registering datasets, ontologies (here: “classes”), and an application (“OMI”), the datasets can be searched and displayed in an integrated way. Kai Lin, SDSC AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Use of Knowledge Structures • Conceptual models of a domain or application, (communication means, system design, …) • Classification of … • concepts (taxonomy) and • data/object instances through classes • Analysis of ontologies e.g. • Graph queries (reachability, path queries, …) • Reasoning (concept subsumption, consistency checking, …) • Targets for semantic data registration • Conceptual indexes and views for • • • • searching, browsing, querying, and integration of registered data AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Creating and Sharing Concept Maps (here: Seismology concept map & Cmap tool) • Bring scientists together for 2+ days • Add CS/KBMS “types” • Create concept maps • Refine • Iterate from napkin drawings, to concept maps, to ontologies AMD Seminar, SDSC, April 1 2004 Randy Keller (UTEP), Bertram Ludaescher, Kai Lin, Dogan Seber (SDSC), et al CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Community-Based Ontology Development • Draft of an aqueous geochemistry ontology developed by scientists AMD Seminar, SDSC, April 1 2004 Bill Glassley (LLNL), Bertram Ludaescher, Kai Lin (SDSC), et al CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GeoWorkbench • Data and service registration • Create spatial, temporal, concept-based indexes as part of registration process • Ability to define views • e.g. using GeonSearch to find data, services, etc. • Run analysis routines • e.g. via workflow specifications, using Kepler • Personal space to “save” and “bookmark” work • Visualize output, save output, feed output to other services AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES 3D Earthquake Modeling using HPC AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Use of LIDAR data for geo-morphology Ramon Arrowsmith, Chris Crosby Arizona State University • Manipulation, analysis and use of LIDAR (LIght Detection And Ranging) data Ramon Arrowsmith, Chris Crosby, ASU AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES LIght Detection And Ranging • Airborne scanning laser rangefinder • Differential GPS • Inertial Navigation System 30,000 points per second at ~15 cm accuracy • $400–$1000/mi2, 106 points/mi2, or 0.04–0.1 cents/point Extensive filtering to remove tree canopy (virtual deforestation) Figure from R. Haugerud, U.S.G.S - http://duff.geology.washington.edu/data/raster/lidar/About_LIDAR.html Ramon Arrowsmith, Chris Crosby, ASU AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Northern San Andreas LIDAR: fault geomorphology Full Feature DEM AMD Seminar, SDSC, April 1 2004 Ramon Arrowsmith, Chris Crosby, ASU Bare Earth DEM CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Processing LiDAR data: the problems • Huge datasets: • 1 GB of point return (.txt) data • 150 MB of point return (.txt) data • 5.5 MB after filtering for ground returns Fort Ross, CA 7.5 min quad • How do we grid these data? • ArcGIS can’t handle it • Expensive commercial software not an option for most data consumers Ramon Arrowsmith, Chris Crosby, ASU AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES GRASS as a processing tool for LiDAR • GRASS: Open source GIS • Interpolation commands designed for large data sets • Splines use local pt density to segment data into rectangular areas for interpolation • Can control spline tension and smoothness • Modular configuration could easily be implemented within the GEON work flow • E.g.: User uploads point data to remote site where GRASS interpolation module runs on super computer and returns user a raster file. • Host the large LIDAR data sets on GEON Data Node at SDSC, with access to large cluster computers Ramon Arrowsmith, Chris Crosby, ASU AMD Seminar, SDSC, April 1 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Contact Information baru@sdsc.edu AMD Seminar, 1 2004 NASA Seminar,SDSC, MarchApril 23, 2004 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES