GEON: The Geosciences Network & The National Laboratory for Advanced Data Research (NLADR) Chaitan Baru Division Director, Science R&D San Diego Supercomputer Center e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Outline • About SDSC • Cyberinfrastructure projects • E.g., TeraGrid, BIRN, SCEC/CME, GEON, SEEK, NEES, … • GEON • NLADR e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER SDSC Organization Chart Administration & Operations Director (Fran Berman) Exec Director (Vijay Samalam) Strategic Partnerships & External Relations User Services & Development (Anke Kamrath) – Consulting – Training – Documentation – User Portals – Outreach & Education – User Services Production Systems (Richard Moore) Allocated Sys Production Servers Networking Ops SAN/Storage Ops Servers/Integration Security Ops TeraGrid Operations e-Science Seminar, June 28th, 2004, Edinburgh, Scotland Technology R&D (Vijay Samalam) Science R&D (Chaitan Baru) Advanced Cyberinfrastructure Lab SRB Lab Networking Research HPC Research Tech Watch Group SDSC/Cal-IT2 Synthesis Center •Data & Knowledge Labs Science Projects • Bio-, neuro-, eco-, geoinformatics… •NLADR SAN DIEGO SUPERCOMPUTER CENTER An emphasis on end-to-end “Cyberinfrastructure” (CI) • Development of broad infrastructure, including services, not just computational cycles • Referred to as “e-science” in the UK • A major emphasis at SDSC on data, information, and knowledge • Increased focus on: • • • • Strategic applications, and “strategic” communities Training and Outreach, e.g. Summer Institutes Community codes, but also data collections, databases “Researcher-level” services, e.g. Linux cluster management software, ease transition from local environment to large-scale computing environment e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER SDSC and CI Projects • SDSC is involved in several, NSF and NIH-funded, community-based CI projects • TeraGrid – Providing access to high-End, national-scale, physical computing infrastructure • BIRN – Biomedical Informatics Research Network, funded by NIH. Integrating distributed brain image data • GEON – Geosciences Network. Integrating distributed Earth Sciences data • SCEC/CME – Southern California Earthquake Consortium Community Modeling Environment • SEEK – Scientific Environment for Ecological Knowledge. Integrating distributed biodiversity data along with tools • OptIPuter – Distributed computing environment using Lambda Grids • NEES – Network for Earthquake Engineering Simulation. Integrating distributed earthquake simulation and sensor data • ROADNet – Realtime Observatories, Applications, and Data management Network • TeraBridge – Health Monitoring of Civil Infrastructure… e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER The TeraGrid: High-end Grid Infrastructure PSC Purdue Indiana Oakridge UT Austin e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Typical Characteristics of CI Projects • Close collaboration between science and IT researchers • Need to provide data and information management… • • • • Storage management, archiving Data modeling, semantic modeling—spatial, temporal, topic, process Data and Information visualization Semantic integration of data • Logic-based formalisms to represent knowledge and map between ontologies • … as well as high-end computing • BIRN, SCEC, GEON, TeraBridge – all have allocations on the TeraGrid • Convert community codes into Web/Grid services • Enable scientists to access much larger computing capability from local cluster/desktop • Provide support for scientific workflow systems (visual programming environments for Web services) e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Biomedical Informatics Research Network: Example of a “community” Grid PI of BIRN CC: Mark Ellisman Co-I’s of BIRN CC: Chaitan Baru, Phil Papadopoulos, Amarnath Gupta, Bertram Ludaescher e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER The GEONgrid: Another “community” Grid Geological Survey of Canada Rocky Mountain Testbed Chronos Mid-Atlantic Coast Testbed OptIPuter Livermore NASA USGS KGS Navdat ESRI SCEC CUAHSI PoP node Partner Projects Compute cluster Data Cluster Partner services 1TF cluster www.geongrid.org e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Project Overview • Close collaboration between geoscientists and IT to interlink databases and Grid-enable applications • “Deep” data modeling of 4D data • Situating 4D data in context—spatial, temporal, topic, process • Semantic integration of Geosciences data • Logic-based formalisms to represent knowledge and map between ontologies • Grid computing • Deploy a prototype GEON Grid: heterogeneous networks, compute nodes, storage capabilities. Enable sharing of data, tools, expertise. Specify and execute workflows • Interaction environments • Information visualization. Visualization of concept maps • Remote data visualization via high-speed networks • Augmented reality in the field • Linkage to BIRN e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Funding Sources • National Science Foundation ITR Project, 2002-2007, $11.6M • Also, $900K for Chronos, $1M for CUAHSI-HIS (NSF) PI Institutions • Arizona State University • Bryn Mawr College • Penn State University • Rice University • San Diego State University • San Diego Supercomputer Center/UCSD • University of Arizona • University of Idaho • University of Missouri, Columbia • University of Texas at El Paso • University of Utah • Virginia Tech • UNAVCO • Digital Library for Earth System Education (DLESE) Partners • California Institute for Telecommunications and Information Technology, Cal-(IT)2 • Chronos • CUAHSI-HIS • ESRI • Geological Survey of Canada • Georeference Online • HP • IBM • IRIS • Kansas Geological Survey • Lawrence Livermore National Laboratory • NASA Goddard, Earth System Division • Southern California Earthquake Consortium (SCEC) • U.S. Geological Survey (USGS) Affiliated Project • EarthScope e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Science Drivers (1) DYSCERN:DYnamics, Structure, and Cenozoic Evolution of the Rocky Mountains) • Rocky Mountain region is at apex of a broad dynamic orogenic plateau between stable interior of North America and the active plate margin along the west coast. • For the past 1.8 billion years, the region has been the focus of repeated tectonic activity • …has experienced complex intra-plate deformation for the past 300 million years. • The deformation processes involved are the subject of considerable debate… • GEON is undertaking an ambitious project to map the lithospheric structure in the Rocky Mountain region in a highly integrated analysis and input the result into a 3-D geodynamic model • …to elucidate our understanding of the Cenozoic evolution of this region. e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Science Drivers (2) CREATOR: Crustal Evolution—Anatomy of an Orogen • The Appalachian Orogen is a continental scale mountain belt that provides a geologic template to examine the growth and break up of continents through plate tectonic processes. The record spans a period in excess of 1000 million years. • Focus on developing an integrated view of collisional processes represented by Siluro-Devonian Acadian Orogeny. Integration scenarios will require IT-based solutions, including design of ontologies and new tools • Research activities include • Organization of geologic and petrologic database for the mid-Atlantic test bed • Development of an ontologic framework to facilitate web based analysis of data. • Registration of geologic and terrane maps, and data for igneous rocks • Application of data mining techniques for discovering similarities in geologic databases • Design of workflow for Web-based navigation and analysis of maps and igneous rock databases • Development of Web services for mineral and rock classification, including use of SVG-based graphics e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER GEONgrid Service Layers Portal (login, myGEON) GeonSearch Registration Services GeoWorkbench Data Mediation Services Indexing Services Visualization & Mapping Services Workflow Services Core Grid Services Authentication, monitoring, scheduling, catalog, data transfer, replication, collection management, databases Physical Grid RedHat Linux, ROCKS, Internet, I2, OptIPuter e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER GEON Workbench: Registration • Uploadable: • OWL ontologies • OWL inter-ontology mappings (“articulations”) • Data sets (shape files) • “Semantic Registration” • Link data set D with ontology O1 (w/ instance-based heuristic) • Query D using ontology O2 • (e.g. rock classification: O1= GSC, O2=BGS) • Ontology-Enabled Application e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER A Multi-Hierarchical Rock Classification “Ontology” (GSC) Genesis Fabric Composition Texture e-Science Seminar, June 28th, 2004, Edinburgh, Scotland Kai Lin, SDSC Boyan Brodaric, GSC SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Uploading Ontologies Choose ClickantoOWL checkfile its to detail upload e-Science Seminar, June 28th, 2004, Edinburgh, Scotland Name Space Can be used to import this ontology into others SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Data Registration Choose Ontology Class Click on Submission Data set name Select a shapefile Choose an ontology class e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Data Registration Step 2: Map data to selected ontology It contains information about geologic age AREA PERIMETER AZ_1000 AZ_1000_ID GEO PERIOD ABBREV DESCR D_SYMBOL P_SYMBOL e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Data Registration Step 3: Resolve mismatches Two terms are not matched any ontology terms Manually mapping algonkian into the ontology e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Ontology-enabled Map Integrator All areas with the age Paleozoic Choose interesting Classes e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Geology Workbench: Change Ontology Submit a mapping e-Science Seminar, June 28th, 2004, Edinburgh, Scotland Ontology mapping between British Rock Classification and Canadian Rock Classification SAN DIEGO SUPERCOMPUTER CENTER GEON Ontology Development Workshops • Workshop format • Led by GEON PI’s • Involves small group of domain experts from community • Participation by a few IT experts in data modeling and knowledge representation • Igneous Petrology, led by Prof. Krishna Sinha, VaTech, 2003 • Seismology, led by Prof. Randy Keller, UT El Paso, Feb 24-25, 2004 • Aqueous Geochemistry, led by Dr. William Glassley, Livermore Labs, March 2-3, 2004 • Structural Geology, led by Prof. John Oldow, Univ. of Idaho, 2004 • Metamorphic Petrology, led by Prof. Maria Crawford, Bryn Mawr, under planning • Chronos and CUAHSI are planning ontology efforts • Also, on-going ontology work in SCEC • Discussion with Steve Bratt, COO, W3C e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Community-Based Ontology Development • Draft of an aqueous geochemistry ontology developed by scientists e-Science Seminar, June 28th, 2004, Edinburgh, Scotland Bill Glassley (LLNL), Bertram Ludaescher, Kai Lin (SDSC), et al SAN DIEGO SUPERCOMPUTER CENTER Levels of Knowledge Representation • • • • • • Controlled vocabularies Database schema (relational, XML, …) Conceptual schema (ER, UML, … ) Thesauri (synonyms, broader term/narrower term) Taxonomies Informal/semi-formal representations • “Concept spaces”, “concept maps” • Labeled graphs / semantic networks (RDF) • Formal ontologies, e.g., in [Description] Logic (OWL) • “formalization of a specification” constrains possible interpretation of terms e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Use of Knowledge Structures • Conceptual models of a domain or application, (communication means, system design, …) • Classification of … • concepts (taxonomy) and • data/object instances through classes • Analysis of ontologies e.g. • Graph queries (reachability, path queries, …) • Reasoning (concept subsumption, consistency checking, …) • Targets for semantic data registration • Conceptual indexes and views for • • • • searching, browsing, querying, and integration of registered data e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Example of a Large Data Problem Ramon Arrowsmith, Chris Crosby Arizona State University • E.g. manipulation, analysis and use of LIDAR (LIght Detection And Ranging) data… Ramon Arrowsmith, Chris Crosby, ASU th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2824, SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES LIght Detection And Ranging • Airborne scanning laser rangefinder • Differential GPS • Inertial Navigation System 30,000 points per second at ~15 cm accuracy • $400–$1000/mi2, 106 points/mi2, or 0.04–0.1 cents/point Extensive filtering to remove tree canopy (virtual deforestation) Figure from R. Haugerud, U.S.G.S - http://duff.geology.washington.edu/data/raster/lidar/About_LIDAR.html Ramon Arrowsmith, Chris Crosby, ASU th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2824, SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES Northern San Andreas LIDAR: fault geomorphology Full Feature DEM Ramon Arrowsmith, Chris Crosby, ASU th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2824, Bare Earth DEM SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES Processing LiDAR data: the problems • Huge datasets: • 1 GB of point return (.txt) data • 150 MB of point return (.txt) data • 5.5 MB after filtering for ground returns Fort Ross, CA 7.5 min quad • How do we grid these data? • ArcGIS can’t handle it • Expensive commercial software not an option for most data consumers Ramon Arrowsmith, Chris Crosby, ASU th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2824, SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES GRASS as a processing tool for LiDAR • GRASS: Open source GIS • Interpolation commands designed for large data sets • Splines use local pt density to segment data into rectangular areas for interpolation • Can control spline tension and smoothness • Modular configuration could easily be implemented within the GEON work flow • E.g.: User uploads point data to remote site where GRASS interpolation module runs on super computer and returns user a raster file. • Host the large LIDAR data sets on GEON Data Node at SDSC, with access to large cluster computers Ramon Arrowsmith, Chris Crosby, ASU th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2824, SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES Accessing data from more than one information source: Federated Metadata Query Metadata Query GSID a la LSID (Life Sciences Identifiers) Metadata Querying Middleware • Search API • Result format (XML, URI’s) gsid:dlese:…. gsid:gn:… gsid:iris:… Query & Result Wrappers (return URI’s) THREDDS DLESE XML Schema Geography Network XML Schema ArcCatalog, ArcXML 16th Annual IRIS Workshop, June 9-13, 2004, Tucson, AZ IRIS SRB MCAT, CORBA, Web services SRB API Grid Metadata Catalog Grid Services SAN DIEGO SUPERCOMPUTER CENTER Federated GSID-based Data Access GSID-based request gsid:srb:…. gsid:odbc:…. Data Access Middleware • Map URIs to local access protocols SRB ArcXML GML OpenDAP http ftp ODBC GridFTP scp JDBC Data 16th Annual IRIS Workshop, June 9-13, 2004, Tucson, AZ Item-level Metadata Collectionlevel Metadata SAN DIEGO SUPERCOMPUTER CENTER iGEON – International Cooperation: Experiences to date • Canada • Geological Society of Canada (Ottawa, Vancouver): Dr. Boyan Brodaric is one of the original team members of GEON. • Contributing important data sets by setting up a WMS (Web Mapping Services) server at WestGrid in Vancouver, BC. • 1Gbps link from Vancouver to GEON portal node at SDSC • China • Computational Geodynamics Lab will host a GEON PoP node for iGEON in China • Australia • Interactions between GEON and EON (Earth and Ocean Network) • Work with Dietmar Mueller to help run mantel convection codes on Linux clusters and provide as a Web service in GEON • Russia, Kyrgyztan • Held discussion with scientists from Russian Academy on data integration and use of Grid computing for geodynamics codes e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER International Cooperation: Planned • Australia • Collaboration planned with ACCESS (www.access.edu.au), Australian computational earth systems simulator. Install a GEON node. • Mexico • Meeting planned between CICESE earth scientists and GEON re. connectivity into Mexico • Japan • Sending invitation to Earth Simulator visualization group to attend GEON Visualization workshop. • UK • Visit to UK e-Science Center June 28/29, 2004 • Targeted • iGEON in Asia-Pacific could collaborate with the PRAGMA effort (Peter Arzberger, PI) • GEON will participate in next PRAGMA meeting as one of the featured applications e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER Opportunities • Define common standards, e.g. • Global Geosciences Identifiers (URI…) • Ontologies (Semantic Web standards) • Web services definitions, and other standards • Work towards linking GEON with other related efforts • Travel funds for travel to each other’s science and IT workshops and individual meetings • Sabbatical, training visits • Share computing capabilities for GeoScience applications • Technologies for 3D and 4D visualizations, on-demand computing, … e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER FYI Cyberinfrastructure Summer Institute for the Geosciences August 16-20, 2004, San Diego See www.geongrid.org/summerinstitute for more information e-Science Seminar, June 28th, 2004, Edinburgh, Scotland SAN DIEGO SUPERCOMPUTER CENTER National Laboratory for Advanced Data Research (NLADR) An SDSC/NCSA Data Collaboration Co-Directors: Chaitan Baru, Data and Knowledge System (DAKS) SDSC Michael Welge, Automated Learning Group (ALG) NCSA National Laboratory for Advanced Data Research NLADR Vision • Collaborative R&D activity between NCSA (Illinois) and SDSC in advanced data technologies • …guided by real applications from science communities • …to develop broad data architecture framework • …within which to develop, deploy, and test data-related technologies • …in the context of a national-scale physical infrastructure (Internet-D) National Laboratory for Advanced Data Research NLADR Focus • Solving the data needs of real applications • Initially focused on some Geoscience applications (GEON, LEAD) • Also, looking into environmental science applications (LTER, NEON, CLEANER) • NLADR Fellows program—enable postdocs, faculty, staff from domain sciences to partner with NLADR staff National Laboratory for Advanced Data Research Core Activities • Internet-D: Fielding a distributed, data testbed • Core technologies and reference implementations of “data cyberinfrastructure” • Standards activities • Evaluation: usability and performance National Laboratory for Advanced Data Research Internet-D • Distributed data testbed • Initially, within networked environment between SDSC and NCSA. • Open to community • …for testing new data management and data mining approaches, protocols, middleware, and technologies. • A minimum configuration will include • Distributed infrastructure, e.g. cluster systems at each end-point—with maximum memory and adequate disk capability, high-speed network connectivity across the end points. • High-end configuration • Prototype environment to represent very-high end, “extreme” capability. • Provide highest possible end-to-end bandwidth, from disk-to-disk • Very large main memory and very large disk arrays. National Laboratory for Advanced Data Research NLADR Core Technologies • Core data services • Caching, replication, prefetching, multiple transfer streams • Integration of distributed data • Integrate independently-created, distributed, heterogeneous databases • Mining complex data • Data mining of distributed, complex scientific data, including exploratory analysis and visualization • Long-term Data Preservation • Developing tools to preserve data for long periods of time National Laboratory for Advanced Data Research NLADR Evaluation Activities • Data Grid benchmarking efforts • Functionality and performance • In multi-user, concurrent access environments • Online, on-demand • Evaluate parallel filesystems, parallel database systems • Develop “data experts” for various modalities of data • Investigate and characterize architectures, capabilities for long term preservation National Laboratory for Advanced Data Research Joining NLADR • No formal process yet • Contact me (baru@sdsc.edu), if interested • Should be willing to contribute one or more of… • Interesting applications • People time, to work on NLADR objectives • Infrastructure (servers, storage, networking) towards Internet-D National Laboratory for Advanced Data Research Thank You! • Visit www.geongrid.org • Stay tuned for: www.nladr.net • My email: baru@sdsc.edu th, 2004, NASASeminar, Seminar,June March 2004 Edinburgh, Scotland e-Science 2823, SAN DIEGO FOR SUPERCOMPUTER CENTER CYBERINFRASTRUCTURE THE GEOSCIENCES