Building a virtual data center for the biological, ecological and environmental sciences -- William Michener (University Libraries, University of New Mexico) -- DataONE Team Presenter Name Outline Challenges faced by the biological, ecological, and environmental sciences Scientific Cyberinfrastructure Socio-cultural Vision for change (a DataONE perspective) Effecting change Re-visiting the challenges 2 Global change Smith, Knapp, Collins. In press. 4 Critical areas in the Earth’s system 5 6 Increasing Process Knowledge Decreasing Spatial Coverage Building the knowledge pyramid Intensive science sites and experiments Extensive science sites Volunteer & education networks Remote sensing Data loss Petabytes Worldwide (including orphaned data) Information Available Storage Source: John Gantz, IDC Corporation: The Expanding Digital Universe 7 Data deluge “the flood of increasingly heterogeneous data” Data are heterogeneous Syntax (format) Schema (model) Semantics (meaning) Jones et al. 2007 8 Poor data practice “data entropy” Time of publication Information Content Specific details General details Retirement or career change Accident Death Time 9 (Michener et al. 1997) Scattered data sources “finding the needle in the haystack” Data are massively dispersed 10 Ecological field stations and research centers (100’s) Natural history museums and biocollection facilities (100’s) Agency data collections (100’s to 1000’s) Individual scientists (1000’s to 10,000s to 100,000s) The problem with standards… Metadata Interoperability Across Data Holdings Data Archive Types of Data Managed Biodiversity, taxonomic, ecological BDP, DwC, DC, OGIS Biogeochemical dynamics, terrestrial ecological Earth observation imagery DIF, BDP, ECHO Ecological, biodiversity, biophysical, social, genomics, and taxonomic EML Avian populations and molecular biology DwC Biological and taxonomic DC subset Biophysical, biodiversity, disturbance, and Earth observation imagery EML Biodiversity, biotic structure, function/process, biogeochemical, climate, and hydrologic EML BDP=Biological Data Profile DC subset=Dublin Core subset DwC=Darwin Core DC=Dublin Core DIF=Directory Interchange Format ECHO=EOS ClearingHOuse EML=Ecological Metadata Language Metadata Standard(s) OGIS=OpenGIS 11 Socio-cultural informatics challenges Why? What is there to help me? How should I best do this? 12 A Vision for Change: DataONE Providing universal access to data about life on earth and the environment that sustains it engaging the scientist in the data curation process supporting the full data life cycle encouraging data stewardship and sharing promoting best practices engaging citizens developing domain-agnostic solutions 1. Build on existing cyberinfrastructure 2. Create new cyberinfrastructure 3. Support new communities of practice DataONE Cyberinfrastructure Coordinating Nodes Member Nodes • retain complete • diverse institutions metadata catalog •• subset of allcommunity data serve local • perform basic indexing provide network-wide resources for •• provide managing their data services • ensure data availability (preservation) • provide replication services Flexible, scalable, sustainable network What data are in scope for DataONE? Biological (genes to biomes) Environmental Atmospheric Ecological Hydrological Oceanographic Social e.g., Land use, human population Economic e.g., trade, ecosystem services, resource extraction DataONE – Building new global CI 16 Effecting change 1. Engage the community Perform baseline and iterative community assessments Usability studies DataONE International Users Group 17 Effecting change 2. Leverage existing cyberinfrastructure 19 KNB, ESDIS, and Waters Networks Investigator Toolkit Many existing open source efforts exist Data management: MATT, UDig, Specify Analysis and modeling: R, Octave Workflow systems: Kepler, Taverna, Triana, Pegasus Grid systems: Condor, Globus, BOINC Data and workflow portals: VegBank, myExperiment Commercial tools important too Excel, MATLAB, SAS, ArcGIS DataONE: help communities build their own tools Integrate, interoperate, stabilize Create libraries to DataONE Service Interface 20 Effecting change 3. Educate Career Long Learning: • best practice guides • exemplary data management plans • podcasts, web-casts • workshops and seminars • downloadable curricula Best Practice Guide Using Best Metadata for Guide Practice e-research How to Cite Best Practice Guide Your Data 5 in a series How to Cite Your Data 6 in a series 6 in a series Gold Star Data Management Plan Here’s How Engaging citizens in science www.CitizenScience.org Effecting change 4. Enable new science and demonstrate success Pilot Catalog: Simple Search Interface (searches entire metadata record) >48,000 Data Set Records NBII Metadata Clearinghouse (39,550) Long Term Ecological Research (LTER) Network (6,897) ORNL Distributed Active Archive Center for Biogeochemical Data (810) Large Scale Biosphere-Atmosphere Experiment in Amazonia (LBA) (783) Organization of Biological Field Stations (124) Inter-American Institute for Global Change Research (IAI) (79) MODIS and ASTER Products (LPDAAC) (38) National Phenology Network (USANPN) (29) 23 Eurasian Collared Dove in North America Search can be emailed, bookmarked, or placed in RSS feed Search can be refined using multiple filters 24 Exploration, Visualization and Analysis (EVA) Working Group Objectives Guide creation of scientific workflows and DataONE data exploration, visualization and analysis functionality. EVA Tools 25 1st EVA Example: Vegetation and Bird Phenology at 130,000 sites in North America May 18 May 1 Phenology (Start of Season) June 28 Understand how bird migration patterns change through time (White et al. 2009) 26 The Process Iterative data-intensive workflow – VisTrails Data synthesis Spatio-Temporal Exploratory Model (STEM) 1m f ( X , s , t ) I ( s , t ) i i n ( s , t ) i 1 27 Results: Spring Migration Detail • Model predictions show population timing and location through migration • Early April – Concentration on Gulf of Mexico Coast • Mid April – Concentration along Mississippi River valley • Mid May – Breeding distribution 28 Occurrence of Indigo Bunting (2008) Neotropical migrant that winters in central and south America Spring Migration Results • Early April – Concentration on Gulf of Mexico Coast • Mid April – Concentration along Mississippi River valley • April 7 April 15 April 26 May 14 Mid May – Breeding distribution 29 Experimentation/Forecasting Red-eyed Vireo (Vireo olivaceus) 30 Questions: Experiment: 1. Is Red-eyed Vireo migration timing associated with NDVI? 2. If so, how is migration timing, direction, and speed affected by NDVI? 1. Fit STEM with NDVI predictor 2. Advance “Greening Date” by 14 days 3. Plot “Spring Arrival Dates” as the first date when predicted probability of occurrence > 5% Challenges “i-Science” – (from the perspective of the scientist) Fast Data & tool discovery Automated/automatic metadata annotation Visualization (i.e., rapid analysis) Easy Data & metadata upload Integrated, linked, and synthesized databases Scientific workflows Interoperability (but hidden from user) Usability (intuitiveness) Cheap Data preservation Free, open source tools But, also “support” for tools I am used to – e.g., Excel 31 The “elephant in the room” challenge It’s the metadata …. need tools and approaches that are: Fast ! Easy !!! Cheap ! 32 DataONE Institutions, Personnel & Support University of New Mexico – William Michener, Rebecca Koskela, Mark Servilla Cornell University – Paul Allen, Steve Kelling CSIRO (Australia) – Donald Hobern Duke University – Ryan Scherle, Todd Vision Ecological Society of America – Cliff Duke Oak Ridge National Laboratory – Bob Cook, John Cobb, Line Pouchard, Bruce Wilson University of California Curation Center – Patricia Cruse, John Kunze University of California Davis – Bertram Ludaescher University of California Santa Barbara – Stephanie Hampton, Matt Jones University of Edinburgh (Scotland) – Peter Buneman University of Illinois – Urbana Champaign – Randy Butler, Von Welch University of Illinois – Chicago – Robert Sandusky University of Kansas – Dave Vieglais University of Manchester (UK) – Carole Goble University of Michigan – Peter Honeyman University of Southampton (UK) – David DeRoure University of Southern California – Ewa Deelman University of Tennessee – Suzie Allard, Carol Tenopir, Maribeth Manoff USGS (National Biological Information Infrastructure, National Phenology Network) – Mike Frame, Vivian Hutchison, Jake Weltzin Utah State University – Jeffery Horsburgh Steve Kelling1, Daniel Fink1, Suresh Santhana Vannan2, Kevin Webb1, Robert Cook2, William Michener3, Jeff Morisette4, Claudio Silva5, Tom Dietterich6, Patrick O’Leary7, Alyssa Rosemartin4, and Damian Gessler8 1Lab of Ornithology, Cornell University and 2Oak Ridge National Laboratory, 3University of New Mexico, 4U.S. Geological Survey, 5University of Utah, 6Oregon State University, 7Idaho National Laboratory, 8I-Plant NSF CISE Pathways Computational Sustainability 0832782, Leon Levy Foundation, and National Aeronautics and Space Administration 33