DRAFT - 2/17/2016 Comparison of Scientific Digital Repositories From the SOW: 1. A report assessing the strengths and limitations of selected scientific digital resource and data repositories (e.g., KNB: http://knb.ecoinformatics.org/index.jsp, and Center for International Earth Science Information Network (CIESIN): http://www.ciesin.org/). The report will include a preliminary list of key functionalities to include in DRIADE and recommend if any existing scientific repositories may satisfy NESCent’s requirements (December 31, 2006). (Note: Additional basic information about many of the projects listed in this report can also be found in the DRIADE project Wiki: https://www.nescent.org/wg/digitaldata/index.php?title=Links ; specific research findings on digital repositories and related activities can be found in the annotated bibliography linked from the DRIADE project Wiki: https://www.nescent.org/wg/digitaldata/index.php?title=Participant_Activities#Jed_Dube ) This report focuses on several ongoing scientific digital repository and related projects that might inform the DRIADE project, based upon its evolving general goals through November 30, 2006. The following are the expressed goals (from DRIADE wiki, @ of November 30, 2006): 1. Heterogeneous digital datasets 2. In the field of evolutionary biology 3. Ensure a self-sustaining economic model 4. Plan for long-term data stewardship 5. Provide tools and incentives to researchers for quality metadata generation and dataset reuse 6. Minimize the technical expertise and time required for data deposition and metadata generation 7. Be sensitive to the intellectual property rights of researchers 8. Focus on published datasets 9. Provide tight linkages to major evolutionary biology journals and domain-specific community databases The report sections are roughly defined as follows: o Resource: Identification and main URL(s) o Strengths: Applicability to the expressed goals of DRIADE (note: this does not imply anything about the success of the project); acceptance, reputation, collaborations; funding by NSF or other major sources; quantity and quality of available information about general or specific examples of best practices, standards-compliance, interoperability, development process/methodology, modeling, collaboration, usability, etc. o Limitations: Aspects of project scope, scale, community, development efforts, etc. that particularly do not apply DRAFT - 2/17/2016 Applicability of Various Scientific Repository Projects’ to DRIADE Project Goals GBIF KNB NSDL SEEK DLESE CIESIN ICPSR MMI / Purdue ORION DlR Heterogeneous ▪ ▪ ▪ ▪ YES ▪ ▪ ▪ Datasets Evolutionary * * * Biology self-sustaining ▪ ▪ ? ▪ * ? economic model long-term data ▪ ▪ ? ▪ ? stewardship tools and ▪ ▪ ▪ ▪ ▪ ? ▪ ▪ ? incentives to researchers Minimize ▪ ▪ ? ▪ ▪ ? ▪ ▪ ▪ technical expertise and time required intellectual ▪ ▪ ? ▪ ? ▪ ? property rights published ? ? datasets tight linkages ? ? to journals and databases ▪ = YES * = SOMEWHAT ? = Not sure (blank) = NO DRAFT - 2/17/2016 Resource: Global Biodiversity Information Facility (GBIF) (Public) Home: http://www.gbif.org/ Existing Data Portal: http://www.europe.gbif.net/portal/index.jsp New Data Portal: http://newportal.gbif.org/ Wiki: http://wiki.gbif.org/gbif/wikka.php?wakka=HomePage Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology SOMEWHAT 3) Ensure a self-sustaining economic model YES 4) Plan for long-term data stewardship YES 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers YES 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: Disciplines are similar NSF funds U.S.A. work = ~20% of overall GBIF budget (per NSF) 26 voting participants and additional 56 associate participants Well-documented http://wiki.gbif.org/gbif/wikka.php?wakka=HomePage etc. Moving from prototype portal to new architecture (late 2006? Into 2007?) http://newportal.gbif.org/welcome.htm - addressing issues, in a phased development schedule: o inadequate search capability o Inadequate and unreliable download function o Confusing search by country (results) o Incomplete and confusing taxonomic information o Limited mapping capability o Inadequate user help o Weak national language support o Insufficiently rich data model (lack of specificity in results due to lack of clarity in model?) o Poor handling of non-scientific names o Loss of historical record for withdrawn data o Lack of web services support for access o Indexing against live database o Etc. DRAFT - 2/17/2016 To utilize Taxon Concept Schema (TCS) with Taxon API (TAPIR) for nomenclature and taxonomy (first) Will support taxon occurrence: o DiGIR with Darwin Core (v. 1.2, v. 2.0, MaNIS, OBIS) o BioCASe with ABCD (v. 1.20, v. 1.48, v. 2.06) o TAPIR with Darwin Core v. 2.0 o TAPIR with ABCD v. 2.06 (including IPGRI) Updated UDDI registry for all data resources (Jun 2006) Standardised provider data use agreements (Dec 2006) Provider `console´ interface (c. Dec. 2006) Schema repository (preliminary setup) Resource crawler: Scheduled execution (Aug 2006) and Strategy-driven approach (Aug 2006) (Numerous improvements to how the) Validation Chain handles records from Crawler (Numerous improvements to how the) data is stored, mirrored and synchronized, handled, etc. Application logic to access index data stores: o Java service layer for data access o Model-View-Controller (MVC) framework o Configurable HTML user interface (Jul 2006 onwards) o (Configurable APIs for) Web services, utilizing TDWG TAPIR protocols/standards Support web browsers and client applications HTML user interface (Jul 2006 onwards) o Browse and search capabilities o Download of tab-delimited or XML data sets o User feedback to data providers o User personalisation options Portal interface exposed as web services (Dec 2006) o WSDL and SOAP access to all data access functions used to construct the HTML UI o Client implementation of Java Services layer API based on web services Additional web services to support community standards o TAPIR with Darwin Core o TAPIR with ABCD o Taxon API with Taxon Concept Schema o Web Feature Service Encourage development of applications and toolkits o Expose data for analysis by external tools and workflow applications Considering: (among many possibilities) o SPICE with Species 2000 Common Data Model (for taxonomic data) o Ecological data sets with EML metadata o Collection metadata o RDF data using LSIDs and (TDWG or GBIF) ontology DRAFT - 2/17/2016 GBIF Portal Architecture (new version): Source: “GBIF Data Portal Directions and Progress”, Donald Hobern, April 2006 DRAFT - 2/17/2016 Resource: Knowledge Network for Biocomplexity (KNB) (Public) Home: http://knb.ecoinformatics.org/index.jsp Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology SOMEWHAT 3) Ensure a self-sustaining economic model NO 4) Plan for long-term data stewardship NO 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers YES 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: Disciplines are similar Funded by NSF Knowledge and Distributed Intelligence Program Partnerships: National Center for Ecological Analysis and Synthesis (UCSanta Barbara et al), Long Term Ecological Research Network (~30 sites), San Diego Supercomputer Center, Texas Tech University Working system with required tools: o Ecological Metadata Language (XML-based metadata specification Supports description of datasets, citations, software, protocols) o Metacat (XML-based metadata PostgreSQL/Tomcat database/server (open source)) o Morpho (data/metadata management tool for scientists) o Storage Resource Broker (planned?) DRAFT - 2/17/2016 Resource: National Science Digital Library (NSDL) Public Home: http://nsdl.org/ Wiki: http://ndr.comm.nsdl.org/cgi-bin/wiki.pl?NSDL_Data_Repository_(NDR) NSDL Registry Wiki: http://metadataregistry.org/wiki/index.php/Main_Page Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology SOMEWHAT 3) Ensure a self-sustaining economic model YES 4) Plan for long-term data stewardship YES 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation ? 7) Be sensitive to the intellectual property rights of researchers ? 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: Very well documented (papers, presentations) NSF-funded History of innovations Currently use FEDORA and OAI-MPH (Carl Lagoze) Now in 2nd iteration: o “an information network overlay” o More resource-centric, rather than metadata-centric (?) o More (standard) ways to contribute metadata o Metadata aggregation and augmentation seems interesting (Hillman) o More (web services) ways to harvest - moving to SOAP and/or RESTbased API DRAFT - 2/17/2016 Resource: Science Environment for Ecological Knowledge (SEEK) http://seek.ecoinformatics.org/ Strengths: 1) Heterogeneous digital datasets NO 2) in the field of evolutionary biology NO 3) Ensure a self-sustaining economic model NO 4) Plan for long-term data stewardship ? 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers ? 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: Disciplines are similar Supports KNB plus systematics data Focus on semantics and scientific workflows Organized into working groups Collaboration: o Partnership for Biodiversity Informatics (PBI), o National Center for Ecological Analysis and Synthesis at UC Santa Barbara (NCEAS); o San Diego Supercomputer Center (SDSC); o University of Kansas (KU); and o University of New Mexico (UNM), o Genome Center at UC Davis (UCD), o Arizona State University (ASU), o University of North Carolina (UNC), o University of Vermont, (UVM) and o Napier University in Scotland (Napier). Tools: (more available on website) o EcoGrid – “2nd generation data network“, integrating distinct data systems and networks, prototype using Metacat, SRB, DiGIR, Xanthoria, etc. o Kepler – grid-enabled scientific workflows - collaborators: SEEK Project, SciDAC SDM Center, Ptolemy Project, GEON Project DRAFT - 2/17/2016 o GrOWL “a visualization and editing tool for Ontology Web Language (OWL) and Description Logics (DL) ontologies based on a semantic network knowledge representation paradigm” o Ontologies o ConceptMapper DRAFT - 2/17/2016 Resource: Digital Library for Earth System Education (DLESE) Public Website: http://www.dlese.org/ Community Review System: http://crs.dlese.org/ Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology NO 3) Ensure a self-sustaining economic model NO (or: applies, but negatively?) 4) Plan for long-term data stewardship NO 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers YES 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Ref: DRIADE goals 3) and 4): DLESE funding from NSF GEO is being phased out (2006) Notes: Partner with NSDL DRAFT - 2/17/2016 Resource: Center for International Earth Science Information Network (CIESIN) http://www.ciesin.org/index.html Strengths: http://www.ciesin.org/metadata/documentation/supplements/netsites.html 1) 2) 3) 4) 5) 6) 7) 8) 9) Heterogeneous digital datasets ? In the field of evolutionary biology NO Ensure a self-sustaining economic model ? Plan for long-term data stewardship ? Provide tools and incentives to researchers for quality metadata generation and dataset reuse ? Minimize the technical expertise and time required for data deposition and metadata generation ? Be sensitive to the intellectual property rights of researchers ? Focus on published datasets ? Provide tight linkages to major evolutionary biology journals and domain-specific community databases ? Limitations: Notes: It appears that the bulk of digital repository implementation work was carried out in the mid-to-late 1990s and the documentation for this effort is no longer readily accessible. If further analysis of CIESIN is required, it will have to be undertaken on a more traditional basis (direct contact or other research) than via web discovery. DRAFT - 2/17/2016 Repository: Marine Metadata Initiative (MMI) http://marinemetadata.org/ / Ocean Research Interactive Observatory Networks (ORION) http://www.orionprogram.org/default.html Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology NO 3) Ensure a self-sustaining economic model NO 4) Plan for long-term data stewardship NO 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers NO 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: NSF-funded This project appears to be a model for effective collaboration and outreach Extensive website DRAFT - 2/17/2016 Repository: Interuniversity Consortium for Political and Social Research (ICPSR) http://www.icpsr.umich.edu/ Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology NO 3) Ensure a self-sustaining economic model YES 4) Plan for long-term data stewardship YES 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse YES 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers YES 8) Focus on published datasets NO 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases NO Limitations: Notes: Established in 1962 Over 500 member colleges and universities Data Documentation Initiative (DDI) http://www.icpsr.umich.edu/DDI/ is at ICPSR Partner with Data-PASS http://www.icpsr.umich.edu/DATAPASS/ DRAFT - 2/17/2016 Repository: Purdue Distributed Institutional Repository (DIR) / E-Scholar http://e-scholar.lib.purdue.edu/ Strengths: 1) Heterogeneous digital datasets YES 2) In the field of evolutionary biology NO 3) Ensure a self-sustaining economic model ? 4) Plan for long-term data stewardship ? 5) Provide tools and incentives to researchers for quality metadata generation and dataset reuse ? 6) Minimize the technical expertise and time required for data deposition and metadata generation YES 7) Be sensitive to the intellectual property rights of researchers ? 8) Focus on published datasets ? 9) Provide tight linkages to major evolutionary biology journals and domain-specific community databases ? Limitations: Notes: (material to be researched) http://www.cni.org/tfms/2005b.fall/abstracts/PB-research-brandt.html (Fall 2005): “Purdue University Libraries will build a model to archive datasets generated at a university during the research process and make these datasets available by linking them to the resulting research publication.” http://dir.lib.purdue.edu/whitepaper.html http://www.rcac.purdue.edu/rcac/events/library.htm (May 2006) http://e-scholar.lib.purdue.edu/eScholar_RR_Sep2006.pdf (Sept 2006) http://www.hpcwire.com/hpc/640030.html