Comparison of Scientific Digital Repositories

advertisement
DRAFT - 2/17/2016
Comparison of Scientific Digital Repositories
From the SOW:
1. A report assessing the strengths and limitations of selected scientific digital resource and data
repositories (e.g., KNB: http://knb.ecoinformatics.org/index.jsp, and Center for International Earth
Science Information Network (CIESIN): http://www.ciesin.org/). The report will include a
preliminary list of key functionalities to include in DRIADE and recommend if any existing
scientific repositories may satisfy NESCent’s requirements (December 31, 2006).
(Note: Additional basic information about many of the projects listed in this report can also be found in the
DRIADE project Wiki: https://www.nescent.org/wg/digitaldata/index.php?title=Links ; specific research
findings on digital repositories and related activities can be found in the annotated bibliography linked
from the DRIADE project Wiki:
https://www.nescent.org/wg/digitaldata/index.php?title=Participant_Activities#Jed_Dube )
This report focuses on several ongoing scientific digital repository and related projects
that might inform the DRIADE project, based upon its evolving general goals through
November 30, 2006.
The following are the expressed goals (from DRIADE wiki, @ of November 30, 2006):
1. Heterogeneous digital datasets
2. In the field of evolutionary biology
3. Ensure a self-sustaining economic model
4. Plan for long-term data stewardship
5. Provide tools and incentives to researchers for quality metadata generation and
dataset reuse
6. Minimize the technical expertise and time required for data deposition and
metadata generation
7. Be sensitive to the intellectual property rights of researchers
8. Focus on published datasets
9. Provide tight linkages to major evolutionary biology journals and domain-specific
community databases
The report sections are roughly defined as follows:
o Resource: Identification and main URL(s)
o Strengths: Applicability to the expressed goals of DRIADE (note: this does not
imply anything about the success of the project); acceptance, reputation,
collaborations; funding by NSF or other major sources; quantity and quality of
available information about general or specific examples of best practices,
standards-compliance, interoperability, development process/methodology,
modeling, collaboration, usability, etc.
o Limitations: Aspects of project scope, scale, community, development efforts,
etc. that particularly do not apply
DRAFT - 2/17/2016
Applicability of Various Scientific Repository Projects’ to DRIADE Project Goals
GBIF KNB NSDL SEEK DLESE CIESIN ICPSR MMI /
Purdue
ORION DlR
Heterogeneous
▪
▪
▪
▪
YES
▪
▪
▪
Datasets
Evolutionary
*
*
*
Biology
self-sustaining
▪
▪
?
▪
*
?
economic
model
long-term data
▪
▪
?
▪
?
stewardship
tools and
▪
▪
▪
▪
▪
?
▪
▪
?
incentives to
researchers
Minimize
▪
▪
?
▪
▪
?
▪
▪
▪
technical
expertise and
time required
intellectual
▪
▪
?
▪
?
▪
?
property rights
published
?
?
datasets
tight linkages
?
?
to journals and
databases
▪ = YES
* = SOMEWHAT
? = Not sure
(blank) = NO
DRAFT - 2/17/2016
Resource: Global Biodiversity Information Facility (GBIF)
(Public) Home: http://www.gbif.org/
Existing Data Portal: http://www.europe.gbif.net/portal/index.jsp
New Data Portal: http://newportal.gbif.org/
Wiki: http://wiki.gbif.org/gbif/wikka.php?wakka=HomePage
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology SOMEWHAT
3) Ensure a self-sustaining economic model YES
4) Plan for long-term data stewardship YES
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers YES
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 Disciplines are similar
 NSF funds U.S.A. work = ~20% of overall GBIF budget (per NSF)
 26 voting participants and additional 56 associate participants
 Well-documented http://wiki.gbif.org/gbif/wikka.php?wakka=HomePage etc.
 Moving from prototype portal to new architecture (late 2006? Into 2007?) http://newportal.gbif.org/welcome.htm - addressing issues, in a phased
development schedule:
o inadequate search capability
o Inadequate and unreliable download function
o Confusing search by country (results)
o Incomplete and confusing taxonomic information
o Limited mapping capability
o Inadequate user help
o Weak national language support
o Insufficiently rich data model (lack of specificity in results due to lack of
clarity in model?)
o Poor handling of non-scientific names
o Loss of historical record for withdrawn data
o Lack of web services support for access
o Indexing against live database
o Etc.
DRAFT - 2/17/2016
















To utilize Taxon Concept Schema (TCS) with Taxon API (TAPIR) for
nomenclature and taxonomy (first)
Will support taxon occurrence:
o DiGIR with Darwin Core (v. 1.2, v. 2.0, MaNIS, OBIS)
o BioCASe with ABCD (v. 1.20, v. 1.48, v. 2.06)
o TAPIR with Darwin Core v. 2.0
o TAPIR with ABCD v. 2.06 (including IPGRI)
Updated UDDI registry for all data resources (Jun 2006)
Standardised provider data use agreements (Dec 2006)
Provider `console´ interface (c. Dec. 2006)
Schema repository (preliminary setup)
Resource crawler: Scheduled execution (Aug 2006) and Strategy-driven
approach (Aug 2006)
(Numerous improvements to how the) Validation Chain handles records from
Crawler
(Numerous improvements to how the) data is stored, mirrored and synchronized,
handled, etc.
Application logic to access index data stores:
o Java service layer for data access
o Model-View-Controller (MVC) framework
o Configurable HTML user interface (Jul 2006 onwards)
o (Configurable APIs for) Web services, utilizing TDWG TAPIR
protocols/standards
Support web browsers and client applications
HTML user interface (Jul 2006 onwards)
o Browse and search capabilities
o Download of tab-delimited or XML data sets
o User feedback to data providers
o User personalisation options
Portal interface exposed as web services (Dec 2006)
o WSDL and SOAP access to all data access functions used to construct
the HTML UI
o Client implementation of Java Services layer API based on web services
Additional web services to support community standards
o TAPIR with Darwin Core
o TAPIR with ABCD
o Taxon API with Taxon Concept Schema
o Web Feature Service
Encourage development of applications and toolkits
o Expose data for analysis by external tools and workflow applications
Considering: (among many possibilities)
o SPICE with Species 2000 Common Data Model (for taxonomic data)
o Ecological data sets with EML metadata
o Collection metadata
o RDF data using LSIDs and (TDWG or GBIF) ontology
DRAFT - 2/17/2016
GBIF Portal Architecture (new version):
Source: “GBIF Data Portal Directions and Progress”, Donald Hobern, April 2006
DRAFT - 2/17/2016
Resource: Knowledge Network for Biocomplexity (KNB)
(Public) Home: http://knb.ecoinformatics.org/index.jsp
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology SOMEWHAT
3) Ensure a self-sustaining economic model NO
4) Plan for long-term data stewardship NO
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers YES
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 Disciplines are similar
 Funded by NSF Knowledge and Distributed Intelligence Program
 Partnerships: National Center for Ecological Analysis and Synthesis (UCSanta
Barbara et al), Long Term Ecological Research Network (~30 sites), San Diego
Supercomputer Center, Texas Tech University
 Working system with required tools:
o Ecological Metadata Language (XML-based metadata specification Supports description of datasets, citations, software, protocols)
o Metacat (XML-based metadata PostgreSQL/Tomcat database/server
(open source))
o Morpho (data/metadata management tool for scientists)
o Storage Resource Broker (planned?)
DRAFT - 2/17/2016
Resource: National Science Digital Library (NSDL)
Public Home: http://nsdl.org/
Wiki: http://ndr.comm.nsdl.org/cgi-bin/wiki.pl?NSDL_Data_Repository_(NDR)
NSDL Registry Wiki: http://metadataregistry.org/wiki/index.php/Main_Page
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology SOMEWHAT
3) Ensure a self-sustaining economic model YES
4) Plan for long-term data stewardship YES
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation ?
7) Be sensitive to the intellectual property rights of researchers ?
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 Very well documented (papers, presentations)
 NSF-funded
 History of innovations
 Currently use FEDORA and OAI-MPH (Carl Lagoze)
 Now in 2nd iteration:
o “an information network overlay”
o More resource-centric, rather than metadata-centric (?)
o More (standard) ways to contribute metadata
o Metadata aggregation and augmentation seems interesting (Hillman)
o More (web services) ways to harvest - moving to SOAP and/or RESTbased API
DRAFT - 2/17/2016
Resource: Science Environment for Ecological Knowledge (SEEK)
http://seek.ecoinformatics.org/
Strengths:
1) Heterogeneous digital datasets NO
2) in the field of evolutionary biology NO
3) Ensure a self-sustaining economic model NO
4) Plan for long-term data stewardship ?
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers ?
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 Disciplines are similar
 Supports KNB plus systematics data
 Focus on semantics and scientific workflows
 Organized into working groups
 Collaboration:
o Partnership for Biodiversity Informatics (PBI),
o National Center for Ecological Analysis and Synthesis at UC Santa
Barbara (NCEAS);
o San Diego Supercomputer Center (SDSC);
o University of Kansas (KU); and
o University of New Mexico (UNM),
o Genome Center at UC Davis (UCD),
o Arizona State University (ASU),
o University of North Carolina (UNC),
o University of Vermont, (UVM) and
o Napier University in Scotland (Napier).
 Tools: (more available on website)
o EcoGrid – “2nd generation data network“, integrating distinct data
systems and networks, prototype using Metacat, SRB, DiGIR, Xanthoria,
etc.
o Kepler – grid-enabled scientific workflows - collaborators: SEEK Project,
SciDAC SDM Center, Ptolemy Project, GEON Project
DRAFT - 2/17/2016
o GrOWL “a visualization and editing tool for Ontology Web Language
(OWL) and Description Logics (DL) ontologies based on a semantic
network knowledge representation paradigm”
o Ontologies
o ConceptMapper
DRAFT - 2/17/2016
Resource: Digital Library for Earth System Education (DLESE)
Public Website: http://www.dlese.org/
Community Review System: http://crs.dlese.org/
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology NO
3) Ensure a self-sustaining economic model NO (or: applies, but negatively?)
4) Plan for long-term data stewardship NO
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers YES
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
 Ref: DRIADE goals 3) and 4): DLESE funding from NSF GEO is being phased
out (2006)
Notes:
 Partner with NSDL
DRAFT - 2/17/2016
Resource: Center for International Earth Science Information Network (CIESIN)
http://www.ciesin.org/index.html
Strengths:
http://www.ciesin.org/metadata/documentation/supplements/netsites.html
1)
2)
3)
4)
5)
6)
7)
8)
9)
Heterogeneous digital datasets ?
In the field of evolutionary biology NO
Ensure a self-sustaining economic model ?
Plan for long-term data stewardship ?
Provide tools and incentives to researchers for quality metadata generation and
dataset reuse ?
Minimize the technical expertise and time required for data deposition and
metadata generation ?
Be sensitive to the intellectual property rights of researchers ?
Focus on published datasets ?
Provide tight linkages to major evolutionary biology journals and domain-specific
community databases ?
Limitations:
Notes:
 It appears that the bulk of digital repository implementation work was carried out
in the mid-to-late 1990s and the documentation for this effort is no longer readily
accessible. If further analysis of CIESIN is required, it will have to be undertaken
on a more traditional basis (direct contact or other research) than via web
discovery.
DRAFT - 2/17/2016
Repository: Marine Metadata Initiative (MMI) http://marinemetadata.org/ /
Ocean Research Interactive Observatory Networks (ORION)
http://www.orionprogram.org/default.html
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology NO
3) Ensure a self-sustaining economic model NO
4) Plan for long-term data stewardship NO
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers NO
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 NSF-funded
 This project appears to be a model for effective collaboration and outreach
 Extensive website
DRAFT - 2/17/2016
Repository: Interuniversity Consortium for Political and Social Research (ICPSR)
http://www.icpsr.umich.edu/
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology NO
3) Ensure a self-sustaining economic model YES
4) Plan for long-term data stewardship YES
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse YES
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers YES
8) Focus on published datasets NO
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases NO
Limitations:
Notes:
 Established in 1962
 Over 500 member colleges and universities
 Data Documentation Initiative (DDI) http://www.icpsr.umich.edu/DDI/ is at ICPSR
 Partner with Data-PASS http://www.icpsr.umich.edu/DATAPASS/
DRAFT - 2/17/2016
Repository: Purdue Distributed Institutional Repository (DIR) / E-Scholar
http://e-scholar.lib.purdue.edu/
Strengths:
1) Heterogeneous digital datasets YES
2) In the field of evolutionary biology NO
3) Ensure a self-sustaining economic model ?
4) Plan for long-term data stewardship ?
5) Provide tools and incentives to researchers for quality metadata generation and
dataset reuse ?
6) Minimize the technical expertise and time required for data deposition and
metadata generation YES
7) Be sensitive to the intellectual property rights of researchers ?
8) Focus on published datasets ?
9) Provide tight linkages to major evolutionary biology journals and domain-specific
community databases ?
Limitations:
Notes: (material to be researched)
 http://www.cni.org/tfms/2005b.fall/abstracts/PB-research-brandt.html (Fall 2005):
“Purdue University Libraries will build a model to archive datasets generated at a
university during the research process and make these datasets available by
linking them to the resulting research publication.”

http://dir.lib.purdue.edu/whitepaper.html

http://www.rcac.purdue.edu/rcac/events/library.htm (May 2006)

http://e-scholar.lib.purdue.edu/eScholar_RR_Sep2006.pdf (Sept 2006)

http://www.hpcwire.com/hpc/640030.html
Download