Enabling the reusability of scientific data: Experiences with designing an

advertisement
Enabling the reusability of scientific
data: Experiences with designing an
open access infrastructure for
sharing datasets
Simon J. Coles
EPSRC National Crystallography Service
School of Chemistry
University of Southampton
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Data & the Publication Problem
2,000,000
Cl
Cl
N
Cl
O
O
Cl
+
Cl
N O
OCl
O
Cl
Cl
Cl
O
O
+
N O
Cl
O
Cl
Cl
N
Cl
N
O
N
25,000,000
450,000
Usability WS, NeSC Jan 06
© S.J. Coles 2006
A Different Approach to Data Publication?
Intellect & Interpretation
Underlying data
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Requirements
• Capture of all digital data and information
generated during the course of an experiment
• Data validation
• Adding value
• Archival system for data with attached
bibliographic and chemical metadata
• Automatic report generation
• Schema and protocols for publication and
dissemination of a dataset
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Open Access Crystal Structure Archive
ecrystals.chem.soton.ac.uk
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Access to the Underlying Data
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Publicising Content
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Harvesting, Linking and Aggregating
Usability WS, NeSC Jan 06
© S.J. Coles 2006
•
•
•
•
Usability: Quality & Uniformity of data
Different laboratories, practices & instruments present
a heterogeneous body of data
Publish according to IUCr ratified schema
To support publication according to this schema a
toolbox add-on to the archive has been developed
Toolbox requires 2 mandatory files only & is capable
of performing file format conversions and generate
value added files
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Ease of Deposition
& Metadata Quality
• Minimal number of manual metadata entries –
many can be hardwired into the system
• Deposition guidelines initially prepared by
students to provide impartial feedback
• Full documentation and in-line help/examples
• Restrained lists, e.g. Keywords
• Data deposited automatically by toolbox
• Automated generation of metadata for report
and OAI interface
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Data Validation
• Peer review removed from self deposit publication
• Simple checks for consistency made by the toolbox
• Checks for crystallographic integrity made through a
web service (IUCr, ‘CHECKCIF’)
• Introduction of data ‘editor’ for the archive; a
deposition must be signed-off by a recognised
professional before going live
• Quality indicators automatically taken from dataset
and presented in HTML jump-off page
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Identifiers
• URL of deposited dataset provides an identifier
• Persistent only if the Institutional support model is
accepted / adopted
• Signed-up to an agency to register metadata
relating to datasets with a DOI
• Pay registry to ensure that DOI always resolves to
associated dataset (10cents to register 1cent per
annum to maintain)
• InChI chemical identifier - a unique text descriptor
for a molecule
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Dissemination & Aggregation
• OAI metadata schema; ratified
by IUCr & chemical community
• OAI covers bibliographic
terms; must introduce
chemical terms
• Both library and subject
specific aggregators satisfied
• Chemical linking; InChI,
chemical classifications and
restricted keywords list
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Endorsement
• Feedback during development from technical
publishing arm of IUCr
• Designed for automatic incorporation into CSD
(global database operated by CCDC)
• Accepted by Executive Committee of IUCr
• Reuse of data achieved in collaboration with
Leverhulme Centre for Molecular Informatics
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: Community Uptake
• Southampton archive about to publish
routinely via the archive
• Five crystallography laboratories in UK agreed
to adopt philosophy, install and populate
archives
• CCDC will harvest required data from all
archives
• IUCr will harvest and curate all data
• Develop aggregator services in collaboration
with IUCr
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Usability: The Next Challenges
• Full acceptance by chemical community
– Validation worries
– Curation worries
– The requirement for as many peer reviewed
publications as possible (despite quality)
• Full acceptance by wider chemistry publishing
community
– Loss of control over underlying data
– Faith in Open Archives replacing experimental
descriptions in articles
• Development of fully functional aggregator services
Usability WS, NeSC Jan 06
© S.J. Coles 2006
Download