Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing datasets Simon J. Coles EPSRC National Crystallography Service School of Chemistry University of Southampton Usability WS, NeSC Jan 06 © S.J. Coles 2006 Data & the Publication Problem 2,000,000 Cl Cl N Cl O O Cl + Cl N O OCl O Cl Cl Cl O O + N O Cl O Cl Cl N Cl N O N 25,000,000 450,000 Usability WS, NeSC Jan 06 © S.J. Coles 2006 A Different Approach to Data Publication? Intellect & Interpretation Underlying data Usability WS, NeSC Jan 06 © S.J. Coles 2006 Requirements • Capture of all digital data and information generated during the course of an experiment • Data validation • Adding value • Archival system for data with attached bibliographic and chemical metadata • Automatic report generation • Schema and protocols for publication and dissemination of a dataset Usability WS, NeSC Jan 06 © S.J. Coles 2006 Open Access Crystal Structure Archive ecrystals.chem.soton.ac.uk Usability WS, NeSC Jan 06 © S.J. Coles 2006 Access to the Underlying Data Usability WS, NeSC Jan 06 © S.J. Coles 2006 Publicising Content Usability WS, NeSC Jan 06 © S.J. Coles 2006 Harvesting, Linking and Aggregating Usability WS, NeSC Jan 06 © S.J. Coles 2006 • • • • Usability: Quality & Uniformity of data Different laboratories, practices & instruments present a heterogeneous body of data Publish according to IUCr ratified schema To support publication according to this schema a toolbox add-on to the archive has been developed Toolbox requires 2 mandatory files only & is capable of performing file format conversions and generate value added files Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Ease of Deposition & Metadata Quality • Minimal number of manual metadata entries – many can be hardwired into the system • Deposition guidelines initially prepared by students to provide impartial feedback • Full documentation and in-line help/examples • Restrained lists, e.g. Keywords • Data deposited automatically by toolbox • Automated generation of metadata for report and OAI interface Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Data Validation • Peer review removed from self deposit publication • Simple checks for consistency made by the toolbox • Checks for crystallographic integrity made through a web service (IUCr, ‘CHECKCIF’) • Introduction of data ‘editor’ for the archive; a deposition must be signed-off by a recognised professional before going live • Quality indicators automatically taken from dataset and presented in HTML jump-off page Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Identifiers • URL of deposited dataset provides an identifier • Persistent only if the Institutional support model is accepted / adopted • Signed-up to an agency to register metadata relating to datasets with a DOI • Pay registry to ensure that DOI always resolves to associated dataset (10cents to register 1cent per annum to maintain) • InChI chemical identifier - a unique text descriptor for a molecule Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Dissemination & Aggregation • OAI metadata schema; ratified by IUCr & chemical community • OAI covers bibliographic terms; must introduce chemical terms • Both library and subject specific aggregators satisfied • Chemical linking; InChI, chemical classifications and restricted keywords list Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Endorsement • Feedback during development from technical publishing arm of IUCr • Designed for automatic incorporation into CSD (global database operated by CCDC) • Accepted by Executive Committee of IUCr • Reuse of data achieved in collaboration with Leverhulme Centre for Molecular Informatics Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: Community Uptake • Southampton archive about to publish routinely via the archive • Five crystallography laboratories in UK agreed to adopt philosophy, install and populate archives • CCDC will harvest required data from all archives • IUCr will harvest and curate all data • Develop aggregator services in collaboration with IUCr Usability WS, NeSC Jan 06 © S.J. Coles 2006 Usability: The Next Challenges • Full acceptance by chemical community – Validation worries – Curation worries – The requirement for as many peer reviewed publications as possible (despite quality) • Full acceptance by wider chemistry publishing community – Loss of control over underlying data – Faith in Open Archives replacing experimental descriptions in articles • Development of fully functional aggregator services Usability WS, NeSC Jan 06 © S.J. Coles 2006