Archiving Research Data, Dryad,and Publishers Neil Beagrie, Charles Beagrie Ltd

advertisement
Archiving Research Data,
Dryad,and Publishers
Neil Beagrie, Charles Beagrie Ltd
Bloomsbury Conference June 2010
With contributions from Julia Chruszcz, Peter Williams, and Todd Vision
Overview
• The Challenge;
• The Dryad Consortium;
• Supplementary Data and Publishers;
• Research Data Preservation Costs (KRDS);
• The Future.
The Challenge
PRC Global Study
n=3759
n=2940
n=1262
n=1653
n=2989
n=2118
n=1294
n=2565
n=1868
n=2273
n=841
n=2362
4
Source: PRC global study (forthcoming)
Requesting Data
• Wicherts et al. (2006 Am. Psychol. 61, 726)
requested data from the 141 most recent
articles in American Psychological
Association (APA) journals.
“6 months later, after … 400 emails, [sending]
detailed descriptions of our study aims, approvals
of our ethical committee, signed assurances not to
share data with others, and even our full
resumes…”
Only 27% of authors shared their data
The Dryad Consortium of
Scholarly Societies and
publishers (and libraries)
Archiving at publication
• Avoids loss, corruption, obsolescence of data files;
• The point in time when authors are best able to ensure
the correctness of data and metadata;
• Authors have incentive to deposit their data in order to
complete the publication process;
• Journals are best able to monitor compliance with policy;
• In short, the “Genbank model” works.
Incentives to authors
• Access to colleagues’ data
• Visibility and citability
– Another way for work to have high impact
• Integration
– Combinability with other data adds value
• Long-term preservation
– Including data format migration
• Ad hoc data sharing can be burdensome
– Deposition to multiple specialized repositories
– Fulfilling individual requests for data takes effort
Joint Data Archiving Policy
• DEPOSIT AT PUBLICATION
– As a condition for publication, all data used in the paper should be
archived in an appropriate public archive.
• REPEATABILITY
– Data should be given with sufficient detail so that together with the
paper content, each result in the published paper may be re-created.
• EMBARGO
– Authors may elect to have the data publicly available at time of
publication, or if the archive allows opt to embargo access to the data.
• EXCEPTIONS
– Exceptions may be granted at the discretion of the editor, especially
for sensitive information such as the location of endangered species.
• COORDINATION
– The aim is for the Dryad consortium of journals to adopt this policy
simultaneously.
That’s all well and good, but
where’s this “appropriate
public archive”?
A mosaic of specialized
databases
• There are a growing number to which deposition
is encouraged/required (Genbank, Treebase)
– And others are emerging
• A world in which every datatype had its own
required database, each with its own submission
system:
– Would be a huge burden on authors
– Would inevitably leave some data orphaned
– Might never be financially possible
Overcoming the submission
burden
• Integrating journal submission and data
submission
– Prepopulating bibliographic metadata
– “Handshaking” with specialized repositories
• Enhancing low-quality author-provided
metadata
– Human curation
– Machine assisted metadata enhancement
The Dryad Digital Repository
The Repository
• Dryad is a repository (at Duke) for datasets
underlying scientific research articles;
• ƒ
Its initial focus has been evolution and ecology;
• ƒ
Participating journals subscribe to the Joint Data
Archiving Policy;
• ƒ
Dryad datasets will have (DOIs), and Creative
Commons ‘CC-Zero’ licenses;
• Project ƒ
Funded by the National Science Foundation
2008-2012;
• Sustainability plan a key deliverable.
Supplementary Data and
Publishers
Overview
• Consultancy for Dryad Sustainability: covered areas of draft
business plan and sustainability for Dryad
• Presenting one of the contributions(publishers) to section on
Comparators and Costs
• Outcomes from desk research and 12 interviews with
publishers/data publishers + some additional input drawn
from Keeping Research Data Safe
• Very brief presentation – article in preparation for Learned
Publishing Oct 2010 issue….KRDS2 available from JISC
Interviewees
•
•
•
•
•
•
•
•
•
•
•
Journal of Clinical Investigation
Journal of the American Medical Association
Molecular Phylogenetics and Evolution (Elsevier)
Journal of Heredity (OUP)
Ecological Society of America
Wiley-Blackwell + Ecology Letters
Royal Society
Federation of American Societies for Experimental Biology
OECD Publishing
Internet Archaeology and Archaeology Data Service
Pangaea: Publishing Network for Geoscientific &
Environmental Data
• Dataverse Network (Social Sciences, Harvard)
Some Findings: growth
• Many interviewees stated that supplementary data and
materials are showings rapid growth
• 3 gave figures: from 32 articles in 2000, to 251 in 2009 – an
increase of 784%; from 6% in 2005 to 38% in 2009; from
2% a decade ago to 87% in 2009.
Some Findings: workflow
• supplementary data have grown organically at the various
journals investigated (author driven);
• Both the work and the costs being absorbed into the daily
running of journals;
• in 4 cases minimal impact on work duties; in 5 others there
was a significant but often unquantified impact (two of these
might be considered data publications with a focus on
publishing data papers or datasets); and in 3 cases the
information was not available or unknown;
• can be explained in terms of level of effort or importance
applied : the greatest levels of effort are associated with
copy editing, format migration, addition of metadata, etc,
whilst the least effort is required for simply hosting the
material; and/or high-levels of automation in the workflow.
Some Findings: costs
• These were in most cases unknown or only partially known;
• Costs mentioned but usually not quantified include: digital
storage costs, salary costs of journal staff; and long term
preservation costs;
• detailed cost information was really only available from
Internet Archaeology via Archaeology Data Service which
had participated in an activity based costing study (KRDS2);
• Internet Archaeology archiving costs reflect those for a
“dataset publisher” so only a comparator for part of Dryad’s
content – large datasets.
Some Findings: revenue
• only author fees and journal subscription fees were
mentioned as current revenue sources for the
supplementary materials in journals;
• 3 journals interviewed have author charges for
supplementary materials (see next slide);
•
The data archiving and sharing organisations interviewed
relied primarily on (uncertain) research grants and
temporary or re-current core funding, but one had access to
a small endowment and another has a charging policy for
some depositors.
Some Findings: author charges
• Journal of Clinical Investigation - authors are charged $300
for supplemental data to appear online with accepted
articles;
• Ecological Archives - submission of ‘appendices and
supplements’ is free up to 10MB. Above this, there is a fee
of $250 for the first 1 GB and $50 for each subsequent GB.
The fee for publication of a data paper is $250 for
publication of the abstract in the relevant journal plus
publication of up to 10 MB in Ecological Archives. An
additional $250 is charged for data sets between 10MB and
1GB, and for larger datasets there is an additional $50 per
GB fee;
• The Federation of American Societies for Experimental
Biology (FASEB) charges $100 for each Supplemental file.
Keeping Research Data Safe
(KRDS1 & KRDS2):
JISC-funded studies of Research Data
Preservation Costs
(separate Dryad costing project by Lori EakinRichards based on KRDS approach)
KRDS: what did we learn?
Whole of Service costing/Seeing the“Big Picture”
Selection of 2009 Allocation of UKDA Activity Costs
Acquisition
5.8%
Ingest
21.5%
A. Storage +Pres. Planning
3.1%
Access
16.9%
KRDS:Implications
• Changing view of digital preservation costs:
– “getting stuff in and out” costs much higher than
“keeping it (bit preservation + migration)”;
– Staff costs c.70% of total costs;
– Importance of economies of scale and
automation;
– Findings of KRDS and Dryad Repository’s own
activity costing projections fed into Dryad
sustainability planning.
Future Plans
• Dryad sustainability plan being put to Dryad
member societies and publishers;
• Dryad extending consortium to new members
–achieving economies of scale;
• Bid to JISC to establish Dryad-UK;
• Extending KRDS research and
implementations.
Further Information
Dryad see www.datadryad.org
Keeping Research Data Safe2 (KRDS2)
webpage at www.beagrie.com/jisc.php
KRDS2 report available from JISC website
http://www.jisc.ac.uk/publications/reports/20
10/keepingresearchdatasafe2.aspx#downlo
ads
Email: neil@beagrie.com
Download