Title of presentation

advertisement
Research data and
reproducibility at Nature
Philip Campbell
Publishing Better Science through Better Data meeting
NPG
14-11-14
Contents
• Data: opportunities and costs
• Reproducibility: Nature’s approaches
Aspiration: all scientific literature online, all
data online, and for them to interoperate
Why is open data an urgent issue?
•
Closing the concept-data gap
•
Maintaining the credibility of science
•
Exploiting the data deluge & computational
potential
•
Combating fraud
•
Addressing planetary challenges
•
Supporting citizen science
•
Responding to citizens’ demands for evidence
•
Restraining the “Database State”
Intelligent openness
Openness of data per se has no value. Open science is more
than disclosure
Data must be:
• Accessible
• Intelligible
• Assessable
• Re-usable
METADATA
Only when these four criteria are fulfilled are data properly
open
The transition to open data
Pathfinder disciplines where benefit is recognised and habits are
changing
Databases as publications
• Hosts/suppliers of databases are publishers
• They have a responsibility to curate and
provide reliable access to content.
• They may also deliver other services around
their products
• They may provide the data as a public good or
charge for access
Worldwide Protein Data Bank (wwPDB)
Worldwide Protein Data Bank (wwPDB)
• The Worldwide Protein Data Bank (wwPDB) archive is the
single worldwide repository of information about the 3D
structures of large biological molecules, including proteins
and nucleic acids.
• As of January 2012, it held 78477 structures. 8120 were
added in 2011, at a rate of 677 per month. In 2011, an
average of 31.6 million data files were downloaded per
month. The total storage requirement for the repository was
135GB for the archive.
• The total cost for the project is approximately $11-12 million
per year (total costs, including overhead), spread out over the
four member sites. It employs 69 FTE staff. wwPDB estimate
that $6-7 million is for “data in” expenses relating to the
deposition and curation of data.
UK Data Archive
•
•
•
•
The UK Data Archive, founded 1967, is curator of the largest collection of digital data in the social sciences
in the United Kingdom. UKDA is funded mainly by Economic and Social Research Council, University of
Essex and JISC, and is hosted at University of Essex.
On average around 2,600 (new or revised) files are uploaded to the repository monthly. (This includes file
packages, so the absolute number of files is higher.) The baseline size of the main storage repository is
<1Tb, though with multiple versions and files outside this system, a total capacity of c.10Tb is required.
The UKDA currently (26/1/2012) employs 64.5 people. The total expenditure of the UK Data Archive
(2010-11) was approx £3.43 million.. Total staff costs (2010-11) across the whole organisation: £2.43
million.
Non-staff costs in 2009-10 were approx £580,000, but will be much higher in 2011-12, ie almost £3 million
due to additional investment.
Institutional Repositories (Tier 3)
» Most university repositories in the UK have small amounts of staff time. The
Repositories Support Project survey in 2011 received responses from 75 UK
universities. It found that the average university repository employed a total 1.36
FTE – combined into Managerial, Administrative and Technical roles. 40% of these
repositories accept research data. In the vast majority of cases (86%), the library
has lead responsibility for the repository.276
» ePrints Soton
» ePrints Soton, founded in 2003, is the institutional repository for the University of
Southampton. It holds publications including journal articles, books and chapters,
reports and working papers, higher theses, and some art and design items. It is
looking to expand its holdings of datasets.
» It has a staff of 3.2 FTE (1FTE technical, 0.9 senior editor, 1.2 editors, 0.1 senior
manager). Total costs of the repository are of £116, 318, comprised of staff costs of
£111,318, and infrastructure costs of £5,000. (These figures do not include a
separate repository for electronics and computer science, which will be merged
into the main repository later in 2012.) It is funded and hosted by the University of
Southampton, and uses the ePrints server, which was developed by the University
of Southampton School of Electronics and Computer Science.
Contingency of these databases
• PDB and arXiv dependent on mixes of
discretionary decisions by government bodies
and philanthropy
• UK Data Archive is unusual in its centrality to the
social sciences funding system
• University repositories highly varied in
performance and in support from the top
• Funders and universities are under many
pressures
• But researchers can do more to promote data
access, as can journals
Approaches to reproducibility
Growth in formal corrections
(Examples from Nature, Nature Biotechnology, Nature
Neuroscience, Nature Methods)
• Missing controls, results not sufficiently representative of
experimental variability, data selection
• Investigator bias, e.g., in determining the boundaries of an area to
study (lack of blinding)
• Technical replicates wrongly described as biological replicates
• Over-fitting of models for noisy datasets in various experimental
settings: fMRI, x-ray crystallography, machine learning
• Errors and inappropriate manipulation in image presentation, poor
data management
• Contamination of primary culture cells
Mandating reporting standards is
not sufficient
MIAME – Minimal Information About a Microarray Experiment
2002: Nature journals mandate deposition of MIAME-compliant microarray data
2006: compliance issues identified
Ioannidis et al., Nat Gen 41, 2, 149 (2009)
Of 18 papers containing microarray data published in NG in 2005-2006, 10 analyses
could not be reproduced, 6 only partially.
Irreproducibility: NPG actions so far
•
•
•
•
•
•
•
Awareness raising – meetings 2013/14: NINDS, NCI, Academy of Medical
Sciences, Royal Society, Science Europe,……
Awareness raising: Editorials, articles by experts
We removed length limits on online methods sections
We substantially increased figure limits in Nature and improved access to
Supplementary Information data in research journals.
Statistical advisor (Terry Hyslop) and referees appointed
‘Reducing our irreproducibility’ Editorial + check lists for authors, editors and
referees (23 April 2013)
Nature + NIH + Science meeting of journal editors in Washington (May
2014)
Raising awareness: our content
•
•
•
•
•
•
•
•
•
•
•
Tackling the widespread and critical impact of batch effects in high-throughput data,
Leek et al., NRG, Oct 2010
How much can we rely on published data on potential drug targets? Prinz et al.,
NRDD, Sep 2011
The case for open computer programs, Ince et al., Nature, Feb 2012
Raise standards for preclinical cancer research, Begley & Ellis, Nature, Mar 2012
Must try harder – Editorial, Nature, Mar 2012
Face up to false positives, MacArthur, Nature, Jul 2012
Error prone – Editorial, Nature, Jul 2012
Next-generation sequencing data interpretation: enhancing reproducibility and
accessibility, Nekrutenko & Taylor, NRG, Sep 2012
A call for transparent reporting to optimize the predictive value of preclinical research.
Landis et al., Nature, Oct 2012
Know when your numbers are significant, Vaux, Nature, Dec 2012
Reuse of public genome-wide gene expression data, Rung & Brazma, NRG, Feb
2013
Raising awareness: our content (2)
• Reducing our irreproducibility – Editorial, Nature, May 2013
• Reproducibility: Six red flags for suspect work, Begley, Nature, May
2013
• Reproducibility: The risks of the replication drive, Bissell, Nature,
Nov 2013
• Of carrots and sticks: incentives for data sharing, Kattge et al,
Nature Geoscience, Nov 2014
• Open code for open science? Easterbrook, Nature Geoscience Nov
2014
• Code share – Editorial, Nature 29 Oct 2014
• Journals unite – Editorial with Science and NIH 6 Nov 2014
Implementation of reporting checklist
• Onerous!
– Authors, referees, editors, copyeditors
• Referees:
– We are not yet sure whether they are paying much attention.
• Authors:
– Some papers submitted with checklist without prompt
– Many have embraced source data
• Improves reporting (see following slide).
• We have commissioned an external assessment of the impact.
• The list may be driving changes in experimental design in the
longer term
Reporting animal experiments in
Nature Neuroscience
Jan ‘12 (10 papers)
Oct ‘13 – Jan ‘14 (41 papers)
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
not done
30%
30%
done
20%
20%
10%
10%
0%
not reported
0%
randomization
blinding
predetermination
of sample size
randomization
blinding
predetermination
of sample size
‘Not reported’ includes cases for which the specific question was not relevant (e.g.,
investigator cannot be blinded to treatment)
Most frequent problems: power analysis calculations, low n (sample size justification), proper
blinding or randomization, multiple t-tests.
Attention needed: Cell line identity
Identify the source of cell lines and indicate if they were recently
authenticated (e.g., by STR profiling) and tested for mycoplasma
contamination.
This checklist question is not yet enforced as a mandate
Audit of Nature Cell Biology papers (Aug’13 – Dec’13):
- Of 21 relevant papers:
- 20 indicate the source of cell lines(*)
- 4 indicate authentication was done(**)
- 5 acknowledge cell lines were not authenticated
- 17 indicate the cells were tested and demonstrated mycoplasma-free(**)
(*) quality of information variable
(**) timing of tests not always satisfactory
Question about developing authorcontribution transparency
• Author contribution statements in Nature
journals are informal, unstructured, nontemplated.
• Should this change? How? (Possible
goals: increased credit, increased
accountability for potential flaws.)
• How granular should this information
become?
Irreproducibility: underlying issues
•
•
•
•
•
•
•
•
Experimental design: randomization, blinding, sample size determinations,
independent experiments vs technical replicates,
Statistics
Big data, overfitting (needs gut scepticism/tacit knowledge)
Gels, microscopy images,
Reagents validity – antibodies, cell lines
Animal studies description
Methods description
Data deposition
•
•
•
•
•
•
Publication bias and refutations – where?
IP confidentiality – replication failures unpublishable
Lab supervision
Lab training
Pressure to publish
“It pays to be sloppy”
Funders: The NIH
Collins and Tabak, Nature 27 January 2014
NIH is developing a training module on enhancing reproducibility and
transparency of research findings, with an emphasis on good experimental design.
This will be incorporated into the mandatory training on responsible conduct of
research for NIH intramural postdoctoral fellows later this year. Informed by this pilot,
final materials will be posted on the NIH website by the end of this year for broad
dissemination, adoption or adaptation, on the basis of local institutional needs.
Funders: The NIH
Collins and Tabak, Nature 27 January 2014
•
Several of the NIH's institutes and centres are also testing the use of a checklist to ensure a
more systematic evaluation of grant applications. Reviewers are reminded to check, for
example, that appropriate experimental design features have been addressed, such as an
analytical plan, plans for randomization, blinding and so on. A pilot was launched last year
that we plan to complete by the end of this year to assess the value of assigning at least one
reviewer on each panel the specific task of evaluating the 'scientific premise' of the
application: the key publications on which the application is based (which may or may not
come from the applicant's own research efforts). This question will be particularly
important when a potentially costly human clinical trial is proposed, based on animalmodel results. If the antecedent work is questionable and the trial is particularly
important, key preclinical studies may first need to be validated independently.
•
Informed by feedback from these pilots, the NIH leadership will decide by the fourth quarter
of this year which approaches to adopt agency-wide, which should remain specific to
institutes and centres, and which to abandon.
Universities/institutes:
target issues
•
•
•
•
•
•
Data validation
Lab size and management
Training
Publication bias
Data/notebooks access
Reagent access
Nature and NPG data policies
• Enforce community database deposition
• Encourage community database
development
• Launch Scientific Data
• Nature-journal editors encourage
submissions of Data Descriptors to
Scientific Data
Thanks for listening
Download