Prepublication data sharing

advertisement
Vol 461|10 September 2009
OPINION
Prepublication data sharing
Rapid release of prepublication data has served the field of genomics well. Attendees at
a workshop in Toronto recommend extending the practice to other biological data sets.
pen discussion of ideas and full disclosure of supporting facts are the bedrock
for scientific discourse and new developments. Traditionally, published papers combine the salient ideas and the supporting facts
in a single discrete ‘package’. With the advent of
methods for large-scale and high-throughput
data analyses, the generation and transmission of the underlying facts are often replaced
by an electronic process that involves sending
information to and from scientific databases.
For such data-intensive projects, the standard
requirement is that all relevant data must be
made available on a publicly accessible website
at the time of a paper’s publication1.
One of the lessons from the Human Genome
Project (HGP) was the recognition that making data broadly available before publication
can be profoundly valuable to the scientific
enterprise and lead to public benefits. This
is particularly the case when there is a community of scientists that can productively use
the data quickly — beyond what the data producers could do themselves in a similar time
period, and sometimes for scientific purposes
outside the original goals of the project.
The principles for rapid release of genomesequence data from the HGP were formulated
at a meeting held in Bermuda in 1996; these were
then implemented by several funding agencies. In
exchange for ‘early release’ of their data, the international sequencing centres retained the right to
be the first to describe and analyse their complete
data sets in peer-reviewed publications. The
draft human genome sequence2 was the highest profile data set rapidly released before publication, with sequence assemblies greater than
1,000 base pairs usually released within 24 hours
of generation. This experience demonstrated
that the broad and early availability of sequence
data greatly benefited life sciences research by
leading to many new insights and discoveries2, including new information on 30 disease
genes published prior to the draft sequence.
At a time when advances in DNA sequencing
technologies mean that many more laboratories
can produce massive data sets, and when an
ever-growing number of fields (beyond genome
sequencing) are grappling with their own datasharing policies, a Data Release Workshop was
convened in Toronto in May 2009 by Genome
Canada and other funding agencies. The meeting brought together a diverse and multinational
O
168
group of scientists, ethicists, lawyers, journal
editors and funding representatives. The goal
was to reaffirm and refine, where needed, the
policies related to the early release of genomic
data, and to extend, if possible, similar datarelease policies to other types of large biological
data sets — whether from proteomics, biobanking or metabolite research.
Building on the past
By design, the Toronto meeting continued
policy discussions from previous meetings,
in particular the Bermuda meetings (1996,
1997 and 1998)3–5 and the 2003 Fort Lauderdale meeting, which recommended that rapid
prepublication release be applied to other data
sets whose primary utility was a resource for
the scientific community, and also established
the responsibilities of the resource producers,
resource users, and the funding agencies6.
A similar 2008 Amsterdam meeting extended
the principle of rapid data release to proteomics data7. Although the recommendations
of these earlier meetings can apply to many
genomics and proteomics projects, many
outside the major sequencing centres and
funding agencies remain unaware of the
details of these policies, and so one goal of the
Toronto meeting was to reaffirm the existing
principles for early data release with a wider
group of stakeholders.
In Toronto, attendees endorsed the value of
rapid prepublication data release for large reference data sets in biology and medicine that have
broad utility and agreed that prepublication
data release should go beyond genomics and
proteomics studies to other data sets — including chemical structure, metabolomic and RNA
interference data sets, and to annotated clinical
resources (cohorts, tissue banks and case-control studies). In each of these domains, there
are diverse data types and study designs, ranging from the large-scale ‘community resource
projects’ first identified at Fort Lauderdale
(for which meeting participants endorsed
prepublication data release) to investigatorled hypothesis-testing projects (for which
the minimum standard should be the release
of generated data at the time of publication).
Several issues discussed at previous data-
EXAMPLES OF PREPUBLICATION DATA-RELEASE GUIDELINES
Project type
Prepublication data release recommended
Prepublication data release optional
Genome
sequencing
Whole-genome or mRNA sequence(s) of a
reference organism or tissue
Sequences from a few loci for crossspecies comparisons in a limited
number of samples
Polymorphism
discovery
Catalogue of variants from genomic and/
or transcriptomic samples in one or more
populations
Variants in a gene, a gene family or a
genomic region in selected pedigrees
or populations
Genetic
association studies
Genomewide association analysis of
thousands of samples
Genotyping of selected gene
candidates
Somatic mutation
discovery
Catalogue of somatic mutations in exomes or
genomes of tumour and non-tumour samples
Somatic mutations of a specific locus
or limited set of genomic regions
Microbiome
studies
Whole-genome sequence of microbial
communities in different environments
Sequencing of target locus in a limited
number of microbiome samples
RNA profiling
Whole-genome expression profiles from a
large panel of reference samples
Whole-genome expression profiles of
a perturbed biological system(s)
Proteomic studies
Mass spectrometry data sets from large
panels of normal and disease tissues
Mass spectrometry data sets from a
well-defined and limited set of tissues
Metabolomic
studies
Catalogue of metabolites in one or more
tissues of an organism
Analyses of metabolites induced in a
perturbed biological system(s)
RNAi or chemical
library screen
Large-scale screen of a cell line or organism
analysed for standard phenotypes
Focused screens used to validate a
hypothetical gene network
3D-structure
elucidation
Large-scale cataloguing of 3D structures of
proteins or compounds
3D structure of a synthetic protein or
compound elucidated in the context
of a focused project
© 2009 Macmillan Publishers Limited. All rights reserved
DATA SHARING OPINION
NATURE|Vol 461|10 September 2009
The Toronto statement
Rapid prepublication data
release should be encouraged
for projects with the following
attributes:
● Large scale (requiring significant
resources over time)
● Broad utility
● Creating reference data sets
● Associated with community
buy-in
Funding agencies should facilitate
the specification of data-release
policies for relevant projects by:
● Explicitly informing applicants
of data-release requirements,
especially mandatory
prepublication data release
● Ensuring that evaluation of data
release plans is part of the peerreview process
● Proactively establishing analysis
plans and timelines for projects
releasing data prepublication
● Fostering investigator-initiated
prepublication data release
● Helping to develop appropriate
consent, security, access and
governance mechanisms that
protect research participants
while encouraging prepublication
data release
● Providing long-term support of
databases
Data producers should state their
intentions and enable analyses of
their data by:
● Informing data users about
the data being generated, data
standards and quality, planned
analyses, timelines, and relevant
contact information, ideally
through publication of a citable
marker paper near the start of the
project or by provision of a citable
URL at the project or fundingagency website
● Providing relevant metadata
(e.g., questionnaires, phenotypes,
environmental conditions,
and laboratory methods) that
will assist other researchers
in reproducing and/or
independently analysing the
data, while protecting interests
of individuals enrolled in studies
focusing on humans
● Ensuring that research
participants are informed that
their data will be shared with
other scientists in the research
community
● Publishing their initial global
analyses, as stated in the marker
paper or citable URL, in a timely
fashion
● Creating databases designed
to archive all data (including
underlying raw data) in an easily
retrievable form and facilitate
usage of both pre-processed and
processed data
Data analysts/users should
freely analyse released
prepublication data and act
responsibly in publishing analyses
of those data by:
● Respecting the scientific
etiquette that allows data
producers to publish the first
global analyses of their data set
● Reading the citeable document
release meetings were not revisited, as they are large in scale, are ‘reference’ in characwere considered fundamental to all types of ter and typically have community ‘buy-in’.
data release (whether prepublication or publi- The table opposite provides examples of
cation-associated). These included: specified projects using different designs, technoloquality standards for all data; database designs gies, and approaches that have several of these
that meet the needs of both data producers and attributes, but also lists projects that are more
users alike; archiving of raw data in a retriev- hypothesis-based for which prepublication
able form; housing of both ‘finished’ and data release should not be mandated.
‘unfinished’ data in databases; and provision of
It was agreed at the meeting that the requirelong-term support for databases
ments for prepublication data
“Funding agencies
by funding agencies. New issues
release must be made clear
that were addressed include the
should require rapid when funding opportunities are
importance of simultaneously
first announced and that proacprepublication data tive engagement of funders is
releasing metadata (such as
release for certain
environmental or experimental
beneficial throughout a project,
conditions and phenotypes) that
as has been the experience of
projects.”
many genome-sequencing
will enable users to fully exploit
the data, as well as the complexities associated efforts, the International HapMap Project, the
with clinical data because of concerns about ENCODE project, the 1000 Genomes project
privacy and confidentiality (see ‘Sharing data and, more recently, the International Cancer
about human subjects’, overleaf).
Genome Consortium, the Human Microbiome
At a practical level, the Toronto meeting Project and the MetaHIT project.
developed a set of suggested ‘best practices’ for
For all projects generating large data sets, the
funding agencies, for scientists in their differ- Toronto meeting recommended that funding
ent roles (whether as data producers, data ana- agencies require that data-sharing plans be prelysts/users, and manuscript reviewers), and for sented as part of grant applications and that
journal editors (see ‘The Toronto statement’). these plans are subjected to peer review. Such
practice is currently the exception rather than
Recommendations for funders
the rule. Funding agencies will need to exerFunding agencies should require rapid cise flexibility by, for example, recognizing that
prepublication data release for projects that large-scale data-generation projects need not
generate data sets that have broad utility, necessarily lead to traditional publications, and
© 2009 Macmillan Publishers Limited. All rights reserved
associated with the project
● Accurately and completely
citing the source of
prepublication data, including
the version of the data set (if
appropriate)
● Being aware that released
prepublication data may be
associated with quality issues that
will be later rectified by the data
producers
● Contacting the data producers
to discuss publication plans in the
case of overlap between planned
analyses
● Ensuring that use of data does
not harm research participants
and is in conformity with ethical
approvals
Scientific journal editors
should engage the research
community about issues related
to prepublication data release
and provide guidance to authors
and reviewers on the third-party
use of prepublication data in
manuscripts
that certain projects may need to release only
some of their generated data before publication. At the same time, general consistency in
data-sharing policies between funding agencies
is desirable, whenever possible. To encourage
compliance, funding agencies and academic
institutions should give credit to investigators
who adopt prepublication data-release practices, one option would be to recognize good
data-release behaviour during grant renewals
and promotion processes, another would be to
track the usage and citation of data sets using
electronic systems similar to those used for
traditional publications8.
Data producers and data users
Early data release can lead to tensions between
the interests of the data-producing scientists
who request the right to publish a first description of a data set and other scientists who wish
to publish their own analyses of the same data.
To date, many papers have been published by
third parties reporting research findings enabled by data sets released before publication.
The experiences shared in Toronto suggest
that these have rarely affected subsequent
publications authored by the data producers.
Nevertheless, the Toronto meeting participants
recognized that this is an ongoing concern that
is best addressed by fostering a scientific culture that encourages transparent and explicit
cooperation on the part of data producers, data
169
OPINION DATA SHARING
NATURE|Vol 461|10 September 2009
analysts, reviewers and journal editors.
Sharing data about human subjects
Data producers should, as early as possible,
Data about human subjects
associated with a unique,
important to develop and
and ideally before large-scale data generation
participating in genetic and
implement robust governance but not directly identifiable
begins, clarify their overall intentions for data
epidemiological research
individual, access may be
models and procedures for
analysis by providing a citable statement, typirequire particularly careful
restricted.
human subjects data early
cally a ‘marker paper’, that would be associated
consideration owing to
Under such conditions,
in a project. Lessons can
with their database entries. This statement
privacy-protection issues
arguments can be made
probably be learned from
should provide clear details about the data set
and the potential harms that
for the release of data for
data policies adopted by
could arise from misuse.
studies involving human
several genomics projects9
to be produced, the associated metadata, the
that generate human-subject
These issues are critical
subjects, as doing so can
experimental design, pilot data, data standdata. For aggregated data that augment the opportunities
to all databases housing
ards, security, quality-control procedures,
cannot be used to identify
information about human
for new discoveries that could
expected timelines, data-release mechanisms
individuals, databases are
subjects, whether or not they
ultimately benefit individuals,
and contact details for lead investigators. If
open access, but for clinical
contain prepublication data.
communities, and society at
data producers request a protected time period
and
genomic
data
that
are
For
these
reasons,
it
is
large.
to allow them to be the first to publish the data
set, this should be limited to global analyses of
the data and ideally expire within one year.
for data users to realize that, historically, many created themselves can help to raise both the
If the citable statement is a ‘marker paper’ it such dialogues have led to coordinated publi- quality of analysis and fairness in citation of
should be subjected to peer review and pub- cations and to new scientific insights. Despite published studies.
lished in a scientific journal. Alternatively, the best intentions of all parties, on occasion
other citable sources, such as digital object a researcher may publish the results of analy- Conclusion
identifiers to specific pages on well-maintained ses that overlap with the planned studies of The rapid prepublication release of sequencing
funding agency or institutional websites, could the data producer. Although such instances data has served the field of genomics well. The
also be used. Data producers benefit from cre- are hopefully rare if good communication Toronto meeting participants acknowledged
ating a citable reference, as it can later be used protocols are followed, these should be viewed that policies for prepublication release of data
to reflect impact of the data sets8.
as a small risk to the data producers, one that need to evolve with the changing research landIn turn, the data users should carefully read comes with the much greater overall benefit of scape, that there is a range of opinion in the scithe source information, including any marker early data release.
entific community, and that actual community
papers, associated with a released data set.
behaviour (as opposed to intentions) need to
Data analysts should pay particular attention Editors and reviewers
be reviewed on a regular basis. To this end, we
to any caveats about data quality, because As reviewers of manuscripts submitted for encourage readers to join the debate over datarapidly released data are often unstable, in publication, scientists should be mindful that sharing principles and practice in an online
that they may not yet have been subjected prepublication data sets are likely to have been forum hosted at http://tinyurl.com/lqxpg3. ■
to full quality control and so may change. It released before extensive quality control is per- Toronto International Data Release Workshop
would be prudent for data analysts to assess formed, and any unnoticed errors may cause Authors. A complete list of the authors and their
the benefits and potential problems in imme- problems in the analyses performed by third affiliations accompanies this article online.
diately analysing released data. They should parties. Where the use of prepublication data is e-mail: birney@ebi.ac.uk, tom.hudson@oicr.on.ca
communicate with data
limited or not crucial to 1. Committee on Responsibilities of Authorship in the
producers to clarify
a study’s conclusions, the
Biological Sciences, National Research Council Sharing
“Prepublication data are likely
Publication-Related Data and Materials: Responsibilities
issues of data quality in
reviewers should only
of Authorship in the Life Sciences (National Academy of
to be released before extensive expect the normal scirelation to the intended
Sciences, 2003).
entific practice of clear 2. International Human Genome Sequencing Consortium
analyses, whenever posquality control is performed.”
sible. In addition, data
citation and interpretaNature 409, 860–921 (2001).
users should be aware
tion. However, when the 3. Summary of Principles Agreed at the First International
Strategy Meeting on Human Genome Sequencing Bermuda,
that some data sets are associated with version main conclusions of a study rely on a prepub25–28 February 1996 (HUGO, 1996); available at www.
numbers: the appropriate version number lication data set, reviewers should be satisfied
ornl.gov/sci/techresources/Human_Genome/research/
bermuda.shtml
should be tracked and then provided in any that the quality of the data is described and
4. Summary of the Report of the Second International
published analyses of those data.
taken into account in the analysis.
Strategy Meeting on Human Genome Sequencing
Resulting papers describing studies that do
Participants at the Toronto meeting
Bermuda, 27 February–2 March 1997 (HUGO, 1997);
available at www.ornl.org/sci/techresources/Human_
not overlap with the intentions stated by the recommended that journals play an active
Genome/research/bermuda.shtml#2
data producers in the marker paper (or other part in the dialogue about rapid prepublica- 5. Guyer,
M. Genome Res. 8, 413 (1998).
citable source) may be submitted for publica- tion data release (both in their formal guide to 6. Sharing Data from Large-scale Biological Research Projects:
A System of Tripartite Responsibility (Wellcome Trust,
tion at any time, but must appropriately cite authors and informal instructions to review2003); available at www.wellcome.ac.uk/stellent/groups/
the data source. Papers describing studies that ers). Journal editors should remind reviewers
corporatesite/@policy_communications/documents/
do overlap with the data producer’s proposed that large-scale data sets may be subject to speweb_document/wtd003207.pdf
analyses should be handled carefully and cific policies regarding how to cite and use the 7. Rodriguez, H. et al. J. Proteome Res. 8, 3689–3692 (2009).
respectfully, ideally including a dialogue with data. Ultimately, journal editors must rely on 8. Nature Biotechnol. 27, 579 (2009).
9. Kaye, J., Heeney, C., Hawkins, N., de Vries, J. & Boddington, P.
the data producer to see if a mutually agreeable their reviewers’ recommendations for reachNature Rev. Genet. 10, 331–335 (2009).
publication schedule (such as co-publication ing decisions about publication. However,
or inclusion within a set of companion papers) encouraging reviewers to carefully check the Join the discussion at http://tinyurl.com/lqxpg3
can be developed. In this regard, it is important conditions for using data that authors have not See online special at http://tinyurl.com/dataspecial
170
© 2009 Macmillan Publishers Limited. All rights reserved
Download