Vol 461|10 September 2009 OPINION Prepublication data sharing Rapid release of prepublication data has served the field of genomics well. Attendees at a workshop in Toronto recommend extending the practice to other biological data sets. pen discussion of ideas and full disclosure of supporting facts are the bedrock for scientific discourse and new developments. Traditionally, published papers combine the salient ideas and the supporting facts in a single discrete ‘package’. With the advent of methods for large-scale and high-throughput data analyses, the generation and transmission of the underlying facts are often replaced by an electronic process that involves sending information to and from scientific databases. For such data-intensive projects, the standard requirement is that all relevant data must be made available on a publicly accessible website at the time of a paper’s publication1. One of the lessons from the Human Genome Project (HGP) was the recognition that making data broadly available before publication can be profoundly valuable to the scientific enterprise and lead to public benefits. This is particularly the case when there is a community of scientists that can productively use the data quickly — beyond what the data producers could do themselves in a similar time period, and sometimes for scientific purposes outside the original goals of the project. The principles for rapid release of genomesequence data from the HGP were formulated at a meeting held in Bermuda in 1996; these were then implemented by several funding agencies. In exchange for ‘early release’ of their data, the international sequencing centres retained the right to be the first to describe and analyse their complete data sets in peer-reviewed publications. The draft human genome sequence2 was the highest profile data set rapidly released before publication, with sequence assemblies greater than 1,000 base pairs usually released within 24 hours of generation. This experience demonstrated that the broad and early availability of sequence data greatly benefited life sciences research by leading to many new insights and discoveries2, including new information on 30 disease genes published prior to the draft sequence. At a time when advances in DNA sequencing technologies mean that many more laboratories can produce massive data sets, and when an ever-growing number of fields (beyond genome sequencing) are grappling with their own datasharing policies, a Data Release Workshop was convened in Toronto in May 2009 by Genome Canada and other funding agencies. The meeting brought together a diverse and multinational O 168 group of scientists, ethicists, lawyers, journal editors and funding representatives. The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar datarelease policies to other types of large biological data sets — whether from proteomics, biobanking or metabolite research. Building on the past By design, the Toronto meeting continued policy discussions from previous meetings, in particular the Bermuda meetings (1996, 1997 and 1998)3–5 and the 2003 Fort Lauderdale meeting, which recommended that rapid prepublication release be applied to other data sets whose primary utility was a resource for the scientific community, and also established the responsibilities of the resource producers, resource users, and the funding agencies6. A similar 2008 Amsterdam meeting extended the principle of rapid data release to proteomics data7. Although the recommendations of these earlier meetings can apply to many genomics and proteomics projects, many outside the major sequencing centres and funding agencies remain unaware of the details of these policies, and so one goal of the Toronto meeting was to reaffirm the existing principles for early data release with a wider group of stakeholders. In Toronto, attendees endorsed the value of rapid prepublication data release for large reference data sets in biology and medicine that have broad utility and agreed that prepublication data release should go beyond genomics and proteomics studies to other data sets — including chemical structure, metabolomic and RNA interference data sets, and to annotated clinical resources (cohorts, tissue banks and case-control studies). In each of these domains, there are diverse data types and study designs, ranging from the large-scale ‘community resource projects’ first identified at Fort Lauderdale (for which meeting participants endorsed prepublication data release) to investigatorled hypothesis-testing projects (for which the minimum standard should be the release of generated data at the time of publication). Several issues discussed at previous data- EXAMPLES OF PREPUBLICATION DATA-RELEASE GUIDELINES Project type Prepublication data release recommended Prepublication data release optional Genome sequencing Whole-genome or mRNA sequence(s) of a reference organism or tissue Sequences from a few loci for crossspecies comparisons in a limited number of samples Polymorphism discovery Catalogue of variants from genomic and/ or transcriptomic samples in one or more populations Variants in a gene, a gene family or a genomic region in selected pedigrees or populations Genetic association studies Genomewide association analysis of thousands of samples Genotyping of selected gene candidates Somatic mutation discovery Catalogue of somatic mutations in exomes or genomes of tumour and non-tumour samples Somatic mutations of a specific locus or limited set of genomic regions Microbiome studies Whole-genome sequence of microbial communities in different environments Sequencing of target locus in a limited number of microbiome samples RNA profiling Whole-genome expression profiles from a large panel of reference samples Whole-genome expression profiles of a perturbed biological system(s) Proteomic studies Mass spectrometry data sets from large panels of normal and disease tissues Mass spectrometry data sets from a well-defined and limited set of tissues Metabolomic studies Catalogue of metabolites in one or more tissues of an organism Analyses of metabolites induced in a perturbed biological system(s) RNAi or chemical library screen Large-scale screen of a cell line or organism analysed for standard phenotypes Focused screens used to validate a hypothetical gene network 3D-structure elucidation Large-scale cataloguing of 3D structures of proteins or compounds 3D structure of a synthetic protein or compound elucidated in the context of a focused project © 2009 Macmillan Publishers Limited. All rights reserved DATA SHARING OPINION NATURE|Vol 461|10 September 2009 The Toronto statement Rapid prepublication data release should be encouraged for projects with the following attributes: ● Large scale (requiring significant resources over time) ● Broad utility ● Creating reference data sets ● Associated with community buy-in Funding agencies should facilitate the specification of data-release policies for relevant projects by: ● Explicitly informing applicants of data-release requirements, especially mandatory prepublication data release ● Ensuring that evaluation of data release plans is part of the peerreview process ● Proactively establishing analysis plans and timelines for projects releasing data prepublication ● Fostering investigator-initiated prepublication data release ● Helping to develop appropriate consent, security, access and governance mechanisms that protect research participants while encouraging prepublication data release ● Providing long-term support of databases Data producers should state their intentions and enable analyses of their data by: ● Informing data users about the data being generated, data standards and quality, planned analyses, timelines, and relevant contact information, ideally through publication of a citable marker paper near the start of the project or by provision of a citable URL at the project or fundingagency website ● Providing relevant metadata (e.g., questionnaires, phenotypes, environmental conditions, and laboratory methods) that will assist other researchers in reproducing and/or independently analysing the data, while protecting interests of individuals enrolled in studies focusing on humans ● Ensuring that research participants are informed that their data will be shared with other scientists in the research community ● Publishing their initial global analyses, as stated in the marker paper or citable URL, in a timely fashion ● Creating databases designed to archive all data (including underlying raw data) in an easily retrievable form and facilitate usage of both pre-processed and processed data Data analysts/users should freely analyse released prepublication data and act responsibly in publishing analyses of those data by: ● Respecting the scientific etiquette that allows data producers to publish the first global analyses of their data set ● Reading the citeable document release meetings were not revisited, as they are large in scale, are ‘reference’ in characwere considered fundamental to all types of ter and typically have community ‘buy-in’. data release (whether prepublication or publi- The table opposite provides examples of cation-associated). These included: specified projects using different designs, technoloquality standards for all data; database designs gies, and approaches that have several of these that meet the needs of both data producers and attributes, but also lists projects that are more users alike; archiving of raw data in a retriev- hypothesis-based for which prepublication able form; housing of both ‘finished’ and data release should not be mandated. ‘unfinished’ data in databases; and provision of It was agreed at the meeting that the requirelong-term support for databases ments for prepublication data “Funding agencies by funding agencies. New issues release must be made clear that were addressed include the should require rapid when funding opportunities are importance of simultaneously first announced and that proacprepublication data tive engagement of funders is releasing metadata (such as release for certain environmental or experimental beneficial throughout a project, conditions and phenotypes) that as has been the experience of projects.” many genome-sequencing will enable users to fully exploit the data, as well as the complexities associated efforts, the International HapMap Project, the with clinical data because of concerns about ENCODE project, the 1000 Genomes project privacy and confidentiality (see ‘Sharing data and, more recently, the International Cancer about human subjects’, overleaf). Genome Consortium, the Human Microbiome At a practical level, the Toronto meeting Project and the MetaHIT project. developed a set of suggested ‘best practices’ for For all projects generating large data sets, the funding agencies, for scientists in their differ- Toronto meeting recommended that funding ent roles (whether as data producers, data ana- agencies require that data-sharing plans be prelysts/users, and manuscript reviewers), and for sented as part of grant applications and that journal editors (see ‘The Toronto statement’). these plans are subjected to peer review. Such practice is currently the exception rather than Recommendations for funders the rule. Funding agencies will need to exerFunding agencies should require rapid cise flexibility by, for example, recognizing that prepublication data release for projects that large-scale data-generation projects need not generate data sets that have broad utility, necessarily lead to traditional publications, and © 2009 Macmillan Publishers Limited. All rights reserved associated with the project ● Accurately and completely citing the source of prepublication data, including the version of the data set (if appropriate) ● Being aware that released prepublication data may be associated with quality issues that will be later rectified by the data producers ● Contacting the data producers to discuss publication plans in the case of overlap between planned analyses ● Ensuring that use of data does not harm research participants and is in conformity with ethical approvals Scientific journal editors should engage the research community about issues related to prepublication data release and provide guidance to authors and reviewers on the third-party use of prepublication data in manuscripts that certain projects may need to release only some of their generated data before publication. At the same time, general consistency in data-sharing policies between funding agencies is desirable, whenever possible. To encourage compliance, funding agencies and academic institutions should give credit to investigators who adopt prepublication data-release practices, one option would be to recognize good data-release behaviour during grant renewals and promotion processes, another would be to track the usage and citation of data sets using electronic systems similar to those used for traditional publications8. Data producers and data users Early data release can lead to tensions between the interests of the data-producing scientists who request the right to publish a first description of a data set and other scientists who wish to publish their own analyses of the same data. To date, many papers have been published by third parties reporting research findings enabled by data sets released before publication. The experiences shared in Toronto suggest that these have rarely affected subsequent publications authored by the data producers. Nevertheless, the Toronto meeting participants recognized that this is an ongoing concern that is best addressed by fostering a scientific culture that encourages transparent and explicit cooperation on the part of data producers, data 169 OPINION DATA SHARING NATURE|Vol 461|10 September 2009 analysts, reviewers and journal editors. Sharing data about human subjects Data producers should, as early as possible, Data about human subjects associated with a unique, important to develop and and ideally before large-scale data generation participating in genetic and implement robust governance but not directly identifiable begins, clarify their overall intentions for data epidemiological research individual, access may be models and procedures for analysis by providing a citable statement, typirequire particularly careful restricted. human subjects data early cally a ‘marker paper’, that would be associated consideration owing to Under such conditions, in a project. Lessons can with their database entries. This statement privacy-protection issues arguments can be made probably be learned from should provide clear details about the data set and the potential harms that for the release of data for data policies adopted by could arise from misuse. studies involving human several genomics projects9 to be produced, the associated metadata, the that generate human-subject These issues are critical subjects, as doing so can experimental design, pilot data, data standdata. For aggregated data that augment the opportunities to all databases housing ards, security, quality-control procedures, cannot be used to identify information about human for new discoveries that could expected timelines, data-release mechanisms individuals, databases are subjects, whether or not they ultimately benefit individuals, and contact details for lead investigators. If open access, but for clinical contain prepublication data. communities, and society at data producers request a protected time period and genomic data that are For these reasons, it is large. to allow them to be the first to publish the data set, this should be limited to global analyses of the data and ideally expire within one year. for data users to realize that, historically, many created themselves can help to raise both the If the citable statement is a ‘marker paper’ it such dialogues have led to coordinated publi- quality of analysis and fairness in citation of should be subjected to peer review and pub- cations and to new scientific insights. Despite published studies. lished in a scientific journal. Alternatively, the best intentions of all parties, on occasion other citable sources, such as digital object a researcher may publish the results of analy- Conclusion identifiers to specific pages on well-maintained ses that overlap with the planned studies of The rapid prepublication release of sequencing funding agency or institutional websites, could the data producer. Although such instances data has served the field of genomics well. The also be used. Data producers benefit from cre- are hopefully rare if good communication Toronto meeting participants acknowledged ating a citable reference, as it can later be used protocols are followed, these should be viewed that policies for prepublication release of data to reflect impact of the data sets8. as a small risk to the data producers, one that need to evolve with the changing research landIn turn, the data users should carefully read comes with the much greater overall benefit of scape, that there is a range of opinion in the scithe source information, including any marker early data release. entific community, and that actual community papers, associated with a released data set. behaviour (as opposed to intentions) need to Data analysts should pay particular attention Editors and reviewers be reviewed on a regular basis. To this end, we to any caveats about data quality, because As reviewers of manuscripts submitted for encourage readers to join the debate over datarapidly released data are often unstable, in publication, scientists should be mindful that sharing principles and practice in an online that they may not yet have been subjected prepublication data sets are likely to have been forum hosted at http://tinyurl.com/lqxpg3. ■ to full quality control and so may change. It released before extensive quality control is per- Toronto International Data Release Workshop would be prudent for data analysts to assess formed, and any unnoticed errors may cause Authors. A complete list of the authors and their the benefits and potential problems in imme- problems in the analyses performed by third affiliations accompanies this article online. diately analysing released data. They should parties. Where the use of prepublication data is e-mail: birney@ebi.ac.uk, tom.hudson@oicr.on.ca communicate with data limited or not crucial to 1. Committee on Responsibilities of Authorship in the producers to clarify a study’s conclusions, the Biological Sciences, National Research Council Sharing “Prepublication data are likely Publication-Related Data and Materials: Responsibilities issues of data quality in reviewers should only of Authorship in the Life Sciences (National Academy of to be released before extensive expect the normal scirelation to the intended Sciences, 2003). entific practice of clear 2. International Human Genome Sequencing Consortium analyses, whenever posquality control is performed.” sible. In addition, data citation and interpretaNature 409, 860–921 (2001). users should be aware tion. However, when the 3. Summary of Principles Agreed at the First International Strategy Meeting on Human Genome Sequencing Bermuda, that some data sets are associated with version main conclusions of a study rely on a prepub25–28 February 1996 (HUGO, 1996); available at www. numbers: the appropriate version number lication data set, reviewers should be satisfied ornl.gov/sci/techresources/Human_Genome/research/ bermuda.shtml should be tracked and then provided in any that the quality of the data is described and 4. Summary of the Report of the Second International published analyses of those data. taken into account in the analysis. Strategy Meeting on Human Genome Sequencing Resulting papers describing studies that do Participants at the Toronto meeting Bermuda, 27 February–2 March 1997 (HUGO, 1997); available at www.ornl.org/sci/techresources/Human_ not overlap with the intentions stated by the recommended that journals play an active Genome/research/bermuda.shtml#2 data producers in the marker paper (or other part in the dialogue about rapid prepublica- 5. Guyer, M. Genome Res. 8, 413 (1998). citable source) may be submitted for publica- tion data release (both in their formal guide to 6. Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility (Wellcome Trust, tion at any time, but must appropriately cite authors and informal instructions to review2003); available at www.wellcome.ac.uk/stellent/groups/ the data source. Papers describing studies that ers). Journal editors should remind reviewers corporatesite/@policy_communications/documents/ do overlap with the data producer’s proposed that large-scale data sets may be subject to speweb_document/wtd003207.pdf analyses should be handled carefully and cific policies regarding how to cite and use the 7. Rodriguez, H. et al. J. Proteome Res. 8, 3689–3692 (2009). respectfully, ideally including a dialogue with data. Ultimately, journal editors must rely on 8. Nature Biotechnol. 27, 579 (2009). 9. Kaye, J., Heeney, C., Hawkins, N., de Vries, J. & Boddington, P. the data producer to see if a mutually agreeable their reviewers’ recommendations for reachNature Rev. Genet. 10, 331–335 (2009). publication schedule (such as co-publication ing decisions about publication. However, or inclusion within a set of companion papers) encouraging reviewers to carefully check the Join the discussion at http://tinyurl.com/lqxpg3 can be developed. In this regard, it is important conditions for using data that authors have not See online special at http://tinyurl.com/dataspecial 170 © 2009 Macmillan Publishers Limited. All rights reserved