BME 1450 Term Paper Scott Young 994257362 1 Sharing Data: The Challenges Created by Systems Biology Scott J. Young, Student, BME 1450 Abstract—Systems biology is a field of research that creates multi-layered models in order to understand biological systems. Due to the large datasets that result from this approach, sharing data is more difficult than in previous biological research. Beginning with the Human Genome Project, databases have been an integral part of data sharing in systems biology. Databases are complex to create and maintain due to differences in data content, naming conventions, data consistency, and methods of data access. Sharing data is further complicated by issues in systems biology such as competition, interdisciplinary research, privatesector research, intellectual property, and irregular funding for databases. In spite of these challenges, there are many indications that data sharing is alive and well in systems biology. Index Terms—Systems biology, data sharing, database, genomics I. INTRODUCTION Providing open access to experimental data is one of the most challenging, yet elemental, parts of systems biology. This paper discusses some of the challenges involved in the process. Systems Biology is a discipline that aims to integrate all molecular elements of a biological system into a model that can be used to explain the emergent behaviours of the system [1]. Information used to define models often includes genes of the system, factors involved in transcription and translation, interactions among proteins and other molecules, as well as environmental conditions and interactions at larger scales. As a result, the amount of information used as system parameters and generated can be very large. For example, characterization of a recent model of a single metabolic process in yeast (a single-celled organism) required ~105 measurements [1]. Models involving more complex processes, or more cells can easily grow by many factors of magnitude. The open sharing of scientific data and materials has been a foundation of all science for a long time. There are several advantages that result from sharing of data [2]. First, freely available data promote open scientific dialogue and inquiry. A researcher’s claims can be confirmed or refuted by consulting his/her experimental data. Second, additional analyses of the data can be performed, perhaps with different methods. This Manuscript received November 1, 2004. Scott J. Young is with the Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, ON, Canada (phone: 416-2230463; e-mail: scott.young@rogers.com). additional work may uncover information that was ignored in previous work. Third, a body of data serves as a launching point for new methods or new areas of research. Fourth, this data can provide an excellent teaching resource, allowing students to walk through the steps of previous discovery. Finally, publicly-available data constitutes a much large dataset than a single researcher could produce on his/her own. Systems biology has seen a shift in methods used to share data. Prior to systems biology, sharing data in biological research was often easier. Data sets and experimental conditions could be adequately explained in the body of a paper or an appendix at publication. Anyone who wished to used the data could get a copy of the paper, transcribe the data, and use it freely. Systems biology has been accompanied by larger data sets that cannot easily be contained within a publication. Further, the widespread demand for some systems biology data, such as genome sequences, has made acquisition through traditional methods inefficient. As a result, publicly available databases of information have been developed to disseminate data for systems biology [2]. In addition to a shift in methods, there has also been a shift in the pressures placed on sharing data. These pressures include the competitiveness of research in life sciences; the growth of private-sector research, the increasing awareness of intellectual property, and funding for databases. These pressures have forced the systems biology community to reexamine its approach to sharing data. II. DATABASES IN SYSTEMS BIOLOGY A. Growth from Human Genome Project Databases and views on rapid data sharing developed as part of the Human Genome Project (HGP) laid the foundation for further database development in systems biology. Release of data in a database form prior to publication was one of the operating principles of the HGP [3]. Initially, the HGP released sequence data within six months of production. This period was reduced to 24 hours in 1996 with the adoption of the Bermuda Principles. These principles have been endorsed as a basis for data sharing in all public sequencing projects since that time. Systems biology databases have increased in size, number, and variety since the HGP. At the same time as researchers have realized the utility of web-based database platforms for accessing systems biology data, the technology to perform this function has become much more easily available. As a result, BME 1450 Term Paper there has been a large growth of publicly-accessible systems biology databases on the web. An annual survey in Nucleic Acids Research [4] has plotted the growing number databases (Fig. 1). These databases provide data at all levels of systems biology, from genome information, through gene and protein annotation and structure, to pathway and molecule interactions. Further, they provide information on virtually any organism being studied by systems biologists. 600 Number of Databases 500 400 300 200 100 0 1996 1997 1998 1999 2000 2001 2002 2003 2004 Year Fig. 1. Databases listed by Nucleic Acids Research [4]. The number has increased from 60 in 1996 (the first year of their listing) to 548 in 2004. This total does not include databases with restrictions on data access. B. Purpose of databases Systems biology databases serve three major purposes. First, a database provides a standard location for researchers in an area to submit pre-publication data as well as data to substantiate claims made in published research. Second, the database provides a place to get baseline data required in systems biology models. Information for models, such as genes and molecule interactions, can be acquired from a database instead of created in the laboratory, thereby saving time and resources. Finally, the databases provide a method to compare results within the field. This allows researchers to develop more complex hypotheses and identify problems with their own research or that of a colleague. III. TECHNICAL CHALLENGES A. Overview Although a central repository of scientific data sounds like a simple concept, it is rarely the reality. Creating and maintaining a database that serves all of its users’ needs can be difficult and expensive. Some of the largest hurdles are outlined below: Scott Young 994257362 2 B. Content The most fundamental consideration is what must be stored in a database. This depends upon the research area as well as the needs of the users. One major issue in this area is the amount of metadata, or surrounding data, that must be recorded with an entry. For example, whereas genome data can be stored as a simple series of letters, microarray data requires all experimental conditions for the experiment, and functional magnetic resonance imaging (fMRI) data requires all experimental conditions as well as calibration and calculation data [2, 5]. In some research areas, debate on the content of a database has held back implementation for years [6]. C. Ontology and Data Format Once content has been determined, the ontology and data data format can be established. Past efforts have recognized the need to have both a consistent data format as well as enough flexibility to accommodate future changes in the field. An ontology is a standardized set of names and concepts used to classify objects. Ontologies are particularly important in biology because many areas have conflicting and confusing naming conventions [5]. An example is the Gene Ontology Consortium, an open-source ontology that is used in many systems biology databases [7]. In addition to the ontology, the data format is an important consideration. Many databases have used customizable data formats such as XML. D. Data Consistency The most important issue when data is stored and maintained in a database is consistency. This is important because this data will be compared with other entries in the database and used as baseline data in other researchers’ models. Ensuring consistency can be difficult because the curator of the database relies on the submitter to submit all the relevant data. In some areas, such as microarray and fMRI data, differences in brands of equipment or analysis methods can make comparison of different results very difficult [5, 8]. As a result, standardized samples may need to be used as controls for submitted data. E. Data Access Each database uses tools that allow users to access and/or analyze the information in the database. The types of tools offered depend upon the type of database and how users want to use the data. As a result, this portion of the database may be in continual development. Increasingly, tools allow users to integrate data from multiple databases and perform more complex analyses from a single webpage. Some tools also integrate published research results with data in the database [9]. IV. STRUCTURAL CHALLENGES A. Overview In addition to the technical issues, many issues related to the structure of the research community have affected the way data is shared in systems biology. These issues include competition for resources; the interdisciplinary nature of research within BME 1450 Term Paper Scott Young 994257362 3 the community; the rise of private sector research; intellectual property awareness; and funding for databases. sharing by making it much more difficult for a researcher to access data [14]. B. Competition Competition for scarce resources and credit have always been an impediment to data sharing [10], and systems biology is no different. Although most researchers accept the benefits of sharing data, some will avoid submitting data to a database to ensure that other laboratories cannot publish an analysis of the data. In neuroscience, for example, reluctance to share data has been factor in delaying a common framework for years [6]. F. Funding Funding can be a major challenge for when creating and maintaining data sharing facilities. Costs for databases are high initially and on-going, due to the staff, software, and equipment required to curate and maintain the database. As well, operating costs for a database are not the most exciting thing for a funding agency to fund. As a result, many databases obtain funding by getting small amounts from many different sources. Subscription services have also been attempted, but with little success in systems biology. These typically don’t work because the number of users is small, the project is constantly developing, and users require a wide range of data access options. As a result, costs are high, and returns are risky. To ensure reliable operation, some individuals have called for a new approach for funding data sharing projects [2]. C. Interdisciplinary Research Due to the interdisciplinary nature of systems biology, researcher may feel that they are competing not only with colleagues in their immediate field, but also with researchers in related fields. One particularly acrimonious area has been between genetic sequencers and bioinformaticists. Laboratories that produce a genetic sequence often wish to be the first to publish a global analysis. Performing the sequencing, however, keeps them too busy to perform this task until they are complete. As a result, bioinformaticists may publish a full analysis based on draft sequence information published on the web. Although this practice does not contravene any guidelines of the sequencing database, the sequencer often feels that this is not proper scientific courtesy. As a result, several sequencers have either removed data removed data from the web or delayed in submitting it to a database to prevent this from happening [11]. D. Private Sector Research Another challenge in data sharing is in dealing with private sector biological research. Up to 60% of genomics research is currently being performed in the private sector [12]. Because publishing in scientific journals carries the duty of sharing data, these companies typically do not publish their results. They opt instead for patenting the genes or treatments resulting from their work. Many researchers feel that the information created from private sector research could be very useful for related research. As a result, advocates have tried to develop incentives to bring this information into the public domain. For example, a timer-type arrangement has been suggested, whereby the data would be inaccessible for a defined period of time following a publication [12]. E. Intellectual Property Intellectual property (IP) concerns have impacted data sharing in two ways. First, material use agreements and other paperwork involved with IP make sharing materials and data much more onerous for some researchers. As a result, some laboratories will not share materials or data due to the time required for IP paperwork [13]. Second, the legal framework for databases is not uniform throughout the world. Therefore, some databases have resorted to using passwords or other methods to protect their information. This can hamper data V. STATE OF DATA SHARING When researching this paper, the author kept thinking of the same question: what is the state of data sharing, and how do you measure it? Although no concise answer for the second part was ever found, the first part of the question may be answered with the following information. A recent report by the National Academies [15] in the US looked into the state of data sharing and responsibilities of authorship within the life sciences. They found that there did exist within the life sciences community a commonly-held belief that the author of scientific research and all other members of the community are responsible for providing data consistent with the goal of moving science forward. They suggested several principles and recommendations but did not state that they felt data sharing was seriously threatened. Information on two of the most popular databases show that they are very active. GenBank contains 31 million sequences, and 37 billion base pairs as of October, 2004. The European Bioinformatics Institute database receives more than a million hits daily, and provided more than 25 Terabytes of data to users in the first six months of 2004 [2]. Finally, when researching this paper, the author had no difficulty accessing and getting information from several databases. In fact, he was amazed at the amount of data that he was able to get without even knowing what to do with it. Although the author is very new to this field, it appears that much data is freely available. That brings up the last point of this paper. There is a strong possibility that systems biology is not limited by the sharing of data, but by abilities and resources to analyze and make sense of the data [16]. It is important for this data to be available, but it is also very important to ensure all efforts are being made to develop understanding from this data. BME 1450 Term Paper REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] L. Hood, J. R. Heath, M. E. Phelps, and B. Lin, "Systems biology and new technologies enable predictive and preventative medicine," Science, vol. 306, pp. 640-643, 2004. C. A. Ball, G. Sherlock, and A. Brazma, "Funding high-throughput data sharing," Nature Biotechnology, vol. 22, pp. 1179-1183, 2004. F. S. Collins, M. Morgan, and A. Patrinos, "The human genome project: lessons from large-scale biology," Science, vol. 300, pp. 286-290, 2003. Database Issue. Nucleic Acids Research. 24-31 (January Issue, 19962003). M. Chicurel, "Databasing the brain," Nature, vol. 406, pp. 822-825, 2000. S. H. Koslow, "Sharing primary data: a threat or asset to discovery?," Nature Reviews Neuroscience, vol. 3, pp. 311-313, 2002. Gene Ontology Consortium, "The Gene Ontology (GO) database and informatics resource," Nucleic Acids Research, vol. 32, pp. D258-D261, 2004. A. Brazma, A. Robinson, G. Cameron, and M. Ashburner, "One-stop shop for microarray data," Nature, vol. 403, pp. 699-700, 2000. J. LaBaer, "Mining the literature and large datasets," Nature Biotechnology, vol. 21, pp. 976-977, 2003. S. E. Fienberg, M. E. Martin, and M. L. Straf, "Sharing Research Data," National Research Council, Washington D. C. 1985. L. Roberts, "A tussle over the rules for DNA data sharing," Science, vol. 298, pp. 1312-1313, 2002. A. Patrinos and D. Drell, "The times they are a-changin'," Nature, vol. 417, pp. 589-590, 2002. D. Adam, "Progress in human genetics hindered by reluctance to share," Nature, vol. 415, pp. 462, 2002. D. Greenbaum and M. Gerstein, "A universal legal framework as a prerequisite for database interoperability," Nature Biotechnology, vol. 21, pp. 979-982, 2003. National Academies, "Sharing publication-related data and materials: responsibilities of authorship in the life sciences," National Research Council of The National Academies, Washington D.C. 2003. S. Brenner, "Life sentences," The Scientist, vol. 16, pp. 12, 2002. Scott Young 994257362 4