Scott Young BME Term Paper - Engineering Computing Facility

advertisement
BME 1450 Term Paper
Scott Young 994257362
1
Sharing Data: The Challenges Created by
Systems Biology
Scott J. Young, Student, BME 1450

Abstract—Systems biology is a field of research that creates
multi-layered models in order to understand biological systems.
Due to the large datasets that result from this approach, sharing
data is more difficult than in previous biological research.
Beginning with the Human Genome Project, databases have been
an integral part of data sharing in systems biology. Databases are
complex to create and maintain due to differences in data content,
naming conventions, data consistency, and methods of data
access. Sharing data is further complicated by issues in systems
biology such as competition, interdisciplinary research, privatesector research, intellectual property, and irregular funding for
databases. In spite of these challenges, there are many indications
that data sharing is alive and well in systems biology.
Index Terms—Systems biology, data sharing, database,
genomics
I. INTRODUCTION
Providing open access to experimental data is one of the
most challenging, yet elemental, parts of systems biology.
This paper discusses some of the challenges involved in the
process.
Systems Biology is a discipline that aims to integrate all
molecular elements of a biological system into a model that
can be used to explain the emergent behaviours of the system
[1]. Information used to define models often includes genes of
the system, factors involved in transcription and translation,
interactions among proteins and other molecules, as well as
environmental conditions and interactions at larger scales. As
a result, the amount of information used as system parameters
and generated can be very large. For example,
characterization of a recent model of a single metabolic
process in yeast (a single-celled organism) required ~105
measurements [1]. Models involving more complex processes,
or more cells can easily grow by many factors of magnitude.
The open sharing of scientific data and materials has been a
foundation of all science for a long time. There are several
advantages that result from sharing of data [2]. First, freely
available data promote open scientific dialogue and inquiry. A
researcher’s claims can be confirmed or refuted by consulting
his/her experimental data. Second, additional analyses of the
data can be performed, perhaps with different methods. This
Manuscript received November 1, 2004.
Scott J. Young is with the Institute of Biomaterials and Biomedical
Engineering, University of Toronto, Toronto, ON, Canada (phone: 416-2230463; e-mail: scott.young@rogers.com).
additional work may uncover information that was ignored in
previous work. Third, a body of data serves as a launching
point for new methods or new areas of research. Fourth, this
data can provide an excellent teaching resource, allowing
students to walk through the steps of previous discovery.
Finally, publicly-available data constitutes a much large
dataset than a single researcher could produce on his/her own.
Systems biology has seen a shift in methods used to share
data. Prior to systems biology, sharing data in biological
research was often easier. Data sets and experimental
conditions could be adequately explained in the body of a
paper or an appendix at publication. Anyone who wished to
used the data could get a copy of the paper, transcribe the data,
and use it freely. Systems biology has been accompanied by
larger data sets that cannot easily be contained within a
publication. Further, the widespread demand for some systems
biology data, such as genome sequences, has made acquisition
through traditional methods inefficient. As a result, publicly
available databases of information have been developed to
disseminate data for systems biology [2].
In addition to a shift in methods, there has also been a shift
in the pressures placed on sharing data. These pressures
include the competitiveness of research in life sciences; the
growth of private-sector research, the increasing awareness of
intellectual property, and funding for databases. These
pressures have forced the systems biology community to
reexamine its approach to sharing data.
II. DATABASES IN SYSTEMS BIOLOGY
A. Growth from Human Genome Project
Databases and views on rapid data sharing developed as part
of the Human Genome Project (HGP) laid the foundation for
further database development in systems biology. Release of
data in a database form prior to publication was one of the
operating principles of the HGP [3]. Initially, the HGP
released sequence data within six months of production. This
period was reduced to 24 hours in 1996 with the adoption of
the Bermuda Principles. These principles have been endorsed
as a basis for data sharing in all public sequencing projects
since that time.
Systems biology databases have increased in size, number,
and variety since the HGP. At the same time as researchers
have realized the utility of web-based database platforms for
accessing systems biology data, the technology to perform this
function has become much more easily available. As a result,
BME 1450 Term Paper
there has been a large growth of publicly-accessible systems
biology databases on the web. An annual survey in Nucleic
Acids Research [4] has plotted the growing number databases
(Fig. 1). These databases provide data at all levels of systems
biology, from genome information, through gene and protein
annotation and structure, to pathway and molecule
interactions. Further, they provide information on virtually
any organism being studied by systems biologists.
600
Number of Databases
500
400
300
200
100
0
1996
1997
1998
1999
2000
2001
2002
2003
2004
Year
Fig. 1. Databases listed by Nucleic Acids Research [4]. The number has
increased from 60 in 1996 (the first year of their listing) to 548 in 2004. This
total does not include databases with restrictions on data access.
B. Purpose of databases
Systems biology databases serve three major purposes. First,
a database provides a standard location for researchers in an
area to submit pre-publication data as well as data to
substantiate claims made in published research. Second, the
database provides a place to get baseline data required in
systems biology models. Information for models, such as
genes and molecule interactions, can be acquired from a
database instead of created in the laboratory, thereby saving
time and resources. Finally, the databases provide a method to
compare results within the field. This allows researchers to
develop more complex hypotheses and identify problems with
their own research or that of a colleague.
III. TECHNICAL CHALLENGES
A. Overview
Although a central repository of scientific data sounds like a
simple concept, it is rarely the reality. Creating and
maintaining a database that serves all of its users’ needs can be
difficult and expensive. Some of the largest hurdles are
outlined below:
Scott Young 994257362
2
B. Content
The most fundamental consideration is what must be stored
in a database. This depends upon the research area as well as
the needs of the users. One major issue in this area is the
amount of metadata, or surrounding data, that must be
recorded with an entry. For example, whereas genome data
can be stored as a simple series of letters, microarray data
requires all experimental conditions for the experiment, and
functional magnetic resonance imaging (fMRI) data requires
all experimental conditions as well as calibration and
calculation data [2, 5]. In some research areas, debate on the
content of a database has held back implementation for years
[6].
C. Ontology and Data Format
Once content has been determined, the ontology and data
data format can be established. Past efforts have recognized
the need to have both a consistent data format as well as
enough flexibility to accommodate future changes in the field.
An ontology is a standardized set of names and concepts used
to classify objects. Ontologies are particularly important in
biology because many areas have conflicting and confusing
naming conventions [5]. An example is the Gene Ontology
Consortium, an open-source ontology that is used in many
systems biology databases [7]. In addition to the ontology, the
data format is an important consideration. Many databases
have used customizable data formats such as XML.
D. Data Consistency
The most important issue when data is stored and maintained
in a database is consistency. This is important because this
data will be compared with other entries in the database and
used as baseline data in other researchers’ models. Ensuring
consistency can be difficult because the curator of the database
relies on the submitter to submit all the relevant data. In some
areas, such as microarray and fMRI data, differences in brands
of equipment or analysis methods can make comparison of
different results very difficult [5, 8]. As a result, standardized
samples may need to be used as controls for submitted data.
E. Data Access
Each database uses tools that allow users to access and/or
analyze the information in the database. The types of tools
offered depend upon the type of database and how users want
to use the data. As a result, this portion of the database may be
in continual development. Increasingly, tools allow users to
integrate data from multiple databases and perform more
complex analyses from a single webpage. Some tools also
integrate published research results with data in the database
[9].
IV. STRUCTURAL CHALLENGES
A. Overview
In addition to the technical issues, many issues related to the
structure of the research community have affected the way data
is shared in systems biology. These issues include competition
for resources; the interdisciplinary nature of research within
BME 1450 Term Paper
Scott Young 994257362
3
the community; the rise of private sector research; intellectual
property awareness; and funding for databases.
sharing by making it much more difficult for a researcher to
access data [14].
B. Competition
Competition for scarce resources and credit have always
been an impediment to data sharing [10], and systems biology
is no different. Although most researchers accept the benefits
of sharing data, some will avoid submitting data to a database
to ensure that other laboratories cannot publish an analysis of
the data. In neuroscience, for example, reluctance to share
data has been factor in delaying a common framework for
years [6].
F. Funding
Funding can be a major challenge for when creating and
maintaining data sharing facilities. Costs for databases are
high initially and on-going, due to the staff, software, and
equipment required to curate and maintain the database. As
well, operating costs for a database are not the most exciting
thing for a funding agency to fund. As a result, many
databases obtain funding by getting small amounts from many
different sources. Subscription services have also been
attempted, but with little success in systems biology. These
typically don’t work because the number of users is small, the
project is constantly developing, and users require a wide
range of data access options. As a result, costs are high, and
returns are risky. To ensure reliable operation, some
individuals have called for a new approach for funding data
sharing projects [2].
C. Interdisciplinary Research
Due to the interdisciplinary nature of systems biology,
researcher may feel that they are competing not only with
colleagues in their immediate field, but also with researchers in
related fields. One particularly acrimonious area has been
between genetic sequencers and bioinformaticists.
Laboratories that produce a genetic sequence often wish to be
the first to publish a global analysis. Performing the
sequencing, however, keeps them too busy to perform this task
until they are complete. As a result, bioinformaticists may
publish a full analysis based on draft sequence information
published on the web. Although this practice does not
contravene any guidelines of the sequencing database, the
sequencer often feels that this is not proper scientific courtesy.
As a result, several sequencers have either removed data
removed data from the web or delayed in submitting it to a
database to prevent this from happening [11].
D. Private Sector Research
Another challenge in data sharing is in dealing with private
sector biological research. Up to 60% of genomics research is
currently being performed in the private sector [12]. Because
publishing in scientific journals carries the duty of sharing
data, these companies typically do not publish their results.
They opt instead for patenting the genes or treatments resulting
from their work.
Many researchers feel that the information created from
private sector research could be very useful for related
research. As a result, advocates have tried to develop
incentives to bring this information into the public domain.
For example, a timer-type arrangement has been suggested,
whereby the data would be inaccessible for a defined period of
time following a publication [12].
E. Intellectual Property
Intellectual property (IP) concerns have impacted data
sharing in two ways. First, material use agreements and other
paperwork involved with IP make sharing materials and data
much more onerous for some researchers. As a result, some
laboratories will not share materials or data due to the time
required for IP paperwork [13]. Second, the legal framework
for databases is not uniform throughout the world. Therefore,
some databases have resorted to using passwords or other
methods to protect their information. This can hamper data
V. STATE OF DATA SHARING
When researching this paper, the author kept thinking of the
same question: what is the state of data sharing, and how do
you measure it? Although no concise answer for the second
part was ever found, the first part of the question may be
answered with the following information.
A recent report by the National Academies [15] in the US
looked into the state of data sharing and responsibilities of
authorship within the life sciences. They found that there did
exist within the life sciences community a commonly-held
belief that the author of scientific research and all other
members of the community are responsible for providing data
consistent with the goal of moving science forward. They
suggested several principles and recommendations but did not
state that they felt data sharing was seriously threatened.
Information on two of the most popular databases show that
they are very active. GenBank contains 31 million sequences,
and 37 billion base pairs as of October, 2004. The European
Bioinformatics Institute database receives more than a million
hits daily, and provided more than 25 Terabytes of data to
users in the first six months of 2004 [2].
Finally, when researching this paper, the author had no
difficulty accessing and getting information from several
databases. In fact, he was amazed at the amount of data that
he was able to get without even knowing what to do with it.
Although the author is very new to this field, it appears that
much data is freely available.
That brings up the last point of this paper. There is a strong
possibility that systems biology is not limited by the sharing of
data, but by abilities and resources to analyze and make sense
of the data [16]. It is important for this data to be available,
but it is also very important to ensure all efforts are being
made to develop understanding from this data.
BME 1450 Term Paper
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
L. Hood, J. R. Heath, M. E. Phelps, and B. Lin, "Systems biology and
new technologies enable predictive and preventative medicine,"
Science, vol. 306, pp. 640-643, 2004.
C. A. Ball, G. Sherlock, and A. Brazma, "Funding high-throughput data
sharing," Nature Biotechnology, vol. 22, pp. 1179-1183, 2004.
F. S. Collins, M. Morgan, and A. Patrinos, "The human genome project:
lessons from large-scale biology," Science, vol. 300, pp. 286-290, 2003.
Database Issue. Nucleic Acids Research. 24-31 (January Issue, 19962003).
M. Chicurel, "Databasing the brain," Nature, vol. 406, pp. 822-825,
2000.
S. H. Koslow, "Sharing primary data: a threat or asset to discovery?,"
Nature Reviews Neuroscience, vol. 3, pp. 311-313, 2002.
Gene Ontology Consortium, "The Gene Ontology (GO) database and
informatics resource," Nucleic Acids Research, vol. 32, pp. D258-D261,
2004.
A. Brazma, A. Robinson, G. Cameron, and M. Ashburner, "One-stop
shop for microarray data," Nature, vol. 403, pp. 699-700, 2000.
J. LaBaer, "Mining the literature and large datasets," Nature
Biotechnology, vol. 21, pp. 976-977, 2003.
S. E. Fienberg, M. E. Martin, and M. L. Straf, "Sharing Research Data,"
National Research Council, Washington D. C. 1985.
L. Roberts, "A tussle over the rules for DNA data sharing," Science, vol.
298, pp. 1312-1313, 2002.
A. Patrinos and D. Drell, "The times they are a-changin'," Nature, vol.
417, pp. 589-590, 2002.
D. Adam, "Progress in human genetics hindered by reluctance to share,"
Nature, vol. 415, pp. 462, 2002.
D. Greenbaum and M. Gerstein, "A universal legal framework as a
prerequisite for database interoperability," Nature Biotechnology, vol.
21, pp. 979-982, 2003.
National Academies, "Sharing publication-related data and materials:
responsibilities of authorship in the life sciences," National Research
Council of The National Academies, Washington D.C. 2003.
S. Brenner, "Life sentences," The Scientist, vol. 16, pp. 12, 2002.
Scott Young 994257362
4
Download