Evaluation of Self-Self Hybridization as an Alternative Experiment

advertisement
Omics Data Integration – A Systems Approach View
Hong Fang1, Roger Perkins1 and Weida Tong2
1
Division of Bioinformatics, Z-Tech Corporation, 3900 NCTR Road, Jefferson, AR
72079
2
Division of Systems Toxicology, National Center for Toxicological Research
(NCTR), Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079
Email addresses:
Hong Fang - hong.fang@fda.hhs.gov
Roger Perkins - roger.perkins@fda.hhs.gov
Weida Tong - weida.tong@fda.hhs.gov
The views presented in this article do not necessarily reflect those of the US Food
and Drug Administration.
1
Introduction
Genomics, proteomics and metabolomics, collectively called omics or
panomics, have been increasingly contributing to drug discovery and development
processes. Each type of omics data conveys a different type of information and
biological context that depends on the particular experimental system. Nevertheless,
for a given biological system, the different data are inextricably linked. When
different data are integrated, knowledge synergies may emerge providing a more
holistic understanding of the system and its affected mechanisms. Convergence of
multiple data sources can provide additional confirmation of validity and broader
understanding of linkages across levels of biological organization. Futuristically, the
process of linking genomic, proteomic and metabolomic data based on a network-like
understanding of the salient mechanisms enters the realm of systems biology, wherein
the dynamic response of a biological system (cells, organs, or entire organisms) might
be simulated across physiologic, developmental, or evolutionary time.
Genes, the proteins they encode, or the metabolites produced in a perturbed
system may provide a quantitative biomarker of disease or toxicity. However
necessary, such biomarkers alone are insufficient for a systems-level understanding of
relevant biology. Data integration is needed to determine how genes, proteins, and
transcription factors, etc, act in concert through linked pathways and associated
molecular mechanisms. A flexible and integrated bioinformatics environment for
combining and interrogating omics information is crucial to generate knowledge of
the whole system by analyzing and integrating global changes in gene expression,
protein expression and interaction, and metabolite profiles.
2
The rapid advancement and adoption of the omics technology in drug discovery and
development has resulted in both opportunities and challenges for the regulatory
agencies. The FDA Critical Path initiative identifies pharmacogenomics (PGx), a subdiscipline of omics, as a major opportunity for advancing medical product
development and personalized medicine
(http://www.fda.gov/oc/initiatives/criticalpath/). To stimulate and facilitate a scientific
effort to promote PGx in drug development and regulation, FDA issued a guidance to
industry on PGx data submission by which a novel mechanism named the Voluntary
Genomics Data Submission (VGDS) was defined
(http://www.fda.gov/OHRMS/DOCKETS/98fr/2003d-0497-gdl0002.pdf). Through
VGDS, a sponsor is able to interact with the agency by submitting PGx data on a
voluntary basis. In the past two and half years, many submissions have been received
[1], where the safety or clinical benefits of a drug or biological product has often been
supported by omics technologies. The ability to manage, analyze and interpret omics
data is a major challenge being cooperatively worked on by FDA and sponsors. The
need for efficient software to enable review was identified early on, and the inevitable
need for the ability to integrate different data within that software became rapidly
apparent. The continuing development of a user-friendly and efficient software
environment to utilize multiple omics data types in demonstrating efficacy and safety
remains a high priority within FDA.
Omics data integration through systems approach
In the absence of data integration, lists of genes, proteins, and metabolic
products that are differentially expressed between sample groups are just lists
providing minimal information regarding biological cause and context. Moreover,
3
numerous effects such as data quality and biases, analysis bias, gene co-expression,
and systematic experimental variability can result in false inference of sample
differences creating either false positive or negatives (i.e., type 1 or 2 type errors).
The prospective fix for such omics data limitations is an integrative systems approach
to data interpretation. Systems biology or systems toxicology, like many other
recently introduced terminologies, is loosely defined and means different things to
different people, depending on their backgrounds. Integrating different omics data
types together with public data provides the ability to elucidate biological contexts
such as the perturbed functions, signally pathway components, transcription-factor
mechanisms of action, gene regulatory networks, post-translational modifications, and
genetic (or SNP) cause-effect relationships, among many others. Moreover, when
different data types lead to the same hypothesis (data triangulation) with
comprehendible biological plausibility, both reliability and validity are enhanced.
In drug discovery and development, systems biology in the form of integration
of gene level, protein level with metabolite level information for identifying such
biological contexts as the perturbed pathways or functions is rapidly reaching the
point of real-world utility. A review of the enormous potential of systems biology and
concomitant research efforts that are underway are far beyond the scope of this paper
[2]. Rather, we limit our discussion to systems approaches to and examples of
integrating omics data that will, we expect, soon yield demonstrable benefits in drug
discover and development as well as in regulatory activities.
Figure 1 depicts a generic work flow of a multiple-omics experiment design
and associated data integration. In a single-dose or repeat-dose toxicity study,
genomics, proteomics and metabolomics data can be collected at different time points
in a parallel fashion. A number of approaches can then been applied to gain
4
mechanistic understanding through integration of the multi-omics expression profiles.
For example, the metabolite data can be linked to enzymes through their enzyme
commission (EC) number. Protein analysis can be obtained through metabolic
analysis of protein expression pattern changes, which can then be linked to gene
expression profile changes. The EC number can be linked to the GenBank accession
number enabling gene expression data to be mapped onto an interactive KEGG map.
In terms of analysis of gene expression data (or protein expression data) alone,
examining the biological response to treatment permits the altered gene expression
profile to be clustered into groups of genes with similar temporal expression patterns.
These clusters can be assumed to contain similarly regulated genes forming
interconnected promoter regions that can be searched for common regulatory
sequences. The approach to putative co-regulated genes can also be extended to the
protein expression data, after which the common regulatory sequences from both gene
and protein expression analysis can be compared and correlated through integrated
analysis. The success of these approaches are largely dependent upon knowledge of
the regulatory relationships between genes, proteins, metabolites, and the more that is
known, the more effective the approaches should be.
In some circumstances, however, sufficient knowledge is not yet available, in
which case an alternative is to develop a correlation network by comparing the
expression pattern for each gene, protein and metabolite across all samples in the
genomic, proteomic and metabolomic hyperspace to derive biomarkers and disease
pathways. The rationale of this approach is that induced gene expression levels
examined in the context of induced changes in protein and metabolite expression
should prove more elucidating and reliable. This approach first combines the
significant lists of genes, proteins, and metabolites into a single file. Then, the
5
associations and correlations of the different omics profiles are determined using
pattern recognition and statistical methods. Finally, the genes, proteins and
metabolites that have similar expression patterns across all samples are cross-linked to
produce a correlation network. Through this approach, the candidate biomarkers and
disease pathways may be derived from analysis of the network.
Pathway analysis and network development are other beneficial strategies in
omics data integration. Use of omics technology in these two areas of research has
greatly enhanced our understanding of the complex roles played by genes, gene
products and their metabolites. Many commercial software solutions are already
available, and are continuing to evolve in parallel with advances in omics technology.
Figure 2 summarizes some of these tools and their associated number of citations
indexed in PubMed. Judging from this figure, it is not surprising that the public tools,
such as KEGG and GenMAPP, have played a prominent role in research. It is
worthwhile to note, however, that while a few tools enable multi-omics data
integration, but most do not [3, 4]. For example, only single type of expression data
(e.g., gene expression) can be processed by most of these tools. In contrast, omics
integration requires two or three different types of omics data (such as gene
expression profile, protein expression profile and metabolomics profile) to be
analyzed in an integrated manner.
ArrayTrack and Omics Data Integration
The VGDS program in FDA has implemented a bioinformatics infrastructure
for data review that enables integrative strategies for data analyses. For this purpose,
the bioinformatics platform, ArrayTrack [5], has been developed and is being
continually refined as an, FDA review tool, including management, analyses and
6
biological interpretation of the exploratory data of genomics, proteomics and
metabolomics data from both clinical and non-clinical experiments. ArrayTrack
undergoes constant refinements and enhancements based on feedback and expressed
needs by users at government, academic, and private institutions. ArrayTrack is also
freely available for download on the Internet
(http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/).
A single microarray experiment generates prodigious and complex data, the
management, analysis and interpretation of which constitutes a formidable effort for
researchers and regulatory reviewers alike. Recognizing the need to integrate these
three essential components within a single platform, ArrayTrack was developed to
provide such a one-stop solution. The software was designed in a modular manner
with data integration strategies in mind to ease adaptations for integrative analyses
and interpretation of different data types.. ArrayTrack is comprised of three integrated
components: (1) DB contains a set of relational databases, each storing experimental
platform-specific data (i.e. genomics, proteomics and metabolomics data) together
with experiment and sample annotation; (2) TOOL provides analysis capabilities for
data visualization, normalization, significance analysis, clustering, and classification;
and (3) LIB contains information (e.g, gene annotation, protein function and
pathways) from public repositories.
ArrayTrack’s LIB contains several libraries providing relevant data on gene
and protein function, signalling and metabolic pathways, cross-species orthologs, and
Gene Ontology. The primary emphasis of ArrayTrack is the direct linking of analysis
results with functional information in LIB for facilitating the interaction between the
choice of analysis methods and the biological relevance of analysis results. Using
ArrayTrack, we can select an analysis method from the TOOL and apply the method
7
to selected omics data stored in DB; the analysis results can be linked directly to
pathways, gene ontology and other functional information stored in LIB. To further
facilitate the data interpretation, ArrayTrack also provides a direct link of analysis
results to the external data repositories, such as OMIM, Unigene, Chromosomal Map,
GeneCard and etc. Importantly, the utility of ArrayTrack is further enhanced by its
interface to or integration with many commercial and public tools, including
Ingenuity Pathway Analysis, GeneGO MetaCore, PathArt, Spotfire, JMP Genomics,
R program and etc.
Examination of the common pathways and functional categories shared
between multi-omics in the differentially expression level are important to understand
the mode of action of a toxicity. ArrayTrack offers a function that identifies the
common pathways or functions shared by a combination of
genes/proteins/metabolites. The function has been demonstrated as a powerful
discovery tool in systems biology to analyze the data from genomics, proteomics and
metabolomics. Figure 3 depicts the work flow of this function. Once significant
expression profiles of gene, protein and metabolite are derived from multi-omics data
respectively, each profile is independently mapped to the pathways to determine
which pathways are altered. Three pathway lists from the gene, protein and metabolite
profiles are then analyzed using Venn diagram to determine the common altered
pathways. The statistical significance of each pathway is assessed using Fisher’s
Exact Test. The detailed pathway can also be displayed with differentially expressed
genes, proteins and metabolites that are highlighted in different colors. The same
process can applied to Gene Ontology (GO) data to identify the common altered GO
terms (i.e., gene functions).
8
Future Perspectives
The combining of omics data for the purpose of identifying and confirming
relevant biological processes, such as pathways and gene or protein functions, is
crucial to realize the benefits of these high-throughput molecular technologies in drug
development and public health. There are a number of challenges, however, to move
this field forward. First, an effective data model is required for managing the omics
information along with the clinical and non-clinical data. A systems approach has to
be examined in the context of the phenotypic response. ArrayTrack applies Study
Data Tabulation Model (SDTM) (http://www.cdisc.org/models/sds/v3.1/index.html)
developed by Clinical Data Interchange Standard Consortium (CDISC) and Standard
for Exchange of Non-Clinical Data (SEND) for managing both clinical and nonclinical data along with multi-omics information. Second, the systems approach
largely relies on the accuracy of pathways and molecular networks. Many
methodologies and mathematical algorithms have been developed and will be
continually improved to provide ever more accurate pathways and networks. Lastly, a
user-friendly software environment with effective data visualization capability is
essential to enable biologists to be able to efficiently and effectively perform data
interpretation, including statistical analyses.
References
1.
Frueh FW: Impact of microarray data quality on genomic data submissions to
the FDA. Nat Biotechnol 2006, 24(9):1105-1107.
2.
Joyce AR, Palsson BO: The model organism as a system: integrating 'omics'
data sets. Nat Rev Mol Cell Biol 2006, 7(3):198-210.
9
3.
Paley SM, Karp PD: The Pathway Tools cellular overview diagram and Omics
Viewer. Nucleic Acids Res 2006, 34(13):3771-3778.
4.
Baitaluk M, Sedova M, Ray A, Gupta A: BiologicalNetworks: visualization
and analysis tool for systems biology. Nucleic Acids Res 2006, 34(Web Server
issue):W466-471.
5.
Tong W, Cao X, Harris S, Sun H, Fang H, Fuscoe J, Harris A, Hong H, Xie Q,
Perkins R et al: ArrayTrack--supporting toxicogenomic research at the U.S.
Food and Drug Administration National Center for Toxicological Research.
Environ Health Perspect 2003, 111(15):1819-1826.
Figure Captions
Figure 1.
A general work flow of a systems approach based on multi-omics
technology is illustrated for a toxicogenomics study. This could be a singledose or repeat-dose toxicity experiment, where genomics, proteomics and
metabolomics data are collected at three different time points independently.
The multi-omics data are then integrated using various statistics, including
clustering, correlation and network modelling. Alternatively, significant
analysis can be carried out first for data from each omics platform to
determine differentially expressed genes, proteins and metabolites at different
time points and followed with data integration. The systems model derived
from either of these approaches needs to be validated on a new experiment.
10
Figure 2.
The number of publications indexed in PubMed for different
pathway/network analysis tools
Figure 3.
Omics data integration in ArrayTrack
11
Figure 1
12
Figure 2
13
Figure 3
14
Download