Omics Data Integration – A Systems Approach View Hong Fang1, Roger Perkins1 and Weida Tong2 1 Division of Bioinformatics, Z-Tech Corporation, 3900 NCTR Road, Jefferson, AR 72079 2 Division of Systems Toxicology, National Center for Toxicological Research (NCTR), Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079 Email addresses: Hong Fang - hong.fang@fda.hhs.gov Roger Perkins - roger.perkins@fda.hhs.gov Weida Tong - weida.tong@fda.hhs.gov The views presented in this article do not necessarily reflect those of the US Food and Drug Administration. 1 Introduction Genomics, proteomics and metabolomics, collectively called omics or panomics, have been increasingly contributing to drug discovery and development processes. Each type of omics data conveys a different type of information and biological context that depends on the particular experimental system. Nevertheless, for a given biological system, the different data are inextricably linked. When different data are integrated, knowledge synergies may emerge providing a more holistic understanding of the system and its affected mechanisms. Convergence of multiple data sources can provide additional confirmation of validity and broader understanding of linkages across levels of biological organization. Futuristically, the process of linking genomic, proteomic and metabolomic data based on a network-like understanding of the salient mechanisms enters the realm of systems biology, wherein the dynamic response of a biological system (cells, organs, or entire organisms) might be simulated across physiologic, developmental, or evolutionary time. Genes, the proteins they encode, or the metabolites produced in a perturbed system may provide a quantitative biomarker of disease or toxicity. However necessary, such biomarkers alone are insufficient for a systems-level understanding of relevant biology. Data integration is needed to determine how genes, proteins, and transcription factors, etc, act in concert through linked pathways and associated molecular mechanisms. A flexible and integrated bioinformatics environment for combining and interrogating omics information is crucial to generate knowledge of the whole system by analyzing and integrating global changes in gene expression, protein expression and interaction, and metabolite profiles. 2 The rapid advancement and adoption of the omics technology in drug discovery and development has resulted in both opportunities and challenges for the regulatory agencies. The FDA Critical Path initiative identifies pharmacogenomics (PGx), a subdiscipline of omics, as a major opportunity for advancing medical product development and personalized medicine (http://www.fda.gov/oc/initiatives/criticalpath/). To stimulate and facilitate a scientific effort to promote PGx in drug development and regulation, FDA issued a guidance to industry on PGx data submission by which a novel mechanism named the Voluntary Genomics Data Submission (VGDS) was defined (http://www.fda.gov/OHRMS/DOCKETS/98fr/2003d-0497-gdl0002.pdf). Through VGDS, a sponsor is able to interact with the agency by submitting PGx data on a voluntary basis. In the past two and half years, many submissions have been received [1], where the safety or clinical benefits of a drug or biological product has often been supported by omics technologies. The ability to manage, analyze and interpret omics data is a major challenge being cooperatively worked on by FDA and sponsors. The need for efficient software to enable review was identified early on, and the inevitable need for the ability to integrate different data within that software became rapidly apparent. The continuing development of a user-friendly and efficient software environment to utilize multiple omics data types in demonstrating efficacy and safety remains a high priority within FDA. Omics data integration through systems approach In the absence of data integration, lists of genes, proteins, and metabolic products that are differentially expressed between sample groups are just lists providing minimal information regarding biological cause and context. Moreover, 3 numerous effects such as data quality and biases, analysis bias, gene co-expression, and systematic experimental variability can result in false inference of sample differences creating either false positive or negatives (i.e., type 1 or 2 type errors). The prospective fix for such omics data limitations is an integrative systems approach to data interpretation. Systems biology or systems toxicology, like many other recently introduced terminologies, is loosely defined and means different things to different people, depending on their backgrounds. Integrating different omics data types together with public data provides the ability to elucidate biological contexts such as the perturbed functions, signally pathway components, transcription-factor mechanisms of action, gene regulatory networks, post-translational modifications, and genetic (or SNP) cause-effect relationships, among many others. Moreover, when different data types lead to the same hypothesis (data triangulation) with comprehendible biological plausibility, both reliability and validity are enhanced. In drug discovery and development, systems biology in the form of integration of gene level, protein level with metabolite level information for identifying such biological contexts as the perturbed pathways or functions is rapidly reaching the point of real-world utility. A review of the enormous potential of systems biology and concomitant research efforts that are underway are far beyond the scope of this paper [2]. Rather, we limit our discussion to systems approaches to and examples of integrating omics data that will, we expect, soon yield demonstrable benefits in drug discover and development as well as in regulatory activities. Figure 1 depicts a generic work flow of a multiple-omics experiment design and associated data integration. In a single-dose or repeat-dose toxicity study, genomics, proteomics and metabolomics data can be collected at different time points in a parallel fashion. A number of approaches can then been applied to gain 4 mechanistic understanding through integration of the multi-omics expression profiles. For example, the metabolite data can be linked to enzymes through their enzyme commission (EC) number. Protein analysis can be obtained through metabolic analysis of protein expression pattern changes, which can then be linked to gene expression profile changes. The EC number can be linked to the GenBank accession number enabling gene expression data to be mapped onto an interactive KEGG map. In terms of analysis of gene expression data (or protein expression data) alone, examining the biological response to treatment permits the altered gene expression profile to be clustered into groups of genes with similar temporal expression patterns. These clusters can be assumed to contain similarly regulated genes forming interconnected promoter regions that can be searched for common regulatory sequences. The approach to putative co-regulated genes can also be extended to the protein expression data, after which the common regulatory sequences from both gene and protein expression analysis can be compared and correlated through integrated analysis. The success of these approaches are largely dependent upon knowledge of the regulatory relationships between genes, proteins, metabolites, and the more that is known, the more effective the approaches should be. In some circumstances, however, sufficient knowledge is not yet available, in which case an alternative is to develop a correlation network by comparing the expression pattern for each gene, protein and metabolite across all samples in the genomic, proteomic and metabolomic hyperspace to derive biomarkers and disease pathways. The rationale of this approach is that induced gene expression levels examined in the context of induced changes in protein and metabolite expression should prove more elucidating and reliable. This approach first combines the significant lists of genes, proteins, and metabolites into a single file. Then, the 5 associations and correlations of the different omics profiles are determined using pattern recognition and statistical methods. Finally, the genes, proteins and metabolites that have similar expression patterns across all samples are cross-linked to produce a correlation network. Through this approach, the candidate biomarkers and disease pathways may be derived from analysis of the network. Pathway analysis and network development are other beneficial strategies in omics data integration. Use of omics technology in these two areas of research has greatly enhanced our understanding of the complex roles played by genes, gene products and their metabolites. Many commercial software solutions are already available, and are continuing to evolve in parallel with advances in omics technology. Figure 2 summarizes some of these tools and their associated number of citations indexed in PubMed. Judging from this figure, it is not surprising that the public tools, such as KEGG and GenMAPP, have played a prominent role in research. It is worthwhile to note, however, that while a few tools enable multi-omics data integration, but most do not [3, 4]. For example, only single type of expression data (e.g., gene expression) can be processed by most of these tools. In contrast, omics integration requires two or three different types of omics data (such as gene expression profile, protein expression profile and metabolomics profile) to be analyzed in an integrated manner. ArrayTrack and Omics Data Integration The VGDS program in FDA has implemented a bioinformatics infrastructure for data review that enables integrative strategies for data analyses. For this purpose, the bioinformatics platform, ArrayTrack [5], has been developed and is being continually refined as an, FDA review tool, including management, analyses and 6 biological interpretation of the exploratory data of genomics, proteomics and metabolomics data from both clinical and non-clinical experiments. ArrayTrack undergoes constant refinements and enhancements based on feedback and expressed needs by users at government, academic, and private institutions. ArrayTrack is also freely available for download on the Internet (http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/). A single microarray experiment generates prodigious and complex data, the management, analysis and interpretation of which constitutes a formidable effort for researchers and regulatory reviewers alike. Recognizing the need to integrate these three essential components within a single platform, ArrayTrack was developed to provide such a one-stop solution. The software was designed in a modular manner with data integration strategies in mind to ease adaptations for integrative analyses and interpretation of different data types.. ArrayTrack is comprised of three integrated components: (1) DB contains a set of relational databases, each storing experimental platform-specific data (i.e. genomics, proteomics and metabolomics data) together with experiment and sample annotation; (2) TOOL provides analysis capabilities for data visualization, normalization, significance analysis, clustering, and classification; and (3) LIB contains information (e.g, gene annotation, protein function and pathways) from public repositories. ArrayTrack’s LIB contains several libraries providing relevant data on gene and protein function, signalling and metabolic pathways, cross-species orthologs, and Gene Ontology. The primary emphasis of ArrayTrack is the direct linking of analysis results with functional information in LIB for facilitating the interaction between the choice of analysis methods and the biological relevance of analysis results. Using ArrayTrack, we can select an analysis method from the TOOL and apply the method 7 to selected omics data stored in DB; the analysis results can be linked directly to pathways, gene ontology and other functional information stored in LIB. To further facilitate the data interpretation, ArrayTrack also provides a direct link of analysis results to the external data repositories, such as OMIM, Unigene, Chromosomal Map, GeneCard and etc. Importantly, the utility of ArrayTrack is further enhanced by its interface to or integration with many commercial and public tools, including Ingenuity Pathway Analysis, GeneGO MetaCore, PathArt, Spotfire, JMP Genomics, R program and etc. Examination of the common pathways and functional categories shared between multi-omics in the differentially expression level are important to understand the mode of action of a toxicity. ArrayTrack offers a function that identifies the common pathways or functions shared by a combination of genes/proteins/metabolites. The function has been demonstrated as a powerful discovery tool in systems biology to analyze the data from genomics, proteomics and metabolomics. Figure 3 depicts the work flow of this function. Once significant expression profiles of gene, protein and metabolite are derived from multi-omics data respectively, each profile is independently mapped to the pathways to determine which pathways are altered. Three pathway lists from the gene, protein and metabolite profiles are then analyzed using Venn diagram to determine the common altered pathways. The statistical significance of each pathway is assessed using Fisher’s Exact Test. The detailed pathway can also be displayed with differentially expressed genes, proteins and metabolites that are highlighted in different colors. The same process can applied to Gene Ontology (GO) data to identify the common altered GO terms (i.e., gene functions). 8 Future Perspectives The combining of omics data for the purpose of identifying and confirming relevant biological processes, such as pathways and gene or protein functions, is crucial to realize the benefits of these high-throughput molecular technologies in drug development and public health. There are a number of challenges, however, to move this field forward. First, an effective data model is required for managing the omics information along with the clinical and non-clinical data. A systems approach has to be examined in the context of the phenotypic response. ArrayTrack applies Study Data Tabulation Model (SDTM) (http://www.cdisc.org/models/sds/v3.1/index.html) developed by Clinical Data Interchange Standard Consortium (CDISC) and Standard for Exchange of Non-Clinical Data (SEND) for managing both clinical and nonclinical data along with multi-omics information. Second, the systems approach largely relies on the accuracy of pathways and molecular networks. Many methodologies and mathematical algorithms have been developed and will be continually improved to provide ever more accurate pathways and networks. Lastly, a user-friendly software environment with effective data visualization capability is essential to enable biologists to be able to efficiently and effectively perform data interpretation, including statistical analyses. References 1. Frueh FW: Impact of microarray data quality on genomic data submissions to the FDA. Nat Biotechnol 2006, 24(9):1105-1107. 2. Joyce AR, Palsson BO: The model organism as a system: integrating 'omics' data sets. Nat Rev Mol Cell Biol 2006, 7(3):198-210. 9 3. Paley SM, Karp PD: The Pathway Tools cellular overview diagram and Omics Viewer. Nucleic Acids Res 2006, 34(13):3771-3778. 4. Baitaluk M, Sedova M, Ray A, Gupta A: BiologicalNetworks: visualization and analysis tool for systems biology. Nucleic Acids Res 2006, 34(Web Server issue):W466-471. 5. Tong W, Cao X, Harris S, Sun H, Fang H, Fuscoe J, Harris A, Hong H, Xie Q, Perkins R et al: ArrayTrack--supporting toxicogenomic research at the U.S. Food and Drug Administration National Center for Toxicological Research. Environ Health Perspect 2003, 111(15):1819-1826. Figure Captions Figure 1. A general work flow of a systems approach based on multi-omics technology is illustrated for a toxicogenomics study. This could be a singledose or repeat-dose toxicity experiment, where genomics, proteomics and metabolomics data are collected at three different time points independently. The multi-omics data are then integrated using various statistics, including clustering, correlation and network modelling. Alternatively, significant analysis can be carried out first for data from each omics platform to determine differentially expressed genes, proteins and metabolites at different time points and followed with data integration. The systems model derived from either of these approaches needs to be validated on a new experiment. 10 Figure 2. The number of publications indexed in PubMed for different pathway/network analysis tools Figure 3. Omics data integration in ArrayTrack 11 Figure 1 12 Figure 2 13 Figure 3 14