Supplemental Text 1

Previous Work in Heterogeneous Data Integration As mentioned in the main text, related prior work in the area of heterogeneous data integration falls into two categories: methodological precursors involving naive Bayesian classifiers and biological precursors performing data integration for simpler organisms. Naive Bayesian classifiers are themselves quite well-studied and robust (see (Mitchell 1997) and (Rish 2001) for reviews), and their applications for data integration in related biological areas have been mainly in the analysis of protein-protein interaction (PPI) data. Beyond the biological and computational challenges inherent in integrating large heterogeneous genomic data collections, a major contribution of HEFalMp not addressed by any previous system is its summarization of data as systems-level functional maps. Previous data integration methods generally provided biomolecular interaction networks as their end product; HEFalMp includes such functional relationship networks, but also provides further analysis in the form of functional maps. These represent a uniform framework in which the millions of edges in such networks can be further summarized in a biologically informative way, yielding data-driven interactions between pathways, diseases, and (as future work) tissue types and developmental stages. (Rhodes et al. 2005) employed a semi-naive Bayesian model to integrate a relatively small and highly curated subset of human PPI data: ~40K PPI pairs from orthologous proteins in model organisms, ~200M coexpression measurements spanning only five microarray datasets, and ~40M curated coannotations from the Gene Ontology. The most important way in which this study differs from ours is in its use of prior knowledge: the Rhodes et al. classifier uses curated coannotations from the Gene Ontology to predict protein-protein interactions, which is fundamentally different from the standard bioinformatics paradigm of predicting new biological knowledge from experimental data. Beyond this substantial basic difference, the Rhodes et al. study predicts only PPIs, not more general functional relationships, and it also incorporates a large amount of data from orthologous organisms, not from direct experimentation on human systems; this has the potential to introduce biases based on the way in which orthology is inferred. Finally, the scope of the Rhodes et al. study is quite different from that addressed in our manuscript: the previous study integrated ~250M data points based on a ~3M pair gold standard, versus ~30B data points and ~55M gold standard pairs used by HEFalMp in >200 distinct functional areas. This difference, combined with Rhodes et al.'s semi-manual normalization and filtering of their three data types, makes it infeasible to scale their solution to a comprehensive functional view of the human genome. The STRING database (Jensen et al. 2009) focuses on a broader definition of PPIs that includes functional relationships, but the majority of its human interactions represent experimental results imported from existing databases (BioGRID (Stark et al. 2006), MINT (Chatr-aryamontri et al. 2007), etc.) STRING also suffers the same potential drawback of using curated databases (e.g. HPRD (Mishra et al. 2006), Reactome (Vastrik et al. 2007), and others) as training data relative to a small, similarly curated gold standard (KEGG pathways (Kanehisa et al. 2008)). While this provides an excellent means of accessing multiple reference databases of experimental results through a unified interface, it is potentially circular when framed as an application of machine learning to predict new biological relationships. In cases where STRING performs data integration to predict new protein interactions, it does so by regressing a confidence score against new datasets, which maps their raw results to membership probabilities in KEGG pathways. While STRING can clearly scale to include a tremendous amount of data, its focus is on aggregation of existing PPI databases rather than on prediction of new functional relationships; STRING itself provides neither a uniform machine learning method for integrating its constituent data nor an interface for exploring the results at a systems level comparable to HEFalMp's functional maps. Other existing systems differ from HEFalMp in their biological, rather than computational, scopes; several data integration techniques have proven to be quite successful in predicting functional relationships in simpler organisms. Such integrations are necessarily smaller in computational scope as well; the most comparable study in yeast (Myers et al. 2007) included a similarly designed gold standard and coverage of ~200 functional contexts, but this still entailed almost 1,000x fewer data points. The machine learning methodology developed in the HEFalMp system is an evolution of this system and that described in (Huttenhower et al. 2006) and in (Myers et al. 2005), with the addition of Bayesian regularization to improve performance in the presence of very large data collections. Other techniques applied to simpler organisms include those of (Jansen et al. 2003) and (Lee et al. 2004) in yeast and that of (Date et al. 2006) in the malaria parasite P. falciparum. Jansen et al. focus exclusively on physical PPIs using a small data collection in yeast (e.g. no coexpression data is integrated); with the exception of several cutoffs to discretize continuous predictions into binary interactions, their methodology is essentially an unmodified application of naive Bayesian classifiers. Lee et al. employ a variety of customized model fitting to map S. cerevisiae experimental data values to a KEGG-based gold standard, and it is unclear whether their approach could scale to the biological and computational challenges of the human genome. Finally, Date and Stoeckert's work with P. falciparum provides an excellent example of the power of functional data integration to explore a largely uncharacterized biological system; they also build upon the integration techniques of (Troyanskaya et al. 2003) with a biological focus on the mechanisms of malaria infection. This emphasizes the importance of applying data integration techniques to new biological systems where they can most effectively collect, focus, and expand upon large collections of experimental data. Thus, from a biological perspective, it is significant that HEFalMp offers the first comprehensive functional integration of human genomic data. This provides an opportunity to explore fundamental human biology and the molecular mechanisms of disease at levels ranging from individual experimental results to the interplay between entire cellular pathways. Regularized naive Bayesian networks represent a machine learning technique with sufficient breadth to incorporate billions of experimental data points, and functional mapping provides the depth to hierarchically summarize this tremendous amount of data in a biologically meaningful way. We hope that both techniques are useful to computational and biological investigators alike in the investigation of the human genome and the genomes of other organisms. References Chatr-aryamontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., and Cesareni, G. 2007. MINT: the Molecular INTeraction database. Nucleic acids research 35: D572-574. Date, S.V. and Stoeckert, C.J., Jr. 2006. Computational modeling of the Plasmodium falciparum interactome reveals protein function on a genome-wide scale. Genome research 16: 542-549. Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. 2006. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics (Oxford, England) 22: 2890-2897. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein, M. 2003. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science (New York, N.Y 302: 449-453. Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M. et al. 2009. STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37: D412-416. Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T. et al. 2008. KEGG for linking genomes to life and the environment. Nucleic acids research 36: D480-484. Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes. Science (New York, N.Y 306: 1555-1558. Mishra, G.R., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R., Raghavan, T.M. et al. 2006. Human protein reference database--2006 update. Nucleic acids research 34: D411-414. Mitchell, T.M. 1997. Machine Learning. McGraw-Hill. Myers, C.L., Robson, D., Wible, A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K., and Troyanskaya, O.G. 2005. Discovery of biological networks from diverse functional genomic data. Genome Biol 6: R114. Myers, C.L. and Troyanskaya, O.G. 2007. Context-sensitive data integration and prediction of biological networks. Bioinformatics 23: 2322-2330. Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana-Sundaram, S., Ghosh, D., Pandey, A., and Chinnaiyan, A.M. 2005. Probabilistic model of the human protein-protein interaction network. Nature biotechnology 23: 951-959. Rish, I. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. 2006. BioGRID: a general repository for interaction datasets. Nucleic acids research 34: D535-539. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. 2003. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proceedings of the National Academy of Sciences of the United States of America 100: 8348-8353. Vastrik, I., D'Eustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G., Croft, D., de Bono, B., Gillespie, M., Jassal, B., Lewis, S. et al. 2007. Reactome: a knowledge base of biologic pathways and processes. Genome biology 8: R39.

Supplemental Text 1

Related documents

Products

Support

Supplemental Text 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib