Supplemental Text 1

advertisement
Previous Work in Heterogeneous Data Integration
As mentioned in the main text, related prior work in the area of heterogeneous data integration falls into two categories:
methodological precursors involving naive Bayesian classifiers and biological precursors performing data integration for
simpler organisms. Naive Bayesian classifiers are themselves quite well-studied and robust (see (Mitchell 1997) and (Rish
2001) for reviews), and their applications for data integration in related biological areas have been mainly in the analysis
of protein-protein interaction (PPI) data. Beyond the biological and computational challenges inherent in integrating large
heterogeneous genomic data collections, a major contribution of HEFalMp not addressed by any previous system is its
summarization of data as systems-level functional maps. Previous data integration methods generally provided
biomolecular interaction networks as their end product; HEFalMp includes such functional relationship networks, but
also provides further analysis in the form of functional maps. These represent a uniform framework in which the millions
of edges in such networks can be further summarized in a biologically informative way, yielding data-driven interactions
between pathways, diseases, and (as future work) tissue types and developmental stages.
(Rhodes et al. 2005) employed a semi-naive Bayesian model to integrate a relatively small and highly curated subset of
human PPI data: ~40K PPI pairs from orthologous proteins in model organisms, ~200M coexpression measurements
spanning only five microarray datasets, and ~40M curated coannotations from the Gene Ontology. The most important
way in which this study differs from ours is in its use of prior knowledge: the Rhodes et al. classifier uses curated
coannotations from the Gene Ontology to predict protein-protein interactions, which is fundamentally different from the
standard bioinformatics paradigm of predicting new biological knowledge from experimental data. Beyond this
substantial basic difference, the Rhodes et al. study predicts only PPIs, not more general functional relationships, and it
also incorporates a large amount of data from orthologous organisms, not from direct experimentation on human
systems; this has the potential to introduce biases based on the way in which orthology is inferred. Finally, the scope of
the Rhodes et al. study is quite different from that addressed in our manuscript: the previous study integrated ~250M data
points based on a ~3M pair gold standard, versus ~30B data points and ~55M gold standard pairs used by HEFalMp in
>200 distinct functional areas. This difference, combined with Rhodes et al.'s semi-manual normalization and filtering of
their three data types, makes it infeasible to scale their solution to a comprehensive functional view of the human
genome.
The STRING database (Jensen et al. 2009) focuses on a broader definition of PPIs that includes functional relationships,
but the majority of its human interactions represent experimental results imported from existing databases (BioGRID
(Stark et al. 2006), MINT (Chatr-aryamontri et al. 2007), etc.) STRING also suffers the same potential drawback of using
curated databases (e.g. HPRD (Mishra et al. 2006), Reactome (Vastrik et al. 2007), and others) as training data relative to a
small, similarly curated gold standard (KEGG pathways (Kanehisa et al. 2008)). While this provides an excellent means of
accessing multiple reference databases of experimental results through a unified interface, it is potentially circular when
framed as an application of machine learning to predict new biological relationships. In cases where STRING performs
data integration to predict new protein interactions, it does so by regressing a confidence score against new datasets,
which maps their raw results to membership probabilities in KEGG pathways. While STRING can clearly scale to include
a tremendous amount of data, its focus is on aggregation of existing PPI databases rather than on prediction of new
functional relationships; STRING itself provides neither a uniform machine learning method for integrating its
constituent data nor an interface for exploring the results at a systems level comparable to HEFalMp's functional maps.
Other existing systems differ from HEFalMp in their biological, rather than computational, scopes; several data
integration techniques have proven to be quite successful in predicting functional relationships in simpler organisms.
Such integrations are necessarily smaller in computational scope as well; the most comparable study in yeast (Myers et al.
2007) included a similarly designed gold standard and coverage of ~200 functional contexts, but this still entailed almost
1,000x fewer data points. The machine learning methodology developed in the HEFalMp system is an evolution of this
system and that described in (Huttenhower et al. 2006) and in (Myers et al. 2005), with the addition of Bayesian
regularization to improve performance in the presence of very large data collections. Other techniques applied to simpler
organisms include those of (Jansen et al. 2003) and (Lee et al. 2004) in yeast and that of (Date et al. 2006) in the malaria
parasite P. falciparum. Jansen et al. focus exclusively on physical PPIs using a small data collection in yeast (e.g. no
coexpression data is integrated); with the exception of several cutoffs to discretize continuous predictions into binary
interactions, their methodology is essentially an unmodified application of naive Bayesian classifiers. Lee et al. employ a
variety of customized model fitting to map S. cerevisiae experimental data values to a KEGG-based gold standard, and it is
unclear whether their approach could scale to the biological and computational challenges of the human genome. Finally,
Date and Stoeckert's work with P. falciparum provides an excellent example of the power of functional data integration to
explore a largely uncharacterized biological system; they also build upon the integration techniques of (Troyanskaya et al.
2003) with a biological focus on the mechanisms of malaria infection. This emphasizes the importance of applying data
integration techniques to new biological systems where they can most effectively collect, focus, and expand upon large
collections of experimental data.
Thus, from a biological perspective, it is significant that HEFalMp offers the first comprehensive functional integration
of human genomic data. This provides an opportunity to explore fundamental human biology and the molecular
mechanisms of disease at levels ranging from individual experimental results to the interplay between entire cellular
pathways. Regularized naive Bayesian networks represent a machine learning technique with sufficient breadth to
incorporate billions of experimental data points, and functional mapping provides the depth to hierarchically summarize
this tremendous amount of data in a biologically meaningful way. We hope that both techniques are useful to
computational and biological investigators alike in the investigation of the human genome and the genomes of other
organisms.
References
Chatr-aryamontri, A., Ceol, A., Palazzi, L.M., Nardelli, G., Schneider, M.V., Castagnoli, L., and Cesareni, G. 2007. MINT:
the Molecular INTeraction database. Nucleic acids research 35: D572-574.
Date, S.V. and Stoeckert, C.J., Jr. 2006. Computational modeling of the Plasmodium falciparum interactome reveals
protein function on a genome-wide scale. Genome research 16: 542-549.
Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. 2006. A scalable method for integration and functional
analysis of multiple microarray datasets. Bioinformatics (Oxford, England) 22: 2890-2897.
Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein,
M. 2003. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science (New
York, N.Y 302: 449-453.
Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M. et al.
2009. STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37:
D412-416.
Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S.,
Tokimatsu, T. et al. 2008. KEGG for linking genomes to life and the environment. Nucleic acids research 36: D480-484.
Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes. Science (New
York, N.Y 306: 1555-1558.
Mishra, G.R., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R.,
Raghavan, T.M. et al. 2006. Human protein reference database--2006 update. Nucleic acids research 34: D411-414.
Mitchell, T.M. 1997. Machine Learning. McGraw-Hill.
Myers, C.L., Robson, D., Wible, A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K., and Troyanskaya, O.G. 2005.
Discovery of biological networks from diverse functional genomic data. Genome Biol 6: R114.
Myers, C.L. and Troyanskaya, O.G. 2007. Context-sensitive data integration and prediction of biological networks.
Bioinformatics 23: 2322-2330.
Rhodes, D.R., Tomlins, S.A., Varambally, S., Mahavisno, V., Barrette, T., Kalyana-Sundaram, S., Ghosh, D., Pandey, A.,
and Chinnaiyan, A.M. 2005. Probabilistic model of the human protein-protein interaction network. Nature biotechnology
23: 951-959.
Rish, I. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial
Intelligence.
Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. 2006. BioGRID: a general repository for
interaction datasets. Nucleic acids research 34: D535-539.
Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. 2003. A Bayesian framework for combining
heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proceedings of the National
Academy of Sciences of the United States of America 100: 8348-8353.
Vastrik, I., D'Eustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G., Croft, D., de Bono, B., Gillespie, M., Jassal, B., Lewis,
S. et al. 2007. Reactome: a knowledge base of biologic pathways and processes. Genome biology 8: R39.
Download