Using a Literature-based NMF Model for Discovering Gene Functional Relationships § * ‡ Θ Elina Tjioe, Michael Berry, Ramin Homayouni, and Kevin Heinrich § * Genome Science and Technology Graduate School, Department of Electrical Engineering and Computer Science, ‡ Θ University of Tennessee, Knoxville; Bioinformatics Program, Department of Biology, University of Memphis, TN; and Computable Genomix, LLC ABSTRACT The rapid growth of the biomedical literature and genomic information present a major challenge for determining the functional relationships among genes. Several bioinformatics tools have been developed to extract and identify gene relationships from various biological databases. In this study, we develop a Web-based bioinformatics tool called Feature Annotation Using Nonnegative matrix factorization (FAUN) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of nonnegative matrix factorization (NMF) for processing gene sets is discussed. FAUN is first tested on a small manually constructed 50 gene collection that we, as well as others, have previously used. We then apply FAUN to analyze several microarrayderived gene sets obtained from studies of the developing cerebellum in normal and mutant mice. FAUN provides utilities for collaborative knowledge discovery and identification of new gene relationships from text streams and repositories (e.g., MEDLINE). It is particularly useful for the validation and analysis of gene associations suggested by microarray experimentation. Gene List Fig. 1. FAUN screenshot showing use of dominant terms across genes highly associated with the user-selected feature. Feature Classifier PMID Citations in Entrez Gene New Gene Dominant Document Features Fig. 2. FAUN screenshot of gene-to-gene correlation and feature strength. DISCUSSION GENE DOCUMENT TEST SET Development PubMed Titles & Abstracts 5 Actual SF T3 T5 T6 T2 T1 Alzheimer 11 10 3 4 5 1 1 Cancer 15 15 9 12 12 7 7 F: 7, 13, 16, 19 3 Feature Annotation Gene Document Collection 11 Gene A Documen t Gene B Gene C Dominant Dominant Gene-to-Gene Document Documen t Terms Genes Correlation F: 2, 6, 10, 11 Alzheimer Nonnegative Matrix Factorization (NMF) General Text Parser (GTP) Term x Gene Doc Matrix Gene A Gene C Feature 1 Term 1 Term 2 Term 3 Feature 2 Feature x Gene (H) Matrix Feature 3 Gene A Gene B ... Term N Term N Development 5 5 4 2 1 0 0 Alz & Dev 3 2 3 3 2 3 0 Can & Dev 16 13 11 8 6 12 2 Total genes 50 45 30 29 26 23 10 15 F: 1, 3, 4, 5, 8, 9, 12, 14, 15 17, 18, 20 Cancer Fig. 3. Classification of genes and features. • Classification of genes in the 50-test-gene document collection based on manual examination of the biomedical literature. • Classification of annotated features based on manual examination of the dominant terms in each feature. Table. 1. The number of genes that are classified correctly using the strongest feature (SF), feature weight threshold 3, 5, 6, 2, and 1 for T3, T5, T6, T2, and T1 respectively. The actual number of genes that are classified based on manual examination of the biomedical literature is shown in the first column. A FAUN-based analysis of a new cerebellum gene set has revealed new knowledge – the gene set contains a large component of transcription factors. FUTURE WORK Capture annotation of feature across different resolutions (ranks of nonnegative matrix factorization). Improve FAUN model by optimizing the NMF rank; consider use of smoothness constraints. Build a FAUN-based classifier with annotated features for the real-time classification of new gene document streams. Gene C Feature 1 Feature 2 Feature 3 REFERENCES ... Term 1 Term 2 Term 3 Gene B Term x Feature (W) Matrix 16 For a preliminary assessment of FAUN feature classification, each gene in the 50-test-gene collection was classified based on its most dominant annotated feature or based on some feature weight threshold. The FAUN classification using the strongest feature (per gene) yielded 90% accuracy. 1. 2. 3. 4. Heinrich, K.E., Berry, M.W., Homayouni, R. (2008) Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature. Journal of Computational Intelligence and Neuroscience, to appear. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J. (2007) Algorithms and Applications for Approximate Nonnegative Matrix Factorization. Computational Statistics & Data Analysis 52(1): 155-173. Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J. (2006) Document Clustering Using Nonnegative Matrix Factorization. Information Processing & Management 42(2), 373-386. Homayouni, R., Heinrich, K., Wei, L., Berry, M.W. (2005) Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts. Bioinformatics 21(1), 104-115. FAUN site: https://shad.eecs.utk.edu/faun