Using a Literature-based NMF Model for Discovering Gene Functional Relationships

advertisement
Using a Literature-based NMF Model for Discovering Gene Functional Relationships
§
*
‡
Θ
Elina Tjioe, Michael Berry, Ramin Homayouni, and Kevin Heinrich
§
*
Genome Science and Technology Graduate School, Department of Electrical Engineering and Computer Science,
‡
Θ
University of Tennessee, Knoxville; Bioinformatics Program, Department of Biology, University of Memphis, TN; and Computable Genomix, LLC
ABSTRACT
The rapid growth of the biomedical literature and genomic information present
a major challenge for determining the functional relationships among genes. Several
bioinformatics tools have been developed to extract and identify gene relationships
from various biological databases. In this study, we develop a Web-based
bioinformatics tool called Feature Annotation Using Nonnegative matrix factorization
(FAUN) to facilitate both the discovery and classification of functional relationships
among genes. Both the computational complexity and parameterization of
nonnegative matrix factorization (NMF) for processing gene sets is discussed. FAUN
is first tested on a small manually constructed 50 gene collection that we, as well as
others, have previously used. We then apply FAUN to analyze several microarrayderived gene sets obtained from studies of the developing cerebellum in normal and
mutant mice. FAUN provides utilities for collaborative knowledge discovery and
identification of new gene relationships from text streams and repositories (e.g.,
MEDLINE). It is particularly useful for the validation and analysis of gene
associations suggested by microarray experimentation.
Gene List
Fig. 1. FAUN screenshot showing use of dominant terms across genes
highly associated with the user-selected feature.
Feature Classifier
PMID Citations
in Entrez Gene
New Gene
Dominant
Document
Features
Fig. 2. FAUN screenshot of gene-to-gene correlation and feature strength.
DISCUSSION
GENE DOCUMENT TEST SET
Development
PubMed
Titles & Abstracts
5
Actual
SF
T3
T5
T6
T2
T1
Alzheimer
11
10
3
4
5
1
1
Cancer
15
15
9
12
12
7
7
F: 7, 13, 16, 19
3
Feature Annotation
Gene Document Collection
11
Gene A
Documen
t
Gene B
Gene C
Dominant
Dominant
Gene-to-Gene
Document
Documen
t
Terms
Genes
Correlation
F: 2, 6, 10, 11
Alzheimer
Nonnegative Matrix Factorization
(NMF)
General Text Parser
(GTP)
Term x Gene Doc
Matrix
Gene A
Gene C
Feature 1
Term 1
Term 2
Term 3
Feature 2
Feature x Gene
(H) Matrix
Feature 3
Gene A
Gene B
...
Term N
Term N
Development
5
5
4
2
1
0
0
Alz & Dev
3
2
3
3
2
3
0
Can & Dev
16
13
11
8
6
12
2
Total genes
50
45
30
29
26
23
10
15
F: 1, 3, 4, 5, 8,
9, 12, 14, 15
17, 18, 20
Cancer
Fig. 3. Classification of genes and features.
• Classification of genes in the 50-test-gene
document collection based on manual
examination of the biomedical literature.
• Classification of annotated features based
on manual examination of the dominant
terms in each feature.
Table. 1. The number of genes that are classified
correctly using the strongest feature (SF), feature
weight threshold 3, 5, 6, 2, and 1 for T3, T5, T6, T2, and
T1 respectively. The actual number of genes that are
classified based on manual examination of the
biomedical literature is shown in the first column.
A FAUN-based analysis of a new cerebellum gene set
has revealed new knowledge – the gene set contains
a large component of transcription factors.
FUTURE WORK
 Capture annotation of feature across different
resolutions (ranks of nonnegative matrix factorization).
 Improve FAUN model by optimizing the NMF rank;
consider use of smoothness constraints.
 Build a FAUN-based classifier with annotated features
for the real-time classification of new gene document
streams.
Gene C
Feature 1
Feature 2
Feature 3
REFERENCES
...
Term 1
Term 2
Term 3
Gene B
Term x Feature
(W) Matrix
16
For a preliminary assessment of FAUN feature
classification, each gene in the 50-test-gene
collection was classified based on its most dominant
annotated feature or based on some feature weight
threshold. The FAUN classification using the
strongest feature (per gene) yielded 90% accuracy.
1.
2.
3.
4.
Heinrich, K.E., Berry, M.W., Homayouni, R. (2008) Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature. Journal of Computational Intelligence and Neuroscience, to appear.
Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J. (2007) Algorithms and Applications for Approximate Nonnegative Matrix Factorization. Computational Statistics & Data Analysis 52(1): 155-173.
Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J. (2006) Document Clustering Using Nonnegative Matrix Factorization. Information Processing & Management 42(2), 373-386.
Homayouni, R., Heinrich, K., Wei, L., Berry, M.W. (2005) Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts. Bioinformatics 21(1), 104-115.
FAUN site: https://shad.eecs.utk.edu/faun
Download