Central Dogma Of Molecular Bi l Biology

advertisement
Central Dogma Of Molecular
Bi l
Biology
Nonnegative Tensor Factorization
of Biomedical Literature for
Analysis of Genomic Data
Sujoy Roy, Ramin Homayouni (UofM)
Michael W. Berry, Andrey A. Puretskiy (UT)
Text Mining Workshop
SIAM Data Mining Conference 2011
DNA
mRNA
RNA
protein
t i
‘Blueprint’
‘Transcript’
‘Function’
Biotechnology is Driving Biology:
Th ‘‘omics’
The
i ’ era
Gene Regulatory Mechanisms
DNA
mRNA
protein
organism
Biotechnology
Genome
Sequencing
Molecular Biology of the Cell
Microarray
p
Expression
Analysis
Mass
Spectrometry
p
y
“Proteomics”
???
MEDLINE
Gene Expression
p
Profiling
g
•
•
Genes that are
coordinately up- or
down-regulated function
together.
What are the key
transcription factors that
regulate the expression
of gene sets?
Differentially
expressed
genes
•
MEDLINE is the premier
bibliographic database for
biomedicine
b
o ed c e suppo
supported
ed by the
e
National Library of Medicine
•
MEDLINE contains >20
million references, most of
which have abstracts
abstracts.
•
MEDLINE covers over
4800 journals, in
over 30 languages
•
MEDLINE citations
date back to 1966
•
Free abstracts !!
Alizadeh, et al., (2000) Nature 403:503.
Goal: Identify relationships between genes
and
d ttranscription
i ti ffactors
t
via
i P
PubMed
bM d
Genes
Transcription Factors
Pubmed Abstracts
Matrix Factorization Techniques
• Singular Value Decomposition (SVD)
(Homayouni et al., 2005)
• Nonnegative Matrix Factorization (NMF)
(Heinrich et al., 2008; Tjioe et al., 2010)
• Nonnegative
N
ti T
Tensor F
Factorization
t i ti (NTF)
Tensor construction
Nonnegative Tensor Factorization
z1
Gene 1
Gene_1
TFs
genes
Gene_2
T
genes
+
y2
genes
+ …+
terms
terms
terms
x1
x2
xk
Optimization problem: min
genes
TFs
y1
=
terms
zk
TFs
genes
TFs
||T
k
  x i
i 1
2
y i  z i ||
F
-- May be construed as 3-way clustering
TF 1
TF_1
terms
TF_2
z2
TFs
-- Generates ranked lists of terms, genes and TFs
Tijk = frequency count of termi in the set of abstracts shared
by genej and TFk
Pilot Study: Interferon Signaling
Workflow
Microarray Data
Fold change and
Significance
analysis
Significant Genes
TF List
(from TRANSFAC)
termXgeneXTF
tensor
NTF analysis (k=1)
…
NTF analysis (k=3)
…
NTF analysis (k=30)
…
Term/Gene/TF cluster
3
Term/Gene/TF clusters
30
Term/Gene/TF clusters
yk
Visualization of termXgeneXTF
g
tensor
Sample Clusters
-- 2325 terms, 86 genes and 409
TFs.
-- 81779550 elements, 22495
nonzero values
-- Density 0.03%
Global Analysis of top ranked terms
genes and
d TF
TFs
genes
terms
Distribution of percentage counts of top
ranked entities in clusters across k
TFs
Cluster Analysis
Validation: KEGG interferon
signaling
i
li pathway
h
Precision analysis using KEGG as
gold
ld standard
d d
11 point average precision values for genes and TFs in 'ifn' clusters obtained with tensor rank k=3, 5 and 25.
Precision and recall were calculated using TLR and IFN signaling pathway information in KEGG as gold
standard.
standard
IFN induced TLR and Jak-Stat signaling pathways derived from KEGG. The red, blue and olive arrows point to
the terms, genes and TFs respectively found in the 'ifn' cluster obtained by nonnegative tensor factorization at
k=3.
Discovery: Link to Cancer
Future Work
• Expand to more datasets
datasets.
• Comprehensive analysis of the effect of
factorization rank ‘k’
k = 1 to 50
50.
• Scaling/Normalization.
• Automated cluster labeling based on the
terms.
Cluster containing the top ranking TF as Rara obtained
with k=10.
k=10
Graph derived from Chilibot NLP tool showing the
sentence level correlation for Rara with various
entities in their respective clusters. Genes, TFs,
and terms are represented as nodes (rectangles)
and the relationship between them as edges
edges.
Acknowledgements
• Ramin Homayouni (University of
Memphis)
• Michael W
W. Berry
Berry, Andrey A
A. Puretskiy
(University of Tennessee, Knoxville)
• Brett
B tt W.
W Bader,
B d Tamara
T
G.
G Kolda
K ld (Sandia
(S di
National Laboratories)
Download