Central Dogma Of Molecular Bi l Biology Nonnegative Tensor Factorization of Biomedical Literature for Analysis of Genomic Data Sujoy Roy, Ramin Homayouni (UofM) Michael W. Berry, Andrey A. Puretskiy (UT) Text Mining Workshop SIAM Data Mining Conference 2011 DNA mRNA RNA protein t i ‘Blueprint’ ‘Transcript’ ‘Function’ Biotechnology is Driving Biology: Th ‘‘omics’ The i ’ era Gene Regulatory Mechanisms DNA mRNA protein organism Biotechnology Genome Sequencing Molecular Biology of the Cell Microarray p Expression Analysis Mass Spectrometry p y “Proteomics” ??? MEDLINE Gene Expression p Profiling g • • Genes that are coordinately up- or down-regulated function together. What are the key transcription factors that regulate the expression of gene sets? Differentially expressed genes • MEDLINE is the premier bibliographic database for biomedicine b o ed c e suppo supported ed by the e National Library of Medicine • MEDLINE contains >20 million references, most of which have abstracts abstracts. • MEDLINE covers over 4800 journals, in over 30 languages • MEDLINE citations date back to 1966 • Free abstracts !! Alizadeh, et al., (2000) Nature 403:503. Goal: Identify relationships between genes and d ttranscription i ti ffactors t via i P PubMed bM d Genes Transcription Factors Pubmed Abstracts Matrix Factorization Techniques • Singular Value Decomposition (SVD) (Homayouni et al., 2005) • Nonnegative Matrix Factorization (NMF) (Heinrich et al., 2008; Tjioe et al., 2010) • Nonnegative N ti T Tensor F Factorization t i ti (NTF) Tensor construction Nonnegative Tensor Factorization z1 Gene 1 Gene_1 TFs genes Gene_2 T genes + y2 genes + …+ terms terms terms x1 x2 xk Optimization problem: min genes TFs y1 = terms zk TFs genes TFs ||T k x i i 1 2 y i z i || F -- May be construed as 3-way clustering TF 1 TF_1 terms TF_2 z2 TFs -- Generates ranked lists of terms, genes and TFs Tijk = frequency count of termi in the set of abstracts shared by genej and TFk Pilot Study: Interferon Signaling Workflow Microarray Data Fold change and Significance analysis Significant Genes TF List (from TRANSFAC) termXgeneXTF tensor NTF analysis (k=1) … NTF analysis (k=3) … NTF analysis (k=30) … Term/Gene/TF cluster 3 Term/Gene/TF clusters 30 Term/Gene/TF clusters yk Visualization of termXgeneXTF g tensor Sample Clusters -- 2325 terms, 86 genes and 409 TFs. -- 81779550 elements, 22495 nonzero values -- Density 0.03% Global Analysis of top ranked terms genes and d TF TFs genes terms Distribution of percentage counts of top ranked entities in clusters across k TFs Cluster Analysis Validation: KEGG interferon signaling i li pathway h Precision analysis using KEGG as gold ld standard d d 11 point average precision values for genes and TFs in 'ifn' clusters obtained with tensor rank k=3, 5 and 25. Precision and recall were calculated using TLR and IFN signaling pathway information in KEGG as gold standard. standard IFN induced TLR and Jak-Stat signaling pathways derived from KEGG. The red, blue and olive arrows point to the terms, genes and TFs respectively found in the 'ifn' cluster obtained by nonnegative tensor factorization at k=3. Discovery: Link to Cancer Future Work • Expand to more datasets datasets. • Comprehensive analysis of the effect of factorization rank ‘k’ k = 1 to 50 50. • Scaling/Normalization. • Automated cluster labeling based on the terms. Cluster containing the top ranking TF as Rara obtained with k=10. k=10 Graph derived from Chilibot NLP tool showing the sentence level correlation for Rara with various entities in their respective clusters. Genes, TFs, and terms are represented as nodes (rectangles) and the relationship between them as edges edges. Acknowledgements • Ramin Homayouni (University of Memphis) • Michael W W. Berry Berry, Andrey A A. Puretskiy (University of Tennessee, Knoxville) • Brett B tt W. W Bader, B d Tamara T G. G Kolda K ld (Sandia (S di National Laboratories)