Text mining, pathways and disease Andrey Rzhetsky Scientists are like plants biomass -> texts/studies sunlight,water -> peaceful time and resources for work Production of new texts (in pages): new additions annually QuickTime™ and a decompressor are needed to see this picture. We estimated that the world libraries host at least a TRILLION unique scholarly pages (Evans and Rzhetsky, 2009; analysis of the Online Computer Library Center’s WorldCat database of books and journals in 71,000 libraries across 121 countries) A bit of context PubMed search for cancer Results: 2,358,785 papers (March 17, 2010) GeneWays as an infogrinder On-line Journals GeneWays Pathways Graph: multi-type arcs and nodes QuickTime™ and a decompressor are needed to see this picture. Context-free grammar: Production rules S A activates B B binding of D by C A p53 C bax D bad Context-free grammar: Non-terminals S, A, B, C, D Context-free grammar Terminals: p53 bax bad Context-free grammar binding A activates B p53 S of bad D by bax C Context-free grammar: Parsing binding A activates B p53 S of bad D by bax C [ bind, [protein, bad], [protein,bax] ] Context-free grammar: Parsing A activates B p53 S [activate, [protein, p53], [action, B]] Context-free grammar: Parsing “Our experiments show that p53 activates binding of bax by bad.” [activate, [protein, p53], [bind, [protein, bad], [protein, bax] ] To be more concrete… The phosphorylation of ATP-citrate lyase by NDPK suggests that NDPK may have a role in the regulation of membrane biosynthesis The phosphorylation of ATP-citrate lyase by NDPK suggests that NDPK may have a role in the regulation of membrane biosynthesis ndpk phosphorylates atp-citrate lyase Typical arcs 1001,'bind' 1004,'suppress' 1011,'replace' 1018,'interact' 1020,'activate' 1022,'stimulate' 1023,'phosphorylate' 1027,'increase' 1028,'associate' 1034,'up-regulate' 1036,'inhibit' 1040,'promote' 1041,'down-regulate' 1043,'trigger' 1049,'block' 1054,'modify' 1057,'digest' 1058,'degrade' 1062,'link' 1071,'cleave' 1072,'release' 1074,'catalyze' 1083,'inactivate' 1106,'repress' 1110,'acetylate' 1117,'methylate' Typical nodes 17767,'calcium channel antagonists' 20324,'hsp70 chaperone' 17467,'activator protein 1' 5104,'daunorubicin' 13194,'tyrosyl-phosphorylated' 9689,'paroxonase' 4190,'immunodeficiency' 4478,'iga2' 8552,'human fcgammarii' 4472,'iga1' 13151,'ikaros' 9820,'caveolin 1' 7277,'virus-triggered p-dcs' 4366,'complexes pr-3' 12290,'anti-alpha4 mabs' 2258,'gal4-mef2d' 14464,'polyneuropathy' database ID 16044,'alk5' 10393,'mek-1 inhibitor' 13262,'pro-matrilysin' Graph: multi-type arcs and nodes QuickTime™ and a decompressor are needed to see this picture. Useful hairballs… dandelions… vermicelli… dust bunnies… rat’s nests… Well, if reality is complicated, so is the corresponding figure Intel 8086: breadboard Making case for utility Looking at cerebellar malformations through text-mined interactomes of mice and humans Ivan Iossifov, Raul Rodriguez-Esteban, Ilya Mayzus, Kathleen J. Millen, and Andrey Rzhetsky PLoS Computational Biology, 2009 QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Application to discovery of genes related to heritable disorders Molecular triangulation We (again!) generated a number of high-confidence predictions about associations of phenotypes with genes. Testing experimentally: to be continued… Looking at cerebellar malformations through text-mined interactomes of mice and humans Ivan Iossifov, Raul Rodriguez-Esteban, Ilya Mayzus, Kathleen J. Millen, and Andrey Rzhetsky PLoS Computational Biology, 2009, to appear QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. An alternative title: about useful hairballs… dandelions… ridiculograms… dust bunnies… rat’s nests… Intel 8086: breadboard Using the position in networks to describe function (from Mark Gerstein) [NY Times, 2-Oct-05, 9-Dec-08] Star witness: D-SPOP QuickTime™ and a decompressor are needed to see this picture. D-SPOP: phenotype QuickTime™ and a decompressor are needed to see this picture. Wild eye Egr-triggered Cell death One copy of D-SPOP deleted D-SPOP as marker In humans, SPOP was highly expressed in 99% of clear cell renal cell carcinomas, the most prevalent form of kidney cancer. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. A case for (marginal) usefulness of our efforts Next: A scary story… The invisible plague Steady rise of prevalence of neurodevelopmental disorders during the last 250 years Autism Bipolar disorder Schizophrenia Genetic-linkage Mapping of Complex Hereditary Disorders to a Wholegenome Molecularinteraction Network QuickT ime™ and a T IFF (Uncompressed) decompressor are needed to see this picture. AR Tian Zheng Miron Baron T. Conrad Gilliam QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. One-chromosome example ☺ ☺ ☺ ☹ ☹ ☹ Sick Healthy ☺ ☺ ☺ ☹ ☹ ☹ Sick Healthy Hypothetical pedigree Model Now, we have 23 chromosomes in two copies, ~25,000 genes Combinations of genes: 108 -- 2 genes 12 10 -- 3 genes 1016 -- 4 genes 1037 -- 10 genes It doesn’t make sense (statistically) to test all possible combinations of genes for association with disease Possible way out: test only functionally plausible combinations of genes Graph: multi-type arcs and nodes QuickTime™ and a decompressor are needed to see this picture. Graph: only physical interactions (such as bind, phosphorylate, etc) QuickTime™ and a decompressor are needed to see this picture. Then we realized that the clusters should be allowed to have arbitrary topology Genetic heterogeneity model Pedigrees, phenotypes, Marker states parameters cluster probability for gene 1 cluster probability for gene c disease We did this exercise for all three disorders (autism, bipolar disorder, schizophrenia). We then combined the most significant predictions. “Our” predictions indeed form overlapping networks!