Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium WormBase: A Database for C. elegans and Other Nematodes www.wormbase.org Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics Strain information: August 1, 1972 Pineapple field in Hawaii Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Bendesky et al., 2012, PLoS Genetics Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics Worm Phenotype Ontology (WPO): Bordering (WBPhenotype:0001820) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Gene: npr-1 Bendesky et al., 2012, PLoS Genetics Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1: Biological Process: feeding behavior Molecular Function: neuropeptide receptor activity Cellular Component: integral to plasma membrane Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ Download citation XML PMID Title Authors Abstract Article Curator type Journal actions Literature Curation Workflow – Full Text Acquisition Fully manual step Done for all papers we select Electronic copies stored in curation database Data Type Flagging/Triage Data Type Flagging/Triage: General classification of papers What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes, Expression patterns, Physical interactions Data Type Flagging Methods Main pipeline: Support Vector Machines (SVMs) Other methods: Textpresso category searches hidden Markov models Pattern matching scripts Support Vector Machines: Document Classification Machine learning models Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s Resulting model classifies all new papers as negative or positive (high, medium, low confidence) Data Type Flagging – Support Vector Machines SVMs trained for ten different data types: Antibody Variation Phenotypes Genetic Interactions Overexpression Phenotypes Physical Interactions RNAi Phenotypes Gene Expression Variation Sequence Change Regulation of Gene Expression Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16 Curation from Support Vector Machine Results SVM results lead directly to manual curation: e.g. RNAi Phenotypes Results from SVMs are processed further e.g. Variation Sequence Change Pattern Matching Script – regular expressions New variations (entity recognition) e.g. mg366, ju43, e1360 Data Type Flagging – Textpresso Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper www.textpresso.org C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others Textpresso Categories Pre-existing dictionaries, vocabularies: Gene names ChEBI (Chemical Entities of Biological Interest) PATO Sequence Ontology (SO) Manually constructed by curators using language from published literature: Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits Data Type Flagging - Textpresso Category Searches Data Type: C. elegans Human Disease Homologs Three-category Textpresso search: C. elegans gene ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ Human disease ”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode anaplastic lymphoma kinse (ALK) homolog, a protooncogene receptor tyrosine kinase.” Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction Textpresso: Semi-Automated Fact Extraction Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1). Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-1 (Figure 3A and B, lane 7). Gene Ontology – Cellular Component Curation During embryogenesis , PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres , as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N). Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Gene Products Component Suggested GO Annotations Textpresso Search Results See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228. Future Directions Textpresso, other methods (HMMs) applied to additional data types e.g. GO Biological Process curation (Phenotypes) Focusing triage and fact extraction on novel findings How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? e.g. Commonly used molecular markers Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative Summary Text Mining Applications for Literature Curation: Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence All steps of our pipeline incorporate some form of semi- or fully automated approaches: Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction) The WormBase Consortium, Textpresso WormBase - Caltech Textpresso - Caltech Paul Sternberg Juancarlos Chan Wen Chen Chris Grove Ranjana Kishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang Xiaodong Wang Karen Yook Former member: Ruihua Fang Hans-Michael Muller Yuling Li James Done Former member: Arun Rangarajan WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth Tamberlyn Bieri Phil Ozersky WormBase – EBI, Sanger, Hinxton, UK CGC – Oxford University, Oxford, UK Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams Jonathan Hodgkin Hidden Markov Models: Semi-Automated GO Molecular Function Curation For each sentence, HMM yields: True positive score False positive score For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)