Text Mining Applications for Literature Curation

advertisement
Text Mining Applications for
Literature Curation
Kimberly Van Auken
WormBase Consortium
Textpresso
Gene Ontology Consortium
WormBase: A Database for C. elegans
and Other Nematodes
www.wormbase.org
Curating Diverse Data Types
Aggregation Behavior
Which worms aggregate
with other worms
and what contributes
to that behavior?
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Aggregation Behavior
Which worms (Strain)
aggregate with
other worms and
and what contributes to
that behavior?
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Aggregation Behavior
Which worms (Strain)
aggregate with
other worms
and what contributes to
that behavior?
Bendesky et al., 2012, PLoS Genetics
Strain information:
 August 1, 1972
 Pineapple field in Hawaii
Curating Diverse Data Types
Aggregation Behavior
Which worms
aggregate with
other worms (Phenotype)
and what contributes
to the behavior?
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Aggregation Behavior
Which worms
aggregate with
other worms (Phenotype)
and what contributes to
that behavior?
Bendesky et al., 2012, PLoS Genetics
 Worm Phenotype Ontology (WPO): Bordering
(WBPhenotype:0001820)
 Life stage ontology, e.g., L3 larval stage
 Assay, e.g., food source
Curating Diverse Data Types
Aggregation Behavior
Which worms (Strain)
aggregate with
other worms (Phenotype)
and what contributes to
that behavior
(Molecular Basis)?
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Aggregation Behavior
Which worms (Strain)
aggregate with
other worms (Phenotype)
and what contributes
to that
behavior (Molecular Basis)?
 Gene: npr-1
Bendesky et al., 2012, PLoS Genetics
 Variation: ad609 (T(83)->I and T(144)->A)
 Gene Ontology for npr-1:
 Biological Process: feeding behavior
 Molecular Function: neuropeptide receptor activity
 Cellular Component: integral to plasma membrane
Literature Curation Workflow
PubMed keyword search – ‘elegans’
Full text paper acquisition
Data type flagging and entity recognition
Detailed curation/Fact extraction
Finding Papers: Daily, automated PubMed
searches using keyword ‘elegans’
Download citation XML
PMID
Title
Authors
Abstract
Article
Curator
type Journal actions
Literature Curation Workflow – Full Text Acquisition
 Fully manual step
 Done for all papers we select
 Electronic copies stored in
curation database
Data Type Flagging/Triage
 Data Type Flagging/Triage:
General classification of papers
What types of experiments are in a paper?
e.g. RNAi phenotypes, Variation phenotypes,
Expression patterns, Physical interactions
Data Type Flagging Methods
Main pipeline:

Support Vector Machines (SVMs)
Other methods:
Textpresso category searches
hidden Markov models
Pattern matching scripts
Support Vector Machines: Document Classification
 Machine learning models
 Use positive and negative gold standard sets of papers
to train (e.g., papers with/without RNAi experiments)
 Positives: 100s, Negatives: 1000s
 Resulting model classifies all new papers as negative
or positive (high, medium, low confidence)
Data Type Flagging – Support Vector Machines
SVMs trained for ten different data types:
 Antibody
 Variation Phenotypes
 Genetic Interactions
 Overexpression Phenotypes
 Physical Interactions
 RNAi Phenotypes
 Gene Expression
 Variation Sequence Change
 Regulation of Gene Expression
 Gene Structure Correction
See: Fang R, et al. (2012) Automatic categorization of diverse experimental
information in the bioscience literature. BMC Bioinformatics. 13(1):16
Curation from Support Vector Machine Results
 SVM results lead directly to manual curation:
e.g. RNAi Phenotypes
 Results from SVMs are processed further
e.g. Variation Sequence Change
Pattern Matching Script – regular expressions
New variations (entity recognition)
e.g. mg366, ju43, e1360
Data Type Flagging – Textpresso
Full text of articles
Terms, phrases, entities – semantically tagged
Keyword or category search
Match within sentence or entire paper
www.textpresso.org
C. elegans
Mouse
D. melanogaster
Neuroscience
Arabidopsis
Dicty
Wnt Pathway
HIV
Nemtaodes
S. cerevisiae
RegulonDB
….many others
Textpresso Categories
 Pre-existing dictionaries, vocabularies:
Gene names
ChEBI (Chemical Entities of Biological Interest)
PATO
Sequence Ontology (SO)
 Manually constructed by curators using language from
published literature:
Sequence similarity – orthologous, conserved
Localization assays – GFP, antibody, fluorescence
Experimental verbs – required, regulates, exhibits
Data Type Flagging - Textpresso Category Searches
Data Type: C. elegans Human Disease Homologs
Three-category Textpresso search:
 C. elegans gene
 ’Ortholog’, ’Homolog’, ’Similar’, ’Model’
 Human disease
”We map this defect in dauer response to a mutation in
the scd-2 gene, which, we show, encodes the nematode
anaplastic lymphoma kinse (ALK) homolog, a protooncogene receptor tyrosine kinase.”
Literature Curation Workflow
PubMed keyword search – ‘elegans’
Full text paper acquisition
Data type flagging and entity recognition
Detailed curation/Fact extraction
Textpresso: Semi-Automated Fact Extraction
 Genetic Interactions
Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its
ability to dominantly (but weakly) suppress sep-1 (e2406ts), but
recessively suppress sep-1(ax110) (supplementary material Table S1).
 Physical Interactions – after SVM document classifier
Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-1
(Figure 3A and B, lane 7).
 Gene Ontology – Cellular Component Curation
During embryogenesis , PAN-1 protein is uniformly distributed
throughout the cytoplasm of the germline and somatic blastomeres , as
seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1
in the P granules (Fig. 2K, N).
Textpresso: Semi-Automated GO Cellular
Component Curation
Textpresso
Gene
Products Component Suggested GO
Annotations
Textpresso Search Results
See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of
protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular
component curation. BMC Bioinformatics. 10:228.
Future Directions
 Textpresso, other methods (HMMs) applied to
additional data types
e.g. GO Biological Process curation (Phenotypes)
 Focusing triage and fact extraction on novel findings
How best to integrate existing knowledge into curation
pipelines to focus curator effort on new experimental
results?
e.g. Commonly used molecular markers
Literature Annotation Tool – Tracking Evidence
WB, GO Common Annotation Framework, BioCreative
Summary
Text Mining Applications for Literature Curation:
 Paper approval and full text acquisition
 Data type flagging and entity recognition
 Fact extraction – record evidence
All steps of our pipeline incorporate some form of
semi- or fully automated approaches:




Scripts for downloads, pattern matching
Support Vector Machines for document classification
Textpresso for flagging and fact extraction
(Hidden Markov Models for flagging, fact extraction)
The WormBase Consortium, Textpresso
WormBase - Caltech
Textpresso - Caltech
Paul Sternberg
Juancarlos Chan
Wen Chen
Chris Grove
Ranjana Kishore
Raymond Lee
Cecilia Nakamura
Daniela Raciti
Gary Schindelman
Kimberly Van Auken
Daniel Wang
Xiaodong Wang
Karen Yook
Former member: Ruihua Fang
Hans-Michael Muller
Yuling Li
James Done
Former member: Arun Rangarajan
WormBase – OICR, Toronto
Lincoln Stein
Abigail Cabunoc
Todd Harris
JD Wong
WormBase – Washington University
John Spieth
Tamberlyn Bieri
Phil Ozersky
WormBase – EBI, Sanger, Hinxton, UK
CGC – Oxford University, Oxford, UK
Richard Durbin Paul Kersey
Matt Berriman
Paul Davis
Michael Paulini
Kevin Howe
Mary Ann Tuli Gary Williams
Jonathan Hodgkin
Hidden Markov Models: Semi-Automated GO
Molecular Function Curation
For each sentence, HMM yields:
 True positive score
 False positive score
For each sentence, curator assigns:
 Fully curatable (entity + indication of enzymatic activity)
 Positive (experiment was performed, result but no entity)
 False Positive (not about enzymatic activity at all)
Download