cui-2013 - Phenotype RCN

advertisement
SUB-LANGUAGE
PROCESSING FOR
PHENOTYPE CURATION
Hong Cui
University of Arizona
Phenotype RCN
Feb 25-27, 2013
Agenda
• CharaParser
• Methodology
• Evaluation
• Applications
• CharaParser for Phenoscape
• New modules
• Evaluations
• Challenges
“Fine-Grained Semantic Mark-up”
• To annotate factual information from textual
morphological descriptions of biodiversity in such
a detailed manner that the machine readable
annotation itself provides information equivalent
to the original text.
An Example
Previous Research
• Syntactic parsing approach (Taylor, 1995 ; Abascal
& Sanchenz, 1999; Vanel, 2004)
• Interactive extraction (Diederich, J., Fortuner, R. &
Milton, J. 1999).
• Semi-supervised bootstrapping for lexicons (Ellen
Riloff, 1999)
• Supervised regular expression rule learning
(Soderland, 1999; Tang & Heidorn 2008)
• Ontology driven and parallel text (Woods et. al.
2004)
• Supervised association rule learning (Cui &
Heidorn, 2007)
General-Purpose Parsers?
CharaParser Approach
1.
Unsupervised machine learning to find anatomy and
character terms from descriptions automatically
•
•
2.
General-purpose syntactic parser (e.g., Stanford
Parser) to parse syntactic structure of sentences
•
•
3.
No need to prepare training examples
50% - 80% terms learned
No need to create special-purpose, domain-dependent
parser
Learned lexicon from 1 is used to adapt the Parser for
biodiversity domains
Intuitive rules to produce annotations from parse
trees.
Unsupervised lexicon learning
If it is known “roots” is an organ:
• Roots yellow to medium brown or black, thin.
• Petals yellow or white
• Petals absent;
• Subtending bracts absent;
• Abaxial hastula absent;
CharaParser: Term Reviewer
Ontology Term Organizer
Compared against a Heuristics-Based Method
• Parser performance evaluated on the same data sets.
• CharaParser: unsupervised learning + Stanford Parser
• Heuristics-based: unsupervised learning + regular
expression rules
Annotation Problems
• Chunk errors:
• Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–4(–5) mm,
pliant;
• Attachment errors:
• on outer cypselae, crowns of bristlelike scales ca. 0.5 mm; on inner, of
dusky white or pale yellow, plumose bristles 5–6 mm.
• Semantics:
• straight posterolateral bounding ridges to subtriangular , bilobed
ventral muscle field;
Applications at Various Development Stages
• Convert XML markup to
• SDD for identification key generation
• Character matrices for tree of life
• RDF for the Semantic Web and search
• Use marked-up descriptions to support search
• FNA Experimental Search
• Data source is RDF triples
• Allow character based search
• Plants that give yellow flowers at 200-400 meter elevation in April in North Carolina
To-Dos
• Tighter integration of ontologies in annotation process.
• Currently internal glossaries are used in place of ontologies
to link a character state (e.g., “red”) to a character (“color”)
• Synonyms are not controlled
• “Petiolate” = “with petiole”
• Continue to reduce annotation errors
• Accommodate various syntactic styles
• Diagnosis paragraphs
• Comparison among different taxa
Phenotype Curation
• Convert character and character state information from natural
language descriptions to EQ statements
Curator Mental Process
read
description
Identify key
phrases
ontologized EQ
(raw EQ)
ontologies
Adapted CharaParser
Character Description
State Descriptions
CharaParser
XML to Raw
EQs
Ontologies
Raw EQs to
Final EQs
Evaluations
• Internal evaluation:
• The development corpus (three publications on fishes and archosaurs)
provided 1,200 character descriptions. 100 of them included in the internal
evaluation benchmark.
• Raw EQ performance: 90%
• Final EQ performance: 50%
• BioCreative2012 evaluation:
• 50 descriptions independently selected by the organizer (>50% Qs were not
in ontologies)
• Gold standard created by chief phenoscape curator (raw and final)
• Three biocurators worked in two modes (Phenex vs. Phenex+CharaParser)
• Raw EQ performance: CharaParser better than biocurators
• Final EQ perfoamnce: biocuration better than CharaParser
• Inter-curator agreements:
Inter-Curator Agreements
Precision
Recall
Curator 1 vs 2
39
49
Curator 1 vs 3
47
56
Curator 2 vs 3
77
71
Error Analyses
• Various fixable syntactic problems
• E.g., “digits I-III”
• Curation granularity
• CharaParser generated more candidate EQs than curators
• “Preopercular latero-sensory canal leaves preopercle at first exit and
enters a plate: yes/no”
• Annotating relations (relational quality)
• “contact between …”
Ontology Access
• Currently use keyword-based search
• Class labels and exact, narrower, and related synonyms
• False positives
• acute(shape) =? acute (process)
• "margin" is a broad synonym of "marginal zone of embryo" in UBERON
• Pre-composed terms in ontology
• “ceratobranchial 5 tooth”, “rib of vertebra 5”, “body of humerus”
• Ambiguious term use in descriptions
• ‘epibranchial 1’ => epibranchial 1 element? bone? cartilage?
• No matching
Exploration of Solutions
• Experimented with
• Word sense disambiguation:
• “crinkly” not in PATO
• Candidate matches: [undulate->1.00000000000002]
[obovate->1.00000000000001]
[flat->1.00000000000001]
[flattened->1]
[circinate->0.884697579551583]
• Experimenting with
• Subsets
• Specify included classes: e.g. classes related to vertebrates
• Specify excluded classes: e.g. exclude certain developmental stages
• Ideas to try out:
• Bootstrapping to narrow down the search space
• starting from known classes
• evaluating candidate matches based on the distances to the known classes and other
source of evidences.
Annotation consistency
• Instructions given to human curators are helpful to CharaParser
• Restricted relation list:
• http://phenoscape.org/wiki/Guide_to_Character_Annotation#Rela
tions_used_for_post-compositions
Feed more info to EQ generation module
Character Description
State Descriptions
CharaParser
XML to EQs
Ontologies
Raw EQs to
Final EQs
Recent Improvements
• Explorer of Taxon Concepts project
• Making it a pure-java program/web-based application
• Currently requires MySQL + Perl
• Making it faster
• Optimization of the program
• Removing MySQL and reducing I/O
• “Parallel” computing using java threads
• Preliminary evaluation shows
• 20 times faster: 2 sec/taxon description
• Memory requirements increased by 3 folds
Acknowledgements
• Fine-Grained Semantic Markup Project (current and past)
• James Macklin: Agriculture and Agri-Food Canada
• Robert (Bob) Morris, Alex Dusenbery: UMass-Boston
• Hariharan Gopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan Krishna
Gowda, ParthaPartha Pratim Sanyal, Chunshui Yu: University of Arizona
• Phenoscape Project
•
•
•
•
Chris Mungall: Laurence Berkeley National Lab
Melissa Haendel : Oregon Health & Science University
Paula Mabee, Alex Dececchi: University of South Dakota
Jim Balhoff, Wasila Dahdul, Hilmar Lapp, Todd Vision: NESCent
• NSF ABI and EF Programs
• The Flora of North American Project
Download