Using ontologies for text processing

advertisement
Using ontologies
for text processing
Lawrence Hunter & K. Bretonnel Cohen
Center for Computational Pharmacology
UCHSC School of Medicine
http://compbio.uchsc.edu
Larry.Hunter@uchsc.edu
Overview
Thesis: Ontologies (or even more elaborated
knowledge-bases) are required to solve the lexical
ambiguity problem
Describe the lexical ambiguity problem and its
central importance in natural language processing
Demonstrate how GO, combined with Direct
Memory Access Parsing, provides a simple solution
to some instances of this problem
Argue no alternative is likely to work as well
Lexical Ambiguity
A word (character string) means different things
in different contexts
– How can a program disambiguate (tell which is meant)?
Widespread problem even in “simple” bioNLP
– DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001]
– Gene symbol vs. non-gene acronym [Pustejovsky et al.
2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and
Hearst 2003]
– Gene/product vs. any other noun [Tanabe and Wilbur, 2002]
A particular example
“Hunk” can be a
– Cell type: human natural killer
– Gene: hormonally upregulated Neu-associated kinase
– Medical abbreviation: radiographic/orthopedic joint
classification system
– Non-technical English: a large lump, piece, or portion
All occur in Medline documents….
(e.g. “hunk of metal” in article on ambulance design)
How do ontologies help?
The idea that knowledge is relevant to
understanding words in context is controversial
only among linguists, but…
Direct Memory Access Parsing (DMAP) [Martin,
1991] [Fitzgerald, 2000] technique demonstrates the
power of knowledge-based method for
disambiguation
GO & similar efforts make DMAP (or other
knowledge-based methods) practical today
What is DMAP?
Conceptual parser
– Maps from text to conceptual representations organized in
packaging and abstraction hierarchies (like GO)
– In contrast to: pure syntactic parsers, pattern matching and
machine learning systems
Conceptual representations include lexical patterns that
specify how to recognize the concept in text
– Patterns consist of text literals and/or references to other concepts
– Organized around concepts, not words; no independent lexicon.
Recognition creates expectations for related concepts
A real example
“…Hunk expression is restricted to subsets of cells…”
[Gardner et al. 2000]
ID: cell-type-HUNK
IS-A: cell-type
lex:
human natural killer
HUNK
ID: gene-expression
slots:
expressed-item: gene
mechanism: expression
lex:
(gene) (expression)
ID: gene-26559
IS-A: gene
lex:
hormonally upregulated Neu-associated kinase
RESULTS
HUNK
hormonally upregulated neu tumor-associated kinase
ID: GO-0006350
lex:
transcription
expression
DMAP output
with and without context
(parse ‘(Hunk))
e-gene-26559
begin: 1 end: 1
e-cell-type-HUNK
begin: 1 end: 1
Hunk alone: ambiguous
(parse ‘(Hunk expression))
Hunk expression:
c-gene-expression-1
not ambiguous
begin: 1 end: 2
expressed-item: e-gene-26559
begin: 1 end: 1
mechanism: GO:0006350
begin: 2 end: 2
DMAP can handle much more
complex constructions
“Hunk is expressed in mouse epithelial cells
during cell proliferation.”
c-localized-gene-expression
expressed-item: e-gene-26559
mechanism: GO:0006350
where: c-epithelial-cell
taxon: ncbi_10090
when: GO:0008283
But uses our enriched knowledge-base, not just GO
Even just DMAP/GO is a big win
Recall 7,042 ambiguous symbols for 9,723 genes
Straightforward to disambiguate symbols that
map to 2 or more genes when:
– Each ambiguous gene referent has GO annotations, and
– There is no overlap between the annotations for the genes
3,333 of the symbols (for 4715 of the genes) have
this feature – nearly half the problem is solved!
Compare the alternatives
Statistical or machine learning approaches
– Must avoid being fooled by word “cells” in example
– Scalability: need statistics for many covariates of every
ambiguous word; doesn’t exploit the abstraction hierarchy
Full syntactic parse doesn’t disambiguate at all!
Cascaded FST’s, pattern-matching, etc.
– Where is source of knowledge for these?
– Much DMAP lexical information can be taken directly from
GO (and LocusLink, etc.)
Acknowledgments
Philip V. Ogren
Daniel J. McGoldrick
Christoffer S. Crosby
Jens Eberlein
George K. Acquaah-Mensah
I/NET’s (http://inetmi.com) CM / CMP software
Support from Wyeth Genetics Institute, NIAAA
http://compbio.uchsc.edu
Biognosticopoea representation
of the hunk gene
Attachment ambiguity
Attachment ambiguity
– These findings suggest that FAK functions
in the regulation of cell migration and cell
proliferation. (Gilmore and Romer 1996:1209)
– What does FAK do?
•
•
•
•
•
•
ALMOST RIGHT:
FAK functions in the regulation of cell migration
FAK functions in cell proliferation
RIGHT:
FAK functions in the regulation of cell migration
FAK functions in the regulation of cell proliferation
Attachment ambiguity
GO-0016477 isA go-process
lex: cell migration
GO-0008283 isA go-process
lex: cell proliferation
GO-0042127 isA go-process
lex: regulation of cell proliferation
regulation of ((go-process) and)* cell proliferation
GO-0030334
lex: regulation of cell migration
regulation of ((go-process) and)* cell migration
Attachment ambiguity
(parse ‘(These findings suggest
that FAK functions in the
regulation of cell migration and
cell proliferation))
GO:30334
begin: 9 end: 12
GO:0042127
begin: 9 end: 15
What do we have so far?
Gene Ontology
UMLS
MeSH
…
What more do we need?
Family
Location
– Macroanatomical
– Subcellular localization
Structure
Function
– Disease associations
– Protein/protein interactions
– …..
Where can we get it?
GO definitions
UMLS definitions
MeSH notes
Biomedical literature
If you don’t like DMAP….
full syntactic parse first
cascaded FST’s
“a little syntax, a little
semantics”
machine learning
pattern-matching
All can benefit from
ontology/KB
Download