Charlotte Andersen John Pestian Karen Davis Lukasz Itert Paweł

advertisement
The Semantic Retrieval System:
Real-time System for Classifying and
Retrieving Unstructured Pediatric
Clinical Annotations
Charlotte Andersen
John Pestian
Karen Davis
Lukasz Itert
Paweł Matykewicz
Włodzisław Duch
Cincinnati, February 2005
Outline






The project
Goals
Focus
Software
Results
Plans …
CCHRF project outline (simplified)
INPUT
(raw medical text)
Preprocessing
Hypothesis generation,
validation, important
relations.
Decision support systems
Automatic medical billing
MetaMap input
MetaMap
Software - UMLS
concept
discovering and
indexing
Annotations:
Concept Space
(UMLS concepts)
Long-term goals (too ambitious?)
IR system facilitating discoveries, helping to answer questions like:
–
–
–
–
–
–
Retrieve similar cases using discharge summaries.
Is X related to Y?
Will X help patient with Y?
What correlates with X?
What causes changes of X?
What are therapy options for X?
Automatic creation of medical billing codes from text.
Can we work out scenarios of use for our target system?
First big problem: disambiguation
Map raw text to some structured form, removing all ambiguities,
expanding acronyms etc.
Use the NLM’s MetaMap to create an XML formatted data whose
schema is based on the Unified Medical Language System’s (ULMS)
Semantic Network ontology.
<semantic type>word</semantic type>
E.C. => <bacterium>Escherichia Coli</bacterium>
<patient>
<FIRST-NAME>Bob</FIRST-NAME>
<LAST-NAME>Nope</LAST-NAME>
</patient>
XML or structured text
Final XML should include maximum information that can be derived
with high confidence for each word, including:
1. Annotations for parts of speech (tree tagger) – for which type
of words?
2. Tags for semantic type (135 types in ULMS + tags for other
non-medical types);
3. Tags for word sense (ULMS + dictionaries such as WordNet);
4. Values assigned to some semantic types, ex: Temperature=high,
or T=102F.
What should we keep depends on the scenarios how the system will
be used.
Small steps to solve the big problem
Main subproblems:
 Removing patient-specific information, but keeping all
information related to a single case together; how to link
sequence of records for a single person?
 Text cleaning: misspellings, obtaining unique terms.
 Expansion of abbreviations and acronyms.
 Ambiguity of medical terms.
 Ambiguity of common words; how interesting are common terms,
which categories/semantic types should be used?
 Assigning values to some categories, ex: blood pressure;
temperature.
 Check XML standards developed at AMIA.
Human information retrieval
3 levels:
First: local, recognition of terms, we bring a lot of background
knowledge reading the text, ignore misspellings and small mistakes.
Second: larger units, semantic interpretation, discover and
understand meaning of concepts composed of several terms, define
semantic word sense for ambiguous words, expand terms and
acronyms to reach unambiguous interpretation.
Third: episodic level of processing, or what the whole record or text
is about? Knowing the category of text helps in unique intrepretaiton
at recognition and semantic level.
Recognition
Pawel started some work, a short report on text recognition memory
was written.
NLM has GSpell and WedSpell spelling suggestion tools, and the
BagOWordsPlus phrase retrieval tool (new, worth trying).
GSpell java classes, used to propose spelling corrections and unique
spelling for words that have alternative spellings.
Even if spelled correctly it may be a mistake, ex:
disease|disease|0.0|1.0|NGrams|Correct
disease|discase|1.0|0.873|NGrams|
disease|diseased|1.0|0.873|NGrams|
disease|decease|2.0|0.5819672267388108|NGrams|
Recognition cont
Is this an issue in our case?
Can we estimate how serious are problems at the recognition level?
The term may be a part of the phrase, and this would be discovered
only when the term is correctly recognized.
How do we know that we have acronym/abbreviation?
Frequently capital letters, usually 2-4 letters, morphological structure
using bi-grams is improbable, ex: DMI, CRC, IVF.
Acronyms and abbreviations should be recognized and expanded.
Need probability of various typos (keys that are close, characters that
are inverted, frequent errors, anticipation what character should come
next etc), and errors at the spelling and phonological level.
External dictionaries should be checked to find out if the word is not a
specific medical term that is not listed in ULMS.
Semantic level
Required to:
 select the most probable term from recognition process that gives
several alternatives at the same confidence level;
 WSD, or find semantic word sense for ambiguous words.
Word may have correct spelling but no sense at the semantic level;
go back to the recognition level and generate more similar words to
check which one is the most probable at the semantic level.
This should give in most cases highly probable term; once this is
achieved unique semantic word sense is defined.
Semantic knowledge representation may be done using:
 context vectors,
 concept-description vectors;
 more elaborate approaches, like frames (CYC).
Semantic knowledge representation




Context vectors: numerical, easy to generate from co-occurrence.
Widely used statistical approach, but lacks semantics;
concept name and its properties may be far apart; A B  A  B
Concept description vectors (CDV), knowledge-based:
list properties of concepts, derive info from definitions, dictionaries,
ontologies, pay more attention to unique features.
Frames, structured representations: more expressive power, with
symbolic values such as color = blue or color in {blue, green} ect.
time = admission_time;
time = day before discharge [time = morning ...], ect ...
Initially simple vector representation should be sufficient for WSD,
but remember that expressive power is limited.
Some thinking about simplified, computationally efficient framebased representation should be done.
Episodic level







Try to understand what the whole record or paragraph is about.
ACP has at least 14 distinct meanings in the Medline abstracts;
recognition/semantic level is not sufficient for disambiguation.
Essentially requires categorization of documents/paragraphs.
The record should be placed in some category and this will restrict
the type of semantic meanings that are probable in this category.
This is more expensive than the semantic level. To achieve this
categories of records should be identified (document classification).
Lukasz has made first experiments using different knowledge rep
with discharge summaries.
Different levels – R, S, E - are coupled. Knowing the disease it is
easier to uniquely expand some acronyms and provide WSD. Adding
some XML-annotation should make text categorization easier.
Several interpretations should be maintained, then one selected.
Billing codes







Is it feasible? Complete automatisation may be hard.
Many courses and books are on the market, B$ annually.
Simplest solution: proper database => codes automatically.
Knowledge-based approach to derive billing codes from texts: look
at the rules in books, try to analyze text, estimate which fields are
easy and which difficult.
Memory-based approach – find similar description that have the
same codes (used in national census).
Correlation-based: look at the statistical distribution of codes,
correlation between digit values; useful for checking, osmetimes fro
prediction.
Demo
Billing codes





Many courses and books are on the market, B$ annually.
Simplest solution: proper database => codes automatically.
Knowledge-based approach to derive billing codes from texts: look
at the rules in books, try to analyze text, estimate which fields are
easy and which difficult.
Memory-based approach – find similar description that have the
same codes (used in national census).
Correlation-based: look at the statistical distribution of codes,
correlation between digit values; useful for checking, osmetimes fro
prediction.
General questions
1.
2.
3.
4.
How should we proceed?
Depending on the scenario of use, we can work on selected
aspects of the problem or try to put the whole system
together & go on improving it.
What data can we have access to?
How reliable it is?
What should we still do at the pre-processing stage?
Anonymizing but linking individual patients?
How should we leverage on the POS-tagged corpus?
Compare different unsupervised taggers;
check the improvement of supervised taggers;
use POS as additional info in concept discovery and WSD;
other ideas ?
Recognition memory level
1.
2.
3.
4.
Cleaning the text, focusing on details: many misspellings,
various recognition memory techniques may be applied to
token => term mappings, Pawel has made a good start;
but be careful, it is easy to introduce errors.
Improvements of GSpell are of interest to NLM.
About 1000 disambiguation rules were derived from >700K
trigrams, but how universal are these rules on the new texts?
Are some not too specific?
Semi-automatic approach may be based on context vectors;
cluster different use of mm, ALL etc first and for each try to
assign unique meaning from context; how does it compare
with manually derived rules? Can we combine the two
approaches for higher confidence?
Semantic memory level
1.
2.
3.
4.
So far we have used only MetaMap but we need phrase and
concept indexing: noun phrases, creating equivalence
classes, compression of information; finding concepts in
whole sentences or large windows, not only in phrases.
WSD, or rather concept sense disambiguation, CSD; work
with the context vectors in the compressed text.
Knowledge-based approach: create concept-description
vectors from medical dictionaries and ontologies; that goes
beyond context vectors by providing reference knowledge.
Knowledge discovery: assigning values to concepts,
assigning concepts to numbers and adjectives, ex.
blood_pressure=[xxx-yyy], or blood_pressure=normal;
adjective – noun relation, or number – concept relations;
looking for relations at this stage, use fuzzy/similarity logic.
Episodic memory level
1.
2.
3.
4.
5.
Document categorization: what categories? For billing very
detailed ones, but even rough categories are useful to
narrow down the choices for acronym and WSD.
Lukasz: most common categories derived from database,
not clear how accurate is initial diagnosis but at this rough
level should be rather fine.
Use MesH headings at some level? Challenge: select the
best set of headings that will help to find unique sense of
words and acronyms.
Many advanced approaches to text categorization, like
kernel-based methods for text, nice field, but the secret is
in pre-processing, finding good feature space.
Relation to the 20Q game, gaining confidence stepwise.
Suggestions & priorities
1.
2.
3.
4.
What are our priorities? All 3 levels are important.
Where will our greatest impact be?
Start with document categorization?
People usually know document category when they read it;
misunderstandings is certain if short documents are given
to wrong experts. Try:
knowledge-based clustering and supervised learning;
recurrent NN for structured problems;
decision trees for many missing values ...
Good categorization needs concepts/phrases, we should
focus on concept discovery and check coupling with
document categorization, exploring parallel hypothesis.
Some work should also be finished at the recognition
memory level: acronyms + misspellings.
Download