slides - Vrije Universiteit Brussel

advertisement
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
VUB Leerstoel 2009-2010
Theme: Ontology for Ontologies, theory and applications
Ontologies and Natural Language Understanding
May 20, 2010; 17h00-19h00
Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels
Room D2.01
Prof. Werner CEUSTERS, MD
Ontology Research Group, Center of Excellence in Bioinformatics and Life Sciences
and
Department of Psychiatry, University at Buffalo, NY, USA
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Context of this lecture series
Knowledge Representation
Informatics
Linguistics
Computational Linguistics
Medical Natural
Language Understanding
Electronic
Health Records
Translational
Research
Medicine
Biology
Ontology
Philosophy
Realism-Based
Ontology
Referent
Tracking
Pharmacogenomics
Pharmacology
Performing
Arts
Defense &
Intelligence
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Today’s topic
Informatics
Linguistics
Computational Linguistics
Medical Natural
Language Understanding
Electronic
Health Records
Medicine
• May 20: ontologies and
Natural Language
Understanding
Realism-Based
Ontology
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Amazing technology
A human being with function enhancing electronic implants
A tiny scanner
capable of detecting
bodily anomalies
A ‘doctor’ who
is in fact some
sort of
computer
program
capable of
making
medical
diagnoses
Flawless communication between a human and a computer
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Or not amazing? … towards a bionic eye
http://bionicvision.org.au/
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Or not ? … mobile diagnostics
SilhouetteMobile™
GlucoPack™
scans and stores information about a wound's
width and depth, which helps nurses track healing
over time as new tissue fills in the injury
reads and transmits glucose
readings
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Or not ? … Transhumanism
Max More
• "Philosophies of life that
seek the continuation
and acceleration of the
evolution of intelligent
life beyond its currently
human form and human
limitations by means of
science and technology,
guided by life-promoting
principles and values."
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Beyond natural evolution …
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
to … mind uploading ?
Ray Kurzweil receives National Medal of Technology (1999).
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
But for today:
How to communicate with computers naturally ?
The supercomputer HAL from 2001: A Space Odyssey.
R T U New York State
Center of Excellence in
Bioinformatics
Life Sciences
Michael& Scott’s
solution
http://aboulet.files.wordpress.com/2007/05/traveling-salesmen1.jpg
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Better: a combination of various technologies
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
My interest in NLU: the medical informatics dogma
• Fact: computers can only deal with a structured
representation of reality:
– structured data:
• relational databases, spread sheets
– structured information:
• XML simulates context
– structured knowledge:
• rule-based knowledge systems
• Conclusion: a need for structured data entry
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Structured data entry
• Current technical solutions:
– rigid data entry forms
– coding and classification systems
• But:
– the description of biological variability requires the
flexibility of natural language and it is generally
desirable not to interfere with the traditional manner of
medical recording (Wiederhold, 1980)
– Initiatives to facilitate the entry of narrative data have
focused on the control rather than the ease of data
entry (Tanghe, 1997)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Drawbacks of structured data entry
• Loss of information
– qualitatively
• limited expressiveness of coding and classification systems,
controlled vocabularies, and “traditional” medical
terminologies
• use of purpose oriented systems
– don’t use data for another purpose than originally foreseen (J VDL)
– quantitatively
• to time-consuming to code all information manually
• Speech recognition and structured data entry forms
are not best friends
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The pilars of healthcare informatics
• Clinical language
– medical narrative
• Clinical terminologies
– coding and classification systems
– nomenclatures
– formal ontologies
• Electronic Healthcare Record
Systems
R T U New York State
Center of Excellence in
The possibilities
Bioinformatics & Life Sciences
• Text based EHCRS able to generate structured
data
• An EHCR exclusively build around a collection
of coded data generated out of free text
• AAmultimedia
multimediaEHCRS
EHCRSwith
withclinical
clinicalnarrative
narrative
registrationand
andstructured
structureddata
datageneration
generation
registration
• A multimedia EHCRS with structured data entry
and text generation
• An EHCR exclusively build around texts
generated out of controled vocabularies
• An EHCR exclusively build around a collection
of structured data able to generate text
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Main issues of MNLU
• Medical natural language understanding is:
– Making computers understand medical language
– Allowing computers to turn unstructured texts in
structured information
• Medical NLU is NOT:
– medical reasoning performed by computers
– reducing the richness of clinical language to a closed
set of codes
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Typical examples of MNLU
• contextual spell checking
• information retrieval
– topic selection
– relevance ranking
• coding and classification
• software agents for clinical studies
• unstructured data registration for structured
reporting
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Areas for application of MNLU
• Coding patient data
• Structured information extraction from
unstructured clinical notes
• Clinical protocols and guidelines
• Assessing patient eligibility for clinical trial entry
• Triggering and alerts
• Linking case descriptions to scientific literature
• Easy access to content
• ... towards a medical semantic web
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
A wealth of communication related applications (1)
• Speech as input:
– voice recognition:
• who is the sender?
– speech recognition:
• dictation: what is the corresponding text?
– irrespective of meaning
•
•
•
•
command and control
language learning (pronunciation checking)
question answering
spoken natural language understanding
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
A wealth of communication related applications (2)
• Text as input:
–
–
–
–
–
speech generation (text-to-speech)
spell checking
grammar checking
plagiarism detection
indexing – semantic indexing – topic detection
• document retrieval
– return documents that tell me when Bonaparte was born
• information retrieval
– find in documents the date Bonaparte was born and return only the date
– clinical coding
R T U New York State
Center Speech
of Excellence
in
generation
Bioinformatics & Life Sciences
(1)
She lives near the highway where
three lives were lost.
R T U New York State
CenterSpeech
of Excellence
in
generation
Bioinformatics & Life Sciences
(2)
Chapter III is about Henry III.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Text-to-speech basics
http://upload.wikimedia.org/wikipedia/en/a/af/Festival_TTS_Telugu.jpg
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Simple speech recognition algorithm
raw
speech
signal
analysis
acoustic
models
sequential
constraints
train
speech
frames
acoustic
analysis
frame
scores
time
alignment
word
sequence
segmentation
From the INRIA Parole project
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Dialogue systems with automatic translation
http://www.oxygen.lcs.mit.edu/images/Speech.jpg
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The disambiguation problem
• Some examples:
– ‘lives’: from ‘to live’ or plural of ‘life’
– ‘III’: as ‘three’ or ‘the third’
– ‘bow’: the weapon or from ‘to bow’
• Statistical models (n-grams):
– most often sufficient
– quite fast analysis
• Syntactic analysis
• Semantic analysis (deep or shallow)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
A toy ontology for communication (1)
• Patterned particular (PP):
–
–
–
–
piece of text: combination of characters
sound wave
series of signs in sign language, smoke
combination and sequence of smells ?
• Some sender which generated a PP with the intention to provoke
something in some receiver, the PP thus becoming a linguistic
patterned particular (LPP)
– standard messages, questions, commands
• carry meaning directly encoded in the message
– poems, lies, deceptions, nonsense:
• no or partial directly encoded information
• Being a PP is not sufficient to be an LPP. There has to be a sender!
– a bird or insect flying in a pattern that looks like an LPP in some language
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
A toy ontology for communication (2)
• Aboutness relation from certain elementary LPPs
to real world entities when created under certain
circumstances
– ‘me’, ‘I’, ‘mine’
– ‘current’, ‘president’, United States’, ‘king, ‘France’
• Pattern types
– morphologic, syntactic, semantic and discourse
conventions
• ‘current President of the United States’
• ‘current king of France’
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
A toy ontology for communication (3)
• Questionable entities:
– ‘propositions’
• sort of factual, linguistically undressed statements about the
world
– ‘bare meanings’
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Text analysis
‘The doctor checks Seven of Nine’s blood pressure’
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Syntactic analysis
sentence
verb phrase
noun phrase
noun phrase
noun phrase
det
The
noun
verb
doctor checks
det
the
prepositional phrase
compound noun prep person name
blood pressure
of
Seven of Nine
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Semantic analysis
checking
sentence
verb phrase
has-object
has-agent
noun phrase
noun phrase
noun phrase
det
noun
verb
det
person
The
doctor checks
prepositional phrase
compound noun prep person name
clinical sign
the
blood pressure
person
of
Seven of Nine
belongs-to
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The doctor uses an instrument
sentence
verb phrase
checking
agent
instrument
object
noun phrase
det
The
noun
noun phrase
verb
det
doctor examines the
noun phrase
noun
prep
det
noun
patient
with
a
hammer
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Here the patient has the hammer !
sentence
checking
noun phrase
verb phrase
agent
object
noun phrase
det
The
noun
prepositional phrase
noun phrase
verb
det
doctor examines the
noun phrase
noun
prep
det
noun
patient
with
a
hammer
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The problem of reference
• ‘The surgeon examined Maria. She found a small
tumor on the left side of her liver. She had it
removed three weeks later.’
• Ambiguities:
–
–
–
–
who denotes the first ‘she’: the surgeon or Maria ?
on whose liver was the tumor found ?
who denotes the second ‘she’: the surgeon or Maria ?
what was removed: the tumor or the liver ?
• Here ontology can come to aid.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Ontologies and NLP
• A two-way collaboration:
– using NLP techniques to assist the development of
ontologies,
– using ontologies to make better NLP applications,
– bootstrapping: NLP applications that require ontologies
in some stage and intend to make these ontologies
better.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
NLU as assistive technology
for ontology development
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex: corpus-based term extraction
• Based on Deniz Yuret’s PhD thesis
• good news: (a particular) language independent
automatic linguistic knowledge extractor
–
–
–
–
relationships between words
grammar generation
term extraction
synonym / homonym detector (???)
• bad news:
– large corpora required (occ > 500 * different tokens)
– big PC required (3.000.000 words/day, DOS, PII-350)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex: term extraction
•
•
•
•
•
•
•
•
TERM
Occurrences (5679 reps)
magnetic resonance
100
san francisco
12
invasive fungal sinusitis
7
rhinosinusitis disability index
3
intensive care unit
178
food allergy
31
th1 and th2
32
positron emission
29
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex grammar induction
• Sentence encountered:
• Sentence analyzed:
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex’s linguistic principles
• Words in natural language sentences:
– tend to collocate with a certain strength,
– are not linked in circular ways,
– have links that don’t cross.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s6
s5
s4
s3
s2
s1
I
saw
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s6
s5
s4
s3
s2
s1
I
saw
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s6
s5
s4
s3
s1
I
saw
s8
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s11
s10
s9
s1
I
saw
s8
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s11
s10
s9
s8
s1
I
saw
s12
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s11
s10
s9
s1
I
saw
s12
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
C-Tex processing
s11
s10
s9
s1
I
saw
s12
s7
a
man
carry
a
telescope
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Advantages
• Defining the required coverage for a given
domain, by
– listing the terms that need to receive a description in
the ontology (= inverse annotation)
– listing the relationships that need to be named
• Catch up mechanism:
– things already done, don’t need to be done again
– If a C-Tex without prior knowledge works fine, one
with ontological knowledge should work even better
• Builds a grammar
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Drawbacks
• very slow
• very sensitive to repeatedly seeing the same
documents
– requires very careful training set development
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Gap Finder and Web Agent
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
“Domain specific” word detection
Indiana
Irving
JAMA
Janus
Johannes
Kanno
Kd
Kern
Knowles
L.M.
LBF4-bind
LBF6-binding
LMP-1-express
LMP-1-positive
LPS
LTR-Cat
Laurent
Lenny
Leung
Lewis
Lim
Listeria
monocytogenes
Indianapolis
Ito
Jaffe
Japan
Johannsen
Kaplan
Keegan
Kimble
Ko
LAV
LBF4-binding
LD
LMP-1-induce
LMP-1-transfect
LT
Laine
Lechler
Lenoir
Levels
Ley
Lin
Liu
Inoue
Iwanaga
Jain
Jk-bind
Johnson
Karin
Keller
Kirsch
Kozma
LBF3-bind
LBF5-and
LFA
LMP-1-mediate
LN
LTR
Lane
Lee
Leonard
Levine
Li
Ling
Loisel
Irani
J
Jama
Jk-binding
K
Kaye
Kennedy
Kishimoto
L
LBF3-binding
LBF6-bind
LMP
LMP-1-negative
LOH
LTR-CAT
Lanes
Left
Lett
Levy
Liebowitz
Listeria
London
R T U New York StateKohonen
clustering
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Kohonen clustering
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Statistical relationship discovery
context
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
EU 6th VAT Directive
term
member state
condition
criterion
member state
member state
member state
member state
member state
member state
accommodation
committee
service
service
supply of service
accompany
achieve
acquire
allow
animal
apply
authorise
authorise
avoid
breeding
calculating
role
ACTOR-OF
ACT-UPON
ACT-UPON
ACT-UPON
ACT-UPON
ACT-UPON
ACT-UPON
ACT-UPON
ACT-UPON
CAUSED-BY
CAUSED-BY
CAUSED-BY
CAUSED-BY
CAUSED-BY
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
HAS_ACTION
value
term
necessary measure
purpose
document
amount
condition
method
national currency of ecu
period
rules
similar establishment
commission
agricultural holdings
taxable person
aim
luggage
exemption
goods
identification
animals
vat
member
suspension
fraud
boars
turnover
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The ‘clique’ - approach
• a clique in an undirected graph
is a subset of its vertices such
that every two vertices in the
subset are connected by an
edge.
• A clique is maximal iff not part
of a larger clique.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Building cliques out of n-grams
Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009,
Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Sorts of cliques in linguistic corpora
Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009,
Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Category and hierarchy generation
Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009,
Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Ontology to improve
natural language understanding
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Understanding content (1)
We see:
“John Doe has a pyogenic
granuloma of the left thumb”
The machine sees:
John Doe has a
pyogenic
granuloma of
the left thumb
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Understanding content (2)
We see:
The XML misunderstanding
<record>
<patient>John Doe</patient>
<diagnosis>pyogenic granuloma of the left thumb</diagnosis>
</record>
The machine sees:
<record>
<subject> John Doe </subject>
<diagnosis> pyogenic granuloma
of the left thumb </diagnosis>
</record>
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Requirements for NLU
1.
Knowledge about terms and how they are used in valid
constructions within natural language;
2. Knowledge about the world, i.e. how the referents denoted by the
terms interrelate in reality and in given types of context;
3. An algorithm that :
a. is able to calculate a language user’s representation of that part
of the world described in the utterances that are the subject of
the analysis.
b. can track the ways in which people express what does NOT
represent anything in reality (eg for medico-legal reasons)
Only a realist ontology (and not an ontology that deals with
“alternative realities”) permits correct disambiguation
between 3a and 3b.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Exploit the relationships along the vertices
Halliday’s systemic
functional
grammar
The structures of language are
partially determined by our
conceptualisation of the
world.
Halliday
No mental representation
without language Fodor
Aristotelian
realism
concept
Meaning is located in
the interaction between living
beings and the environment
language
James J. Gibson, Ecological
Realism in Psychology
referents
Baboons and humans have different cut-off points for discerning "same" objects because
our verbal expression for "same" makes the idea of "same" more restrictive.”
Fagot and Wasserman (Centre for Research in Cognitive Neuroscience in Marseille)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The content
Language A
Proprietary Terminologies
Language
LexiconB
Lexicon
Others ...
Grammar
ICPC
Grammar
SNOMED
Formal Domain
Ontology
ICD
Linguistic Ontology
MEDRA
R T U New York State
Center of
Excellence
in
Use
of spatial
logics
Bioinformatics & Life Sciences
HASOVERLAPPING
-REGION
HASPARTIALSPATIALOVERLAP
ISSPATIAL
-PARTOF
ISPROPERSPAT.PART-OF
HAS-DISCRETEDREGION
HASSPATIAL
-PART
HASPROPERSPATIAL
-PART
HAS-SPATIALPOINTREFERENCE
HASCONNECTINGREGION
HASDISCONNECTEDREGION
HASEXTERNALIS-NONCONNECTINGTANG.ISREGION
SPAT.TANG.IS- HAS-NON- HASPART-OF
SPAT.- SPAT.- TANG.- TANG.PART-OF EQUIV.- SPAT.SPAT.OF
PART
PART
ISIS-PARTLYIN-CONVEX- INSIDECONVEXISHULL-OF
HULL-OF
OUTSIDECONVEXHULL-OF
ISIS-GEOINSIDE- TOPOINSIDEOF
OF
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Example: (canonical) joint anatomy
• joint HAS-HOLE joint space
• joint capsule IS-OUTER-LAYER-OF joint
• meniscus
– IS-INCOMPLETE-FILLER-OF joint space
– IS-TOPO-INSIDE joint capsule
– IS-NON-TANGENTIAL-MATERIAL-PART-OF
joint
• joint
– IS-CONNECTOR-OF bone X
– IS-CONNECTOR-OF bone Y
• synovia
– IS-INCOMPLETE-FILLER-OF joint space
• synovial membrane IS-BONAFIDEBOUNDARY-OF joint space
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Linguistic, domain and BFO-based RUs
Generalised Possession
Healthcare phenomenon
Hassubclass-of Haspossessor
1
possessed
Human being
1
2
subclass-of
1
Having a healthcare phenomenon
2
Is-possessor-of
Patient
subclass-of
3
4
Has-Healthcare3 phenomenon
Patient at risk
subclass-of
Patient at risk
for osteoporosis
Is-RiskFactor-Of
subclass-of
Has-Healthcarephenomenon
4
Risk Factor
subclass-of
subclass-of
Risk factor for
osteoporosis
Is-RiskFactor-Of
Osteoporosis
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Value of the three sorts of RUs
• Linguistic:
– capture the way language is used
• Domain:
– capture the way how domain experts conceptualize the
domain
• is in part reflected by the way they talk about the domain
• BFO-based:
– capture how matters are believed to be, without
referring to linguistic or domain RUs except when they
denote the same thing
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
One should try to maximize the number of
BFO-based Representational Units
• In this case: base RUs on the Ontology of General
Medical Science
– healthcare phenomenon  bodily feature ?
– risk factor  disposition ?
– osteoporosis  disorder, disease, path. process ?
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
MNLU: the general idea
Text
Result
Keywords
ICD-Codes
Discharge
letter
MedLine
abstracts
English patient
record
French patient
record
Surgery
report
Protocol
checking
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
MNLU: some requirements
Processor
Domain
representation
Text
Result
Goal
representation
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Linguistic Application Components
Processor
Domain representation
Text
Result
Linguistic
Knowledge
Task
Knowledge
Goal representation
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Implements Rector’s
‘Clean separation of knowledge’
• Conceptual knowledge: the knowledge of sensible domain
concepts
• Knowledge of definitions and criteria: how to determine
if a concept applies to a particular instance
• Surface linguistic knowledge: how to express the concepts
in any given language
• Knowledge of classification and coding systems: how an
expression has been classified by such a system
• Pragmatic knowledge: what users usually say or think,
what they consider important, how to integrate in software
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
What does this mean for applications?
Processor
Domain representation
Text
Result
Linguistic
Knowledge
Discourse
Linguistic
Coding
Task
Information
Knowledge
rules
Knowledge
Goal representation
English
Keywords
Reports
P.Rec
Completeness
French
ICD-Codes
P.Rec
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Halliday’s systemic functional grammar
• A “complete” theory for NLU
– constructivistic basis: “language construes human
experience”
– English: It is raining
– Chinese: The sky drops water
• hence: natural languages are instances of generic schemes
– macro-structure of documents
• derive a “structural formula”
– micro-structure of documents
• lexical cohesion
• in-conjunction analysis
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
General Principle of Semantic Mapping
1. Semantic constraints are associated with:
a) Lexemes, or,
b) Syntactic classes which generalize over lexemes.
2. A word inherits all constraints associated with each of
the syntactic classes it instantiates, as well as any
associated with the lexeme itself.
3. Where the lexicon provides multiple semantic
interpretations of a word, these are tried in order until
one applies. (e.g., “with” can be interpreted as
HAS_HC_PHENOMENON, HAS_INSTRUMENT, etc.)
R T U New York State
Center
of Excellence in
Lexicon-specified
Mapping
Bioinformatics & Life Sciences
• “Lexsem rules” fix the RU that a particular term can map to.
lexsem
e.g., lexsem
<string>
“present”
<wordclass>
<concept>
verb
CONSULTATION_PROCESS
• The <string> element defines the root form of the lexeme, so the
above example will also be applicable for “presents” and
“presenting”.
• The <wordclass> element distinguishes cases of lexical
ambiguity, e.g., “present” as a noun.
• Where a lexeme is polysemic, multiple lexsem entries are
provided.
• In some cases, a lexeme provides not only a RU, but some
structure as well, e.g.,
lexsem "since" preposition {}
{Head.Sem.HAS-CEN-OCCURENCE-SINCE PPHead.Sem}
(meaning: the concept expressed by the syntactic dominator of “since” is linked by a HAS-CENOCCURENCE-SINCE relation to the RU expressed by the NP following “since”)
R T U New York State
Center of Excellence in
Syntax-specified Mapping
Bioinformatics & Life Sciences
Two reasons for associating mapping information on syntactic features:
The syntactic feature represents a generalisation over a set of lexemes
e.g., the syntactic feature human-surname contains the mapping information for all
surnames).
The syntactic feature represents a syntactic configuration which itself implies meaning
e.g., passive is not a feature of a word but of a configuration of words
Syntactic constraints are of two types:
Specify the class a particular role filler must have (whether syntactic element or conceptual):
e.g.,
Sem.Actor: human
(Sem.Actor is a role-chain, meaning “the Actor slot of the Sem slot”)
Specify that the fillers of two role-chains are the same:
e.g.,
Sem.Actor = Subj.Sem
Logical combinations of syntactic constraints are possible:
{and {Head.Sem: COMPLAINING_PROCESS}
{Head.Sem.HAS_SAYING PPHead.Sem}
}
(‘or’ and ‘not’ are also possible)
R T U New York State
of Excellence
in
RUsCenter
involved
in analyzing
“Mr. Smith”
Bioinformatics & Life Sciences
Material
Entity
human
Is-assignedname-of
Ontology
male
human
name
MrSmith
Mr
Smith
Is-assignedname-of
“Smith”
Instance
Text
R T U New York State
“Mr
Center of Excellence in
Smith”
analysed
Bioinformatics
& syntactically,
Life Sciences
and features used
to drive mapping.
female-titled
•
•
Title: female-title
The Orth slot of a word gives its
surface string.
The append( ) operator joins
together its arguments as a singleHUMANstring.
NAME-TYPE
Sem: female-human
titled-human
TITLEDHUMAN-TYPE
Title: title
Title: male-title
Title -2
Sem: male-human
untitled-human
human-name
Sem: human
HUMANNAME-TYPE4
human-surname
male-titled
genderless-titled
prenamed-provided
human-firstname
Prename: human-firstname
Prename -1
HUMANNAME-TYPE3
Sem.Assigned_Name = append{Prenam.Orth,
Orth}
prename-not-provided
Sem.Assigned_Name = Orth
R T U New York State
Center of Excellence in
analysis
of “an 83-year-old man”
Bioinformatics & Life Sciences
Dom-ent
human
age
state
HAS-WE-STATE
human
age
P-TYPE
human
Ontology
male
human
X1
HAS-WE-STATE
X2
P-TYPE
X3
Instance
Deict
Epith
An
83-year-old
man
Syntax
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Syntactic-Semantic mapping
• Lexicon:
lexsem “man”
noun MALE_HUMAN
lexsem “$int$-year-old” adjective HUMAN_AGE_STATE
one of the constraints (shown in red) on the feature ‘pre• Syntax:
qualified’ (which introduces the Epith role) fits:
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Example of a bootstrapping approach
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Syntactic relationship “discovery” process
• Text processed subsequently by:
– paragrapher
– segmenter
•
•
•
•
sentence detection
tokenisation
rewriting of abbreviations
identification of relevant sentences
– parser
– reference resolution resolver
– relationship discoverer
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Text to be processed
Sphingosine 1-phosphate induces expression of early growth response-1 and fibroblast growth factor-2
through mechanism involving extracellular signal-regulated kinase in astroglial cells.
Sato K, Ishikawa K, Ui M, Okajima F.
Laboratory of Signal Transduction, Institute for Molecular and Cellular Regulation, Gunma University,
3-39-15 Showa-machi, Maebashi, Japan. kosato@akagi.sb.gunma-u.ac.jp
In rat type I astrocytes and C6 glioma cells, sphingosine 1-phosphate (S1P) clearly induced the
expression of fibroblast growth factor-2 (FGF-2) mRNA to an extent comparable to that achieved by
platelet-derived growth factor (PDGF) and endothelin. In C6 cells, Western blotting showed that S1P
also induced expression of early growth response-1 (Egr-1), one of the immediate early gene products
and an essential transcriptional factor for FGF-2 expression. On the other hand, sphingosine, a
substrate for sphingosine kinase which forms intracellular S1P, was a very weak activator for the
expression of either FGF-2 or Egr-1. The S1P-induced Egr-1 expression was partially inhibited by
treatment of the cells with either calphostin C, an inhibitor of protein kinase C (PKC), or pertussis
toxin (PTX), and completely inhibited by the combination of these agents. Essentially, the same
inhibitory pattern by these agents has been observed for S1P-induced extracellular signal-regulated
kinase (ERK) activation. The S1P-induced expression of Egr-1 was also completely inhibited in
association with complete inhibition of ERK by PD 98059, an ERK kinase inhibitor. Thus, the S1Pinduced activation of the Egr-1/FGF-2 system may be mediated through ERK activation, which may
involve at least two signaling pathways, i.e., a PTX-sensitive G-protein-dependent pathway and a
PKC-dependent pathway.
PMID: 10640689 [PubMed - indexed for MEDLINE]
R T U New York StateParagrapher output
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Segmenter
output
R T U New York State
Center of Excellence in
Re-use of resolved
Bioinformatics & Life Sciences
abbr.
R T U New York State
Center of Excellence in Parser
Bioinformatics & Life Sciences
output
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Reference
resolution
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Domain-specific CUE-words
•
•
if (domain.equals("PROTEINS"))
subjObjVerbs_ar = new Object[]
– {"abolish", "abolishes", "abolished", "abolishing",
– "accompany", "accompanies", "accompanied", "accompanying",
– "acetylate", "acetylates","acetylated","acetylating",
– "activate", "activates", "activated", "activating",
– "affect", "affects", "affected", "affecting",
– ....}
• if (domain.equals("PROTEINS"))
•
ofByNouns_ar = new Object[]
– {"acetylation", "activation", "affection", "aggregation", "altering",
"amelioration", "antagonization", "association", "augmentation", "binding",
"blocking", "blockage",.... }
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Inter-protein relationship “discovery”
• Leptin rapidly inhibits hypothalamic neuropeptide
Y secretion and stimulates corticotropin-releasing
hormone secretion in adrenalectomized mice .
– (leptin)-INHIBITS-(hypothalamic neuropeptide Y
secretion)
– (leptin)-INHIBITS-(neuropeptide Y)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
... special patterns
• These results indicate that oTP-1 may prevent luteolysis
by inhibiting development of endometrial responsiveness
to oxytocin and , therefore , reduce oxytocin-induced
synthesis of IP3 and PGF2 alpha .
– (oxytocin)-CAUSES-(synthesis of IP3 and PGF2
alpha)
– (oxytocin)-CAUSES-(pgf2 alpha)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
From syntactic modification to subsumption
• (adj)-(noun) :: Cadj-noun IS_A Cnoun
– steroid hormone IS_A hormone
– fetal liver IS_A liver
• BUT not:
– binding factor IS_A factor
– total protein IS_A protein
– two domain IS_A domain
• Usefulness ?
– relationship with the Cadj
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
NLU in the GALEN project
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The place of Galen
Processor
Domain representation
Text
Result
Linguistic
Knowledge
Task
Knowledge
Goal representation
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The processor at work ...
Processor
Domain representation
Meaning
Representation
Goal
Representation
Task
Knowledge
Goal representation
Result
Text
Linguistic
Knowledge
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Some claims by Galen (+)
• European wide endeavour
• Result of work by highly competent researchers
and developers
• Clean knowledge kernel of pure medical
terminology
• Totally independent from any source or target
system
• Openess
• Development not affordable by one single entity
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
NLP applications around Galen
C-Tex
Linguistic
Knowledge
Multi Tale
Text
Linguistic
Representation
Cassandra
Galen terminological
Knowledge
Meaning
Representation
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
MultiTale: synsem - tagging
Dura was incised in linear fashion and the scar around the inlet of the reservoir
was dissected out until the ventricular catheter was exposed and withdrawn
under direct vision.
<Clause type="surg">
<Segment role="do" semtype="anat" syntax="sg" meaning="T-A1120">Dura
</Segment>
<Segment role="action" semtype="open" syntax="papa" meaning="P101000">
<SegConst.1 syntax="past">was </SegConst.1>
<SegConst.1 role="action" semtype="open" syntax="papa" meaning="P101000">incised </SegConst.1></Segment>
<Segment syntax="prep">in </Segment>
<Segment semtype="manner" syntax="adjnoun" meaning="(G-A148,G-D430)">
<SegConst.1 semtype="mod" syntax="adj" meaning="G-A148">linear
</SegConst.1>
<SegConst.1 semtype="manner" syntax="sg" meaning="G-D430">fashion
</SegConst.1></Segment>
</Clause>
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
MultiTale-II: Galen-ready linguistic representation
valgising osteotomy of humerus
({valgising}5(osteotomy)1{[of]3(humerus)2}4)22
Pre- and postmarker
 …
Relationship with the GALEN
ontology (exhaustive)
link
{…}
(…)
@…#
\…/
criterion
descriptor / concept
co-ordination
not represented in GALEN
<…>
criterion modifier
Relationship with natural language
phenomena (examples)
explicit in prepositions, or implicit in
adjectives
adjectives, adverbial constructions
nouns, idioms
“and”, “or”
function words such as articles, possessive
pronouns, etc.
adverbs
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Cassandra-II: from LR to CR
({valgising}5(osteotomy)1{[of]3(humerus)2}4)22
((cutting)21
{[TO_ACHIEVE]6((Deed:valgising)7
{[ACTS_ON]17(Pathology:pathologicalposture)18}19)20}5
{[ACTS_ON]3(Anatomy:humerus)2}4)22
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Linguistic versus Conceptual repr. (1)
(excision)35 {[of]111 ((cicatrix)2120 {[of]216 (skin)474}0)0}0
(debridement)82 {[of]142 ({palmar}1785 (skin)474)0}0
RefId
35
82
111
142
216
474
1785
2120
Prototype
excision
debridement
of
of
of
skin
palmar
cicatrix
Conceptual repr.
Linguistic repr.
excising
excising
debriding
debriding
ACTS_ON
THEME
ACTS_ON
SOURCE
HAS_LOCATION
SOURCE
skin
skin
IS_PART_OF(palm) LOCATIVE(palm)
cicatrix
cicatrix
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Linguistic versus Conceptual repr. (2)
The Galen view
ResourseManagementProcess
InstallingProcess
LiquidInstallingProcess
Filling
Injecting
The linguistic semantic view
To install <theme> [ in <goal> ]
To fill
<goal> [with <theme> ]
To inject <theme> [ in <goal> ]
To inject <goal>
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Semantic Indexing
with and without
using ontology
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Goals of Semantic Indexing
1. How to identify in a running text those
“components that carry meaning” ?
2. How to assess how relevant these components
are in the context of the entire document ?
- aboutness or characterizing power (NLM MetaMap)
- topic
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Statistics-based systems
• do not possess explicit domain knowledge,
• can only identify words or multi-word units in texts,
– Based on individual document statistics
– Based on corpus statistics
• project these on implicitly constructed concepts that are
mathematically justifiable, but that do not necessarily
correspond with metaphysical reality,
• are capable in finding those components that qualify as
topic markers,
• are poor in identifying all components.
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
“Concept”-based systems
• use explicitly defined concepts to which words,
terms or phrases are attached as known
grammaticalizations in a specific language.
• “attachment” may be
– Lexically realised
– Grammatically realised
• Using syntactic grammar and/or semantic grammar
• tend to identify many components,
• are less performant in finding the topics.
R T U New York State
Center of Excellence in
®:
TeSSI
Bioinformatics & Life
Sciences
Terminology Supported Semantic Indexing
• Based on LinkBase®:
– formal ontologies dealing with time, mereology, partonomy, ...
(Smith, Varzi, Cohn, ...)
– domain ontology structured according to the way languages are
influenced by semantics (Bateman)
– linking towards multiple 3rd party terminologies, classification
systems, ontologies, ...
– multi-lingual
• Combines in-document statistics with spreading activation
enforcement in LinkBase®
• Implemented as a server
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Architectural Overview
LinkBase
Database
JD
Ja BC
va
Unix
Workstation
PC
LinkFactory
Server
Mac
RMI
Corba
Soap
LAN
Concept tree
WAN
Internet
Server
Business
Objects
Criteria / Full definitions
Linktype tree
Translate
...
TeSSI Server
Index
LinkFactory Workbench
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Phrase extraction
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Disambiguation
R T U New York State
Center of Excellence in
Coding
Bioinformatics & Life
Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Intermediate conclusions
• Good results
– (showed by means of recall/precision studies based on
OHSUMED)
• BUT:
– important effort in building an appropriate ontology
• (we can live with that because we did it already for
healthcare)
• Is there a risk that ever this effort would lose its
value ?
R T U
New York State
Center of Excellence in
Bioinformatics & Life Sciences
22 page full paper
A “statistics only system”
ABSTRACT ONLY
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
How far can these systems go ?
• Some positive characteristics:
– Do not require detailed domain knowledge
– Are language independent
– Are able to find complex multi-word units
• Some negative (?) characteristics:
– seem to be dependent from document length
– unclear how to link to existing terminologies
• (find “words” instead of “concepts”)
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
To find this out
• Select from OHSUMED 29 abstracts with stated
high relevance for 5 concepts, hence supposed to
cover the same topic;
• Sort abstracts in ascending order with respect to
document length;
• Concatenate documents to get even larger
documents;
• Perform a forecast analysis;
• Compare TeSSI with statistics based system.
R T U New York State
Center of Excellence in
Word,Bioinformatics
concept and
node
identification
& Life
Sciences
per
document (real)
Count of words, concepts or nodes
10000
1000
Words
100
Nodes
Concepts
10
1
1
2
3
4
5
6
Document number
7
8
9
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Absolute Concept/Node identification (real)
1800
1600
Nr of nodes or concepts
1400
1200
1000
800
600
400
200
0
0
500
1000
1500
2000
2500
3000
Word Count
3500
4000
4500
5000
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Relative Concept/Node identification (real)
0,4
concepts
0,35
0,3
0,25
0,2
0,15
0,1
nodes
0,05
0
0
500
1000
1500
2000
2500
Nr of words
3000
3500
4000
4500
5000
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Concept/Node identification % (forecast)
0,4
0,35
concepts
0,3
0,25
0,2
0,15
0,1
0,05
nodes
0
0
20.000.000
40.000.000
60.000.000
Nr of words
80.000.000
100.000.000 120.000.000
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
Conclusions
• The “ontological approach” that accepts language
as a medium of communication, provides a very
good basis for NLU if associative relationships are
prominently present.
– Hierarchies are not enough
• In-document (and even corpus) statistics provide
additional information but have an upper bound if
used without domain information.
– Detail and explicitness at the level of concept and
relationships determine indexing performance
R T U New York State
Center of Excellence in
Bioinformatics & Life Sciences
The End
Download