Invited talk on Specialist NLP Tools

advertisement
The SPECIALIST Lexicon and NLP Tools
Allen Browne
Nov 6, 2009
Guy Divita
National Library of Medicine
The SPECIALIST Lexicon
Nov. 6, 2009
Text
processing
Lexical tools
SPECIALIST
LEXICON
The SPECIALIST Lexicon
• A syntactic lexicon
• Biomedical and general
English
• Over 430,000 records
Lexicon Growth
George A.
Miller
The
Science of
Words
1991
Frequency Spectrum of Medline 2006
3000001
2500001
V(m,N)
2000001
1500001
1000001
500001
1
1
100
10000
M
1000000
100000000
Frequency Spectrum:
Alice in Wonderland
Bayaan, 2001
The SPECIALIST Lexicon
• Morphology
– Inflection
– Derivation
• Orthography
– Spelling variants
• Syntax
– Complementation for verbs, nouns, and
adjectives
Morphology
• Inflectional
– nucleus, nuclei
– cauterize, cauterizes, cauterized, cauterizing
– red, redder reddest
• Derivational
– laryngeal -- larynx
– transport -- transportation
Derivational Morphology
Dictionary+ology+is
Inflectional Morphology
octopus
octopi
octopuses
Orthography
Spelling Variation
•
•
•
•
•
•
align -- aline
Grave’s disease -- Graves’s disease -- Graves’ disease
anesthetize -- anesthetise
Esophagus -- oesophagus
foetus – fetus
centre -- center
Orthography
Syntax -- Verb Complements
• intran
– I’ll treat.
• tran=np
– He treated the patient.
• ditran=np,pphr(with,np)
– She treated the patient with the
drug.
Syntax -- Verb Complements
{base=treat
entry=E0061964
cat=verb
variants=reg
intran
tran=np
tran=pphr(with,np)
tran=pphr(of,np)
ditran=np,pphr(to,np)
ditran=np,pphr(with,np)
ditran=np,pphr(for,np)
cplxtran=np,advbl
nominalization=treatment|noun|E0061968
}
Lexicon Parts of Speech
Noun
Adj
Verb
Adv
Prep
Pron
Conj
Det
Modal
Aux
Compl
350000
300000
250000
200000
150000
100000
50000
0
Noun
Adj
Verb
Adv
Prep
Pron
Conj
Det
Modal
Aux
Compl
Miller -- 1991
Lexicon Unit Records
{base=chronic
{base=Kaposi's sarcoma
spelling_variant=Kaposi
entry=E0016869
sarcoma
cat=adj
entry=E0003576
variants=inv
cat=noun
position=attrib(1
variants=uncount
)
variants=reg
position=pred
variants=glreg
stative
}
}
{base=aspirate
{base=in
entry=E0010803
entry=E0033870
cat=verb
cat=prep
variants=reg
}
tran=np
nominalization=aspiration|noun|E0010804
}
Acronyms and Abbreviations
{base=BLM
entry=E0319730
cat=noun
variants=uncount
variants=metareg
abbreviation_of=bilayer lipid membrane|E0319734
abbreviation_of=bimolecular liquid membrane|E0319733
abbreviation_of=bleomycin|E0013378
}
Orthographic vs. Lexicographic
Word:
Why, for instance, if a two-word boy
scout feels chilly on his one-word
campground, does he pull up a twoword camp chair in front of his one-word
campfire? Anyone who seeks a strictly
logical answer to such questions is
chasing will-o'-the-wisps (chargeable in
telegrams as a single word, because of
the hyphens) in a semantic bog.
Louis Salomon, Semantics and Common Sense, Holt Rinehart and Winston, 1966.
UTF-8
{base=resume
spelling_variant=résumé
spelling_variant=resumé
entry=E0053099
cat=noun
variants=reg
}
{base=deja vu
spelling_variant=deja-vu
spelling_variant=déjà vu
entry=E0021340
cat=noun
variants=uncount
}
{base=role
spelling_variant=rôle
entry=E0053757
cat=noun
variants=reg
}
{base=cafe
spelling_variant=café
entry=E0420690
cat=noun
variants=reg
}
Noun Variants
{base=Kaposi's sarcoma
spelling_variant=Kaposi
sarcoma
entry=E0003576
cat=noun
variants=uncount
variants=reg
variants=glreg
}
• Kaposi’s sarcoma
• Kaposi’s
sarcomas
• Kaposi’s
sarcomata
• Kaposi sarcoma
• Kaposi sarcomas
• Kaposi sarcomata
Regular Nouns
The plural suffix is s.
y becomes ie following a consonant before s.
e is inserted before s if the base ends in s, z, x, ch, or s
Leach – Leaches
Stomach – Stomachs
 irregular
Greco-latin Regular nouns
Uncount Nouns
(abstract or mass)
{base=smallpox
entry=E0056359
cat=noun
variants=uncount
}
{base=potassium
entry=E0049387
cat=noun
variants=uncount
}
* This form does not occur
•
•
•
•
•
•
* a smallpox
* two smallpoxes
much smallpox
* a potassium
* two potassiums
much potassium
Fixed Plural Nouns
{base=police
entry=E0048616
cat=noun
variants=plur
}
{base=scissors
entry=E0054633
cat=noun
variants=plur
}
Irregular Nouns
{base=corpus
entry=E0019113
cat=noun
variants=irreg|corpora|
variants=reg
}
{base=larynx
entry=E0036919
cat=noun
variants=irreg|larynges|
variants=reg
}
Regular Verbs
• The third person present tense suffix is
s.
– y becomes ie following a consonant before
s.
– e is inserted between z, x, ch, or sh and s.
• The past tense suffix is ed.
The
– ypast
becomes
participle
ie following
is the asame
consonant
as the
before
past tens
The ed.
present participle suffix is ing.
Final eie following
is deleted
before
ed.
-–
y becomes
a consonant
before
ing.
- Final e is deleted before ing
unless preceded by e, y or o.
Regular Verbs
• dismiss: dismisses, dismissed,
dismissing
• agree: agrees; agreed; agreeing
• dry: dries, dried, drying
Regular Doubling Verbs
• End in a CVC pattern
• Double the final consonant before ed and
ing.
• Are otherwise regular
• variants=regd
control: controls, controlled, controlling
Irregular Verbs
{base=bite
entry=E0013219
cat=verb
variants=irreg|bite|bites|bit|bitten|biting|
intran
tran=np
cplxtran=np,advbl
}
Ancillary Data Bases
• Synonymy
– sm.db
• Derivation
– dm.db, dm.rules
• Inflection
– im.rules
• Neoclassical
compounds
– nc.db
Derivational Facts and Rules
dm.facts
treatment|noun|treat|verb
prohibition|noun|prohibitive|adj
cell lineage|noun|cell line|noun
photochemotherapeutic|adj|photochemotherapy|noun
pharmacotherapeutic|adj|pharmacotherapy|noun
Derivational Facts and Rules
dm.rules
# e.g. alienation|alienate
ation$|noun|ate|verb
ration|rate; station|state;
Inflectional Facts and Rules
im.rules
# Noun rules (glreg)
us$|noun|singular|i$|noun|plural
antus|anti;
ma$|noun|singular|mata$|noun|plu
ral
a$|noun|singular|ae$|noun|plural
um$|noun|singular|a$|noun|plural
on$|noun|singular|a$|noun|plural
sis$|noun|singular|ses$|noun|plura
l
is$|noun|singular|ides$|noun|plural
men$|noun|singular|mina$|noun|pl
ural
Neoclassical compounds
nc.db
abdomin(o)|abdomen|root
ab|away from|prefix
acanth(o)|prickle|root
acar(o)|mite|root
acetabul(o)|acetabulum|root
ad|towards|prefix
agogue|inducing|terminal
albumin(o)|albumin|root
sis|condition|terminal
stomy|surgical opening|terminal
PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCO
NIOSIS
pneu.mo.no.ul.tra.mi.cro.scop.ic.sil.i.co.vol.ca.no.co.ni.o.sis \'n(y)u:-m*-(.)no--.*l-tr*-.mi-kr*-'ska:p-ik-'sil-i-(.)ko--(.)v\ n [NL, fr. Gk pneumo-n + ISV ultramicroscopic + NL
silicon +]a:l-'ka--no--.ko--ne--'o--s*s ISV volcano + Gk konis dust : a
pneumoconiosis caused by the inhalation of very fine silicate or quartz dust
-- Merriam
Webster's 3rd International Dictionary, page 1747.
The Protein of a tobacco mosaic virus, Dahlemense
strain
acetylseryltyrosylserylisoleucylthreonylserylprolylserylglutami
nylphenylalanylvalylphenylalanylleucylserylserylvalyltryptophy
lalanylaspartylprolylisoleucylglutamylleucylleucyllasparaginylv
alylcysteinylthreonylserylserylleucylglycllasparaginylglutaminy
lphenylalanylglutaminylthreonylglutaminylglutaminylalanylargi
nylthreonylthreonylglutaminylvalylglutaminylglutaminylphenyla
lanylserylglutaminylvalyltryptophyllysylprolylphenylalanylprolyl
glutaminylserylthreonylvalylarginylphenylalanylprolylglycylasp
artylvalyltyrosyllsyslvalyltyrosylarginyltyrosylasparaginylalanyl
valylleucylaspartylprolylleucylisoleucylthreonylalanylleucylleuc
ylglycylthryonylphenylalanylaspartylthreonylarginylasparaginyl
arginylisoleucylisoleucylglutamylvalylglutamylasparaginylgluta
minylglutaminylserylprolylthreonylthreonylalanylglutamylthreo
nylleucylaspartylalanylthreonylarginylarginylvalylaspartylaspar
tylalanylthreonylvalylalanylisoleucylarginylserylalanylasparagi
nylisoleucylasparaginylleucylvallasparaginylglutamylleucylvaly
larginylglycylthreonylglycylleucultyrosylasparaginylglutaminyla
sparaginylthreonylphenylalanylglutamylserylmethionylserylgly
cylleucylvalyltryptophylthreonylserylalanylprolylalanylserine
Synonyms
sm.db
alar|adj|wing|noun
amygdaline|adj|tonsil|noun
articular|adj|joint|noun
bulbar|adj|medulla oblongata|noun
fununcular|adj|boil|noun
genicular|adj|knee|noun
hepatocellular|adj|liver cells|noun
lazar|adj|leprosy|noun
lenticular|adj|crystalline lens|noun
ypsiliform|adj|upsiloid|adj
wolfram|noun|tungsten|noun
double vision|noun|diplopia|noun
Text
processing
Lexical tools
SPECIALIST
LEXICON
Lexical Tools
• Wordind -- breaks strings into words
– Produces the Metathesaurus word indexes
(MRXW)
• LVG -- performs various lexical
transformations
• NORM -- a selection of LVG transformations,
– Used for Metathesaurus indexing
– Produces the Metathesaurus Normalized word
and string indexes (MRXNW & MRXNS)
– Used to access those indexes
Normalization
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Hodgkin Disease
HODGKINS DISEASE
Hodgkin's Disease
Disease, Hodgkin's
HODGKIN'S DISEASE
Hodgkin's disease
Hodgkins Disease
Hodgkin's disease
NOS
Hodgkin's disease,
NOS
Disease, Hodgkins
Diseases, Hodgkins
Hodgkins Diseases
Hodgkins disease
hodgkin's disease
Disease;Hodgkins
Disease, Hodgkin
• disease
hodgkin
SPECIALIST NLP Tools
• Tokenizers
– Sentence, Section, Phrases, Words
• Term variant lookup
• Part of Speech Tagger
• Index Maker
The Lexical Systems Group
• Allen Browne: browne@nlm.nih.gov
• Guy Divita: divita@nlm.nih.gov
• Chris Lu: lu@nlm.nih.gov
SPECIALIST NLP Tools
Lister Hill National Center For Biomedical
Communications
National Library of Medicine
Guy Divita
Fall 2009
SPECIALIST NLP Tools
SPECIALIST.nlm.nih.gov
Tools
The Lexicon
Document
Tokenization Tools
Lexicon
Term Lookup
POS Tagger
Term Manipulation
Tools
Spelling Suggestion
Visual
Annotation Tool
Text Categorization
Tool
SPECIALIST Lexical Tools Java
Utilities to build
smarter indexes
Term Based
Tools
SPECIALIST Lexical Tools
• 56 Term
transformations
treats
inflections
combinations
treating
treated
nominalizations
treat
treatment
treatments
derivations
treatability
Term Based
Tools
Java
treaty
treatable
treater
SPECIALIST Lexical Tools Java
colour
coloring
colored
colors
inflections
Spelling
variants
nominalizations
color
chromaticities
colorlessness
combinations
derivations
Chromaticness
Term Based
Tools
synonyms chromatic
colorless
colorant
colorful
SPECIALIST Lexical Tools Java
seconds
seconded
inflections
serous
combinations
Ser
SOR
secant
secondarily
acronym
expansions
second
nominalizations
synonyms
derivations
acronyms
s’s
sec
secondly
secondary
s
Term Based
Tools
SPECIALIST Lexical Tools Java
lowercase
Input
term
Strip diacritics
Remove possessive
The tools can be
arranged so that
the output of one
is the input to
Remove stop words
Strip punctuation
another.
Word order sort
Term Based
Tools
Example of a quick and dirty normalization
Output
term
SPECIALIST Lexical Tools: Norm Java
remove genitives
replace punctuation with spaces
remove stop words
lowercase
uninflect each word
spelling variants
Term Based
Tools
word order
sort
SPECIALIST Lexical Tools: Norm Java
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Hodgkin Disease
HODGKINS DISEASE
Hodgkin's Disease
Disease, Hodgkin's
HODGKIN'S DISEASE
Hodgkin's disease
Hodgkins Disease
Hodgkin's disease NOS
Hodgkin's disease, NOS
Disease, Hodgkins
Diseases, Hodgkins
Hodgkins Diseases
Hodgkins disease
hodgkin's disease
Disease;Hodgkins
Disease, Hodgkin
Hash into a class
of lexically similar
terms
disease hodgkin
Term Based
Tools
Spelling Retrieval Tools
• GSpell
–
–
–
–
–
–
A term retrieval tool
N-gram nearest neighbor algorithm
MetaPhone phonetic spelling normalization
Homophones
Common misspellings
Candidates sorted by an edit distance and frequency
of occurrence from a corpus
• Build Your Own
– Custom crafted dictionaries are key to spelling
suggestion
Term Based
Tools
dTagger
• Assigns Parts of Speech (POS) to words in text
• NP parsers need terms with Parts of
Speech assigned to determine phrase
breaks and head assignment
Document
Based Tools
noun
adj/adv
verb
conj
det
prep
aux/
modal
Legend
SPECIALIST Text Tools
Sections
Sentences
Phrases
Terms
Words
Lexicon Entries
Document Formats
– Medline
– HL7
– Free text
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
A1 Chromatography Laboratory, Research Center for Biomolecules,
School of Sciences, Industrial University of Santander. A.A. 678,
Bucaramanga, Colombia
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin E) or
synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has
been evaluated by monitoring volatile carbonyl compounds released in
model lipid systems subjected to peroxidation. The procedure
employed methodology previously developed for the determination of
carbonyl compounds as their pentafluorophenylhydrazine derivatives
which were quantified, with high sensitivity, by means of capillary gas
chromatography with electron-capture detection. Linoleic acid and
sunflower oil were used as model lipid systems. Lipid peroxidation was
induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
Document
Based Tools
Abstract
–
–
–
–
–
–
SPME determination of volatile aldehydes for evaluation of in-vitro
antioxidant activity
Title Auth
Tokenizes Text into
SPECIALIST Text Tools
SPME determination of volatile aldehydes for evaluation of
in-vitro antioxidant activity
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
A1 Chromatography Laboratory, Research Center for
Biomolecules, School of Sciences, Industrial University of
Santander. A.A. 678, Bucaramanga, Colombia
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin
E) or synthetic substances (tert-butyl hydroxy anisole (BHA),
Trolox) has been evaluated by monitoring volatile carbonyl
compounds released in model lipid systems subjected to
peroxidation. The procedure employed methodology previously
developed for the determination of carbonyl compounds as their
pentafluorophenylhydrazine derivatives which were quantified,
with high sensitivity, by means of capillary gas chromatography
with electron-capture detection. Linoleic acid and sunflower oil
were used as model lipid systems. Lipid peroxidation was induced
in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
Word Tokenizer
Term Tokenizer
POS tagger
Phrase Chunker
Phrase Variant
Generation
Document
Based Tools
Document
Section
Sentence
Token
Java Document
Container
Text Annotation
Document
Based Tools
Text Annotation (2)
Simple Format
Offset|Size|Tag|SubTag|Annotation|..
0| 0|BOS | | |
0| 3|det
| | |The
4| 2|adj
| | |in
6| 1|adj
| | |-|
7| 5|adj
| | |vitro|
13| 11|noun | | |antioxidant|
25| 8|noun | | |activity|
34|
37|
45|
46|
56|
2|prep
7|adj
1|lp
9|noun
4|noun
| | |of|
| | |natural|
| | |(|
| | |essential
| | |oils|
Text Categorization
• A set of tools for:





Text categorization
Indexing & retrieval
Document classification
Word sense disambiguation
etc..
• Based on JD Indexing (Susanne Humphrey)
 Vector/ Cosine coefficient method
 Unsupervised
 Uses the pre-existing assignment of Journal Descriptors
Document
to Medline abstracts
Based Tools
 High performance
Text Categorization
• Command line tools





JDI (Journal Descriptor Indexing)
STI (Semantic Type Indexing)
STRI (Semantic Type Real-Time Indexing)
MLT (MEDLINE Tokenizer)
STWSD (ST Word Sense Disambiguation)
• Web Tools
• Java APIs
Document
Based Tools
MetaMap Transfer (MMTx)
• Extracts UMLS
concepts from text
• Java Implementation
of MetaMap
Meta Mapping (1000):
C0496836
(Malignant neoplasm of eye, unspecified)
[Neoplastic Process]
Doc
Tools
Retinoblastoma
What is retinoblastoma?
Retinoblastoma is a rare type of
eye cancer that develops in the
retina, which is the part of the
eye that detects light and color.
Although this disorder can occur
at any age, it usually develops in
young children.
MMTx
Why would you want to use it?
--
Medical Text
-- ---- - --- - -- -- --- -------- -- -- -- -
--
--
--
--
-- ----- ---- ---- --- ---- - ------- ------- -----
--------- -- --- ----- -- --- --- ------------ ----
--- ------ - ----- --------- --- ---- ----- -
--
Document
Based Tools
SPME determination of volatile aldehydes for evaluation of
in-vitro antioxidant activity
P3I
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
A1 Chromatography Laboratory, Research Center for
Biomolecules, School of Sciences, Industrial University of
Santander. A.A. 678, Bucaramanga, Colombia
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin
E) or synthetic substances (tert-butyl hydroxy anisole (BHA),
Trolox) has been evaluated by monitoring volatile carbonyl
compounds released in model lipid systems subjected to
peroxidation. The procedure employed methodology previously
developed for the determination of carbonyl compounds as their
pentafluorophenylhydrazine derivatives which were quantified,
with high sensitivity, by means of capillary gas chromatography
with electron-capture detection. Linoleic acid and sunflower oil
were used as model lipid systems. Lipid peroxidation was induced
in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
De-Identification
Dates
Names
Addresses
Phone No’s
Age > 90
Alpha numeric
identifiers
SPME determination of volatile aldehydes for evaluation of
in-vitro antioxidant activity
P3I
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
A1 Chromatography Laboratory, Research Center for
Biomolecules, School of Sciences, Industrial University of
Santander. A.A. 678, Bucaramanga, Colombia
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin
E) or synthetic substances (tert-butyl hydroxy anisole (BHA),
Trolox) has been evaluated by monitoring volatile carbonyl
compounds released in model lipid systems subjected to
peroxidation. The procedure employed methodology previously
developed for the determination of carbonyl compounds as their
pentafluorophenylhydrazine derivatives which were quantified,
with high sensitivity, by means of capillary gas chromatography
with electron-capture detection. Linoleic acid and sunflower oil
were used as model lipid systems. Lipid peroxidation was induced
in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
SPME determination of volatile aldehydes for evaluation of
in-vitro antioxidant activity
Patient
Record
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
A1 Chromatography Laboratory, Research Center for
Biomolecules, School of Sciences, Industrial University of
Santander. A.A. 678, Bucaramanga, Colombia
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin
E) or synthetic substances (tert-butyl hydroxy anisole (BHA),
Trolox) has been evaluated by monitoring volatile carbonyl
compounds released in model lipid systems subjected to
peroxidation. The procedure employed methodology previously
developed for the determination of carbonyl compounds as their
pentafluorophenylhydrazine derivatives which were quantified,
with high sensitivity, by means of capillary gas chromatography
with electron-capture detection. Linoleic acid and sunflower oil
were used as model lipid systems. Lipid peroxidation was induced
in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
De-Identification (2)
Term Tokenizer
POS tagger
Name Recognition
Address
Recognition
Human
Edit and
Review
Annotation Tool
Document
Based Tools
Redaction
SPME determination of volatile aldehydes for evaluation of
in-vitro antioxidant activity
Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez
Chromatography Laboratory, Research Center for
Biomolecules, School of Sciences, Industrial University of
Santander. A.A. 678, Bucaramanga, Colombia
A1
Transform back to
Original Format
Abstract:
The in-vitro antioxidant activity of natural (essential oils, vitamin
E) or synthetic substances (tert-butyl hydroxy anisole (BHA),
Trolox) has been evaluated by monitoring volatile carbonyl
compounds released in model lipid systems subjected to
peroxidation. The procedure employed methodology previously
developed for the determination of carbonyl compounds as their
pentafluorophenylhydrazine derivatives which were quantified,
with high sensitivity, by means of capillary gas chromatography
with electron-capture detection. Linoleic acid and sunflower oil
were used as model lipid systems. Lipid peroxidation was induced
in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in
sunflower oil by heating in the presence of O2 (220 °C, 2 h).
To do List
Patient De-identification
Text Tools, Gspell, dTagger 2010 Distribution to include
•Using the 2010 Lexicion
•Updated to Java 1.6 (Generics)
•Berkeley Java DB replaced with HyperSQL
•dTagger integrated with Annotation Tool
•Eclipse Projects
SPECIALIST NLP Tools
Lister Hill National Center For Biomedical Communications
National Library of Medicine
Resources
SPECIALIST NLP Tools
http://SPECIALIST.nlm.nih.gov
Presentations, Tutorials and
Documentation
http://lexsrv3.nlm.nih.gov/SPECIALIST/docs
Lexicon Technical Document
http://SPECIALIST.nlm.nih.gov/technicalReport.pdf
Contacts
General Questions
umlslex@nlm.nih.gov
Allen Browne
browne@nlm.nih.gov
Guy Divita
divita@nlm.nih.gov
Chris Lu
lu@nlm.nih.gov
Download