Natural Language Processing COMPSCI 423/723

advertisement
Computational Intelligence in
Biomedical and Health Care Informatics
HCA 590 (Topics in Health Sciences)
Rohit Kate
Natural Language Processing:
Words and Parses
1
Reading
• Chapter 8, Biomedical Informatics: Computer
Applications in Health Care and Biomedicine
by Edward H. Shortliffe (Editor) and James J.
Cimino (Editor), Springer, 2006.
Linguistics Essentials
Basic Steps of Natural
Language Processing
Sound
waves
Words
Phonetics
Syntactic
processing
Parses
We will skip phonetics and phonology.
Semantic
processing
Meaning
Pragmatic Meaning
processing in context
Basic Steps of Natural
Language Processing
Sound
waves
Words
Phonetics
Syntactic
processing
Parses
We will skip phonetics and phonology.
Semantic
processing
Meaning
Pragmatic Meaning
processing in context
Words: Morphology
• Study of internal structure of words
– carried  carry + ed (past tense)
– independently  in + (depend + ent) + ly
•
English has relatively simple morphology, some other
languages like German or Finnish have complex word
structures
• Very accurate morphological analyzers are available for most
languages; considered a solved problem
• Biomedical domains have rich morphology:
– hydroxynitrodihydrothymine => hydroxy-nitro-di-hydro-thym-ine
– hepaticocholangiojejunostomy => hepatico-cholangio-jejuno-stom-y
• Identifying morphological structure also helps dealing with
new words
Words: Parts of Speech
• Linguists group words of a language into
categories which occur in similar places in a
sentence and have similar type of meaning:
e.g. nouns, verbs, adjectives; these are called
parts of speech (POS)
• A basic test to see if words belong to the same
category or not is the substitution test
– This is a good [dog/chair/pencil].
– This is a [good/bad/green/tall] chair.
Parts of Speech
• Nouns: Typically refer to entities and their names like
people, animals, things
– John, Mary, boy, girl, dog, cats, mug, table, idea
– Can be further divided as proper, singular, plural
• Pronouns: Variables or place-holders for nouns
–
–
–
–
–
Nominative: I, you, he, she, we, they, it
Accusative: me, you, him, her, us, them, it
Possessive: my, your, his, her, our, their, its
2nd Possessive: mine, yours, his, hers, ours, theirs, its
Reflexive: myself, yourself, himself, herself, ourselves,
themselves, itself
Parts of Speech
• Determiners: Describe particular reference of
a noun
– Articles: a, an, the
– Demonstratives: this, that, these, those
• Adjectives: Describe properties of nouns
– good, bad, green, tall
• Verbs: Describe actions
– talk, sleep, eat, throw
– Categorized based on tense, person,
singular/plural
Parts of Speech
• Adverbs: Modify verbs by specifying space, time, manner or
degree
– often, slowly, very
• Prepositions: Small words that express spatial relations and
other attributes
– in, on, over, of, about, to, with
– They introduce prepositional phrases that typically introduce
ambiguity in a sentence.
• I saw a man on the hill with a telescope.
– Prepositional phrase attachment: Another important NLP problem
• Particles: Subclass of prepositions that bond with verbs to
form phrasal verbs
– take off, air out, ran up
POS Tagging
• Automatic POS tagging is often the first step in
analyzing a sentence
John saw the saw and decided to take it
to the table.
NOUN VERB DT NOUN CONJ VERB TO VERB PRP PREP DT NOUN
• Why is this a non-trivial task?
– The same word can have different pos tags in
different sentences:
• His position was near the tree.
Noun
• Position him near the tree.
Verb
Basic Steps of Natural
Language Processing
Sound
waves
Words
Phonetics
Syntactic
processing
Parses
Semantic
processing
Meaning
Pragmatic Meaning
processing in context
Phrase Structure
• Most languages have a word order
• Words are organized into phrases, group of
words that act as a single unit or a constituent
– [The dog] [chased] [the cat].
– [The fat dog] [chased] [the thin cat].
– [The fat dog with red collar] [chased] [the thin old
cat].
– [The fat dog with red collar named Tom] [suddenly
chased] [the thin old white cat].
Phrases
• Noun phrase: A syntactic unit of a sentence which
acts like a noun and in which a noun is usually
embedded called its head
– An optional determiner followed by zero or more
adjectives, a noun head and zero or more prepositional
phrases
• Prepositional phrase: Headed by a preposition and
express spatial, temporal or other attributes
• Verb phrase: Part of the sentence that depend on the
verb. Headed by the verb.
• Adjective phrase: Acts like an adjective.
An Important NLP Task: Phrase
Chunking
• Find all non-recursive noun phrases (NPs) and
verb phrases (VPs) in a sentence.
– [NP I] [VP ate] [NP the spaghetti] [PP with] [NP
meatballs].
– [NP He ] [VP reckons ] [NP the current account
deficit ] [VP will narrow ] [PP to ] [NP only # 1.8
billion ] [PP in ] [NP September ]
• Some applications need all the noun phrases
in a sentence
Phrase Structure Grammars
• Syntax is the study of word orders and phrase
structures
• Syntactic analysis tells how to determine meaning of
a sentence from the meaning its of words
– The dog bit the man.
– The man bit the dog.
• A basic question in Linguistics: What forms a legal
sentence in a language?
• Syntax helps to answer that question
– *Bit the the man dog.
• Conventionally, ‘*’ indicates ungrammatical sentence
– Colorless green ideas sleep furiously.
• Meaningless but grammatical
Phrase Structure Grammars
• Linguists have come up with many grammar
formalisms to capture syntax of languages,
phrase structure grammar is one of them and
is very commonly used
• A context-free grammar (CFG) that generates
sentences
– Context free: Only one symbol on left side
• Productions of a small example grammar:
S  NP VP
VP  Verb
VP  Verb NP
NP  Article Noun Verb  [slept|ate|made|bit]
Noun  [girl|cake|dog|man] Article  [A|The]
Phrase Structure Grammars
• The parse of the sentence is typically shown
as a tree
The girl ate the cake.
Non-terminals
S
NP
VP
Article Noun Verb
The
Terminals
girl
ate
NP
Article
Noun
the
cake
A syntactic derivation or a parse tree
Phrase Structure Grammars
• Some of the productions can be recursive (one inside
another, like NP  NP PP) which can then expand
several times
(S
(NP (PRP I))
(VP
(VBD saw)
(NP (NP (DT the) (NN man))
(PP (IN on) (NP (NP (DT the) (NN hill)))
(PP (IN with)
(NP (DT the) (NN telescope))))))
• Because of recursion in the grammars there are
potentially infinite number of sentences in a
language
Syntactic Ambiguity
• Typically a grammar can lead to several parses of a
sentence, called syntactic ambiguity
(S
(NP (PRP I))
(VP
(VBD saw)
(NP (NP (DT the) (NN man))
(PP (IN on) (NP (NP (DT the) (NN hill)))
(PP (IN with)
(NP (DT the) (NN telescope))))))
Syntactic Ambiguity
• Typically a grammar can lead to several parses of a
sentence, called syntactic ambiguity
(S
(NP (PRP I))
(VP
(VBD saw)
(NP (NP (DT the) (NN man))
(PP (IN on) (NP (DT the) (NN hill)))
(PP (IN with) (NP (DT the) (NN telescope))))))
Syntactic Ambiguity
• Typically a grammar can lead to several parses of a
sentence, called syntactic ambiguity
(S
(NP (PRP I))
(VP
(VBD saw)
(NP (DT the) (NN man))
(PP (IN on) (NP (DT the) (NN hill)))
(PP (IN with) (NP (DT the) (NN telescope))))))
Syntactic Parsing: A Very
Important NLP Task
• Not uncommon to have hundreds of parses
for a sentence
• Syntactic parsing is the task of finding the best
parse for a sentence
• Previous rule-based approaches were brittle
and would not work well
• Statistical methods for syntactic parsing have
been more successful and are currently being
used
Statistical Syntactic Parsing
• Statistical syntactic parsing uses a probabilistic model
of syntax in order to assign probabilities to each
parse tree
• Provides principled approach to resolving syntactic
ambiguity
– The more likely parse will have higher probability
– Includes POS tagging
• Probabilities are typically learned from annotated
parses of thousands of sentences, called a treebank
– Penn Treebank (http://www.cis.upenn.edu/~treebank/)
• Most well known treebank
• Contains annotated parse trees of a few thousand Wall Street
Journal articles
24
• Sparked progress in automated syntactic parsing methods
Probabilistic Context Free Grammar
(PCFG)
• A PCFG is a probabilistic version of a CFG where each
production has a probability.
• Probabilities of all productions rewriting a given nonterminal must add to 1, defining a distribution for
each non-terminal.
25
Simple PCFG for Air Travel Domain
Grammar
S → NP VP
S → Aux NP VP
S → VP
NP → Pronoun
NP → Proper-Noun
NP → Det Nominal
Nominal → Noun
Nominal → Nominal Noun
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → VP PP
PP → Prep NP
Prob
0.8
0.1
0.1
0.2
0.2
0.6
0.3
0.2
0.5
0.2
0.5
0.3
1.0
+ 1.0
+ 1.0
+ 1.0
+ 1.0
Lexicon
Det → the | a | that | this
0.6 0.2 0.1 0.1
Noun → book | flight | meal | money
0.1 0.5
0.2 0.2
Verb → book | include | prefer
0.5
0.2
0.3
Pronoun → I | he | she | me
0.5 0.1 0.1 0.3
Proper-Noun → Houston | NWA
0.8
0.2
Aux → does
1.0
Prep → from | to | on | near | through
0.25 0.25 0.1 0.2 0.2
Sentence Probability
• Assume productions for each node are chosen
independently.
• Probability of derivation is the product of the
probabilities of its productions.
P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x
S
0.1
0.5 x 0.3 x 1.0 x 0.2 x 0.2 x
VP
0.5
0.5 x 0.8
Verb
NP
0.5
= 0.0000216
Det
book
D1
0.6
Nominal 0.5
0.6
the Nominal PP 1.0
0.3
NP 0.2
Noun Prep
0.2
0.5
flight through Proper-Noun
0.8
Houston
27
Syntactic Disambiguation
• Resolve ambiguity by picking most probable parse
tree.
S
D2
P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x
0.1
VP
0.6 x 0.3 x 1.0 x 0.5 x 0.2 x
0.3
VP
0.5
0.2 x 0.8
= 0.00001296
Verb
NP 0.6
0.5
book
PP
Det Nominal
1.0
0.6 0.3
NP 0.2
the Noun 0.2Prep
0.5
flight through Proper-Noun
0.8
Houston
D1 has a higher probability, hence it is the more likely parse according to the PCFG.
28
28
Syntactic Parsing
• State-of-the art in syntactic parsers also uses words
to influence probabilities of productions
– VP  Verb NP with “sneeze” as the verb will have a
different probability than VP Verb NP with “eat” as the
verb
• You don’t sneeze something but you eat something
• Try the online version of the Stanford Parser:
– http://nlp.stanford.edu:8080/parser/
29
Syntax of Biomedical Languages
• Clinical language often relaxes many syntactic constraints in
order to be highly compact
–
–
–
–
The cough worsened
Cough worsened
Cough
Increased tenderness.
• Because these are widely used, they are not considered
ungrammatical, but as a sublanguage
• There are wide variety of sublanguages in the biomedical
domains each exhibiting specialized content and linguistic
forms
• Parsers trained in one domain typically do not work well on
another domain; requires adaptation
Download