Morphological Parsing CS 4705

advertisement
Morphological Parsing
CS 4705
CS 4705
Parsing
• Taking a surface input and analyzing its
components and underlying structure
• Morphological parsing: taking a word or string of
words as input and identifying their stems and
affixes (and sometimes interpreting these)
– E.g.:
• goose  goose +N +SG or goose + V
• geese  goose +N +PL
• gooses  goose +V +3SG
– Bracketing: indecipherable  [in [[de [cipher]] able]]
Why ‘parse’ words?
• To find stems
– Simple key to word similarity
– Yellow, yellowish, yellows, yellowed, yellowing…
• To find affixes and the information they convey
– ‘ed’ signals a verb
– ‘ish’ an adjective
– ‘s’?
• Morphological parsing provides information about
a word’s semantics and the syntactic role it plays
in a sentence
Some Practical Applications
• For spell-checking
– Is muncheble a legal word?
• To identify a word’s part-of-speech (pos)
– For sentence parsing, for machine translation, …
• To identify a word’s stem
– For information retrieval
• Why not just list all word forms in a lexicon?
What do we need to build a morphological
parser?
• Lexicon: list of stems and affixes (w/
corresponding p.o.s.)
• Morphotactics of the language: model of how and
which morphemes can be affixed to a stem
• Orthographic rules: spelling modifications that
may occur when affixation occurs
– in  il in context of l (in- + legal)
• Most morphological phenomena can be described
with regular expressions – so finite state
techniques often used to represent morphological
processes
Using FSAs to Represent English Plural
Nouns
• English nominal inflection
plural (-s)
reg-n
q0
q1
irreg-pl-n
irreg-sg-n
•Inputs: cats, geese, goose
q2
• Derivational morphology: adjective fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happy, real (clearly)
• Adj-root2: big, red (*bigly)
FSAs can also represent the Lexicon
• Expand each non-terminal arc in the previous FSA
into a sub-lexicon FSA (e.g. adj_root2 = {big,
red}) and then expand each of these stems into its
letters (e.g. red  r e d) to get a recognizer for
adjectives
e
r
unq0
q1
q2
q3
b
d
q4
i
q5
g
q6
q7
-er, -est
But…..
• Covering the whole lexicon this way will require
very large FSAs with consequent search and
maintenance problems
– Adding new items to the lexicon means recomputing
the whole FSA
– Non-determinism
• FSAs tell us whether a word is in the language or
not – but usually we want to know more:
– What is the stem?
– What are the affixes and what sort are they?
– We used this information to recognize the word: why
can’t we store it?
Parsing with Finite State Transducers
• cats cat +N +PL (a plural NP)
• Kimmo Koskenniemi’s two-level morphology
– Idea: word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
– Morphological parsing : find the mapping
(transduction) between lexical and surface levels
lexical
c
a
t
surface
c
a
t
+N +PL
s
Finite State Transducers can represent this
mapping
• FSTs map between one set of symbols and another
using a FSA whose alphabet  is composed of
pairs of symbols from input and output alphabets
• In general, FSTs can be used for
– Translators (Hello:Ciao)
– Parser/generators (Hello:How may I help you?)
– As well as Kimmo-style morphological parsing
• FST is a 5-tuple consisting of
– Q: set of states {q0,q1,q2,q3,q4}
– : an alphabet of complex symbols, each an i/o pair s.t.
i  I (an input alphabet) and o  O (an output alphabet)
and  is in I x O
– q0: a start state
– F: a set of final states in Q {q4}
– (q,i:o): a transition function mapping Q x  to Q
– Emphatic Sheep  Quizzical Cow
a:o
b:m
a:o
a:o
!:?
q0
q1
q2
q3
q4
FST for a 2-level Lexicon
• E.g.
q0
g
c:c
q1
q4
a:a
q5
e:o
q2
q6
e:o
t:t
q7
s
Reg-n
Irreg-pl-n
Irreg-sg-n
cat
g o:e o:e s e
goose
q3
e
FST for English Nominal Inflection
+N:
reg-n
q1
q4
+PL:^s#
+SG:-#
q0 irreg-n-sg
q2 +N:
q5
irreg-n-pl
q3
q6
+N:
c
a
t
c
a
t
+SG:-#
+PL:-s#
+N +PL
s
q7
Useful Operations on Transducers
• Cascade: running 2+ FSTs in sequence
• Intersection: represent the common transitions in
FST1 and FST2 (ASR: finding pronunciations)
• Composition: apply FST2 transition function to
result of FST1 transition function
• Inversion: exchanging the input and output
alphabets (recognize and generate with same FST)
• cf AT&T FSM Toolkit and papers by Mohri,
Pereira, and Riley
Orthographic Rules and FSTs
• Define additional FSTs to implement rules such as
consonant doubling (beg  begging), ‘e’ deletion
(make  making), ‘e’ insertion (watch 
watches), etc.
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
#
Porter Stemmer (1980)
• Used for tasks in which you only care about the stem
– IR, modeling given/new distinction, topic detection, document
similarity
• Lexicon-free morphological analysis
• Cascades rewrite rules (e.g. misunderstanding -->
misunderstand --> understand --> …)
• Easily implemented as an FST with rules e.g.
– ATIONAL  ATE
– ING  ε
• Not perfect ….
– Doing  doe
• Policy  police
• Does stemming help?
– IR, little
– Topic detection, more
Summing Up
• FSTs provide a useful tool for implementing a
standard model of morphological analysis,
Kimmo’s two-level morphology
• But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the rulebased Porter Stemmer
• Next time:
– Read Ch 3.10-11, 3.13 (new version)
Download