Morphological Parsing CS 4705 1

advertisement
Morphological Parsing
CS 4705
CS 4705
1
Parsing
• Taking a surface input and analyzing its
components and underlying structure
• Morphological parsing: taking a word or string of
words as input and identifying the stems and
affixes (and possibly interpreting these)
– E.g.:
• goose  goose +N +SG or goose + V
• geese  goose +N +PL
• gooses  goose +V +3SG
– Bracketing: indecipherable  [in [ [de [cipher] ] able] ]
2
Why ‘parse’ words?
• To find stems
– Simple key to word similarity
– Yellow, yellowish, yellows, yellowed, yellowing…
• To find affixes and the information they convey
– ‘ed’ signals a verb
– ‘ish’ an adjective
– ‘s’?
• Morphological parsing provides information about
a word’s semantics and the syntactic role it plays
in a sentence
3
Some Practical Applications
• For spell-checking
– Is muncheble a legal word?
• To identify a word’s part-of-speech (pos)
– For sentence parsing, for machine translation, …
• To identify a word’s stem
– For information retrieval
• Why not just list all word forms in a lexicon?
4
What do we need to build a morphological
parser?
• Lexicon: list of stems and affixes (w/
corresponding p.o.s.)
• Morphotactics of the language: model of how and
which morphemes can be affixed to a stem
• Orthographic rules: spelling modifications that
may occur when affixation occurs
– in  il in context of l (in- + legal)
• Most morphological phenomena can be described
with regular expressions – so finite state
techniques often used to represent morphological
processes
5
Using FSAs to Represent English Plural
Nouns
• English nominal inflection
plural (-s)
reg-n
q0
q1
q2
irreg-pl-n
irreg-sg-n
•Inputs: cats, geese, goose
6
• Derivational morphology: adjective fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happi, real (clearly)
• Adj-root2: big, red (*bigly)
7
FSAs can also represent the Lexicon
• Expand each non-terminal arc in the previous FSA
into a sub-lexicon FSA (e.g. adj_root2 = {big,
red}) and then expand each of these stems into its
letters (e.g. red  r e d) to get a recognizer for
adjectives
e
r
ε
q0
q1
q2
q3
b
d
q4
i
q5
g
q6
8
But…..
• Covering the whole lexicon this way will require very
large FSAs with consequent search and maintenance
problems
– Adding new items to the lexicon means recomputing the whole
FSA
– Non-determinism
– Some stems require modification when they acquire affixes
• FSAs tell us whether a word is in the language or not – but
usually we want to know more:
– What is the stem?
– What are the affixes and what sort are they?
– We used this information to recognize the word: why can’t we
store it?
9
Parsing with Finite State Transducers
• cats cat +N +PL (a plural NP)
• Kimmo Koskenniemi’s two-level morphology
– Idea: word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
– Morphological parsing : find the mapping
(transduction) between lexical and surface levels
lexical
c
a
t
surface
c
a
t
+N +PL
s
10
Finite State Transducers can represent this
mapping
• FSTs map between one set of symbols and another
using a FSA whose alphabet  is composed of
pairs of symbols from input and output alphabets
• In general, FSTs can be used for
– Translators (Hello:Ciao)
– Parser/generators (Hello:How may I help you?)
– As well as Kimmo-style morphological parsing
11
• FST is a 5-tuple consisting of
– Q: set of states {q0,q1,q2,q3,q4}
– : an alphabet of complex symbols, each an i/o pair s.t.
i  I (an input alphabet) and o  O (an output alphabet)
and  is in I x O
– q0: a start state
– F: a set of final states in Q {q4}
– (q,i:o): a transition function mapping Q x  to Q
– Quizzical Cow  Emphatic Sheep
o:a
m:b
o:a
o:a
?:!
q0
q1
q2
q3
q4
12
FST for a 2-level Lexicon
• E.g.
q0
g
c:c
q1
q4
a:a
q5
o:e
q2
q6
o:e
t:t
q7
q3
e
s
Reg-n
Irreg-pl-n
Irreg-sg-n
cat
g o:e o:e s e
goose
13
FST for English Nominal Inflection
+N:
reg-n
q1
q4
+PL:^s#
+SG:#
q0 irreg-n-sg
q2 +N:
q5
irreg-n-pl
q3
q6
+N:
c
a
t
c
a
t
+SG:#
q7
+PL:#
+N +PL
s
14
Useful Operations on Transducers
• Cascade: running 2+ FSTs in sequence
• Intersection: represent the common transitions in
FST1 and FST2 (ASR: finding pronunciations)
• Composition: apply FST2 transition function to
result of FST1 transition function
• Inversion: exchanging the input and output
alphabets (recognize and generate with same FST)
• cf AT&T FSM Toolkit and papers by Mohri,
Pereira, and Riley
15
Orthographic Rules and FSTs
• Define additional FSTs to implement rules such as
consonant doubling (beg  begging), ‘e’ deletion
(make  making), ‘e’ insertion (watch 
watches), etc.
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
#
16
Porter Stemmer (1980)
• Used for tasks in which you only care about the stem
– IR, modeling given/new distinction, topic detection, document
similarity
• Lexicon-free morphological analysis
• Cascades rewrite rules (e.g. misunderstanding -->
misunderstand --> understand --> …)
• Easily implemented as an FST with rules e.g.
– ATIONAL  ATE
– ING  ε
• Not perfect ….
– Doing  doe
17
• Policy  police
• Does stemming help?
– IR, little
– Topic detection, more
18
Summing Up
• FSTs provide a useful tool for implementing a
standard model of morphological analysis,
Kimmo’s two-level morphology
• But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the rulebased Porter Stemmer
• Next time:
– Read Ch 5:1-8
• HW1 assigned (read the assignment)
19
Download