Finite state morphology

advertisement
Natural Language Processing – Part 1: Words
Finite state morphology
Maria Holmqvist, marho@ida.liu.se
Part 1 – Designing a FST for lexical parsing of Swedish nouns
The task was to design a finite state transducer for translating Swedish noun word
forms into a lexical description which contains the lemma, part-of-speech category,
gender, definiteness, number and case. For each of the five declensions of Swedish an
FST was written.
1 flicka, ros
2 pojke, spik
3 bild, sko
4 byte
5 hus
To combine these 5 transducers into one, the simple procedure used here, was to
include all states and transitions in a large transducer for all noun declensions by giving
every FST the same start and end state. This larger FST for all nouns contains 36
states.
Part 2 – Implementing the Finite state transducer
The lexical parser is made up of two components, (i) the transducer description and (ii)
a program for processing an input and traversing through the states of the transducer.
FST description
The description of the finite state transducer is a simplified version of the table format in
Jurafsky and Martin (2000). Only the “significant” cells from the table are included in
this description, i.e, only the valid transitions between states. Table 1 contains the
description of the FST for inflections of the noun ros. The first column contains all
states and the rest of the row contains alternative paths from each state. The path from
a state has the format x:y:n, where x is the accepted input, y the output and n the new
state. Starting at state 0, the only valid input is ‘ros’. All other inputs will result in a
failed analysis. When encountering input ‘ros’, ‘ros’ will be output and we move to state
1. In state 1, the valid input is ‘or’, and so on. Compare this description with the FST
figure above.
Three special symbols are used in the description. The symbol ‘E’ is used as the
epsilon symbol. If ‘E’ stands in input position, it means that we move down this path
without looking at the input. If ‘E’ is in output position nothing will be output. The ‘#’symbol is used to denote ‘end-of-string’. The states in the first column are marked with
a ‘:’-symbol if they are accepting states.
State
0
1
2
3
4
5
6:
Legal transitions
ros:ros:1
E: N UTR SG:2
en: DEF:4
na: DEF:4
#: NOM:6
#: NOM:6
or: N UTR PL:3
E: INDEF:5
E: INDEF:4
s#: GEN:6
#: GEN:6
Table 1. The FST for “ros”
The description of finite state transducers for the five noun declensions in Swedish can
be found here:
ros.fst, flicka.fst, spik.fst, pojke.fst, bild.fst, sko.fst, byte.fst, hus.fst,
and the combined FST here:
noun.fst
Implementation
The program for transforming a noun from morphological to lexical level was
implemented in Perl and can be found here. When the user specifies a word, the
program will output the lexical description(s) of the word or else produce “Failed”.
The FST-description is supplied to the program as a command line argument.
> perl fst.pl noun.fst
Write a word and press Enter. (q = quit):
spik
spik N UTR SG INDEF NOM
Ambiguous input
The program will produce all possible morphological analyses of ambiguous word forms
like ros and hus:
ros
ros N UTR SG INDEF NOM
ros N UTR SG INDEF GEN
hus
hus N NEU SG INDEF NOM
hus N NEU SG INDEF GEN
hus N NEU PL INDEF NOM
hus N NEU PL INDEF GEN
This is done by keeping a stack of all transitions in progress and then processing each
transition on the stack one step forward in the FST and then putting this new result
back on the stack. We can exemplify this by analysing the word form ‘ros’ and using
the FST for ros in table 1.
After seeing input ‘ros’ we will be in state 1. Since there has been no ambiguity so far
we only have one “transition in progress” on our stack. This transition contains three
pieces of information: the remaining input (‘#’), output so far (‘ros’) and the current
state (1):
Stack:
‘#’, ‘ros’, 1
After two E-transitions our stack still contains only one alternative:
Stack:
’#’, ’ros N UTR SG INDEF’, 5
We pop this transition from the stack and for each alternative path given the remaining
input ‘#’ we create a new transition and put it on the stack.
Stack:
‘’. ‘ros N UTR SG INDEF NOM’, 6
‘’, ‘ros N UTR SG INDEF GEN’, 6
In the next round we pop the first transition and find that the newly produced states are
accepting states and that there’s no input left to process. The analysis was a success
and the output strings are printed.
References
Daniel Jurafsky and James H. Martin (2000). Speech and language processing.
Download