Natural Language Processing – Part 1: Words Finite state morphology Maria Holmqvist, marho@ida.liu.se Part 1 – Designing a FST for lexical parsing of Swedish nouns The task was to design a finite state transducer for translating Swedish noun word forms into a lexical description which contains the lemma, part-of-speech category, gender, definiteness, number and case. For each of the five declensions of Swedish an FST was written. 1 flicka, ros 2 pojke, spik 3 bild, sko 4 byte 5 hus To combine these 5 transducers into one, the simple procedure used here, was to include all states and transitions in a large transducer for all noun declensions by giving every FST the same start and end state. This larger FST for all nouns contains 36 states. Part 2 – Implementing the Finite state transducer The lexical parser is made up of two components, (i) the transducer description and (ii) a program for processing an input and traversing through the states of the transducer. FST description The description of the finite state transducer is a simplified version of the table format in Jurafsky and Martin (2000). Only the “significant” cells from the table are included in this description, i.e, only the valid transitions between states. Table 1 contains the description of the FST for inflections of the noun ros. The first column contains all states and the rest of the row contains alternative paths from each state. The path from a state has the format x:y:n, where x is the accepted input, y the output and n the new state. Starting at state 0, the only valid input is ‘ros’. All other inputs will result in a failed analysis. When encountering input ‘ros’, ‘ros’ will be output and we move to state 1. In state 1, the valid input is ‘or’, and so on. Compare this description with the FST figure above. Three special symbols are used in the description. The symbol ‘E’ is used as the epsilon symbol. If ‘E’ stands in input position, it means that we move down this path without looking at the input. If ‘E’ is in output position nothing will be output. The ‘#’symbol is used to denote ‘end-of-string’. The states in the first column are marked with a ‘:’-symbol if they are accepting states. State 0 1 2 3 4 5 6: Legal transitions ros:ros:1 E: N UTR SG:2 en: DEF:4 na: DEF:4 #: NOM:6 #: NOM:6 or: N UTR PL:3 E: INDEF:5 E: INDEF:4 s#: GEN:6 #: GEN:6 Table 1. The FST for “ros” The description of finite state transducers for the five noun declensions in Swedish can be found here: ros.fst, flicka.fst, spik.fst, pojke.fst, bild.fst, sko.fst, byte.fst, hus.fst, and the combined FST here: noun.fst Implementation The program for transforming a noun from morphological to lexical level was implemented in Perl and can be found here. When the user specifies a word, the program will output the lexical description(s) of the word or else produce “Failed”. The FST-description is supplied to the program as a command line argument. > perl fst.pl noun.fst Write a word and press Enter. (q = quit): spik spik N UTR SG INDEF NOM Ambiguous input The program will produce all possible morphological analyses of ambiguous word forms like ros and hus: ros ros N UTR SG INDEF NOM ros N UTR SG INDEF GEN hus hus N NEU SG INDEF NOM hus N NEU SG INDEF GEN hus N NEU PL INDEF NOM hus N NEU PL INDEF GEN This is done by keeping a stack of all transitions in progress and then processing each transition on the stack one step forward in the FST and then putting this new result back on the stack. We can exemplify this by analysing the word form ‘ros’ and using the FST for ros in table 1. After seeing input ‘ros’ we will be in state 1. Since there has been no ambiguity so far we only have one “transition in progress” on our stack. This transition contains three pieces of information: the remaining input (‘#’), output so far (‘ros’) and the current state (1): Stack: ‘#’, ‘ros’, 1 After two E-transitions our stack still contains only one alternative: Stack: ’#’, ’ros N UTR SG INDEF’, 5 We pop this transition from the stack and for each alternative path given the remaining input ‘#’ we create a new transition and put it on the stack. Stack: ‘’. ‘ros N UTR SG INDEF NOM’, 6 ‘’, ‘ros N UTR SG INDEF GEN’, 6 In the next round we pop the first transition and find that the newly produced states are accepting states and that there’s no input left to process. The analysis was a success and the output strings are printed. References Daniel Jurafsky and James H. Martin (2000). Speech and language processing.