Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical] Morphology - reminder • Internal analysis of word forms • morpheme – allomorphic variation • Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes • lexeme – abstract notion of group of word forms that ‘belong’ together – lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form 2 Role of morphology • Commonly made distinction: inflectional vs derivational • Inflectional morphology is grammatical – number, tense, case, gender • Derivational morphology concerns word building – part-of-speech derivation – words with related meaning 3 Inflectional morphology • Grammatical in nature • Does not carry meaning, other than grammatical meaning • Highly systematic, though there may be irregularities and exceptions – Simplifies lexicon, only exceptions need to be listed – Unknown words may be guessable • Language-specific and sometimes idiosyncratic • (Mostly) helpful in parsing 4 Derivational morphology • Lexical in nature • Can carry meaning • Fairly systematic, and predictable up to a point – Simplifies description of lexicon: regularly derived words need not be listed – Unknown words may be guessable • But … – Apparent derivations have specialised meaning – Some derivations missing • Languages often have parallel derivations which may be translatable 5 Morphological processes • • • • • • Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut) Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi 6 Morphophonemics • Morphemes and allomorphs – eg {plur}: +(e)s, vowel change, yies, fves, um a, , ... • Morphophonemic variation – Affixes and stems may have variants which are conditioned by context • eg +ing in lifting, swimming, boxing, raining, hoping, hopping – Rules may be generalisable across morphemes • eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses • Applies to both {plur} (nouns) and {3rd sing pres} (verbs) 7 Morphology in NLP • Analysis vs synthesis – what does dogs mean? vs what is the plural of dog? • Analysis – Need to identify lexeme • Tokenization • To access lexical information – Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) – Morphology can be ambiguous • May need other process to disambiguate (eg German –en) • Synthesis – Need to generate appropriate inflections from underlying representation 8 Morphology in NLP • String-handling programs can be written • More general approach – formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) – Computational algorithm (program) which can apply those rules to actual instances – Especially of interest if rules (though not program) is independent of direction: analysis or synthesis 9 Role of lexicon in morphology • Rules interact with the lexicon – Obviously category information • eg rules that apply to nouns – Note also morphology-related subcategories • eg “er” verbs in French, rules for gender agreement – Other lexical information can impact on morphology • eg all fish have two forms of the plural (+s and ) • in Slavic languages case inflections differ for inanimate and animate nouns) 10 Problems with rules • Exceptions have to be covered – Including systematic irregularities – May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English fves) • Rules must not over/under-generate – Must cover all and only the correct cases – May depend on what order the rules are applied in 11 Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given ‘word’ occurs in a text • Or you want to search for texts containing certain ‘words’ (e.g. Google) 12 Morphological processing • Stemming • String-handling approaches – Regular expressions – Mapping onto finite-state automata • 2-level morphology – Mapping between surface form and lexical representation 13 Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic stringhandling algorithms, which depend on rules which identify affixes that can be stripped 14 Finite state automata • A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) • A bit like a flow chart, but can be used for both recognition (analysis) and generation • FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings 15 Finite state automata • A bit like a flow chart, but can be used for both recognition and generation • “Transition network” • Unique start point • Series of states linked by transitions • Transitions represent input to be accounted for, or output to be generated • Legal exit-point(s) explicitly identified 16 Example Jurafsky & Martin, Figure 2.10 a a b q0 q1 a q2 ! q3 q4 • Loop on q3 means that it can account for infinite length strings • “Deterministic” because in any state, its behaviour is fully predictable 17 Non-deterministic FSA Jurafsky & Martin, Figure 2.18 2.19 a a b q0 q1 a q2 ! q3 q4 ε • At state q2 with input “a” there is a choice of transitions • We can also have “jump” arcs (or empty transitions), which also introduce nondeterminism 18 An FSA to handle morphology c o f q0 q1 x q2 e q3 q4 s i r q6 q5 y q7 Spot the deliberate mistake: overgeneration 19 Finite State Transducers • A “transducer” defines a relationship (a mapping) between two things • Typically used for “two-level morphology”, but can be used for other things • Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping 20 Finite State Transducers • Three functions: – Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other – Generator (synthesis): can generate a legal pair of strings – Translator (transduction): given one string, can generate the corresponding string • Mapping usually between levels of representation – spy+s : spies – Lexical:intermediate foxNPs : fox^s – Intermediate:surface fox^s : foxes 21 Some conventions • Transitions are marked by “:” • A non-changing transition “x:x” can be shown simply as “x” • Wild-cards are shown as “@” • Empty string shown as “ε” 22 An example based on Trost p.42 #spy+s# : spies #:ε s p y:i +:e s #:ε #toy+s# : toys #:ε #:ε t o s #:ε h w y e i +:0 l f:v s f:v e #:ε +:e s s #:ε #:ε 23 Using wild cards and loops #:0 s p y:i +:e s #:0 #:0 t o y +:0 s #:0 Can be collapsed into a single FST: @ #:0 y:i y +:e s #:0 +:0 24 Another example (J&M Fig. 3.9, p.74) fox cat dog q4 q1 q0 goose sheep mouse g o:e o:e s e sheep m o:i u:εs:c e lexical:intermediate P:^ s # N:ε S:# N:ε q2 q5 N:ε q3 q6 S:# q7 P:# 25 fox cat dog q1 q0 f q0 c s1 o s2 a s3 s4 d x t q1 g s5 o s6 26 [0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7] [0] f:f o:o x:x [1] N:ε [4] S:# [7] [0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7] [0] s:s h:h e:e p:p [2] N:ε [5] S:# [7] [0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7] fox cat dog foxNPs#:fox^s# foxNS:fox# catNPs#:cat^s# sheepNS:sheep# gooseNP:geese# q4 q1 q0 goose sheep mouse g o:e o:e s e sheep m o:i u:εs:c e P:^ s # N:ε S:# N:ε q2 q5 N:ε q3 q6 S:# q7 P:# 27 Lexical:surface mapping J&M Fig. 3.14, p.78 ε e / {x s z} ^ __ s # foxNPs#:fox^s# catNPs#:cat^s# ^: ε # other other q5 z, s, x s z, s, x q0 ^:ε ε:e ^:ε q1 #, other q2 s q3 q4 z, x # 28 [0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0] [0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0] fox^s#foxes# cat^s#:cat^s# ^: ε # other other q5 z, s, x s z, s, x q0 ^:ε ε:e ^:ε q1 #, other q2 s q3 q4 z, x # 29 FST • But you don’t have to draw all these FSTs • They map neatly onto rule formalisms • What is more, these can be generated automatically • Therefore, slightly different formalism 30 FST compiler http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html [d [c [f [g o a o o s0: s1: s2: s3: s4: s5: s6: s7: s8: s9: s10: s11: s12: s13: s14: fs15: s16: g t x o N N N s P P P e .x. .x. .x. N P d o c a f o .x. g t x g s s e e ] ] s e | | ] | s e] c -> s1, d -> s2, f -> s3, g -> s4. a -> s5. o -> s6. o -> s7. <o:e> -> s8. t -> s9. g -> s9. x -> s10. s0 <o:e> -> s11. <N:s> -> s12. <N:e> -> s13. s -> s14. <P:0> -> fs15. <P:s> -> fs15. e -> s16. (no arcs) <N:0> -> s12. c d s1 s2 f s3 g s4 31