Morphology

advertisement
Morphology
See
Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The
Oxford Handbook of Computational Linguistics, Oxford
(2004): OUP
D Jurafsky & JH Martin: Speech and Language Processing,
Upper Saddle River NJ (2000): Prentice Hall, Chapter 3
[quite technical]
Morphology - reminder
• Internal analysis of word forms
• morpheme – allomorphic variation
• Words usually consist of a root plus affix(es),
though some words can have multiple roots, and
some can be single morphemes
• lexeme – abstract notion of group of word forms
that ‘belong’ together
– lexeme ~ root ~ stem ~ base form ~ dictionary
(citation) form
2
Role of morphology
• Commonly made distinction: inflectional vs
derivational
• Inflectional morphology is grammatical
– number, tense, case, gender
• Derivational morphology concerns word
building
– part-of-speech derivation
– words with related meaning
3
Inflectional morphology
• Grammatical in nature
• Does not carry meaning, other than grammatical
meaning
• Highly systematic, though there may be
irregularities and exceptions
– Simplifies lexicon, only exceptions need to be listed
– Unknown words may be guessable
• Language-specific and sometimes idiosyncratic
• (Mostly) helpful in parsing
4
Derivational morphology
• Lexical in nature
• Can carry meaning
• Fairly systematic, and predictable up to a point
– Simplifies description of lexicon: regularly derived
words need not be listed
– Unknown words may be guessable
• But …
– Apparent derivations have specialised meaning
– Some derivations missing
• Languages often have parallel derivations which
may be translatable
5
Morphological processes
•
•
•
•
•
•
Affixes: prefix, suffix, infix, circumfix
Vowel change (umlaut, ablaut)
Gemination, (partial) reduplication
Root and pattern
Stress (or tone) change
Sandhi
6
Morphophonemics
• Morphemes and allomorphs
– eg {plur}: +(e)s, vowel change, yies, fves, um a, , ...
• Morphophonemic variation
– Affixes and stems may have variants which are
conditioned by context
• eg +ing in lifting, swimming, boxing, raining, hoping, hopping
– Rules may be generalisable across morphemes
• eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses
• Applies to both {plur} (nouns) and {3rd sing pres} (verbs)
7
Morphology in NLP
• Analysis vs synthesis
– what does dogs mean? vs what is the plural of dog?
• Analysis
– Need to identify lexeme
• Tokenization
• To access lexical information
– Inflections (etc) carry information that will be needed
by other processes (eg agreement useful in parsing,
inflections can carry meaning (eg tense, number)
– Morphology can be ambiguous
• May need other process to disambiguate (eg German –en)
• Synthesis
– Need to generate appropriate inflections from
underlying representation
8
Morphology in NLP
• String-handling programs can be written
• More general approach
– formalism to write rules which express
correspondence between surface and
underlying form (eg dogs = dog +{plur})
– Computational algorithm (program) which can
apply those rules to actual instances
– Especially of interest if rules (though not
program) is independent of direction: analysis
or synthesis
9
Role of lexicon in morphology
• Rules interact with the lexicon
– Obviously category information
• eg rules that apply to nouns
– Note also morphology-related subcategories
• eg “er” verbs in French, rules for gender agreement
– Other lexical information can impact on morphology
• eg all fish have two forms of the plural (+s and )
• in Slavic languages case inflections differ for inanimate and
animate nouns)
10
Problems with rules
• Exceptions have to be covered
– Including systematic irregularities
– May be a trade-off between treating
something as a small group of irregularities or
as a list of unrelated exceptions (eg French
irregular verbs, English fves)
• Rules must not over/under-generate
– Must cover all and only the correct cases
– May depend on what order the rules are
applied in
11
Tokenization
• The simplest form of analysis is to reduce
different word forms into tokens
• Also called “normalization”
• For example, if you want to count how
many times a given ‘word’ occurs in a text
• Or you want to search for texts containing
certain ‘words’ (e.g. Google)
12
Morphological processing
• Stemming
• String-handling approaches
– Regular expressions
– Mapping onto finite-state automata
• 2-level morphology
– Mapping between surface form and lexical
representation
13
Stemming
• Stemming is the particular case of
tokenization which reduces inflected forms
to a single base form or stem
• (Recall our discussion of stem ~ base form
~ dictionary form ~ citation form)
• Stemming algorithms are basic stringhandling algorithms, which depend on
rules which identify affixes that can be
stripped
14
Finite state automata
• A finite state automaton is a simple and intuitive
formalism with straightforward computational
properties (so easy to implement)
• A bit like a flow chart, but can be used for both
recognition (analysis) and generation
• FSAs have a close relationship with “regular
expressions”, a formalism for expressing strings,
mainly used for searching texts, or stipulating
patterns of strings
15
Finite state automata
• A bit like a flow chart, but can be used for
both recognition and generation
• “Transition network”
• Unique start point
• Series of states linked by transitions
• Transitions represent input to be
accounted for, or output to be generated
• Legal exit-point(s) explicitly identified
16
Example
Jurafsky & Martin, Figure 2.10
a
a
b
q0
q1
a
q2
!
q3
q4
• Loop on q3 means that it can account for
infinite length strings
• “Deterministic” because in any state, its
behaviour is fully predictable
17
Non-deterministic FSA
Jurafsky & Martin, Figure 2.18
2.19
a
a
b
q0
q1
a
q2
!
q3
q4
ε
• At state q2 with input “a” there is a choice of
transitions
• We can also have “jump” arcs (or empty
transitions), which also introduce nondeterminism
18
An FSA to handle morphology
c
o
f
q0
q1
x
q2
e
q3
q4
s
i
r
q6
q5
y
q7
Spot the deliberate mistake: overgeneration
19
Finite State Transducers
• A “transducer” defines a relationship (a
mapping) between two things
• Typically used for “two-level morphology”,
but can be used for other things
• Like an FSA, but each state transition
stipulates a pair of symbols, and thus a
mapping
20
Finite State Transducers
• Three functions:
– Recognizer (verification): takes a pair of strings and
verifies if the FST is able to map them onto each
other
– Generator (synthesis): can generate a legal pair of
strings
– Translator (transduction): given one string, can
generate the corresponding string
• Mapping usually between levels of
representation
– spy+s : spies
– Lexical:intermediate foxNPs : fox^s
– Intermediate:surface fox^s : foxes
21
Some conventions
• Transitions are marked by “:”
• A non-changing transition “x:x” can be
shown simply as “x”
• Wild-cards are shown as “@”
• Empty string shown as “ε”
22
An example
based on Trost p.42
#spy+s# : spies
#:ε
s
p
y:i
+:e
s
#:ε
#toy+s# : toys
#:ε
#:ε
t
o
s
#:ε
h
w
y
e
i
+:0
l
f:v
s
f:v
e
#:ε
+:e
s
s
#:ε
#:ε
23
Using wild cards and loops
#:0
s
p
y:i
+:e
s
#:0
#:0
t
o
y
+:0
s
#:0
Can be collapsed into a single FST:
@
#:0
y:i
y
+:e
s
#:0
+:0
24
Another example (J&M Fig. 3.9, p.74)
fox
cat
dog
q4
q1
q0
goose
sheep
mouse
g o:e o:e s e
sheep
m o:i u:εs:c e
lexical:intermediate
P:^ s #
N:ε
S:#
N:ε
q2
q5
N:ε
q3
q6
S:#
q7
P:#
25
fox
cat
dog
q1
q0
f
q0
c
s1
o
s2
a
s3
s4
d
x
t
q1
g
s5
o
s6
26
[0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7]
[0] f:f o:o x:x [1] N:ε [4] S:# [7]
[0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7]
[0] s:s h:h e:e p:p [2] N:ε [5] S:# [7]
[0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7]
fox
cat
dog
foxNPs#:fox^s#
foxNS:fox#
catNPs#:cat^s#
sheepNS:sheep#
gooseNP:geese#
q4
q1
q0
goose
sheep
mouse
g o:e o:e s e
sheep
m o:i u:εs:c e
P:^ s #
N:ε
S:#
N:ε
q2
q5
N:ε
q3
q6
S:#
q7
P:#
27
Lexical:surface mapping
J&M Fig. 3.14, p.78
ε  e / {x s z} ^ __ s #
foxNPs#:fox^s#
catNPs#:cat^s#
^: ε
#
other
other
q5
z, s, x
s
z, s, x
q0
^:ε
ε:e
^:ε
q1
#, other
q2
s
q3
q4
z, x
#
28
[0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0]
[0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0]
fox^s#foxes#
cat^s#:cat^s#
^: ε
#
other
other
q5
z, s, x
s
z, s, x
q0
^:ε
ε:e
^:ε
q1
#, other
q2
s
q3
q4
z, x
#
29
FST
• But you don’t have to draw all these FSTs
• They map neatly onto rule formalisms
• What is more, these can be generated
automatically
• Therefore, slightly different formalism
30
FST compiler
http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html
[d
[c
[f
[g
o
a
o
o
s0:
s1:
s2:
s3:
s4:
s5:
s6:
s7:
s8:
s9:
s10:
s11:
s12:
s13:
s14:
fs15:
s16:
g
t
x
o
N
N
N
s
P
P
P
e
.x.
.x.
.x.
N P
d o
c a
f o
.x.
g
t
x
g
s
s
e
e
]
]
s
e
|
|
] |
s e]
c -> s1, d -> s2, f -> s3, g -> s4.
a -> s5.
o -> s6.
o -> s7.
<o:e> -> s8.
t -> s9.
g -> s9.
x -> s10.
s0
<o:e> -> s11.
<N:s> -> s12.
<N:e> -> s13.
s -> s14.
<P:0> -> fs15.
<P:s> -> fs15.
e -> s16.
(no arcs)
<N:0> -> s12.
c
d
s1
s2
f
s3
g
s4
31
Download