Lecture 3 Morphology: Parsing Words CS 4705

advertisement
Lecture 3
Morphology: Parsing Words
CS 4705
What is morphology?
• The study of how words are composed from
smaller, meaning-bearing units (morphemes)
– Stems: children, undoubtedly,
– Affixes (prefixes, suffixes, circumfixes, infixes)
• Immaterial
• Trying
• Gesagt
• Absobl**dylutely
– Concatenative vs. non-concatenative (e.g. Arabic rootand-pattern) morphological systems
• Agglutinative (e.g. Turkish, Japanese) vs. analytic
(e.g. Mandarin) vs inflectional systems (e.g.
English, Latin, Russian)
(English) Inflectional Morphology
• Word stem + grammatical morpheme
– Usually produces word of same class
– Usually serves a syntactic function (e.g. agreement)
like  likes or liked
bird  birds
• Nominal morphology
– Plural forms
• s or es
• Irregular forms (goose/geese)
• Mass vs. count nouns (fish/fish,email or emails?)
– Possessives (cat’s, cats’)
• Verbal inflection
– Main verbs (sleep, like, fear) verbs relatively regular
• -s, ing, ed
• And productive: Emailed, instant-messaged, faxed,
homered
• But some are not regular: eat/ate/eaten,
catch/caught/caught
– Primary (be, have, do) and modal verbs (can, will,
must) often irregular and not productive
• Be: am/is/are/were/was/been/being
– Irregular verbs few (~250) but frequently occurring
– So….English inflectional morphology is fairly easy to
model….with some special cases...
(English) Derivational Morphology
• Word stem + grammatical morpheme
– Usually produces word of different class
– More complicated than inflectional
• E.g. verbs --> nouns
– -ize verbs  -ation nouns
– generalize, realize  generalization, realization
• E.g.: verbs, nouns  adjectives
– embrace, pity embraceable, pitiable
– care, wit  careless, witless
• Example: adjective  adverb
– happy  happily
• But “rules” have many exceptions
– Less productive: *evidence-less, *concern-less, *goable, *sleep-able
– Meanings of derived terms harder to predict by rule
• clueless, careless, nerveless
Parsing
• Taking a surface input and identifying its
components and underlying structure
• Morphological parsing: parsing a word into stem
and affixes, identifying its parts and their
relationships
– Stem and features:
• goose  goose +N +SG or goose + V
• geese  goose +N +PL
• gooses  goose +V +3SG
– Bracketing: indecipherable  [in [[de [cipher]] able]]
Why parse words?
• For spell-checking
– Is muncheble a legal word?
• To identify a word’s part-of-speech (pos)
– For sentence parsing, for machine translation, …
• To identify a word’s stem
– For information retrieval
• Why not just list all word forms in a lexicon?
What do we need to build a morphological
parser?
• Lexicon: list of stems and affixes (w/
corresponding pos)
• Morphotactics of the language: model of how and
which morphemes can be affixed to a stem
• Orthographic rules: spelling modifications that
may occur when affixation occurs
– in  il in context of l (in- + legal)
Using FSAs to Represent Morphotactic
Models (given a lexicon)
• English nominal inflection
plural (-s)
reg-n
q0
q1
irreg-pl-n
irreg-sg-n
•Inputs: cats, goose, geese
q2
• Derivational morphology: adjective fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happy, real
• Adj-root2: big, red
FSAs can also represent the Lexicon
• Expand each non-terminal arc in the previous FSA
into a sub-lexicon FSA (e.g. adj_root2 = {big,
red}) and then expand each of these stems into its
letters (e.g. red  r e d) to get a recognizer for
e
adjectives
r
unq0
q1
q2
q3
b
d
q4
i
q5
g
q6
q7
-er, -est
But…..
• Covering the whole lexicon this way will require
very large FSAs with consequent search and
maintenance problems
– Adding new items to the lexicon means recomputing
the whole FSA
– Non-determinism
• FSAs tell us whether a word is in the language or
not – but usually we want to know more:
– What is the stem?
– What are the affixes and what sort are they?
– We used this information to recognize the word: can
we get it back?
Parsing with Finite State Transducers
• cats cat +N +PL
• Koskenniemi’s two-level morphology
– Idea: word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
– Morphological parsing : find the mapping
(transduction) between lexical and surface levels
c
a
t
c
a
t
+N +PL
s
Finite State Transducers can represent this
mapping
• FSTs map between one set of symbols and another
using an FSA whose alphabet  is composed of
pairs of symbols from input and output alphabets
• In general, FSTs can be used for
– Translators (Hello:Ciao)
– Parser/generator s(Hello:How may I help you?)
– As well as Kimmo-style morphological parsing
• FST is a 5-tuple consisting of
– Q: set of states {q0,q1,q2,q3,q4}
– : an alphabet of complex symbols, each an i/o pair s.t.
i  I (an input alphabet) and o  O (an output alphabet)
and  is in I x O
– q0: a start state
– F: a set of final states in Q {q4}
– (q,i:o): a transition function mapping Q x  to Q
– Emphatic Sheep  Quizzical Cow
a:o
b:m
a:o
a:o
!:?
q0
q1
q2
q3
q4
FST for a 2-level Lexicon
• E.g.
q0
g
c:c
q1
q4
a:a
q5
e:o
q2
q6
e:o
t:t
q7
s
Reg-n
Irreg-pl-n
Irreg-sg-n
cat
g o:e o:e s e
goose
q3
e
FST for English Nominal Inflection
+N:
reg-n
q1
q4
+PL:^s#
+SG:-#
q0 irreg-n-sg
q2 +N:
q5
irreg-n-pl
q3
q6
+N:
c
a
t
c
a
t
+SG:-#
+PL:-s#
+N +PL
s
q7
• Combining (via cascade or composition)
this FSA with FSAs for each noun type
replaces e.g. reg-n with every regular noun
representation in the lexicon (cf. J&M p.76)
– e.g. q0
Reg-noun-stem: cat
q7
Orthographic Rules and FSTs
• Define additional FSTs to implement rules such as
consonant doubling (beg  begging), ‘e’ deletion
(make  making), ‘e’ insertion (watch 
watches), etc.
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
#
• Note: These FSTs can be used for generation or
recognition by simply exchanging the input and
output alphabets
Summing Up
• FSTs provide a useful tool for implementing a
standard model of morphological analysis,
Kimmo’s two-level morphology
– Key is to provide an FST for each of multiple levels of
representation and then to combine those FSTs using a
variety of operators (cf AT&T FSM Toolkit and papers
by Mohri, Pereira, and Riley, e.g.
– Other (older) approaches are still widely used, e.g. the
rule-based Porter Stemmer described in J&M appendix
B
• Next time: Read Ch 4
Word Classes
• AKA morphological classes, parts-of-speech
• Closed vs. open (function vs. content) class words
– Pronoun, preposition, conjunction, determiner,…
– Noun, verb, adverb, adjective,…
How do people represent words?
• Hypotheses:
– Full listing hypothesis: words listed
– Minimum redundancy hypothesis: morphemes listed
• Experimental evidence:
– Priming experiments (Does seeing/hearing one word
facilitate recognition of another?) suggest neither
– Regularly inflected forms prime stem but not derived
forms
– But spoken derived words can prime stems if they are
semantically close (e.g. government/govern but not
department/depart)
• Speech errors suggest affixes must be represented
separately in the mental lexicon
– easy enoughly
Download