Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin

advertisement
Morphology: Words
and their Parts
CS 4705
Slides adapted from Jurafsky, Martin
Hirschberg and Dorr.
English Morphology


Morphology is the study of the ways that
words are built up from smaller meaningful
units called morphemes
We can usefully divide morphemes into two
classes
–
–
Stems: The core meaning bearing units
Affixes: Bits and pieces that adhere to stems to
change their meanings and grammatical functions
Nouns and Verbs (English)

Nouns are simple (not really)
–

Markers for plural and possessive
Verbs are only slightly more complex
–
Markers appropriate to the tense of the verb
Regulars and Irregulars

Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow
the rules)
–
–

Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew
The terms regular and irregular will be used
to refer to words that follow the rules and
those that don’t.
Regular and Irregular Nouns and
Verbs

Regulars…
–
–

Walk, walks, walking, walked, walked
Table, tables
Irregulars
–
–
–
–
Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut
Goose, geese
Why care about morphology?
–
Spelling correction: referece
Morphology in machine translation

–
–
–
Hyphenation algorithms: refer-ence
Part-of-speech analysis: google, googler
Text-to-speech: grapheme-to-phoneme conversion

–
Spanish words quiero and quieres are both related to
querer ‘want’
hothouse (/T/ or /D/)
Allows us to guess at meaning


‘Twas brillig and the slithy toves…
Muggles moogled migwiches
Concatenative Morphology


Morpheme+Morpheme+Morpheme+…
Stems: often called lemma, base form, root, lexeme
–

hope+ing
hoping
hop
hopping
Affixes
–
–
–
–
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
What useful information does
morphology give us?

Different things in different languages
–
–

Spanish: hablo, hablaré/ English: I speak, I will speak
English: book, books/ Japanese: hon, hon
Languages differ in how they encode morphological
information
–
–
Isolating languages (e.g. Cantonese) have no affixes: each
word usually has 1 morpheme
Agglutinative languages (e.g. Finnish, Turkish) are
composed of prefixes and suffixes added to a stem (like
beads on a string) – each feature realized by a single affix,
e.g. Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
‘Wonder if he can also ... with his capability of not causing things
to be unsystematic’
–
–
Inflectional languages (e.g. English) merge different features
into a single affix (e.g. ‘s’ in likes indicates both person and
tense); and the same feature can be realized by different
affixes
Polysynthetic languages (e.g. Inuit languages) express much
of their syntax in their morphology, incorporating a verb’s
arguments into the verb, e.g. Western Greenlandic
Aliikusersuillammassuaanerartassagaluarpaalli.
aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-li
entertainment-provide-SEMITRANS-one.good.at-COP-say.that-REPFUT-sure.but-3.PL.SUBJ/3SG.OBJ-but
'However, they will say that he is a great entertainer, but ...'
–
So….different languages may require very different
morphological analyzers
What we want






Something to automatically do the following
kinds of mappings:
Cats
cat +N +PL
Cat
cat +N +SG
Cities
city +N +PL
Merging merge +V +Presentparticiple
Caught catch +V +past-participle
Morphology Can Help Define Word
Classes


AKA morphological classes, parts-of-speech
Closed vs. open (function vs. content) class
words
–
–

Pronoun, preposition, conjunction, determiner,…
Noun, verb, adverb, adjective,…
Identifying word classes is useful for almost
any task in NLP, from translation to speech
recognition to topic detection…very basic
semantics
(English) Inflectional Morphology
Word stem + grammatical morpheme 
different forms of same word
–
–
Usually produces word of same class
Usually serves a syntactic or grammatical function
(e.g. agreement)
like  likes or liked
bird  birds

Nominal morphology
–
Plural forms


s or es
Irregular forms (goose/geese)

–

Mass vs. count nouns (fish/fish(es), email or emails?)
Possessives (cat’s, cats’)
Verbal inflection
–
Main verbs (sleep, like, fear) relatively regular



–
-s, ing, ed
And productive: emailed, instant-messaged, faxed, homered
But some are not:
– eat/ate/eaten, catch/caught/caught
Primary (be, have, do) and modal verbs (can, will,
must) often irregular and not productive

–
Be: am/is/are/were/was/been/being
Irregular verbs few (~250) but frequently occurring
Derivational Morphology

Word stem + syntactic/grammatical
morpheme  new words
–
–

Usually produces word of different class
Incomplete process: derivational morphs cannot
be applied to just any member of a class
Verbs --> nouns
–
–
–
-ize verbs  -ation nouns
generalize, realize  generalization, realization
synthesize but not synthesization

Verbs, nouns  adjectives
–
–

Adjective  adverb
–

embrace, pity embraceable, pitiable
care, wit  careless, witless
happy  happily
Process selective in unpredictable ways
–
–
–
Less productive: nerveless/*evidence-less,
malleable/*sleep-able, rar-ity/*rareness
Meanings of derived terms harder to predict by
rule
clueless, careless, nerveless, sleepless
Compounding

Two base forms join to form a new word
–
–
Bedtime, Weinerschnitzel, Rotwein
Careful? Compound or derivation?
Morphotactics

What are the ‘rules’ for constructing a word in
a given language?
–
–
–
Pseudo-intellectual vs. *intellectual-pseudo
Rational-ize vs *ize-rational
Cretin-ous vs. *cretin-ly vs. *cretin-acious

Semantics: In English, un- cannot attach to
adjectives that already have a negative
connotation:
–
–
–

Unhappy vs. *unsad
Unhealthy vs. *unsick
Unclean vs. *undirty
Phonology: In English, -er cannot attach to
words of more than two syllables
–
–
–
–
–
great, greater
Happy, happier
Competent, *competenter
Elegant, *eleganter
Unruly, ?unrulier
Morphological Parsing

These regularities enable us to create
software to parse words into their component
parts
Morphology and FSAs

We’d like to use the machinery provided by
FSAs to capture facts about morphology
•
•
•
Ie. Accept strings that are in the language
And reject strings that are not
And do it in a way that doesn’t require us to in
effect list all the words in the language
What do we need to build a
morphological parser?



Lexicon: list of stems and affixes (w/ corresponding
p.o.s.)
Morphotactics of the language: model of how and
which morphemes can be affixed to a stem
Orthographic rules: spelling modifications that may
occur when affixation occurs
–

in  il in context of l (in- + legal)
Most morphological phenomena can be described
with regular expressions – so finite state techniques
often used to represent morphological processes
Start Simple



Regular singular nouns are ok
Regular plural nouns have an -s on the end
Irregulars are ok as is
Simple Rules
Now Add in the Words

Derivational morphology: adjective fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happi, real (clearly)
• Adj-root2: big, red (*bigly)
Parsing/Generation
vs. Recognition

We can now run strings through these machines to recognize strings
in the language
•
•

But recognition is usually not quite what we need
•
•

Accept words that are ok
Reject words that are not
Often if we find some string in the language we might like to find the
structure in it (parsing)
Or we have some structure and we want to produce a surface form
(production/generation)
Example
•
From “cats” to “cat +N +PL”
Finite State Transducers

The simple story
•
•
•
Add another tape
Add extra symbols to the transitions
On one tape we read “cats”, on the other we write
“cat +N +PL”
Applications


The kind of parsing we’re talking about is
normally called morphological analysis
It can either be
•
•
An important stand-alone component of an
application (spelling correction, information
retrieval)
Or simply a link in a chain of processing
FSTs
Kimmo Koskenniemi’s two-level morphology
Idea: word is a relationship between lexical level
(its morphemes) and surface level (its
orthography)
Transitions
c:c



a:a
t:t
+N:ε
+PL:s
c:c means read a c on one tape and write a c on the other
+N:ε means read a +N symbol on one tape and write nothing on the
other
+PL:s means read +PL and write an s
Typical Uses



Typically, we’ll read from one tape using the first
symbol on the machine transitions (just as in a
simple FSA).
And we’ll write to the second tape using the other
symbols on the transitions.
In general, FSTs can be used for
–
–
–
Translators (Hello:Ciao)
Parser/generators (Hello:How may I help you?)
As well as Kimmo-style morphological parsing
Ambiguity

Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state.
•

Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result
Ambiguity

What’s the right parse (segmentation) for
•
•
•

Unionizable
Union-ize-able
Un-ion-ize-able
Each represents a valid path through the
derivational morphology machine.
Ambiguity

There are a number of ways to deal with this
problem
•
•
•
Simply take the first output found
Find all the possible outputs (all paths) and return
them all (without choosing)
Bias the search so that only one or a few likely
paths are explored
The Gory Details

Of course, its not as easy as
•


“cat +N +PL” <-> “cats”
As we saw earlier there are geese, mice and oxen
But there are also a whole host of
spelling/pronunciation changes that go along with
inflectional changes
•
•
Cats vs Dogs
Fox and Foxes
Multi-Tape Machines


To deal with this we can simply add more
tapes and use the output of one tape
machine as the input to the next
So to handle irregular spelling changes we’ll
add intermediate tapes with intermediate
symbols
Generativity



Nothing really privileged about the directions.
We can write from one and read from the
other or vice-versa.
One way is generation, the other way is
analysis
Multi-Level Tape Machines

We use one machine to transduce between the
lexical and the intermediate level, and another to
handle the spelling changes to the surface tape
Lexical to Intermediate Level
Intermediate to Surface

The add an “e” rule as in fox^s# <-> foxes#
Foxes
Note


A key feature of this machine is that it doesn’t
do anything to inputs to which it doesn’t
apply.
Meaning that they are written out unchanged
to the output tape.
Overall Scheme

We now have one FST that has explicit
information about the lexicon (actual words,
their spelling, facts about word classes and
regularity).
•

Lexical level to intermediate forms
We have a larger set of machines that
capture orthographic/spelling rules.
•
Intermediate forms to surface forms
Overall Scheme
Cascades

This is a scheme that we’ll see again and
again.
•
•
•
Overall processing is divided up into distinct
rewrite steps
The output of one layer serves as the input to the
next
The intermediate tapes may or may not wind up
being useful in their own right
Porter Stemmer (1980)

Used for tasks in which you only care about the stem
–




IR, modeling given/new distinction, topic detection,
document similarity
Lexicon-free morphological analysis
Cascades rewrite rules (e.g. misunderstanding -->
misunderstand --> understand --> …)
Easily implemented as an FST with rules e.g.
–
ATIONAL  ATE
–
ING  ε
Not perfect ….
–
Doing  doe


Policy  police
Does stemming help?
–
–
IR, little
Topic detection, more
Summing Up




FSTs provide a useful tool for implementing a
standard model of morphological analysis, Kimmo’s
two-level morphology
But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the rule-based
Porter Stemmer
Next time:
– Read Ch 4
HW1 assigned; see web page:
http://www.cs.columbia.edu/~kathy/NLP
Download