morph

advertisement
Morphology: Words
and their Parts
CS 4705
CS 4705
Basic Uses of Morphology
• The study of how words are composed from
smaller, meaning-bearing units (morphemes)
• Applications:
– Spelling correction: referece
– Hyphenation algorithms: refer-ence
– Part-of-speech analysis: googler
– Text-to-speech: grapheme-to-phoneme
conversion
• hothouse (/T/ or /D/)
– Speech recognition: phoneme-to-grapheme
conversion
– Amusing poetry and artificial languages in
standardized tests
• ‘Twas brillig and the slithy toves…
• Muggles moogled migwiches
What is a word?
• In formal languages, words are arbitrary strings
• In natural languages, words are made up of
meaningful subunits called morphemes
– Allows for productivity: googled, texted
– Abstract concepts denoting entities or
relationships in the world
• Roots +
• Syntactic or grammatical elements
– Realizations of morphemes: morphs
• Door realizes door; take and took realize take
• Allomorphs are classes of related morphs that realize a given
morpheme
– Allomorphs of s include en, men, es in English
– Take and took are allomorphs of take
– Sum: Morpheme [s] is realized by an allomorph class that
includes the related morphs {en,men,es}
– Syntactic or grammatical morphemes can convey many things
– In Italian, mark nouns for gender and number
Singular
Plural
Masc
pomodoro pomodori
Fem
cipolla
cipolle
pomodor- cipoll-: stems, may or may not occur on their own as words
– Stem may not occur as a word: derivative/deriv
– Base form (lemma) occurs as word: derivative/derive
– Sometimes the same: cars has stem ‘car’ and base form or lemma
‘car’ too
What useful information does morphology give us?
• Different things in different languages
– Spanish: hablo, hablaré/ English: I speak, I will speak
– English: book, books/ Japanese: hon, hon
• Languages differ in how they encode morphological
information
– Isolating languages (e.g. Cantonese) have no affixes:
each word usually has 1 morpheme
– Agglutinative languages (e.g. Finnish, Turkish) are
composed of prefixes and suffixes added to a stem (like
beads on a string) – each feature realized by a single
affix, e.g. Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
‘Wonder if he can also ... with his capability of not causing things
to be unsystematic’
– Inflectional languages (e.g. English) merge different
features into a single affix (e.g. ‘s’ in likes indicates
both person and tense); and the same feature can be
realized by different affixes
– Polysynthetic languages (e.g. Inuit languages) express
much of their syntax in their morphology, incorporating
a verb’s arguments into the verb, e.g. Western
Greenlandic
Aliikusersuillammassuaanerartassagaluarpaalli.
aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-li
entertainment-provide-SEMITRANS-one.good.at-COP-say.that-REPFUT-sure.but-3.PL.SUBJ/3SG.OBJ-but
'However, they will say that he is a great entertainer, but ...'
– So….different languages may require very different
morphological analyzers
Morphology Can Help Define Word Classes
• AKA morphological classes, parts-of-speech
• Closed vs. open (function vs. content) class words
– Pronoun, preposition, conjunction,
determiner,…
– Noun, verb, adverb, adjective,…
• Identifying word classes is useful for almost any
task in NLP, from translation to speech recognition
to topic detection…very basic semantics
(English) Inflectional Morphology
Word stem + grammatical morpheme  different
forms of same word
– Usually produces word of same class
– Usually serves a syntactic or grammatical
function (e.g. agreement)
like  likes or liked
bird  birds
• Nominal morphology
– Plural forms
• s or es
• Irregular forms (goose/geese)
• Mass vs. count nouns (fish/fish(es), email or emails?)
– Possessives (cat’s, cats’)
• Verbal inflection
– Main verbs (sleep, like, fear) relatively regular
• -s, ing, ed
• And productive: emailed, instant-messaged, faxed, homered
• But some are not:
– eat/ate/eaten, catch/caught/caught
– Primary (be, have, do) and modal verbs (can,
will, must) often irregular and not productive
» Be: am/is/are/were/was/been/being
– Irregular verbs few (~250) but frequently occurring
• Particles occur in only one form: in English
– Prepositions: to, from
– Adverbs: happily, quickly
– Conjunctions: but, and
– Articles: the, a, an
– Japanese?
• So….English inflectional morphology is fairly
easy to model….with some special cases...
Derivational Morphology
• Word stem + syntactic/grammatical morpheme 
new words
– Usually produces word of different class
– Incomplete process: derivational morphs cannot
be applied to just any member of a class
• Verbs --> nouns
– -ize verbs  -ation nouns
– generalize, realize  generalization, realization
– synthesize but no synthesization
• Verbs, nouns  adjectives
– embrace, pity embraceable, pitiable
– care, wit  careless, witless
• Adjective  adverb
– happy  happily
• Process selective in unpredictable ways
– Less productive: nerveless/*evidence-less,
malleable/*sleep-able, rar-ity/*rareness
– Meanings of derived terms harder to predict by
rule
• clueless, careless, nerveless, sleepless
• Derivation can be applied recursively:
– Hospital  hospitalize  hospitalization 
prehospitalization  …
– Morphological analysis identifies concatenative
processes as well as morphemes
[pre[[[hospital]ize]ation]]
– But there are bracketing paradoxes
unhappier
[un[happier]: not happier
[[unhappy]er]: more unhappy
Compounding
• Two base forms join to form a new word
– Bedtime, Weinerschnitzel, Rotwein
– Careful? Compound or derivation?
Affixes can be attached to stems in different ways
– Prefixation
• Immaterial
– Suffixation: more common across languages
than prefixation
• Trying
– Circumfixation: combine prefixation and
suffixation
• Gesagt
– Infixation
• English: Absobl**dylutely
• Bontoc: ‘um’ turns adjectives and nouns into verbs
(kilad (red)  kumilad (to be red))
Concatenative vs. Non-concatenative Morphology
• Semitic root-and-pattern morphology
– Root (2-4 consonants) conveys basic semantics
(e.g. Arabic /ktb/)
– Vowel pattern conveys voice and aspect
– Derivational template (binyan) identifies word
class
Template
CVCVC
CVCCVC
CVVCVC
tVCVVCVC
nCVVCVC
CtVCVC
stVCCVC
Vowel Pattern
active
katab
kattab
ka:tab
taka:tab
nka:tab
ktatab
staktab
passive
kutib write
kuttib cause to write
ku:tib correspond
tuku:tib write each other
nku:tib subscribe
ktutib write
stuktib dictate
Morphotactics
• What are the ‘rules’ for constructing a word in a
given language?
– Pseudo-intellectual vs. *intellectual-pseudo
– Rational-ize vs *ize-rational
– Cretin-ous vs. *cretin-ly vs. *cretin-acious
• Possible ‘rules’
– Suffixes are suffixes and prefixes are prefixes
– Certain affixes attach to certain types of stems
(nouns, verbs, etc.)
– Certain stems can/cannot take certain affixes
• Semantics: In English, un- cannot attach to
adjectives that already have a negative
connotation:
– Unhappy vs. *unsad
– Unhealthy vs. *unsick
– Unclean vs. *undirty
• Phonology: In English, -er cannot attach to words
of more than two syllables
– great, greater
– Happy, happier
– Competent, *competenter
– Elegant, *eleganter
– Unruly, ?unrulier
Morphological Parsing
• These regularities enable us to create software to
parse words into their component parts
– Known words and new ones (e.g.
Pneumonoultramicroscopicsilicovolcanoconiosi
s, Columbianize, Columbianization)
Morphological Representations: Evidence from
Human Performance
• Hypotheses:
– Full listing hypothesis: words listed
– Minimum redundancy hypothesis:
morphemes listed
• Experimental evidence:
– Priming experiments (Does seeing/hearing one
word facilitate recognition of another?) suggest
neither
– Regularly inflected forms (e.g. cars) prime stem
(car) but not derived forms (e.g. management,
manage)
– But spoken derived words can prime stems if
they are semantically close (e.g.
government/govern but not department/depart)
• Speech errors suggest affixes must be represented
separately in the mental lexicon
– ‘easy enoughly’ for ‘easily enough’
Summing Up
• Different languages have different morphological
systems
– If we can discover how to decode such a
system, we can identify useful information
about the word class and the semantic meaning
of a word
– Morphological regularities provide basis for
building (automatic) morphological analyzers
• Next time: Read Ch 3.2-3.6
– HW1 will be assigned (check the course
syllabus and courseworks)
Announcements
• HW1 will now be due 9/25/07
• WICS lunch tomorrow at noon in the CS Lounge,
452 MUDD (rsvp to hila@cs.columbia.edu)
Download