Computational Morphology Lecture 1

advertisement
1
COMPUTATIONAL
MORPHOLOGY
(Processing words)
2
What is it?
• Morphology: the study/knowledge of structure/form
• In this case: of words
• How words are created, structured, analyzed
• Morpheme: basic meaningful unit of language
• Computational morphology: developing/using computer
applications that involve morphology
3
Computational applications
• Analysis: parse/break a word into its constituent
morphemes
• Generation: create/generate a word from its constituent
morpheme
4
Basic approaches
• Match with a lexicon
• Problem: coverage
• Cut-and-paste
• Often ad-hoc
• English: Porter’s stemming algorithm
• Only useful for uncomplicated languages
• Finite-state morphology
• Using FSA’s to transduce
• Machine learning
• Morpheme boundary id, classification
5
FSA’s for morphology
6
Word classification
• Part-of-speech category
• Noun, verb, adjective, adverb, etc.
• Simple word vs. complex word
• One morpheme vs. more morphemes
• Open-class/lexical word vs.
closed-class/function(al)/stop word
• Productive/inventive use vs. restricted use
7
Morpheme classification
• Root vs. affix
• Root: word’s basic meaning morpheme
• Affix: prefix, suffix, infix, circumfix
• Base: what affixes are added to
• Free/bound morphemes
• Whether or not can stand alone
• Lexical/grammatical morphemes
• Often want to throw out the latter (stop words)
8
Some morpheme properties
• Ambiguity
• -er: agentive suffix (e.g. singer, kicker, …)
-er: comparative suffix (e.g. bigger, hotter)
• Productivity: how widely can it be used?
• modernize, *newize
• often interacts with other areas of linguistics
• Allomorphy: distribution of variants
• illogical, irresponsible, inappropriate, ignoble, immodest
9
Word-structure diagrams
• Each morpheme is
Adv
Adv
Adj
Pref
Deriv
un-
Root
N
Suff
Deriv
condition -al
Suff
Deriv
-ly
labelled (root, affix
type, POS)
• Each step is binary (2
branches)
• Each stage should
span a real word
10
English morphology
• Pluralization of nouns
• dog+s, bat+s, walk+s
• Conjugation of verbs
• walk+0, walk+s, walk+ed, walk+ing
• Adverbialization of adjectives
• careful+ly, reckless+ly
• Other possibilities
• out+swim/eat/run, re+do/think/release,
non-negotiable/returnable
• big+er, big+est
11
English morphology (cont.)
• English is not complicated, yet nontrivial:
• skiessky+s, fliesfly+s, keyskey+s
• forgettingforget+ing, targetingtarget+ing
• bakingbake+ing, tragicallytragic+ly
• Lots of exceptions: wentgo+ed,
automataautomaton+s, child(ren),
corporacorpus+s
• Morphological ambiguity
• axes axe+s OR axis+s
• runs verb OR noun
12
Portuguese morphology
• Verb conjugation
• 63 possible forms
• 3 major conjugation classes, many sub-classes
• Over 1000 (semi)productive verb endings
• Noun pluralization
• Almost as simple as English
• Adjective inflection
• Number
• Gender
13
Portuguese verb (falar)
falando
falado
falar falares falar falarmos falardes falarem
falo falas fala falamos falais falam
falava falavas falava falávamos faláveis falavam
falei falaste falou falamos falastes falaram
falara falaras falara faláramos faláreis falaram
falarei falarás falará falaremos falareis falarão
falaria falarias falaria falaríamos falaríeis falariam
fala falai
fale fales fale falemos faleis falem
falasse falasses falasse falássemos falásseis falassem
falar falares falar falarmos falardes falarem
14
Finnish complexity
• Nouns
• Cases, number, possessive affixes
• Potentially 840 forms for each noun
• Adjectives
• As for nouns, but also comparative, superlative
• Potentially 2,520 forms for each
• Verbs
• Potentially over 10,000 forms for each
15
Complexity
• Varying degrees of morphological richness across
languages
qasuiirsarvingssarsingitluinarnarpuq
“someone did not find a completely suitable resting
place”
Dampfschiffahrtsgesellschaftsdirektorsstellvertretersgemahlin
16
English complexity (WSJ)
superconductivity's
telecommunications
misrepresentations
biotechnological
immunodeficiency
nonparticipation
responsibilities
unconstitutional
capitalizations
computerization
congressionally
discontinuation
diversification
extraordinarily
internationally
microprocessors
philosophically
disproportionately
constitutionality
superconductivity
deoxyribonucleic
mischaracterizes
pharmaceuticals'
superspecialized
administrations
cerebrovascular
confidentiality
criminalization
dispassionately
entrepreneurial
inconsistencies
liberalizations
notwithstanding
professionalism
overspecialization
counterproductive
administration's
enthusiastically
nonmanufacturing
recapitalization
unapologetically
anthropological
competitiveness
confrontational
discombobulated ?????
dissatisfaction
experimentation
instrumentation
micromanagement
pharmaceuticals
proportionately
17
Morphological constraints
• dog+s, walk+ed, big(g)+est, sight+ing+s, punish+ment+s
• *s+dog, *ed+walk, *est+big, *sight+s+ing, *punish+s+ment
• big+er, hollow+est
• *interesting+er, *ridiculous+est
18
Morphological processes
• Affixation: prefix, suffix, infix
• Interleaving (KaTaB, uKTaB)
• Cliticization (isn’t, s’appelle)
• Internal change: (sing/sang, goose/geese)
• Suppletion (irregularity): (aller/ir, be/am)
• Stress placement: implant, import, contest
• Tone placement: dà vs. dá ( will spank vs. spanked)
• Reduplication
• Full: iji/ijiiji
• Partial: lakad/lalakad
19
Word formation methods
• Conversion: down (Gatorade), up (price)
• Clipping: narc, fed, bra
• Blends: Cranicot, smog, infomercial, Tôdai
• Backformation: resurrect, liposuct, orientate
• Acronyms: RAM, cd-rom
• Coinage: teflon, kleenex, skidoo
• Proper names: curie, watt, boycott
20
Base (citation) form
• Dictionaries typically don’t contain all morphological
variants of a word
• Citation form: base form, lemma
• Languages, dictionaries differ on citation form
• Armenian: verbs listed with first person sg.
• Semitic languages: triliteral roots
• Chinese/Japanese: character stroke order
21
Areas of focus in morphology
• Derivational
• do+able, adjourn+ment, depos+ition, un+lock, teach+er
• Inflectional
• dog+s, sneez+ed
• Compounding
• overkill, BYU intramural track star
• Cliticization
• I’m, she’ll, they’ve, o’clock
22
Derivational morphology
• Changes meaning and/or category (do+able,
adjourn+ment, depos+ition, un+lock, teach+er)
• Allows leveraging words of other categories (import)
• Not very productive
• Derivational morphemes usually surround root
23
Inflectional morphology
• Does not change meaning or category (dog+s, big+er,
•
•
•
•
run+s)
(Almost) all languages use it, but to widely varying
degrees
Highly productive
Outermost part of word (usually)
Categories: number, gender, case, tense, aspect,
honorifics, etc. etc.
24
Compounding
• N+N: streetlight, bookcase
• V+N: swearword, washcloth, scrub board
• A+N: bluebird, happy hour, high chair
• P+N: overlord, outhouse, in-law
• N+A: sky-blue, blood-red
• P+A: overripe, ingrown
• (endo/exo)centricity: (dog food, redneck)
25
Constraints on morphology
• Ordering constraint
• Derivational morphology must precede inflectional
morphology
• *neighbor+s+hood, neighbor+hood+s
• Productivity constraint
• Derivational morphology is less productive
• -ize: only certain adjectives admit this suffix; *new+ize,
modern+ize
• -ment on verbs: *arrest+ment, confine+ment
26
Constraints (cont.)
• Incompatibilty of certain roots/affixes
• defend+ant, assail+ant, serv+ant
• *fight+ant, *teach+ant
• Why? Latinate vs. non-Latinate borrowings
• whit(e)+en, soft+en, mad(d)+en, liv(e)+en
• *blu(e)+en, *calm+en, *angry+en, *die+en
• Why? Final sound of monosyllabic base/root.
27
Sample Long NC’s
off-highway truck final drive first reduction planetary assembly
parking brake / travel stop pilot control valve pressure switch
fuel injection pump drive sprocket bearing lubrication line
left rear suspension cylinder pressure sensor circuit fault
ground-level right rear leg elevation control valve
axle wish bone ball joint king pin bolts
28
Variation: morphology
•
•
•
•
217 air conditioning system
24 air conditioner system
1 air condition system
4 air start motor
48 air starter motor
131 air starting motor
91 combustion gases
16 combustible gases
5 washer fluid
1 washing fluid
• 4 synchronization solenoid
19 synchronizing solenoid
• 85 vibration motor
16 vibrator motor
118 vibratory motor
• 1 blowby / airflow indicator
12 blowby / air flow indicator
• 18 electric system
24 electrical system
3 electronic system
1 electronics system
• 1 cooling system pressurization pump group
103 cooling system pressurizing pump group
29
Variation: word boundaries
•
•
•
•
•
11 four wheel drive
30 four-wheel drive
5 one half turn
34 one-half turn
24 one way check valve
14 one-way check valve
1 right hand joystick
1 right-hand joystick
5 anti-oxidation additive
18 antioxidation additive
• 20 inter-axle differential
•
•
•
•
1 interaxle diferential
35 nonferrous particles
21 non-ferrous particles
2 water-cooled turbocharger
4 watercooled turbocharger
1 air/fuel mixture
14 air-fuel mixture
1 rear wiper/washer switch
2 rear wiper-washer switch
30
Lushootseed examples
gwd: seated
gwdil: sit down
gwdiltxw: seat someone, marry
gwdis: sit next to someone
sxwgwdil: chair
sxwgwigwdil: little chair
gwigwdil: sit briefly
sgwigwdil: brief sitting
sgwigwdilaltxw: outhouse
gwdgwdil: sitting around
gwaadil: people sitting around
bda (child)
bibda (DIM: infant)
bdbda (DTR: children)
bibdbda (DIM+DTR: dolls,
litter)
bibibda (DTR+DIM: young ch.)
pastd: white person
papastd: (pej.)
papstd: white child/friend
paspastd: many white folks
papapstd: many white children
pastdaltxw : white man’s house
31
Computational morphology
• Processing morphological structure via computer
(parsing, generation)
• Traditional approach
• ad-hoc methods
• Cut-and-paste algorithms
• Dictionary lookup
• Inadequate for highly inflected languages
• Even statistical approaches are often unuseful
• Two-level approach w/ finite-state techniques
32
The two-level model
• Each word has 2 simultaneous representations
(correspondences)
• Lexical: underlying concatenation of morphemes
• Surface: actual orthographic form
• Describe and resolve the differences between
these levels by morphological rules
• Leverages finite-state technology, formal
specification, transduction approach between
correspondences
33
Sample correspondences
#sky#
sky
#sky+s#
skies
#dye+ing#
dye ing
#die+ing#
dy ing
#uta+ma+na-ca+pjja+samacha-i+wa#
uta ma n
ca pjja samach
i wa
#travaill+er# #katab+at# #katab+ti#
travaill es
k t b t
k t b t
34
The system
• PC-Kimmo: system for two-level processing
• Distributed by SIL for fieldwork, text analysis
• Components
• Lexicons: inventory of morphemes
• Rules: specify allomorphic variants,
morphophonological interface
• Word grammar (optional): specify word-level constraints
on order, structure, cooccurence of morpheme classes
35
Sample parses (w/ glosses)
PC-KIMMO>recognize gWEdsutudZildubut
gWE+d+s+?u+^tudZil+du+b+ut
Dub+my+Nomz+Perf+bend_over+OOC+Midd+Rfx
PC-KIMMO>recognize adsukWaxWdubs
ad+s+?u+^kWaxW+du+b+s
Your+Nomz+Perf+help+OOC+Midd+his/hers
36
Sample constituency graph
PC-KIMMO>recognize LubElEskWaxWyildutExWCEL
Lu+bE+lEs+^kWaxW+yi+il+d+ut+ExW+CEL
Fut+ANEW+PrgSttv+help+YI+il+Trx+Rfx+Inc+our
Word
|
NWord
_____________________________|_____________________________
VWord
DET2
|
+CEL
VTnsAsp
+our
__________|__________
FUT
VWord
Lu+
|
Fut+
VAsp0
_____________|______________
ANEW
VWord
bE+
|
ANEW+
VAsp2
__________________|___________________
PROGRSTAT
VWord
lEs+
|
ProgrStatv+
VFrame
_______|________
VFrame
NOW
_______|________
+ExW
VFrame
VSUFRFX +Incho
_______|_______
+ut
VFrame
VSUFTRX
+Rfx
_____|______
+d
VFrame
ACHV
+Trx
___|____
+il
VFrame VSUFYI
+il
|
+yi
ROOT
+yi
^kWaxW
help
37
Sample generation
PC-KIMMO>generate ad+^pastEd=al?txW
adpastEdal?txW
PC-KIMMO>generate ad+s+?u+^kWaxW+du+b+s
adsukWaxWdubs
PC-KIMMO>generate Lu+ad+s+al?txW
Luadsal?txW
Ladsal?txW
38
Upper Chehalis word graph
PC-KIMMO>recognize ?acqW|a?stqlsCnCsa
?ac+qW|a?=stq=ls+Cn+Csa
stative+ache=fire=head+SubjITrx1s+again
Word
|
VPredFull
_____________|_____________
VPred
ADVSUFF
________________|________________
+Csa
VMain2
SUBJSUFF
+again
|
+Cn
VMain
+SubjITrx1s
__________|___________
ASPTENSE
VFrame
?ac+
|
stative+
Root3
________|_________
Root2
LSUFF
_____|_____
=ls
Root1
FSUFF
=head
|
=stq
ROOT
=fire
qW|a?
ache
39
Traditional analysis
d/ba7riyjuiuynnveiqս
Prefix
Root
Suffix
Endings
40
Armenian word graph
Word
|
NDet
___________|____________
NDecl
ART
_______________|_______________
+s
NBase
CASE +1sPoss.
_____________|______________
+ov
ROOT
PLURAL
+Inst
tjpax'dowt'iwn
+ny'r
woe_tribulation
+plural
Download