1 COMPUTATIONAL MORPHOLOGY (Processing words) 2 What is it? • Morphology: the study/knowledge of structure/form • In this case: of words • How words are created, structured, analyzed • Morpheme: basic meaningful unit of language • Computational morphology: developing/using computer applications that involve morphology 3 Computational applications • Analysis: parse/break a word into its constituent morphemes • Generation: create/generate a word from its constituent morpheme 4 Basic approaches • Match with a lexicon • Problem: coverage • Cut-and-paste • Often ad-hoc • English: Porter’s stemming algorithm • Only useful for uncomplicated languages • Finite-state morphology • Using FSA’s to transduce • Machine learning • Morpheme boundary id, classification 5 FSA’s for morphology 6 Word classification • Part-of-speech category • Noun, verb, adjective, adverb, etc. • Simple word vs. complex word • One morpheme vs. more morphemes • Open-class/lexical word vs. closed-class/function(al)/stop word • Productive/inventive use vs. restricted use 7 Morpheme classification • Root vs. affix • Root: word’s basic meaning morpheme • Affix: prefix, suffix, infix, circumfix • Base: what affixes are added to • Free/bound morphemes • Whether or not can stand alone • Lexical/grammatical morphemes • Often want to throw out the latter (stop words) 8 Some morpheme properties • Ambiguity • -er: agentive suffix (e.g. singer, kicker, …) -er: comparative suffix (e.g. bigger, hotter) • Productivity: how widely can it be used? • modernize, *newize • often interacts with other areas of linguistics • Allomorphy: distribution of variants • illogical, irresponsible, inappropriate, ignoble, immodest 9 Word-structure diagrams • Each morpheme is Adv Adv Adj Pref Deriv un- Root N Suff Deriv condition -al Suff Deriv -ly labelled (root, affix type, POS) • Each step is binary (2 branches) • Each stage should span a real word 10 English morphology • Pluralization of nouns • dog+s, bat+s, walk+s • Conjugation of verbs • walk+0, walk+s, walk+ed, walk+ing • Adverbialization of adjectives • careful+ly, reckless+ly • Other possibilities • out+swim/eat/run, re+do/think/release, non-negotiable/returnable • big+er, big+est 11 English morphology (cont.) • English is not complicated, yet nontrivial: • skiessky+s, fliesfly+s, keyskey+s • forgettingforget+ing, targetingtarget+ing • bakingbake+ing, tragicallytragic+ly • Lots of exceptions: wentgo+ed, automataautomaton+s, child(ren), corporacorpus+s • Morphological ambiguity • axes axe+s OR axis+s • runs verb OR noun 12 Portuguese morphology • Verb conjugation • 63 possible forms • 3 major conjugation classes, many sub-classes • Over 1000 (semi)productive verb endings • Noun pluralization • Almost as simple as English • Adjective inflection • Number • Gender 13 Portuguese verb (falar) falando falado falar falares falar falarmos falardes falarem falo falas fala falamos falais falam falava falavas falava falávamos faláveis falavam falei falaste falou falamos falastes falaram falara falaras falara faláramos faláreis falaram falarei falarás falará falaremos falareis falarão falaria falarias falaria falaríamos falaríeis falariam fala falai fale fales fale falemos faleis falem falasse falasses falasse falássemos falásseis falassem falar falares falar falarmos falardes falarem 14 Finnish complexity • Nouns • Cases, number, possessive affixes • Potentially 840 forms for each noun • Adjectives • As for nouns, but also comparative, superlative • Potentially 2,520 forms for each • Verbs • Potentially over 10,000 forms for each 15 Complexity • Varying degrees of morphological richness across languages qasuiirsarvingssarsingitluinarnarpuq “someone did not find a completely suitable resting place” Dampfschiffahrtsgesellschaftsdirektorsstellvertretersgemahlin 16 English complexity (WSJ) superconductivity's telecommunications misrepresentations biotechnological immunodeficiency nonparticipation responsibilities unconstitutional capitalizations computerization congressionally discontinuation diversification extraordinarily internationally microprocessors philosophically disproportionately constitutionality superconductivity deoxyribonucleic mischaracterizes pharmaceuticals' superspecialized administrations cerebrovascular confidentiality criminalization dispassionately entrepreneurial inconsistencies liberalizations notwithstanding professionalism overspecialization counterproductive administration's enthusiastically nonmanufacturing recapitalization unapologetically anthropological competitiveness confrontational discombobulated ????? dissatisfaction experimentation instrumentation micromanagement pharmaceuticals proportionately 17 Morphological constraints • dog+s, walk+ed, big(g)+est, sight+ing+s, punish+ment+s • *s+dog, *ed+walk, *est+big, *sight+s+ing, *punish+s+ment • big+er, hollow+est • *interesting+er, *ridiculous+est 18 Morphological processes • Affixation: prefix, suffix, infix • Interleaving (KaTaB, uKTaB) • Cliticization (isn’t, s’appelle) • Internal change: (sing/sang, goose/geese) • Suppletion (irregularity): (aller/ir, be/am) • Stress placement: implant, import, contest • Tone placement: dà vs. dá ( will spank vs. spanked) • Reduplication • Full: iji/ijiiji • Partial: lakad/lalakad 19 Word formation methods • Conversion: down (Gatorade), up (price) • Clipping: narc, fed, bra • Blends: Cranicot, smog, infomercial, Tôdai • Backformation: resurrect, liposuct, orientate • Acronyms: RAM, cd-rom • Coinage: teflon, kleenex, skidoo • Proper names: curie, watt, boycott 20 Base (citation) form • Dictionaries typically don’t contain all morphological variants of a word • Citation form: base form, lemma • Languages, dictionaries differ on citation form • Armenian: verbs listed with first person sg. • Semitic languages: triliteral roots • Chinese/Japanese: character stroke order 21 Areas of focus in morphology • Derivational • do+able, adjourn+ment, depos+ition, un+lock, teach+er • Inflectional • dog+s, sneez+ed • Compounding • overkill, BYU intramural track star • Cliticization • I’m, she’ll, they’ve, o’clock 22 Derivational morphology • Changes meaning and/or category (do+able, adjourn+ment, depos+ition, un+lock, teach+er) • Allows leveraging words of other categories (import) • Not very productive • Derivational morphemes usually surround root 23 Inflectional morphology • Does not change meaning or category (dog+s, big+er, • • • • run+s) (Almost) all languages use it, but to widely varying degrees Highly productive Outermost part of word (usually) Categories: number, gender, case, tense, aspect, honorifics, etc. etc. 24 Compounding • N+N: streetlight, bookcase • V+N: swearword, washcloth, scrub board • A+N: bluebird, happy hour, high chair • P+N: overlord, outhouse, in-law • N+A: sky-blue, blood-red • P+A: overripe, ingrown • (endo/exo)centricity: (dog food, redneck) 25 Constraints on morphology • Ordering constraint • Derivational morphology must precede inflectional morphology • *neighbor+s+hood, neighbor+hood+s • Productivity constraint • Derivational morphology is less productive • -ize: only certain adjectives admit this suffix; *new+ize, modern+ize • -ment on verbs: *arrest+ment, confine+ment 26 Constraints (cont.) • Incompatibilty of certain roots/affixes • defend+ant, assail+ant, serv+ant • *fight+ant, *teach+ant • Why? Latinate vs. non-Latinate borrowings • whit(e)+en, soft+en, mad(d)+en, liv(e)+en • *blu(e)+en, *calm+en, *angry+en, *die+en • Why? Final sound of monosyllabic base/root. 27 Sample Long NC’s off-highway truck final drive first reduction planetary assembly parking brake / travel stop pilot control valve pressure switch fuel injection pump drive sprocket bearing lubrication line left rear suspension cylinder pressure sensor circuit fault ground-level right rear leg elevation control valve axle wish bone ball joint king pin bolts 28 Variation: morphology • • • • 217 air conditioning system 24 air conditioner system 1 air condition system 4 air start motor 48 air starter motor 131 air starting motor 91 combustion gases 16 combustible gases 5 washer fluid 1 washing fluid • 4 synchronization solenoid 19 synchronizing solenoid • 85 vibration motor 16 vibrator motor 118 vibratory motor • 1 blowby / airflow indicator 12 blowby / air flow indicator • 18 electric system 24 electrical system 3 electronic system 1 electronics system • 1 cooling system pressurization pump group 103 cooling system pressurizing pump group 29 Variation: word boundaries • • • • • 11 four wheel drive 30 four-wheel drive 5 one half turn 34 one-half turn 24 one way check valve 14 one-way check valve 1 right hand joystick 1 right-hand joystick 5 anti-oxidation additive 18 antioxidation additive • 20 inter-axle differential • • • • 1 interaxle diferential 35 nonferrous particles 21 non-ferrous particles 2 water-cooled turbocharger 4 watercooled turbocharger 1 air/fuel mixture 14 air-fuel mixture 1 rear wiper/washer switch 2 rear wiper-washer switch 30 Lushootseed examples gwd: seated gwdil: sit down gwdiltxw: seat someone, marry gwdis: sit next to someone sxwgwdil: chair sxwgwigwdil: little chair gwigwdil: sit briefly sgwigwdil: brief sitting sgwigwdilaltxw: outhouse gwdgwdil: sitting around gwaadil: people sitting around bda (child) bibda (DIM: infant) bdbda (DTR: children) bibdbda (DIM+DTR: dolls, litter) bibibda (DTR+DIM: young ch.) pastd: white person papastd: (pej.) papstd: white child/friend paspastd: many white folks papapstd: many white children pastdaltxw : white man’s house 31 Computational morphology • Processing morphological structure via computer (parsing, generation) • Traditional approach • ad-hoc methods • Cut-and-paste algorithms • Dictionary lookup • Inadequate for highly inflected languages • Even statistical approaches are often unuseful • Two-level approach w/ finite-state techniques 32 The two-level model • Each word has 2 simultaneous representations (correspondences) • Lexical: underlying concatenation of morphemes • Surface: actual orthographic form • Describe and resolve the differences between these levels by morphological rules • Leverages finite-state technology, formal specification, transduction approach between correspondences 33 Sample correspondences #sky# sky #sky+s# skies #dye+ing# dye ing #die+ing# dy ing #uta+ma+na-ca+pjja+samacha-i+wa# uta ma n ca pjja samach i wa #travaill+er# #katab+at# #katab+ti# travaill es k t b t k t b t 34 The system • PC-Kimmo: system for two-level processing • Distributed by SIL for fieldwork, text analysis • Components • Lexicons: inventory of morphemes • Rules: specify allomorphic variants, morphophonological interface • Word grammar (optional): specify word-level constraints on order, structure, cooccurence of morpheme classes 35 Sample parses (w/ glosses) PC-KIMMO>recognize gWEdsutudZildubut gWE+d+s+?u+^tudZil+du+b+ut Dub+my+Nomz+Perf+bend_over+OOC+Midd+Rfx PC-KIMMO>recognize adsukWaxWdubs ad+s+?u+^kWaxW+du+b+s Your+Nomz+Perf+help+OOC+Midd+his/hers 36 Sample constituency graph PC-KIMMO>recognize LubElEskWaxWyildutExWCEL Lu+bE+lEs+^kWaxW+yi+il+d+ut+ExW+CEL Fut+ANEW+PrgSttv+help+YI+il+Trx+Rfx+Inc+our Word | NWord _____________________________|_____________________________ VWord DET2 | +CEL VTnsAsp +our __________|__________ FUT VWord Lu+ | Fut+ VAsp0 _____________|______________ ANEW VWord bE+ | ANEW+ VAsp2 __________________|___________________ PROGRSTAT VWord lEs+ | ProgrStatv+ VFrame _______|________ VFrame NOW _______|________ +ExW VFrame VSUFRFX +Incho _______|_______ +ut VFrame VSUFTRX +Rfx _____|______ +d VFrame ACHV +Trx ___|____ +il VFrame VSUFYI +il | +yi ROOT +yi ^kWaxW help 37 Sample generation PC-KIMMO>generate ad+^pastEd=al?txW adpastEdal?txW PC-KIMMO>generate ad+s+?u+^kWaxW+du+b+s adsukWaxWdubs PC-KIMMO>generate Lu+ad+s+al?txW Luadsal?txW Ladsal?txW 38 Upper Chehalis word graph PC-KIMMO>recognize ?acqW|a?stqlsCnCsa ?ac+qW|a?=stq=ls+Cn+Csa stative+ache=fire=head+SubjITrx1s+again Word | VPredFull _____________|_____________ VPred ADVSUFF ________________|________________ +Csa VMain2 SUBJSUFF +again | +Cn VMain +SubjITrx1s __________|___________ ASPTENSE VFrame ?ac+ | stative+ Root3 ________|_________ Root2 LSUFF _____|_____ =ls Root1 FSUFF =head | =stq ROOT =fire qW|a? ache 39 Traditional analysis d/ba7riyjuiuynnveiqս Prefix Root Suffix Endings 40 Armenian word graph Word | NDet ___________|____________ NDecl ART _______________|_______________ +s NBase CASE +1sPoss. _____________|______________ +ov ROOT PLURAL +Inst tjpax'dowt'iwn +ny'r woe_tribulation +plural