I256: Applied Natural Language Processing Marti Hearst Sept 11, 2006 1 Elements of Language Today: Morphology Illustration from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html 2 3 4 Jabberwocky Analysis This is nonsense … or is it? This is not English … but it’s much more like English than it is like French or German or Chinese or … Why do we pretty much understand the words? 5 Jabberwocky Analysis Why do we pretty much understand the words? We recognize combinations of morphemes. Chortled - Laugh in a breathy, gleeful way; (Definition from Oxford American Dictionary) A combination of "chuckle" and "snort." Galumphing - Moving in a clumsy, ponderous, or noisy manner. Perhaps a blend of "gallop" and "triumph." (Definition from Oxford American Dictionary) Activity: Make up a word whose meaning can be inferred from the morphemes that you used. 6 Jabberwocky Analysis Why do we pretty much understand the words? Surrounding English words strongly indicate the parts-of-speech of the nonsense words. toves: probably can perform an action (because they did gyre and gimble) wabe: is probably a place. (they did … in the wabe) http://assets.cambridge.org/052185/542X/excerpt/052185542X_excerpt.pdf 7 Jabberwocky Analysis Surrounding English words strongly indicate the parts-of-speech of the nonsense words. It’s similar in the French Translation: Example from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html 8 Morphology Morphology: The study of the way words are built up from smaller meaning units. Morphemes: The smallest meaningful unit in the grammar of a language. Contrasts: Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) A useful resource: Glossary of linguistic terms by Eugene Loos http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.h tm Modified from Dorr and Habash (after Jurafsky and Martin) 9 Examples (English) “unladylike” 3 morphemes, 4 syllables unlady -like ‘not’ ‘(well behaved) female adult human’ ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “technique” 1 morpheme, 2 syllables “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns Modified from Dorr and Habash (after Jurafsky and Martin) 10 Morpheme Definitions Root The portion of the word that: – is common to a set of derived or inflected forms, if any, when all affixes are removed – is not further analyzable into meaningful elements – carries the principle portion of meaning of the words Stem The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. Affix A bound morpheme that is joined before, after, or within a root or stem. Clitic a morpheme that functions syntactically like a word, but does not appear as an independent phonological word – Spanish: un beso, las aguas – English: Hal’s (genetive marker) Modified from Dorr and Habash (after Jurafsky and Martin) 11 Inflectional vs. Derivational Word Classes Parts of speech: noun, verb, adjectives, etc. Word class dictates how a word combines with morphemes to form new words Inflection: Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. – Doesn’t change the word class – Usually produces a predictable, nonidiosyncratic change of meaning. run -> runs | running | ran Derivation: The formation of a new word or inflectable stem from another word or stem. – compute -> computer -> computerization Modified from Dorr and Habash (after Jurafsky and Martin) 12 Inflectional Morphology Adds: tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Examples come is inflected for person and number: The pizza guy comes at noon. las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) Modified from Dorr and Habash (after Jurafsky and Martin) 13 Derivational Morphology Nominalization (formation of nouns from other parts of speech, primarily verbs in English): computerization appointee killer fuzziness Formation of adjectives (primarily from nouns) computational clueless Embraceable Diffulcult cases: building from which sense of “build”? Modified from Dorr and Habash (after Jurafsky and Martin) 14 Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing hoping hop hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized Modified from Dorr and Habash (after Jurafsky and Martin) 15 Templatic Morphology Roots and Patterns Example: Hebrew verbs Root: – Consists of 3 consonants CCC – Carries basic meaning Template: – Gives the ordering of consonants and vowels – Specifies semantic information about the verb Active, passive, middle voice Example: – lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) Modified from Dorr and Habash (after Jurafsky and Martin) 16 Morphological Analysis Tools Porter stemmer: A simple approach: just hack off the end of the word! Frequently used, especially for Information Retrieval, but results are pretty ugly! 17 porter.demo() Original ***************************** Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group . Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate . A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , researchers reported . Results ******************************* Pierr Vinken , 61 year old , will join the board as a nonexecut director Nov. 29 . Mr. Vinken is chairman of Elsevi N.V. , the Dutch publish group . Rudolph Agnew , 55 year old and former chairman of Consolid Gold Field PLC , wa name a nonexecut director of thi British industri conglomer . A form of asbesto onc use to make Kent cigarett filter ha caus a high percentag of cancer death among a group of worker expos to it more than 30 year ago , research report . 18 Morphological Analysis Tools WordNet’s morphy() A slightly more sophisticated approach Use an understanding of inflectional morphology – Uses a set of Rules of Detachment – Use an Exception List for irregulars – Handle collocations in a special way Do the transformation, compare the result to the WordNet dictionary If the transformation produces a real word, then keep it, else use the original word. For more details, see – http://wordnet.princeton.edu/man/morphy.7WN.html 19 Some morphy() output >>> wntools.morphy('dogs') 'dog' >>> wntools.morphy('running', pos='verb') 'run' >>> wntools.morphy('corpora') 'corpus' >>> 20 Morphological Analysis Tools Very sophisticated programs have been developed Use a techniqued called Two-Level Phonology Has been applied to numerous languages Best known: PCKimmo After Kimmo Koskenniemi, based in part on work by Lauri Kartunnen in 1983 Uses: – A rules file which specifies the alphabet and the phonological (or spelling) rules, – A lexicon file which lists lexical items and encodes morphotactic constraints. http://www.sil.org/pckimmo/ NLTK-Lite has a version too (not working in my download) Commercial versions are available inXight’s LinguistX version based on technology developed by Kaplan and others from Xerox PARC (or at least used to be) 21 Morphological Analysis Tools CatVar: Categorial Variation Database “A database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants.” Example: the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)). http://clipdemos.umiacs.umd.edu/catvar 22 Next Time Computing with n-grams 23