Morphology

Fondements du TAL Delphine Bernhard Morphology Contains slides adapted from Pierre Zweigenbaum LIMSI-CNRS | 1/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 2/72 Levels of linguistic structure Our focus today (c) David Groome, 2006 LIMSI-CNRS | 3/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 4/72 Can you read this (fast!)? wikipédiaestunprojetdencyclopédiecollectiveétabli esurinternetuniversellemultilingueetfonctionnants urleprincipeduwikiwikipédiaapourobjectifdoffrirun contenulibrementréutilisableneutreetvérifiableque chacunpeutéditeretaméliorer Wikipédia est un projet d’encyclopédie collective établie sur Internet, universelle, multilingue et fonctionnant sur le principe du wiki. Wikipédia a pour objectif d’offrir un contenu librement réutilisable, neutre et vérifiable, que chacun peut éditer et améliorer. LIMSI-CNRS | 5/72 It's only words ...  ... but what are they exactly and how can we automatically recognise them?  In speech, there are no obvious breaks  So how do babies learn words?  According to (Saffran et al., 1996) they use distributional cues and statistical regularities in speech LIMSI-CNRS | 6/72 How do we recognise words in speech? (Bauer, 1988)  There are no gaps between words in speech:  Menbecomeoldbuttheyneverbecomegood  Thanks to our knowledge of language, we recognise certain strings of sounds/letters:  e.g. we can recognise men in the previous sequence because it also comes up in sequences like:  Menareconservativeafterdinner  Menlosetheirtempersindefendingtheirtaste.  Afterfortymenhavemarriedtheirhabits. LIMSI-CNRS | 7/72 Learning to read is difficult for humans  Reading disabilities:  Dyslexia: inability to decode, or break down, words into phonemes  Comprehension difficulties  The invention of writing and reading is recent  Contrarily to speech or vision, it is an unnatural process that has to be learned: brains are not wired to read! LIMSI-CNRS | 8/72 For computers: characters and strings  Control characters:  End of line: \n  Tabulation: \t  Encodings:  ASCII: English alphabet  Latin 1, ISO-8859-1: Western European Languages  ISO-8859-15: Similar to ISO-8859-1, but replaces some less common symbols with €, Œ or œ  Windows-1252, Cp1252: superset of ISO 8859-1 (includes €, Œ and œ)  UTF-8: can represent every character in the Unicode character set, backward-compatible with ASCII LIMSI-CNRS | 9/72 Practical definition of words and sentences  Bauer (1988):  A word is a unit which, in print, is bounded by spaces on both sides. We will call this an orthographic word.  Kučera and Francis (1967):  A graphic word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks  Grefenstette and Tapanainen (1994):  Sentences end with punctuation. LIMSI-CNRS | 10/72 What are the "words" and sentences here? Pacific Lumber Co. was trying to figure out the safest way to bring the activists down. He doesn't need us. For additional information see also http://www.limsi.fr New York is situated on the east coast of the United States. c’est-à-dire les pommes de terre des U.S.A. LIMSI-CNRS | 11/72 Tokenisation  Tokenisation: process which divides the input text into word tokens: punctuation marks, word-like units, numbers, etc.  A system which splits texts into word tokens is called a tokeniser  A very simple example:  Input text: John likes Mary and Mary likes John.  Tokens: {"John", "likes", "Mary", "and", "Mary", "likes", "John", "."} LIMSI-CNRS | 12/72 Problems of tokenisation  Numeric expressions: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.  How many words are there in 4.53 +/- 0.15%?     1 3 9 (“four point five three, plus or minus fifteen percent”) not a word  The answer depends on the application at hand LIMSI-CNRS | 13/72 Problems of tokenisation  Boundaries  For "simple" words:  Spaces  Punctuation  Multiword expressions: several units, one word  pomme de terre  parce que  Contracted forms: one unit, several words  aux (à les), des (de les) LIMSI-CNRS | 14/72 Outline 1. Word segmentation 2. Linguistic morphology a) What is morphology? b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 15/72 Morphology Words can be further decomposed into smaller units: pneumonoultramicroscopicsilicovolcanoconiosis lung microscopic volcano disease extreme silicium dust lung disease caused by the inhalation of very fine silica dust found in volcanoes LIMSI-CNRS | 16/72 What is morphology?  Morphology is the branch of linguistics which studies word forms and word formation  Word formation processes  Inflection  Derivation  Composition / Compounding LIMSI-CNRS | 17/72 Words vs. lexemes vs. lemmas  A lexeme refers to the set of word forms which correspond to the same dictionary entry small, smaller, smallest → SMALL knife, knives → KNIFE  A lemma is the canonical form of a lexeme SMALL  In the following, capital letters are used to indicate lemmas LIMSI-CNRS | 18/72 Inflection  Inflection is the process of forming different grammatical forms of a single lexeme montrer → montrera cheval → chevaux  The grammatical category of the word form remains the same LIMSI-CNRS | 19/72 Word formation  Word formation is the process of creating new lexemes from existing ones:  Derivation: combines bases and affixes  Compounding: combines lexemes LIMSI-CNRS | 20/72 Derivation  Derivation involves the creation of one lexeme from another re- + create → RECREATE re- is a derivational prefix recreate + s → recreates -s is an inflectional suffix, it provides another word-form of the lexeme RECREATE!  Derivation might induce a change of the grammatical category  be- + witch → BEWITCH: changes a noun into a verb LIMSI-CNRS | 21/72 Compounding  A compound involves the creation of one lexeme from two or more other lexemes popcorn = a kind of corn which pops hot dog = a kind of food (opaque compound)  Compounding is particularly frequent in French medical language  appendice + ectomie → appendicectomie LIMSI-CNRS | 22/72 Non concatenative phenomena  Root-and-pattern morphology (e.g. Arabic, Hebrew)  the root consists of consonants only (3 by default) ktb = to write  the pattern is a combination of vowels (possibly consonants too) with slots for the root consonants kaatab = he corresponded  Apophony: vowel changes within a root  Ablaut: sing, sang, sung  Umlaut: Buch, Bücher LIMSI-CNRS | 23/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 24/72 Morphological normalisation  Morphological normalisation consists in identifying a single canonical representative for morphologically related wordforms  Methods:  Stemming  Lemmatisation LIMSI-CNRS | 25/72 Stemming  Stemming is an algorithmic approach to strip off the endings of words  Objective: group words belonging to the same morphological family by transforming them into a similar stemmed representation  Stemming does not distinguish between inflection and derivation  The stems obtained do not necessarily correspond to a genuine word form  The best known stemming algorithms have been developed by Lovins (1968) and Porter (1980) LIMSI-CNRS | 26/72 Algorithmic stemming method 1) Desuffixing: removal of predefined word endings sitting → sitt 2) Recoding: transform the endings of the previously obtained stems using transformation rules sitt → sit These 2 phases can be performed successively (Lovins) or simultaneously (Porter) LIMSI-CNRS | 27/72 Porter's stemmer  Based on a limited set of general cascaded transformational rules: -ational → -ate : relational → relate  Variants exist for many languages: English, French, Spanish, Portuguese, Italian, Romanian, German Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish  Fast  Accurate enough for some applications, e.g. Information Retrieval  Available at http://snowball.tartarus.org/ LIMSI-CNRS | 28/72 Steps in Porter stemming (excerpts)  Step 1a  SSES → SS caresses → caress  Step 1b  (m>0) EED → EE feed → feed, agreed → agree  Step 1c  (*v*) Y → I happy → happi, sky → sky  Step 2  (m>0) ATIONAL → ATE relational → relate LIMSI-CNRS | 29/72  Step 3  (m>0) ICATE → IC triplicate → triplic  Step 4  (m>1) AL → revival → reviv  Step 5a  (m>1) E → probate → probat  Step 5b  (m > 1 and *d and *L) → single letter controll → control Porter's stemmer Original Word vision visible visibility visionary visioner visual LIMSI-CNRS | 30/72 Stemmed Word vision visibl visibl visionari vision visual Comparison of three stemmers © 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze LIMSI-CNRS | 31/72 Stemming errors  Under-stemming: adhere → adher adhesion → adhes  Over-stemming:  appendicitis → append  append → append LIMSI-CNRS | 32/72 Ambiguity Homographs: words which have the same spelling but different meanings I saw the saw Preterite form of the verb SEE ≠ Singular form of the noun SAW Such cases cannot be properly dealt with with stemming only, the word's grammatical category has to be identified LIMSI-CNRS | 33/72 Lemmatisation  Lemmatisation consists in mapping word forms to their lemma (base form): sing, sang, sung → sing  Lemmatisation only handles inflection, not derivation  In order to disambiguate ambiguous cases, lemmatisation is usually combined with part-of-speech tagging  Additional morphological information is usually provided with the lemma (more about this later in the presentation) LIMSI-CNRS | 34/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 35/72 Morphological analysis  Aim:  split a word into its constituent morphemes : foxes → fox + es  get morpho-syntactic information : part-of-speech (POS), tense, number, person, voice, gender, etc.  Morphological analysis can be perfomed:  manually, the analyses are then stored in lexical databases  automatically:  based on some manually-written rules and lexicons  in an unsupervised manner, using no external resources LIMSI-CNRS | 36/72 Lexical databases: contents  word entries + information  surface form, lemma  syntactic properties  category, POS (Part Of Speech)  features: masculine, feminine, etc.  semantic properties  semantic relations: synonym, antonym, hypernym  semantic type: person, event, object LIMSI-CNRS | 37/72 CELEX  CELEX is a lexical database which is available for English, Dutch and German LIMSI-CNRS | 38/72 Morphalou http://www.cnrtl.fr/lexiques/morphalou/ LIMSI-CNRS | 39/72 Prolex http://www.cnrtl.fr/lexiques/prolex/ LIMSI-CNRS | 40/72 French Verbs (Dubois & Dubois-Charlier) LIMSI-CNRS | 41/72 Unsupervised Segmentation  Unsupervised morphological segmentation consists in automatically breaking down words into their constituent morphemes  Only input dataset: list of words (no language-specific rules or lexicons)  Scientific goals:  Learn of the phenomena underlying word construction in natural languages  Discover approaches suitable for a wide range of languages  Advance machine learning methodology  See the Morpho Challenge website http://www.cis.hut.fi/morphochallenge2009/ LIMSI-CNRS | 42/72 Segmentation by analogy (Lepage, 1998) LIMSI-CNRS | 43/72 Application of the analogy principle LIMSI-CNRS | 44/72 fahre schlafe fahren X? Segmentation by compression  Minimum Description Length and bayesian inference (Goldsmith, 2001; Creutz & Lagus, 2005) LIMSI-CNRS | 45/72 Harris (1955): Segmentation by successor counts  At the end of a morpheme (or word) almost any sound can follow: design + #, design + ation, design + ing, design + ed, ...  However, within morphemes, the choice is more restricted: desig + n  Basic algorithm:  At each position in an utterance, count the number of different sounds which can possibly follow  Peaks in this count indicate morpheme boundaries LIMSI-CNRS | 46/72 Segmentation of He's quicker  Utterance = He's quicker (hiyzkwikәr)  Successors of h:  His ship's in?  Humans act like simians.  ...  Successors of hi:  Hip-high in water.  Hidden meanings were discovered.  ... LIMSI-CNRS | 47/72 Successor counts 35 Successor counts 30 25 20 15 10 5 0 h - i - y - z - k - w Sounds LIMSI-CNRS | 48/72 - i - k - ә - r - Morphological parsing  Aim: break down a word into component morphemes and build a structured representation of the analysis  Example: cats → cat +N +PL lemma features  Our focus: finite-state morphological parsing LIMSI-CNRS | 49/72 Finite state automata  A finite state automaton (FSA) recognises a set of strings  An FSA is represented as directed graph:  vertices (nodes) represent states  directed links between nodes represent transitions LIMSI-CNRS | 50/72 Sheep Talk FSA  The language of sheep includes to following utterances: baa!, baaa!, baaaa!, baaaaa!, etc.  Regular expression for this language: baa+!  FSA that can accept this language: a q0 LIMSI-CNRS | 51/72 b q1 a q2 a q3 ! q4 Formal definition of an FSA  Q = q0 q1 q2 ... qN-1 a finite set of N states Σ  q0 a finite input alphabet of symbols the start state F  δ(q,i) the set of final states the transition function  For the sheep talk automaton: Q = {q0, q1, q2, q3, q4}, Σ = {a, b, !}, F = {q4} LIMSI-CNRS | 52/72 Deterministic vs. non-deterministic FSA  Deterministic FSA for sheep talk a q0 b q1 a q2 a q3 ! q4 q3 ! q4  Non-deterministic FSA for sheep talk a q0 LIMSI-CNRS | 53/72 b q1 a q2 a Morphological parsers  Components:  lexicon: list of lemmas and affixes  morphotactics: word grammar which accounts for morpheme ordering  orthographic rules: model the changes that occur when two morphemes combine city + s → cities  Morphological parsers can be implemented as finite-state transducers LIMSI-CNRS | 54/72 Finite State Transducers 1 s:PL 3 cat:N 0 catch:V     2 Finite-state transducers map between one representation and another State 0: start state State 1: cat has been recognised as a +N (possible end state) State 2: catch has been recognised as a +V (possible end state) State 3: cats has been recognised as +N +PL (possible end state) LIMSI-CNRS | 55/72 Two-level morphology  (Koskenniemi, 1984)  Surface level: words as they are pronounced or written  Lexical level: concatenation of morphemes Lexical level: Surface level: c a t +N +PL c a t s  The mapping between the surface and the lexical level is constrained by rules LIMSI-CNRS | 56/72 Two-level rules  Example rule (Trost, 2003): lexical level +:e ⇐ { s x z [ {s c} h ] } : _s surface level left context right context  Application of the rule: # d i s h + s # | | | | | 1 | | 0 d i s h e s 0 LIMSI-CNRS | 57/72 Spelling rule: e-insertion PC-KIMMO  Demo: http://languagelink.let.uu.nl/~lion LIMSI-CNRS | 58/72 PC-Kimmo: POS Ambiguity 1: Word: [ cat: head: Word [ pos: V vform: BASE ] root: `walk root_pos:V clitic:drvstem:- ] LIMSI-CNRS | 59/72 2: Word: [ cat: Word head: [ agr: [ 3sg: + ] number:SG pos: N proper:verbal:- ] root: `walk root_pos:N clitic:drvstem:- ] Inflectional Analysis for French: Flemm  Developed by F. Namer http://www.cnrtl.fr/outils/flemm/  Input : word + POS (as provided by the TreeTagger or the Brill tagger)  renouent VER:pres renouer  Output: lemma + morpho-syntactic features  renouent VER(pres):Vmip3p--1 renouer || renouent VER(pres):Vmsp3p--1 renouer  Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er groupe LIMSI-CNRS | 60/72 Inflectional Analysis for French: Flemm  Analyse linguistique : le cas de -èrent  en général, -èrent marque les verbes du 3ème groupe au passé simple : céd-èrent  quelquefois, la terminaison est plus courte et -èrent marque le présent : légifèr-ent  très rarement, terminaison ambiguë : lac-èrent et lacèrent  Règles et exceptions : le cas de -èrent  les partitions ambiguës sont lexicalisées car rares  la règle étant le désuffixage sur le suffixe le plus long, les verbes correspondant au suffixe -ent tels que légifèrsont lexicalisés  autres cas (e.g. céd-) : désuffixage régulier sur -èrent. LIMSI-CNRS | 61/72 Derivational Analysis: DériF  Developed by F. Namer http://www.cnrtl.fr/outils/DeriF/  Input: form/POS  sympathique/ADJ  Output: analysis  [ [ sympathie NOM] ique ADJ] (sympathique/ADJ, sympathie/NOM) " En rapport avec le(s) sympathie" LIMSI-CNRS | 62/72 Derivational Analysis: DériF  Word formation rules  déXiser V  Xable A  inX A [dé [X N] +iser V] [[X (er) V] +able A] [in [X A] A]  Sequence of decompositions  impensable/ADJ  décomposable/ADJ in + pensable/ADJ décomposer/VERBE + able/ADJ  Ambiguous analyses  implantable/ADJ implanter/VERBE + able/ADJ im + plantable/ADJ  Produces a gloss :  " ( lequel - Que l') on peut implanter" // " Non plantable" LIMSI-CNRS | 63/72 Analysis of neoclassical compounds: DériF  acrodynie/N  Hierarchical decomposition:  [ [ acr N* ] [ odyn N* ] ie NOM ]  Definition (gloss):  "douleur (du -- liée au) extrémité "  Semantic type:  Type = maladie  Lexical and semantic relations with other lexemes:  eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie, eql:apex/algo, eql:apex/algés, eql:apex/odyn  see:acr/ite, see:apex/ite LIMSI-CNRS | 64/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 65/72 Information Retrieval Stemming  Stemming is frequently used in Information Retrieval:  Stemming is applied at indexing time  User queries are analysed likewise  Stems in the user query are matched against stems in documents  It reduces the number of terms to index  It improves recall (number of documents which are retrieved) LIMSI-CNRS | 66/72 Information Retrieval Morphological Query Expansion  Morphological variants of a word can be used to perform query expansion  The original word forms are indexed  Query terms are expanded with their morphological variants at retrieval time (Moreau et al., 2007) Original query: Ineffectiveness of U.S. embargoes or sanctions Expanded query: ineffectiveness ineffective effectiveness effective ineffectively embargoes embargo embargoed embargoing sanctioning sanction sanctioned sanctions sanctionable LIMSI-CNRS | 67/72 Text-To-Speech Systems  Aim: take text, in standard spelling, and synthesise a spoken version of the text  Problems  Proper names (places, persons)  Out of vocabulary words (words unknown to the system)  Solutions from morphology  hothouse = hot + house and not hoth + ouse LIMSI-CNRS | 68/72 Machine Translation  Aim: translate a text from one language into another language  Problems:  A word in one language may correspond to two or more words in another language  Out of vocabulary words  How can morphological analysis help?  compounds: Aktionsplan (de) → action plan (en)  inflection: va, aller (fr) → go (en) LIMSI-CNRS | 69/72 Meditate on this... "Maybe in order to understand mankind, we have to look at the word itself. Mankind. Basically, it's made up of two separate words – 'mank' and 'ind'. What do these words mean? It's a mystery, and that's why so is mankind." Jack Handey (Deep Thoughts) LIMSI-CNRS | 70/72 Relevant Literature  Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a Natural Language from Unannotated Text, in 'Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)', pp. 106-113.  Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a Natural Language', Computational Linguistics 27(2), 153-198.  Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190222.  Koskenniemi, K. (1984), A general computational model for word-form recognition and production, in 'Proceedings of the 22nd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 178--181.  Lepage, Y. (1998), Solving analogies on words: an algorithm, in 'Proceedings of the 17th international conference on Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 728-734. LIMSI-CNRS | 71/72 Relevant Literature  Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological query expansion using analogy-based machine learning.', in Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), Roma, Italy, April 2007.  Trost, H. (2003), The Oxford Handbook of Computational Linguistics, Oxford University Press, chapter Morphology, pp. 25--47.  Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word Segmentation: The Role of Distributional Cues', Journal of Memory and Language 35(4), 606-621. LIMSI-CNRS | 72/72

Morphology

Related documents

Products

Support

Morphology

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib