Fondements du TAL Delphine Bernhard Morphology Contains slides adapted from Pierre Zweigenbaum LIMSI-CNRS | 1/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 2/72 Levels of linguistic structure Our focus today (c) David Groome, 2006 LIMSI-CNRS | 3/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 4/72 Can you read this (fast!)? wikipédiaestunprojetdencyclopédiecollectiveétabli esurinternetuniversellemultilingueetfonctionnants urleprincipeduwikiwikipédiaapourobjectifdoffrirun contenulibrementréutilisableneutreetvérifiableque chacunpeutéditeretaméliorer Wikipédia est un projet d’encyclopédie collective établie sur Internet, universelle, multilingue et fonctionnant sur le principe du wiki. Wikipédia a pour objectif d’offrir un contenu librement réutilisable, neutre et vérifiable, que chacun peut éditer et améliorer. LIMSI-CNRS | 5/72 It's only words ... ... but what are they exactly and how can we automatically recognise them? In speech, there are no obvious breaks So how do babies learn words? According to (Saffran et al., 1996) they use distributional cues and statistical regularities in speech LIMSI-CNRS | 6/72 How do we recognise words in speech? (Bauer, 1988) There are no gaps between words in speech: Menbecomeoldbuttheyneverbecomegood Thanks to our knowledge of language, we recognise certain strings of sounds/letters: e.g. we can recognise men in the previous sequence because it also comes up in sequences like: Menareconservativeafterdinner Menlosetheirtempersindefendingtheirtaste. Afterfortymenhavemarriedtheirhabits. LIMSI-CNRS | 7/72 Learning to read is difficult for humans Reading disabilities: Dyslexia: inability to decode, or break down, words into phonemes Comprehension difficulties The invention of writing and reading is recent Contrarily to speech or vision, it is an unnatural process that has to be learned: brains are not wired to read! LIMSI-CNRS | 8/72 For computers: characters and strings Control characters: End of line: \n Tabulation: \t Encodings: ASCII: English alphabet Latin 1, ISO-8859-1: Western European Languages ISO-8859-15: Similar to ISO-8859-1, but replaces some less common symbols with €, Œ or œ Windows-1252, Cp1252: superset of ISO 8859-1 (includes €, Œ and œ) UTF-8: can represent every character in the Unicode character set, backward-compatible with ASCII LIMSI-CNRS | 9/72 Practical definition of words and sentences Bauer (1988): A word is a unit which, in print, is bounded by spaces on both sides. We will call this an orthographic word. Kučera and Francis (1967): A graphic word is a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks Grefenstette and Tapanainen (1994): Sentences end with punctuation. LIMSI-CNRS | 10/72 What are the "words" and sentences here? Pacific Lumber Co. was trying to figure out the safest way to bring the activists down. He doesn't need us. For additional information see also New York is situated on the east coast of the United States. c’est-à-dire les pommes de terre des U.S.A. LIMSI-CNRS | 11/72 Tokenisation Tokenisation: process which divides the input text into word tokens: punctuation marks, word-like units, numbers, etc. A system which splits texts into word tokens is called a tokeniser A very simple example: Input text: John likes Mary and Mary likes John. Tokens: {"John", "likes", "Mary", "and", "Mary", "likes", "John", "."} LIMSI-CNRS | 12/72 Problems of tokenisation Numeric expressions: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. How many words are there in 4.53 +/- 0.15%? 1 3 9 (“four point five three, plus or minus fifteen percent”) not a word The answer depends on the application at hand LIMSI-CNRS | 13/72 Problems of tokenisation Boundaries For "simple" words: Spaces Punctuation Multiword expressions: several units, one word pomme de terre parce que Contracted forms: one unit, several words aux (à les), des (de les) LIMSI-CNRS | 14/72 Outline 1. Word segmentation 2. Linguistic morphology a) What is morphology? b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 15/72 Morphology Words can be further decomposed into smaller units: pneumonoultramicroscopicsilicovolcanoconiosis lung microscopic volcano disease extreme silicium dust lung disease caused by the inhalation of very fine silica dust found in volcanoes LIMSI-CNRS | 16/72 What is morphology? Morphology is the branch of linguistics which studies word forms and word formation Word formation processes Inflection Derivation Composition / Compounding LIMSI-CNRS | 17/72 Words vs. lexemes vs. lemmas A lexeme refers to the set of word forms which correspond to the same dictionary entry small, smaller, smallest → SMALL knife, knives → KNIFE A lemma is the canonical form of a lexeme SMALL In the following, capital letters are used to indicate lemmas LIMSI-CNRS | 18/72 Inflection Inflection is the process of forming different grammatical forms of a single lexeme montrer → montrera cheval → chevaux The grammatical category of the word form remains the same LIMSI-CNRS | 19/72 Word formation Word formation is the process of creating new lexemes from existing ones: Derivation: combines bases and affixes Compounding: combines lexemes LIMSI-CNRS | 20/72 Derivation Derivation involves the creation of one lexeme from another re- + create → RECREATE re- is a derivational prefix recreate + s → recreates -s is an inflectional suffix, it provides another word-form of the lexeme RECREATE! Derivation might induce a change of the grammatical category be- + witch → BEWITCH: changes a noun into a verb LIMSI-CNRS | 21/72 Compounding A compound involves the creation of one lexeme from two or more other lexemes popcorn = a kind of corn which pops hot dog = a kind of food (opaque compound) Compounding is particularly frequent in French medical language appendice + ectomie → appendicectomie LIMSI-CNRS | 22/72 Non concatenative phenomena Root-and-pattern morphology (e.g. Arabic, Hebrew) the root consists of consonants only (3 by default) ktb = to write the pattern is a combination of vowels (possibly consonants too) with slots for the root consonants kaatab = he corresponded Apophony: vowel changes within a root Ablaut: sing, sang, sung Umlaut: Buch, Bücher LIMSI-CNRS | 23/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 24/72 Morphological normalisation Morphological normalisation consists in identifying a single canonical representative for morphologically related wordforms Methods: Stemming Lemmatisation LIMSI-CNRS | 25/72 Stemming Stemming is an algorithmic approach to strip off the endings of words Objective: group words belonging to the same morphological family by transforming them into a similar stemmed representation Stemming does not distinguish between inflection and derivation The stems obtained do not necessarily correspond to a genuine word form The best known stemming algorithms have been developed by Lovins (1968) and Porter (1980) LIMSI-CNRS | 26/72 Algorithmic stemming method 1) Desuffixing: removal of predefined word endings sitting → sitt 2) Recoding: transform the endings of the previously obtained stems using transformation rules sitt → sit These 2 phases can be performed successively (Lovins) or simultaneously (Porter) LIMSI-CNRS | 27/72 Porter's stemmer Based on a limited set of general cascaded transformational rules: -ational → -ate : relational → relate Variants exist for many languages: English, French, Spanish, Portuguese, Italian, Romanian, German Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian, Turkish Fast Accurate enough for some applications, e.g. Information Retrieval Available at LIMSI-CNRS | 28/72 Steps in Porter stemming (excerpts) Step 1a SSES → SS caresses → caress Step 1b (m>0) EED → EE feed → feed, agreed → agree Step 1c (*v*) Y → I happy → happi, sky → sky Step 2 (m>0) ATIONAL → ATE relational → relate LIMSI-CNRS | 29/72 Step 3 (m>0) ICATE → IC triplicate → triplic Step 4 (m>1) AL → revival → reviv Step 5a (m>1) E → probate → probat Step 5b (m > 1 and *d and *L) → single letter controll → control Porter's stemmer Original Word vision visible visibility visionary visioner visual LIMSI-CNRS | 30/72 Stemmed Word vision visibl visibl visionari vision visual Comparison of three stemmers © 2008 Cambridge University Press, Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze LIMSI-CNRS | 31/72 Stemming errors Under-stemming: adhere → adher adhesion → adhes Over-stemming: appendicitis → append append → append LIMSI-CNRS | 32/72 Ambiguity Homographs: words which have the same spelling but different meanings I saw the saw Preterite form of the verb SEE ≠ Singular form of the noun SAW Such cases cannot be properly dealt with with stemming only, the word's grammatical category has to be identified LIMSI-CNRS | 33/72 Lemmatisation Lemmatisation consists in mapping word forms to their lemma (base form): sing, sang, sung → sing Lemmatisation only handles inflection, not derivation In order to disambiguate ambiguous cases, lemmatisation is usually combined with part-of-speech tagging Additional morphological information is usually provided with the lemma (more about this later in the presentation) LIMSI-CNRS | 34/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 35/72 Morphological analysis Aim: split a word into its constituent morphemes : foxes → fox + es get morpho-syntactic information : part-of-speech (POS), tense, number, person, voice, gender, etc. Morphological analysis can be perfomed: manually, the analyses are then stored in lexical databases automatically: based on some manually-written rules and lexicons in an unsupervised manner, using no external resources LIMSI-CNRS | 36/72 Lexical databases: contents word entries + information surface form, lemma syntactic properties category, POS (Part Of Speech) features: masculine, feminine, etc. semantic properties semantic relations: synonym, antonym, hypernym semantic type: person, event, object LIMSI-CNRS | 37/72 CELEX CELEX is a lexical database which is available for English, Dutch and German LIMSI-CNRS | 38/72 Morphalou LIMSI-CNRS | 39/72 Prolex LIMSI-CNRS | 40/72 French Verbs (Dubois & Dubois-Charlier) LIMSI-CNRS | 41/72 Unsupervised Segmentation Unsupervised morphological segmentation consists in automatically breaking down words into their constituent morphemes Only input dataset: list of words (no language-specific rules or lexicons) Scientific goals: Learn of the phenomena underlying word construction in natural languages Discover approaches suitable for a wide range of languages Advance machine learning methodology See the Morpho Challenge website LIMSI-CNRS | 42/72 Segmentation by analogy (Lepage, 1998) LIMSI-CNRS | 43/72 Application of the analogy principle LIMSI-CNRS | 44/72 fahre schlafe fahren X? Segmentation by compression Minimum Description Length and bayesian inference (Goldsmith, 2001; Creutz & Lagus, 2005) LIMSI-CNRS | 45/72 Harris (1955): Segmentation by successor counts At the end of a morpheme (or word) almost any sound can follow: design + #, design + ation, design + ing, design + ed, ... However, within morphemes, the choice is more restricted: desig + n Basic algorithm: At each position in an utterance, count the number of different sounds which can possibly follow Peaks in this count indicate morpheme boundaries LIMSI-CNRS | 46/72 Segmentation of He's quicker Utterance = He's quicker (hiyzkwikәr) Successors of h: His ship's in? Humans act like simians. ... Successors of hi: Hip-high in water. Hidden meanings were discovered. ... LIMSI-CNRS | 47/72 Successor counts 35 Successor counts 30 25 20 15 10 5 0 h - i - y - z - k - w Sounds LIMSI-CNRS | 48/72 - i - k - ә - r - Morphological parsing Aim: break down a word into component morphemes and build a structured representation of the analysis Example: cats → cat +N +PL lemma features Our focus: finite-state morphological parsing LIMSI-CNRS | 49/72 Finite state automata A finite state automaton (FSA) recognises a set of strings An FSA is represented as directed graph: vertices (nodes) represent states directed links between nodes represent transitions LIMSI-CNRS | 50/72 Sheep Talk FSA The language of sheep includes to following utterances: baa!, baaa!, baaaa!, baaaaa!, etc. Regular expression for this language: baa+! FSA that can accept this language: a q0 LIMSI-CNRS | 51/72 b q1 a q2 a q3 ! q4 Formal definition of an FSA Q = q0 q1 q2 ... qN-1 a finite set of N states Σ q0 a finite input alphabet of symbols the start state F δ(q,i) the set of final states the transition function For the sheep talk automaton: Q = {q0, q1, q2, q3, q4}, Σ = {a, b, !}, F = {q4} LIMSI-CNRS | 52/72 Deterministic vs. non-deterministic FSA Deterministic FSA for sheep talk a q0 b q1 a q2 a q3 ! q4 q3 ! q4 Non-deterministic FSA for sheep talk a q0 LIMSI-CNRS | 53/72 b q1 a q2 a Morphological parsers Components: lexicon: list of lemmas and affixes morphotactics: word grammar which accounts for morpheme ordering orthographic rules: model the changes that occur when two morphemes combine city + s → cities Morphological parsers can be implemented as finite-state transducers LIMSI-CNRS | 54/72 Finite State Transducers 1 s:PL 3 cat:N 0 catch:V 2 Finite-state transducers map between one representation and another State 0: start state State 1: cat has been recognised as a +N (possible end state) State 2: catch has been recognised as a +V (possible end state) State 3: cats has been recognised as +N +PL (possible end state) LIMSI-CNRS | 55/72 Two-level morphology (Koskenniemi, 1984) Surface level: words as they are pronounced or written Lexical level: concatenation of morphemes Lexical level: Surface level: c a t +N +PL c a t s The mapping between the surface and the lexical level is constrained by rules LIMSI-CNRS | 56/72 Two-level rules Example rule (Trost, 2003): lexical level +:e ⇐ { s x z [ {s c} h ] } : _s surface level left context right context Application of the rule: # d i s h + s # | | | | | 1 | | 0 d i s h e s 0 LIMSI-CNRS | 57/72 Spelling rule: e-insertion PC-KIMMO Demo: LIMSI-CNRS | 58/72 PC-Kimmo: POS Ambiguity 1: Word: [ cat: head: Word [ pos: V vform: BASE ] root: `walk root_pos:V clitic:drvstem:- ] LIMSI-CNRS | 59/72 2: Word: [ cat: Word head: [ agr: [ 3sg: + ] number:SG pos: N proper:verbal:- ] root: `walk root_pos:N clitic:drvstem:- ] Inflectional Analysis for French: Flemm Developed by F. Namer Input : word + POS (as provided by the TreeTagger or the Brill tagger) renouent VER:pres renouer Output: lemma + morpho-syntactic features renouent VER(pres):Vmip3p--1 renouer || renouent VER(pres):Vmsp3p--1 renouer Verbe au présent de l'indicatif ou du subjonctif à la troisième personne du pluriel, 1er groupe LIMSI-CNRS | 60/72 Inflectional Analysis for French: Flemm Analyse linguistique : le cas de -èrent en général, -èrent marque les verbes du 3ème groupe au passé simple : céd-èrent quelquefois, la terminaison est plus courte et -èrent marque le présent : légifèr-ent très rarement, terminaison ambiguë : lac-èrent et lacèrent Règles et exceptions : le cas de -èrent les partitions ambiguës sont lexicalisées car rares la règle étant le désuffixage sur le suffixe le plus long, les verbes correspondant au suffixe -ent tels que légifèrsont lexicalisés autres cas (e.g. céd-) : désuffixage régulier sur -èrent. LIMSI-CNRS | 61/72 Derivational Analysis: DériF Developed by F. Namer Input: form/POS sympathique/ADJ Output: analysis [ [ sympathie NOM] ique ADJ] (sympathique/ADJ, sympathie/NOM) " En rapport avec le(s) sympathie" LIMSI-CNRS | 62/72 Derivational Analysis: DériF Word formation rules déXiser V Xable A inX A [dé [X N] +iser V] [[X (er) V] +able A] [in [X A] A] Sequence of decompositions impensable/ADJ décomposable/ADJ in + pensable/ADJ décomposer/VERBE + able/ADJ Ambiguous analyses implantable/ADJ implanter/VERBE + able/ADJ im + plantable/ADJ Produces a gloss : " ( lequel - Que l') on peut implanter" // " Non plantable" LIMSI-CNRS | 63/72 Analysis of neoclassical compounds: DériF acrodynie/N Hierarchical decomposition: [ [ acr N* ] [ odyn N* ] ie NOM ] Definition (gloss): "douleur (du -- liée au) extrémité " Semantic type: Type = maladie Lexical and semantic relations with other lexemes: eql:acr/algie, eql:acr/algo, eql:acr/algés, eql:apex/algie, eql:apex/algo, eql:apex/algés, eql:apex/odyn see:acr/ite, see:apex/ite LIMSI-CNRS | 64/72 Outline 1. Word segmentation 2. Linguistic morphology a) Morphemes b) Morphological processes 3. Computational morphology a) Normalisation: stemming, lemmatisation b) Analysis: lexical databases, unsupervised segmentation, rule-based analysis and parsing 4. Applications LIMSI-CNRS | 65/72 Information Retrieval Stemming Stemming is frequently used in Information Retrieval: Stemming is applied at indexing time User queries are analysed likewise Stems in the user query are matched against stems in documents It reduces the number of terms to index It improves recall (number of documents which are retrieved) LIMSI-CNRS | 66/72 Information Retrieval Morphological Query Expansion Morphological variants of a word can be used to perform query expansion The original word forms are indexed Query terms are expanded with their morphological variants at retrieval time (Moreau et al., 2007) Original query: Ineffectiveness of U.S. embargoes or sanctions Expanded query: ineffectiveness ineffective effectiveness effective ineffectively embargoes embargo embargoed embargoing sanctioning sanction sanctioned sanctions sanctionable LIMSI-CNRS | 67/72 Text-To-Speech Systems Aim: take text, in standard spelling, and synthesise a spoken version of the text Problems Proper names (places, persons) Out of vocabulary words (words unknown to the system) Solutions from morphology hothouse = hot + house and not hoth + ouse LIMSI-CNRS | 68/72 Machine Translation Aim: translate a text from one language into another language Problems: A word in one language may correspond to two or more words in another language Out of vocabulary words How can morphological analysis help? compounds: Aktionsplan (de) → action plan (en) inflection: va, aller (fr) → go (en) LIMSI-CNRS | 69/72 Meditate on this... "Maybe in order to understand mankind, we have to look at the word itself. Mankind. Basically, it's made up of two separate words – 'mank' and 'ind'. What do these words mean? It's a mystery, and that's why so is mankind." Jack Handey (Deep Thoughts) LIMSI-CNRS | 70/72 Relevant Literature Creutz, M. & Lagus, K. (2005), Inducing the Morphological Lexicon of a Natural Language from Unannotated Text, in 'Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)', pp. 106-113. Goldsmith, J. (2001), 'Unsupervised Learning of the Morphology of a Natural Language', Computational Linguistics 27(2), 153-198. Harris, Z. (1955), 'From phoneme to morpheme', Language 31(2), 190222. Koskenniemi, K. (1984), A general computational model for word-form recognition and production, in 'Proceedings of the 22nd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 178--181. Lepage, Y. (1998), Solving analogies on words: an algorithm, in 'Proceedings of the 17th international conference on Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, pp. 728-734. LIMSI-CNRS | 71/72 Relevant Literature Moreau, F.; Claveau, V. & Sébillot, P. (2007), 'Automatic morphological query expansion using analogy-based machine learning.', in Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), Roma, Italy, April 2007. Trost, H. (2003), The Oxford Handbook of Computational Linguistics, Oxford University Press, chapter Morphology, pp. 25--47. Saffran, J. R.; Newport, E. L. & Aslin, R. N. (1996), 'Word Segmentation: The Role of Distributional Cues', Journal of Memory and Language 35(4), 606-621. LIMSI-CNRS | 72/72