RANLP tutorial, September 2013, Hissar, Bulgaria The Analytics of Word Sociology Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013 Keywords • • • • • • • • • • • • • • • computer science linguistics computational linguistics statistics inferential statistics syntactic parsing dependency parsing shallow parsing chunking POS-tagging lemmatization tokenisation type vs. token distribution Zipf law • • • • • • • • • • • • • • • hypothesis testing statistical significance null hypothesis association measure collocation extraction mutual information log-likelihood ratio entropy contingency table co-occurrence collocation extraposition long-distance dependency n-gram precision, recall, F-measure 2 Outline 1. Introduction 2. Terminology clarification 3. Theoretical description 4. Practical accounts 5. Behind the curtains: the maths and stats 6. Wrap up and outlook 3 Objectives • Understand the concept of collocation and its relevance for the fields of linguistics, lexicography and natural language processing. • Become aware of the definitorial and terminological issues, the description of collocations in terms of semantic compositionality, and the relation with other multi-word expressions. • Understand the basic architecture of a collocation extraction system. • Become familiar with the most influential work in the area of collocation extraction. • Get (more than) an overview of the underlying technology – in particular, the statistical computation details. 4 INTRODUCTION 5 Social Analytics “Measuring + Analyzing + Interpreting interactions and associations between people, topics and ideas.” (http://en.wikipedia.org/wiki/Social_analytics) http://www.submitedge.com http://irevolution.net 6 You shall know someone … … by the company they keep http://flowingdata.com 7 Word Sociology • Barnbrook (1996) Language and Computers, Chapt. 5 «The sociology of words»: – collocation analysis: «automatic quantitative analysis and identification of word patterns around words of interest» collocate word 2 collocate word 3 collocate word 1 collocate word 4 `node’ word collocate word n … collocate word 5 8 You shall know a word … … by the company it keeps! (Firth, 1957) `node’ word = ? … Seretan and Wehrli (2011): FipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora 9 Collocation analysis: Key concepts • Node word: the word under investigation • Collocate: the “word patterns” around the node word • Association measure (AM): Evert (2004): “a formula that computes an association score from the frequency information […]” • Collocation extraction [from corpora]: the task of automatically identifying genuine associations of words in corpora 10 Relevance for Linguistics • Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-Meaning Theory, semantic prosody, … Words are “separated in meaning at the collocational level” (Firth, 1968, 180) Word collocation is one of the most important forms of text cohesion: is a passage of language "a unified whole or is just a collection of unrelated sentences"? (Halliday and Hassan, 1976, 1) Collocations are found at the intersection of lexicon and grammar "semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments” (Sinclair, 1991, 110); Collocations [“idioms of encoding”] are expressions “which are larger than words, which are like words in that they have to be learned separately as individual whole facts about the language" (Fillmore et al., 1988, 504) “We acquire collocations, as we acquire other aspects of language, through encountering texts in the course of our lives” (Hoey, 1991, 219). 11 Relevance for Linguistics (cont.) • Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-Meaning Theory, semantic prosody, … In the Meaning-Text Theory (e.g., Mel’čuk, 1998), collocations are described by means of lexical functions (associating meaning and the utterance expressing that meaning): Magn(problem) = big Magn(rain) = heavy Magn(injury) = serious Collocations are often between words which share a positive or a negative connotation (semantic prosody – e.g., Louw, 1993). FipsCoView 12 Relevance for Lexicography • Dictionaries of co-occurrences/collocations/cum-corpus “Collocation is the way words combine in a language to produce natural-sounding speech and writing” (Lea and Runcie, 2002) “Advanced learners of second language have great difficulty with nativelike collocation and idiomaticity. Many grammatical sentences generated by language learners sound unnatural and foreign.” (Ellis, 2008) Benson et al., 1986 Sinclair, 1987 OCDSE (Lea and Runcie, 2002) 13 Relevance for Lexicography (cont.) http://dictionary.reverso.net/english-cobuild 14 Relevance for Lexicography (cont.) • Dictionaries of co-occurrences/collocations/cum-corpus Beauchesne, 2001 Charest et al., 2012 15 Relevance for Natural Language Processing • Machine translation EN ask a question – FR poser `put’ une question – ES hacer `make’ una pregunta “collocations are the key to producing more acceptable output” (Orliac and Dillinger, 2003) • Natural language generation EN to brush one’s teeth – * to wash one’s teeth “In the generation literature, the generation of collocations is regarded as a problem” (Heid and Raab, 1989) “However, collocations are not only considered useful, but also a problem both in certain applications (e.g. generation, […] machine translation […])” (Heylen et al., 1994) 16 Relevance for Natural Language Processing (cont.) • Syntactic parsing * vs. • Word sense disambiguation break: about 50 senses record: about 10 senses to break a world record: 1 sense verb-object collocation break – record “a polysemous word exhibits essentially only one sense per collocation” (Yarowsky, 2003) 17 Senses of break (partial) 18 Senses of record 19 Relevance for Natural Language Processing (cont.) • OCR distinguish between homographs: terse/tense, gum/gym, deaf/dear, cookie/rookie, beverage/leverage (Examples from Yarowski, 2003) • Speech recognition distinguish between homophones: aid/aide, cellar/seller, censor/sensor, cue/queue, pedal/petal (Examples from Yarowski, 2003) (Examples from Church and Hanks, 1990) 20 Relevance for Natural Language Processing (cont.) • Text summarisation collocations capture the gist of a document (the most typical and salient phrases): be city, have population, people live, county seat, known as, be capital city, large city, city population, close to, area of city, most important, city name, most famous, located on coast (Examples from Seretan, 2011) • Text classification collocations are words which are characteristic of a body of texts • Context-sensitive dictionary look-up Context: The point doesn’t bear any relation to the question we are discussing. Idea: Display the subentry bear – relation instead of the entry for bear (Example from Michiels, 1998) 21 TERMINOLOGY CLARIFICATION 22 Ethymology • cum ‘together’ • locare ‘to locate’ (from locus ‘place’) General meaning: collocated things (set side by side) Specific meaning: collocated words in a sentence http://www.collinsdictionary.com Note: In French, two different forms exist: colocation ‘flatsharing’/collocation. 23 One term – two acceptations • Broad acceptation: semantic collocation (doctor – hospital – nurse – …) “Collocation is the cooccurrence of two or more words within a short space of each other in a text. The usual measure of proximity is a maximum of four words intervening.” (Sinclair 1991:170) • Narrow acceptation: typical syntagm (“conventional way of saying”) “co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern” (Cowie 1978:132) Note: The current literature uses the term co-occurrence to refer to the first acceptation. The term collocation is reserved exclusively for the second acceptation. 24 Collocation definitions 1. 2. 3. 4. 5. 6. 7. 8. Collocations are actual words in habitual company. (Firth, 1968, 182) We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun. (Hausmann, 1989, 1010) a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133) a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988) A collocation is an arbitrary and recurrent word combination. (Benson, 1990) Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170) The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222) recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Broad Narrow 25 Collocation definitions (cont.) 9. Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5) 10. A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) 11. Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) 12. We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7) 13. A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9) 14. lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Broad Narrow 26 Features: Unit • Children memorise not only single words, but also groups (chunks) of words. • Collocations are prefabricated units available as blocks (cf. the idiom principle): “The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.” (Sinclair, 1991, 110) • “semi-finished products” of language (Hausmann, 1985, 124); “déjà-vu”. 27 Features: Recurrent, typical • Collocations are actual words in habitual company. (Firth, 1968, 182) • typical, specific and characteristic combination of two words (Hausmann, 1985) • We shall call collocation a characteristic combination of two words […]. (Hausmann, 1989, 1010) • a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133) • A collocation is an arbitrary and recurrent word combination. (Benson, 1990) • recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143) • A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) • Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) • We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7) 28 Features: Arbitrary • typical, specific and characteristic combination of two words (Hausmann, 1985) • A collocation is an arbitrary and recurrent word combination (Benson, 1990) • The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222) • recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143) • Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5) • Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) • lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76) 29 Features: Unpredictable • “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988): “With an encoding idiom, by contrast, we have an expression which language users might or might not understand without prior experience, but concerning which they would not know that it is a conventional way of saying what it says” (Fillmore et al., 1988, 505) • […] these affinities can not be predicted on the basis of semantic or syntactic rules, but can be observed with some regularity in text (Cruse, 1986) • A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9) 30 Features: Made up of two or more words • • • • • • • Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair 1991:170) co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern (Cowie 1978:132) a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988) Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170) A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) the components of a collocation can again be collocational themselves: next to the German collocation Gültigkeit haben (n + v), we have allgemeine Gültigkeit haben [lit., ‘general validity have’], with allgemeine Gültigkeit, a collocation (n + a), as a component (Heid, 1994, 232). In most of the examples, collocation patterns are restricted to pairs of words, but there is no theoretical restriction to the number of words involved (Sinclair, 1991, 170). 31 Summing up… • • • • • • • • • • prefabricated unit made up of two or more words reccursive recurrent/typical arbitrary unpredictable partly transparent syntactically motivated worth storing in a lexicon asymmetric (base + collocate) But ultimately, the exact definition of collocations varies according to the application needs: “the practical relevance is an essential ingredient of their definition” (Evert, 2004, 75). 32 THEORETICAL DESCRIPTION 33 Prehistory • Collocations have even been known and studied by the ancient Greeks (Gitsaki, 1996). • Pedagogical interest in collocations: Harold Palmer (1877–1949): “polylogs”, “known units” Albert Sydney Hornby (1898–1978): Idiomatic and Syntactic English Dictionary (1942) A learner’s Dictionary of Current English (1948) Advanced Learner’s Dictionary of Current English (1952), Oxford Advanced Learner’s Dictionary (multiple prints) Anthony P. Cowie Peter Howarth Michael Lewis: “islands of reliability” • Linguistics interest in collocations: “groupements usuels”, opposed to “groupements passagers” (Bally, 1909) usual combinations temporary/free combinations “Lexikalische Solidaritäten” (Coseriu, 1967). lexical solidarity 34 Syntactic characterisation Distinction between lexical and grammatical collocations (Benson et al., 1986) • Lexical collocations involve open-class words only (nouns, verbs, adjectives, most adverbs) most collocations • Grammatical collocations may contain function words (prepositions, conjunctions, pronouns, auxiliary verbs, articles): apathy towards, agreement that, in advance, angry at, afraid that (Examples from Benson et al., 1986) 35 Syntactic characterisation (cont.) Syntactic configurations relevant for collocations: • “We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun.” (Hausmann, 1989, 1010) N-A, N-V, V-N, V-Adv, A-Adv, N-P-N • BBI dictionary (Benson et al., 1986): many types, including: A-N, N-N, N-P:of-N, N-V, V-N, V-P-N, Adv-A, V-Adv, N-P, N-Conj, P-N, A-P, A-Conj • Unrestricted typology: “The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” (Fontenelle, 1992, 222) 36 Semantic characterisation • The collocation is a semantic unit: “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components” (Choueka, 1988) • “the noncompositionality of a string must be considered when assessing its holism” (Moon, 1998, 8) • Is the meaning of a collocation obtained by the composition of the meanings of individual words? 37 Semantic characterisation (cont.) • Collocations occupy the grey area of a continuum of compositionality: regular combinations collocations idiomatic expressions transparent opaque • Collocations are partly compositional (Meaning-Text Theory): heavy ‘A’ smoker ‘B’ ‘A B’ B: base – autosemantic (semantic head) A: collocate – synsemantic (semantically dependent) 38 Semantic characterisation (cont.) • “the meaning of a collocation is not a straightforward composition of the meaning of its parts” (Manning and Schütze, 1999, 172–173) white ‘A’ wine ‘B’ ‘A B’ “there is always an element of meaning added to the combination” (1999, 184); The meaning of a collocation like white wine contains an added element of connotation with respect to the connotation of wine and white together. • “the individual words in a collocation can contribute to the overall semantics of the compound” (McKeown and Radev, 2000, 507). 39 Semantic characterisation (cont.) • Easy to decode, difficult to encode: “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988) entertain ‘A’ hope ‘B’ ‘A B’ 40 Collocations vs. idioms collocations collocations idioms “fall somewhere along a continuum between free word combinations and idioms” (McKeown and Radev, 2000, 509) ? idioms collocations idioms “The term collocation will be used to refer to sequences of lexical items which habitually co-occur, but which are nonetheless fully transparent in the sense that each lexical constituent is also a semantic constituent.” (Cruse, 1986, 40) 41 Collocations vs. idioms (cont.) collocations collocations ? idioms “I will use the term collocation as the most general term to refer to all types of fixed combinations of lexical items; in this view, idioms are a special subclass of collocations” (van der Wouden, 1997, 9). idioms idioms collocations “Idiomaticity applies to encoding for collocations, but not to decoding” (Fillmore et al., 1988). 42 Collocations vs. other types of MWEs • Multi-word expressions (MWE) cover a broad spectrum of phenomena: Named entities European Union Compounds wheel chair Verb-particle constructions give up Light-verb contructions take a bath ... Note: While theoretically appealing, fine-grained distinctions are less important in practice. All expressions share the same fate: lexicon → special treatment. They are equally important; what changes is their share in language. 43 Predominance of collocations • “collocations make up the lion’s share of the phraseme [MWE] inventory, and thus deserve our special attention” (Mel’čuk 1998, 24). • “no piece of natural spoken or written English is totally free of collocation” (Lea and Runcie, vii) • “In all kinds of texts, collocations are indispensable elements with which our utterances are very largely made” (Kjellmer 1987:140) Les députés réformistes surveilleront de près les mesures que prendra le gouvernement au sujet du rôle que jouera le Canada dans le maintien de la paix […] (Hansard Corpus ) 44 Quiz agreement 45 46 PRACTICAL ACCOUNTS 47 Basic architecture Preprocessing Candidate selection Candidate ranking 48 (Collaborative) Synopsis 49 English • Choueka (1988): Looking for needles in a haystack … pre-processing: - (plain text) candidates: sequences of adjacent works up to 7 word long ranking: raw frequency • Kjellmer (1994): A Dictionary of English Collocations plain text sequences of adjacent words raw frequency • Justeson and Katz (1995): Technical terminology: Some linguistic properties and an algorithm for identification in text NP chunking (patterns containing N, A, P) n-grams raw frequency EX: central processing unit 50 English (cont.) • Church and Hanks (1990): Word association norms, mutual information, and lexicography preprocessing: POS-tagging candidates: adjacent pairs (phrasal verbs) ranking: MI EX: allude to (P) vs. tend to (infinitive marker) • Church et al. (1989): Parsing, word associations and typical predicate-argument relations shallow parsing predicate-argument relations (S-V-O) MI, t-test EX: drink beer/tea/cup/coffee 51 English (cont.) • Smadja (1993): Retrieving collocations from text: Xtract z-score POS-tagging “retains words (or parts of speech) occupying a position with probability greater than a given threshold” (p. 151) rigid noun phrases EX: stock market, foreign exchange, New York Stock Exchange phrasal templates EX: common stocks rose *NUMBER* to *NUMBER* predicative collocations EX: index [...] rose, stock [...] jumped, use [...] widely parser used as postprocessing (results validation) Note: First large-scale evaluation, with professional lexicographers. Impact of parsing: precision rises from 40% to 80%. 52 English (cont.) • Dagan and Church (1994): Termight: Identifying and translating technical terminology POS-tagging NP chunking (NPs defined by regular expressions over tags) ranking: frequency of the head word in document bilingual – word alignments EX: default paper size, software settings • Lin (1998): Extracting collocations from text corpora dependency parsing (sentences shorter than 25 words) A-N, N-N, N-P-N, S-V, V-O version of MI ("adjusted") 53 English (cont.) • Pearce (2001): Synonymy in collocation extraction data already preprocessed (syntactic treebank) noun+modifier pairs ranking: Web frequencies EX: baggage allowance, luggage compartment • Dias (2003): Multiword unit hybrid extraction POS-tagging sequences of words/POS-speeches Mutual Expectation EX: [Blue Mosque], [been able to], [can still be] • Orliac and Dillinger (2003): Collocation extraction for machine translation full parsing (but cannot handle relative constructions) MI, log-likelihood ratio EX: download/save/locate file 54 English (cont.) • Kilgarriff et al. (2004): The Sketch Engine shallow parsing syntactic relations identified on the basis of regex over POS tags version of MI 55 German • Breidt (1993): Extraction of V-N-collocations from text corpora POS tagging sliding window: V-N pairs in a 5-word span (N precedes V) MI, t test EX: [in] Betracht kommen (‘to be considered’) • Krenn (2000): The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations POS tagging and shallow parsing P-N-V (i.e., PP-V) combinations MI, Dice coefficient, LLR, entropy, lexical keys (list of support verbs) EX: zur Verfügung stellen (lit., at the availability put, ‘make available’), am Herzen liegen (lit., at the heart lie, ‘have at hearth’). 56 German (cont.) • Krenn and Evert (2001): Can we do better than frequency? A case study on extracting PP-verb collocations POS tagging, chunking PP-V (PP + V in the same sentence; inflected forms) frequency, t test, LLR, chi-square, MI EX: in Betrieb gehen/nehmen (’go/put into operation’) • Evert and Krenn (2001): Methods for the qualitative evaluation of lexical association measures also A-N pairs, POS tagging, same ranking measures EX: Rotes Kreuz (‘Red Cross’) 57 German (cont.) • Zinsmeister and Heid (2003): Significant Triples: Adjective+Noun+Verb Combinations full parsing A-N-V combinations LLR EX: (eine) klare Absage erteilen (lit. give a clear refusal, ‘refuse resolutely’) • Schulte im Walde (2003): A Collocation Database for German Verbs and Nouns as above, but many syntactic configurations EX: Zeichen ‘symbol’ – Freundschaft ‘friendship’ • Wermter and Hahn (2004): Collocation extraction based on modifiability statistics POS tagging, shallow parsing PP-V combinations limited modifiability criterion (high relative frequency of collocate) EX: unter [stark/schwer] Druck geraten ‘to get under [strong/heavy] pressure’ 58 French • Lafon (1984): Dépouillements et statistiques en lexicométrie plain text directed/undirected pairs z-score • Bourigault (1992): Surface grammatical analysis for the extraction of terminological noun phrases POS tagging, chunking, shallow parsing NPs (terms) EX: disque dur ‘hard disk’, station de travail ‘workstation’ 59 French (cont.) • Daille (1994): Approche mixte pour l’extraction automatique de terminologie… lemmatization, POS tagging, shallow parsing (Finite State Automata) NPs: N-A, N-N, N-à-N, N-de-N, N-P-D-N many AMs: e.g., cubic MI, LLR, raw frequency EX: réseau national à satellites • Jacquemin et al. (1997): Expansion of multi-word terms for indexing and retrieval using morphology and syntax POS tagging, shallow parsing (regex over POS tags) combinations in a 10-word window; syntactic relations “A ±5-word window is considered as sufficient for detecting collocations in English (Martin, Al, and Van Sterkenburg, 1983). We chose a window-size twice as large because French is a Romance language with longer syntactic structures” EX: fruits et agrumes topicaux huile de palme ‘palm oil’ – palmier à huile ‘palm tree’ 60 French (cont.) • Goldman et al. (2001): Collocation extraction using a syntactic parser full parsing syntactic relations, many configurations LLR • Tutin (2004): Pour une modélisation dynamique des collocations dans les textes shallow parsing (INTEX) syntactic relations, many configurations EX: angoisse – saisir, lit. fear seize • Archer (2006): Acquisition semi-automatique de collocations … parsing verb-adverb version of MI EX: changer radicalement, ‘to change radically’ 61 Other languages E.g., • Czeck: Pecina (2008) • Dutch: Villada Moirón (2005) • Italian: Calzolari and Bindi (1990), Basili et al., (1994) • Chinese: Wu and Zhou (2003) • Korean: Kim et al. (1999) • Japanese: Ikehara et al. (1995) • Romanian: Todirascu et al. (2008) 62 (Collaborative) Synopsis 63 64 BEHIND THE CURTAINS: MATHS, STATISTICS 65 Extraction systems: What is behind? FipsCoView 66 Extraction procedure • • Input: Text corpus Output: Collocations (typical combinations) Procedure: 1. Candidate selection 2. Candidate ranking Many options: 1. Candidate selection: which criteria? – – – – 2. n-grams: what length? skip-grams: what distance? directed or not? syntactic relations: which tools? (shallow/dependency/full parser?) frequency threshold: yes/no? if yes, which threshold? (2? 5? 10? more?) Candidate ranking: which criteria? – statistical significant (more frequent than expected by chance)? – semantic unit, partly transparent? – arbitrary? Note: Not all these criteria can be easily put into practice, most of them are not. There is plenty of room for future work. 67 Ranking based on statistical significance • Statistical significance (in inferential – as opposed to descriptive – statistics): An event is statistically significant is it is not due to chance alone. In our case, the event is the co-occurrence of the component words of a candidate in language: e.g., great – interest. • Statistical hypothesis tests tell whether an event is statistically significant or not. • Null hypothesis: the default assumption is that the event is due to chance. In our case, the null hypothesis is that great and interest occur together by chance (“groupements passagers” – Bally, 1909). 68 Great – interest: observed co-occurrences 69 Ranking based on statistical significance • Method: Comparing chance-expected (E) against observed (O) frequencies of occurrence of the event. The larger the difference, the more significant the event. In our case, O: How often did we see great and interest together (in the candidate dataset)? E: How often would we expect two words like great and interest to occur together? Consider that great can be replaced by a lot of other words: big, special, major… Similarly, the place of interest can be taken by words like fantasy, experience, work … How can we compute the probability of seeing great and interest together, under the assumption that they are independent (→ chance-expected frequency)? If we know P(A) – the probability of seeing great in our dataset, and P(B) – the probability of seeing interest in our dataset, then according to the formula for computing probabilities of independent events, the probability of seeing great and interest together is the product of the individual probabilities. 70 Ranking based on statistical significance • Individual probabilities: P(A) – the probability of seeing great in our dataset P(A) = number of times great occurs in the dataset / size of dataset P(B) – the probability of seeing interest in our dataset P(B) = number of times interest occurs in the dataset / size of dataset • Joint probability: – the probability of seeing both great and interest; • Chance-expected frequency (E): joint probability x size of dataset 71 Contingency table – Observed values interest ┐interest great a b ┐great c d In general: Two random variables (a set of possible different values), X – first position in a candidate pair Y – second position in a candidate pair ┐ − means ‘not’ a – joint frequency; N – sample size; R – row marginal, C- column marginal Sample: data (our candidate set) selected from a population (corpus) 72 Contingency table – Expected values • Expected values under the null hypothesis: • Sample computation : expected joint frequency (first cell) (sample size x individual probability of seeing u in the first position x individual probability of seeing v in the second position) 73 Comparing O and E • Question: Is the difference large? Idea: Take O – E or log O/E because log O/E = log O – log E The results of the comparison might be either positive or negative. The test is a two-tailed test (≠). • Question: Are the observed frequencies higher than chance-expected ones? The test is a one tailed test (>). If the answer is yes, we identified a positive association. • Question: Are the observed frequencies lower than chance-expected ones? The test is a one tailed test (<). If the answer is yes, we identified a negative association. 74 Popular association measures • AM: “a formula that computes an association score from the frequency information in a pair type’s contingency table” (Evert, 2004, 75) AM Assumption on data distribution t test normal z-score normal chi-square log-likelihood ratio Formula Explicit formula - binomial 75 Ranking based on mutual information • Pointwise multual information (MI, or PMI): PMI = the information about u provided by the occurrence of v the information about v provided by the occurrence of u 76 Comments on AM applicability • Lexical data has a Zipfian distribution, with a small number of highly frequent words, and a high number of infrequent words. Most tests make assumptions on data distribution which are wrong. The application of t test and z-score to lexical data is often contested (Kilgarriff 1996, Dunning1993, Evert 2004). • AMs are less reliable for infrequent data. Minimal suggested frequency: 5 (Church and Hanks, 1990). They overemphasise rare events: PMI, chi-square • AMs are not reliable for small sample sizes (N): z-score, chi-square • Some AMs overemphasise common events: chi-square. • Results vary with the experimental setting: type of candidates, domain, amount of data excluded by the frequency threshold, linguistic preprocessing… (Evert and Krenn, 2005). • Plain frequency is already a competitive AM. • There is no single all-purpose AM. 77 Exercice • Some values in the contingency table are more difficult to compute than others. For instance, a, N, R1 and C1 are relatively easy to compute by looking for occurrences of u and v together or in isolation, and by counting the items in the dataset (N). But what about b, c, and d? Can you give formulas for computing b, c, and d depending on a, N, R and C? Example: b = R1 – a c = ____________________________ d = ____________________________ 78 Hands-on session • Build a minimally viable collocation extractor (well, a candidate ranking module; we assume candidate data is already available). Data: lex, key – lexeme index and key for a word, e.g., 111011778 decision Candidate dataset: Provided in a database, table structure: <lex1, lex2, key1, key2, type, prep_key> lex1, key1, lex2, key2 – the two items of a candidate pair type – the syntactic type prep_key – the intervening preposition, if any (e.g., comply with duty) Method: Implement queries in MS Access for computing: dataset size N joint frequencies a row marginals R1 column marginals C1 all contingency values a, b, c, d AM formulas 79 AMs in MS Access SQL AM Explicit formula Even more explicit formula (for MS Access) t test (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)*(a^(1/2))) z-score (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)^(1/2) * ((a+b)*(a+c))^(1/2)) chi-square log-likelihood ratio PMI ((a+b+c+d)*(a*d – b*c)^2) / ((a+b)*(a+c)*(b+d)*(c+d)) 2*(a*log(a)+b*log(b)+c*log(c)+d*log(d)(a+b)*log(a+b)-(a+c)*log(a+c)-(b+d)*log(b+d)(c+d)*log(c+d)+(a+b+c+d)*log(a+b+c+d)) log((a*(a+b+c+d))/ ((a+b)*(a+c)))/log(2) 80 Resuming … 81 Outline 1. Introduction 2. Terminology clarification 3. Theoretical description 4. Practical accounts 5. Behind the curtains: the maths and stats 6. Wrap up and outlook 82 WRAP UP AND OUTLOOK 83 Word sociology • Do we know more about it and how to analyse it? • • • • • • About how it has been approached in theoretical and computational linguistics? About why it is important and which application can exploit this type of knowledge? About the types of constructions dealt with in practical work? … the underlying language technology? … the portability across languages? … the computational work behind association strength quantification? • Have you identified less explored, potential areas of further research? 84 A look at other multi-word expressions • Those which were more studied in literature: Idioms: Rosamund Moon. 1998. Fixed expressions and idioms in English: A corpus-based approach. Claredon Press Oxford, Oxford. Compounds: Gaston Gross. 1996. Les expressions figées en français. OPHRYS, Paris. • And those on which empirical work was particularly focused: Idioms: Christiane Fellbaum (ed). 2007. Idioms and Collocations: Corpus-based Linguistic and Lexicographic Studies. London, Continuum. Light-Verb Constructions: Afsaneh Fazly. 2007. Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph.D. thesis, University of Toronto. Verb-particle constructions: e.g., Baldwin and Villavicencio (2002), Bannard et al. (2003) Nominal Compounds: Jacquemin, C. (2001). Spotting and Discovering Terms through NLP. MIT Press, Cambridge MA. 85 Selected readings: Books • • • • • • • • • Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart. Thierry Fontenelle. 1997. Turning a bilingual dictionary into a lexical-semantic database. Max Niemeyer Verlag, Tübingen. Sylviane Granger, Fanny Meunier (eds.) (2008), Phraseology: An interdisciplinary perspective, Amsterdam/Philadelphia, John Benjamins. Francis Grossmann, Tutin Agnès (eds.) (2003), Les collocations : analyse et traitement, Travaux et recherches en linguistique appliquée, Amsterdam, de Werelt. Pavel Pecina. Lexical Association Measures: Collocation Extraction. PhD thesis, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, 2008. John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press, Oxford. Michael Stubbs. 2002. Words and Phrases: Corpus Studies of Lexical Semantics. Blackwell, Oxford. Ton van der Wouden. 1997. Negative Contexts. Collocation, polarity, and multiple negation. Routledge, London and New York. María Begoña Villada Moirón. 2005. Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen. 86 Selected readings: Chapters/Articles • • • • • • • • Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. Beatrice Daille. 1994. Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pages 29–36, Las Cruces (New Mexico), U.S.A. Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188–195, Toulouse, France. Ulrich Heid. 1994. On ways words work together – research topics in lexical combinatorics. In Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), pages 226–257, Amsterdam, The Netherlands. Dekang Lin. 1998. Extracting collocations from text corpora. In First Workshop on Computational Terminology, pages 57–63, Montreal, Canada. Kathleen R. McKeown and Dragomir R. Radev. 2000. Collocations. In Robert Dale, Hermann Moisl, and Harold Somers, editors, A Handbook of Natural Language Processing, pages 507–523. Marcel Dekker, New York, U.S.A. Darren Pearce. 2002. A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, pages 1530– 1536, Las Palmas, Spain. Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177. 87 Ressources • UCS toolkit, by Stefan Evert • mwetoolkit, by Carlos Ramisch • Ngram Statistics Package (NSP), by Ted Pedersen et al. 88 Events • SIGLEX-MWE: Workshops on Multiword Expressions • PARSEME COST Action 89 90 References: http://www.issco.unige.ch/en/staff/seretan/data/ranlp/tutorial/ RANLP-2013-tutorial-references.pdf 91