Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Computational Approaches to Lexical Collocations Stefan Evert IMS Stuttgart Brigitte Krenn ÖFAI Vienna Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Collocations IMS • • • • What kind of animal are they? What are they good for? How can we find them? How likely are we to find them? Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Schedule IMS • Monday: Introduction to Collocations • Tuesday: Extraction of Coocurrence Data • Wednesday: Association Measures (AMs) • Thursday: Evaluation of AMs • Friday: Significance of Result Differences Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Collocations Terminology & Definitions IMS • Firth's Notion of Collocation ``Meaning by collocation is an abstraction at the syntagmatic level and is not directly concerned with the conceptual or idea approach to the meaning of words.'' ``One of the meanings of night is its collocability with dark, and of dark, of course, its collocation with night.'‘ (Firth 1957) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Collocations Terminology & Definitions IMS • Choueka's Notion of Collocation "A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components." (Choueka 1988) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Collocations Terminology & Definitions IMS Summing up, we have 2 different views 1. Firth: collocations as lexical proximities in text 2. Choueka: collocations as syntactic and semantic units, semantic irregularity Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS • Idioms preferably used in the English literature, e.g. Bar-Hillel:55, Hockett:58, Katz;Postal:63, Healey:68,Makkai:72. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS • Phraseological Units, (Ge.: Phraseologismen) a widely used generic term in the German literature, e.g. BurgerEA:82, Fleischer:82. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS • Light-verb Constructions, Support-verb Constructions refer to very particular phenomena, cross-categorisation with idioms Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS Terms stemming from Computational Linguistics, e.g. • multi-word lexemes, e.g. Tschichold:97, BreidtEA:96. • multi-word expressions, e.g. Segond;Tapanainen:95 • non-compositional compounds, e.g. Melamed:97 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS • Collocation e.g, red herring, kick the bucket, dark-night, dog-bark • Base, Collocate elements if a collocation, the base selects its collocates, it is not always clear to define what the base is, e.g. red-herring, kick-bucket Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Terminology IMS • Collocation Phrase grammatical unit which is constituted by a base and its collocates or by two or more collocates, e.g. a red herring, kick the bucket, unexpectedly kick the bucket Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summing up IMS • Terminology and definitions are influenced by • different linguistic traditions • computational linguistic applications Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summing up IMS • The phenomena covered are manifold: • • • • lexical proximities in texts syntactic and semantic units semantic irregularity syntactic rigidity Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Observations: Generativity IMS • Collocations range from completely fixed to syntactically flexible constructions. • Syntactic restrictions usually coincide with semantic restrictions and thus are indicators for the degree of lexicalization of a particular word combination. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Observations: Generativity IMS • Particular word combinations are associated with specific restrictions that cannot be inferred from standard rules of grammar and thus need to be stored together with the collocation. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Observations: Recurrence IMS • Within corpora, the proportion of collocations is larger among highly recurrent word combination than among infrequent ones. • Recurrence is an effect of lexicalization. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Observations: Idiomaticity IMS • Semantic opacity is not sufficient for the definition of collocations as there exists a variety of conventionalized word combinations that range from • fully compositional ones like Hut aufsetzen (`put on a hat'), Jacke anziehen (`put on a jacket') to • semantically opaque ones like ins Gras beissen (`bite into the grass' literal meaning, `die' idiomatic meaning). Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Observations: Words, Multi-words or Phrases IMS • Collocations can be • word level phenomena • phrase level phenomena (collocation phrase) • Collocation phrases consist of the lexically determined words (collocates) only or contain additional lexically underspecified material. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Word-level Collocations IMS • Adjective- and Adverb-Like Collocations • nichts desto trotz (`nonetheless') adverb • fix und fertig (`exhausted') adjective • Preposition-Like Collocations • im Lauf(e), im Zuge (`during') • an Hand (`with the help of') Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Word-level Collocations IMS • Noun-Like Collocations • Rotes Kreuz (Red Cross) • Wiener Sängerknaben (Vienna choir boys) • Hinz und Kunz (`every Tom, Dick and Harry') • Sequences where the nouns are duplicated • Schulter an Schulter (shoulder to shoulder), • Kopf an Kopf (neck and neck) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Phrase-level Collocations IMS • Modal constructions • sich (nicht) lumpen lassen (`to splash out') • Verb-object combinations • übers Ohr hauen (`take somebody for a ride') • unter die Lupe nehmen (`take a close look at') • zum Vorschein bringen (`bring something to the light') • des Weges kommen (`to approach') • Lügen strafen (`prove somebody a liar') Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Phrase-level Collocations IMS • Copula constructions • guten Glaubens sein (`be in good faith') • auf Draht sein (`be on the ball') • Proverbs • Morgenstund hat Gold im Mund (morning hour has gold in the mouth, ‘the early bid catches the worm’) • wissen, wo der Barthel den Most holt (know where the Barthel the cider fetches, `know every trick in the book') Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summing up IMS • Structural dependency • the collocates of a collocation are syntactic dependents, thus knowledge of syntactic structure is a precondition for accurate collocation identification. • Syntactic context • may help to discriminate literal and collocational readings, see for instance im Lauf, im Zug where a genitive to the right is a strong indicator for collocational reading. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summing up IMS • Markedness • morphologically or syntactically marked constructions like seemingly incomplete syntactic structure or archaic e-suffix are suitable indicators for collocations, see im Laufe, im Zuge for e-suffix and zu Recht, an Hand for incomplete syntactic structures. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summing up IMS • Single-word versus multi-word units • single-word occurrences of word combinations indicate word-level collocations, see for instance zu Recht, zurecht. • Syntactic rigidity • is an important indicator for collocations see for instance Hinz und Kunz, an und für sich, fix und fertig, Kopf an Kopf. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Identification of Collocations: Operationalizable Criteria IMS • over proportionally high recurrence of collocational word combinations compared to noncollocational word combinations in corpora • lexical determination of the collocates of a collocation • collocations constitute grammatical units • grammatical restrictions in the collocation phrases Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams IMS • numerical span • typically AMs work on bi-grams , i.e., all <wi, wj> pairs within a certain span are considered • <wi,wi> pairs are not considered! • e.g.: wi-3 ... wi-1 wi wi+1 ... wi+3 (span size is 6) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams IMS • grammatical span • e.g. we assume that collocational relations do not exceed sentence boundaries i.e., span size = sentence size Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS • huge amounts of cooccurrence data need to be managed this requires special algorithms (e.g. Yamamoto,Church 2001) Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS • definition of span size is crucial If the span size is kept small, it is unlikely to properly cover nonadjacent collocates of structurally flexible collocations. Enlarging the span size leads to an increase of candidate collocations including an increase of noisy data which need to be discarded in a further processing step. Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS High amount of noise in the collocation data due to • inappropriate span size Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS High amount of noise in the collocation data due to • over-proportional frequency of function words within texts use stop word lists Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS High amount of noise in the collocation data due to • insensitivity to punctuation use a sentence as the largest unit within which the collocates of a collocation may occur Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS High amount of noise in the collocation data due to • insensitivity to parts-of-speech knowing parts-of-speech allows a large number of syntactically invalid n-grams to be excluded beforehand Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Positional N-Grams: Problems IMS High amount of noise in the collocation data due to • insensitivity to syntactic structure further improvement of the appropriateness of the collocation candidates selected is achieved by the availability of structural and/or dependency information Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Proposal IMS • (Partial) replacement of • positional n-grams by • relational n-grams • What does it imply? • In which cases do we want/need it? Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien From Positional to Relational N-Grams IMS • positional with part-of-speech, eg. (Evert,Kermes 2003) • <Adj,N> pairs within certain span Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien From Positional to Relational N-Grams IMS • positional with part-of-speech and lexical knowledge, e.g. Breidt 93 • sentence final <noun,past participle> pairs • past participles must be part of a list of 16 German support verbs e.g. kommen, bringen, stehen, stellen, ... Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien From Positional to Relational N-Grams IMS • partially relational, e.g. Krenn 2000 • extraction of PN-combinations from PPs • extraction of main verbs • combination of PN-pairs and verbs co-occurring in a sentence Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien From Positional to Relational N-Grams IMS • Result • a theoretical maximum of PNV combinations, i.e., • verbs are duplicated in sentences that contain more than one PP, • PPs are duplicated in sentences where more than one main verb is found. This has effects on counting! Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Relational N-Grams IMS Assumption The collocates of a collocation are syntactically related! Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Relational N-Grams IMS Requirements • part-of-speech tagging • partial parsing (phrase chunking) • full parsing Evert,Kermes 2003 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Relational N-Grams IMS Problems • full parsing of free text is error prone • there are cases where collocation extraction from fully parsed text is less accurate Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien Summary IMS • Use application dependent notion of collocation as opposed to Firth/Choueka • Extract collocations from relational data (relational n-grams) • Interested in collocation phrases and their collocates • Consider grammatically homogenous data (e.g. Adj-N, PP-V) • Recurrence/cooccurrence frequency is main criterion for collocation extraction (statistical approaches)