tutorial

advertisement
RANLP tutorial, September 2013, Hissar, Bulgaria
The Analytics of Word Sociology
Violeta Seretan
Department of Translation Technology
Faculty of Translation and Interpreting
University of Geneva
8 September 2013
Keywords
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
computer science
linguistics
computational linguistics
statistics
inferential statistics
syntactic parsing
dependency parsing
shallow parsing
chunking
POS-tagging
lemmatization
tokenisation
type vs. token
distribution
Zipf law
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
hypothesis testing
statistical significance
null hypothesis
association measure
collocation extraction
mutual information
log-likelihood ratio
entropy
contingency table
co-occurrence
collocation
extraposition
long-distance dependency
n-gram
precision, recall, F-measure
2
Outline
1.
Introduction
2.
Terminology clarification
3.
Theoretical description
4.
Practical accounts
5.
Behind the curtains: the maths and stats
6.
Wrap up and outlook
3
Objectives
• Understand the concept of collocation and its relevance for the fields of
linguistics, lexicography and natural language processing.
• Become aware of the definitorial and terminological issues, the description of
collocations in terms of semantic compositionality, and the relation with other
multi-word expressions.
• Understand the basic architecture of a collocation extraction system.
• Become familiar with the most influential work in the area of collocation
extraction.
• Get (more than) an overview of the underlying technology – in particular, the
statistical computation details.
4
INTRODUCTION
5
Social Analytics
“Measuring + Analyzing + Interpreting interactions and associations between
people, topics and ideas.” (http://en.wikipedia.org/wiki/Social_analytics)
http://www.submitedge.com
http://irevolution.net
6
You shall know someone …
… by the company they keep
http://flowingdata.com
7
Word Sociology
• Barnbrook (1996) Language and Computers, Chapt. 5
«The sociology of words»:
– collocation analysis: «automatic quantitative analysis and
identification of word patterns around words of interest»
collocate
word 2
collocate
word 3
collocate
word 1
collocate
word 4
`node’
word
collocate
word n
…
collocate
word 5
8
You shall know a word …
… by the company it keeps! (Firth, 1957)
`node’
word
=
?
…
Seretan and Wehrli (2011): FipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora
9
Collocation analysis: Key concepts
• Node word:
the word under investigation
• Collocate:
the “word patterns” around the node word
• Association measure (AM):
Evert (2004): “a formula that computes an association score from the frequency
information […]”
• Collocation extraction [from corpora]:
the task of automatically identifying genuine associations of words in corpora
10
Relevance for Linguistics
• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface,
Text-Meaning Theory, semantic prosody, …
Words are “separated in meaning at the collocational level” (Firth, 1968, 180)
Word collocation is one of the most important forms of text cohesion: is a passage of
language "a unified whole or is just a collection of unrelated sentences"? (Halliday and
Hassan, 1976, 1)
Collocations are found at the intersection of lexicon and grammar
"semi-preconstructed phrases that constitute single choices, even though they
might appear to be analysable into segments” (Sinclair, 1991, 110);
Collocations [“idioms of encoding”] are expressions “which are larger than words,
which are like words in that they have to be learned separately as individual whole
facts about the language" (Fillmore et al., 1988, 504)
“We acquire collocations, as we acquire other aspects of language,
through encountering texts in the course of our lives” (Hoey, 1991, 219).
11
Relevance for Linguistics (cont.)
• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface,
Text-Meaning Theory, semantic prosody, …
In the Meaning-Text Theory (e.g., Mel’čuk, 1998), collocations are described by means
of lexical functions (associating meaning and the utterance expressing that meaning):
Magn(problem) = big
Magn(rain) = heavy
Magn(injury) = serious
Collocations are often between words which share a positive or a negative
connotation (semantic prosody – e.g., Louw, 1993).
FipsCoView
12
Relevance for Lexicography
• Dictionaries of co-occurrences/collocations/cum-corpus
“Collocation is the way words combine in a language to produce natural-sounding
speech and writing” (Lea and Runcie, 2002)
“Advanced learners of second language have great difficulty with nativelike collocation
and idiomaticity. Many grammatical sentences generated by language learners sound
unnatural and foreign.” (Ellis, 2008)
Benson et al., 1986
Sinclair, 1987
OCDSE (Lea and Runcie, 2002) 13
Relevance for Lexicography (cont.)
http://dictionary.reverso.net/english-cobuild
14
Relevance for Lexicography (cont.)
• Dictionaries of co-occurrences/collocations/cum-corpus
Beauchesne, 2001
Charest et al., 2012
15
Relevance for Natural Language Processing
• Machine translation
EN ask a question – FR poser `put’ une question – ES hacer `make’ una pregunta
“collocations are the key to producing more acceptable output”
(Orliac and Dillinger, 2003)
• Natural language generation
EN to brush one’s teeth – * to wash one’s teeth
“In the generation literature, the generation of collocations is regarded as a problem”
(Heid and Raab, 1989)
“However, collocations are not only considered useful, but also a problem both in
certain applications (e.g. generation, […] machine translation […])”
(Heylen et al., 1994)
16
Relevance for Natural Language Processing (cont.)
•
Syntactic parsing
*
vs.
•
Word sense disambiguation
break: about 50 senses
record: about 10 senses
to break a world record: 1 sense
verb-object collocation break – record
“a polysemous word exhibits essentially only one sense per collocation” (Yarowsky, 2003)
17
Senses of break (partial)
18
Senses of record
19
Relevance for Natural Language Processing (cont.)
• OCR
distinguish between homographs:
terse/tense, gum/gym, deaf/dear, cookie/rookie, beverage/leverage
(Examples from Yarowski, 2003)
• Speech recognition
distinguish between homophones:
aid/aide, cellar/seller, censor/sensor, cue/queue, pedal/petal
(Examples from Yarowski, 2003)
(Examples from Church and Hanks, 1990)
20
Relevance for Natural Language Processing (cont.)
• Text summarisation
collocations capture the gist of a document (the most typical and salient phrases):
be city, have population, people live, county seat, known as, be capital city, large city, city
population, close to, area of city, most important, city name, most famous, located on coast
(Examples from Seretan, 2011)
• Text classification
collocations are words which are characteristic of a body of texts
• Context-sensitive dictionary look-up
Context: The point doesn’t bear any relation to the question we are discussing.
Idea: Display the subentry bear – relation instead of the entry for bear
(Example from Michiels, 1998)
21
TERMINOLOGY CLARIFICATION
22
Ethymology
• cum ‘together’
• locare ‘to locate’ (from locus ‘place’)
General meaning: collocated things (set side by side)
Specific meaning: collocated words in a sentence
http://www.collinsdictionary.com
Note: In French, two different forms exist: colocation ‘flatsharing’/collocation.
23
One term – two acceptations
• Broad acceptation: semantic collocation (doctor – hospital – nurse – …)
“Collocation is the cooccurrence of two or more words within a short space of each
other in a text. The usual measure of proximity is a maximum of four words
intervening.” (Sinclair 1991:170)
• Narrow acceptation: typical syntagm (“conventional way of saying”)
“co-occurrence of two or more lexical items as realizations of structural elements
within a given syntactic pattern” (Cowie 1978:132)
Note:
The current literature uses the term co-occurrence to refer to the first acceptation.
The term collocation is reserved exclusively for the second acceptation.
24
Collocation definitions
1.
2.
3.
4.
5.
6.
7.
8.
Collocations are actual words in habitual company. (Firth, 1968, 182)
We shall call collocation a characteristic combination of two words in a structure like
the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d)
verb + adverb; e) adjective + adverb; f) noun + (prep) + noun. (Hausmann, 1989, 1010)
a sequence of words that occurs more than once in identical form [...] and which is
grammatically well structured (Kjellmer, 1987, 133)
a sequence of two or more consecutive words, that has characteristics of a syntactic
and semantic unit whose exact and unambiguous meaning or connotation cannot be
derived directly from the meaning or connotation of its components (Choueka, 1988)
A collocation is an arbitrary and recurrent word combination. (Benson, 1990)
Collocation is the cooccurrence of two or more words within a short space of each
other in a text. (Sinclair, 1991, 170)
The term collocation refers to the idiosyncratic syntagmatic combination of lexical
items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222)
recurrent combinations of words that co-occur more often than expected by chance
and that correspond to arbitrary word usages (Smadja, 1993, 143)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Broad
Narrow
25
Collocation definitions (cont.)
9. Collocation: idiosyncratic restriction on the combinability of lexical items (van der
Wouden, 1997, 5)
10. A collocation is an expression consisting of two or more words that correspond to some
conventional way of saying things. (Manning and Schütze, 1999, 151)
11. Collocations [...] cover word pairs and phrases that are commonly used in language, but
for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000,
507)
12. We reserve the term collocation to refer to any statistically significant cooccurrence,
including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)
13. A collocation is a word combination whose semantic and/or syntactic properties cannot
be fully predicted from those of its components, and which therefore has to be listed in
a lexicon. (Evert, 2004, 9)
14. lexically and/or pragmatically constrained recurrent co-occurrences of at least two
lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Broad
Narrow
26
Features: Unit
• Children memorise not only single words, but also groups (chunks) of words.
• Collocations are prefabricated units available as blocks (cf. the idiom principle):
“The principle of idiom is that a language user has available to him or her a large
number of semi-preconstructed phrases that constitute single choices, even though
they might appear to be analysable into segments.” (Sinclair, 1991, 110)
• “semi-finished products” of language (Hausmann, 1985, 124); “déjà-vu”.
27
Features: Recurrent, typical
• Collocations are actual words in habitual company. (Firth, 1968, 182)
• typical, specific and characteristic combination of two words (Hausmann,
1985)
• We shall call collocation a characteristic combination of two words […].
(Hausmann, 1989, 1010)
• a sequence of words that occurs more than once in identical form [...] and
which is grammatically well structured (Kjellmer, 1987, 133)
• A collocation is an arbitrary and recurrent word combination. (Benson, 1990)
• recurrent combinations of words that co-occur more often than expected by
chance and that correspond to arbitrary word usages (Smadja, 1993, 143)
• A collocation is an expression consisting of two or more words that correspond
to some conventional way of saying things. (Manning and Schütze, 1999, 151)
• Collocations [...] cover word pairs and phrases that are commonly used in
language, but for which no general syntactic and semantic rules apply.
(McKeown and Radev, 2000, 507)
• We reserve the term collocation to refer to any statistically significant
cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag
et al., 2002, 7)
28
Features: Arbitrary
• typical, specific and characteristic combination of two words (Hausmann,
1985)
• A collocation is an arbitrary and recurrent word combination (Benson, 1990)
• The term collocation refers to the idiosyncratic syntagmatic combination of
lexical items and is independent of word class or syntactic structure.
(Fontenelle, 1992, 222)
• recurrent combinations of words that co-occur more often than expected by
chance and that correspond to arbitrary word usages (Smadja, 1993, 143)
• Collocation: idiosyncratic restriction on the combinability of lexical items (van
der Wouden, 1997, 5)
• Collocations [...] cover word pairs and phrases that are commonly used in
language, but for which no general syntactic and semantic rules apply.
(McKeown and Radev, 2000, 507)
• lexically and/or pragmatically constrained recurrent co-occurrences of at least
two lexical items which are in a direct syntactic relation with each other
(Bartsch, 2004, 76)
29
Features: Unpredictable
• “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988):
“With an encoding idiom, by contrast, we have an expression which language users
might or might not understand without prior experience, but concerning which they
would not know that it is a conventional way of saying what it says” (Fillmore et al.,
1988, 505)
• […] these affinities can not be predicted on the basis of semantic or syntactic
rules, but can be observed with some regularity in text (Cruse, 1986)
• A collocation is a word combination whose semantic and/or syntactic
properties cannot be fully predicted from those of its components, and which
therefore has to be listed in a lexicon. (Evert, 2004, 9)
30
Features: Made up of two or more words
•
•
•
•
•
•
•
Collocation is the cooccurrence of two or more words within a short space of each
other in a text. (Sinclair 1991:170)
co-occurrence of two or more lexical items as realizations of structural elements
within a given syntactic pattern (Cowie 1978:132)
a sequence of two or more consecutive words, that has characteristics of a
syntactic and semantic unit whose exact and unambiguous meaning or connotation
cannot be derived directly from the meaning or connotation of its components
(Choueka, 1988)
Collocation is the cooccurrence of two or more words within a short space of each
other in a text. (Sinclair, 1991, 170)
A collocation is an expression consisting of two or more words that correspond to
some conventional way of saying things. (Manning and Schütze, 1999, 151)
the components of a collocation can again be collocational themselves: next to the
German collocation Gültigkeit haben (n + v), we have allgemeine Gültigkeit haben
[lit., ‘general validity have’], with allgemeine Gültigkeit, a collocation (n + a), as a
component (Heid, 1994, 232).
In most of the examples, collocation patterns are restricted to pairs of words, but
there is no theoretical restriction to the number of words involved (Sinclair, 1991,
170).
31
Summing up…
•
•
•
•
•
•
•
•
•
•
prefabricated unit
made up of two or more words
reccursive
recurrent/typical
arbitrary
unpredictable
partly transparent
syntactically motivated
worth storing in a lexicon
asymmetric (base + collocate)
But ultimately, the exact definition of collocations varies according to the
application needs:
“the practical relevance is an essential ingredient of their definition” (Evert, 2004, 75).
32
THEORETICAL DESCRIPTION
33
Prehistory
•
Collocations have even been known and studied by the ancient Greeks (Gitsaki,
1996).
•
Pedagogical interest in collocations:
Harold Palmer (1877–1949): “polylogs”, “known units”
Albert Sydney Hornby (1898–1978):
Idiomatic and Syntactic English Dictionary (1942)
A learner’s Dictionary of Current English (1948)
Advanced Learner’s Dictionary of Current English (1952),
Oxford Advanced Learner’s Dictionary (multiple prints)
Anthony P. Cowie
Peter Howarth
Michael Lewis: “islands of reliability”
•
Linguistics interest in collocations:
“groupements usuels”, opposed to “groupements passagers” (Bally, 1909)
usual combinations
temporary/free combinations
“Lexikalische Solidaritäten” (Coseriu, 1967).
lexical solidarity
34
Syntactic characterisation
Distinction between lexical and grammatical collocations (Benson et al., 1986)
• Lexical collocations
involve open-class words only (nouns, verbs, adjectives, most adverbs)
most collocations
• Grammatical collocations
may contain function words (prepositions, conjunctions, pronouns, auxiliary verbs,
articles):
apathy towards, agreement that, in advance, angry at, afraid that
(Examples from Benson et al., 1986)
35
Syntactic characterisation (cont.)
Syntactic configurations relevant for collocations:
•
“We shall call collocation a characteristic combination of two words in a structure
like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun
(object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun.”
(Hausmann, 1989, 1010)
N-A, N-V, V-N, V-Adv, A-Adv, N-P-N
•
BBI dictionary (Benson et al., 1986): many types, including:
A-N, N-N, N-P:of-N, N-V, V-N, V-P-N, Adv-A, V-Adv,
N-P, N-Conj, P-N, A-P, A-Conj
•
Unrestricted typology:
“The term collocation refers to the idiosyncratic syntagmatic combination of lexical items
and is independent of word class or syntactic structure.” (Fontenelle, 1992, 222)
36
Semantic characterisation
• The collocation is a semantic unit:
“a sequence of two or more consecutive words, that has characteristics of a syntactic
and semantic unit whose exact and unambiguous meaning or connotation cannot be
derived directly from the meaning or connotation of its components” (Choueka, 1988)
• “the noncompositionality of a string must be considered when assessing its
holism” (Moon, 1998, 8)
• Is the meaning of a collocation obtained by the composition of the meanings of
individual words?
37
Semantic characterisation (cont.)
• Collocations occupy the grey area of a continuum of compositionality:
regular combinations
collocations
idiomatic expressions
transparent
opaque
• Collocations are partly compositional (Meaning-Text Theory):
heavy ‘A’
smoker
‘B’
‘A B’
B: base – autosemantic (semantic head)
A: collocate – synsemantic (semantically dependent)
38
Semantic characterisation (cont.)
• “the meaning of a collocation is not a straightforward composition of the
meaning of its parts” (Manning and Schütze, 1999, 172–173)
white
‘A’
wine
‘B’
‘A B’
“there is always an element of meaning added to the combination” (1999, 184);
The meaning of a collocation like white wine contains an added element of connotation
with respect to the connotation of wine and white together.
• “the individual words in a collocation can contribute to the overall semantics of
the compound” (McKeown and Radev, 2000, 507).
39
Semantic characterisation (cont.)
• Easy to decode, difficult to encode:
“idioms of encoding” (Makkai, 1972; Fillmore et al., 1988)
entertain ‘A’
hope
‘B’
‘A B’
40
Collocations vs. idioms
collocations
collocations
idioms
“fall somewhere along a continuum
between free word combinations
and idioms” (McKeown
and Radev, 2000, 509)
?
idioms
collocations
idioms
“The term collocation will be used to
refer to sequences of lexical items
which habitually co-occur, but which
are nonetheless fully transparent in
the sense that each lexical
constituent is also a semantic
constituent.” (Cruse, 1986, 40)
41
Collocations vs. idioms (cont.)
collocations
collocations
?
idioms
“I will use the term collocation as the most
general term to refer to all types of fixed
combinations of lexical items; in this view,
idioms are a special subclass of collocations”
(van der Wouden, 1997, 9).
idioms
idioms
collocations
“Idiomaticity applies to encoding
for collocations, but not to
decoding” (Fillmore et al., 1988).
42
Collocations vs. other types of MWEs
• Multi-word expressions (MWE) cover a broad spectrum of phenomena:
Named entities
European Union
Compounds
wheel chair
Verb-particle constructions
give up
Light-verb contructions
take a bath
...
Note:
While theoretically appealing, fine-grained distinctions are less important in practice.
All expressions share the same fate: lexicon → special treatment. They are equally
important; what changes is their share in language.
43
Predominance of collocations
• “collocations make up the lion’s share of the phraseme [MWE] inventory, and
thus deserve our special attention” (Mel’čuk 1998, 24).
• “no piece of natural spoken or written English is totally free of collocation” (Lea
and Runcie, vii)
• “In all kinds of texts, collocations are indispensable elements with which our
utterances are very largely made” (Kjellmer 1987:140)
Les députés réformistes surveilleront de près les
mesures que prendra le gouvernement au sujet du
rôle que jouera le Canada dans le maintien de la
paix […]
(Hansard Corpus )
44
Quiz
agreement
45
46
PRACTICAL ACCOUNTS
47
Basic architecture
Preprocessing
Candidate
selection
Candidate
ranking
48
(Collaborative) Synopsis
49
English
• Choueka (1988): Looking for needles in a haystack …
pre-processing: - (plain text)
candidates: sequences of adjacent works up to 7 word long
ranking: raw frequency
• Kjellmer (1994): A Dictionary of English Collocations
plain text
sequences of adjacent words
raw frequency
• Justeson and Katz (1995): Technical terminology: Some linguistic properties and
an algorithm for identification in text
NP chunking (patterns containing N, A, P)
n-grams
raw frequency
EX: central processing unit
50
English (cont.)
• Church and Hanks (1990): Word association norms, mutual information, and
lexicography
preprocessing: POS-tagging
candidates: adjacent pairs (phrasal verbs)
ranking: MI
EX: allude to (P) vs. tend to (infinitive marker)
• Church et al. (1989): Parsing, word associations and typical predicate-argument
relations
shallow parsing
predicate-argument relations (S-V-O)
MI, t-test
EX: drink beer/tea/cup/coffee
51
English (cont.)
• Smadja (1993): Retrieving collocations from text: Xtract
z-score
POS-tagging
“retains words (or parts of speech) occupying a position with probability
greater than a given threshold” (p. 151)
rigid noun phrases
EX: stock market, foreign exchange, New York Stock Exchange
phrasal templates
EX: common stocks rose *NUMBER* to *NUMBER*
predicative collocations
EX: index [...] rose, stock [...] jumped, use [...] widely
parser used as postprocessing (results validation)
Note: First large-scale evaluation, with professional lexicographers.
Impact of parsing: precision rises from 40% to 80%.
52
English (cont.)
• Dagan and Church (1994): Termight: Identifying and translating technical
terminology
POS-tagging
NP chunking (NPs defined by regular expressions over tags)
ranking: frequency of the head word in document
bilingual – word alignments
EX: default paper size, software settings
• Lin (1998): Extracting collocations from text corpora
dependency parsing (sentences shorter than 25 words)
A-N, N-N, N-P-N, S-V, V-O
version of MI ("adjusted")
53
English (cont.)
• Pearce (2001): Synonymy in collocation extraction
data already preprocessed (syntactic treebank)
noun+modifier pairs
ranking: Web frequencies
EX: baggage allowance, luggage compartment
• Dias (2003): Multiword unit hybrid extraction
POS-tagging
sequences of words/POS-speeches
Mutual Expectation
EX: [Blue Mosque], [been able to], [can still be]
• Orliac and Dillinger (2003): Collocation extraction for machine translation
full parsing (but cannot handle relative constructions)
MI, log-likelihood ratio
EX: download/save/locate file
54
English (cont.)
• Kilgarriff et al. (2004): The Sketch Engine
shallow parsing
syntactic relations identified on the basis of regex over POS tags
version of MI
55
German
• Breidt (1993): Extraction of V-N-collocations from text corpora
POS tagging
sliding window: V-N pairs in a 5-word span (N precedes V)
MI, t test
EX: [in] Betracht kommen (‘to be considered’)
• Krenn (2000): The Usual Suspects: Data-Oriented Models for Identification and
Representation of Lexical Collocations
POS tagging and shallow parsing
P-N-V (i.e., PP-V) combinations
MI, Dice coefficient, LLR, entropy, lexical keys (list of support verbs)
EX: zur Verfügung stellen (lit., at the availability put, ‘make available’),
am Herzen liegen (lit., at the heart lie, ‘have at hearth’).
56
German (cont.)
• Krenn and Evert (2001): Can we do better than frequency? A case study on
extracting PP-verb collocations
POS tagging, chunking
PP-V (PP + V in the same sentence; inflected forms)
frequency, t test, LLR, chi-square, MI
EX: in Betrieb gehen/nehmen (’go/put into operation’)
• Evert and Krenn (2001): Methods for the qualitative evaluation of lexical
association measures
also A-N pairs, POS tagging, same ranking measures
EX: Rotes Kreuz (‘Red Cross’)
57
German (cont.)
•
Zinsmeister and Heid (2003): Significant Triples: Adjective+Noun+Verb
Combinations
full parsing
A-N-V combinations
LLR
EX: (eine) klare Absage erteilen
(lit. give a clear refusal, ‘refuse resolutely’)
•
Schulte im Walde (2003): A Collocation Database for German Verbs and Nouns
as above, but many syntactic configurations
EX: Zeichen ‘symbol’ – Freundschaft ‘friendship’
•
Wermter and Hahn (2004): Collocation extraction based on modifiability statistics
POS tagging, shallow parsing
PP-V combinations
limited modifiability criterion (high relative frequency of collocate)
EX:
unter [stark/schwer] Druck geraten
‘to get under [strong/heavy] pressure’
58
French
• Lafon (1984): Dépouillements et statistiques en lexicométrie
plain text
directed/undirected pairs
z-score
• Bourigault (1992): Surface grammatical analysis for the extraction of
terminological noun phrases
POS tagging, chunking, shallow parsing
NPs (terms)
EX: disque dur ‘hard disk’, station de travail ‘workstation’
59
French (cont.)
• Daille (1994): Approche mixte pour l’extraction automatique de terminologie…
lemmatization, POS tagging, shallow parsing (Finite State Automata)
NPs: N-A, N-N, N-à-N, N-de-N, N-P-D-N
many AMs: e.g., cubic MI, LLR, raw frequency
EX: réseau national à satellites
• Jacquemin et al. (1997): Expansion of multi-word terms for indexing and
retrieval using morphology and syntax
POS tagging, shallow parsing (regex over POS tags)
combinations in a 10-word window; syntactic relations
“A ±5-word window is considered as sufficient for
detecting collocations in English (Martin, Al, and
Van Sterkenburg, 1983). We chose a window-size
twice as large because French is a Romance language
with longer syntactic structures”
EX: fruits et agrumes topicaux
huile de palme ‘palm oil’ – palmier à huile ‘palm tree’
60
French (cont.)
• Goldman et al. (2001): Collocation extraction using a syntactic parser
full parsing
syntactic relations, many configurations
LLR
• Tutin (2004): Pour une modélisation dynamique des collocations dans les textes
shallow parsing (INTEX)
syntactic relations, many configurations
EX: angoisse – saisir, lit. fear seize
• Archer (2006): Acquisition semi-automatique de collocations …
parsing
verb-adverb
version of MI
EX: changer radicalement, ‘to change radically’
61
Other languages
E.g.,
• Czeck: Pecina (2008)
• Dutch: Villada Moirón (2005)
• Italian: Calzolari and Bindi (1990), Basili et al., (1994)
• Chinese: Wu and Zhou (2003)
• Korean: Kim et al. (1999)
• Japanese: Ikehara et al. (1995)
• Romanian: Todirascu et al. (2008)
62
(Collaborative) Synopsis
63
64
BEHIND THE CURTAINS:
MATHS, STATISTICS
65
Extraction systems: What is behind?
FipsCoView
66
Extraction procedure
•
•
Input: Text corpus
Output: Collocations (typical combinations)
Procedure:
1. Candidate selection
2. Candidate ranking
Many options:
1. Candidate selection: which criteria?
–
–
–
–
2.
n-grams: what length?
skip-grams: what distance? directed or not?
syntactic relations: which tools? (shallow/dependency/full parser?)
frequency threshold: yes/no? if yes, which threshold? (2? 5? 10? more?)
Candidate ranking: which criteria?
– statistical significant (more frequent than expected by chance)?
– semantic unit, partly transparent?
– arbitrary?
Note: Not all these criteria can be easily put into practice, most of them are not. There is plenty of room for future work.
67
Ranking based on statistical significance
• Statistical significance (in inferential – as opposed to descriptive – statistics):
An event is statistically significant is it is not due to chance alone.
In our case, the event is the co-occurrence of the component words of a candidate in
language: e.g., great – interest.
• Statistical hypothesis tests tell whether an event is statistically significant or
not.
• Null hypothesis: the default assumption is that the event is due to chance.
In our case, the null hypothesis is that great and interest occur together by chance
(“groupements passagers” – Bally, 1909).
68
Great – interest: observed co-occurrences
69
Ranking based on statistical significance
•
Method: Comparing chance-expected (E) against observed (O) frequencies of
occurrence of the event. The larger the difference, the more significant the event.
In our case,
O: How often did we see great and interest together (in the candidate dataset)?
E: How often would we expect two words like great and interest to occur together?
Consider that great can be replaced by a lot of other words: big, special, major…
Similarly, the place of interest can be taken by words like fantasy, experience, work …
How can we compute the probability of seeing great and interest together, under the
assumption that they are independent (→ chance-expected frequency)?
If we know
P(A) – the probability of seeing great in our dataset, and
P(B) – the probability of seeing interest in our dataset,
then according to the formula for computing probabilities of independent events,
the probability of seeing great and interest together is the product of the individual
probabilities.
70
Ranking based on statistical significance
• Individual probabilities:
P(A) – the probability of seeing great in our dataset
P(A) = number of times great occurs in the dataset / size of dataset
P(B) – the probability of seeing interest in our dataset
P(B) = number of times interest occurs in the dataset / size of dataset
• Joint probability:
– the probability of seeing both great and interest;
• Chance-expected frequency (E): joint probability x size of dataset
71
Contingency table – Observed values
interest
┐interest
great
a
b
┐great
c
d
In general: Two random variables (a set of possible different values),
X – first position in a candidate pair
Y – second position in a candidate pair
┐ − means ‘not’
a – joint frequency; N – sample size; R – row marginal, C- column marginal
Sample: data (our candidate set) selected from a population (corpus)
72
Contingency table – Expected values
• Expected values under the null hypothesis:
• Sample computation : expected joint frequency (first cell)
(sample size
x individual probability of seeing u in the first position
x individual probability of seeing v in the second position)
73
Comparing O and E
•
Question: Is the difference large?
Idea: Take O – E
or log O/E
because log O/E = log O – log E
The results of the comparison might be either positive or negative.
The test is a two-tailed test (≠).
•
Question: Are the observed frequencies higher than chance-expected ones?
The test is a one tailed test (>).
If the answer is yes, we identified a positive association.
•
Question: Are the observed frequencies lower than chance-expected ones?
The test is a one tailed test (<).
If the answer is yes, we identified a negative association.
74
Popular association measures
•
AM: “a formula that computes an association score from the frequency information in a
pair type’s contingency table” (Evert, 2004, 75)
AM
Assumption on
data distribution
t test
normal
z-score
normal
chi-square
log-likelihood ratio
Formula
Explicit formula
-
binomial
75
Ranking based on mutual information
• Pointwise multual information (MI, or PMI):
PMI =
the information about u provided by the occurrence of v
the information about v provided by the occurrence of u
76
Comments on AM applicability
•
Lexical data has a Zipfian distribution, with a small number of highly frequent
words, and a high number of infrequent words. Most tests make assumptions on
data distribution which are wrong. The application of t test and z-score to lexical
data is often contested (Kilgarriff 1996, Dunning1993, Evert 2004).
•
AMs are less reliable for infrequent data. Minimal suggested frequency: 5 (Church
and Hanks, 1990). They overemphasise rare events: PMI, chi-square
•
AMs are not reliable for small sample sizes (N): z-score, chi-square
•
Some AMs overemphasise common events: chi-square.
•
Results vary with the experimental setting: type of candidates, domain, amount of
data excluded by the frequency threshold, linguistic preprocessing… (Evert and
Krenn, 2005).
•
Plain frequency is already a competitive AM.
•
There is no single all-purpose AM.
77
Exercice
• Some values in the contingency table
are more difficult to compute than others.
For instance, a, N, R1 and C1 are relatively
easy to compute by looking for occurrences
of u and v together or in isolation, and by
counting the items in the dataset (N).
But what about b, c, and d?
Can you give formulas for computing
b, c, and d depending on a, N, R and C?
Example:
b = R1 – a
c = ____________________________
d = ____________________________
78
Hands-on session
• Build a minimally viable collocation extractor
(well, a candidate ranking module; we assume candidate data is already available).
Data:
lex, key – lexeme index and key for a word, e.g., 111011778 decision
Candidate dataset:
Provided in a database, table structure: <lex1, lex2, key1, key2, type, prep_key>
lex1, key1, lex2, key2 – the two items of a candidate pair
type – the syntactic type
prep_key – the intervening preposition, if any (e.g., comply with duty)
Method:
Implement queries in MS Access for computing:
dataset size
N
joint frequencies
a
row marginals
R1
column marginals
C1
all contingency values
a, b, c, d
AM formulas
79
AMs in MS Access SQL
AM
Explicit formula
Even more explicit formula
(for MS Access)
t test
(a*(a+b+c+d) - (a+b)*(a+c)) /
((a+b+c+d)*(a^(1/2)))
z-score
(a*(a+b+c+d) - (a+b)*(a+c)) /
((a+b+c+d)^(1/2) * ((a+b)*(a+c))^(1/2))
chi-square
log-likelihood ratio
PMI
((a+b+c+d)*(a*d – b*c)^2) /
((a+b)*(a+c)*(b+d)*(c+d))
2*(a*log(a)+b*log(b)+c*log(c)+d*log(d)(a+b)*log(a+b)-(a+c)*log(a+c)-(b+d)*log(b+d)(c+d)*log(c+d)+(a+b+c+d)*log(a+b+c+d))
log((a*(a+b+c+d))/
((a+b)*(a+c)))/log(2)
80
Resuming …
81
Outline
1.
Introduction
2.
Terminology clarification
3.
Theoretical description
4.
Practical accounts
5.
Behind the curtains: the maths and stats
6.
Wrap up and outlook
82
WRAP UP AND OUTLOOK
83
Word sociology
•
Do we know more about it and how to analyse it?
•
•
•
•
•
•
About how it has been approached in theoretical and computational linguistics?
About why it is important and which application can exploit this type of knowledge?
About the types of constructions dealt with in practical work?
… the underlying language technology?
… the portability across languages?
… the computational work behind association strength quantification?
•
Have you identified less explored, potential areas of further research?
84
A look at other multi-word expressions
•
Those which were more studied in literature:
Idioms:
Rosamund Moon. 1998. Fixed expressions and idioms in English: A corpus-based approach. Claredon
Press Oxford, Oxford.
Compounds:
Gaston Gross. 1996. Les expressions figées en français. OPHRYS, Paris.
•
And those on which empirical work was particularly focused:
Idioms:
Christiane Fellbaum (ed). 2007. Idioms and Collocations: Corpus-based Linguistic and Lexicographic
Studies. London, Continuum.
Light-Verb Constructions:
Afsaneh Fazly. 2007. Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph.D.
thesis, University of Toronto.
Verb-particle constructions: e.g., Baldwin and Villavicencio (2002), Bannard et al. (2003)
Nominal Compounds:
Jacquemin, C. (2001). Spotting and Discovering Terms through NLP. MIT Press, Cambridge MA.
85
Selected readings: Books
•
•
•
•
•
•
•
•
•
Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and
Collocations. Ph.D. thesis, University of Stuttgart.
Thierry Fontenelle. 1997. Turning a bilingual dictionary into a lexical-semantic
database. Max Niemeyer Verlag, Tübingen.
Sylviane Granger, Fanny Meunier (eds.) (2008), Phraseology: An interdisciplinary
perspective, Amsterdam/Philadelphia, John Benjamins.
Francis Grossmann, Tutin Agnès (eds.) (2003), Les collocations : analyse et
traitement, Travaux et recherches en linguistique appliquée, Amsterdam, de
Werelt.
Pavel Pecina. Lexical Association Measures: Collocation Extraction. PhD thesis,
Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic,
2008.
John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press,
Oxford.
Michael Stubbs. 2002. Words and Phrases: Corpus Studies of Lexical Semantics.
Blackwell, Oxford.
Ton van der Wouden. 1997. Negative Contexts. Collocation, polarity, and multiple
negation. Routledge, London and New York.
María Begoña Villada Moirón. 2005. Data-driven identification of fixed expressions
and their modifiability. Ph.D. thesis, University of Groningen.
86
Selected readings: Chapters/Articles
•
•
•
•
•
•
•
•
Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual
information, and lexicography. Computational Linguistics, 16(1):22–29.
Beatrice Daille. 1994. Study and Implementation of Combined Techniques for
Automatic Extraction of Terminology. In Proceedings of the Workshop The
Balancing Act: Combining Symbolic and Statistical Approaches to Language, pages
29–36, Las Cruces (New Mexico), U.S.A.
Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of
lexical association measures. In Proceedings of the 39th Annual Meeting of the
Association for Computational Linguistics, pages 188–195, Toulouse, France.
Ulrich Heid. 1994. On ways words work together – research topics in lexical
combinatorics. In Proceedings of the 6th Euralex International Congress on
Lexicography (EURALEX ’94), pages 226–257, Amsterdam, The Netherlands.
Dekang Lin. 1998. Extracting collocations from text corpora. In First Workshop on
Computational Terminology, pages 57–63, Montreal, Canada.
Kathleen R. McKeown and Dragomir R. Radev. 2000. Collocations. In Robert Dale,
Hermann Moisl, and Harold Somers, editors, A Handbook of Natural Language
Processing, pages 507–523. Marcel Dekker, New York, U.S.A.
Darren Pearce. 2002. A comparative evaluation of collocation extraction
techniques. In Third International Conference on Language Resources and
Evaluation, pages 1530– 1536, Las Palmas, Spain.
Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational
Linguistics, 19(1):143–177.
87
Ressources
• UCS toolkit, by Stefan Evert
• mwetoolkit, by Carlos Ramisch
• Ngram Statistics Package (NSP), by Ted Pedersen et al.
88
Events
• SIGLEX-MWE: Workshops on Multiword Expressions
• PARSEME COST Action
89
90
References:
http://www.issco.unige.ch/en/staff/seretan/data/ranlp/tutorial/
RANLP-2013-tutorial-references.pdf
91
Download