Adam Kilgarriff

advertisement
COLLOCATIONALITY (AND HOW TO MEASURE IT)
Adam Kilgarriff (1), Jan Pomikalek (1,2), Pavel Rychly (2)
(1) Lexical Computing Ltd
Brighton, UK
adam@lexmasterclass.com
(2) Masaryk University
Brno, Czech Republic
ABSTRACT
Collocation is increasingly recognised as a central aspect of language, a fact that English
learners’ dictionaries have responded to extensively. Statistical measures for identifying
collocations in large corpora are now well-established. We move on to a further issue:
which words have a particularly strong tendency to occur in collocations, or are most
“collocational”, and thereby merit having their collocates shown in dictionaries. We
propose a measure of collocationality based on entropy, a measure defined in
informational theory. We describe experiments to find the most ‘collocational’ words in
the British National Corpus and present results with the most ‘collocational’ nouns and
verbs in relation to the grammatical relation “object”.
1
Collocationality (and how to measure it)
1.
Introduction
As Firth pointed out, “you shall know a word by the company it keeps” (Firth 1968:179).
The dictum has been taken to heart, in linguistics (Hoey 2005), psycholinguistics (Wray
forthcoming), language teaching (Carter and McCarthy 1988) and lexicography (Hanks
2002).
As we gain access to large corpora, so it becomes possible to use statistical methods to
automatically identify collocations. This line of “lexical statistics” research was
inaugurated by Church and Hanks’s 1989 paper which introduced Mutual Information
(MI), a measure from Information Theory (Shannon and Weaver 1963), as a statistic for
measuring how closely related two words were. Since then there have been both a
number of papers refining and comparing different statistics (e.g. Dunning 1993, Krenn
and Evert 2003), and very widespread use of MI and related statistics, particularly as they
are computed, and collocation lists of highest-scoring items presented, in all the widely
used corpus query tools.
MI and related measures are measures of association. They assess how noteworthy the
association is between two words. In this paper we address a slightly different question:
which words have a strong tendency to occur in collocations? Which words are very
“collocational”?
While the association measures are more basic, the issue of collocationality arises in a
number of contexts. One is language teaching: what are good words to use, to start
teaching about collocationality? Another is lexicographic. Dictionaries such as the
Macmillan English Dictionary for Advanced Learners (2002) and the Cambridge
Advanced Learner’s Dictionary (2005) present brief accounts of collocations for words
where collocation is a particularly salient theme. Which words should these accounts
appear for?
2
2.
Entropy
A word that is very ‘collocational’ is one which has a strong tendency to appear with
particular words, rather than appearing freely with large numbers of words. This theme is
captured mathematically by ‘entropy’, again from Information Theory. Entropy is
defined over a probability distribution, and states how much information there is in that
distribution. (A probability distribution is a set of possible outcomes, and the probability
of each, so for an unbiased coin, the probability distribution is
Heads, 0.5
Tails, 0.5)
The entropy is defined as the sum, across all possible outcomes, of the product of the
probability and the log (to base 2) of the probability (which is a negative quantity, so we
then change the sign to positive):
-Σ p(x).log(p(x))
For an unbiased coin, for both the heads and the tails case, we have p(x) = 0.5, and the
log of 0.5 is -1, so we have 0.5 x -1 + 0.5 x -1 = -1, which we then negate to give entropy
of 1. Other distributions will give higher entropy.
Consider three word bag, beg and bug which occur (hypothetically) with the other words,
pin, pan, pen and pun (and no others) in proportions as below (in the p(x) column). The
third column gives the log (base 2) of the proportion, and the third gives the product of
the previous two. Sums and entropy scores are also shown.
Table 1: Example entropy calculations.
Distribution word bag
x
p(x) log =
pan 0.25 -2 -0.5
pen 0.25 -2 -0.5
pin
0.25 -2 -0.5
pun 0.25 -2 -0.5
-2.0
Sum
2.0
Entr
Distribution word beg
X
p(x) log =
Pan 0.7 -.51 -.36
Pen 0.1 -3.3 -.33
Pin
0.1 -3.3 -.33
Pun 0.1 -3.3 -.33
-1.3
Sum
1.3
Entr
3
Distribution word bug
X
p(x) log =
pan 0.5 -1 -0.5
pen 0.5 -1 -0.5
pin
0
pun
0
-1.0
Sum
1.0
Entr
Comparing bag and beg, we see that both have four collocates, but all are equally likely for bag
whereas beg has one strong collocate, pan, and that this results in bag having higher entropy than
beg. This is the characteristic of entropy that underlies the paper: bag, with the more even
distribution of collocates, has higher entropy. Beg, with a more skewed distribution, has lower.
Comparing bag and bug we see another feature of entropy. Both distributions are entirely even,
with all collocates equally likely, but bag has four possible collocates whereas bug has just two).
Bag, the word with a higher number of collocates, has the higher entropy. Entropy is a measure of
how (un)informative a probability distribution is. There are two aspects to this: it is more
informative, the fewer the possible outcomes. And it is more informative, the more that a small
number of outcomes account for what happens most of the time.
2.1
Probabilities, proportions
How can a word’s collocations be modeled as a probability distribution? As hinted above, we take
a very straightforward approach. We view all the words that occur with a node word as the set of
possibilities. We then count how often each of them occurs with the nodeword, in a corpus. We
can then estimate the probability for each collocate, as the proportion of the complete set of
occurrences of the node word with some collocate or other, that occurred with this particular
collocate. Technically, this is a maximum likelihood estimate (MLE) for the probability of the
collocate, given the data. (While MLEs are not good estimates for probabilities when based on low
counts, and there will be very many collocates which, in principle, form part of the probability
distribution but did not happen to appear in one particular sample, we ask readers to look to the
results to assess whether these considerations are damaging for the purposes of producing a measure
of collocationality.)
2.2
Grammatical relations
Next we clarify what we mean by the collocate occurring “with” the nodeword.
In the linguistics literature, relations between base and collocates are generally grammatical.
Prototypical collocations associate a base noun with the verb it is object of (pay attention) or a base
noun with an adjective that modifies it (bright idea). In dictionaries such as the Oxford
Collocations Dictionary (OCD; 2002), collocates are divided according to the grammatical relation
they stand in to the base: in noun entries, OCD typically first lists adjectives that modify the noun,
then verbs that the noun is object of, then verbs that the noun is subject of, then prepositions, then
phrases.
Our experience with corpus-derived data also demonstrates the usefulness of grammatical relations
as an organizing principle. We use the Sketch Engine (Kilgarriff et al 2004) and note the merit as
argued in Kilgarriff and Rundell (2002) of having a separate list of high-salience collocates for each
grammatical relation. This both classifies collocates and provides a method for filtering out the high
levels of noise that are otherwise typical of collocate lists.
In this report, we treat each grammatical relation separately. Thus we aim to identify the most
collocational items with respect to a particular grammatical relation, for example, the most
collocational nouns with respect to the verbs they are object of. We can envisage a number of ways
in which different lists might be merged, but leave that for further work.
So: a collocate occurs “with” a nodeword if it occurs in the specified grammatical relation to the
nodeword.
3.
Experiment
We used the British National Corpus1 – a 100 million word corpus of spoken and written British
English, covering a wide range of text types – as loaded into the Sketch Engine. The version of the
BNC we used was lemmatized, so we could treat grammatical relations as holding between lemmas
(“take”) rather than word forms (“take” or “took” or “takes” or “taken” or “taking”) and part-ofspeech-tagged.
The Sketch Engine supports shallow parsing, and we used this parsing facility to
identify all instances of triples of the form <grammatical-relation, nodeword, word2> in the corpus,
using the system’s API (Application Programming Interface).
1
http://www.natcorp.ox.ac.uk
5
We use the relation ‘object’, where the nodeword is a noun and the second word is a verb for
illustration. We identified, for each noun, what verbs it occurred as object of, and how often. For
the noun advantage we identified the verbs below, which we treat as the population of possible
outcomes. The Maximum Likelihood Estimates of probabilities are then, for each verb, the number
of times that advantage has that verb as object, divided by the total number of times that a verb was
identified which had advantage as its object. (This means that the sum of the proportions, in the
third column below, is 1, as required for a probability distribution.)
Verb
Take
Gain
Offer
See
Enjoy
Obtain
…
Clarify
…
Total
Freq
MLE-prob
Log
Prob x log
(freq/3730)
2084
.5587
-0.84
-.469
131
.0351
-4.83
-.169
117
.0314
-4.99
-.157
110
.0295
-5.08
-.150
67
.0180
-5.79
-.104
58
.0155
-6.01
-.093
…
…
…
…
1
.000268
-11.86
-0.0031
… …
…
…
3730
1.000
-3.909
Table 2: Calculation of entropy for advantage (object relation).
We perform the calculation as before: the entropy of the noun advantage, in relation to the object
relation, given BNC data, is thus 3.909.
We also calculate entropy for all other nouns in relation to ‘object’. (We limit the data set to nouns
found as objects more than fifty times in the BNC, using the Sketch Engine’s lemmatiser, parser,
etc.)
Plotting entropy against the frequency of the noun (occurring with an identifiable object, eg 3730
for advantage) gives us the graph in Figure 1. Each cross represents an individual noun. Note that
the frequency axis is logarithmic.
6
Fig. 1. Graph of entropy vs frequency for object-of relation for 5738 nouns in the BNC
It is evident from the graph that entropy tends to increase with frequency. As can be seen, the
lowest-entropy items are low-frequency items. This relates to the comparison between bag and bug
presented in Table 1: entropy tends to increase with the number of possible outcomes. We consider
collocationality a “frequency-neutral” term: that is, we would like to say that common words do not
intrinsically have any stronger likelihood to be highly collocational than rare words. Thus, in order
to use entropy as the basis for a measure of collocationality, we should first normalize it, to take
away frequency effects. There are various ways this could be done. We use the simplest. We take
the minimum-regression straight line through the graph, that is, the line which points on the graph
are, on average, closest to. The normalized value is then the distance from this line.
4.
Results
7
We present below the 100 “most collocational” items for nouns (with respect to the verbs they are
objects of) and verbs (with respect to their object-nouns) ordered by frequency. Frequency ordering
is useful as the higher-frequency words often present different patterns to the lower-frequency ones.
(Numbers in brackets give the number of data points each result is based on.) We discuss
nouns/objects in some detail: for verbs/objects, we make only brief observations.
4.1
Nouns/objects
place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise
(3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath
(2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button
(661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472),
refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon
(399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness
(288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence
(228), go-ahead (202), sip (198), accountancy (188), climax (173), nod (172), brunt (163), headway
(161), fancy (157), damn (151), plunge (147), credence (146), amends (146), piss (145), inroad
(145), sway (142), communiqué (140), crying (133), para (133), overdose (132), heed (127), toss
(124), centenary (121), detour (117), sae (115), hang (110), shrug (107), stir (106), save (103),
gamble (101), cholangitis (100), chess (92), stroll (89), twinge (80), papillum (76), virginity (75),
errand (75), gauntlet (73), gulp (69), bluff (67), swig (66), robemaker (62), cool (61), doorbell (60),
glossary (59), shrift (58), keep (57), spokeswoman (57), grab (56), snort (52), quarter-final (52),
fray (52), esc (52), cropper (52), mickey (51)
Many of these nouns occur with high frequency in support-verb constructions: thus events take
place, we pay attention and take care, steps and advantage (of situations). Door does not have one
support verb but, rather, seven strong collocates: as well as opening doors, we close, shout, lock,
slam, push and unlock them. (Checking is easily undertaken, as the data is the same as the data
used to prepare the word sketch for the noun, so the word sketch, which is hyperlinked to the
relevant KWIC concordance, can be examined, e.g. at http://www.sketchengine.co.uk .)
Looking at lower-frequent items fromthe bottom of the list, we find idioms where the noun rarely or
never occurs outside the idiom: take the mickey, come a cropper, enter (or join) the fray, make
headway, bear the brunt (of), take the piss, beg pardon and give the go-ahead. Shrift (which is
8
nearly always short) is given, got and received. We see a formulaic aspect of sports journalism in
teams usually reaching quarter-finals.
A little noise must be set aside. Appendix, chapter, page, glossary and para result from the verb see
as in “see chapter 5”. Sae features because people are often asked to “send SAE” (self-addressed
envelope), and SAE rarely occurs anywhere else.2 Accountancy results from self-references in one
of the BNC’s sources, the magazine “Accountancy”, which often says, eg, “See ACCOUNTANCY,
Jan 9 1992, p 21”. Esc results from pages of computer manuals which often say “Tap ESC for
ESCAPE”. (Corpus statistics of this kind aim to find surprising facts about the language, and
surprisingness is identified with being much more common than the norm. So the statistics also
succeed in finding items that are strikingly common for arbitrary non-linguistic reasons, often to do
with the composition, collection and encoding of the corpus.)
The items which occur largely with one verb, in fairly fixed expressions, will already be specified in
good dictionaries. The words that are more lexicographically useful are those where the entropy is
low, but the list of strong collocates is long enough and diverse enough for the entry to be a useful
locus for some description of collocational behaviour. Around half of the high-frequency, highcollocationality items meet this criterion: attention (draw, pay, attract, give, focus, turn), care (take,
provide, need), impression (give, make, get, create), notice (serve, take, give), breath (catch, draw,
take, hold), hold (grab, get, take, catch, keep).3
Macmillan comparison
We identified the 322 nouns for which there were collocation panels in the Macmillan Dictionary,
and considered their collocationality scores with respect to the object relation. They had markedly
higher scores than a random sample of nouns: eight were in the top hundred, a quarter were
amongst the highest-scoring 11%, and two thirds were in the top half. Where a word had a low
collocation score, but nonetheless had a Macmillan panel, there are three possible interpretations:
Capitalisation would ideally be taken into account, although we note that capitalisation in most corpora – even
carefully edited ones such as the BNC – is an unreliable clue to linguistics status. Words may be capitalised because
they are in headings, or for emphasis, and there are also interactions with the POS-tagger which uses capitalisation as
part of its evaluation of whether a word is a proper name. Proper names are excluded from this list.
3
Here we list a verb if it accounts for over 5% of the data. Diversity was harder to evaluate, and impression is a
marginal case as the verbs are not so diverse.
2
9
the word may have strong collocates in other grammatical relations; the collocation measure may
be flawed; or Macmillan’s selection may be open to improvement.
Verbs/objects
take (106749), pay (18925), play (17832), raise (15477), spend (15267), open (11362), close
(6106), shake (5483), sign (5100), answer (4177), exercise (3265), speak (3013), solve (2555),
score (2495), live (2201), waste (2091), thank (1926), pose (1897), fulfil (1885), wait (1768), shut
(1675), last (1521), incur (1365), research (1072), devote (1025), age (1009), exert (966), bite
(919), park (836), beg (739), slam (634), sip (574), narrow (540), levy (450), nod (433), part (425),
adjourn (424), pave (420), clasp (411), ratify (391), reap (376), bridge (337), shrug (324), enlist
(322), clench (313), bow (303), wage (299), clap (256), redress (248), dial (232), retrace (205), poll
(202), cock (200), coin (194), comb (193), purse (191), grit (170), stake (169), allay (167), wring
(157), wag (154), peacekeep (151), fell (147), incline (139), wreak (138), ruffle (136), wrinkle
(134), preheat (134), adduce (133), broach (122), foot (121), hunch (109), blink (103), bide (103),
disobey (99), whet (89), sclerose (86), jog (85), buck (85), moisten (81), jumble (81), recharge (81),
wuther (81), overstep (74), scroll (74), crane (74), hazard (70), mince (66), pervert (65), elapse
(60), hesitate (60), grope (59), elbow (57), re-run (57), transact (55), contort (55), redouble (55),
immunise (53), pry (52)
It is interesting to see take in the list, and this clearly reflects its common role as a support verb. It
has a long list of very strong collocates, including many of the items in the nouns list above. Some
other items in this group are corollaries of nouns in the noun/objects list (slam doors, beg pardon) .
Some relate to past and present participles taking adjectival roles, rather than finite verbs
(peacekeeping force/troops/operation, redoubled efforts, sclerosing cholangitis) and some items,
like this last, relate to specialist documents in the corpus. Shake (hand, head, fist), wag (tail, finger,
dog) , clap (hands, eyes) and clasp (hands, eyes, fngers) form an intriguing group.
Discussion
The measure captures some aspects of lexicographically interesting collocationality. It has the
additional merit of being based on well-understood mathematics, from Information Theory.
It gives greatest weight to those bases which occur predominantly wiht just one strong collocate:
while this provides one useful list, another will focus on words with a small number of strong
collocates.
10
The measure makes no acknowledgement of the frequency of the collocate. Collocates which, in
the corpus at large, are lower-frequency make more striking collocates, as is explored extensively in
relation to measures of collocation strength: it is as yet unclear whether this should play a role in a
measure for collocationality. Intuitively, it seems appropriate that enlist help contributes more to
the collocationality score for help than provide help, even though it is less frequent. Entropy does
not capture the intuition.
Collocationality as explored in this paper is grammatical-relation-specific. It may be appropriate to
bring together data from different grammatical relations to provide a unified collocationality profile
for a word.
Each of these proposed extensions will lead away from direct use of entropy as the underlying
mathematical idea. The approach taken here is to say that, while this may be inevitable, it is good
to start from as simple mathematics as possible.
References
A. Dictionaries
Cambridge Advanced Learner’s Dictionary, 2nd edition 2005. CUP.
Macmillan English Dictionary for Advanced Learners. 2002. Macmillan.
Oxford Collocations Dictionary. 2002. OUP.
B. Other Literature
Carter, R. and M. McCarthy (1988) Vocabulary and Language Teaching. Longman.
Church, K. and P. Hanks (1989) Word Association Norms, Mutual Information, and Lexicography’, in Proceedings of
the 27th Annual Meeting of the Association for Computational Linguistics; reprinted in Computational Linguistics,
Spring 1990.
Dunning, T. (1993) Accurate models for the statistics of surprise and coincidence. Computational Linguistics, 19(1):
61-74.
Evert, S. and B. Krenn (2001) Methods for the qualitative evaluation of lexical association measures. Proc. 39th
Meeting of the Association for Computational Linguistics, Toulouse, France. 188-193.
Firth, J. (1957) “A Synopsis of Linguistic Theory 1930-1955” in Studies in Linguistic Analysis, Philological Society,
Oxford; reprinted in Palmer, F. (ed. 1968) Selected Papers of J. R. Firth, Longman, Harlow.
11
Hanks, P. (2002) Mapping Meaning onto Use. In Marie-Hélène Corréard (ed.): Lexicography and Natural Language
Processing: a Festschrift in honour of B. T. S. Atkins. Euralex.
Hoey, M. (2005). Lexical Priming: A new theory of words and language. Routledge.
Kilgarriff, A. and M. Rundell (2002) “Lexical profiling software and its lexicographic applications - a case study.” Proc
EURALEX, Copenhagen, August: 807-818.
Kilgarriff, A., P. Rychly, P. Smrz and D. Tugwell (2004) “The Sketch Engine” Proc. Euralex. Lorient, France, July:
105-116.
Shannon, C. and Weaver, W. The Mathematical Theory of Communication. Univ of Illinois Press, 1963
Wray, A. (Forthcoming) Formulaic Language. In Brown, K. (ed.) Encyclopedia of Language and Linguistics. 2nd
edition. Oxford: Elsevier. 2006.
12
Download