multiwordExpressionAnalysis130527js

advertisement

COLLOCATION DISCOVERY USING STOP WORDS AND NON-

PARAMETRIC STATISTICS

1.1

Introduction

Linguists frequently classify Parts of Speech as open and closed . Closed Parts of

Speech are composed of a fixed, or nearly fixed, group of words, with few or no additions over decades or centuries. A few examples of closed parts of speech include the article (a, the, an), the preposition (to, in, on, etc.

), modals/auxiliaries ( was, has, have, etc .), or the personal pronoun ( I, him, she, etc .). Other parts of speech, such as the noun and, perhaps slightly less, the verb, are continuously changing: new words are added, such as iPhone or

Google, while others are labeled arcane ( fortnight , etc .) and fade from common use.

Understanding language requires a basic understanding of its parts of speech. While a person may not be familiar with the label noun or verb, he must be able to differentiate between a thing or person, and an action or state. Similarly, in computational language processing, language analysis tools frequently require the ability to distinguish standard parts of speech.

The noun is a particular challenge for computational linguists not only because it is an open part of speech, but also because it often is formed as by multiple words, or tokens.

For example, the multiple word expression, Kennedy Airport, may be mistakenly assumed to be two nouns, one proper and the other common. The meaning of either word in isolation is quite remote from the meaning of the two words taken together. Since such nouns are constantly being added to the corpus, it is a common problem in natural language processing to detect and isolate these multiword expressions.

1

[1] use the term collocation to describe a multi-word expression that correspond to some conventional way of saying things. The authors suggest that collocations can be described by one of the following attributes:

Non-Compositionality : As described in the example the meaning of Kennedy

Airport is not the simple composition of the two words Kennedy + Airport.

Non-Substitutionality :

Non-Modifiability :

Collocations play an important part in understanding language and common expression.

For example, Lewis (2000) provides theory to language acquisition and emphasizes the importance to language learning:

“The single most important task facing language learners is acquiring a sufficiently large vocabulary. We now recognize that much of

‘vocabulary’ consists of prefabricated chunks of different kinds. The single most important kind of chunk is collocation. Self-evidently, then, teaching collocation should be a top priority in every language course.” (Lewis, 2000)

Related Work

Humans learn collocations in a similar way that they learn single words – through repeated encounters in context. It has been shown in human English second-language learners that reliably learning collocation form and meaning required more a single encounter [2]. The researchers were able to demonstrate increasing ability to recall collocations when presented with 5, 10, or 15 examples compared to only one example.

2

The use of linguistics tools, such as WordNet, have been proposed to extracting collocations from synonyms [3]. For example, collocations for the WordNet synonyms baggage and luggage can be extracted using collocational preferences, where certain words will co-occur more frequently with one synonym than with the other(s). Such a technique is effective at determining that emotional has a much stronger collocational preference to baggage than luggage .

A comprehensive inventory of association measures for determining the true association of a word pair is provided in [4], given the observed number of times the words occurred together and apart. Association measures compared include the ones that have achieved popularity in collocation analysis, including Mutual Information, the t-score measure, and the log-likelihood measure, and more unusual measures of association including hypergeometric likelihood, and Liddel, Dice, Jaccard point estimates. Evert introduce a useful parameter space called ebo-system, for expectation, balance, and observed. The ebo-system provides a useful way of comparing and visualizing the various association measures. A critical problem addressed is related to the fact that for lowfrequency terms, the observed counts are much higher than the expectation that lead to inflated estimates of association. By applying the Zipf-Mandelbrot population an attempt was made to derive an accurate measure of association for low-frequency collocations.

However, the theoretical results differed from observed texts and concludes that probability estimates of the hapax and dis legomena (i.e. words that occur only once or twice) are distorted in unpredictable ways. He concludes with a recommendation to apply a cutoff ratio of 4 to ensure probabilities can be accurately assessed. The association measures were evaluated using the annotated British National Corpus (BNC) and the German Frankfurter

Rundschau (FR) Corpus. Collocation precision of the association measures were compared using precision and recall statistics. In general, best precision and recall performance was

3

determined using the log-likelihood score on the FR. However, for fine-grained comparative evaluation of German figurative expressions (idioms) (e.g. “ unter die Arme greifen ” or “ ins Gras beissen ”) and support-verb constructions (e.g. “ in Kraft treten ”, or

“ unter Druck setzen ”) the best association measures were different. The log-likelihood and

Pearson X 2 did equivalently well on the figurative expressions; however, for the support verbs the t-score achieved significantly higher precision than the other measures of association.

It has been pointed out in [5], that the assumption of independence for estimating the expected probability, p exp

, as the null hypothesis in word combinations is unrealistic.

For example, since the word ‘the” has a high probability of occurring the probability of the word pair the the should be quite high since P(UV = ‘the the’) = P(u=‘the’)P(v=‘the’).

Independence ignores semantic, lexical and syntactic structures inherent in every language.

To avoid the independence assumption, the authors propose the use of an Aggregate

Markov Model (AMM) for the task of estimating p exp, where the hidden variable represents the level of dependency between the two words. By specifying the number of hidden classes one can vary the dependencies between completely independent (unigram model) to completely dependent (bigram model). They applies their solution against three gold standards: the German adjective-noun dataset, the German pp-verb dataset, and the English verb particle dataset. They compare precision against the independence assumption, but were only able to demonstrate significant improvement with the German adjective-noun dataset.

In [6], 84 different measures of association were compared against a manually annotated Czech corpus. The authors report best performance in precision and recall of

4

two-word collocations was provided by point-wise mutual information and logistic regression.

A frequently mentioned limitation with much of the research related to collocation discovery is that meaningful collocations are assumed to be of some fixed length, such as bigrams or trigrams (XXX). For example consider the following:

“He is a gentlemanly little fellow --his head reaches about to my shoulder--cultured and travelled, and can talk splendidly, which Jack never could”.

Pair-wise collocation analysis described by Ewert (2004), Piao et al. (2005), Mason

(2006) and others are unable to discover the collocation gentlemanly little fellow .

Furthermore, many of the described solution methods assume adjacency between words within a collocations. Finally, several measures of association used to discover collocations assume that type occurrence counts follow specific statistical distributions

(e.g. Normal, t-Distribution, etc.).

Several methods have been developed to discover collocations that are longer than two words, and collocations that consist of non-adjacent words. The original example of such a method is Xtract [7]. Given a keyword, concordances are formed from the corpus using a fixed-length span of words that proceed and follow a keyword. These concordances enable generating position-dependent counts of associated words relative to the keyword.

The keyword, an associated word, and the separation can be used to query the corpus for multi-word collocations. For example, given a keyword controller , associated word air , and a separation of 2, the collocation air traffic controller can be extracted from the corpus. While this method can discover extended collocations, its inherent shortcoming is that low-frequency words are not productive in terms of generating collocations. These methods also require that a list of keywords are available.

5

An alternative heuristic approach [8] [9]seeks to avoid these limitations by first finding occurrences a key word (node word) an all non-function words occurring within a fixed span (typically three or four words to the left and right of the key word) and recursively growing a series of “collocates” until no neighbor words have occurrence counts above a predefined threshold of counts. The notion of concgram [10]encapsulates a similar approach, but uses a statistical test to evaluate the significance of the multi-part collocation obtained over a user-defined span.

Collocational chains [11] provide another alternative to extending the range of collocations beyond pairs of words. Multiple, adjacent pairs of words are combined to form variable length collocations as long as all word pairs meet a minimum association criterion, using Gravity Counts and other common measures of association described above. Returning the previous example, if both gentlemanly little and little fellow achieved high association scores (high collectivity) the proposed algorithm would declare the gentlemanly little fellow a collocational chain.

Method

Stop words (or function words) are routinely removed as part of preprocessing a corpus. Indeed, these high-frequency words carry very little semantic content; however they serve to express grammatical relationships with other words and can also be used to delimit collocations. Our approach seeks to use these cues to extract collocation candidates of various lengths, n > 2. We assume collocations of interest are multiword extensions of single-word prototypes; collocations are delimited by same surrounding stop words.

Consider the collocation in the following sentences:

"Start the buzz-tail," said Cap'n Bill, with a tremble in his voice.

There was now a promise of snow in the air, and a few days later the ground as covered to the depth of an inch or more.

6

In both sentences, the nouns tremble and promise of snow are delimited by the surrounding pairing of the article a and preposition in . Table 1 shows the top twenty surrounds extracted from the Gutenberg Youth Corpus. Unsurprisingly, the surrounds consist of stop words that delimit one or multiple parts of speech. The majority of these surrounds delimit open parts of speech and can extract single- and multiple-word instances of that part of speech.

Table 1Top twenty rank-ordered surrounds from the Gutenberg Youth Corpus. The examples show that high-frequency surrounds delimit both single- and multi-word expressions.

Surrounds

U_V the_of the_and a_of and_the

POS identified Single-Word Examples

Noun

Noun

Noun

Verb

Preposition but I heard the sounds of conflict and thus knew that they

Dan Baxter leaped into the rowboat and took Dora

Great Desert west of the Colorado found a stretch of burning salt

Heidi was to go and fetch the bag from the shady hollow rays on her bed and on the large heap of hay

Two-Word Collocation Candidate Examples sordid appetite for dollars, or the dreary existence of country alone on a church-top, with the blue sky and a few tall pinnacles,

They heard a faint creaking of the flooring of a lookout, and would visit the adventurers again the next day. they sailed in and out over the great snow-covered peaks as_as the_.

Adverb

Noun

I know as much as you, Ned. That fellow ran us down, that's all.

"The old girl there," he answered, pointing to the wreck .

the Rover boys became as light hearted as ever. following day to join him at the Tavistock Hotel . and_and Verb

Adjective

Noun of the village laughing and yelling and knew that

His [..] call had been officious and unpleasant and unsolicited and Beatrice and Benedick remained alone in the church. tried not to show it, and sang songs and cheered its opponents. was [..] quite broad and led upward and in the general direction

in snowballing each other and Jack Ness and Aleck Pop. his_and was_to

Noun

Verb

John Ellis laid down his paper and stood up with a sarcastic smile. and I was forced to try his filth or his armament were his cackling laughter and the strange

Pan-at-lee was listening intently to the sounds of the […] gryfs to_and Verb

Noun friendship had been formed which was to grow and deepen after paying his fare to Montrose and buying his cheese

He commenced to laugh aloud and stood up very straight he says, be a lower class, given up to physical toil and confined to_a the_in to_. Noun/Pronoun

Verb the_were Noun and_it it_be one_the a_to

Verb

Noun

Verb

Modal

Auxiliary

Preposition

Noun

Dr. Henderson was to give a report today on the condition of and in the meantime she thought it well to search out a place through the crowd in the store, take a child of five years old with me to Frankfurt .

The Hun ceased blustering and began to plead . cross the fields where the fireflies were lighting their starry lamps. she took it out of the basket and threw it on the ground. me once more revisiting the glimpses of the street lamps in my favourite

He turned to Henry Cale . if we open the door and allow any one to peep in.

The Cowardly Lion and the Hungry Tiger were unharnessed from the pocket, and had snatched it away. chance he should be led through the lion pit it would be a simple

No, it could not be from Justin.

The one in the cart.

We'll give the little sneak a chance to turn over a new leaf

The girl had moved to one side of the apartment and was pretending he made a silent gesture to Miss Ophelia to come and look.

7

was_and Adjective you_to Verb

She was capricious and exacting

Do you wish to leave any name or message? the creature was jet black and entirely covered with hair

Not at all; you are beginning to get civilized.

We define the triplet UVW , to represent the combined predecessor, center, and successor components of three or more words contained within a sentence:

π‘ˆ = {𝑒}

𝑉 = {𝑣

1

, 𝑣

2

, … , 𝑣 𝑛

}

π‘Š = {𝑀}

Where u , v , and w are individual words that are members of the predecessor, center, and successor sets respectively. For U and W , apply a fixed length of single word, but center

V may consist of one or more words. We define a surrounds πœ‘ = {π‘ˆ, π‘Š} as any observed pairing of predecessor and successor words that encloses one or more centers

𝔙 πœ‘

= ∀{𝑉}: ∃π‘ˆ πœ‘

π‘‰π‘Š πœ‘

. In this example, πœ‘ = {π‘Ž, 𝑖𝑛} encloses centers 𝔙 πœ‘

=

{ tremble, promise of snow} .

In the first processing step, we discover surrounds from the corpus, count their occurrences, and rank order them from highest to lowest occurrence count: 𝔖 =

{πœ‘

(1)

, πœ‘

(2)

, … , πœ‘

(𝑀)

}: 𝑐(πœ‘

(1)

) ≥ 𝑐(πœ‘

(2)

) ≥ β‹― ≥ 𝑐(πœ‘

(𝑀)

) . Note that the first word in an expression (sentence or independent clause) is preceded with a u = <start> and the last type w = <!|.|?|;> to capture the first and last words within the surrounds. Figure 1 shows the cumulative distribution function of the surrounds ordered by occurrence count.

Surrounds with high occurrence counts tend to consist of function words, but as the counts decrease, the surrounds ted to delimit the function words. For example, at the left side of the chart, { the, of } clearly delimits nominals sounds and dreary existence , but at the other extreme, the surrounds consisting of nouns { prejudices, society } delimit the function word of . The many low-count surrounds tend not to be useful in extracting collocations and at some, while the high-count surrounds are extremely productive. This

8

observation is not a hard and fast rule (e.g. the surrounds { poor, mamma } occurred once and delimited the adjective little ); it is nonetheless a language characteristic that can be exploited for extracting collocations.

Figure 1. Empirical CDF of surrounds extracted from the Gutenberg Youth Corpus ordered from highest to lowest occurrence count. In total, the corpus generated 1,975,016 surrounds types and 9,743,797 surrounds instances. The most frequently occurring surrounds type {the, of} was observed 77,611 times.

The second step in the process starts with selecting the top k from the rank-ordered surrounds 𝔖 𝜐

satisfying the surrounds total proportionality criterion, υ. 𝜐 =

∑ π‘˜ 𝑖=1 𝑐(πœ‘

(𝑖)

)

∑ 𝑀 𝑗=1 𝑐(πœ‘

(𝑗)

)

Using the example in Figure 1, if υ=25%, the top 1,848 surrounds are selected for collocation candidate extraction.

Borrowing terminology from signal processing, these selected surrounds become a type of collocation-extraction kernel to be convolved over the original corpus in search of new

9

multi-word centers (i.e. n > 1). Algorithm ExtractCollocationCandidates is used to extracts collocation candidates from corpus S using selected surrounds 𝔖 𝜐

.

Algorithm ExtractCollocationCandidates(S, 𝔖 𝜐

)

𝔙 ∢= ∅ for all expressions in corpus: ∀ 𝑠 ∈ 𝑆 for all selected surrounds: ∀πœ‘ ∈ 𝔖 𝜐 for word indices in s: 𝑖 = 1 π‘‘π‘œ |𝑠| if word s i

=predecessor: 𝑠 𝑖

= π‘ˆ πœ‘ for word indices in s: 𝑗 = 𝑖 + 3 π‘‘π‘œ |𝑠| if word s[j]=predecessor 𝑗 = π‘Š πœ‘

Add to collocations candidates: 𝑗−1

𝔙 ∢= 𝔙 ∪ ⋃ 𝑠 π‘˜ π‘˜=𝑖+1

Return candidate list, 𝔙

Applying the surrounds { a, in} to the following sentences, the collocation candidates wild beast and hole bored are extracted:

Here it leaped futilely a half dozen times for the top of the palisade, and then trembling and chattering in rage it ran back and forth along the base of the obstacle, just as a wild beast in captivity paces angrily before the bars of its cage.

It was the one that seemed to have had a hole bored in it and then plugged up again.

Clearly, wild beast meets the collocational criteria described above, whereas hole bored is less compelling. The objective of the final step is to select the candidates that satisfy the

10

collocation criteria and discard those that do not. We apply a non-parametric variation of a frequently applied method to determine which collocation candidates co-occur significantly more often than chance. The null hypothesis assumes words are selected independently at random and claims that the probability of a collocation candidate V is the same as the product of the probabilities of the individual words [12]:

𝐻

0 𝑛

: 𝑃(𝑉) = ∏ 𝑃(𝑣 𝑖 𝑖=1

)

We also apply the standard test statistic:

𝑍 =

𝑋 − πœ‡

√𝑠

2 𝑛

=

𝑃(𝑉) − ∏ 𝑛 𝑖=1

𝑃(𝑣 𝑖

)

As described in [1], the sample variance s 2 =P(V) is based on the assumption that the null hypothesis is true and the selection of a word is essentially a Bernoulli trial with parameter

P(V) , with sample variance s 2 =P(V)(1-P(V)) and mean µ=p . Since P(V) << 1.0, s 2 ≈P(V).

The histogram of counts and its empirical CDF with respect to the Z score for collocation candidates of lengths of 2, 3, 4 words are shown in Figure 2. The median of the distribution

(1.15) is significantly greater than 0, giving credence to the notion, as many have already indicated, that words are not selected independently, at random. Furthermore, the asymmetrical shape of the distribution suggests that nonparametric methods should be preferred.

11

Figure 2Distribution of Collocation Candidates of Lengths 2 - 4

The empirical CDF [13], 𝐹̂

𝑁

can be used as a non-parametric alternative for approximating the true, but unknown distribution, F

N

:

𝐹̂

𝑁

(π‘₯) =

1

𝑁

∑ 𝑁 𝑖=1

𝐼(π‘₯ 𝑖

≤ π‘₯) where

𝐼 = {

1, 𝑖𝑓 π‘₯ 𝑖

≤ π‘₯

0, π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

12

The Dvoretzky, Kiefer and Wolfowitz (DKW) Inequality provide a method of computing the upper an lower confidence bound for an empirical CDF given a type-I error probability

α and the total number of instances within the sample N :

𝑃{𝑠𝑒𝑝 π‘₯

|𝐹(π‘₯) − 𝐹̂

𝑁

(π‘₯)| > πœ€} ≤ 2𝑒 −2π‘πœ€

2

= 𝛼

1 πœ€ = √

2𝑁 ln (

2 𝛼

)

The upper and lower bounds of the empirical CDF can then be calculated:

𝐿(π‘₯) = max {𝐹̂

𝑁

− πœ€, 0} and π‘ˆ(π‘₯) = min{𝐹̂

𝑁

+ πœ€, 1}

The Gutenberg Youth Corpus with N = 9,743,797 and a selected

α=

0.05, provides for a very tight uncertainty bound of 95% ± 0.04% and a critical value of 2.51.

Results

1403 Collocation were selected

13

Accepted Collocation Candidate (α

=0.05) there was don`t know

Young Inventor

Rover Boys went on

Mr. Damon

Emerald City little girl at once other side of steam yacht all right two men was decided long ago pilot house

Von Horn

Aunt Martha had gone out of the question last night the outer world have a chance proved to be he was going you will find i want to i should like the living room beg your pardon quickly as possible

97%

96%

96%

96%

96%

96%

96%

96%

95%

95%

95%

95%

95%

95%

Collocation

Confidence

100%

100%

100%

100%

100%

100%

99%

99%

99%

99%

98%

98%

98%

97%

97%

97%

97%

83

281

58

92

54

67

60

131

46

170

51

41

38

39

Occurrence count

3685

1181

967

674

1295

537

276

407

592

299

137

414

204

212

99

86

80

Rejected Collocation

Candidate corner of the house may be added here as quickly as possible late in the afternoon she was glad the breakfast table there must be hold fast to as he rushed from

Occurrence count

19

18

18

18

25

14

23

10

2

14

Collocation

Confidence

93%

93%

93%

93%

92%

92%

92%

81%

78%

have you a member paper to his eyes quietly drew it`ll do the chemist`s rob began you explain the detective surprised that children might been the first did i know or the the as the and that the and in the

2

2

2

2

2

2

6

32

19

3

13

2

3

2

2

8

78%

78%

49%

32%

32%

16%

16%

16%

16%

16%

16%

0%

0%

0%

0%

0%

15

Download