multiwordExpressionAnalysis130527js

advertisement

COLLOCATION DISCOVERY USING STOP WORDS AND NON PARAMETRIC STATISTICS

1.1

Introduction

Linguists frequently classify Parts of Speech as

open

and

closed

. Closed Parts of Speech are composed of a fixed, or nearly fixed, group of words, with few or no additions over decades or centuries. A few examples of closed parts of speech include the article (a, the, an), the preposition

(to, in, on, etc.

), modals/auxiliaries (

was, has, have, etc

.), or the personal pronoun (

I, him, she, etc

.). Other parts of speech, such as the noun and, perhaps slightly less, the verb, are continuously changing: new words are added, such as iPhone or Google, while others are labeled arcane (

fortnight

,

etc

.) and fade from common use. Understanding language requires a basic understanding of its parts of speech. While a person may not be familiar with the label noun or verb, he must be able to differentiate between a thing or person, and an action or state. Similarly, in computational language processing, language analysis tools frequently require the ability to distinguish standard parts of speech. The noun is a particular challenge for computational linguists not only because it is an open part of speech, but also because it often is formed as by multiple words, or tokens. For example, the multiple word expression,

Kennedy Airport,

may be mistakenly assumed to be two nouns, one proper and the other common. The meaning of either word in isolation is quite remote from the meaning of the two words taken together. Since such nouns are constantly being added to the corpus, it is a common problem in natural language processing to detect and isolate these multiword expressions. 1

[1] use the term

collocation

to describe a multi-word expression that correspond to some conventional way of saying things. The authors suggest that collocations can be described by one of the following attributes:

Non-Compositionality

: As described in the example the meaning of Kennedy Airport is not the simple composition of the two words Kennedy + Airport.

Non-Substitutionality

:

Non-Modifiability

: Collocations play an important part in understanding language and common expression. For example, Lewis (2000) provides theory to language acquisition and emphasizes the importance to language learning:

“The single most important task facing language learners is acquiring a sufficiently large vocabulary. We now recognize that much of ‘vocabulary’ consists of prefabricated chunks of different kinds. The single most important kind of chunk is collocation. Self-evidently, then, teaching collocation should be a top priority in every language course.”

(Lewis, 2000)

Related Work

Humans learn collocations in a similar way that they learn single words – through repeated encounters in context. It has been shown in human English second-language learners that reliably learning collocation form and meaning required more a single encounter [2]. The researchers were able to demonstrate increasing ability to recall collocations when presented with 5, 10, or 15 examples compared to only one example. 2

The use of linguistics tools, such as WordNet, have been proposed to extracting collocations from synonyms [3]. For example, collocations for the WordNet synonyms

baggage

and

luggage

can be extracted using collocational preferences, where certain words will co-occur more frequently with one synonym than with the other(s). Such a technique is effective at determining that

emotional

has a much stronger collocational preference to

baggage

than

luggage

. A comprehensive inventory of association measures for determining the true association of a word pair is provided in [4], given the observed number of times the words occurred together and apart. Association measures compared include the ones that have achieved popularity in collocation analysis, including Mutual Information, the t-score measure, and the log-likelihood measure, and more unusual measures of association including hypergeometric likelihood, and Liddel, Dice, Jaccard point estimates. Evert introduce a useful parameter space called ebo-system, for expectation, balance, and observed. The ebo-system provides a useful way of comparing and visualizing the various association measures. A critical problem addressed is related to the fact that for low frequency terms, the observed counts are much higher than the expectation that lead to inflated estimates of association. By applying the Zipf-Mandelbrot population an attempt was made to derive an accurate measure of association for low-frequency collocations. However, the theoretical results differed from observed texts and concludes that probability estimates of the

hapax

and

dis legomena

(i.e. words that occur only once or twice) are distorted in unpredictable ways. He concludes with a recommendation to apply a cutoff ratio of 4 to ensure probabilities can be accurately assessed. The association measures were evaluated using the annotated British National Corpus (BNC) and the German

Frankfurter Rundschau

(FR) Corpus. Collocation precision of the association measures were compared using precision and recall statistics. In general, best precision and recall performance was 3

determined using the log-likelihood score on the FR. However, for fine-grained comparative evaluation of German figurative expressions (idioms) (e.g. “

unter die Arme greifen

” or “

ins Gras beissen

”) and support-verb constructions (e.g. “

in Kraft treten

”, or “

unter Druck setzen

”) the best association measures were different. The log-likelihood and Pearson

X

2 did equivalently well on the figurative expressions; however, for the support verbs the t-score achieved significantly higher precision than the other measures of association. It has been pointed out in [5], that the assumption of independence for estimating the expected probability, p exp , as the null hypothesis in word combinations is unrealistic. For example, since the word ‘the” has a high probability of occurring the probability of the word pair

the the

should be quite high since P(UV = ‘the the’) = P(u=‘the’)P(v=‘the’). Independence ignores semantic, lexical and syntactic structures inherent in every language. To avoid the independence assumption, the authors propose the use of an Aggregate Markov Model (AMM) for the task of estimating p exp, where the hidden variable represents the level of dependency between the two words. By specifying the number of hidden classes one can vary the dependencies between completely independent (unigram model) to completely dependent (bigram model). They applies their solution against three gold standards: the German adjective-noun dataset, the German pp-verb dataset, and the English verb particle dataset. They compare precision against the independence assumption, but were only able to demonstrate significant improvement with the German adjective-noun dataset. In [6], 84 different measures of association were compared against a manually annotated Czech corpus. The authors report best performance in precision and recall of 4

two-word collocations was provided by point-wise mutual information and logistic regression. A frequently mentioned limitation with much of the research related to collocation discovery is that meaningful collocations are assumed to be of some fixed length, such as bigrams or trigrams (XXX). For example consider the following:

“He is a

gentlemanly little fellow

--his head reaches about to my shoulder--cultured and travelled, and can talk splendidly, which Jack never could”.

Pair-wise collocation analysis described by Ewert (2004), Piao et al. (2005), Mason (2006) and others are unable to discover the collocation

gentlemanly little fellow

. Furthermore, many of the described solution methods assume adjacency between words within a collocations. Finally, several measures of association used to discover collocations assume that type occurrence counts follow specific statistical distributions (e.g. Normal, t-Distribution, etc.). Several methods have been developed to discover collocations that are longer than two words, and collocations that consist of non-adjacent words. The original example of such a method is Xtract [7]. Given a keyword, concordances are formed from the corpus using a fixed-length span of words that proceed and follow a keyword. These concordances enable generating position-dependent counts of associated words relative to the keyword. The keyword, an associated word, and the separation can be used to query the corpus for multi-word collocations. For example, given a keyword

controller

, associated word

air

, and a separation of 2, the collocation

air traffic controller

can be extracted from the corpus. While this method can discover extended collocations, its inherent shortcoming is that low-frequency words are not productive in terms of generating collocations. These methods also require that a list of keywords are available. 5

An alternative heuristic approach [8] [9]seeks to avoid these limitations by first finding occurrences a key word (node word) an all non-function words occurring within a fixed span (typically three or four words to the left and right of the key word) and recursively growing a series of “collocates” until no neighbor words have occurrence counts above a predefined threshold of counts. The notion of concgram [10]encapsulates a similar approach, but uses a statistical test to evaluate the significance of the multi-part collocation obtained over a user-defined span. Collocational chains [11] provide another alternative to extending the range of collocations beyond pairs of words. Multiple, adjacent pairs of words are combined to form variable length collocations as long as all word pairs meet a minimum association criterion, using Gravity Counts and other common measures of association described above. Returning the previous example, if both

gentlemanly little

and

little fellow

achieved high association scores (high collectivity) the proposed algorithm would declare the

gentlemanly little fellow

a collocational chain.

Method

Stop words (or function words) are routinely removed as part of preprocessing a corpus. Indeed, these high-frequency words carry very little semantic content; however they serve to express grammatical relationships with other words and can also be used to delimit collocations. Our approach seeks to use these cues to extract collocation candidates of various lengths,

n

> 2. We assume collocations of interest are multiword extensions of single-word prototypes; collocations are delimited by same surrounding stop words. Consider the collocation in the following sentences:

"Start the buzz-tail," said Cap'n Bill, with a

tremble

in his voice. There was now a

promise of snow

in the air, and a few days later the ground as covered to the depth of an inch or more.

6

In both sentences, the nouns

tremble

and

promise of snow

are delimited by the surrounding pairing of the article

a

and preposition

in

. Table 1 shows the top twenty surrounds extracted from the Gutenberg Youth Corpus. Unsurprisingly, the surrounds consist of stop words that delimit one or multiple parts of speech. The majority of these surrounds delimit open parts of speech and can extract single- and multiple-word instances of that part of speech.

Table 1Top twenty rank-ordered surrounds from the Gutenberg Youth Corpus. The examples show that high-frequency surrounds delimit both single- and multi-word expressions.

Surrounds U_V

the_of the_and a_of

POS identified

Noun Noun Noun

and_the

Verb Preposition

Single-Word Examples

but I heard the sounds of conflict and thus knew that they Dan Baxter leaped into the rowboat and took Dora Great Desert west of the Colorado found a stretch of burning salt Heidi was to go and fetch the bag from the shady hollow rays on her bed and on the large heap of hay

Two-Word Collocation Candidate Examples

sordid appetite for dollars, or the dreary existence of country alone on a church-top, with the blue sky and a few tall pinnacles, They heard a faint creaking of the flooring of a lookout, and would visit the adventurers again the next day. they sailed in and out over the great snow-covered peaks as_as the_.

Adverb Noun

I know as much as you, Ned. That fellow ran us down, that's all. "The old girl there," he answered, pointing to the wreck . the Rover boys became as light hearted as ever. following day to join him at the Tavistock Hotel . and_and

Verb Adjective Noun

of the village laughing and yelling and knew that His [..] call had been officious and unpleasant and unsolicited and Beatrice and Benedick remained alone in the church. tried not to show it, and sang songs and cheered its opponents. was [..] quite broad and led upward and in the general direction in snowballing each other and Jack Ness and Aleck Pop. his_and was_to

Noun Verb

John Ellis laid down his paper and stood up with a sarcastic smile. and I was forced to try his filth or his armament were his cackling laughter and the strange Pan-at-lee was listening intently to the sounds of the […] gryfs to_and to_a the_in to_. the_were and_it it_be one_the a_to

Verb Noun Verb Noun Noun/Pronoun Verb Noun Verb Modal Auxiliary Preposition Noun

friendship had been formed which was to grow and deepen after paying his fare to Montrose and buying his cheese Dr. Henderson was to give a report today on the condition of through the crowd in the store, take a child of five years old with me to Frankfurt . The Hun ceased blustering and began to plead . cross the fields where the fireflies were lighting their starry lamps. she took it out of the basket and threw it on the ground. chance he should be led through the lion pit it would be a simple The one in the cart. We'll give the little sneak a chance to turn over a new leaf He commenced to laugh aloud and stood up very straight he says, be a lower class, given up to physical toil and confined and in the meantime she thought it well to search out a place me once more revisiting the glimpses of the street lamps in my favourite He turned to Henry Cale . if we open the door and allow any one to peep in. The Cowardly Lion and the Hungry Tiger were unharnessed from the pocket, and had snatched it away. No, it could not be from Justin. The girl had moved to one side of the apartment and was pretending he made a silent gesture to Miss Ophelia to come and look.

7

was_and you_to

Adjective

She was capricious and exacting the creature was jet black and entirely covered with hair

Verb

Do you wish to leave any name or message? Not at all; you are beginning to get civilized.

We define the triplet

UVW

, to represent the combined predecessor, center, and successor components of three or more words contained within a sentence: 𝑈 = {𝑢} 𝑉 = {𝑣 1 , 𝑣 2 , … , 𝑣 𝑛 } 𝑊 = {𝑤} Where

u

,

v

, and

w

are individual words that are members of the predecessor, center, and successor sets respectively. For

U

and

W

, apply a fixed length of single word, but

center

V

may consist of one or more words. We define a

surrounds

𝜑 = {𝑈, 𝑊} as any observed pairing of predecessor and successor words that encloses one or more centers 𝔙 𝜑 = ∀{𝑉}: ∃𝑈 𝜑 𝑉𝑊 𝜑 . In this example, 𝜑 = {𝑎, 𝑖𝑛} encloses centers 𝔙 𝜑 = { tremble, promise of snow} . In the first processing step, we discover surrounds from the corpus, count their occurrences, and rank order them from highest to lowest occurrence count: 𝔖 = {𝜑 (1) , 𝜑 (2) , … , 𝜑 (𝑀) }: 𝑐(𝜑 (1) ) ≥ 𝑐(𝜑 (2) ) ≥ ⋯ ≥ 𝑐(𝜑 (𝑀) ) . Note that the first word in an expression (sentence or independent clause) is preceded with a

u

= and the last type

w

= to capture the first and last words within the surrounds. Figure 1 shows the cumulative distribution function of the surrounds ordered by occurrence count. Surrounds with high occurrence counts tend to consist of function words, but as the counts decrease, the surrounds ted to delimit the function words. For example, at the left side of the chart, {

the, of

} clearly delimits nominals

sounds

and

dreary existence

, but at the other extreme, the surrounds consisting of nouns {

prejudices, society

} delimit the function word

of

. The many low-count surrounds tend not to be useful in extracting collocations and at some, while the high-count surrounds are extremely productive. This 8

observation is not a hard and fast rule (e.g. the surrounds {

poor, mamma

} occurred once and delimited the adjective

little

); it is nonetheless a language characteristic that can be exploited for extracting collocations.

Figure 1. Empirical CDF of surrounds extracted from the Gutenberg Youth Corpus ordered from highest to lowest occurrence count. In total, the corpus generated 1,975,016 surrounds types and 9,743,797 surrounds instances. The most frequently occurring surrounds type {the, of} was observed 77,611 times.

The second step in the process starts with selecting the top

k

from the rank-ordered surrounds 𝔖 𝜐 satisfying the surrounds total proportionality criterion, υ. 𝜐 = ∑ 𝑘 𝑖=1 ∑ 𝑀 𝑗=1 𝑐(𝜑 (𝑖) ) 𝑐(𝜑 (𝑗) ) Using the example in Figure 1, if υ=25%, the top 1,848 surrounds are selected for collocation candidate extraction. Borrowing terminology from signal processing, these selected surrounds become a type of

collocation-extraction kernel

to be convolved over the original corpus in search of new 9

multi-word centers (i.e.

n

> 1). Algorithm ExtractCollocationCandidates is used to extracts collocation candidates from corpus S using selected surrounds 𝔖 𝜐 .

Algorithm

ExtractCollocationCandidates(S, 𝔖 𝜐 ) 𝔙 ∶= ∅

for

all expressions in corpus: ∀ 𝑠 ∈ 𝑆

for

all selected surrounds: ∀𝜑 ∈ 𝔖 𝜐

for

word indices in s: 𝑖 = 1 𝑡𝑜 |𝑠|

if

word s i =predecessor: 𝑠 𝑖 = 𝑈 𝜑

for

word indices in s: 𝑗 = 𝑖 + 3 𝑡𝑜 |𝑠|

if

word s[j]=predecessor 𝑗 = 𝑊 𝜑 Add to collocations candidates: 𝑗−1 𝔙 ∶= 𝔙 ∪ ⋃ 𝑠 𝑘 𝑘=𝑖+1

Return

candidate list, 𝔙 Applying the surrounds {

a, in}

to the following sentences, the collocation candidates

wild beast

and

hole bored

are extracted:

Here it leaped futilely a half dozen times for the top of the palisade, and then trembling and chattering in rage it ran back and forth along the base of the obstacle, just as a

wild beast

in captivity paces angrily before the bars of its cage. It was the one that seemed to have had a

hole bored

in it and then plugged up again.

Clearly,

wild beast

meets the collocational criteria described above, whereas

hole bored

is less compelling. The objective of the final step is to select the candidates that satisfy the 10

collocation criteria and discard those that do not. We apply a non-parametric variation of a frequently applied method to determine which collocation candidates co-occur significantly more often than chance. The null hypothesis assumes words are selected independently at random and claims that the probability of a collocation candidate

V

is the same as the product of the probabilities of the individual words [12]: 𝑛 𝐻 0 : 𝑃(𝑉) = ∏ 𝑖=1 𝑃(𝑣 𝑖 ) We also apply the standard test statistic: 𝑍 = 𝑋 − 𝜇 √𝑠 2 𝑛 = 𝑃(𝑉) − ∏ 𝑛 𝑖=1 𝑃(𝑣 𝑖 ) As described in [1], the sample variance

s 2 =P(V)

is based on the assumption that the null hypothesis is true and the selection of a word is essentially a Bernoulli trial with parameter

P(V)

, with sample variance

s 2 =P(V)(1-P(V))

and mean

µ=p

. Since

P(V)

<< 1.0,

s 2 ≈P(V).

The histogram of counts and its empirical CDF with respect to the Z score for collocation candidates of lengths of 2, 3, 4 words are shown in Figure 2. The median of the distribution (1.15) is significantly greater than 0, giving credence to the notion, as many have already indicated, that words are not selected independently, at random. Furthermore, the asymmetrical shape of the distribution suggests that nonparametric methods should be preferred. 11

Figure 2Distribution of Collocation Candidates of Lengths 2 - 4

The empirical CDF [13], 𝐹̂ 𝑁 can be used as a non-parametric alternative for approximating the true, but unknown distribution,

F N

: 𝐹̂ 𝑁 (𝑥) = 1 𝑁 ∑ 𝑁 𝑖=1 𝐼(𝑥 𝑖 ≤ 𝑥) where 𝐼 = { 1, 𝑖𝑓 𝑥 𝑖 ≤ 𝑥 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 12

The Dvoretzky, Kiefer and Wolfowitz (DKW) Inequality provide a method of computing the upper an lower confidence bound for an empirical CDF given a type-I error probability

α

and the total number of instances within the sample

N

: 𝑃{𝑠𝑢𝑝 𝑥 |𝐹(𝑥) − 𝐹̂ 𝑁 (𝑥)| > 𝜀} ≤ 2𝑒 −2𝑁𝜀 2 = 𝛼 𝜀 = √ 1 2𝑁 ln ( 2 𝛼 ) The upper and lower bounds of the empirical CDF can then be calculated: 𝐿(𝑥) = max {𝐹̂ 𝑁 − 𝜀, 0} and 𝑈(𝑥) = min{𝐹̂ 𝑁 + 𝜀, 1} The Gutenberg Youth Corpus with N = 9,743,797 and a selected

α=

0.05, provides for a very tight uncertainty bound of 95% ± 0.04% and a critical value of 2.51.

Results

1403 Collocation were selected 13

Accepted Collocation Candidate (α =0.05)

there was don`t know Young Inventor Rover Boys went on Mr. Damon Emerald City little girl at once other side of steam yacht all right two men was decided long ago pilot house Von Horn Aunt Martha had gone out of the question last night the outer world have a chance proved to be he was going you will find i want to i should like the living room beg your pardon quickly as possible

Occurrence count

3685 1181 967 674 1295 537 276 407 592 299 137 414 204 212 99 86 80 83 281 58 92 54 67 60 131 46 170 51 41 38 39

Collocation Confidence

100% 100% 100% 100% 100% 100% 99% 99% 99% 99% 98% 98% 98% 97% 97% 97% 97% 97% 96% 96% 96% 96% 96% 96% 96% 95% 95% 95% 95% 95% 95%

Rejected Collocation Candidate

corner of the house may be added here as quickly as possible late in the afternoon she was glad the breakfast table there must be hold fast to as he rushed from

Occurrence count

19 18 18 18 25 14 23 10 2 14

Collocation Confidence

93% 93% 93% 93% 92% 92% 92% 81% 78%

have you a member paper to his eyes quietly drew it`ll do the chemist`s rob began you explain the detective surprised that children might been the first did i know or the the as the and that the and in the

2 2 2 2 2 2 6 32 19 3 13 2 3 2 2 8 78% 78% 49% 32% 32% 16% 16% 16% 16% 16% 16% 0% 0% 0% 0% 0% 15

Download