ANalysis

advertisement
ANALYSIS
1.1 Problem Description
Linguists frequently classify Parts of Speech as open and closed. Closed Parts of
Speech are composed of a fixed, or nearly fixed, group of words, with few or no additions
over decades or centuries. A few examples of closed parts of speech include the article (a,
the, an), the preposition (to, in, on, etc.), modals/auxiliaries (was, has, have, etc.), or the
personal pronoun (I, him, she, etc.). Other parts of speech, such as the noun and, perhaps
slightly less, the verb, are continuously changing: new words are added, such as iPhone or
Google, while others are labeled arcane (fortnight, etc.) and fade from common use.
Understanding language requires a basic understanding of its parts of speech. While
a person may not be familiar with the label noun or verb, he must be able to differentiate
between a thing or person, and an action or state. Similarly, in computational language
processing, language analysis tools frequently require the ability to distinguish standard
parts of speech.
The noun is a particular challenge for computational linguists not only because it is
an open part of speech, but also because it often is formed as by multiple words, or tokens.
For example, the multiple word expression, Kennedy Airport, may be mistakenly assumed
to be two nouns, one proper and the other common. The meaning of either word in isolation
is quite remote from the meaning of the two words taken together. Since such nouns are
constantly being added to the corpus, it is a common problem in natural language
processing to detect and isolate these multiword expressions.
1
(Manning & Schütze, 1999) use the term collation to describe a multi-word
expression that correspond to some conventional way of saying things. The authors suggest
that collocations can be described by one of the following attributes:
Non-Compositionality: As described in the example the meaning of Kennedy
Airport is not the simple composition of the two words Kennedy + Airport.
Non-Substitutionality:
Non-Modifiability:
Collocations play an important part in understanding language and common expression.
For example, Lewis (2000) provides theory to language acquisition and emphasizes the
importance to language learning:
“The single most important task facing language learners is acquiring
a sufficiently large vocabulary. We now recognize that much of
‘vocabulary’ consists of prefabricated chunks of different kinds. The
single most important kind of chunk is collocation. Self-evidently, then,
teaching collocation should be a top priority in every language
course.” (Lewis, 2000)
1.2 Performance Criteria
1.3 RELATED WORK
A comprehensive inventory of association measures for determining the true
association of a word pair is provided in (Evert, 2004), given the observed number of times
the words occurred together and apart. Association measures compared include the ones
that have achieved popularity in collocation analysis, including Mutual Information, the t2
score measure, and the log-likelihood measure, and more unusual measures of association
including hypergeometric likelihood, and Liddel, Dice, Jaccard point estimates.
Evert
introduce a useful parameter space called ebo-system, for expectation, balance, and
observed. The ebo-system provides a useful way of comparing and visualizing the various
association measures. A critical problem addressed is related to the fact that for lowfrequency terms, the observed counts are much higher than the expectation that lead to
inflated estimates of association. By applying the Zipf-Mandelbrot population an attempt
was made to derive an accurate measure of association for low-frequency collocations.
However, the theoretical results differed from observed texts and concludes that probability
estimates of the hapax and dis legomena (i.e. words that occur only once or twice) are
distorted in unpredictable ways. He concludes with a recommendation to apply a cutoff
ratio of 4 to ensure probabilities can be accurately assessed. The association measures were
evaluated using the annotated British National Corpus (BNC) and the German Frankfurter
Rundschau (FR) Corpus. Collocation precision of the association measures were compared
using precision and recall statistics. In general, best precision and recall performance was
determined using the log-likelihood score on the FR. However, for fine-grained
comparative evaluation of German figurative expressions (idioms) (e.g. “unter die Arme
greifen” or “ins Gras beissen”) and support-verb constructions (e.g. “in Kraft treten”, or
“unter Druck setzen”) the best association measures were different. The log-likelihood and
Pearson X2 did equivalently well on the figurative expressions; however, for the support
verbs the t-score achieved significantly higher precision than the other measures of
association.
It has been pointed out in (Bouma, 2010), that the assumption of independence for
estimating the expected probability, pexp, as the null hypothesis in word combinations is
3
unrealistic. For example, since the word ‘the” has a high probability of occurring the
probability of the collocation ‘the the’ should be quite high since P(UV = ‘the the’) =
P(u=‘the’)P(v=‘the’). Independence ignores semantic, lexical and syntactic structures
inherent in every language. To avoid the independence assumption, the authors propose
the use of an Aggregate Markov Model (AMM) for the task of estimating pexp, where the
hidden variable represents the level of dependency between the two words. By specifying
the number of hidden classes one can vary the dependencies between completely
independent (unigram model) to completely dependent (bigram model). They applies their
solution against three gold standards: the German adjective-noun dataset, the German ppverb dataset, and the English verb particle dataset. They compare precision against the
independence assumption, but were only able to demonstrate significant improvement with
the German adjective-noun dataset.
In (Pecina, 2005), 84 different measures of association were compared against a
manually annotated Czech corpus. The authors report best performance in precision and
recall of two-word collocations was provided by point-wise mutual information and
logistic regression.
A frequently mentioned limitation with much of the research related to collocation
discovery is that meaningful collocations are assumed to be of some fixed length, such as
bigrams or trigrams (XXX). For example consider the following:
“He is a gentlemanly little fellow--his head reaches about to my
shoulder--cultured and travelled, and can talk splendidly, which Jack
never could”.
Pair-wise collocation analysis described by Ewert (2004), Piao et al. (2005), Mason
(2006) and others are unable to discover the collocation gentlemanly little fellow.
Furthermore, many of the described solution methods assume adjacency between words
4
within a collocations. Finally, several measures of association used to discover
collocations assume that type occurrence counts follow specific statistical distributions
(e.g. Normal, t-Distribution, etc.). An alternative heuristic approach (Danielsson, 2007)
(Danielsson, Automatic Extraction of meaningful units from corpora, 2003)seeks to avoid
these limitations by first finding occurrences a key word (node word) an all non-function
words occurring within a fixed span (typically three or four words to the left and right of
the key word) and recursively growing a series of “collocates” until no neighbor words
have occurrence counts above a predefined threshold of counts. The notion of concgram
(Cheng, Greaves, & Warren, 2006)encapsulates a similar approach, but uses a statistical
test to evaluate the significance of the multi-part collocation obtained over a user-defined
span.
Collocational chains (Daudaravičius & Marcinkevičienė, 2004) provide another
alternative to extending the range of collocations beyond pairs of words. Multiple,
adjacent pairs of words are combined to form variable length collocations as long as all
word pairs meet a minimum association criterion, using Gravity Counts and other
common measures of association described above. Returning the previous example, if
both gentlemanly little and little fellow achieved high association scores (high
collectivity) the proposed algorithm would declare the gentlemanly little fellow a
collocational chain.
1.4 METHOD
Stop words (or function words) are routinely removed as part of preprocessing a
corpus. Indeed, these high-frequency words carry very little semantic content; however
they serve to express grammatical relationships with other words. Our approach seeks to
use these cues of relationships to extract collocation candidates of various lengths, n > 2.
We assume collocations of interest are multiword extensions of single-word prototypes.
Consider the collocation following sentences:
5
"Start the buzz-tail," said Cap'n Bill, with a tremble in his voice.
There was now a promise of snow in the air, and a few days later the
ground as covered to the depth of an inch or more.
In both sentences, the nouns tremble and promise of snow are delimited by the
surrounding pairing of the article a and preposition in. We define a polygram as the
triplet uvw, representing the combined fore, center, and aft components of three or more
words contained within a sentence. Fore u and aft w are apply a fixed length of single
word, but v may consist of one or more words. We define a surrounds u*w of observed
words where one or more words, represented by the wildcard *, may be embedded within
the surrounding words u and w, where u precedes w in the sentence word order. All other
polygram prototypes, such as There*a are also valid, but will occur less frequently.
The first step in extracting collocations is to generate a polygram prototype language
model setting the length |v|=1 and searching for counting all occurrences of u*w and rank
ordering the surrounds in descending order of occurrence. Note that the first word in a an
expression (sentence or independent clause) is preceded with a u = <start> and the last
type w = <!|.|?|;> to capture the first and last words of a sentence within the surrounds.
The second step in the process of extracting collocations is to select the top k percent of
the rank-ordered surrounds. Borrowing terminology from signal processing, these
selected surrounds become a type of kernel to be convolved over the original corpus in
search of new multi-word centers (i.e. 1 < |v| < L), where L represents the maximum
length of the center. Any multi-word centers delimited by the surrounds are extracted as
collocation candidates. Applying the surrounds a*in to the following sentences, the
collocation candidates wild beast and hole bored are extracted:
Here it leaped futilely a half dozen times for the top of the palisade, and then trembling
and chattering in rage it ran back and forth along the base of the obstacle, just as a wild
6
beast in captivity paces angrily before the bars of its cage.
It was the one that seemed to have had a hole bored in it and then plugged up again.
Clearly, wild beast meets the collocational criteria described above, but hole bored appears
less compelling. The objective of the final step is to select the candidates that satistfy the
collocational criteria and discard those that do not.
CHAPTER 2: PROBLEM DOMAIN
2.1 MOTIVATION
2.2 WORDS AND COLLOCATIONS
Definition of word from Lexical Analysis.
7
2.3 WORD CO-OCCURRENCE ANALYSIS
2.4 SYNTACTIC CLUSTERING
2.4.1 Partitioned versus Hierarchical Clustering Methods
2.4.2 Distance Measures
2.4.3 Hard and Soft Clustering
2.4.4 Incremental and Batch Clustering
2.4.5 Bidirectional Clustering
2.5.6 Part-of-Speech Induction
2.6 RESEARCH OBJECTIVE
Primary research question:
Primary research objective:
8
Download