ANALYSIS 1.1 Problem Description Linguists frequently classify Parts of Speech as open and closed. Closed Parts of Speech are composed of a fixed, or nearly fixed, group of words, with few or no additions over decades or centuries. A few examples of closed parts of speech include the article (a, the, an), the preposition (to, in, on, etc.), modals/auxiliaries (was, has, have, etc.), or the personal pronoun (I, him, she, etc.). Other parts of speech, such as the noun and, perhaps slightly less, the verb, are continuously changing: new words are added, such as iPhone or Google, while others are labeled arcane (fortnight, etc.) and fade from common use. Understanding language requires a basic understanding of its parts of speech. While a person may not be familiar with the label noun or verb, he must be able to differentiate between a thing or person, and an action or state. Similarly, in computational language processing, language analysis tools frequently require the ability to distinguish standard parts of speech. The noun is a particular challenge for computational linguists not only because it is an open part of speech, but also because it often is formed as by multiple words, or tokens. For example, the multiple word expression, Kennedy Airport, may be mistakenly assumed to be two nouns, one proper and the other common. The meaning of either word in isolation is quite remote from the meaning of the two words taken together. Since such nouns are constantly being added to the corpus, it is a common problem in natural language processing to detect and isolate these multiword expressions. 1 (Manning & Schütze, 1999) use the term collation to describe a multi-word expression that correspond to some conventional way of saying things. The authors suggest that collocations can be described by one of the following attributes: Non-Compositionality: As described in the example the meaning of Kennedy Airport is not the simple composition of the two words Kennedy + Airport. Non-Substitutionality: Non-Modifiability: Collocations play an important part in understanding language and common expression. For example, Lewis (2000) provides theory to language acquisition and emphasizes the importance to language learning: “The single most important task facing language learners is acquiring a sufficiently large vocabulary. We now recognize that much of ‘vocabulary’ consists of prefabricated chunks of different kinds. The single most important kind of chunk is collocation. Self-evidently, then, teaching collocation should be a top priority in every language course.” (Lewis, 2000) 1.2 Performance Criteria 1.3 RELATED WORK A comprehensive inventory of association measures for determining the true association of a word pair is provided in (Evert, 2004), given the observed number of times the words occurred together and apart. Association measures compared include the ones that have achieved popularity in collocation analysis, including Mutual Information, the t2 score measure, and the log-likelihood measure, and more unusual measures of association including hypergeometric likelihood, and Liddel, Dice, Jaccard point estimates. Evert introduce a useful parameter space called ebo-system, for expectation, balance, and observed. The ebo-system provides a useful way of comparing and visualizing the various association measures. A critical problem addressed is related to the fact that for lowfrequency terms, the observed counts are much higher than the expectation that lead to inflated estimates of association. By applying the Zipf-Mandelbrot population an attempt was made to derive an accurate measure of association for low-frequency collocations. However, the theoretical results differed from observed texts and concludes that probability estimates of the hapax and dis legomena (i.e. words that occur only once or twice) are distorted in unpredictable ways. He concludes with a recommendation to apply a cutoff ratio of 4 to ensure probabilities can be accurately assessed. The association measures were evaluated using the annotated British National Corpus (BNC) and the German Frankfurter Rundschau (FR) Corpus. Collocation precision of the association measures were compared using precision and recall statistics. In general, best precision and recall performance was determined using the log-likelihood score on the FR. However, for fine-grained comparative evaluation of German figurative expressions (idioms) (e.g. “unter die Arme greifen” or “ins Gras beissen”) and support-verb constructions (e.g. “in Kraft treten”, or “unter Druck setzen”) the best association measures were different. The log-likelihood and Pearson X2 did equivalently well on the figurative expressions; however, for the support verbs the t-score achieved significantly higher precision than the other measures of association. It has been pointed out in (Bouma, 2010), that the assumption of independence for estimating the expected probability, pexp, as the null hypothesis in word combinations is 3 unrealistic. For example, since the word ‘the” has a high probability of occurring the probability of the collocation ‘the the’ should be quite high since P(UV = ‘the the’) = P(u=‘the’)P(v=‘the’). Independence ignores semantic, lexical and syntactic structures inherent in every language. To avoid the independence assumption, the authors propose the use of an Aggregate Markov Model (AMM) for the task of estimating pexp, where the hidden variable represents the level of dependency between the two words. By specifying the number of hidden classes one can vary the dependencies between completely independent (unigram model) to completely dependent (bigram model). They applies their solution against three gold standards: the German adjective-noun dataset, the German ppverb dataset, and the English verb particle dataset. They compare precision against the independence assumption, but were only able to demonstrate significant improvement with the German adjective-noun dataset. In (Pecina, 2005), 84 different measures of association were compared against a manually annotated Czech corpus. The authors report best performance in precision and recall of two-word collocations was provided by point-wise mutual information and logistic regression. A frequently mentioned limitation with much of the research related to collocation discovery is that meaningful collocations are assumed to be of some fixed length, such as bigrams or trigrams (XXX). For example consider the following: “He is a gentlemanly little fellow--his head reaches about to my shoulder--cultured and travelled, and can talk splendidly, which Jack never could”. Pair-wise collocation analysis described by Ewert (2004), Piao et al. (2005), Mason (2006) and others are unable to discover the collocation gentlemanly little fellow. Furthermore, many of the described solution methods assume adjacency between words 4 within a collocations. Finally, several measures of association used to discover collocations assume that type occurrence counts follow specific statistical distributions (e.g. Normal, t-Distribution, etc.). An alternative heuristic approach (Danielsson, 2007) (Danielsson, Automatic Extraction of meaningful units from corpora, 2003)seeks to avoid these limitations by first finding occurrences a key word (node word) an all non-function words occurring within a fixed span (typically three or four words to the left and right of the key word) and recursively growing a series of “collocates” until no neighbor words have occurrence counts above a predefined threshold of counts. The notion of concgram (Cheng, Greaves, & Warren, 2006)encapsulates a similar approach, but uses a statistical test to evaluate the significance of the multi-part collocation obtained over a user-defined span. Collocational chains (Daudaravičius & Marcinkevičienė, 2004) provide another alternative to extending the range of collocations beyond pairs of words. Multiple, adjacent pairs of words are combined to form variable length collocations as long as all word pairs meet a minimum association criterion, using Gravity Counts and other common measures of association described above. Returning the previous example, if both gentlemanly little and little fellow achieved high association scores (high collectivity) the proposed algorithm would declare the gentlemanly little fellow a collocational chain. 1.4 METHOD Stop words (or function words) are routinely removed as part of preprocessing a corpus. Indeed, these high-frequency words carry very little semantic content; however they serve to express grammatical relationships with other words. Our approach seeks to use these cues of relationships to extract collocation candidates of various lengths, n > 2. We assume collocations of interest are multiword extensions of single-word prototypes. Consider the collocation following sentences: 5 "Start the buzz-tail," said Cap'n Bill, with a tremble in his voice. There was now a promise of snow in the air, and a few days later the ground as covered to the depth of an inch or more. In both sentences, the nouns tremble and promise of snow are delimited by the surrounding pairing of the article a and preposition in. We define a polygram as the triplet uvw, representing the combined fore, center, and aft components of three or more words contained within a sentence. Fore u and aft w are apply a fixed length of single word, but v may consist of one or more words. We define a surrounds u*w of observed words where one or more words, represented by the wildcard *, may be embedded within the surrounding words u and w, where u precedes w in the sentence word order. All other polygram prototypes, such as There*a are also valid, but will occur less frequently. The first step in extracting collocations is to generate a polygram prototype language model setting the length |v|=1 and searching for counting all occurrences of u*w and rank ordering the surrounds in descending order of occurrence. Note that the first word in a an expression (sentence or independent clause) is preceded with a u = <start> and the last type w = <!|.|?|;> to capture the first and last words of a sentence within the surrounds. The second step in the process of extracting collocations is to select the top k percent of the rank-ordered surrounds. Borrowing terminology from signal processing, these selected surrounds become a type of kernel to be convolved over the original corpus in search of new multi-word centers (i.e. 1 < |v| < L), where L represents the maximum length of the center. Any multi-word centers delimited by the surrounds are extracted as collocation candidates. Applying the surrounds a*in to the following sentences, the collocation candidates wild beast and hole bored are extracted: Here it leaped futilely a half dozen times for the top of the palisade, and then trembling and chattering in rage it ran back and forth along the base of the obstacle, just as a wild 6 beast in captivity paces angrily before the bars of its cage. It was the one that seemed to have had a hole bored in it and then plugged up again. Clearly, wild beast meets the collocational criteria described above, but hole bored appears less compelling. The objective of the final step is to select the candidates that satistfy the collocational criteria and discard those that do not. CHAPTER 2: PROBLEM DOMAIN 2.1 MOTIVATION 2.2 WORDS AND COLLOCATIONS Definition of word from Lexical Analysis. 7 2.3 WORD CO-OCCURRENCE ANALYSIS 2.4 SYNTACTIC CLUSTERING 2.4.1 Partitioned versus Hierarchical Clustering Methods 2.4.2 Distance Measures 2.4.3 Hard and Soft Clustering 2.4.4 Incremental and Batch Clustering 2.4.5 Bidirectional Clustering 2.5.6 Part-of-Speech Induction 2.6 RESEARCH OBJECTIVE Primary research question: Primary research objective: 8