Abstract

Abstract Analyzing the internal morphological structure of words can benefit natural language processing (NLP) applications from grapheme-to-phoneme conversion (Demberg et al., 2007) to machine translation (Goldwater and McClosky, 2005). But for most of the world’s languages, no morphological analysis system yet exists. Unsupervised induction techniques, that learn the morphology of a language from unannotated text data, can facilitate the development of computational morphology systems for new languages. And the morphological analyses that an unsupervised system can produce have helped NLP tasks including speech recognition (Creutz, 2006) and information retrieval (Kurimo et al., 2008b). This thesis describes ParaMor, an unsupervised induction algorithm which leverages the paradigmatic structure of inflectional morphology to learn morphological analysis systems for any language that possess an alphabetic writing system. Paradigms, set of mutually substitutable morphological operations, organize the inflectional morphology of natural languages. For example, most adjectives in Spanish inflect for two paradigms. First, adjectives are marked for Gender: an o suffix for Masculine or an a for Feminine. Then Spanish adjectives mark Number: an s suffix signals Plural, while the absence of any marking indicates Singular. The four surface forms of the cross-product of the gender and number paradigms on the Spanish word for ‘beautiful’ are then: bello, bella, bellos, and bellas. ParaMor focuses on the most common morphological process, suffixation. ParaMor capitalizes on natural language paradigms in a two stage algorithm. ParaMor’s first stage identifies candidate paradigms which likely model suffixes of morphological paradigms and their cross-products, while the second stage uses the identified candidate paradigms to segment word forms at character boundaries for which there is paradigmatic evidence of a morpheme boundary. ParaMor’s first stage, paradigm identification, is further broken down into three steps: First, a recall-centric search scours a space of candidate partial paradigms for those which possibly model suffixes of true paradigms. Next, ParaMor merges selected candidates which appear to model the same paradigm. And ParaMor’s paradigm identification stage concludes by discarding those clusters which most likely do not model true paradigms. Since paradigms are the organizational structure of inflectional morphology, ParaMor can analyze little derivational morphology. Other unsupervised morphology induction systems, such as Morfessor (Creutz, 2006) seek to identify all morphemes whether inflectional or derivational. With their emphasis on very different aspects of morphology, ParaMor’s morphological analyses are largely complementary to those of a system like Morfessor. And this thesis leverages the individual strengths of the general purpose morphology induction system Morfessor and the inflection specific system ParaMor by combining the analyses from the two systems into a single compound analysis. To evaluate ParaMor’s morphological analyses, this thesis follows the methodology of Morpho Challenge 2007, a peer operated competition for morphology analysis systems (Kurimo et al., 2008a; Kurimo et al., 2008b). Morpho Challenge 2007 evaluated each system’s morphological analyses of up to four languages, English, German, Finnish, and Turkish, in two ways: First, in a linguistically motivated assessment of morpheme identification; and second, in a task-based evaluation that augmented an information retrieval system with morphological segmentations. When ParaMor’s morphological analyses are merged with those of Morfessor, the resulting morpheme recall in the linguistic evaluations of all four languages is higher than that of any system which competed in the Challenge; in Turkish, ParaMor’s recall, at 52.1%, is twice that of the next highest system. Although focused on morpheme recall, ParaMor maintains a strong precision—achieving an F1 at morpheme identification above all competing systems in the German, Finnish, and Turkish tracks. In the three languages of the IR evaluation, ParaMor significantly outperforms, at average precision over newswire queries, a morphologically naïve baseline; scoring just behind the leading system from Morpho Challenge 2007 in English and ahead of the first place system in German.

Abstract

Related documents

Products

Support

Abstract

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib