Abstract

advertisement
Abstract
Analyzing the internal morphological structure of words can benefit natural language
processing (NLP) applications from grapheme-to-phoneme conversion (Demberg et al., 2007) to
machine translation (Goldwater and McClosky, 2005). But for most of the world’s languages, no
morphological analysis system yet exists. Unsupervised induction techniques, that learn the
morphology of a language from unannotated text data, can facilitate the development of
computational morphology systems for new languages. And the morphological analyses that an
unsupervised system can produce have helped NLP tasks including speech recognition (Creutz,
2006) and information retrieval (Kurimo et al., 2008b).
This thesis describes ParaMor, an unsupervised induction algorithm which leverages the
paradigmatic structure of inflectional morphology to learn morphological analysis systems for
any language that possess an alphabetic writing system. Paradigms, set of mutually substitutable
morphological operations, organize the inflectional morphology of natural languages. For
example, most adjectives in Spanish inflect for two paradigms. First, adjectives are marked for
Gender: an o suffix for Masculine or an a for Feminine. Then Spanish adjectives mark Number:
an s suffix signals Plural, while the absence of any marking indicates Singular. The four surface
forms of the cross-product of the gender and number paradigms on the Spanish word for
‘beautiful’ are then: bello, bella, bellos, and bellas. ParaMor focuses on the most common
morphological process, suffixation.
ParaMor capitalizes on natural language paradigms in a two stage algorithm. ParaMor’s first
stage identifies candidate paradigms which likely model suffixes of morphological paradigms and
their cross-products, while the second stage uses the identified candidate paradigms to segment
word forms at character boundaries for which there is paradigmatic evidence of a morpheme
boundary. ParaMor’s first stage, paradigm identification, is further broken down into three steps:
First, a recall-centric search scours a space of candidate partial paradigms for those which
possibly model suffixes of true paradigms. Next, ParaMor merges selected candidates which
appear to model the same paradigm. And ParaMor’s paradigm identification stage concludes by
discarding those clusters which most likely do not model true paradigms.
Since paradigms are the organizational structure of inflectional morphology, ParaMor can
analyze little derivational morphology. Other unsupervised morphology induction systems, such
as Morfessor (Creutz, 2006) seek to identify all morphemes whether inflectional or derivational.
With their emphasis on very different aspects of morphology, ParaMor’s morphological analyses
are largely complementary to those of a system like Morfessor. And this thesis leverages the
individual strengths of the general purpose morphology induction system Morfessor and the
inflection specific system ParaMor by combining the analyses from the two systems into a single
compound analysis.
To evaluate ParaMor’s morphological analyses, this thesis follows the methodology of
Morpho Challenge 2007, a peer operated competition for morphology analysis systems (Kurimo
et al., 2008a; Kurimo et al., 2008b). Morpho Challenge 2007 evaluated each system’s
morphological analyses of up to four languages, English, German, Finnish, and Turkish, in two
ways: First, in a linguistically motivated assessment of morpheme identification; and second, in a
task-based evaluation that augmented an information retrieval system with morphological
segmentations. When ParaMor’s morphological analyses are merged with those of Morfessor, the
resulting morpheme recall in the linguistic evaluations of all four languages is higher than that of
any system which competed in the Challenge; in Turkish, ParaMor’s recall, at 52.1%, is twice
that of the next highest system. Although focused on morpheme recall, ParaMor maintains a
strong precision—achieving an F1 at morpheme identification above all competing systems in the
German, Finnish, and Turkish tracks. In the three languages of the IR evaluation, ParaMor
significantly outperforms, at average precision over newswire queries, a morphologically naïve
baseline; scoring just behind the leading system from Morpho Challenge 2007 in English and
ahead of the first place system in German.
Download