A Morphological Tagger for Korean: Statistical Tagging Combined with Corpus-Based

advertisement
Machine Translation (2005) 18:275–297
DOI 10.1007/s10590-004-7693-4
© Springer 2005
A Morphological Tagger for Korean: Statistical
Tagging Combined with Corpus-Based
Morphological Rule Application
CHUNG-HYE HAN1 and MARTHA PALMER2
1
Department of Linguistics, Simon Fraser University, 8888 University Dr., Burnaby, BC,
V5A1SC, Canada. E-mail: chunghye@sfu.ca
2
Department of Computer and Information Sciences, University of Pennsylvania, 3330
Walnut St., Philadelphia, PA 19104-6389, USA. E-mail: mpalmer@linc.cis.edu
Abstract. This paper describes a novel approach to morphological tagging for Korean, an
agglutinative language with a very productive inflectional system. The tagger takes raw
text as input and returns a lemmatized and morphologically disambiguated output for
each word: the lemma is labeled with a part-of-speech (POS) tag and the inflections are
labeled with inflectional tags. Unlike the standard approach to tagging for morphologically complex languages, in our proposed approach the tagging phase precedes the analysis phase. It comprises a trigram-based tagging component followed by a morphological
rule application component, obtaining 95% precision and recall on unseen test data.
Key words: agglutinative morphology, Korean, morphological rules, morphological tagger,
statistical tagging, Treebank
1. Introduction
Korean is an agglutinative language with a very productive inflectional system. Inflections include different types of case markers, postpositions, prefixes, and suffixes on nouns, as in (1a); tense, honorific and sentence type
markers on verbs and adjectives, as in (2a,b); among others. Furthermore,
these inflections can combine with each other to form complex compound
inflections. For example, postpositions, which correspond to English prepositions, can be followed by case markers such as nominative or accusative
markers, as in (1b), or auxiliary postpositions such as
(to ‘also’) and
(man ‘only’), as in (1c), which then can be followed by a topic marker, as
in (1d).1 Honorific and tense markers on verbs can be followed by a nominalizer and then a case marker or a postposition and a topic marker, as
in (2c,d). Based on the possible inflectional sequences, we estimate that the
276
CHUNG-HYE HAN AND MARTHA PALMER
number of possible morphological variants of a word can, in principle, be
in the tens of thousands.
(1)
a.
hak kyo-ka
school-nom
b.
hak kyo-eyse-ka
school-from-nom
c.
hak kyo-eyse-man
school-from-only
d.
hak kyo-eyse-man-un
school-from-only-topic
(2)
a.
ka-ess-ta
go-past-decl
b.
ka-si -ess-ta
go-honorific-past-decl
c.
ka-ki -ka
go-nominalizer-nom
d.
ka-si -ess-ki -ey-nun
go-honorific-past-nominalizer-to-topic
This morphological complexity implies that for any NLP application on
Korean to be successful, a component that does morphological analysis is
necessary. Without it, any application that makes use of a computational
lexicon would call for large and unnecessary space requirements, because it
would require a lexicon that lists all possible morphological variants of a
word; moreover, the development of applications such as a part-of-speech
(POS) tagger and a parser based on statistical models would not be feasible
due to the sparse data problem caused by the multiplicity of morphological variants of the same stem in the training data. Further, to best exploit
A MORPHOLOGICAL TAGGER FOR KOREAN
277
the agglutinative nature of morphological structures, a tagger for Korean
needs to be morpheme-based, rather than word-based. That is, an output
of a tagger should consist of a segmentation of each word into its constituent morphemes and an assignment of a tag to each of them.
In this paper, we describe an implemented morphological tagger (that
also does morphological analysis) for Korean that takes raw text as an
input and returns a lemmatized and morphologically disambiguated output where for each word, the base-form is labeled with a POS tag and
the inflections are labeled with inflectional tags. Throughout this paper, we
will use the term “word” to mean character sequences separated by white
spaces in the original raw text, and the term “lemma” to mean the stem or
base-form of a word stripped off of all inflectional and derivational morphemes.
In the standard approach to tagging for morphologically complex languages like Korean, a morphological analysis phase precedes a tagging
phase. That is, all possible ways of morphologically segmenting a given
word are generated which are then disambiguated at the tagging phase
(Lim et al., 1995, 1997; Hong et al., 1996; Chan et al., 1998; Yoon et al.,
1999; Lee et al., 2000).2 The main motivation behind applying morphological analysis prior to tagging is the sparse data problem which could potentially arise due to the fact that base forms of words can appear with a
wide array of morphological affixes. However, this presupposes knowledge
of all possible morphological rules, requiring quite a lot of time and effort
for their construction and implementation. Furthermore, while the generation of all possible morphological analyses is not an especially onerous
task, resolving the resulting ambiguity in the tagging phase is quite challenging. In contrast, in our proposed approach, the tagging phase precedes
the analysis phase. It will be shown that even with the sparse data problem, by applying tagging before analysis, we can achieve accuracy results
that are comparable to state-of-the-art systems. This is shown to be possible because tagging is done in two stages: (i) trigram-based tagging, and
(ii) updating the tags for unknown words using morphological information
automatically extracted from the training data. For us, the rich inflections
are not seen as a problem causing sparse data, but rather they are exploited
when guessing the tags for unknown words. As we will see in Section 2.4,
our approach allows for morphological analysis to be done deterministically by using the information obtained from the tagging phase. This makes
our approach efficient since there is no time-consuming morphological disambiguation step.
Our approach employs techniques from POS tagging to resolve ambiguity in morphological tagging and analysis. POS tagging was introduced
by Church (1988), and Ratnaparkhi (1996) currently has one of the best
278
CHUNG-HYE HAN AND MARTHA PALMER
performances for a POS tagger that has been trained on the Penn English
Treebank (Marcus et al., 1993). Since English is not as morphologically
complex as Korean, Ratnaparkhi models very limited suffix information
to deal with unseen words. Taggers for morphologically rich languages
like Czech, an inflectional language, and Hungarian, Basque and Turkish, agglutinative languages, have also been developed. Hajič and Hladká
(1998) use an exponential probabilistic model for tag disambiguation after
employing morphological analysis for Czech. Hajič et al. (2001) develop
a hybrid system for Czech, where a rule-based system performs partial
disambiguation, followed by the application of a trigram Hidden Markov
Model (HMM) tagger. Tufiş et al. (2000) propose a two-step tagging process for Hungarian in which initial tagging is done using a reduced tag
set with an off-the-shelf HMM tagger, followed by a recovery of the full
morphosyntactic description. Hakkani-Tür et al. (2002) propose to break
up the full morphological tag sequence into smaller units called inflectional groups, treating the members of each group as subtags, and then
to determine the correct sequence of tags via statistical techniques. For
Basque, Ezeiza et al. (1998) suggest the application, after initial morphological analysis, of morphological rules followed by an HMM-based tagger for disambiguation. In comparison, our model is based on a generative
statistical model and a rule-based approach, and applies a trigram-based
tagging component followed by a morphological rule application component.
We employed a corpus-based approach for the implementation of our
morphological tagger, and automatically extracted the training data and
morphological rules from the Penn Korean Treebank which contains 54,366
words and 5,078 sentences (Han et al., 2002). The Treebank is composed
of 33 files in total, out of which 30 files were used for training data, and
three files (files 05, 20, and 30, comprising 9% of the entire corpus) were
set aside for test data. The Treebank has phrase-structure annotation, with
head/phrase-level tags as well as function tags. Each word is morphologically analyzed, where the lemma and the inflections are identified. The
lemma is tagged with a POS tag (e.g., NNC, NPN, VV, VX), and the
inflections are tagged with inflectional tags (e.g., PCA, EAU, EFN).3 An
example annotation is given in (4) for the sentence in (3).
(3)
kwenhan-ul -nwu-ka-kacko-iss-ci
authority-acc who-nom have be-int
‘Who has the authority?’
A MORPHOLOGICAL TAGGER FOR KOREAN
279
(4)
As we did not need to spend any time on manual construction of morphological rules, the implementation of the entire system was done quite
efficiently, allowing us to develop a quick and effective technique for rapidly prototyping a working system.
The rest of the paper is organized as follows. In Section 2, we discuss
the general approach, the training data, and the workings of each component underlying the system. Section 3 presents the results of the performance evaluation done on the tagger.
2. Description of the Morphological Tagger
The tagger follows several sequential steps for the labeling of the raw text
with POS and inflectional tags. After tokenization (mostly applied to punctuation), all morphological contractions in the input string are decomposed
(spelling recovery). The known words are then tagged (morph tagging) followed by a phase that updates the tags for any unknown words (update
morph tagging). Finally, the lemma and its tags are separated from the
inflections with their tags, creating the final output (lemma/inflection identification). These steps are summarized in Figure 1.
2.1.
TOKENIZATION
When the raw text is fed through our tagger, the first task of the tagger is
to separate out the punctuation symbols. This is a simple process in which
white space is inserted between the word and the associated punctuation
symbol.
2.2.
SPELLING RECOVERY
In many cases, in the creation of an inflected word, a syllable or a character is deleted from the stem as in (5), and/or a morphological contraction
occurs between the stem and the inflection as in (6).
280
CHUNG-HYE HAN AND MARTHA PALMER
Figure 1. Overview of the tagger.
(5)
a.
nwukwu + ka nwuka
who+nom
b.
tep + umyen tewumyen
hot+if
(6)
a.
ka + ass + ta kassta
go+past+decl
b.
mwues + i + nka mwuenka
what+co+int
In other words, there can be several allomorphs for a given morpheme, and thus an important part of the task for a morphological tagger/
A MORPHOLOGICAL TAGGER FOR KOREAN
281
analyzer is to identify the base-form of a morpheme, by recovering the
materials that have been deleted, and decomposing the materials that have
been contracted. We call this process “spelling recovery.”
To obtain training material for the spelling recovery component, the
Treebank was converted to a data format containing a list of triples for
each sentence: a word with morphological deletion and contraction, the
corresponding string with the deletion and contraction undone, and a POS
tag for the word. All the inflectional tags are stripped away, and all the
POS tags are mapped onto one of the five tags: NN (noun), AN (noun
modifier), VV (verb and adjective), AD (adverb and conjunction), or NC
(noun with a copula inflection). This means that the tag for a word annotated with a noun POS tag and a case inflectional tag is reduced to NN,
and the tag for a word annotated with a verb POS tag and a tense and
sentence type inflectional tags is reduced to VV. Importantly, spelling recovery is conditioned on these five POS tags. Note that the Treebank makes
finer distinctions in tagging, and has 34 tags in total. For instance, the
Treebank has five noun tags: NPR for proper nouns, NNC for common
nouns, NNX for dependent nouns, NPN for personal pronouns, NNU for
numerals and NFW for words written in foreign characters. For the purposes of spelling recovery though, all the words belonging to the five noun
categories behave the same way. The same goes for verb/adjective categories. Although the Treebank makes a three-way distinction for verb/adjective tags, all types of verbs and adjectives behave the same way with respect
to spelling recovery. We thus simplified the rules involved in the spelling
recovery process by collapsing 34 tags to five tags in order to condition the
process.
We then automatically extracted 605 spelling recovery templates from
the list of triples, by comparing the word and the corresponding string. If
the two forms differ, they are listed from the point where the two forms
start to differ, with the corresponding POS tag. For a small set of function words, entire words and the corresponding strings are listed. Some
example spelling recovery templates are given in Table I. The first column
contains “subwords” (which are the parts of words that are to be targeted for spelling recovery), the second column POS tags, and the third column substrings (which defines the strings the subwords will be converted
into).4
The POS for each word is needed before we can attempt the spelling
recovery step because this is constrained by POS tags. For instance,
tonghay is ambiguous between a verb ‘move’ and a noun ‘east sea’, and it
should have different spelling recovery outputs depending on its POS tag,
as illustrated in (7).
282
CHUNG-HYE HAN AND MARTHA PALMER
Table I. Examples of spelling recovery templates
(7)
a.
tonghay
‘east sea’
tonghay
tonghay
‘move’
tonghae
b.
We use a statistical model to infer the most likely sequence of POS tags
for the input word sequence. The training data is the word and its POS tag
from the list of triples shown in Table I. This training data is used to create a probability model to infer the most likely tag sequence t∗ = t0∗ , . . . , tn∗
given an input word sequence w0 , . . . , wn . This probability model is used
as a POS tagger which finds t∗ by finding the tag sequence with the highest probability as in (8).
(8)
t∗ = argmaxt0 ,...,tn = P (t0 , . . . , tn | w0 , . . . , wn )
This provides the most likely tag from the five possible POS tags for each
word in a given input sentence, as in (9) (for gloss see (3)). Because the tag
set at this stage contains only five tags, the sparse data problem is minimized.
(9)
a.
b.
In our implementation, we used a standard trigram model combined with
Katz backoff smoothing for unseen events. Unknown words are tagged as
NN. The details of the model we use are explained in Appendix A.3. The
code for the implementation of this trigram-based tagger was adapted and
modified from the trigram-based SuperTagger by Srinivas (1997).5
283
A MORPHOLOGICAL TAGGER FOR KOREAN
In the spelling recovery phase, the input sentence is first run through the
POS tagger, followed by the application of the spelling recovery templates
to each word. At this point, each template is taken as an instruction, as
follows:
If the input word contains the subword and is tagged with the POS
tag, then replace the part from the word matching the subword with the
substring.
Some spelling recovery templates that are unambiguously verb/adjective templates are made to apply to input words tagged as NN as well as VV. This
takes care of unknown verbs and adjectives that are wrongly tagged as NN.
An example input and output of spelling recovery is given in (10) (for
gloss see (3)). The parts to which spelling recovery has been applied are
underlined.
(10) a. Input:
b. POS tagging:
c. Output:
The spelling recovery phase was evaluated on the test sentences which
were set aside from the Treebank. The results are given in Table II.
2.3.
MORPHOLOGICAL TAGGING
Morphological tagging is done in two stages: first, input words are tagged
with a trigram-based morphological tagger, and then the tag for unknown
words is updated using what we call “inflectional templates”. Each step is
described in more detail below.
Table II. Evaluation of spelling recovery
Process
Accuracy (%)
POS tagging for spelling recovery
Spelling recovery
Spelling recovery assuming 100% POS tagging
92.03
96.33
98.62
284
CHUNG-HYE HAN AND MARTHA PALMER
2.3.1. Initial Morphological Tagging
A morphological tagger based on a trigram-based model was trained on
91% of the 54k Korean Treebank.6 The statistical model for this morphological tagger is the same as the one used for the POS tagger that
was used in the spelling recovery phase. That is, given the input word
sequence w0 , . . . , wn , the model is used to find the tag sequence t∗ =
t0∗ , . . . , tn∗ with the highest probability. The only difference is that the morphological tagger uses a different set of tags. See Appendix A.3 for the
details of the model. The tag set for the morphological tagger consists of
possible complex tags extracted from the Treebank, altogether 165 complex
tags. These complex tags are based on 14 POS tags, 10 inflectional tags,
and five punctuation tags.7 Examples are given in Table III. For example,
NNC+CO+EFN+PAD is the complex tag for a common noun, inflected
with a copula, a sentence type marker and a postposition. The tagger
assigns an NNC (common noun) tag to unknown words, NNC being the
most frequent tag for unknown words. Dealing with unknown words using
inflectional information is discussed in Section 2.3.2.
Table III. Examples
complex tags
of
NNC
NNC+PCA
NNC+CO+EFN+PAD
NPR+PAD+PAU
NNU+PAU
VJ+EPF+EAN
VV+EPF+EFN
VV+EPF+EFN+PAU
VX+EPF+EFN+PAU
The tagging phase was evaluated on the test sentences, obtaining 81.96%
accuracy. An example input and output of the morphological tagger is
given in (11).
(11) a. Input:
Cey-ka kwanchuk sahang-ul pokoha-ess-supnita.
I-nom observation item-acc report-past-decl
‘I reported the observation items.’
A MORPHOLOGICAL TAGGER FOR KOREAN
285
b. Output:
2.3.2. Updating the Morphological Tags
The results obtained from the trigram-based morphological tagger are not
very accurate, the main reason being that both the number of unknown
words and the number of tags in the tag set are quite large. This vindicates the discussion in Section 1 that argues for the importance of morphological analysis. Without morphological analysis, the tagging results will be
quite low. To compensate for this, we added a phase that updates the morphological tags for unknown words. The update process exploits the fact
that the POS tag of a given word is in general predictable from the inflections on the word. That is, a large number of inflections can only occur on
verbs/adjectives, and a large number of inflections can only occur on nouns.
For this purpose, inflectional templates, which are pairs of inflection
sequences and the corresponding complex tag, were extracted from the
Treebank. The total number of extracted inflectional templates is 594. This
comprises only 1% of the number of word tokens in the training data,
showing that the size of inflectional templates is relatively quite small in
comparison to the size of the training data. A few example templates are
listed in Table IV. Further, a list of common noun stems was extracted
from the Treebank.
The procedure for updating the morphological tags is shown in Figure 2.
An example of the tagging update process is given in (12). The main
verb of the sentence
icesssupnita ‘forgot’, which is inflected with
a past-tense marker and a declarative sentence type marker, is wrongly
tagged as NNC in the trigram-based tagging phase. This tag is updated as
VV+EPF+EFN, VV for verb, EPF for tense, and EFN for sentence type.
Table IV. Examples of inflectional templates
286
CHUNG-HYE HAN AND MARTHA PALMER
1. For each input word that is tagged as an NNC, first check if the word is included in
the common noun list:
a) If the word is included in the common noun list, it is a known word, hence leave
the tag as it is.
b) If the word is not included in the common noun list, it is an unknown word, hence
check the inflectional templates to update the tag.
i) Match the inflection sequence with the word, starting from the right side. If there
are matching inflection sequences, choose the complex tag paired with the longest
matching inflection sequence, and replace the NNC tag with that complex tag.
That is, adopt a right-to-left longest match requirement.
ii) If there is no matching inflection sequence, leave the complex tag as NNC.
2. For those words that are not tagged simply as NNC, leave their tags as they are.
Figure 2. Procedure for updating the morphological tags.
(12) a. Input:
Cey-ka kwanchuk sahang-ul ic-ess-supnita.
I-nom observation item-acc forget-past-decl
‘I forgot the observation items.’
b. Output:
After the tags for unknown words were updated, the morphological tagging accuracy on the test set increased to 93.86%, a 12-point increase from
the trigram-based morphological tagging. If the shortest match is employed
to the tagging update process, the accuracy reduces to 90.29%.
2.4.
LEMMA/INFLECTION IDENTIFICATION
Lemma/inflection identification produces the final output of our morphological tagger. Given a pair of a word and its complex tag, the complex tag is
decomposed and each constituent tag is paired up with the corresponding
morpheme in the word. That is, the lemma and its POS tag are paired up,
and then each inflection is paired up with the corresponding inflectional tag.
This is illustrated in (13), where a, b, c and d stand for substrings of a word,
p stands for a POS tag, and x, y and z stand for inflectional tags.
287
A MORPHOLOGICAL TAGGER FOR KOREAN
Table V. Example entries from inflection and stem dictionaries
Inflection dictionary
Stem dictionary
(13) abcd/p + x + y + z
⇓
a/p + b/x + c/y + d/z
The complex tag in conjunction with the inflection and stem dictionary (extracted from the Treebank) look-up are used to determine the
lemma/inflection identification process. The inflection dictionary contains
230 entries in total, and the stem dictionary contains 3,880 entries in total.
Example entries from the inflection dictionary and stem dictionary are
given in Table V. The list in the stem dictionary is small because the size of
the Treebank itself is small. It is possible to include a much larger stem dictionary, using resources from outside the Treebank, to enhance the overall
performance of the tagger. We however did not do so for present purposes
in order to be faithful to the corpus-based approach we have adopted; the
data used for the implementation and subsequently the evaluation of
the tagger is restricted to the Treebank. An example input and output of the lemma/inflection identification is given in (14) (for gloss see
(11)).
(14) a. Input:
b. Output:
The procedures for lemma/inflection identification are shown in Figure 3.
288
CHUNG-HYE HAN AND MARTHA PALMER
Given a pair of a word and its complex tag (abcd, p + x + y + z):8
1. Separate the substring from the input word that matches the longest inflection in the
inflection dictionary associated with the last inflectional tag in the input complex tag,
and label it with the inflectional tag (e.g., d/z);
If there is no matching inflection in the inflection dictionary, separate the substring
that matches the longest inflection in the inflection dictionary associated with the second-to-last inflectional tag, and label it with the inflectional tag (e.g., d/y);
2. Repeat the above procedure until the POS tag (i.e., the first tag) is reached. The
remaining substring is labeled with the POS tag (e.g., a/p).
Figure 3. Procedures for lemma/inflection identification.
In this section we showed that the combination of corpus-based rules
and the simple algorithm given above applied to morphologically tagged
data is sufficient for the usually complex task of lemma/inflection identification.9 Hoping to improve the performance, we have also implemented a
variation of this technique using corpus-particular knowledge such as special cases for POS tags like NNC, VV, and VJ based on our intuitions. The
performance however did not improve much. The performance results are
given in Table VI.
3. Evaluation and Discussion
3.1.
EVALUATION OF THE MORPHOLOGICAL TAGGER
The performance of the morphological tagger trained on the Treebank has
been evaluated on 9% of the Treebank. The test set consists of 3,717 word
tokens and 425 sentences. Both precision and recall were computed by
comparing the morpheme/tag pairs in the test file and the “gold file”. The
gold file contains hand-corrected annotations that are used to evaluate the
accuracy of the tagger output. The precision corresponds to the percentage of morpheme/tag pairs in the gold file that match the morpheme/tag
pairs in the test file. The recall corresponds to the percentage of morpheme/tag pairs in the test file that match the morpheme/tag pairs in the
gold file. Our approach yielded 95.43% precision and 95.04% recall for the
Treebank-trained tagger. Further, using the corpus-specific lemma/inflection
Table VI. Evaluation of the morphological tagger
Method
Precision (%)
Recall (%)
Treebank-trained
Treebank-trained+
95.43
95.78
95.04
95.39
A MORPHOLOGICAL TAGGER FOR KOREAN
Table VII. Comparison
morphological taggers
with
289
other
Reference
Accuracy (%)
Chan et al. (1998)
Yoon et al. (1999)
Lim et al. (1997)
97.0
94.7
94.8
identification procedure in the Treebank-trained+ tagger, the performance
increased only very slightly, yielding precision and recall scores of 95.78%
and 95.39%, respectively.
The performance obtained is comparable to the results reported for
state-of-the-art systems. Table VII shows accuracy scores reported for
instance by systems such as those described in Chan et al. (1998), Yoon
et al. (1999), and Lim et al. (1997), when tested on the test set drawn from
the same corpus each was trained on. Ideally, for a fair comparison, it
would be desirable to retrain other systems on the Penn Korean Treebank
and compare their performances with the proposed tagger on the same test
set, or retrain the proposed tagger on the same domain as the other taggers
and compare the results on the same test set. Unfortunately, both options
are not realistic given that neither the training data nor the source code for
the other taggers are publicly available.
3.2.
ERROR ANALYSIS
The sources for errors in our system can be divided into six major types:
tagging, spelling recovery and ambiguity resolution errors, and missing
entries from the common noun list, the inflectional template list and the
inflection dictionary. The most frequent type of error was a tagging error,
comprising 63% of total errors. In examples with tagging errors, the lemma
or the inflection is mistagged. The second most frequent type of error was
caused by incorrect spelling recovery, comprising 21% of total errors. If
all morphological contractions in the input string are not correctly decomposed, the input will be misanalyzed and mistagged. Ambiguity resolution
errors are mostly caused by our longest match strategy. Given a pair of
a word and a complex tag (abc, x + y), if the inflection dictionary contains an entry where bc is associated with y and another entry where c is
associated with y, then the possible analyses for abc are ab/x + c/y and
a/x + bc/y. The lemma/inflection identification procedure will always select
the analysis where the longest substring is identified as an inflection, selecting the second analysis. Thus, an error will result if the correct analysis is
290
CHUNG-HYE HAN AND MARTHA PALMER
the first one. This type of error covered 16% of total errors. Further, in
a small number of cases, missing entries from the common noun list, the
inflectional template list and the inflection dictionary resulted in errors. The
number and the percentage of the types of errors from the test data is summarized in Table VIII, along with examples.
4. Conclusion
In this paper, we describe an implemented morphological tagger (that also
does morphological analysis) for Korean, a language with a rich inflectional system. The training data for the trigram-based tagging component,
and morphological rules, inflection and stem dictionaries for morphological tagging and analysis are automatically extracted from an annotated corpora, the Penn Korean Treebank. This corpus-based approach enabled us
to implement the entire system in a short period of time (roughly eight
Table VIII. Summary of error analysis
Type of
error
Number
(%)
Tagging
174
63
Spelling
recovery
57
Ambiguity
16
Missing
NNC
15
Missing
template
11
Missing
inflection
4
Gold
Test
cengcikha-ci
honest-connective ‘be honest’
cengikha-ci
piwu-l
eminate-adnominal
‘which will eminate’
piwul
pwutay-lo
squad-to ‘to the squad’
pwu-taylo
hakto
cadet ‘cadet’
hak-to
hwunlyun-i-kwun
drill-cop-decl ‘be a drill’
hwunlyunikwun
sotaykup-eykkaci
platoon level-towards
‘towards the platoon level’
sotaykupe-kkaci
21
6
5
4
1
A MORPHOLOGICAL TAGGER FOR KOREAN
291
weeks), with resulting performance in the range of 95% precision/recall on
the test set from the same domain as the training data. The treebanktrained morphological tagger performed as well as state-of-the-art taggers
whose training data are from different domains, even though, in contrast
to other taggers, in our system the analysis phase follows the tagging phase.
This was possible because of the two-stage morphological tagging process
we employed. In particular, in the second tagging stage, the morphological tags for the unknown words are updated, exploiting the inflections by
matching these against information stored in inflectional templates. Our
tagger has been successfully incorporated into a Korean statistical parser
(Sarkar and Han, 2002), and a Korean–English machine translation system as part of the source language analysis component (Palmer et al.,
2002). Our approach is applicable to any language, as long as a corpus
annotated with morphological information is available. Although the training data and the morphological rules will be corpus-specific, and hence
language-specific, the general approach and the procedures that we have
proposed for spelling recovery, updating morphological tags, and lemma/
inflection identification as well as trigram-based tagging are language-neutral, and can be applied to any agglutinative language. In the future, we
would like to test the portability of our approach by first retraining our
tagger on a larger corpus, and second by applying our approach to morphological tagger development for other languages with rich morphology.
Appendix A. A List of Tags used in the Penn Korean Treebank
A.1. POS TAGS
NPR:
NNC:
NNX:
NPN:
NNU:
NFW:
VV:
VJ:
VX:
ADV:
proper noun:
(hankwuk ‘Korea’),
(kullinthon ‘Clinton’)
common noun:
(hakkyo ‘school’),
(kemphyute ‘computer’)
dependent noun: (kes ‘thing’), (tung ‘etc’), (nyen ‘year’)
personal pronoun, demonstratives:
(ku ‘he’),
(ikes ‘this’)
ordinal, cardinal, numeral: 1,
(hana ‘one’),
(chesccay ‘first’)
words written in foreign characters: Clinton, computer
verb: (ka ‘go’), (mek ‘eat’)
adjective:
(yeppu ‘pretty’),
(talu ‘different’)
auxiliary predicate: (iss ‘present progressive’), (ha ‘must’)
constituent and clausal adverb:
(maywu ‘very’),
(coyonghi
‘quietly’),
(manil ‘if’)
ADC: conjunctive adverb:
(kuliko ‘and’),
(kulena ‘but, however’)
DAN: configurative, demonstrative:
(say ‘new’),
(hen ‘old’),
(ku
‘that’)
IJ: exclamation: (a ‘ah’)
292
CHUNG-HYE HAN AND MARTHA PALMER
LST: list marker: a, (b), 1, 2.3.1,
(ka), . (na.)
A.2. INFLECTIONAL TAGS
PCA: case marker:
(ka/i nominative),
(ul/lul accusative), (uy possessive), (ya vocative)
PAD: adverbial postposition:
(yese ‘from’), (lo ‘to’)
PCJ: conjunctive postposition:
(wa/kwa ‘and’),
(hako ‘and’)
PAU: auxiliary postposition: (man ‘only’),
(to ‘also’), (nun topic)
CO: copula: (i ‘be’)
EFN: sentence type marker:
(nunta/nta declarative),
(ela/la
imperative),
(ni/nunka/nunci interrogative),
(ca propositive)
ECS: coordinate, subordinate, adverbial, complementizer:
(ko ‘and’),
(mulo ‘because’), (key attaches to adjectives to derive adverbs),
(tako ‘that’)
EAU: auxiliary, on verbs or adjectives that immediately precede auxiliary
predicates: (a), (key), (ci), (ko)
EAN: adnominal, on main verbs or adjectives in relative clauses or complement clauses of a complex NP:
(nun/n)
ENM: nominal, on nominalized verb: (ki), (um)
EPF: pre-final ending: (ess past), (si honorific)
XSF: suffix: (nim), (tul), (cek)
XPF: prefix: (cey), (kak), (may)
XSV: verbalization suffix: (ha), (toy),
(siki)
XSJ: adjectivization suffix:
(sulep), (tap), (ha)
A.3. PUNCTUATION TAGS
SCM:
SFN:
SLQ:
SRQ:
SSY:
comma: ,
sentence ending markers: . ? !
left quotation mark: ‘ ( “ {
right quotation mark: ’ ) ” }
other symbols: . . . ;: -
Appendix B. Description of the Statistical Model for Tagging
In this appendix, we provide details of the statistical model we used in two
different steps in our overall morphological tagging algorithm. The only
difference in the two uses of the model was the change in the tag set used.
In Section 2.2, we used a tag set of five tags in order to perform the task
A MORPHOLOGICAL TAGGER FOR KOREAN
293
of spelling recovery, while in Section 2.3.1, we used a tag set of 165 tags in
order to perform the full task of tagging each morpheme.
Let the input sentence (word sequence) be w = w0 , w1 , . . . , wn . Let the
most likely tag sequence be t∗ = t0∗ , t1∗ , . . . , tn∗ . Given an input word sequence
w we want to find the most likely tag sequence, which we do by constructing a probability model which provides a method for comparing
all possible tag sequences given the input word sequence, P (t0 , t1 , . . . , tn |
w0 , w1 , . . . , wn ).
The best (or most likely) tag sequence is (1).
t∗ = argmaxt0 ,...,tn P (t0 , . . . , tn | w0 , . . . , wn )
(1)
In order to estimate this from our training data, we use Bayes rule to
convert our conditional probability into a generative model (2). We are
comparing tag sequences for the same input word sequence, and so we can
ignore the denominator P (w0 , . . . , wn ) (3) since it does not contribute to
the distinction between different tag sequences.
argmaxt0 ,...,tn P (t0 , . . . , tn | w0 , . . . , wn )
P (w0 , . . . , wn | t0 , . . . , tn ) × P (t0 , . . . , tn )
P (w0 , . . . , wn )
(2)
= argmaxt0 ,...,tn P (w0 , . . . , wn | t0 , . . . , tn ) × P (t0 , . . . , tn )
(3)
= argmaxt0 ,...,tn
The two terms in this equation cannot be reasonably estimated due to
sparse data problems. We apply the Markov assumption to simplify each
term. For the first term that involves lexical probabilities, the assumption
made is that the probability of generating a word is dependent only on its
POS tag (4).
P (w0 , . . . , wn | t0 , . . . , tn )
= P (w0 | t0 ) × P (w1 | t1 ) × · · · × P (wn | tn )
n
=
P (wi | ti )
(4)
i=0
The second term is the model that produces POS tag sequences. We use
the Markov assumption to generate tag sequences using a trigram model,
where each tag is generated based on the previous two tags (5).
294
CHUNG-HYE HAN AND MARTHA PALMER
P (t0 , . . . , tn )
= P (t0 ) × P (t1 | t0 ) × P (t2 | t0 , t1 ) × · · · × P (tn | tn−2 , tn−1 )
n
= P (t0 ) × P (t1 | t0 ) ×
P (ti | ti−2 , ti−1 )
(5)
i=2
Putting these two terms back into the original equation we get (6):
argmaxt0 ,...,tn P (t0 , . . . , tn | w0 , . . . , wn )
= argmaxt0 ,...,tn P (w0 , . . . , wn | t0 , . . . , tn ) × P (t0 , . . . , tn )
= argmaxt0 ,...,tn
n
P (wi | ti ) ×
i=0
P (t0 ) × P (t1 | t0 ) ×
n
P (ti | ti−2 , ti−1 )
(6)
i=2
= argmaxt0 ,...,tn
n
P (wi | ti ) × P (ti | ti−2 , ti−1 )
(7)
i=0
In order to simplify our Equation from (6) to (7), we add special tokens
of type <bos> to the beginning of each sentence in order to condition the
first word on these tokens. We also add tokens of type <eos> to the end
of the sentence.
Now that we have the model in (7), all we need to do to find the most
likely tag sequence is to train the following two probability models, P (wi |
ti ) and P (ti | ti−2 , ti−1 ).
This is done using our training data which consists of word and
tag sequences. To find the most likely tag sequence, we use the same
algorithm used to find the best tag sequence in HMM: the Viterbi
algorithm.
The model however cannot handle unseen data, such as unseen words
wi ; unseen word–tag combinations (wi , ti ); and unseen trigrams (ti−2 , ti−1 , ti ).
For unseen words, we rename those words with singleton counts in our
training data as a special <unseen> token. Due to this model, as stated
in Section 2.3.1, the trigram tagger for morphological tagging assigns the
NNC (common noun) tag to unknown words by default, NNC being the
most frequent tag for unknown words. The update tagging step (see Section
2.3.2) does further disambiguation for the unknown words. We also use
suffix and prefix features from the words and use it in the original model
as was done in Weischedel et al. (1993). For unseen tri-tag sequences in
A MORPHOLOGICAL TAGGER FOR KOREAN
295
the trigram model of tag sequences, we use Katz backoff smoothing (Katz,
1987) which uses bigrams of tag sequences when the trigram sequence
was unobserved in our training data. The discount factor for the bigram
sequence is computed using the Good–Turing estimate (Good, 1953) which
is the standard practice in Katz backoff smoothing.
Acknowledgements
We thank Anoop Sarkar and Aravind Joshi for valuable discussions on
many occasions. We are also grateful to the anonymous reviewer whose
critical comments helped to reshape and improve this paper. All errors are
ours. This work was supported by IRCS, and contract DAAD 17-99-C0008 awarded by the Army Research Lab to CoGenTex, Inc., with the University of Pennsylvania as a subcontractor. In addition, the first author was
supported by SSHRC #410-2003-0544.
Notes
1
Hangul characters are romanized following the Yale Romanization Convention throughout the paper. In this and subsequent examples, the following abbreviations are used:
acc(usative case marker), co(pula inflection), decl(arative), int(errogative), nom(inative case
marker).
2
Further relevant works, all in Korean, which the present authors were unfortunately
unable to locate, and for which incomplete references are provided, include a presentation by E. C. Lee and J. H. Lee at the 4th Conference of Hangul and Korean Information Processing, a journal paper by J. H. Choi and S. J. Lee in the 1993 Journal of Korea
Information Science Society, and Masters theses by S. Y. Kim, from KAIST in 1987, and
by H. S. Lim from Korea University in 1994.
3
NNC is a tag for common nouns, NPN for pronouns, VV for verbs, and VX for auxiliary verbs. PCA is a tag for case markers, EAU for inflections on a verb followed by an
auxiliary verb, and EFN for sentence-type inflections. The tags used in the Penn Korean
Treebank are listed in Appendix A. See Han and Han (2001) for further explanations of
these tags.
4
The spelling recovery templates are similar to the morphographemic rewrite rules found
in traditional morphological analyzers. The difference however is that while traditional
morphographemic rewrite rules are conditioned on morphological notions such as morpheme boundaries, hence presupposing a segmentation of a word into its component morphemes, our spelling recovery templates are conditioned on the POS tag assigned to the
entire word.
5
Srinivas’s 1997 SuperTagger is publicly available from the XTAG Project webpage:
http://www.cis.upenn.edu/˜xtag.
6
While the Korean Treebank is smaller than the usual data set used in English POS tagging (the 1m-word Penn English Treebank (Marcus et al., 1993)) the size of the Treebank
is big enough to provide statistically significant results. Hence, we did not find the need
to conduct n-fold cross validation experiments.
296
CHUNG-HYE HAN AND MARTHA PALMER
7
The tag set used in our morphological tagger is a subset of the Korean Treebank tag
set. In the tag set for the tagger, EAU and ECS are collapsed to ECS, and affixal tags
such as XPF, XSF, XSV and XSJ are not included. In particular, in the training data,
the POS tag for words with XSV tag were converted to VV, and XSJ tag were converted
to VJ.
8
The longest match requirement employed in lemma/inflection identification is similar to
the one employed in a finite-state transducer with Directed Replacement, as described in
Karttunen (1996).
9
A reviewer asks if we assume that all morphological processes are via overt suffixes.
Although there may be empirical and theoretical motivations for positing null morphemes
in Korean, we did not model this in our system. The main reason for not doing so is
that the Treebank we used did not have any annotations for null morphemes.
References
Chan, Jeongwon, Geunbae Lee, and Jong-Hyeok Lee: 1998, ‘Generalized Unknown Morpheme Guessing for Hybrid POS Tagging of Korean’, in Proceedings of the Sixth
Workshop on Very Large Corpora, Montreal, Canada, pp. 85–93.
Church, Kenneth: 1988, ‘A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text’, Computer Speech and Language 5, 19–54.
Ezeiza, N., I. Alegria, J. M. Arriola, R. Urizar, and I. Aduriz: 1998, ‘Combining Stochastic
and Rule-based Methods for Disambiguation in Agglutinative Languages’, in COLINGACL ’98: 36th Annual Meeting of the Association for Computational Linguistics and 17th
International Conference on Computational Linguistics, Montreal, Canada, pp. 379–384.
Good, I. J.: 1953, ‘The Population Frequencies of Species and the Estimation of Population Parameters’, Biometrika 40, 237–264.
Hajič, Jan and Barbora Hladká: 1998, ‘Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset’, in COLING-ACL ’98: 36th Annual
Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 483–490.
Hajič, Jan, Pavel Krbec, Pavel Květoň, Karel Oliva, and Vladimı́r Petkevič: 2001, ‘Serial
Combination of Rules and Statistics: A Case Study in Czech Tagging’, in Association
for Computational Linguistics 39th Annual Meeting and 10th Conference of the European
Chapter, Toulouse, France, pp. 260–267.
Hakkani-Tür, Dilek Z., Kemal Oflazer, and Gökhan Tür: 2002, ‘Statistical Morphological Disambiguation for Agglutinative Languages’, Computers and the Humanities 36,
381–410.
Han, Chung-hye and Na-Rae Han: 2001, ‘Part of Speech Tagging Guidelines for Penn
Korean Treebank’, IRCS Report 01-09, IRCS, University of Pennsylvania.
Han, Chung-hye, Na-Rare [sic] Han, Eon-Suk Ko, and Martha Palmer: 2002, ‘Development and Evaluation of a Korean Treebank and its Application to NLP’, in LREC
2002: Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, pp. 1635–1642.
Hong, Y., M. W. Koo, and G. Yang: 1996, ‘A Korean Morphological Analyzer for Speech
Translation System’, in ICSLP 96: The Fourth International Conference on Spoken
Language Processing, Philadelphia, PA, pp. 676–679.
Karttunen, Lauri: 1996, ‘Directed Replacement’, in 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, pp. 108–115.
A MORPHOLOGICAL TAGGER FOR KOREAN
297
Katz, Slava M.: 1987, ‘Estimation of Probabilities from Sparse Data for the Language
Model Component of a Speech Recognizer’, IEEE Transaction on Acoustics, Speech
and Signal Processing 35, 400–401.
Lee, Sang-Zoo, Jun-ichi Tsujii, and Hae-Chang Rim: 2000, ‘Lexicalized Hidden Markov
Models for Part-of-speech Tagging’, in Proceedings of the 18th International Conference on Computational Linguistics, COLING 2000 in Europe, Saarbrücken, Germany, pp.
481–487.
Lim, Hewui Seok, Jin-Dong Kim, and Hae-Chang Rim: 1997, ‘A Korean Part-of-speech
Tagger using Transformation-based Error-driven Learning’, in Proceedings of the 1997
International Conference on Computer Processing of Oriental Languages, Hong Kong,
pp. 456–459.
Lim, Heui-Suk, Sang-Zoo Lee, and Hae-Chang Rim: 1995, ‘An Efficient Korean Morphological Analyzer Using Exclusive Information’, in International Conference on Computer
Processing of Oriental Languages, ICCPOL ’95, Honolulu, HI.
Marcus, Mitch, Beatrice Santorini, and M. Marcinkiewicz: 1993, ‘Building a Large Annotated Corpus of English’, Computational Linguistics 19, 313–330.
Palmer, Martha, Chung-hye Han, Anoop Sarkar, and Ann Bies: 2002, ‘Integrating Korean
Analysis Components in a Modular Korean/English Machine Translation System’, Ms.
University of Pennsylvania and Simon Fraser University.
Ratnaparkhi, Adwait: 1996, ‘A Maximum Entropy Model for Part-of-speech Tagging’, in
Proceedings of the Conference on Empirical Methods in Natural Language Processing,
Philadelphia, PA, pp. 133–142.
Sarkar, Anoop and Chung hye Han: 2002, ‘Statistical Morphological Tagging and Parsing
of Korean with an LTAG Grammar’, in Proceedings of the 6th International Workshop
on Tree Adjoining Grammars and Related Formalisms TAG+6, Venice, Italy, pp. 48–56.
Srinivas, B.: 1997, ‘Complexity of Lexical Descriptions and its Relevance to Partial Parsing’. Ph.D. thesis, Department of Computer and Information Sciences, University of
Pennsylvania.
Tufiş, Dan, Péter Dienes, Csaba Oravecz, and Tamás Váradi: 2000, ‘Principled Hidden
Tagset Design for Iterated Tagging of Hungarian’, in LREC 2000: 2nd International
Conference on Language Resources and Evaluation, Athens, Greece, pp. 1421–1426.
Weischedel, Ralph, Richard Schwartz, Jeff Palmucci, Marie Meteer, and Lance Ramshaw:
1993, ‘Coping with Ambiguity and Unknown Words through Probabilistic Models’,
Computational Linguistics 19, 359–382.
Yoon, Juntae, C. Lee, S. Kim, and M. Song (
): 1999,
Morany:
[Morphological Analyzer of Yonsei University Morany: Morphological Analysis
based on Large Lexical Database Extracted from Corpus], in Proceedings of the
11th Conference on Hangul and Korean Language Information Processing (
), pp. 92–98.
Download