Parsing-based Chinese Word Segmentation Integrating

advertisement
Parsing-based Chinese Word Segmentation
Integrating Morphological and Syntactic Information
Xihong WU, Meng ZHANG, Xiaojun LIN
Speech and Hearing Research Center
Key Laboratory of Machine Perception (Ministry of Education)
School of Electronics Engineering and Computer Science, Peking University, Beijing, China
Email: {wxh,zhangm,linxj}@cis.pku.edu.cn
Abstract—The conventional sequence labeling methods for
Chinese word segmentation do not fully utilize the linguistic information, which restricts further improvements of the performance.
Chinese morphology intensively investigates the constructions
and usages of Chinese words, which is helpful to Chinese word
segmentation. Furthermore, some word segmentation ambiguities
cannot be resolved only by means of the lexical information, and
the final disambiguations take place in the parsing process. In this
paper, we propose a parsing-based Chinese word segmentation
model, which can fully utilize the morphological and syntactic
information. Experiments on Penn Chinese Treebank(CTB) 5.0
show that the proposed model obtains competitive performances
as the CRFs-based model. To investigate the relationship between
our parsing-based model and the CRFs-based model, a maximum entropy model based framework for integrating different
knowledge sources is employed. The integrating model obtains
an F-measure of 97.9, 25% in segmentation error rate reduction
relative to the CRFs-based model, which indicates that the two
models are complementary to each other.
I. I NTRODUCTION
Chinese word segmentation (CWS) is one of the core techniques in Chinese language processing, and it is a prerequisite
for many tasks, such as syntactic parsing, machine translation
and information extraction. Character1 sequence labeling has
become a prevailing technique for this task [1], [2], [3].
Among the machine learning methods, Maximum Entropy
model (ME) [4] and Conditional Random Fields(CRFs) [5]
turn out to be effective, and have obtained excellent performances in the word segmentation tracks of SIGHAN Bakeoff.
Though the conventional methods based on ME or CRFs
emphasize the contexts of Chinese characters, they do not
fully utilize the linguistic information, such as the construction structure within a word and the high-level syntactic
information. Chinese morphology intensively investigates the
constructions and usages of Chinese words, which is beneficial
to word segmentation. Furthermore, some word segmentation
ambiguities cannot be resolved only by means of the lexical
information, and the final disambiguations take place in the
parsing process. Therefore, the morphological and syntactic
information can be used to improve word segmentation.
A variety of approaches have been developed to explore the
usage of these linguistic information on Chinese word segmen1 Characters here stand for various tokens occurring in a naturally written
Chinese text, including Chinese characters(hanzi), foreign letters, digital
numbers, and punctuations.
978-1-61284-729-0/11/$26.00 ©2011 IEEE
tation. For the morphological information, Lin et al. introduced
this kind of information into the conventional CRFs-based
model, using morpheme tags to refine the character boundary tags, and obtained improvements across multiple corpora
[6]. Their achievements are benefited from the morphology
information. For the syntactic information, there are lots of
researches on joint Chinese word segmentation and part-ofspeech(POS) tagging, with the purpose of avoiding word
segmentation error propagation and making full use of the POS
information to help solve segmentation ambiguities. Based on
the maximum entropy word segmenter, Ng et al. implemented
in a labeling fashion by expanding boundary tags to include
POS information [7]. Zhang et al. utilized a single perceptron
as the joint tagging model and adopted a multiple-beam search
algorithm to realize efficient decoding [8]. These approaches
all achieved improved results on word segmentation, compared
to models in a pipeline style. The improvements are due to
the introduction of the POS information. With regard to the
high level syntactic information besides POS, Luo proposed
a character-based parser, which could meet the requirements
for parsing Chinese unsegmented sentences and achieved
decent performances [9]. In his work, three simple rules
were employed to convert the word-based parse tree into
the character-based parse tree, which encoded both the word
boundary and the POS information. Then a maximum entropy
parser [10] was trained on the character-based treebank. In
his character-based parsing approach, the syntactic information
could directly influence word segmentation.
However, it is not sufficient for Chinese word segmentation
to partially investigate the morphological or syntactic information. In this paper, we propose a parsing-based Chinese
word segmentation model, which combines the morphological
and syntactic information in a single model. Fig. 1 illustrates
the similarity between the construction of Chinese words and
that of phrases and sentences. The preterminal node of the
morpheme structure tree is the morpheme tag corresponding
to hanzi at its terminal node. “n”, “v” and “a” stand for
nominal, verbal and adjective morpheme respectively. Here,
the “modifier head construction” refers to an adjective modifying a noun, or a noun modifying a noun, and the “serial
verb construction” refers to two verbs. Both “modifier head
construction” and “serial verb construction” are the most
common types for Chinese words. We might as well regard
114
Chinese word morpheme structure as a special case of the
conventional syntactic structure.
Similar to Luo’s work [9], but our work is different in two
ways: Firstly, the parsing model we use is a unlexicalized
parser [11] rather than a maximum entropy parser [10].
Berkeley parser is a robust syntactic parser, which has achieved
high performances in multiple languages and domains, even
without any language-specific tuning [11]. Secondly, Luo
derived character-level tags from POS tags [9]. POS is believed
to encode information between words rather than that within
words, while morphology reflects construction patterns within
words. In order to integrate these information, character-level
tags are further refined with morpheme tags in our approach.
The remainder of this paper is organized as follows. Section
II introduces Chinese morphology. Then, Section III gives the
CRFs-based baseline word segmenter. Following that, Section
IV describes our parsing-based word segmentation approach
in detail. In Section V, several experiments are carried out for
evaluation. The appealing characteristics of providing multiple
segmentation choices are discussed in Section VI. Conclusions
are drawn at the end.
II. C HINESE M ORPHOLOGY
Chinese morphology investigates the constructions and usages of Chinese words, which is helpful to Chinese word
segmentation. A word in Chinese is defined as the smallest
independently useable part of language or that part of the
sentence which can be used independently [12]. The evolution
of Chinese word is a process from single-character words
to multi-character words. As the constituents of words, morphemes can be characterized in many ways, such as relational
description, modification structure description, semantic description, syntactic description and form class description [13].
The word structure of Chinese is mainly described with
the form class description, in which morphemes can be
viewed in terms of their form class identities, or “part-ofspeech” . Deciding the form class of a morpheme by its
part-of-speech works well for morphemes as they appear in
syntactic contexts, i.e., when they are used as free words.
When they occur outside of syntactic contexts or inside words,
the morphemes may satisfy the following conditions. Firstly,
they remain the class of the original one, whether they serve
as the left- or right-hand member of the word, such as the
noun “纸”(Zhi) in “纸 板”(ZhiBan), “纸 花”(ZhiHua), “宣
纸”(XuanZhi), “剪纸”(JianZhi) and “信纸”(XinZhi). Secondly, they would change depending on some principles, for
examples, nouns have nominal constituents on the right and
verbs have verbal constituents on the left [13]. However,
these Headedness Principles in the grammar of Chinese word
formation may also be contravened at times.
Four types of morphemes, i.e. function word, root word,
bound root and affix, are induced from the criteria of freebound and content-function. Their functions will be described
in detail below. According to [13], there are two subcategories
in the category of affix, namely, word-forming affix and grammatical affix. The word-forming affix works as nominalizing
suffix, such as “-子”(Zi), “-头”(Tou), “-性”(Xing) and “度”(Du), verbalizing suffix “-化”(Hua), negative prefixes “无”(Wu), “未-”(Wei) and “非-”(Fei), adverbial suffix “-然”(Ran)
and the agentive suffix “-者”(Zhe). As for the subcategory of
grammatical affix, there are verbal aspect markers “-了”(Le),
“-着”(Zhe) and “-过”(Guo); the resultative potential infixes
“-得”(De) and “不-”(Bu) and the human noun plural suffix
“-们”(Men). Similar to the affixes, bound roots are bound
morphemes. But different from word-forming affixes, bound
roots tend to have meanings that are more fixed and lexicalized , while word-forming affixes tend to be more general
and abstract. Additionally, the word-forming affixes are more
productive than bound roots and involve a grammatical change.
Except for the function word, the three morpheme types - root
word, bound root and affix - can be free to combine with other
morphemes to form larger words in Chinese.
In view of the two general approaches to the morphological
analysis presented above, Packard proposed a joint approach
based on the X-bar theory of syntax [13]. A basic property of
the X-bar syntactic theory is that a category is expanded as
a lower-bar copy of itself, optionally including other maximal
expansions. It is possible to apply the X-bar theory to words,
because the basic units can be categorized according to the
following four facets:
(i) form class identities
(ii) the extent to which they can stand alone as free morphemes
(iii) whether they have a “grammatical” or a “lexical” identity
(iv) the manner in which they combine to form words
Those previously described all suggest that there exist
some construction rules in forming words with morphemes
in Chinese, which provides cues for word segmentation.
A. The Construction of Chinese Morpheme Corpus
It is difficult to construct a Chinese morpheme corpus
owing to its large scales and ambiguities. The Contemporary
Chinese Compound Words Database [14] developed by Peking
University plays an important role during the initial construction stage. It has 10914 single-character words, 32711 twocharacter words, 7296 three-character words, and 7850 fourcharacter words. In this database, a word is described with
information of its part-of-speech, the form class identities of
its morphemes, the construction type of the word, and so on.
After intensive observations of the database entries which
have the same word and the same part-of-speech tag, we make
an audacious assumption that the morpheme category for a
character is definite given the word containing the character
and its part-of-speech tag. This assumption simplifies the
morpheme annotation for a corpus as compiling a morpheme
database just like the Contemporary Chinese Compound Words Database except that there exists only one entry with the
same word and part-of-speech tag.
The morphemes are divided into 26 categories as shown in
table I. The definition of morpheme categories are referred to
[15].
115
Fig. 1. The parse tree of the sentence “政府充满活力。”(The government is full of vitality.) and the morpheme structure tree of the words. A. is the parse
tree, B. is the morpheme structure tree.
TABLE I
T HE C HINESE MORPHEME CATEGORIES . C AT. AND D ESC . STAND FOR
CATEGORY AND DESCRIPTION RESPECTIVELY.
Cat.
a
c
e
h
j
m
o
q
t
v
x
z
ns
Desc.
adjective
conjunction
interjection
prefix
abbreviation
numeral
onomatopoeia
quantifier
time
verb
not morpheme
status
location
Cat.
b
d
f
i
k
n
p
r
u
w
y
nr
nt
Desc.
difference
adverb
direction
idiom
suffix
normal noun
preposition
pronoun
auxiliary
punctuation
mood
person
organization
Next, we will briefly describe the annotation process at
CTB. Firstly, only sentence with all its words present in
the Contemporary Chinese Compound Words Database is
selected, resulting in a small morpheme annotated corpus.
Then with these sentences used as the training set, a CRFsbased annotation model is trained using following features:
unigram features, part-of-speech tags, word boundaries and
bigram morpheme tags. After training, the model is used to
label the rest of the corpus with the morpheme categories.
The sentence with high coverage of words in our morpheme
database, is selected and added to the training set. Before that,
an additional pass of manual correction was done. The whole
process are iterated several rounds until all words in CTB are
covered.
III. T HE BASELINE S YSTEM
Chinese word segmentation is an essential technique in Chinese language processing. According to [7], the segmentation
task can be transformed to a tagging problem by assigning
each character a boundary tag of the following four types :
“b” for a character that begins a word, “m” for a character that
occurs in the middle of a word, “e” for a character that ends a
word, and “s” for a character that occurs as a single-character
word. The labeled result can be split into subsequences with
the pattern s or b(m)*e, which denote single-character word or
multi-character word respectively. Among the machine learning methods, Maximum Entropy(ME)[16] and Conditional
Random Fields(CRFs)[5] turns out to be effective, and obtains
excellent performances in the word segmentation tracks of
SIGHAN Bakeoff.
We adopt the CRFs-based character sequence labeling approach as our baseline word segmenter. Our CRFs is implemented based on the CRF++ package2 . Here the common four tags {b, m, e, s} are used to encode the word
boundary information. Eight unigram feature templates, namely, C−2 , C−1 , C0 , C1 , C2 , C−1 C0 , C0 C1 , C−1 C1 are selected,
where C0 refers to the current character and C−n ( or Cn ) is
the n-th character to the left(or right) of the current character.
We also use the bigram feature template which denotes the
dependency between adjacent tags.
IV. PARSING - BASED W ORD S EGMENTATION
This section describes our parsing-based word segmenter.
Penn Chinese Treebank (CTB)[17] is manually segmented
116
2 http://crfpp.sourceforge.net/.
and is annotated at word-level. In order to build a parser
performing at character-level, first and foremost, the original
word-based trees need to be converted into character-based
trees. Then a parsing model is trained on the character-based
trees using the Berkeley parser [11], which achieves high
performances across multiple languages.
A. Word-Tree to Character-Tree
Word-based trees in CTB are converted into character-based
trees according to some rules, as done in [9]. The simple rules
used in this conversion are as follows:
(i) Word-level part-of-speech tags become constituent labels
in character-based trees.
(ii) Character-level tags become morpheme tag appended
with a positional tag and a word-level POS tags.
(iii) Morpheme tags are annotated as described in section
II-A.
Fig. 2 illustrates a conversion example. The character-level
POS tag “n b NN” of character “政”(politics) represents the
first and nominal character of a noun.
It is obvious that character-level tags encode word structure
information and word boundary information while chunklevel labels are word-level POS tags. Therefore, parsing a
Chinese character sentence is essentially performing word
segmentation, POS tagging and syntactic parsing at the same
time.
The parse tree for a sentence can be transformed back to
a word parse tree at ease, the yield of which is the result for
word segmentation.
B. Parsing Model
Probabilistic Context-Free Grammars(PCFGs) lays foundations for most high performance syntactic parsers. However,
restricted by the strong context-free assumptions, the original
PCFG model which simply takes the grammars and probabilities off a treebank, does not perform well. Therefore, a variety
of techniques have been developed to enrich and generalize
the original grammar, ranging from lexicalization [18], [19]
to symbol annotation [20], [21], [22], [23].
Recently, Berkeley parser proposed a hierarchical state-split
approach to refine the original grammars, and achieves stateof-the-art performances [23]. Starting with the basic nonterminals, this method repeats the split-merge (SM) cycle to increase the complexity of grammars. Specifically, it splits every
symbol into two, and then re-merges some new subcategories
which cause little or less loss in likelihood incurred when
removing it. Petrov et al. showed that this parsing technique
generalizes well across languages and obtained state-of-theart performances for Chinese and German, even without any
language-specific tuning [11]. Therefore, their parser3 is used
for our parsing-based word segmentation.
3 http://code.google.com/p/berkeleyparser/.
C. Integrating Parsing-based Model and CRFs-based Model
Berkeley parser derives from generative PCFGs-based parser, which could not fully utilize the informative lexical features
and might decrease the performance on word segmentation. In
order to investigate the relationship between our parsing-based
model, fully utilizing morphological and syntactic information,
and CRFs-based model, incorporating diverse and overlapping
features, we attempt to integrate the two models into one. We
adopt a general framework for integrating different knowledge
sources for machine translation introduced by [24]. Based on
maximum entropy models, they incorporate different knowledge sources as feature functions.
Given a Chinese character sentence, the posterior probability for its corresponding tree is computed as (1):
P r(T |C)
=
P αM
(T |C)
1
=
∑
∑
exp[ M
m=1 αm hm (T,C)]
∑
M
′
T ′ exp[
m=1 αm hm (T ,C)]
(1)
Where C is the character sequence, and T is the parse tree.
The decision rule here is:
T0
= arg maxT P r(T |C)
∑M
= arg maxT m=1 αm hm (T, C)
(2)
The parameters α1M of this model can be optimized by
standard approaches, such as the Minimum Error Rate Training used in machine translation [25]. In fact, the generative
PCFGs-based parsing can be treated as a special case if we
only use the logarithm probability from the generative PCFGs
model (3) as the feature function.
h1 (T, C) = logPP CF Gs (T |C)
(3)
In our approach, the following two logarithms of the scores
are used as feature functions:
h1 (T, C)
h2 (T, C)
= logPP CF Gs (T |C)
= logP∏CRF s (W |C)
= log wi PCRF s (wi |C)
(4)
Instead of computing the score of the whole word sequence
W with character sequence C through PCRF s (W |C) directly,
we try to get the posterior probability of a sub-sequence to be
tagged as one whole word PCRF s (wi |C).
V. E XPERIMENTS AND R ESULTS
In this section, we designed several experiments to investigate the validity of our proposed model. It is not possible
to make a comparison with Luo’s work [9], since he uses
different release of the Chinese Treebank from us.
A. Experimental Settings
We present experimental results on Chinese Treebank(CTB)
5.0. We adopt the standard data allocation and split the corpus
as follows: files from CHTB 001.fid to CHTB 270.fid, and
files from CHTB 400.fid to CHTB 1151.fid are used as training set. The development set includes files from CHTB 301.fid
to CHTB 325.fid, and the test set includes files CHTB 271.fid
117
Fig. 2. The conversion for the tree of the sentence “政府充满活力。”(The government is full of vitality.). A. is the original word-based tree, B. is the
converted character-based tree.
TABLE III
T HE PERFORMANCE OF SYNTACTIC PARSING
to CHTB 300.fid. Word-based trees are converted to characterbased trees using the procedure described in Section IV-A. All
traces and functional tags are stripped.
With regard to the parser from [11], all the experiments are
carried out after six cycles of split-merge.
The parameters α1M of the integrating model are tuned with
Minimum Error Rate Training based on the development set.
Model
Oracle
Pipeline
ChParse
Integ
≤
LP
90.0
88.7
87.4
87.7
40 words
LR
F
87.1
88.5
85.6
87.1
83.9
85.6
84.2
85.9
IN DIFFERENT MODELS .
all
LP
85.6
83.3
82.7
83.3
sentences
LR
F
82.8
84.2
80.8
82.1
79.3
81.0
79.6
81.4
B. Results
Three metrics were used for the evaluation of word segmentation : precision(P), recall(R) , F-measure(F) defined by
2PR/(P+R).
As shown is Table II, our parsing-based model “ChParse”
achieves an F-measure of 97.1, competitive with 97.2 from
the conventional CRFs-based word segmenter. “Morph” gives
the results using only the morphological information in the
character-based tree and “ChParse-Morph” shows the results
after we strip all the morpheme tags off the character-based
trees. They both perform inferior to “ChParse”, which indicates that the morphological, boundary and POS information
are all beneficial to word segmentation. The best result 97.9,
25% in segmentation error rate reduction relative to the CRFsbased model, comes from the integrating model “Integ”, and
the large improvement suggests that our parsing-based model
is complementary to the CRFs-based model.
TABLE II
T HE PERFORMANCE OF WORD SEGMENTATION IN
Model
CRFs
ChParse
Morph
ChParse-Morph
Integ
P
96.9
96.7
96.0
96.4
97.6
R
97.5
97.5
96.3
97.0
98.2
DIFFERENT MODELS .
F
97.2
97.1
96.2
96.7
97.9
Taking the sentence “澳向中国提供一笔贷款”(Australia
provides China with a loan.) as an example to illustrate the
usage of linguistic information. Fig. 3 shows the parse tree
from the pipeline approach, which takes CRFs-based word
segmentation as the input for the word-based parsing, and Fig.
4 shows the parse tree from “ChParse”. In this example, the
morpheme tag “p” for character “向” help identify the correct
boundary. The morphological information and the high-level
syntactic information give great instructions to the low level
word segmentation.
Since our word segmenter is based on parsing model, the
by-product of which is the parse result for the character sequence, we might as well see the parsing performances. Three
measures of constituent labels, namely, labeled precision(LP),
labeled recall(LP) and F-measure(F), are computed as usual
except that constituent spans are defined in terms of characters,
instead of words. We make a few modifications to the EVALB
parseval reference implementation4 to meet the variable space
requirements and specify some parameters to evaluate at the
same level compared with word-based parsing.
Table III shows the detailed results of syntactic parsing.
The first row “Oracle” is treated as the upper bound performance which takes the manual word segmentation as input for the parser. What is surprising is that with a better
word segmentation, the integrating model “Integ” does not
surpass the pipeline approach on parsing performance. In
our opinion, the reason of the divergences on performances
between word segmentation and parsing in our parsing-based
model is that, a good word segmentation requires only correct
118
4 http://nlp.cs.nyu.edu/evalb/.
chch
Fig. 3.
The word-level parse tree for the sentence “向中国提供一笔贷款”(Australia provides China with a loan.) in the pipeline approach.
character boundaries, while for a good parse, besides character
boundaries, the refined preterminal labels with POS tags and
morpheme tags also need labeling correctly, which is a more
sophisticated task.
VI. D ISCUSSION
For Chinese words, no agreements have been reached upon
the definition of words, and there exist multiple corpora
each corresponding to different and incompatible annotation
guidelines, such as the word segmentation guidelines released
by Peking University, and Microsoft Research Asia etc. The
conventional data-driven approaches rely heavily on the training data, which lead to decrease on the performance when
training data and test data are annotated under different guidelines. It will become more serious when come to downstream
applications in practice.
However, very few previous work paid attention to this
important problem. Jiang et al. proposed a simple yet effective
strategy for the annotation-style adaptation on Chinese word
segmentation and part-of-speech tagging [26]. With the purpose of transferring knowledge from a differently annotated
corpus to the corpus with desired annotation, they used the
prediction from the classifier which was trained on the source
corpus, as additional features for training a classifier on the
target corpus. Their strategy obtained improvements over the
classifiers trained on both the source corpus and the target
corpus.
One of the reasons why no segmentation standards are
widely accepted is due to the lack of morphology in Chinese
[26]. Our character-based parser integrating the morphological
information can throw a light on this problem. We will take the
two Chinese character sequences “学习机”(learning machine)
and “作曲家”(composer) as an example. The parsing result
striping positional tags and POS tags at the preterminal node
of the parse tree, is shown is Fig. 5. For “学习机”, although the
three characters form a noun, the first two characters conform
to the “serial verb construction”. Similar to “作曲家”, the first
two characters conform to the “verb object construction”. Both
the “serial verb construction” and “verb object construction”
are most common constructions in Chinese morphology. If our
character-based parser could learn the constructions automatically, which means the flat morpheme structure would become
the hierarchical morpheme structure, it would provide words
with different granularities. In that case, the character-based
parsing would break the constraints from the segmentation
guidelines through the training corpus, and provide multiple
word segmentation choices for different applications. The
characteristics of this approach is very appealing in practice,
such as in the area of information retrieval.
Fig. 5. The morpheme parse tree of the Chinese character sequence “学习
机”(learning machine) and “作曲家”(composer).
VII. C ONCLUSION
In this paper, we present a parsing-based Chinese word segmentation model, which utilizes morphological and syntactic
information to further improve the performance. Evaluations
show that both morphological and syntactic information are
useful to word segmentation. The performance of the integrating model, combining our parsing-based model and the
119
chch
Fig. 4.
The character-level parse tree for the sentence “向中国提供一笔贷款”(Australia provides China with a loan.) from the parsing-based model.
conventional CRFs-based model, increases to a large extent,
which indicates that the two models are complementary to
each other. In other words, the high performance of the
integrating model suggests the complementarity of the morphological and syntactic information with lexical information.
Discriminative syntactic parsing models have the capability
to integrally admit multiple features, and may incorporate all
these useful information into the model in a unified manner.
The parsing-based word segmentation model can be treated
as a character-based parsing model, which can perform word
segmentation, POS tagging and syntactic parsing in a unified
model. Experiments show that this kind of character-based
unified model is promising.
Our future work is to convert the flat morpheme structure
into the hierarchical structure to provide multiple word segmentation choices for different applications.
ACKNOWLEDGMENT
The work was supported in part by the National Natural
Science Foundation of China (No.90920302), a HGJ Grant of
China (No. 2011ZX01042-001-001), and a research program
from Microsoft China.
R EFERENCES
[1] H. T. N. Jin Kiat Low and W. Guo, “A maximum entropy approach
to chinese word segmentation,” in Proceedings of the Fourth SIGHAN
Workshop on Chinese Language Processing, Jeju Island, Korea, 2005,
pp. 161–164.
[2] G. A. D. J. Huihsin Tseng, Pichuan Chang and C. Manning, “A
conditional random field word segmenter for sighan bakeoff 2005,” in
Proceedings of the Fourth SIGHAN Workshop on Chinese Language
Processing, Jeju Island, Korea, 2005, pp. 168–171.
[3] C.-N. H. Hai Zhao and M. Li, “An improved chinese word segmentation
system with conditional random field,” in Proceedings of the Fifth
SIGHAN Workshop on Chinese Language Processing, Sydney, Australia,
2006, pp. 162–165.
[4] V. J. D. P. Adam L. Berger, Stephen A. Della Pietra, “A maximum
entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39–72, 1996.
[5] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random
fields: Probabilistic models for segmenting and labeling sequence data,”
in Proceedings of ICML 2001, 2001, pp. 282–289.
[6] X. Lin, L. Zhao, M. Zhang, and X. Wu, “A morphology-based chinese
word segmentation method,” in Proceedings of NLP-KE 2010, 2010, pp.
1–5.
[7] H. T. Ng and J. K. Low, “Chinese part-of-speech tagging: One-at-atime or all-at-once? word-based or character-based?” in Proceedings of
EMNLP 2004, D. Lin and D. Wu, Eds., Barcelona, Spain, 2004, pp.
277–284.
[8] Y. Zhang and S. Clark, “Joint word segmentation and pos tagging
using a single perceptron,” in Proceedings of ACL:HLT 2008, Columbus,
Ohio,USA, 2008, pp. 888–896.
[9] X. Luo, “A maximum entropy chinese character-based parser,” in Proceedings of EMNLP 2003, M. Collins and M. Steedman, Eds., 2003, pp.
192–199.
[10] A. Ratnaparkhi, “A linear observed time statistical parser based n maximum entropy models,” in Proceedings of EMNLP 1997, Providence,
Rhode Island, 1997, pp. 1–10.
[11] S. Petrov and D. Klein, “Improved inference for unlexicalized parsing,”
in Proceedings of HLT-NAACL 2007, Rochester, New York, USA, 2007,
pp. 404–411.
[12] L. Wang, “Problems with the boundary between words and word
groups,” in Zhongguo Yuwen, 1953, pp. 3–8.
[13] J. L. Packard, The Morphology of Chinese. Cambridge, UK: Cambridge
University Press, 2000.
120
[14] S. Y. Yun Liu and X. Zhu, “Construction of the contemporary chinese
compound words database and its application,” in Modem Education
Technology and Chinese Teaching to Foreigners : Proceedings of the
Second International Conference on New Technologies in Teaching and
Learning Chinese, 2000, pp. 273–278.
[15] X. Z. Shiwen Yu, Huiming Duan and B. Sun, “The basic processing of
contemporary chinese corpus at peking university specification,” Journal
of Chinese Information Processing, vol. 16, no. 5, pp. 49–64, 2002.
[16] N. Xue and L. Shen, “Chinese word segmentation as lmr tagging,” in
Proceedings of the Second SIGHAN Workshop on Chinese Language
Processing, Sapporo, Japan, 2003, pp. 176–179.
[17] N. Xue, F.-D. Chiou, and M. Palmer, “Building a large-scale annotated
chinese corpus,” in Proceedings of the 19th International Conference
on Computational Linguistics, Morristown, New Jersey, USA, 2002, pp.
1–8.
[18] M. Collins, “Three generative, lexicalised models for statistical parsing,”
in Proceedings of ACL 1997, Madrid, Spain, 1997, pp. 16–23.
[19] E. Charniak, “Statistical parsing with a context-free grammar and word
statistics,” in Proceedings of AAAI 1997, 1997, pp. 598–603.
[20] M. Johnson, “Pcfg models of linguistic tree representations,” Computational Linguistics, vol. 24, pp. 613–632, 1998.
[21] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in
Proceedings of ACL 2003, Sapporo, Japan, 2003, pp. 423–430.
[22] T. Matsuzaki, Y. Miyao, and J. Tsujii, “Probabilistic cfg with latent
annotations,” in Proceedings of ACL 2005, Ann Arbor, Michigan, USA,
2005, pp. 75–82.
[23] S. Petrov, L. Barrett, R. Thibaux, and D. Klein, “Learning accurate,
compact, and interpretable tree annotation,” in Proceedings of ACL 2006,
Sydney, Australia, 2006, pp. 433–440.
[24] F. J. Och and H. Ney, “Discriminative training and maximum entropy
models for statistical machine translation,” in Proceedings of ACL,
Philadelphia, Pennsylvania, USA, 2002, pp. 295–302.
[25] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proceedings of the 41st Annual Meeting of the Association for
Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, July 2003, pp. 160–167.
[26] W. Jiang, L. Huang, and Q. Liu, “Automatic adaptation of annotation
standards: Chinese word segmentation and pos tagging – a case study,”
in Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP, Suntec, Singapore, August 2009, pp. 522–
530.
121
Download