Parsing-based Chinese Word Segmentation Integrating Morphological and Syntactic Information Xihong WU, Meng ZHANG, Xiaojun LIN Speech and Hearing Research Center Key Laboratory of Machine Perception (Ministry of Education) School of Electronics Engineering and Computer Science, Peking University, Beijing, China Email: {wxh,zhangm,linxj}@cis.pku.edu.cn Abstract—The conventional sequence labeling methods for Chinese word segmentation do not fully utilize the linguistic information, which restricts further improvements of the performance. Chinese morphology intensively investigates the constructions and usages of Chinese words, which is helpful to Chinese word segmentation. Furthermore, some word segmentation ambiguities cannot be resolved only by means of the lexical information, and the final disambiguations take place in the parsing process. In this paper, we propose a parsing-based Chinese word segmentation model, which can fully utilize the morphological and syntactic information. Experiments on Penn Chinese Treebank(CTB) 5.0 show that the proposed model obtains competitive performances as the CRFs-based model. To investigate the relationship between our parsing-based model and the CRFs-based model, a maximum entropy model based framework for integrating different knowledge sources is employed. The integrating model obtains an F-measure of 97.9, 25% in segmentation error rate reduction relative to the CRFs-based model, which indicates that the two models are complementary to each other. I. I NTRODUCTION Chinese word segmentation (CWS) is one of the core techniques in Chinese language processing, and it is a prerequisite for many tasks, such as syntactic parsing, machine translation and information extraction. Character1 sequence labeling has become a prevailing technique for this task [1], [2], [3]. Among the machine learning methods, Maximum Entropy model (ME) [4] and Conditional Random Fields(CRFs) [5] turn out to be effective, and have obtained excellent performances in the word segmentation tracks of SIGHAN Bakeoff. Though the conventional methods based on ME or CRFs emphasize the contexts of Chinese characters, they do not fully utilize the linguistic information, such as the construction structure within a word and the high-level syntactic information. Chinese morphology intensively investigates the constructions and usages of Chinese words, which is beneficial to word segmentation. Furthermore, some word segmentation ambiguities cannot be resolved only by means of the lexical information, and the final disambiguations take place in the parsing process. Therefore, the morphological and syntactic information can be used to improve word segmentation. A variety of approaches have been developed to explore the usage of these linguistic information on Chinese word segmen1 Characters here stand for various tokens occurring in a naturally written Chinese text, including Chinese characters(hanzi), foreign letters, digital numbers, and punctuations. 978-1-61284-729-0/11/$26.00 ©2011 IEEE tation. For the morphological information, Lin et al. introduced this kind of information into the conventional CRFs-based model, using morpheme tags to refine the character boundary tags, and obtained improvements across multiple corpora [6]. Their achievements are benefited from the morphology information. For the syntactic information, there are lots of researches on joint Chinese word segmentation and part-ofspeech(POS) tagging, with the purpose of avoiding word segmentation error propagation and making full use of the POS information to help solve segmentation ambiguities. Based on the maximum entropy word segmenter, Ng et al. implemented in a labeling fashion by expanding boundary tags to include POS information [7]. Zhang et al. utilized a single perceptron as the joint tagging model and adopted a multiple-beam search algorithm to realize efficient decoding [8]. These approaches all achieved improved results on word segmentation, compared to models in a pipeline style. The improvements are due to the introduction of the POS information. With regard to the high level syntactic information besides POS, Luo proposed a character-based parser, which could meet the requirements for parsing Chinese unsegmented sentences and achieved decent performances [9]. In his work, three simple rules were employed to convert the word-based parse tree into the character-based parse tree, which encoded both the word boundary and the POS information. Then a maximum entropy parser [10] was trained on the character-based treebank. In his character-based parsing approach, the syntactic information could directly influence word segmentation. However, it is not sufficient for Chinese word segmentation to partially investigate the morphological or syntactic information. In this paper, we propose a parsing-based Chinese word segmentation model, which combines the morphological and syntactic information in a single model. Fig. 1 illustrates the similarity between the construction of Chinese words and that of phrases and sentences. The preterminal node of the morpheme structure tree is the morpheme tag corresponding to hanzi at its terminal node. “n”, “v” and “a” stand for nominal, verbal and adjective morpheme respectively. Here, the “modifier head construction” refers to an adjective modifying a noun, or a noun modifying a noun, and the “serial verb construction” refers to two verbs. Both “modifier head construction” and “serial verb construction” are the most common types for Chinese words. We might as well regard 114 Chinese word morpheme structure as a special case of the conventional syntactic structure. Similar to Luo’s work [9], but our work is different in two ways: Firstly, the parsing model we use is a unlexicalized parser [11] rather than a maximum entropy parser [10]. Berkeley parser is a robust syntactic parser, which has achieved high performances in multiple languages and domains, even without any language-specific tuning [11]. Secondly, Luo derived character-level tags from POS tags [9]. POS is believed to encode information between words rather than that within words, while morphology reflects construction patterns within words. In order to integrate these information, character-level tags are further refined with morpheme tags in our approach. The remainder of this paper is organized as follows. Section II introduces Chinese morphology. Then, Section III gives the CRFs-based baseline word segmenter. Following that, Section IV describes our parsing-based word segmentation approach in detail. In Section V, several experiments are carried out for evaluation. The appealing characteristics of providing multiple segmentation choices are discussed in Section VI. Conclusions are drawn at the end. II. C HINESE M ORPHOLOGY Chinese morphology investigates the constructions and usages of Chinese words, which is helpful to Chinese word segmentation. A word in Chinese is defined as the smallest independently useable part of language or that part of the sentence which can be used independently [12]. The evolution of Chinese word is a process from single-character words to multi-character words. As the constituents of words, morphemes can be characterized in many ways, such as relational description, modification structure description, semantic description, syntactic description and form class description [13]. The word structure of Chinese is mainly described with the form class description, in which morphemes can be viewed in terms of their form class identities, or “part-ofspeech” . Deciding the form class of a morpheme by its part-of-speech works well for morphemes as they appear in syntactic contexts, i.e., when they are used as free words. When they occur outside of syntactic contexts or inside words, the morphemes may satisfy the following conditions. Firstly, they remain the class of the original one, whether they serve as the left- or right-hand member of the word, such as the noun “纸”(Zhi) in “纸 板”(ZhiBan), “纸 花”(ZhiHua), “宣 纸”(XuanZhi), “剪纸”(JianZhi) and “信纸”(XinZhi). Secondly, they would change depending on some principles, for examples, nouns have nominal constituents on the right and verbs have verbal constituents on the left [13]. However, these Headedness Principles in the grammar of Chinese word formation may also be contravened at times. Four types of morphemes, i.e. function word, root word, bound root and affix, are induced from the criteria of freebound and content-function. Their functions will be described in detail below. According to [13], there are two subcategories in the category of affix, namely, word-forming affix and grammatical affix. The word-forming affix works as nominalizing suffix, such as “-子”(Zi), “-头”(Tou), “-性”(Xing) and “度”(Du), verbalizing suffix “-化”(Hua), negative prefixes “无”(Wu), “未-”(Wei) and “非-”(Fei), adverbial suffix “-然”(Ran) and the agentive suffix “-者”(Zhe). As for the subcategory of grammatical affix, there are verbal aspect markers “-了”(Le), “-着”(Zhe) and “-过”(Guo); the resultative potential infixes “-得”(De) and “不-”(Bu) and the human noun plural suffix “-们”(Men). Similar to the affixes, bound roots are bound morphemes. But different from word-forming affixes, bound roots tend to have meanings that are more fixed and lexicalized , while word-forming affixes tend to be more general and abstract. Additionally, the word-forming affixes are more productive than bound roots and involve a grammatical change. Except for the function word, the three morpheme types - root word, bound root and affix - can be free to combine with other morphemes to form larger words in Chinese. In view of the two general approaches to the morphological analysis presented above, Packard proposed a joint approach based on the X-bar theory of syntax [13]. A basic property of the X-bar syntactic theory is that a category is expanded as a lower-bar copy of itself, optionally including other maximal expansions. It is possible to apply the X-bar theory to words, because the basic units can be categorized according to the following four facets: (i) form class identities (ii) the extent to which they can stand alone as free morphemes (iii) whether they have a “grammatical” or a “lexical” identity (iv) the manner in which they combine to form words Those previously described all suggest that there exist some construction rules in forming words with morphemes in Chinese, which provides cues for word segmentation. A. The Construction of Chinese Morpheme Corpus It is difficult to construct a Chinese morpheme corpus owing to its large scales and ambiguities. The Contemporary Chinese Compound Words Database [14] developed by Peking University plays an important role during the initial construction stage. It has 10914 single-character words, 32711 twocharacter words, 7296 three-character words, and 7850 fourcharacter words. In this database, a word is described with information of its part-of-speech, the form class identities of its morphemes, the construction type of the word, and so on. After intensive observations of the database entries which have the same word and the same part-of-speech tag, we make an audacious assumption that the morpheme category for a character is definite given the word containing the character and its part-of-speech tag. This assumption simplifies the morpheme annotation for a corpus as compiling a morpheme database just like the Contemporary Chinese Compound Words Database except that there exists only one entry with the same word and part-of-speech tag. The morphemes are divided into 26 categories as shown in table I. The definition of morpheme categories are referred to [15]. 115 Fig. 1. The parse tree of the sentence “政府充满活力。”(The government is full of vitality.) and the morpheme structure tree of the words. A. is the parse tree, B. is the morpheme structure tree. TABLE I T HE C HINESE MORPHEME CATEGORIES . C AT. AND D ESC . STAND FOR CATEGORY AND DESCRIPTION RESPECTIVELY. Cat. a c e h j m o q t v x z ns Desc. adjective conjunction interjection prefix abbreviation numeral onomatopoeia quantifier time verb not morpheme status location Cat. b d f i k n p r u w y nr nt Desc. difference adverb direction idiom suffix normal noun preposition pronoun auxiliary punctuation mood person organization Next, we will briefly describe the annotation process at CTB. Firstly, only sentence with all its words present in the Contemporary Chinese Compound Words Database is selected, resulting in a small morpheme annotated corpus. Then with these sentences used as the training set, a CRFsbased annotation model is trained using following features: unigram features, part-of-speech tags, word boundaries and bigram morpheme tags. After training, the model is used to label the rest of the corpus with the morpheme categories. The sentence with high coverage of words in our morpheme database, is selected and added to the training set. Before that, an additional pass of manual correction was done. The whole process are iterated several rounds until all words in CTB are covered. III. T HE BASELINE S YSTEM Chinese word segmentation is an essential technique in Chinese language processing. According to [7], the segmentation task can be transformed to a tagging problem by assigning each character a boundary tag of the following four types : “b” for a character that begins a word, “m” for a character that occurs in the middle of a word, “e” for a character that ends a word, and “s” for a character that occurs as a single-character word. The labeled result can be split into subsequences with the pattern s or b(m)*e, which denote single-character word or multi-character word respectively. Among the machine learning methods, Maximum Entropy(ME)[16] and Conditional Random Fields(CRFs)[5] turns out to be effective, and obtains excellent performances in the word segmentation tracks of SIGHAN Bakeoff. We adopt the CRFs-based character sequence labeling approach as our baseline word segmenter. Our CRFs is implemented based on the CRF++ package2 . Here the common four tags {b, m, e, s} are used to encode the word boundary information. Eight unigram feature templates, namely, C−2 , C−1 , C0 , C1 , C2 , C−1 C0 , C0 C1 , C−1 C1 are selected, where C0 refers to the current character and C−n ( or Cn ) is the n-th character to the left(or right) of the current character. We also use the bigram feature template which denotes the dependency between adjacent tags. IV. PARSING - BASED W ORD S EGMENTATION This section describes our parsing-based word segmenter. Penn Chinese Treebank (CTB)[17] is manually segmented 116 2 http://crfpp.sourceforge.net/. and is annotated at word-level. In order to build a parser performing at character-level, first and foremost, the original word-based trees need to be converted into character-based trees. Then a parsing model is trained on the character-based trees using the Berkeley parser [11], which achieves high performances across multiple languages. A. Word-Tree to Character-Tree Word-based trees in CTB are converted into character-based trees according to some rules, as done in [9]. The simple rules used in this conversion are as follows: (i) Word-level part-of-speech tags become constituent labels in character-based trees. (ii) Character-level tags become morpheme tag appended with a positional tag and a word-level POS tags. (iii) Morpheme tags are annotated as described in section II-A. Fig. 2 illustrates a conversion example. The character-level POS tag “n b NN” of character “政”(politics) represents the first and nominal character of a noun. It is obvious that character-level tags encode word structure information and word boundary information while chunklevel labels are word-level POS tags. Therefore, parsing a Chinese character sentence is essentially performing word segmentation, POS tagging and syntactic parsing at the same time. The parse tree for a sentence can be transformed back to a word parse tree at ease, the yield of which is the result for word segmentation. B. Parsing Model Probabilistic Context-Free Grammars(PCFGs) lays foundations for most high performance syntactic parsers. However, restricted by the strong context-free assumptions, the original PCFG model which simply takes the grammars and probabilities off a treebank, does not perform well. Therefore, a variety of techniques have been developed to enrich and generalize the original grammar, ranging from lexicalization [18], [19] to symbol annotation [20], [21], [22], [23]. Recently, Berkeley parser proposed a hierarchical state-split approach to refine the original grammars, and achieves stateof-the-art performances [23]. Starting with the basic nonterminals, this method repeats the split-merge (SM) cycle to increase the complexity of grammars. Specifically, it splits every symbol into two, and then re-merges some new subcategories which cause little or less loss in likelihood incurred when removing it. Petrov et al. showed that this parsing technique generalizes well across languages and obtained state-of-theart performances for Chinese and German, even without any language-specific tuning [11]. Therefore, their parser3 is used for our parsing-based word segmentation. 3 http://code.google.com/p/berkeleyparser/. C. Integrating Parsing-based Model and CRFs-based Model Berkeley parser derives from generative PCFGs-based parser, which could not fully utilize the informative lexical features and might decrease the performance on word segmentation. In order to investigate the relationship between our parsing-based model, fully utilizing morphological and syntactic information, and CRFs-based model, incorporating diverse and overlapping features, we attempt to integrate the two models into one. We adopt a general framework for integrating different knowledge sources for machine translation introduced by [24]. Based on maximum entropy models, they incorporate different knowledge sources as feature functions. Given a Chinese character sentence, the posterior probability for its corresponding tree is computed as (1): P r(T |C) = P αM (T |C) 1 = ∑ ∑ exp[ M m=1 αm hm (T,C)] ∑ M ′ T ′ exp[ m=1 αm hm (T ,C)] (1) Where C is the character sequence, and T is the parse tree. The decision rule here is: T0 = arg maxT P r(T |C) ∑M = arg maxT m=1 αm hm (T, C) (2) The parameters α1M of this model can be optimized by standard approaches, such as the Minimum Error Rate Training used in machine translation [25]. In fact, the generative PCFGs-based parsing can be treated as a special case if we only use the logarithm probability from the generative PCFGs model (3) as the feature function. h1 (T, C) = logPP CF Gs (T |C) (3) In our approach, the following two logarithms of the scores are used as feature functions: h1 (T, C) h2 (T, C) = logPP CF Gs (T |C) = logP∏CRF s (W |C) = log wi PCRF s (wi |C) (4) Instead of computing the score of the whole word sequence W with character sequence C through PCRF s (W |C) directly, we try to get the posterior probability of a sub-sequence to be tagged as one whole word PCRF s (wi |C). V. E XPERIMENTS AND R ESULTS In this section, we designed several experiments to investigate the validity of our proposed model. It is not possible to make a comparison with Luo’s work [9], since he uses different release of the Chinese Treebank from us. A. Experimental Settings We present experimental results on Chinese Treebank(CTB) 5.0. We adopt the standard data allocation and split the corpus as follows: files from CHTB 001.fid to CHTB 270.fid, and files from CHTB 400.fid to CHTB 1151.fid are used as training set. The development set includes files from CHTB 301.fid to CHTB 325.fid, and the test set includes files CHTB 271.fid 117 Fig. 2. The conversion for the tree of the sentence “政府充满活力。”(The government is full of vitality.). A. is the original word-based tree, B. is the converted character-based tree. TABLE III T HE PERFORMANCE OF SYNTACTIC PARSING to CHTB 300.fid. Word-based trees are converted to characterbased trees using the procedure described in Section IV-A. All traces and functional tags are stripped. With regard to the parser from [11], all the experiments are carried out after six cycles of split-merge. The parameters α1M of the integrating model are tuned with Minimum Error Rate Training based on the development set. Model Oracle Pipeline ChParse Integ ≤ LP 90.0 88.7 87.4 87.7 40 words LR F 87.1 88.5 85.6 87.1 83.9 85.6 84.2 85.9 IN DIFFERENT MODELS . all LP 85.6 83.3 82.7 83.3 sentences LR F 82.8 84.2 80.8 82.1 79.3 81.0 79.6 81.4 B. Results Three metrics were used for the evaluation of word segmentation : precision(P), recall(R) , F-measure(F) defined by 2PR/(P+R). As shown is Table II, our parsing-based model “ChParse” achieves an F-measure of 97.1, competitive with 97.2 from the conventional CRFs-based word segmenter. “Morph” gives the results using only the morphological information in the character-based tree and “ChParse-Morph” shows the results after we strip all the morpheme tags off the character-based trees. They both perform inferior to “ChParse”, which indicates that the morphological, boundary and POS information are all beneficial to word segmentation. The best result 97.9, 25% in segmentation error rate reduction relative to the CRFsbased model, comes from the integrating model “Integ”, and the large improvement suggests that our parsing-based model is complementary to the CRFs-based model. TABLE II T HE PERFORMANCE OF WORD SEGMENTATION IN Model CRFs ChParse Morph ChParse-Morph Integ P 96.9 96.7 96.0 96.4 97.6 R 97.5 97.5 96.3 97.0 98.2 DIFFERENT MODELS . F 97.2 97.1 96.2 96.7 97.9 Taking the sentence “澳向中国提供一笔贷款”(Australia provides China with a loan.) as an example to illustrate the usage of linguistic information. Fig. 3 shows the parse tree from the pipeline approach, which takes CRFs-based word segmentation as the input for the word-based parsing, and Fig. 4 shows the parse tree from “ChParse”. In this example, the morpheme tag “p” for character “向” help identify the correct boundary. The morphological information and the high-level syntactic information give great instructions to the low level word segmentation. Since our word segmenter is based on parsing model, the by-product of which is the parse result for the character sequence, we might as well see the parsing performances. Three measures of constituent labels, namely, labeled precision(LP), labeled recall(LP) and F-measure(F), are computed as usual except that constituent spans are defined in terms of characters, instead of words. We make a few modifications to the EVALB parseval reference implementation4 to meet the variable space requirements and specify some parameters to evaluate at the same level compared with word-based parsing. Table III shows the detailed results of syntactic parsing. The first row “Oracle” is treated as the upper bound performance which takes the manual word segmentation as input for the parser. What is surprising is that with a better word segmentation, the integrating model “Integ” does not surpass the pipeline approach on parsing performance. In our opinion, the reason of the divergences on performances between word segmentation and parsing in our parsing-based model is that, a good word segmentation requires only correct 118 4 http://nlp.cs.nyu.edu/evalb/. chch Fig. 3. The word-level parse tree for the sentence “向中国提供一笔贷款”(Australia provides China with a loan.) in the pipeline approach. character boundaries, while for a good parse, besides character boundaries, the refined preterminal labels with POS tags and morpheme tags also need labeling correctly, which is a more sophisticated task. VI. D ISCUSSION For Chinese words, no agreements have been reached upon the definition of words, and there exist multiple corpora each corresponding to different and incompatible annotation guidelines, such as the word segmentation guidelines released by Peking University, and Microsoft Research Asia etc. The conventional data-driven approaches rely heavily on the training data, which lead to decrease on the performance when training data and test data are annotated under different guidelines. It will become more serious when come to downstream applications in practice. However, very few previous work paid attention to this important problem. Jiang et al. proposed a simple yet effective strategy for the annotation-style adaptation on Chinese word segmentation and part-of-speech tagging [26]. With the purpose of transferring knowledge from a differently annotated corpus to the corpus with desired annotation, they used the prediction from the classifier which was trained on the source corpus, as additional features for training a classifier on the target corpus. Their strategy obtained improvements over the classifiers trained on both the source corpus and the target corpus. One of the reasons why no segmentation standards are widely accepted is due to the lack of morphology in Chinese [26]. Our character-based parser integrating the morphological information can throw a light on this problem. We will take the two Chinese character sequences “学习机”(learning machine) and “作曲家”(composer) as an example. The parsing result striping positional tags and POS tags at the preterminal node of the parse tree, is shown is Fig. 5. For “学习机”, although the three characters form a noun, the first two characters conform to the “serial verb construction”. Similar to “作曲家”, the first two characters conform to the “verb object construction”. Both the “serial verb construction” and “verb object construction” are most common constructions in Chinese morphology. If our character-based parser could learn the constructions automatically, which means the flat morpheme structure would become the hierarchical morpheme structure, it would provide words with different granularities. In that case, the character-based parsing would break the constraints from the segmentation guidelines through the training corpus, and provide multiple word segmentation choices for different applications. The characteristics of this approach is very appealing in practice, such as in the area of information retrieval. Fig. 5. The morpheme parse tree of the Chinese character sequence “学习 机”(learning machine) and “作曲家”(composer). VII. C ONCLUSION In this paper, we present a parsing-based Chinese word segmentation model, which utilizes morphological and syntactic information to further improve the performance. Evaluations show that both morphological and syntactic information are useful to word segmentation. The performance of the integrating model, combining our parsing-based model and the 119 chch Fig. 4. The character-level parse tree for the sentence “向中国提供一笔贷款”(Australia provides China with a loan.) from the parsing-based model. conventional CRFs-based model, increases to a large extent, which indicates that the two models are complementary to each other. In other words, the high performance of the integrating model suggests the complementarity of the morphological and syntactic information with lexical information. Discriminative syntactic parsing models have the capability to integrally admit multiple features, and may incorporate all these useful information into the model in a unified manner. The parsing-based word segmentation model can be treated as a character-based parsing model, which can perform word segmentation, POS tagging and syntactic parsing in a unified model. Experiments show that this kind of character-based unified model is promising. Our future work is to convert the flat morpheme structure into the hierarchical structure to provide multiple word segmentation choices for different applications. ACKNOWLEDGMENT The work was supported in part by the National Natural Science Foundation of China (No.90920302), a HGJ Grant of China (No. 2011ZX01042-001-001), and a research program from Microsoft China. R EFERENCES [1] H. T. N. Jin Kiat Low and W. Guo, “A maximum entropy approach to chinese word segmentation,” in Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005, pp. 161–164. [2] G. A. D. J. Huihsin Tseng, Pichuan Chang and C. Manning, “A conditional random field word segmenter for sighan bakeoff 2005,” in Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005, pp. 168–171. [3] C.-N. H. Hai Zhao and M. Li, “An improved chinese word segmentation system with conditional random field,” in Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 2006, pp. 162–165. [4] V. J. D. P. Adam L. Berger, Stephen A. Della Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39–72, 1996. [5] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of ICML 2001, 2001, pp. 282–289. [6] X. Lin, L. Zhao, M. Zhang, and X. Wu, “A morphology-based chinese word segmentation method,” in Proceedings of NLP-KE 2010, 2010, pp. 1–5. [7] H. T. Ng and J. K. Low, “Chinese part-of-speech tagging: One-at-atime or all-at-once? word-based or character-based?” in Proceedings of EMNLP 2004, D. Lin and D. Wu, Eds., Barcelona, Spain, 2004, pp. 277–284. [8] Y. Zhang and S. Clark, “Joint word segmentation and pos tagging using a single perceptron,” in Proceedings of ACL:HLT 2008, Columbus, Ohio,USA, 2008, pp. 888–896. [9] X. Luo, “A maximum entropy chinese character-based parser,” in Proceedings of EMNLP 2003, M. Collins and M. Steedman, Eds., 2003, pp. 192–199. [10] A. Ratnaparkhi, “A linear observed time statistical parser based n maximum entropy models,” in Proceedings of EMNLP 1997, Providence, Rhode Island, 1997, pp. 1–10. [11] S. Petrov and D. Klein, “Improved inference for unlexicalized parsing,” in Proceedings of HLT-NAACL 2007, Rochester, New York, USA, 2007, pp. 404–411. [12] L. Wang, “Problems with the boundary between words and word groups,” in Zhongguo Yuwen, 1953, pp. 3–8. [13] J. L. Packard, The Morphology of Chinese. Cambridge, UK: Cambridge University Press, 2000. 120 [14] S. Y. Yun Liu and X. Zhu, “Construction of the contemporary chinese compound words database and its application,” in Modem Education Technology and Chinese Teaching to Foreigners : Proceedings of the Second International Conference on New Technologies in Teaching and Learning Chinese, 2000, pp. 273–278. [15] X. Z. Shiwen Yu, Huiming Duan and B. Sun, “The basic processing of contemporary chinese corpus at peking university specification,” Journal of Chinese Information Processing, vol. 16, no. 5, pp. 49–64, 2002. [16] N. Xue and L. Shen, “Chinese word segmentation as lmr tagging,” in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, 2003, pp. 176–179. [17] N. Xue, F.-D. Chiou, and M. Palmer, “Building a large-scale annotated chinese corpus,” in Proceedings of the 19th International Conference on Computational Linguistics, Morristown, New Jersey, USA, 2002, pp. 1–8. [18] M. Collins, “Three generative, lexicalised models for statistical parsing,” in Proceedings of ACL 1997, Madrid, Spain, 1997, pp. 16–23. [19] E. Charniak, “Statistical parsing with a context-free grammar and word statistics,” in Proceedings of AAAI 1997, 1997, pp. 598–603. [20] M. Johnson, “Pcfg models of linguistic tree representations,” Computational Linguistics, vol. 24, pp. 613–632, 1998. [21] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of ACL 2003, Sapporo, Japan, 2003, pp. 423–430. [22] T. Matsuzaki, Y. Miyao, and J. Tsujii, “Probabilistic cfg with latent annotations,” in Proceedings of ACL 2005, Ann Arbor, Michigan, USA, 2005, pp. 75–82. [23] S. Petrov, L. Barrett, R. Thibaux, and D. Klein, “Learning accurate, compact, and interpretable tree annotation,” in Proceedings of ACL 2006, Sydney, Australia, 2006, pp. 433–440. [24] F. J. Och and H. Ney, “Discriminative training and maximum entropy models for statistical machine translation,” in Proceedings of ACL, Philadelphia, Pennsylvania, USA, 2002, pp. 295–302. [25] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, July 2003, pp. 160–167. [26] W. Jiang, L. Huang, and Q. Liu, “Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging – a case study,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, August 2009, pp. 522– 530. 121